Introduction to Predictive Analytcs

Download Report

Transcript Introduction to Predictive Analytcs

Introduction to
Predictive Analytics
TIBER TRAINING, 6/12/2015
Short Bio
B.A. from Allegheny College
M.S. from Carnegie Mellon University
Internship at the Department of Veterans Affairs
July 2014 – Present: Senior Consultant, IBM Global Business Services
Advanced Analytics and Optimization
Today’s Goals:
Students Will Be Able To:
1.
Identify the purpose of Data Mining
2.
Discuss major paradigms in Data Mining
3.
Apply paradigms to business problems
4.
Discuss major considerations in model performance
5.
Evaluate model performance
6.
Discuss the general functionality of some major classification algorithms
7.
Use RapidMiner software to load data, train a classifier, and evaluate
8.
Use RapidMiner to create clusters from data
Agenda
1. Conceptual Material
1.
2.
3.
4.
Data Mining Foundations
Problem Types and Model Paradigms
Model Performance Considerations
Recognizing a Strong Model
2. Hands-On Activities
1.
2.
3.
K-Nearest Neighbors Classification
Decision Tree Models
K-Means Clustering
Data Mining Foundations – Data Mining
Definition
BRAINSTORM
Based on exposure to literature, conversations with
clients, professional publications:
What is data mining?
Data Mining Foundations – Several
Definitions
To extract insight
or information
from data
To discover
interesting
relationships
within data
“Extracting useful knowledge
from data to solve business
problems”
* Provost & Fawcett, “Data Science for Business”
CRISP-DM Framework
“Cross Industry Standard Practice for Data Mining”
1.
Iterative process
2.
Human stays in the loop throughout
iterations
3.
Understanding of both the business and
the data are paramount
System Design
BI DASHBOARDS
RESULTS REPORTING
REPORTING
PROCESSED STATE
SOURCE
SYSTEMS
(Data Warehouse?)
(Operational Data
Store?)
(Other form?)
DATA MINING
RESULTS
STORAGE
RESULTS DASHBOARD
AD HOC, ONE-OFF DATA
MINING
AUTOMATIC DATA
MINING
It is assumed that some
action will occur
Data Format
Rows,
Also known as:
• Records
• Observations
NOTE:
Expects inputs and outputs
to be at the same level
(i.e. for an individual, for a
visit to the store, for a
webpage view, etc)
Name
Age
Occupation
Purchase?
Total Spent
Jane Smith
38
Designer
Yes
40,000
James Doe
19
Student
No
N/A
Sally Q. Public
50
CEO
Yes
9,000
Columns,
Also known as:
• Fields
• Attributes
• Dimensions
Input Fields
Labels/Output Fields,
Can be:
• Binary
• Categorical
• Numeric
Data Format in Different Disciplines
BRAINSTORM
Based on experience within business intelligence and
data warehousing:
How does this data format expectation differ from
other applications?
Does it pose any unique challenges?
Problem Types and Paradigms
SUPERVISED
CLASSIFICATION
REGRESSION
Supervised methods assume labels/outputs
UNSUPERVISED
CLUSTERING
VISUALIZATION
Unsupervised methods lack labels/outputs
The preceding slide had $ amount of
purchase. Why not use it?!
Problem Types: Classification
Examples:
(From preceding slide)
Every new customer must record their name, age, and occupation.
Given a set of inputs,
predict the most likely
value for a categorical or
binary output
Jimmy G. Generic
10
Student
Will this new customer make a purchase?
Given the dollar amount, GPS location of the store, category of the
retailer, date, and time of a purchase – is it fraudulent?
Given the pulse, respiration, and blood pressure – Will the patient
have a complication?
Problem Types: Regression
Examples:
(From preceding slide)
Every new customer must record their name, age, and occupation.
Given a set of inputs,
predict the most likely
value for a real-valued
output
Jimmy G. Generic
10
Student
IF this customer makes a purchase, how much are they likely to spend?
Today’s session has minimal coverage
of regression. Why?
Problem Types: Clustering
Examples:
`
`
Days as a Salesperson
`
Given a set of inputs, with
or without outputs, assign
a class label to each record
based upon similarity to
other records
Average $ Value Sales/Day
Problem Types: Visualization
Social Network Visualization,
Courtesy of Wikipedia
Not necessarily a problem
type like the others, but:
Can sometimes create
new insight for existing
problems
What can we say about this?
Would we necessarily glean
the same from clustering?
Problem Types and Business
BRAINSTORM
Now that we’ve introduced the major problem types in
data mining:
Form groups and create a list of at least five business
use cases for each of the problem types.
Model Performance Considerations
We pull 10 records from the
database
Build a model (which type
of model? Which type of
problem?)
X = purchase
O = nonpurchase
X X
X
# of times clicked
PROBLEM SCENARIO:
We have clickstream data
on how long a customer has
viewed an item (total) and
how many times they’ve
clicked it
OO
OO
O
X
Total time spent viewing
X
Model Performance Considerations
We pull 10 records from the
database
Build a model (which type
of model? Which type of
problem)
X = purchase
O = nonpurchase
X X
X
# of times clicked
PROBLEM SCENARIO:
We have clickstream data
on how long a customer has
viewed an item (total) and
how many times they’ve
clicked it
NONPURCHASE
PURCHASE
OO
OO
O
X
X
q
Total time spent viewing
SUCCESS!
If Total_Time >= q THEN Purchase ELSE Nonpurchase
Time to cash
the check!
Right?
Model Performance Considerations
We deploy the system, 10
new customers
With more rules, using both
dimensions, we can still
have reasonable accuracy
We can’t have a perfect
classifier using these two
dimensions …
Can we imagine a third?
And another rule?
# of times clicked
Suddenly … Our classifier
looks much worse
X X
X
X X
X
X X
p
OO
OO
O
O O
X
X
O OO
q
Total time spent viewing
Model Performance Considerations
UNDERFIT
OVERFIT
http://datascience.stackexchange.com/questions/361/when-is-a-model-underfitted
http://scott.fortmann-roe.com/docs/BiasVariance.html
Which of these did we display in the original model for our clickstream customers?
What if we kept adding rules, and kept adding dimensions?
Testing a Model
Cross Validation lets us
choose random subsets of
the data to hold out from
the model building stage
10-fold cross validation is
the most popular
https://chrisjmccormick.files.wordpress.com/2013/07/10_fold_cv.png
10-Fold Cross Validation Utility
1.
Choose between different modeling algorithms
2.
Tune parameters of chosen algorithm to reduce bias or variance
3.
Create an estimate of model performance
Evaluating classifiers
Refer to handout – Confusion Matrix and Performance Metrics
BRAINSTORM
Why not just use accuracy?
Time for the big bucks!
I can deliver 99% accuracy for
some problems, guaranteed!
How?
Receiver Operating Characteristic (ROC)
Curve
IMPORTANT
Points on the ROC Curve each represent a specific
threshold for a classifier.
Time for the big bucks!
I can deliver a 0% False Positive
Rate, guaranteed! How?
http://upload.wikimedia.org/wikipedia/commons/8/8c/Receiver_Operating_Characteristic.png
Using ROC Curves to Compare Models
http://www.actaneurocomms.org/content/figures/2051-5960-1-3-11-l.jpg
1.
Generally, models closer to the top left are best, e.g. 100%
true positive rate and 0% False Positive Rate
2.
Occasionally, one model will tend toward lower false
positives while another favors higher true positives
3.
Leads to: Cost sensitive or unbalanced classifiers
K-Nearest Neighbors
1.
Non-parametric approach
◦
(Does not actually extract model
parameters, relies on having all
examples in memory)
2.
Relies on a distance function
between each point and others
3.
Each nearby point “votes” on the
class label
http://mines.humanoriented.com/classes/2010/fall/csci568/portfo
lio_exports/lguo/image/knn/knn.jpg
K-Nearest Neighbors, Continued
1.
Parameters can be tuned:
◦
◦
K (number of neighbors to use as a vote)
Distance metric
2.
What about 1 Nearest Neighbor?
3.
What about 100 Nearest Neighbors?
K-Nearest Neighbors
1.
Fairly interpretable (present example cases)
2.
Can handle multiple output categories (not just binary)
3.
Can only handle numeric inputs
4.
Nonparametric – Memory intensive and slower to score, no time to train
5.
Distance metrics begin to break down as dimensionality increases
Decision Tree Models
1.
Recursive algorithm approach
2.
Uses some measure of purity
◦
◦
Given a certain split in an
input, how “pure” is the
output?
Information Gain, based on
Entropy, is a popular measure
Decision Trees, continued
1.
Test Information Gain of the output variable given a split on each of the input variables
2.
First split – Maximum Information Gain
3.
Recursively repeat for each of the nodes, which correspond to split points of original input
value
4.
Stop when the leaf nodes have reached an adequate level of “purity”
5.
Handle numeric fields with cutpoints (beyond the scope of this session)
Decision Trees – Entropy and Information
Gain
ENTROPY
Entropy = −p(yi) ∗ log2p(yi)
Example: 7/10 people purchased, 3/10 did not
H(y)= -[0.7*log2(0.7)+0.3*log2(0.3)]
H(y) = -[0.7*-0.51+0.3 * -1.74]
H(y) = .88
http://www.talkorigins.org/faqs/information/Hbinary.gif
How does .88 correspond to the chart at right?
Entropy, Another View
(Best Regards to CMU’s Andrew Moore, who used a similar image)
PILLOW STUFFING ENTROPY: LOW
PILLOW STUFFING ENTROPY: HIGH
Decision Trees – Entropy and Information
Gain, Continued
INFORMATION GAIN
IG(T,a) = H(T) – H(T|a)
In other words, take the entropy of the outcome and subtract the entropy after you’ve split on
some criteria.
For example, if H(y) = .88 for purchases on the preceding slide, what is the entropy if you first
break it into income >= 50,000 and income <= 50,000?
Information Gain – Worked Example
Shopped?
Income >= 50k?
Purchased?
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
N
Y
Y
N
Y
Y
N
Y
Y
Y
N
Y
N
N
Y
N
N
IG, Shopped:
Entropy(Purchased) = .88
Entropy(Purchased|Shopped = Y) = .88
P(shopped = y) = 1
IG = .88 – (1*.88) = 0
IG, Income:
Entropy(Purchased) = .88
Entropy(Purchased|Income >= 50) = .72
Entropy(Purchased|Income < 50) = .97
P(Income >= 50) = .5
P(Income < 50) = .5
IG = .88 – (.5 * .72 + .5 * .97)
IG = .88 – (.36 + .485)
IG = .035
Information Gain, Worked Example
Income > 50k
Purchased
Income > 50k
Purchased
Y
Y
N
Y
Y
Y
N
Y
Y
Y
N
Y
Y
Y
N
N
Y
N
N
N
Why devote so much time to decision
trees?
1.
Many modern implementations, including ID3 and C4.5/C5.0
2.
Intuitive, interpretable models
3.
Fast, scalable
4.
Can handle many dimensions, mixed categorical and numeric inputs
5.
Can handle multiple valued outputs, not just binary
6.
Parametric, requires longer to train, scores very quickly
7.
Sometimes called CART – Classification and Regression Trees. Can be used for Regression,
beyond the scope of this session.
K-Means Clustering
1.
K is a given – the number of clusters
2.
Place K cluster centers randomly throughout the space, assigning all points to one of the K
clusters
3.
Calculate the centroid of the points assigned to each cluster (in all dimensions) and move the
cluster centers to this new point
4.
Calculate the distance of all points to the new centroid, repeat step 3
5.
Repeat steps 3 and 4 until convergence (cluster centroids move very little from one iteration to
the next)
K-Means Clustering Example
Notice that the clusters
are extremely imbalanced
in Iteration 1
Imagine the green points
on the bottom left pulling
that centroid down and
left
Likewise with blue
Conclusion
Topics introduced:
1.
Definition of Data Mining
2.
Data format – Inputs, Outputs, Dimensions
3.
Problem Types and Paradigms
4.
Model Performance Considerations, Measuring and Comparing Models
5.
K-Nearest Neighbors non-parametric classification
6.
Decision Trees parametric classification
7.
K-Means clustering, unsupervised clustering
Topics Not Covered
1.
Much of classical statistics (p values, ANOVA, statistical representativeness, sampling plans), much of Bayesian
2.
Preprocessing data (Binning, normalizing, deriving new variables)
3.
Segmentation for model performance
4.
IT Infrastructure Concerns
5.
Bayesian methods (Naïve Bayes, Bayes Nets)
6.
Geospatial analytics
7.
Social Network Analysis and Visualization
8.
Recommender Systems
9.
Regression, Time Series Analysis
10. Advanced Techniques (Two-Stage Modeling, etc)
11. Many algorithms (Neural Nets, Support Vector Machines, etc)
12. Operationalizing Data Mining