Data Mining Tutorial

Download Report

Transcript Data Mining Tutorial

Handling the Data
Deluge
Tools for the era of “Big Data”
D. A. Dickey
April 2012
Data Mining - What is it?
• Collection of methods for “big data”
• Fast methods
• Topics
– Association Analysis
– Trees (recursive splitting)
– Logistic Regression
– Neural Networks
– Nearest Neighbor
– Clustering
– Text Mining, Etc.
Association Analysis
• Market basket analysis
– What they’re doing when they scan your “VIP”
card at the grocery
– People who buy diapers tend to also buy
_________ (beer?)
– Just a matter of accounting but with new
terminology (of course  )
Does diaper purchase predict beer
purchase?
• Contingency tables
Beer
Yes
No
diapers
diapers
Beer
No
Yes
No
6
94
100
23
77
40
60
100
23
77
DEPENDENT (yes)
INDEPENDENT (no predictability)
Termnilogy
•
•
•
•
•
•
Baskets: ABC ACD BCD ADE BCE
Rule
Support
Confidence
X=>Y Pr{X and Y}
Pr{Y|X}
A=>D
2/5
2/3
C=>A
2/5
2/4
B&C=>D
1/5
1/3
ABC
ACD
BCD
ADE
BCE
Don’t be Fooled!
• Lift = Confidence /Expected Confidence if Independent
Checking
Saving
No
(1500)
Yes
(8500)
(10000)
No
500
3500
4000
Yes
1000
5000
6000
Observed Confidence (savings=>checking) is 5000/6000 =
83%
Sounds great but ….
Expect 8500/10000 = 85% if independent
Lift = 83/85 < 1.
Savings account holders actually LESS likely than others to
have checking account !!!
Trees
•
•
•
•
•
•
•
•
A “divisive” method (splits)
Start with “root node” – all in one group
Get splitting rules
Response often binary
Result is a “tree”
Example: Loan Defaults
Example: Framingham Heart Study
Example: Automobile fatalities
Recursive Splitting
Pr{default} =0.008
Pr{default} =0.012
Pr{default} =0.006
X1=Debt
To
Income
Ratio
Pr{default} =0.0001
Pr{default} =0.003
No default
Default
X2 = Age
Some Actual Data
• Framingham Heart
Study
• First Stage Coronary
Heart Disease
– P{CHD} = Function of:
• Age - no drug yet! 
• Cholesterol
• Systolic BP
Import
Example of a “tree”
All 1615 patients
Split # 1: Age
Systolic BP
“terminal node”
How to make splits?
• Which variable to use?
• Where to split?
– Cholesterol > ____
– Systolic BP > _____
• Goal: Pure “leaves” or “terminal nodes”
• Ideal split: Everyone with BP>x has
problems, nobody with BP<x has
problems
First review Chi Square test
• Contingency tables
Heart Disease
No
Yes
Low
BP
High
BP
Heart Disease
No
Yes
95
5
100
75
25
55
45
100
75
25
DEPENDENT (yes)
INDEPENDENT (no)
c2 Test Statistic
• Expect 100(150/200)=75 in upper left if
independent (etc. e.g. 100(50/200)=25)
Heart Disease
No
Yes
Low
BP
High
BP
2
(
observed

exp
ected
)
c 2  allcells
exp ected
95
(75)
55
(75)
5
(25)
45
(25)
100
150
50
200
100
WHERE IS HIGH BP CUTOFF???
2(400/75)+
2(400/25) =
42.67
Compare to
Tables –
Significant!
(Significant ???)
Conclusion: Sufficient evidence
the hypothesis of no relationship.
H0:
H1:
H0: Innocence
H1: Guilt
95
(75)
55
(75)
5
(25)
45
(25)
Beyond reasonable
doubt
P<0.05
H0: No association
P=0.00000000064
Measuring “Worth” of a Split
• P-value is probability of Chi-square as
great as that observed if independence is
true. (Pr {c2>42.67} is 6.4E-11)
• P-values all too small.
• Logworth = -log10(p-value) = 10.19
• Best Chi-square  max logworth.
Logworth for Age Splits
?
Age 47 maximizes logworth
How to make splits?
• Which variable to use?
• Where to split?
– Cholesterol > ____
– Systolic BP > _____
• Idea – Pick BP cutoff to minimize p-value
for c2
• What does “signifiance” mean now?
Multiple testing
• 50 different BPs in data, 49 ways to split
• Sunday football highlights always look
good!
• If he shoots enough times, even a 95% free
throw shooter will miss.
• Tried 49 splits, each has 5% chance of
declaring significance even if there’s no
relationship.
Multiple testing
a=
Pr{ falsely reject hypothesis 2}
(miss on 2nd shot)
a=
Pr{ falsely reject hypothesis 1}
(miss on 1st shot)
Pr{ falsely reject one or the other} < 2a
Desired: 0.05 probabilty or less
Solution: use a = 0.05/2
Or – compare 2(p-value) to 0.05
Multiple testing
•
•
•
•
•
•
50 different BPs in data, m=49 ways to split
Multiply p-value by 49
Bonferroni – original idea
Kass – apply to data mining (trees)
Stop splitting if minimum p-value is large.
For m splits, logworth becomes
-log10(m*p-value)  ! ! !
Goals
• Split if diversity in parent “node” > summed
diversities in child nodes
• Observations should be
– Homogeneous (not diverse) within leaves
– Different between leaves
– Leaves should be diverse
Validation
• Traditional stats – small dataset, need all
observations to estimate parameters of
interest.
• Data mining – loads of data, can afford
“holdout sample”
• Variation: n-fold cross validation
– Randomly divide data into n sets
– Estimate on n-1, validate on 1
– Repeat n times, using each set as holdout.
Pruning
• Grow bushy tree on the “fit data”
• Classify holdout data
• Likely farthest out branches do not
improve, possibly hurt fit on holdout data
• Prune non-helpful branches.
• What is “helpful”? What is good
discriminator criterion?
Goals
• Want diversity in parent “node” > summed
diversities in child nodes
• Goal is to reduce diversity within leaves
• Goal is to maximize differences between
leaves
• Use validation average squared error,
proportion correct decisions, etc.
• Costs (profits) may enter the picture for
splitting or pruning.
Accounting for Costs
• Pardon me (sir, ma’am) can you spare
some change?
• Say “sir” to male +$2.00
• Say “ma’am” to female +$5.00
• Say “sir” to female -$1.00 (balm for
slapped face)
• Say “ma’am” to male -$10.00 (nose splint)
Including Probabilities
Leaf has Pr(M)=.7, Pr(F)=.3.
You say:
Sir
Ma’am
True
Gender
M
0.7 (2)
0.7 (-10)
0.3 (5)
F
+$1.10
-$5.50
Expected profit is 2(0.7)-1(0.3) = $1.10 if I say “sir”
Expected profit is -7+1.5 = -$5.50 (a loss) if I say “Ma’am”
Weight leaf profits by leaf size (# obsns.) and sum
Prune (and split) to maximize profits.
Additional Ideas
• Forests – Draw samples with replacement
(bootstrap) and grow multiple trees.
• Random Forests – Randomly sample the
“features” (predictors) and build multiple
trees.
• Classify new point in each tree then
average the probabilities, or take a
plurality vote from the trees
Lift
3.3
* Cumulative Lift Chart
- Go from leaf of most
to least predicted
1
response.
Predicted
response
high ------------------------------------- low
- Lift is
proportion responding in first p%
overall population response rate
Regression Trees
• Continuous response Y
• Predicted response Pi constant in regions
i=1, …, 5
Predict 80
Predict 50
X2
Predict
130
Predict 100
X1
Predict
20
• Predict Pi in cell i.
• Yij jth response in cell i.
• Split to minimize sum of squared prediction
errors Si Sj (Yij-Pi)2
Predict 80
Predict 50
Predict
130
Predict 100
Predict
20
• Predict Pi in cell i.
• Yij jth response in cell i.
• Split to minimize Si Sj (Yij-Pi)2
Real data example: Traffic accidents in Portugal*
Y = injury induced “cost to society”
Help - I ran
Into a “tree”
Help - I ran
Into a “tree”
* Tree developed by Guilhermina Torrao, (used with permission)
NCSU Institute for Transportation Research & Education
Another tool –
Regression
If the Life Line is long and deep, then this
represents a long life full of vitality and
health. A short line, if strong and deep,
also shows great vitality in your life and
the ability to overcome health problems.
However, if the line is short and shallow,
then your life may have the tendency to
be controlled by others
http://www.ofesite.com/spirit/palm/lines/linelife.htm
Wilson & Mather JAMA 229 (1974)
X=life line length
Y=age at death
Result: Predicted Age at Death = 79.24 – 1.367(lifeline)
(Is this “real”??? Is this repeatable???)
We Use LEAST SQUARES
Squared residuals sum to 9609
proc reg data=life;
model age=line;
run;
Parameter Estimates
Variable DF
Intercept 1
Line
1
Parameter
Estimate
79.23341
-1.36697
Standard
Error
14.83229
1.59782
t Value Pr > |t|
5.34
<.0001
-0.86
0.3965
Area 0.19825
Area 0.19825
0.39650
-0.86
0.86
Logistic Regression
•
•
•
•
“Trees” seem to be main tool.
Logistic – another classifier
Older – “tried & true” method
Predict probability of response from input
variables (“Features”)
• Linear regression gives infinite range of
predictions
• 0 < probability < 1 so not linear regression.
Example: Seat Fabric Ignition
• Flame exposure time = X
• Ignited Y=1, did not ignite Y=0
– Y=0, X= 3, 5, 9 10 ,
13,
16
– Y=1, X =
11, 12 14, 15, 17, 25, 30
• Q=(1-p1)(1-p2)(1-p3)(1-p4)p5p6(1-p7)p8p9(1p10)p11p12p13
• p’s all different pi=f(a+bXi)
• Find a,b to maximize Q(a,b)
• Logistic idea: Map p in (0,1) to L in whole
real line
• Use L = ln(p/(1-p))
• Model L as linear in temperature, e.g.
• Predicted L = a + b(temperature)
• Given temperature X, compute L(x)=a+bX
then p = eL/(1+eL)
• p(i) = ea+bXi/(1+ea+bXi)
• Write p(i) if response, 1-p(i) if not
• Multiply all n of these together, find a,b to
maximize
Generate Q for array of (a,b) values
DATA LIKELIHOOD;
ARRAY Y(14) Y1-Y14; ARRAY X(14) X1-X14;
DO I=1 TO 14; INPUT X(I) y(I) @@; END;
DO A = -3 TO -2 BY .025;
DO B = 0.2 TO 0.3 BY .0025;
Q=1;
DO i=1 TO 14;
L=A+B*X(i); P=EXP(L)/(1+EXP(L));
IF Y(i)=1 THEN Q=Q*P; ELSE Q=Q*(1-P);
END; IF Q<0.0006 THEN Q=0.0006; OUTPUT; END;END;
CARDS;
3 0 5 0 7 1 9 0 10 0 11 1 12 1 13 0 14 1 15 1 16 0 17 1
25 1 30 1
;
Likelihood function (Q)
-2.6
0.23
Concordant pair 
Discordant Pair
IGNITION DATA
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Parameter
Intercept
TIME
DF
1
1
Estimate
-2.5879
0.2346
Standard
Error
1.8469
0.1502
Wald
Chi-Square
1.9633
2.4388
Pr > ChiSq
0.1612
0.1184
Association of Predicted Probabilities and Observed Responses
Percent Concordant
Percent Discordant
Percent Tied
Pairs
79.2
20.8
0.0
48
Somers' D
Gamma
Tau-a
c
0.583
0.583
0.308
0.792
Example:
Shuttle Missions
•
•
•
•
•
O-rings failed in Challenger disaster
Low temperature
Prior flights “erosion” and “blowby” in O-rings
Feature: Temperature at liftoff
Target: problem (1) - erosion or blowby vs. no
problem (0)
Example: Framingham
• X=age
• Y=1 if heart trouble, 0 otherwise
Framingham
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Parameter
DF
Intercept
age
1
1
Standard
Wald
Estimate
Error Chi-Square
-5.4639
0.0630
0.5563
0.0110
96.4711
32.6152
Pr>ChiSq
<.0001
<.0001
Neural Networks
• Very flexible functions
• “Hidden Layers”
• “Multilayer Perceptron”
output
inputs
Logistic function of
Logistic functions
Of data
Arrows represent linear
combinations of “basis
functions,” e.g. logistic
curves (hyperbolic tangents)
p1
p2
p3
b1
b2
Y
b3
Example:
Y = a + b1 p1 + b2 p2 + b3 p3
Y = 4 + p1+ 2 p2 - 4 p3
Unsupervised Learning
• We have the “features” (predictors)
• We do NOT have the response even on a
training data set (UNsupervised)
• Clustering
– Agglomerative
• Start with each point separated
– Divisive
• Start with all points in one cluster then spilt
– Direct
• State # clusters beforehand
EM  PROC FASTCLUS
• Step 1 – find (50) “seeds” as separated as
possible
• Step 2 – cluster points to nearest seed
– Drift: As points are added, change seed
(centroid) to average of each coordinate
– Alternatively: Make full pass then recompute
seed and iterate.
• Step 3 – aggregate clusters using Ward’s
method
Clusters as Created
As Clustered – PROC FASTCLUS
TEXT MINING
Hypothetical collection of news releases (“corpus”) :
release 1: Did the NCAA investigate the basketball scores and
vote for sanctions?
release 2: Republicans voted for and Democrats voted against
it for the win.
(etc.)
Compute word counts:
NCAA basketball score vote Republican Democrat win
Release 1
1
1
1
1
0
0
0
Release 2
0
0
0
2
1
1
1
Text Mining Mini-Example: Word counts in 16 e-mails
--------------------------------words-----------------------------------------
d
o
c
u
m
e
n
t
E
l
e
c
t
i
o
n
P
r
e
s
i
d
e
n
t
1
2
3
4
5
6
7
8
9
10
11
12
13
14
20
5
0
8
0
10
2
4
26
19
2
16
14
1
8
6
2
9
0
6
3
1
13
22
0
19
17
0
R
e
p
u
b
l
i
c
a
n
B
a
s
k
e
t
b
a
l
l
D
e
m
o
c
r
a
t
V
o
t
e
r
s
N
C
A
A
10
9
0
7
4
9
1
4
9
10
0
21
12
4
12
5
14
0
16
5
13
16
2
11
14
0
0
21
6
4
0
12
0
5
0
2
16
9
1
13
20
3
0
2
2
14
0
19
1
4
20
12
3
9
19
6
1
0
12
2
15
5
12
9
6
0
12
0
0
9
L
i
a
r
T
o
u
r
n
a
m
e
n
t
S
p
e
e
c
h
5
9
0
12
2
20
13
0
24
14
0
16
12
3
3
0
16
3
17
0
20
12
4
10
16
4
5
8
8
12
4
15
3
18
0
9
30
22
12
12
9
0
W
i
n
s
S
c
o
r
e
_
V
S
c
o
r
e
_
N
18
12
24
22
9
13
0
3
9
3
17
0
6
3
15
9
19
8
0
9
1
0
10
1
23
0
1
10
21
0
30
2
1
14
6
0
14
0
8
2
4
20
Eigenvalues of the Correlation Matrix
1
2
3
4
5
6
7
8
9
10
11
12
13
Eigenvalue
Difference
7.10954264
2.30455155
1.00292318
0.76887967
0.55817886
0.45732963
0.30169451
0.16772870
0.16271459
0.1192580
0.0303509
0.0159719
0.0008758
4.80499109
1.30162837
0.23404351
0.21070080
0.10084923
0.15563511
0.13396581
0.00501411
0.04345658
0.08890707
0.01437903
0.01509610
Proportion
Cumulative
0.5469
0.1773
0.0771
0.0591
0.0429
0.0352
0.0232
0.0129
0.0125
0.0092
0.0023
0.0012
0.0001
Prin 2
Prin 1
0.5469
0.7242
0.8013
0.8605
0.9034
0.9386
0.9618
0.9747
0.9872
0.9964
0.9987
0.9999
1.0000
55% of the variation in
these 13-dimensional
vectors occurs in one
dimension.
Variable
Prin1
Basketball
NCAA
Tournament
Score_V
Score_N
Wins
-.320074
-.314093
-.277484
-.134625
-.120083
-.080110
Speech
Voters
Liar
Election
Republican
President
Democrat
0.273525
0.294129
0.309145
0.315647
0.318973
0.333439
0.336873
Eigenvalues of the Correlation Matrix
1
2
3
4
5
6
7
8
9
10
11
12
13
Eigenvalue
Difference
7.10954264
2.30455155
1.00292318
0.76887967
0.55817886
0.45732963
0.30169451
0.16772870
0.16271459
0.1192580
0.0303509
0.0159719
0.0008758
4.80499109
1.30162837
0.23404351
0.21070080
0.10084923
0.15563511
0.13396581
0.00501411
0.04345658
0.08890707
0.01437903
0.01509610
Proportion
0.5469
0.1773
0.0771
0.0591
0.0429
0.0352
0.0232
0.0129
0.0125
0.0092
0.0023
0.0012
0.0001
Cumulative
0.5469
0.7242
0.8013
0.8605
0.9034
0.9386
0.9618
0.9747
0.9872
0.9964
0.9987
0.9999
1.0000
Prin 2
Prin 1
Prin1 coordinate =
.707(word1) – .707(word2)
55% of the variation in
these 13-dimensional
vectors occurs in one
dimension.
Variable
Prin1
Basketball
NCAA
Tournament
Score_V
Score_N
Wins
-.320074
-.314093
-.277484
-.134625
-.120083
-.080110
Speech
Voters
Liar
Election
Republican
President
Democrat
0.273525
0.294129
0.309145
0.315647
0.318973
0.333439
0.336873
d
o
c
u
m
e
n
t
C
L
U
S
T
E
R
3
11
5
14
7
8
1
1
1
1
1
1
P
r
i
n
1
-3.63815
-3.02803
-2.98347
-2.48381
-2.37638
-1.79370
E
l
e
c
t
i
o
n
P
r
e
s
i
d
e
n
t
R
e
p
u
b
l
i
c
a
n
B
a
s
k
e
t
b
a
l
l
L
i
a
r
T
o
u
r
n
a
m
e
n
t
D
e
m
o
c
r
a
t
V
o
t
e
r
s
S
p
e
e
c
h
N
C
A
A
0
2
0
1
2
4
2
0
0
0
3
1
0
0
4
4
1
4
14
14
16
21
13
16
0
1
0
3
0
2
2
3
0
6
1
4
12
12
15
9
12
9
0
0
2
3
13
0
16
16
17
8
20
12
8
6
6
9
22
17
19
13
10
9
9
7
10
12
21
9
12
5
5
0
11
0
0
2
6
4
5
12
9
20
13
16
0
2
19
14
12
19
9
20
1
0
5
2
0
0
0
6
5
9
20
12
14
12
16
24
3
0
0
3
10
5
4
4
W
i
n
s
S
c
o
r
e
_
V
S
c
o
r
e
_
N
4
12
3
0
0
9
24
17
9
3
0
3
19
23
0
10
1
0
30
8
1
20
6
0
8
12
18
15
22
9
12
30
18
12
13
22
3
6
0
9
15
9
9
8
1
1
0
10
21
0
14
2
0
4
2
14
(biggest gap)
1
2
6
4
10
13
12
9
2
2
2
2
2
2
2
2
-0.00738
0.48514
1.54559
1.59833
2.49069
3.16620
3.48420
3.54077
20
5
10
8
19
14
16
26
PROC CLUSTER (single linkage) agrees !
Cluster 2
Cluster 1
Plot of Prin1*Prin2$document.
Symbol points to label.
Prin1 ‚
‚
4 ˆ
‚
‚
> 12
> 9
‚
> 13
3 ˆ
‚
‚
> 10
‚
2 ˆ
‚
‚
4 <> 6
‚
1 ˆ
‚
‚
> 2
‚
0 ˆ
> 1
‚
‚
‚
-1 ˆ
‚
‚
‚
> 8
-2 ˆ
‚
‚
> 7
> 14
‚
-3 ˆ
> 5
> 11
‚
‚
‚
> 3
-4 ˆ
‚
Šƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒ
-3
-2
-1
0
1
2
3
Prin2
Can use
two, three
or more
components
(dimensions)
• Should always use holdout sample
• Perturb coefficients to optimize fit (fit data)
– Nonlinear search algorithms
• Eliminate unnecessary complexity using
holdout data.
• Other basis sets
– Radial Basis Functions
– Just normal densities (bell shaped) with
adjustable means and variances.
Statistics to Data Mining Dictionary
Statistics
(nerdy)
Data Mining
(cool)
Independent variables
Dependent variable
Estimation
Clustering
Features
Target
Training, Supervised Learning
Unsupervised Learning
Prediction
Slopes, Betas
Intercept
Scoring
Weights (Neural nets)
Bias (Neural nets)
Composition of Hyperbolic
Tangent Functions
Radial Basis Function
and my personal
Type I and Type II Errors
Neural Network
Normal Density
favorite…
Confusion Matrix
Summary
• Data mining – a set of fast stat methods for
large data sets
• Some new ideas, many old or extensions of old
• Some methods:
– Trees (recursive splitting)
– Logistic Regression
– Neural Networks
– Association Analysis
– Nearest Neighbor
– Clustering
– Etc.
D.A.D.
Cubic Clustering Criterion
(to decide # of Clusters)
• Divide random scatter of (X,Y) points into
4 quadrants
• Pooled within cluster variation much less
than overall variation
• Large variance reduction
• Big R-square despite no real clusters
• CCC compares random scatter R-square
to what you got to decide #clusters
• 3 clusters for “macaroni” data.