Transcript Slides

Biostatistics Case Studies 2008
Session 6:
Classification Trees
Peter D. Christenson
Biostatistician
http://gcrc.LAbiomed.org/Biostat
Case Study
Goal of paper: Classify subjects as IR or non-IR
using subject characteristics other than definitive IR
such as clamp:
Gender, weight, lean body mass, BMI, waist and hip
circumferences, LDL, HDL, Total Chol, triglycerides,
FFA, DBP, SBP, fasting insulin and glucose, HOMA,
family history of diabetes, and some derived ratios
from these.
Major Conclusion Using All Predictors
p 336, 1st column:
IR
HOMA
4.65
3.60
Non-IR
27.5
28.9
BMI
Broad Approaches to Classification
1. Regression, discriminant analysis - modeling.
2. Cluster analyses - geometric.
3. Trees - partitioning.
Overview of Classification Trees
General concept, based on subgroupings:
1. Form combinations of subgroups according to
High or Low on each characteristic.
2. Find actual IR rates in each subgroup.
3. Combine subgroups that give similar IR rates.
4. Classify as IR if IR rate is large enough.
Notes:
1. No model or statistical assumptions.
2. And so no p-values.
3. Many options are involved in grouping details.
4. Actually implemented hierarchically – next slide.
Figure 2
Classify as non-IR
Classify as IR
Alternative: Logistic Regression
1. Find equation:
Prob(IR) = function(w1*BMI + w2*HOMA + w3*LDL +...)
where the w’s are weights (coefficients).
2. Classify as IR if Prob(IR) is large enough.
Note:
• Assumes specific statistical model.
• Gives p-values (depends on model being
correct).
• Need to use Prob(IR) which is very modeldependent, unlike High/Low categorizations.
Trees or Logistic Regression?
Logistic:
•
•
•
•
Not originally designed for classifying, but for finding
Prob(IR).
Requires specification of predictor interrelations, either
known or through data examination; thus, not as flexible.
Dependent on correct model.
Can prove whether predictors are associated with IR.
Trees:
•
•
•
Designed for classifying.
Interrelations not pre-specified, but detected in the
analysis.
Does not prove associations “beyond reasonable doubt”,
as provided in regression.
Outline of this Session
Use simulated data to:
• Classify via logistic regression using only one
predictor.
• Classify via trees using one predictor.
• Show that results are identical.
• Show how results differ when 2 predictors are
used.
IR and HOMA
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
HOMA
Simulated Data: N=2138 with IR rate increasing with HOMA
as in actual data in paper. Overall IR rate = 700/2138 =
33%.
IR and HOMA: Logistic Fit
A logistic
curve has
this
sigmoidal
shape
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
HOMA
Fitted logistic model predicts probability of IR as:
Prob(IR) = eu/(1 + eu), where u= -4.83 + 0.933(HOMA)
Using Logistic Model for Classification
•
The logistic model proves that the risk of IR ↑ as
HOMA ↑ (significance of the coefficient 0.933 is
p<0.0001).
• How can we classify as IR or not based on
HOMA?
• Use Prob(IR).
– Need cutpoint c so that we classify as:
• If Prob(IR) > c then classify as IR
• If Prob(IR) ≤ c then classify as non-IR
• Regression does not supply c. It is chosen to
balance sensitivity and specificity.
IR and HOMA: Logistic with Arbitrary Cutpoint
Actual
IR: N=440
0.9
Non-IR: N=99
0.8
0.7
0.6
Assign IR
0.5
Assign non-IR
0.4
0.3
0.2
Actual
0.1
0.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
HOMA
If cutpoint c=0.50 is chosen, then we have:
Sensitivity = 440/(440+260) = 62.9%
Specificity = 1339/(1339+99) = 93.1%
IR: N=260
Non-IR: N=1339
IR and HOMA: Logistic with Other Cutpoints
From SAS (“Event” = IR):
Classification Table
Prob
Level
--Correct--NonEvent Event
0.100
0.200
0.300
0.4000.37
0.500
0.600
0.700
0.800
0.900
700
567
521
485
440
386
331
272
171
0
1181
1277
1325
1339
1354
1363
1376
1404
-Incorrect-NonEvent Event
1438
257
161
113
99
84
75
62
34
0
133
179
215
260
314
369
428
529
------------ Percentages -----------Sensi- Speci- False False
Correct tivity ficity
POS
NEG
32.7
100.0
0.0
81.8
81.0
82.1
84.1
74.4
88.8
85.284.7 71.769.3 91.792.1
83.2
62.9
93.1
81.4
55.1
94.2
79.2
47.3
94.8
77.1
38.9
95.7
73.7
24.4
97.6
67.3
31.2
23.6
18.9
18.4
17.9
18.5
18.6
16.6
Often, the overall percentage correct is used to choose the
“optimal” cutpoint. Here, that gives cutpoint=0.37, with %
correct=85.2%, sensitivity=71.7% and specificity=91.7%.
.
10.1
12.3
14.0
16.3
18.8
21.3
23.7
27.4
Using Classification Trees with One Predictor
•
Choose every possible HOMA value and find
sensitivity and specificity (as in creating a ROC
curve).
•
Assign relative weights to sensitivity and
specificity, often equal, as previously.
•
Need cutpoint h so that we classify as:
• If HOMA > h then classify as IR
• If HOMA ≤ h then classify as non-IR
IR and HOMA: Trees with Other Cutpoints
HOMA
Level
Correct
NonEvent Event
Incorrect
NonEvent Event
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
8.5
9.0
588
542
506
458
400
338
286
233
174
123
60
0
321
201
136
105
88
75
69
48
35
22
10
0
1117
1237
1302
1333
1350
1363
1369
1390
1403
1416
1428
1438
112
158
194
242
300
362
414
467
526
577
640
700
Correct
79.7
83.2
84.6
83.8
81.9
79.6
77.4
75.9
73.8
72.0
69.6
67.3
Percentages
Sensi- Speci- False
tivity ficity
POS
84.0
77.4
72.3
65.4
57.1
48.3
40.9
33.3
24.9
17.6
8.6
0.0
77.7
86.0
90.5
92.7
93.9
94.8
95.2
96.7
97.6
98.5
99.3
100.0
35.3
27.1
21.2
18.7
18.0
18.2
19.4
17.1
16.7
15.2
14.3
.
False
NEG
9.1
11.3
13.0
15.4
18.2
21.0
23.2
25.1
27.3
29.0
30.9
32.7
If the overall percentage correct is used to choose the
“optimal” cutpoint, then cutpoint=4.61, with % correct=85.2%,
sensitivity=71.7% and specificity=91.7%.
IR and HOMA: Final, Simple Tree
700/2138
32.7% IR
HOMA≤4.61
HOMA>4.61
198/1517
502/621
13.1% IR
80.8% IR
This is exactly the result from the logistic regression,
since the logistic function is monotone in HOMA. See
next slide.
IR and HOMA: Logistic equivalent to Tree
Actual
IR: N=502
0.9
Non-IR: N=119
0.8
0.7
0.6
0.5
Assign IR
0.4
0.3
Assign non-IR
0.2
0.1
0.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
HOMA
Assign non-IR
Actual
IR: N=198
Assign IR
Actual
Actual
IR: N=198
IR: N=502
Non-IR: N=1319
Non-IR: N=119
Non-IR: N=1319
Summary: Classifying IR from HOMA
0.9
One
0.8
0.7
0.6
Predictor
0.5
0.4
0.3
0.2
0.1
0.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
Same
Classification
with Trees
and Logistic
because:
HOMA
Logistic regression gives
Prob(IR) = eu/(1 + eu), where u= -4.83 + 0.933(HOMA),
so Prob(IR) large is equivalent to HOMA large.
This is not the case with 2 predictors, as in next slide.
IR Rate: BMI and HOMA
100.0
% IR
0.0
20.0
9.0
40.0 3.0
BMI
HOMA
1. %IR increases non-smoothly with both HOMA and BMI.
2. Logistic regression fits a smooth surface to these %s.
Classifying IR from HOMA and BMI
IR Rate: BMI and HOMA
Two
100.0
Predictors
0.0
20.0
9.0
Different
Classification
with Trees
and Logistic
because:
40.0 3.0
BMI
HOMA
Logistic regression gives
Prob(IR) = eu/(1 + eu), where u= -6.51 + 0.87(HOMA) + 0.07(BMI),
so Prob(IR) ↑s smoothly as 0.87(HOMA) + 0.07(BMI) does.
Trees allow different IR rates in (HOMA , BMI) subgroups.
Classifying IR from HOMA and BMI: Logistic
Logistic regression forces a smooth partition such as the
following, although adding HOMA-BMI interaction could
give curvature to the demarcation line.
IR
HOMA
Equation:
0.87(HOMA) + 0.07(BMI)
4.65
= cutpoint
Non-IR
3.60
27.5
28.9
BMI
Compare this to the tree partitioning on the next slide.
Classifying IR from HOMA and BMI: Trees
Trees partition HOMA-BMI combinations into subgroups,
some of which are then combined as IR and non-IR.
IR
HOMA
4.65
3.60
Non-IR
27.5
28.9
BMI
We now consider the steps and options that need to be
specified in a tree analysis.
Implausible Biological Conclusions from Trees
IR
HOMA
4.65
3.60
Non-IR
27.5
28.9
BMI
Potential Logistic Modeling Inadequacy
Proportion Insulin Resistent
IR Rate: BMI and HOMA
100.0
0.0
20.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
9.0
0.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
HOMA
40.0 3.0
BMI
HOMA
Data may not
be fit well by:
Data does follow
logistic curve.
Logistic regression
Prob(IR) = eu/(1 + eu), where u= -6.51 + 0.87(HOMA) + 0.07(BMI),
so Prob(IR) ↑s smoothly as 0.87(HOMA) + 0.07(BMI) does
Classification Tree Software
•
CART from www.salford-systems.com.
•
Statistica from www.statsoft.com.
•
SAS: in the Enterprise Miner module.
•
SPSS: Has Salford CART in their Clementine
Data Mining module.
Remaining Slides:
Overview of Decisions Needed for
Classification Trees
Classification Tree Steps
There are several flavors of tree methods, each with
many options, but most involve:
•
Specifying criteria for predictive accuracy.
•
Tree building.
•
Tree building stopping rules.
•
Pruning.
•
Cross-validation.
Specifying criteria for predictive accuracy
•
•
•
•
Misclassification cost generalizes the concept of
misclassification rates so that some types of
misclassifying are given greater weight.
Relative weights, or costs are assigned to each
type of misclassification.
A prior probability of each outcome is specified
usually as the observed prevalence of outcome
in the data, but could be from previous research
or for other populations.
The costs and priors together give the criteria for
balancing specificity and sensitivity. Observed
prevalence and equal weights → minimizing
overall misclassification.
Tree Building
•
•
•
•
Recursively apply what we did for HOMA for each
of the two resulting partitions, then for the next set,
etc.
Every factor is screened at every step. The same
factor may be reused.
Some algorithms allow certain linear combinations
of factors (e.g., as logistic regression provides,
called discriminant functions) to be screened.
An “impurity measure” or “splitting function”
specifies the criteria for measuring how different
two potential new subgroups are. Some choices
are “Gini”, chi-square and G-square.
Tree Building Stopping Rules
•
It is possible to continue splitting and building a
tree until all subgroups are “pure” with only one
type of outcome. This may be too fine to be useful.
•
One alternative is “minimum N”, to allow only pure
or only subgroups of a minimum size.
•
Another choice is “Fraction of objects” in which a
minimum fraction of an outcome class, or a pure
class is obtained.
Tree Pruning
•
•
•
Pruning tries to solve the problem of lack of
generalizability due to over-fitting the results to the
data at hand.
Start at the latest splits and measure the
magnitude of the reduced misclassification due to
that split. Remove the split if it is not large.
How large is “not large”. This can be made at least
objective, if not foolproof, by a complexity
parameter related to the depth of the tree, i.e.
number of levels of splits. Combining that with the
misclassification cost function gives a “costcomplexity pruning”, used in this paper.
Cross Validation
•
•
•
•
At least two data sets are used. The decision rule
is built with training set(s), and applied to test
set(s).
If the misclassification costs for the test sets are
similar to that for the training sets, then that
decision rule is considered “validated”.
With large datasets, as in business data mining,
only one training and one test set is used.
For smaller datasets, “v-fold cross-validation” is
used. The data is randomly split into v sets. Each
set serves as the test set once, with the combined
remaining v-1 sets as the training set, and v-1
times as part of the training set, for v analyses.
Average cost is compared to that for the entire set.