Tree methods

Download Report

Transcript Tree methods

Ronny Kohavi, ICML 1998
Ronny Kohavi, ICML 1998
Ronny Kohavi, ICML 1998
What is Data mining?
Internal Databases
Research
Question
Find
Data
Data Warehouses
Internet
Online databases
Data Collection
Answer
Research
Question
Data Processing
Extract Information
Data Analysis
Outline
• Data Mining
• Methodology for CART
• Data mining trees, tree sketches
•Applications to clinical data.
•Categorical Response - OAB data
•Continuous Response - Geodon RCT data
• Robustness issues
Moore’s law:
+
-
•Processing “capacity” doubles every couple of years
(Exponential)
•Hard Disk storage “capacity” doubles every 18 months
(Use to be every 36 months)
•Bottle necks are not speed anymore.
•Processing capacity is not growing as fast as data
acquisition.
Mining Data
50’s-80’s: EDA, Data Visualization.
1990 Ripley “that is not statistics, that’s data mining”
90’s-06: Data Mining: Large Datasets, EDA, DV, Machine
Learning, Vision,…
 Example: Biopharmaceutical Area. Data repositories:
- Data from many clinical, monitoring, marketing.
- Data is largely unexplored.
Data mining objective: “To extract valuable information.”
“To identify nuggets, clusters of observations in these data
that contain potentially valuable information.”
 Example: Biopharmaceutical Data:
- Extract new information from existing databases.
- Answer questions of clinicians, marketing.
- Help design new studies.
Data Mining Software and
Recursive Partition
SOFTWARE:
Splus / Insightful Miner:
Tree, CART, C4.5, BG, RF, BT
R:
Rpart, CART, BG, RF, BT
SAS / Enterprise Miner:
CART, C4.5, CHAID, Tree browser
SPSS
: CART, CHAID
Clementine : CART, C4.5, Rule Finder
HelixTree
: CART, FIRM, BG, RF, BT
Recursive Partition (CART)
I. Dependent variable is categorical
•Classification Trees, Decision Trees
Example: A doctor might have a rule
for choosing which drug to prescribe to
high cholesterol patients.
II. Dependent variable is numerical
• Regression Tree
AGE
Dose=Function (BMI,AGE)
80
65
40
18
High Blood
Pressure?
N
Y
Age>60
Y
Drug A
N
Drug B
80
20
24
Age<30
Y
Drug B
BMI
Y
AGE<65
80
BMI<24
N
Y
N
AGE<18
DrugA
Y
20
80
N
40
N
Classic Example of CART:
Pima Indians Diabetes
• 768 Pima Indian females, 21+ years old ; 268 tested positive to diabetes
• 8 predictors: PRG, PLASMA, BP, THICK, INSULIN, BODY, PEDIGREE, AGE
• OBJECTIVE: PREDICT DIABETES
BODY  29.9
Y
20
30
40
50
AGE
60
70
80
916.3
913.7
0.8
0.8
RESP
0.6
0.0
0.2
0.4
0.2
0.0
0.0
0.2
RESP
RESP
DIABETES
0.4 0.6 0.8
1.0
N
1.0
N
P(Diabetes)
35%
19%
61%
19%
49%
12%
44%
1.0
AGE  28
Y
N
N
768
485
283
367
401
222
546
0.6
Y
CART
993.5
854.3
0.4
PLASMA127
?
Node
Combined
PLASMA<=127
PLASMA>127
AGE<=28
AGE>28
BODY<=27.8
BODY>27.8
0
50
100
PLASMA
150
200
0
10 20 30 40 50 60
BODY
Classic Example of CART:
Pima Indians Diabetes
CART Algorithm
PLASMA<127.5
|
• Grow tree
• Stop when node
sizes are small
• Prune tree
AGE<28.5
BODY<30.95
BODY<29.95
BODY<26.35
PLASMA<145.5
PLASMA<99.5
0.01325 0.17500 0.04878
PLASMA<157.5
AGE<30.5
0.14630 0.51430
PEDIGREE<0.561
0.18180
0.40480 0.73530
0.86960
BP<61
0.72310
1.00000 0.32500
CART criteria functions
For regression trees :
N Lˆ 2L  N Rˆ 2R
Equal variances(CART ) : h 
NL  NR
N L log ˆ 2L  N R log ˆ 2R
Non equal variances : h 
NL  NR
For classification trees: criteria functions
h  pL min( p L0 , p1L )  pR min( pR0 , p1R )
h  pL ( pL0 log pL0  p1L log p1L )  pR ( pR0 log pR0  p1R log p1R ) (C5)
h  pL pL0 p1L  pR pR0 p1R (CART)
DATA PREPROCESSING
RECOMMENDATIONS FOR TREES
a. Make sure that all the factors are declared as factors
Some times factor variables are read into R as numeric or as character variables.
Suppose that a variable RACE on a SAS dataset is coded as 1, 2, 3, 4 representing
4 race groups. We need to be sure that it was not read as a numeric variable,
therefore we will first check the types of the variables. We may use the functions
“class” and “is.factor” combined with “sapply” in the following way.
sapply(w,is.factor) or sapply(w,class)
Suppose that the variable “x” is numeric when it is supposed to be a factor. Then
we convert it into factor:
w$x = factor(w$x)
b. Recode factors:
Sometimes the codes assigned to factor levels are very long phrases and when those code
are inserted into the tree the resulting graph can be very messy. We prefer to use short
words to represent the codes. To recode the factor levels you may use the function
“f.recode”: > levels(w$Muscle)
[1] ""
"Mild Weakness"
[3] "Moderate Weakness"
"Normal"
> musc =f.recode(w$Muscle,c("","Mild","Mod","Norm"))
> w$Musclenew = musc
Example Hospital data
hospital = read.table("project2/hospital.txt",sep=",")
colnames(hospital) <c("ZIP","HID","CITY","STATE","BEDS","RBEDS","OUTV","ADM", "SIR",
"SALESY","SALES12","HIP95","KNEE95","TH","TRAUMA","REHAB","HIP96",
"KNEE96","FEMUR96")
hosp = hospital[,-c(1:4,10)]
hosp$TH = factor(hosp$TH)
hosp$TRAUMA = factor(hosp$TRAUMA)
hosp$REHAB = factor(hosp$REHAB)
u<-rpart(log(1+SALES12)~.,data=hosp,control=rpart.control(cp=.01))
plot(u); text(u)
u=rpart(log(1+SALES12)~.,data=hosp,control=rpart.control(cp=.001))
plot(u,uniform=T) ; text(u)
Regression Tree for log(1+Sales)
HIP95 < 40.5 [Ave: 1.074, Effect: -0.76 ]
HIP96 < 16.5 [Ave: 0.775, Effect: -0.298 ]
RBEDS < 59 [Ave: 0.659, Effect: -0.117 ]
HIP95 < 0.5 [Ave: 1.09, Effect: +0.431 ] -> 1.09
HIP95 >= 0.5 [Ave: 0.551, Effect: -0.108 ]
KNEE96 < 3.5 [Ave: 0.375, Effect: -0.175 ] -> 0.375
KNEE96 >= 3.5 [Ave: 0.99, Effect: +0.439 ] -> 0.99
RBEDS >= 59 [Ave: 1.948, Effect: +1.173 ] -> 1.948
HIP96 >= 16.5 [Ave: 1.569, Effect: +0.495 ]
FEMUR96 < 27.5 [Ave: 1.201, Effect: -0.368 ] -> 1.201
FEMUR96 >= 27.5 [Ave: 1.784, Effect: +0.215 ] -> 1.784
HIP95 >= 40.5 [Ave: 2.969, Effect: +1.136 ]
KNEE95 < 77.5 [Ave: 2.493, Effect: -0.475 ]
BEDS < 217.5 [Ave: 2.128, Effect: -0.365 ] -> 2.128
BEDS >= 217.5 [Ave: 2.841, Effect: +0.348 ]
OUTV < 53937.5 [Ave: 3.108, Effect: +0.267 ] -> 3.108
OUTV >= 53937.5 [Ave: 2.438, Effect: -0.404 ] -> 2.438
KNEE95 >= 77.5 [Ave: 3.625, Effect: +0.656 ]
SIR < 9451 [Ave: 3.213, Effect: -0.412 ] -> 3.213
SIR >= 9451 [Ave: 3.979, Effect: +0.354 ] -> 3.979
Regression Tree
HIP95<2.52265
|
HIP96<2.01527
RBEDS<2.77141
HIP95<0.5
FEMUR96<2.28992
KNEE95<2.96704
BEDS<3.8403
ADM<4.87542
OUTV<15.2396
1.2010 1.7840 2.1280
KNEE96<1.36514
1.0900
0.8984 2.3880
0.3752 0.9898
SIR<9.85983
3.2130 3.9790
3.1080 2.4380
PC3< |-0.936
Classification tree:
 data(tissue)
 gr = rep(1:3, c( 11,11,19))
> x <- f.pca(f.toarray(tissue))$scores[,1:4]
> x= data.frame(x,gr=gr)
> library(rpart)
> tr =rpart(factor(gr)~., data=x)
n= 41
node), split, n, loss, yval, (yprob)
* denotes terminal node
PC2< -1.154
3
1
2
1) root 41 22 3 (0.26829268 0.26829268 0.46341463)
2) PC3< -0.9359889 23 12 1 (0.47826087 0.47826087 0.04347826)
4) PC2< -1.154355 12 1 1 (0.91666667 0.00000000 0.08333333) *
5) PC2>=-1.154355 11 0 2 (0.00000000 1.00000000 0.00000000) *
3) PC3>=-0.9359889 18 0 3 (0.00000000 0.00000000 1.00000000) *
> plot(tr)
> text(tr)
>
Random forest Algorithm
(A variant of bagging)
•Select ntree, the number of trees to grow, and mtry, a number no larger than
number of variables.
•For i = 1 to ntree:
•Draw a bootstrap sample from the data. Call those not in the bootstrap
sample the "out-of-bag" data.
•Grow a "random" tree, where at each node, the best split is chosen among
mtry randomly selected variables. The tree is grown to maximum size and
not pruned back.
5.Use the tree to predict out-of-bag data.
6.In the end, use the predictions on out-of-bag data to form majority votes.
7.Prediction of test data is done by majority votes from predictions from the
ensemble of trees.
R-package: randomForest with function called also randomForest
Boosting (Ada boosting)
Input:
Data (xi,yi) i=1,…,n ;
1.
2.
3.
4.
5.
6.
7.
wi =1/n
Fit tree or any other learning method: h1(xi)
Calculate misclassification error E1
If E1 > 0.5 stop and abort loop
b1= E1/(1- E1)
for i=1,…,n
if h1(xi) =yi
wi = wi b1 else wi = wi
Normalize the wi’s to add up to 1.
Go back to 1. and repeat until no change in prediction error.
R-package: bagboost with function called also bagboost and also adaboost
Boosting (Ada boosting)
i=sample(nrow(hosp),1000,rep=F)
xlearn = f.toarray((hospital[-c(1:4,10:11),]))
ylearn = 1*( hospital$SALES12 > 50)
xtest = xlearn[i,]
xlearn = xlearn[-i,]
ytest = ylearn[i]
ylearn = ylearn[-i]
## BOOSTING EXAMPLE
u = bagboost(xlearn[1:100,], ylearn[1:100],
xtest,presel=0,mfinal=20)
summarize(u,ytest)
## RANDOM FOREST EXAMPLE
u = randomForest(xlearn[1:100,], ylearn[1:100],
xtest,ytest)
round(importance(u),2)
Competing methods
Recursive Partition:
Find the partition that best approximates the response.
For moderate/large datasets partition tree may be too big
Data Mining Trees
Bump Hunting:
Find subsets that optimize some criterion
Subsets are more “robust”
Not all interesting subsets are found
Paradigm for data mining:
Selection of interesting subsets
Recursive Partition
Var 2
Data Mining Trees
Bump Hunting
Var 1
High
Resp
Other
Data
High
Resp
Low
Resp
Data Mining Trees
ARF (Active Region Finder)
0.2
0.4
y
0.6
0.8
1.0
Naive thought: For the jth descriptor variable xj, an “interesting”
subset {a<xji<b} is one such that
p = Prob[ Z=1 | a<xji<b ]
is much larger than
 = Prob[ Z=1 ]
0.0
a
-2
-1
b
0
1
2
T= (p-)/p measures how interesting
a subset is.
x
Add a penalty term to prevent selection of
subsets that are too small or too large.
ARF algorithm diagram
- Create NodeList with
one node = FullData
- Set NodeType=FollowUp
- Set CurrentNode= 1
Split
CurrentNode
Left Bucket
BucketSize >Min?
Yes: NodeType=Followup
No: NodeType=Terminal
BucketSize > 0?
Y: Add Node to
NodeList
Center Bucket
Is Split Significant?
T or F
Right Bucket
BucketSize >Min?
Yes: NodeType=Followup
No: NodeType=Terminal
BucketSize >Min?
Yes: NodeType=Followup
No: NodeType=Terminal
Add Bucket to NodeList
BucketSize > 0?
Y: Add Node to
NodeList
Set CurrentNode= +1
EXIT
Y if CurrentNode> LastNode
Print Report
N
If NodeType = Terminal
N
Y
0.4
0.0
-50
0
y1
50
0.8
Comparing CART & ARF
20
30
40
50
60
70
80
20
30
40
60
70
80
x
0.4
0.0
0
y3
0.8
20 40 60
x
50
ARF: Captures
subset with small
variance (but not
the rest).
20
30
40
50
60
70
80
20
30
40
60
70
80
x
0.4
ARF: captures
interior subsets.
0.0
-50
0
y5
50
0.8
x
50
20
30
40
50
60
70
80
CART Needs both
subset with small
variance relative
to mean diff.
20
30
40
50
60
70
80
Two Examples
Point density is important
20
Poor
0.6
0
0.4
-20
0.2
0.0
Non-respondant
0.8
40
1.0
Subset that are
hidden in the middle
2
4
6
Pain Scale
8
10
0
10
20
30
DURATIL
40
50
Methodology
Var 2
1. Methodology Objective:
Var 1
The Data Space is divided between
- High response subsets
- Low Response subsets
- Other
High
Resp
Other
Data
High
Resp
Low
Resp
2. Categorical Responses:
Subsets that have high response on one of the categories.
T= (p-)/p
3. Continuous Responses: High mean response measured by
Z  (x  ) /  x
4. Statistical significance should be based on the entire tree
building process.
5. Categorical Predictors
6. Data Visualization
7. PDF report.
Report
Simple Tree or Tree sketch : Only statistically significant
nodes.
Full Tree: All nodes.
Table of Numerical Outputs: Detailed statistics of each node
List of Interesting Subsets: List of significant subsets
Conditional Scatter Plot (optional): Data Visualization.
How about outliers?
For Regression trees
- Popular belief: Trees are not affected by outlier (are robust)
- Outlier detection:
Run the data mining tree allowing for small buckets.
For observation Xi in terminal node j calculate the score
| X i  Median |
Zi 
MAD
Node for outliers n. 2&3
Frequency
3
2
Frequency
3
1
0
Frequency
2
1
0
-20
0
20
40
60
80
-20
0
20
40
60
0.0 0.5 1.0 1.5 2.0 2.5 3.0
5
Node for outlier n. 1
4
4
Zi is the number of std dev away from the mean
Zi > 3.5 then Xi is noted as an outlier.
Node for outlier n. 4
-20
0
20
40
Tree with Outliers
After Outlier removal
Robustness issues
ISSUE
In regulatory environments outliers are rarely omitted.
Our method is easily adaptable to robust splits by
calculating the robust version of the criterion by replacing
the mean and std dev by suitable estimators of location and
scale:
Z  (T  TR ) /  TR
Binary/Categorical Response
- How do we think about Robustness of trees?
- One outlier might not make any difference.
- 5% , 10% or more outliers could make a difference.
Further work
Binary/Categorical Response
- How do we think about Robustness of trees?
- One outlier might not make any difference.
- 5% , 10% or more outliers could make a difference.
Alvir, Cabrera, Caridi and Nguyen (2006)
Mining Clinical Trial data.
R-package: www.rci.rutgers.edu/DM/ARF_1.0.zip