Trees Diagram as a Visual Way to Improve Interpretability of Tree

Download Report

Transcript Trees Diagram as a Visual Way to Improve Interpretability of Tree

“BOF” Trees Diagram as a
Visual Way to Improve
Interpretability of Tree
Ensembles
Vesna Luzar-Stiffler, Ph.D.
University Computing Centre, and CAIR Research Centre,
Zagreb, Croatia
Charles Stiffler, Ph.D.
CAIR Research Centre, Zagreb, Croatia
[email protected], [email protected]
BOF Trees Visualization  Zagreb, June 12, 2004
Outline
Introduction/Background
Trees
 Ensemble Trees
 Visualization Tools

Simulation Results
Web Survey Results
Conclusions/Recommendations
BOF Trees Visualization  Zagreb, June 12, 2004
Introduction / Background
Classification / Decision Trees


Data mining (statistical learning) method for
classification
Invented twice:


Statistical community: Breiman: Friedman et.al. (1984)
Machine Learning community: Quinlan (1986)
Many positive features

Interpretability, ability to handle data of mixed type
and missing values, robustness to outliers, etc.
Disadvantage

unstable vis-à-vis seemingly minor data perturbations
 low predictive power
BOF Trees Visualization  Zagreb, June 12, 2004
Introduction / Background
Possible improvements: Ensembles




Bagging i.e., Bootstraping trees (Breiman, 1996)
Boosting, e.g., AdaBoost (Freund & Schapire, 1997)
Random Forests (Breiman, 2001)
Stacking, randomized trees, etc.
Advantage:

Improved prediction
Disadvantage

Loss of interpretability (“black box”)
BOF Trees Visualization  Zagreb, June 12, 2004
Classification Tree
Let
fˆ ( x)
be the classification
tree prediction at
input x obtained from
the full “training” data
Z=
{(x1,y1),(x2,y2)…(xN,yN)}
BOF Trees Visualization  Zagreb, June 12, 2004
Bagging Classification Tree
Let
1
fˆ ( x)
*b
be the classification
2
tree prediction at
input x obtained from
the bootstrap sample
Z*b, b=1,2,…B.
Bagging estimate:
B
ˆf ( x)  1  fˆ ( x)
B
B
bag
BOF Trees Visualization  Zagreb, June 12, 2004
b 1
*b
Visualization tools
Graphs based on predictor “importances”
(Bxp) matrix F (p=# of predictors)
1 ˆ
ˆ
For bagged trees, we take the avg: I   I (T )
B
2
k
B
Diagram 1, importance mean bar chart
 Diagram 2, (“BOF Clusters”) is the cluster
means chart (NEW)
 Diagram 3, (“BOF MDPREF”) is the
multidimensional preference bi-plot (NEW)

BOF Trees Visualization  Zagreb, June 12, 2004
b 1
2
k
b
Visualization tools
Graphs based on proximity (nxn) matrix P,
(n=# of cases)
Diagram 4 (“Proximity Clusters”) is the cluster
means chart (Breiman,2002)
 Diagram 5 (“Proximity MDS”) is the
multidimensional scaling plot of “similar”
cases (Breiman,2002)

BOF Trees Visualization  Zagreb, June 12, 2004
Simulation experiments
S1:
Generate a sample of
size n=30,
two classes, and
p=5 variables (x1-x5), with
a standard normal
distribution and pair-wise
correlation 0.95.
The responses are
generated according to
Pr(Y=1|x1≤0.5) = 0.2,
Pr(Y=1|x1>0.5)=0.8.
BOF Trees Visualization  Zagreb, June 12, 2004
S2:
Generate a sample of
size n=30,
two classes, and
p=5 variables (x1-x5), with
a standard normal
distribution and pair-wise
correlation 0.95 between
x1 and x2, and 0 among
other predictors.
The responses are
generated according to
Pr(Y=1|x1≤0.5) = 0.2,
Pr(Y=1|x1>0.5)=0.8.
Diagram 1, Mean importance
S1
BOF Trees Visualization  Zagreb, June 12, 2004
S2
Diagram 2, “BOF Clusters”
S1
BOF Trees Visualization  Zagreb, June 12, 2004
S2
Diagram 3, “BOF MDPREF”
S1
BOF Trees Visualization  Zagreb, June 12, 2004
S2
Diagram 4, “Proximity Clusters”
S1
BOF Trees Visualization  Zagreb, June 12, 2004
S2
Web Survey data
ICT infrastructure/usage in Croatian
primary and secondary schools
25,000+ teachers (cases)
200+ variables
Response: “classroom use of a computer
by educators” (yes/no)
Partition
50% training
 25% validation
 25% test

BOF Trees Visualization  Zagreb, June 12, 2004
Initial tree (before bagging)
BOF Trees Visualization  Zagreb, June 12, 2004
Diagram 1, “Mean importance”
BOF Trees Visualization  Zagreb, June 12, 2004
Diagram 2, “BOF Clusters”
BOF Trees Visualization  Zagreb, June 12, 2004
Diagram 3, “BOF MDPREF”
BOF Trees Visualization  Zagreb, June 12, 2004
Bootstrap tree 11
BOF Trees Visualization  Zagreb, June 12, 2004
Bootstrap tree 22
BOF Trees Visualization  Zagreb, June 12, 2004
Bootstrap tree 12
BOF Trees Visualization  Zagreb, June 12, 2004
Clustering trees
BOF Trees Visualization  Zagreb, June 12, 2004
Diagram 5, “Proximity MDS”
BOF Trees Visualization  Zagreb, June 12, 2004
Conclusions/ Recommendations
There are SWs for trees
There are some SWs for tree ensembles
There are some visualization tools (old
and new)
The problem is

they are not “interfaced” (integrated)
BOF Trees Visualization  Zagreb, June 12, 2004