KDD06rdt - Columbia University

Download Report

Transcript KDD06rdt - Columbia University

A General Framework for Fast and
Accurate Regression by Data
Summarization in Random Decision Trees
Wei Fan,
IBM T.J.Watson
Joe McCloskey,
US Department
of Defense
Philip Yu,
IBM T.J.Watson
Three DM Problems

Classification:


Probability Estimation:



Label: given set of labels in training data.
Similar to the above setting: estimate the
probability that x is an example of class y.
Difference: no truth is given, i.e., no true
probability
Regression:

Target value: continuous values.
Model Approximation

True model or correct model.




Generates y for each x with probability P(y|x).
Normally never known in reality.
Perfect model: never makes mistakes or
has the same prediction as the true
model.
Not always possible due to:



Stochastic nature of the problem
Noise in training data
Data is insufficient
Optimal Model

Loss function L(t,y) to evaluate performance.


Optimal decision decision y* is the label that minimizes
expected loss when x is sampled repeatedly:
Examples


0-1 loss: y* is the label that appears the most often,
i.e., if P(fraud|x) > 0.5, predict fraud
cost-sensitive loss: the label that minimizes the
“empirical risk”.
• If P(fraud|x) * $1000 > $90 or p(fraud|x) > 0.09, predict
fraud

MSE or mean square error: predict average
How we look for optimal models?

Don’t impose “exact forms”:




Impose “exact forms”:



Decision Trees, Classification
based on Association rules,
Production rules
Learner estimate structure as
well as parameters
NP-hard for most “model
representation”
logistic regression functions,
linear regression model, etc
Learners estimate parameter
ONLY. Structure is pre-fixed
Inductive Bias.

Decision tree is rather
flexible, efficient yet powerful
representation.
Consider Decision Tree

Compromise between accuracy and model
complexity


We employ all kinds of heuristics to look for it.




We think that simplest-structured hypothesis that fits
the data is the best.
info gain, gini index, Kearns-Mansour, etc
pruning: MDL pruning, reduced error-pruning, costbased pruning.
Reality: tractable, but still pretty expensive
Truth: none of purity check functions guarantee
accuracy over testing data.
Random Decision Tree
-classification, regression, probability estimation

Key characteristics:



Structure is randomly picked.
Statistics are summarized from training data.
At each node, an un-used feature is
chosen randomly


A discrete feature is un-used if it has never
been chosen previously on a given decision
path starting from the root to the current
node.
A continuous feature can be chosen multiple
times on the same decision path, but each
time a different threshold value is chosen
Continued

We stop when one of the following
happens:


A node becomes too small.
Or the total height of the tree exceeds some
limits:
• Such as the total number of features.
Node Statistics

Classification and Probability
Estimation:


Each node of the tree keeps the number of
examples belonging to each class.
Regression:

Each node of the tree keeps the mean value
of examples sorted into the node
Classification/Prob Estimatimation

During classification, each tree outputs
posterior probability:
B1 < 0.5
Y
N
B2 > 0.7
Y
B1 > 0.3
N
Y
P(P1|x)=0.3
P1: 200
P2: 10
……
P1: 30
P2: 70
Regression

During classification, each tree average
value of training examples that falls
within each node
Age >30
Y
N
Capt> 70%
Y
Avg
AGI=100K
Edu=PhD
N
Y
……
Avg
AGI=150K
Classification
The prediction from multiple random
trees are averaged as the final
output.
 Classification: loss function is
needed.

A few words about some of
its advantage
Training can be very efficient.
Particularly true for very large
datasets.
 Natural multi-class probability.
 Natural multi-label classification and
probability estimation.
 Imposes very little about the
structures of the model.

Number of trees

Sampling theory:



Worst scenario



The random decision tree can be thought as sampling
from a large (infinite when continuous features exist)
population of trees.
Unless the data is highly skewed, 30 to 50 gives pretty
good estimate with reasonably small variance. In most
cases, 10 are usually enough.
Only one feature is relevant. All the rest are noise.
Probability:
Variance Deduction:
Donation Dataset
-classification and prob estimation
Decide whom to send charity
solicitation letter.
 It costs $0.68 to send a letter.
 Loss function

Result
Credit Card Fraud
-classification and prob estimation
Detect if a transaction is a fraud
 There is an overhead to detect a
fraud, {$60, $70, $80, $90}
 Loss Function

Result
Comparing with Boosting






Don’t handle multi-class problems
naturally, ECOC
Do not output probabilities.
Inefficient.
Boosting rounds is tricky. Sometimes,
more rounds can lead to overfitting.
Inefficient.
Implementation needs careful numerical
manipulation.
Comparing with Bagging

Could be very inefficient particularly
for very large dataset


i.e., bootstrap sampling needs linear
scan of the data.
Do not output reliable probabilities.
Probability Estimation
Probability Estimation
Overfitting
Non-overfitting of RDT
Selectivity
Tolerance to data insufficiency
GUIDE
Age >30
Y
N
Capt> 70%
Y
MLR
Edu=PhD
N
Y
……
MLR
MLR y = a+a1*x1+a2*x2 + … ak*xk
Regression: single
independent variable
RDT
Depend on combination of 5
independent variables
RDT
It grows like …
Comparing with GUIDE



Need to decide grouping variables and
independent variables. A non-trivial task.
If all variables are categorical, GUIDE
becomes a single CART regression tree.
Strong assumption and greedy-based
search. Sometimes, can lead to very
unexpected results, like the one given
earlier
Conclusion



Imposing a particular form of model is
not a good idea to train highly-accurate
models.
It may not even be efficient for some
forms of models.
RDT has been show to solve all three
major problems in data mining,
classification, probability estimation and
regressions, simply, efficiently and
accurately.
Selected Bibliography of RDT






ICDM’03: “Is random model better? On its accuracy and
efficiency” (Fan, Wang, Yu and Ma)
AAAI’04: “On the Optimality of Posterior Probability Estimation
by Random Decision Tree” (Fan)
ICDM’05: “Effective Estimation of Posterior Probabilities:
Explaining the Accuracy of Randomized Decision Tree
Approaches” (Fan, Greengrass, McCloskey, Yu, and Drummey)
ICDM’05: “Learning through Changes: An Empirical Study of
Dynamic Behaviors of Probability Estimation Trees” (Zhang,
Buckles, Peng, and Xu)
Master Thesis by Tony Liu, supervised by Kai Ming Ting, “The
Utility of Randomness in Decision Tree Construction”, Monash
University, 2005
KDD’06: “A General Framework for Fast and Accurate Regression
by Data Summarization in Random Decision Trees”