Transcript ICDM03rdt

Is Random Model Better?
-On its accuracy and efficiencyWei Fan
IBM T.J.Watson
Joint work with
Haixun Wang, Philip S. Yu,
and Sheng Ma
Optimal Model

Loss function L(t,y) to evaluate
performance.


t is true label and y is prediction
Optimal decision decision y* is the label
that minimizes expected loss when x is
sampled repeatedly:


0-1 loss: y* is the label that appears the most
often, i.e., if P(fraud|x) > 0.5, predict fraud
cost-sensitive loss: the label that minimizes
the “empirical risk”.
• If P(fraud|x) * $1000 > $90 or p(fraud|x) > 0.09,
predict fraud
How we look for optimal
model?



NP-hard for most “model representation”
We think that simplest hypothesis that fits
the data is the best.
We employ all kinds of heuristics to look
for it.



info gain, gini index, Kearns-Mansour, etc
pruning: MDL pruning, reduced error-pruning,
cost-based pruning.
Reality: tractable, but still pretty
expensive
On the other hand


Occam’s Razor’s interpretation: two
hypotheses with the same loss, we should
prefer the simpler one.
Very complicated hypotheses that are
highly accurate:





Meta-learning
Boosting (weighted voting)
Bagging (sampling without replacement)
Where are we? The above are very
complicated to compute.
Question: do we have to?
Do we have to be “perfect”?

0-1 loss binary problem:



P(positive|x) > 0.5, we predict x to be
positive.
P(positive|x) = 0.6, P(positive|x) = 0.9 makes
no difference in final prediction!
Cost-sensitive problems:



P(fraud|x) * $1000 > $90, we predict x to be
fraud.
Re-write it P(fraud|x) > 0.09
P(fraud|x) = 1.0 and P(fraud|x) = 0.091
makes no difference.
Random Decision Tree



Building several empty iso-depth tree structures
without even looking at the data.
Example is sorted through a unique path from
the root the the leaf. Each tree node records the
number of instances belonging to each class.
Update each empty node by scanning the data
set only once.


It is like “classifying” the data.
When an example reaches a node, the number of
examples belonging to a particular class label
increments
Classification

Each tree outputs membership probability



p(fraud|x) = n_fraud/(n_fraud + n_normal)
The membership probability from multiple
random trees are averaged to
approximate the true probability
Loss function is required to make a
decision


0-1 loss: p(fraud|x) > 0.5, predict fraud
cost-sensitive loss: p(fraud|x) $1000 > $90
Tree depth
To create diversity
 Half of the number of features
 Combinations peak at half the size
of population


Such as, combine 2 out 4 gives 6
choices.
Number of trees

Sampling theory:
30 gives pretty good estimate with
reasonably small variance
 10 is usually already in the range.


Worst scenario
Only one feature is relevant. All the
rest are noise.
 Probability:

Simple Feature Info Gain

Limitation:

At least one feature with info gain by
itself
Same limitation as C4.5 and dti
 Example

Training Efficiency
One complete scan of the training
data.
 Memory Requirement:

Hold one tree (or better multiple trees)
 One example read at a time.

Donation Dataset
Decide whom to send charity
solicitation letter.
 It costs $0.68 to send a letter.
 Loss function

Result
Result
Credit Card Fraud
Detect if a transaction is a fraud
 There is an overhead to detect a
fraud, {$60, $70, $80, $90}
 Loss Function

Result
Extreme situation
Tree Depth
Compare with Boosting
Compare with Bagging
Conclusion
Point out the reality that
conventional inductive learning
(single best and multiple
complicated) are probably way too
complicated beyond necessity
 Propose a very efficient and
accurate random tree algorithm
