Transcript pptx

Machine Learning:
Decision Trees
Homework 4 assigned
courtesy: Geoffrey Hinton, Yann LeCun, Tan, Steinbach, Kumar
What is machine learning?
It is very hard to write programs that solve problems like
recognizing a face.
 We don’t know what program to write because we don’t
know how its done.
 Even if we had a good idea about how to do it, the
program might be horrendously complicated.
 Instead of writing a program by hand, we collect lots of
examples that specify the correct output for a given input.
 A machine learning algorithm then takes these examples
and produces a program that does the job.
 The program produced by the learning algorithm may
look very different from a typical hand-written program. It
may contain millions of numbers.
 If we do it right, the program works for new cases as well
as the ones we trained it on.

2
3
Handwriting/face recognition/detection
4
Visual object/action recognition
Aeroplanes
Bicycles
Birds
Boats
Buses
Cars
Cats
Trains
Cows
Chairs
Dogs
Horses
5
Different types of learning

Supervised Learning: given training examples of inputs and
corresponding outputs, produce the “correct” outputs for new
inputs. Example: character recognition.

Unsupervised Learning: given only inputs as training, find
structure in the world: discover clusters, manifolds, characterize
the areas of the space to which the observed inputs belong

Reinforcement Learning: an agent takes inputs from the
environment, and takes actions that affect the environment.
Occasionally, the agent gets a scalar reward or punishment. The
goal is to learn to produce action sequences that maximize the
expected reward.
6
Applications











handwriting recognition, OCR: reading checks and zip codes,
handwriting recognition for tablet PCs.
speech recognition, speaker recognition/verification
security: face detection and recognition, event detection in videos.
text classification: indexing, web search.
computer vision: object detection and recognition.
diagnosis: medical diagnosis (e.g. pap smears processing)
adaptive control: locomotion control for legged robots, navigation for
mobile robots, minimizing pollutant emissions for chemical plants,
predicting consumption for utilities...
fraud detection: e.g. detection of “unusual” usage patterns for credit
cards or calling cards.
database marketing: predicting who is more likely to respond to an ad
campaign. (...and the antidote) spam filtering.
games (e.g. backgammon).
Financial prediction (many people on Wall Street use machine learning).
7
Learning ≠ memorization





rote learning is easy: just memorize all the training examples and
their corresponding outputs.
when a new input comes in, compare it to all the memorized
samples, and produce the output associated with the matching
sample.
PROBLEM: in general, new inputs are different from training
samples. The ability to produce correct outputs or behavior on
previously unseen inputs is called GENERALIZATION.
rote learning is memorization without generalization.
The big question of Learning Theory (and practice): how to get
good generalization with a limited number of examples.
8
Look ahead

Supervised learning
– Decision trees
– Linear models
– Neural networks

Unsupervised learning
– K-means clustering
http://www-stat.stanford.edu/~tibs/ElemStatLearn/
Full text in PDF
Learning from examples
Tid
Attrib1
Attrib2
Attrib3
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learning
algorithm
Class
Induction
Learn
Model
10
Model
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Deduction
10
Test Set
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Examples of Classification Task
Handwriting recognition
 Face detection
 Speech recognition
 Object recognition

© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Learning from examples
Uses different biases in predicting Russel’s waiting habbits
K-nearest
neighbors
If patrons=full and day=Friday
then wait (0.3/0.7)
If wait>60 and Reservation=no
then wait (0.4/0.9)
Decision Trees
--Examples are used to
--Learn topology
--Order of questions
Association rules
--Examples are used to
--Learn support and
confidence of association
rules
SVMs
Neural Nets
--Examples are used to
--Learn topology
--Learn edge weights
Russell waits
RW
None
some
full
T
0.3
0.2
0.5
F
0.4
0.3
0.3
Wait time?
Patrons?
Naïve bayes
(bayesnet learning)
--Examples are used to
--Learn topology
--Learn CPTs
Friday?
Decision Tree Classification Task
Tid
Attrib1
Attrib2
Attrib3
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Class
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
© Tan,Steinbach, Kumar
Married
NO
> 80K
YES
Introduction to Data Mining
4/18/2004
‹#›
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
© Tan,Steinbach, Kumar
Married
NO
> 80K
YES
Introduction to Data Mining
4/18/2004
‹#›
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
© Tan,Steinbach, Kumar
Married
NO
> 80K
YES
Introduction to Data Mining
4/18/2004
‹#›
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
© Tan,Steinbach, Kumar
Married
NO
> 80K
YES
Introduction to Data Mining
4/18/2004
‹#›
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
© Tan,Steinbach, Kumar
Married
NO
> 80K
YES
Introduction to Data Mining
4/18/2004
‹#›
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
© Tan,Steinbach, Kumar
Married
Assign Cheat to “No”
NO
> 80K
YES
Introduction to Data Mining
4/18/2004
‹#›
Decision Tree Classification Task
Tid
Attrib1
Attrib2
Attrib3
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Class
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Decision Tree Induction

Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
http://www2.cs.uregina.ca/~dbd/cs831/notes/
ml/dtrees/c4.5/tutorial.html
– SLIQ,SPRINT
http://www.cs.waikato.ac.nz/ml/weka/
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Tree Induction

Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.

Issues
– Determine how to split the records
How
to specify the attribute test condition?
How to determine the best split?
– Determine when to stop splitting
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Tree Induction

Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.

Issues
– Determine how to split the records
How
to specify the attribute test condition?
How to determine the best split?
– Determine when to stop splitting
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
How to Specify Test Condition?

Depends on attribute types
– Nominal
– Ordinal
– Continuous

Depends on number of ways to split
– 2-way split
– Multi-way split
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Splitting Based on Nominal Attributes

Multi-way split: Use as many partitions as distinct
values.
CarType
Family
Luxury
Sports

Binary split: Divides values into two subsets.
Need to find optimal partitioning.
{Sports,
Luxury}
CarType
© Tan,Steinbach, Kumar
{Family}
OR
Introduction to Data Mining
{Family,
Luxury}
CarType
{Sports}
4/18/2004
‹#›
Splitting Based on Continuous Attributes

Different ways of handling
– Discretization to form an ordinal categorical
attribute
Static – discretize once at the beginning
 Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.

– Binary Decision: (A < v) or (A  v)
consider all possible splits and finds the best cut
 can be more compute intensive

© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Tree Induction

Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.

Issues
– Determine how to split the records
How
to specify the attribute test condition?
How to determine the best split?
– Determine when to stop splitting
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
How to determine the Best Split
Greedy approach:
– Nodes with homogeneous class distribution
are preferred
 Need a measure of node impurity:

C0: 5
C1: 5
C0: 9
C1: 1
Non-homogeneous,
Homogeneous,
High degree of impurity
Low degree of impurity
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Entropy

The entropy of a random variable V with values
vk, each with probability P(vk) is:
𝐻 𝑉 = − 𝑘 𝑃(𝑣𝑘 ) 𝑙𝑜𝑔2 𝑃(𝑣𝑘 )
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Splitting Criteria based on INFO

Entropy at a given node t:
Entropy(t )   p( j | t ) log p( j | t )
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Measures homogeneity of a node.
 Maximum
(log nc) when records are equally distributed
among all classes implying least information
 Minimum (0.0) when all records belong to one class,
implying most information
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
How to Find the Best Split
Before Splitting:
C0
C1
N00
N01
M0
A?
B?
Yes
No
Node N1
C0
C1
Node N2
N10
N11
C0
C1
N20
N21
M2
M1
Yes
No
Node N3
C0
C1
Node N4
N30
N31
C0
C1
M3
M12
N40
N41
M4
M34
Gain = M0 – M12 vs M0 – M34
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Examples for computing Entropy
Entropy(t )   p( j | t ) log p( j | t )
j
C1
C2
0
6
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
© Tan,Steinbach, Kumar
P(C1) = 0/6 = 0
2
P(C2) = 6/6 = 1
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Introduction to Data Mining
4/18/2004
‹#›
Splitting Based on INFO...

Information Gain:
n


GAIN  Entropy ( p)    Entropy (i ) 
 n

k
split
i
i 1
Parent Node, p is split into k partitions;
ni is number of records in partition i
– Measures Reduction in Entropy achieved because of
the split. Choose the split that achieves most reduction
(maximizes GAIN)
– Used in ID3 and C4.5
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
The Information Gain
Computation
P+ : N+ /(N++N-)
P- : N- /(N++N-)
I(P+ ,, P-) = -P+ log(P+) - P- log(P- )
N+
NThe difference
is the information
gain
Splitting on
feature f
N1+
N1-
I(P1+ ,, P1-)
N2+
N2-
I(P2+ ,, P2-)
Nk+
Nk-
I(Pk+ ,, Pk-)
So, pick the feature
with the largest Info Gain
I.e. smallest residual info
k
Given k mutually exclusive and exhaustive
events E1….Ek whose probabilities are
p1….pk
The “information” content (entropy) is defined as
S i -pi log2 pi
A split is good if it reduces the entropy..
S
i=1
[Ni+ + Ni- ]/[N+ + N-] I(Pi+ ,, Pi-)
A simple example
Ex Masochistic Anxious
Nerdy
1
F
F
T
HATES
EXAM
Y
V(M) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2)
= 1
2
F
F
T
N
V(A) = 2/4 * I(1,0) + 2/4 * I(0,1)
3
T
F
F
N
4
T
T
T
Y
= 0
V(N) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2)
= 1
So Anxious is the best attribute to split on
Once you split on Anxious, the problem is solved
Decision Tree Based Classification

Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Tree Induction

Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.

Issues
– Determine how to split the records
How
to specify the attribute test condition?
How to determine the best split?
– Determine when to stop splitting
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Overfitting due to Noise
Decision boundary is distorted by noise point
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Notes on Overfitting

Overfitting results in decision trees that are more
complex than necessary

Training error no longer provides a good estimate
of how well the tree will perform on previously
unseen records

Need new ways for estimating errors
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
How to Address Overfitting

Pre-Pruning (Early Stopping Rule)
– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:

Stop if all instances belong to the same class

Stop if all the attribute values are the same
– More restrictive conditions:
Stop if number of instances is less than some user-specified
threshold

Stop if class distribution of instances are independent of the
available features (e.g., using  2 test)


Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
How to Address Overfitting…

Post-pruning
– Grow decision tree to its entirety
– Trim the nodes of the decision tree in a
bottom-up fashion
– If generalization error improves after trimming,
replace sub-tree by a leaf node.
– Class label of leaf node is determined from
majority class of instances in the sub-tree
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Decision Boundary
1
0.9
x < 0.43?
0.8
0.7
Yes
No
y
0.6
y < 0.33?
y < 0.47?
0.5
0.4
Yes
0.3
0.2
:4
:0
0.1
No
Yes
:0
:4
:0
:3
No
:4
:0
0
0
0.1
0.2
0.3
0.4
0.5
x
0.6
0.7
0.8
0.9
1
• Border line between two neighboring regions of different classes is
known as decision boundary
• Decision boundary is parallel to axes because test condition involves
a single attribute at-a-time
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Oblique Decision Trees
x+y<1
Class = +
Class =
• Test condition may involve multiple attributes
• More expressive representation
• Finding optimal test condition is computationally expensive
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›