Ethem`s Slides

Download Report

Transcript Ethem`s Slides

Lecture Slides for
ETHEM ALPAYDIN
© The MIT Press, 2010
[email protected]
http://www.cmpe.boun.edu.tr/~ethem/i2ml2e
Tree Uses Nodes, and Leaves
3
Decision Tree Representation
 Internal decision nodes test on the values of different attributes.
 Univariate test: Uses a single attribute of training example x; xi
 Numeric xi : Binary split : xi > wm
 Discrete xi : n-way split for n possible values
 Multivariate test: Uses all attributes of sample x = (x1, x2, …, xn)
 Branches correspond to an outcomes of a test
 Leaves correspond to a prediction
 For classification: Class labels, or proportion (probabilities)
 For regression: Numeric; r average, or local fit
 Learning: build a Decision Tree consistent with the training set
4
 Greedy learning: find the best split recursively
Ex: Decision Tree for PlayTennis Data
What is the prediction for training example (Sunny, Mild, High, Strong) ?
Each training example fall precisely into exactly one leaf
Ex: Decision Tree for Wisconsin Cancer Data
1.
2.
3.
For each sample, test the corresponding attribute at each node
Follow the appropriate branch according to the outcome of the
test
At a leaf, we return the majority class, or the class probabilities,
as the prediction for the sample
DTs as Logical Representations
 Decision Tree = Set of If-Then-Rules: rule = path from root to leaf
7
Representation Power and Efficiency of DTs
 Any Boolean function can be represented by a DT
 Any function f: A1 × A2 × … × An → C can be represented by a DT
 DT Construction Algorithms
1.
Number of tests between O(n) and O(2n); n = number of attributes
1.
We want an efficient DT learning algorithm generating the simplest consistent tree
2.
Produce comprehensible results
3.
Often the first method to try on new data set.
8
Learning DTs and Complexity
 Many DTs are consistent with the same training set D
 Simplest DTs are preferred over complex DTs
1.
2.
3.
Simplest = smallest: least number of nodes
Simplest = takes fewest bit to encode, least space, least memory
Finding the simplest DT consistent with set D is NP-Hard
1. Hypothesis space H = space of all possible trees; exponential in size
 We could enumerate all possible tree, keep the consistent ones, and then
return the smallest ones.
 Combinatorial problem and overfitting issue
2. Solutions
1. Heuristic search methods, or
2. Limit H to a subset of simple trees: limit the size, or the #of tests
 Learning is greedy; find the best split recursively (Breiman et al, 1984;
Quinlan, 1986, 1993)
9
Learning a Decision Tree (Univariate)
 Given a subset of N of training examples:
 Find the best test attribute A to split N on
 Split N according to each value of the test.
 Recurse on each subset of N
 Stop when no more splits is possible, or the tree generalizes well
10 Will give the full algorithm next…
Top-Down Induction of Decision Trees
Given training set T with classes C1, …, Ck, then:
•
For Classification:
1. If all elements of T are of same class Ci: the DT for T is a leaf
with class label i
2. Else: Select the best test attribute AT to split T on
3. Partition T into m subsets T1, …, Tm according to the values of
the m outcomes of the test; the DT for T is the test node AT with
its m childrens
4. Recurse on each T1, …, Tm
•
For Regression: Same as classification, except
1. The label of a leaf is a mean value or a value obtained by linear
regression.
2. The definition of best test attribute is different.
11
Which is the Best Test ?
 The test should provide information about the class label
 Suppose we have 40 examples (30+ and 10-), and we consider two
test attributes that give the following splits below:
 Intuitively, we want a test attribute that separates the training set as
well as possible
 If each leaf was pure, the test attribute would provide maximal
information about the label at the leaf
 What is purity and how to we measure purity ?
12
Information and Uncertainty
 Examples:
You are about to observe the outcome of a fair dice roll
2. You are about to observe the outcome of a biased coin flip
3. I am about to tell you my name
You know a priori: I have a fair dice, a biased coin, your name, …
I am sending you a message; that is, the outcome of …
If the message you receive is totally expected with 100% certainty,
then the information content of the message is 0, nothing surprising and
very boring .
In each situation, you have a different amount of uncertainty as to what
outcome/message you will observe/receive
Information = reduction in uncertainty, is relative to what you know a
priori, is related to surprise, and depends on context
1.





13
Information = Reduction of Uncertainty
 Let E be an event with probability P(E), the amount of information
contained in E is:
1
𝐼 𝐸 = log 2 𝑃(𝐸)
= − log 2 𝑃(𝐸) bits of information
 Result of a fair coin flip provides log22 = 1 bit of information
 Result of a fair dice roll provides log26 = 2.58 bits of information
 Telling you my name provide log21 = 0 bit of information 
 Information is additive: If you have a sequence of k events E then you
can add the k I(E)’s. Example: a sequence S of k independent fair coin
flips E contains
𝐼 𝑆 = log 2
14
1 𝑘
𝑃(𝐸)
=
𝑘
𝑖=1
1
log 2 𝑃(𝐸)
= 𝑘 bits
Entropy
 Suppose we have a transmitter S that conveys the results of a random
experiment with m possible discrete outcomes s1, …, sm with
probability distribution P = p1, …, pm
 Entropy is the average amount of information when observing S, that is
the expected information content of the probability distribution of S
Provided pi ≠ 0.
For pi = 0, we have I(si) = 0 by convention.
 Note: the entropy depends only on the probability distribution of S,
not on the outcomes, thus we can write I(pi) instead of I(si).
 So we can really writes H(P) instead of H(S) 
 Entropy = uncertainty the observer has apriori, average number of
bits to communicate/encode a message, …tc.
15
Entropy and Coding Theory
 H(P) = -Σpilog2pi is the average number of bits needed to
communicate a symbol
 Communicating a message from an alphabet of 4 symbols with
equal probabilities requires 2 bits encoding for each symbol
 Conveying an outcome that is certain takes 0 bits.
 Suppose now p0 = 0.5, p1 = 0.25, p2 = p3 = 0.125.
 What is the best encoding for each symbol ?
 z0 = 0, z1 = 10, z2 = 110, z3 = 111 … bits
 What is the length of the message over time ?
 Shannon Entropy:
 There are codes that will communicate the symbols with efficiency
arbitrarily close to H(P) bits/symbols.
 There are no codes that will do it with efficiency greater than H(P)
bits/symbols.
16
Shannon Entropy as Measure of Information
• For any distribution P, H(P) is the optimal number of binary questions
17
required on average to determine an outcome drawn from P.
Properties of Entropy
• The further P is from uniform, the lower the entropy
• So we can use entropy as a measure of impurity  of a set D
• We want to select a test attribute that reduces the entropy of D to the
18
maximum extent 
Entropy applied to Binary Classification
 Consider a training set D with two classes C+ and C-, and let
 p+ be the proportion of positive examples in D
 p- be the proportion of negative examples in D
 Entropy measures the impurity of D based on the empirical
probabilities of the two classes, and is the expected number of bits
needed to encode the classes + and H(D) = -(p+log2p+ + p-log2p-)
 So we can use entropy H(D) to measure purity of D !
 D has maximum purity when p+ or p- = 0 or 1.
 D has maximal impurity if it has uniform probability distribution:
19
that is when p+ = p- = 0.5
Classification Trees
(ID3, CART, C4.5)
 For node m, Nm instances reach m, Nim belong to Ci
i
N
PˆC i | x, m  pmi  m
Nm
 Node m is pure if pim is 0 or 1
 Measure of impurity is entropy
K
I m   pmi log 2 pmi
i 1
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
20
Best Split
 If node m is pure, generate a leaf and stop, otherwise split and
continue recursively
 Impurity after split: Nmj of Nm take branch j. Nimj belong to Ci
i
N
i
PˆC i |x, m, j   pmj
 mj
Nmj
n N
K
mj
i
i
I'm  
p
log
p

mj
2 mj
N
j 1
m i 1
 Find the variable and split that minimize impurity (among all
variables -- and split positions for numeric variables)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
21
Conditional Entropy and Information Gain
 Conditional Entropy of a training set D given an attribute A is:
 H(D | A) = Σv ϵValues(A)P(A = v)H(D | A = v)
 H(D | A) = Σv ϵValues(A)PvH(Dv)
 Where
 Values(A) = set of all possible values of attribute A
 Pv = P(A = v) = proportions of examples for which A = v
 H(Dv) = H(D | a = v) = entropy of the subset for which A = v
 Information Gain: expected reduction in entropy of D given A
 IG(D, A) = H(D) – H(D | A)
 This is what we need in constructing a DT 
 Select a test attribute A that gives highest information gain
22
Which Attribute is the Best Classifier ?
23
Selecting the Best Test Attribute
 Choose Outlook as the top test attribute, because:
 IG(D, Outlook) = 0.246
 IG(D, Humidity) = 0.151
 IG(D, Wind) = 0.048
 IG(D, Temperature) = 0.029
 Then repeat for each children of outlook
24
Which Attribute is the Best Classifier ?
25
Decision Tree Learning Algorithm: ID3
 Given training set T with classes C1, …, Ck, there are 3 cases:
If all elements of T are of same class Ci: the DT for T is a leaf with
label i
2. If T is empty: the DT of T is a leaf with label equal to the majority
class in the parent of this node
3. If T contains elements of different classes:
A. Select the test attribute AT which has the highest Information Gain
B. Partition T into m subsets T1, …, Tm , one for each outcome of
the test with AT
C. The DT of T is a decision node with label AT and one branch for
each outcome of the test
D. Repeat from Step 1 for each subset Tj
1.
For Regression: Same as above but the label of a leaf is a mean value or a value
obtained by regression. Impurity measure is also different.
26
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
27
Hypothesis Space of ID3
 Space H = Set of all possible DTs
 Search heuristic based on Information Gain
 Simple-to-Complex hill-climbing search through H
28
Hypothesis Space Search by ID3
 Space H is complete:
 Target function surely is in H
 Output a single consistent hypothesis,
 Can’t query the teacher
 Can’t consider alternative DT’s
 Do not backtrack
 May get stuck in local optimum
 Statistically-based search choices
 Use all examples in each step, robust to noisy data
 Less sensitive to errors in individual examples
 Inductive Bias: preference for shorter DTs and for those with
29
high IG on the root
Attributes with Many Values
 Problem:
 Information Gain tends to select attributes with many values first
 Example: attribute Date has too many possible values
 Solution: use Information Gain Ratio instead of Information Gain
 Information Gain Ratio: proportion of information generated by the
split that is useful for classification
 Split Information: measures potential information generated by
partitioning S into c classes, i.e. entropy of S with respect to values of A
Where, S1, …, Sc are the subsets resulting from partitioning S by the c-valued attribute A
30
Attributes with Real Values
 Create a discrete attribute to test continuous attribute
 Simply take the midpoints between two consecutive values
 Example: Temperature = 85, or 54, …
 Which is the best split threshold? Temp > 54 or Temp > 85 ?
Sort examples by value of real attribute to test, and obtain all
candidate thresholds. E.g., Temp>54 and Temp>85 are the candidate
discrete attributes
2. Compute the Information Gain for each discrete attribute
3. Choose the best discrete attribute and …
• Replace the continuous attribute and
• … use it as any other attribute for learning the decision tree
1.
31
Attributes with Costs
 Attributes may have associated costs or risks
 Example in medical diagnosis, BloodTest may cost $150
 Goal: Learn a DT which also minimizes total cost of decision nodes
 Solutions: Replace Information Gain to take cost into consideration.
1.
IG_Cost(D, A) =
𝐼𝐺(𝑆,𝐴)2
𝐶𝑜𝑠𝑡(𝐴)
2𝐼𝐺(𝑆,𝐴) −1
(𝐶𝑜𝑠𝑡 𝐴 +1)𝑤
IG_Cost(D, A) =
3. Use a cost matrix: Not all misclassifications are equally costly
• E.g, false alarm is less costly than a failure in nuclear plant
• Cost matrix is like a confusion matrix, except cost of errors
are assigned to the elements outside of the main diagonal
(misclassifications). Used for diagnosis
• We will discuss confusion matrices later in this chapter
2.
32
Regression Trees
 Error at node m:
1 if x  X m : x reaches node m
bm x   
0 otherwise
1
2
t
t




Em 
r

g
b
x

m
m
t
Nm
gm 
t
t


b
x
r
t m
 b x 
t
t
m
 After splitting:
1 if x  X mj : x reaches node m and branch j
bmj x   
0 otherwise
1
2
t
t




E 'm 
r

g
b
x

mj
mj
j t
Nm
b x  r


 b x 
t
gmj
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
t
mj
t
t
t
mj
33
Model Selection in Trees
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
34
Pruning Trees
 Remove subtrees for better generalization (decrease
variance)
 Prepruning: Early stopping
 Postpruning: Grow the whole tree then prune subtrees
which overfit on the pruning set
 Prepruning is faster, postpruning is more accurate
(requires a separate pruning set)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
35
Rule Extraction from Trees
C4.5Rules
(Quinlan, 1993)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
36
Learning Rules
 Rule induction is similar to tree induction but
 tree induction is breadth-first,
 rule induction is depth-first; one rule at a time
 Rule set contains rules; rules are conjunctions of terms
 Rule covers an example if all terms of the rule evaluate to
true for the example
 Sequential covering: Generate rules one at a time until all
positive examples are covered
 IREP (Fürnkrantz and Widmer, 1994), Ripper (Cohen,
1995)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
37
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
38
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
39
Multivariate Trees
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
40