Lecture 8 , Feb - 16
Transcript Lecture 8 , Feb - 16
MSB rm 132
Ofc hr: Thur, 11-12 a
• “Learning denotes changes in a system that ... enable a
system to do the same task more efficiently the next time.”
• “Learning is constructing or modifying representations of
what is being experienced.”
• “Learning is making useful changes in our minds.”
• Decision Tree
• Hunt and colleagues use exhaustive search decision-tree
methods (CLS) to model human concept learning in the
• In the late 70’s, Quinlan developed ID3 with the information
gain heuristic to learn expert systems from examples.
• Quinlan’s updated decision-tree package (C4.5) released in
predict a categorical output from categorical and/or real inputs
Decision trees are most popular data mining tool
Easy to understand
Easy to implement
Easy to use
• Extremely popular method
– Credit risk assessment
– Medical diagnosis
– Market analysis
– Chemistry …
• Internal decision nodes
– Univariate: Uses a single attribute, xi
– Multivariate: Uses all attributes, x
– Classification: Class labels, or proportions
– Regression: Numeric; r average, or local fit
• Learning is greedy; find the best split recursively
• Occam’s razor: (year 1320)
– Prefer the simplest hypothesis that fits the data.
– The principle states that the explanation of any phenomenon should make
as few assumptions as possible, eliminating those that make no difference
in the observable predictions of the explanatory hypothesis or theory.
• Albert Einstein:
Make everything as simple as possible, but not simpler. Why?
– It’s a philosophical problem.
– Simple explanation/classifiers are more robust
– Simple classifiers are more understandable
Shorter trees are preferred over larger Trees
want attributes that classifies examples well. The best
attribute is selected. Select attribute which partitions the
learning set into subsets as “pure” as possible.
Each branch corresponds to attribute value
Each internal node has a splitting predicate
Each leaf node assigns a classification
• Entropy (disorder, impurity) of a set of examples, S, relative
to a binary classification is:
Entropy( S ) p1 log 2 ( p1 ) p0 log 2 ( p0 )
where p1 is the fraction of positive examples in S and p0 is
the fraction of negatives.
• If all examples are in one category, entropy is zero (we define
• If examples are equally mixed (p1=p0=0.5), entropy is a maximum of 1.
• Entropy can be viewed as the number of bits required on average to
encode the class of an example in S where data compression (e.g.
Huffman coding) is used to give shorter codes to more likely cases.
• For multi-class problems with c categories, entropy generalizes to:
Entropy( S ) pi log 2 ( pi )