Transcript Document

Decisions, Decisions
• Concepts
• Naïve Bayesian Classification
• Decision Trees
– General Algorithm
– Refinements
• Accuracy
• Scalability
– Strengths and Limitations
Lecture 12, CS567
1
•
Problem
–
Will there be a pop quiz today?
•
Data
–
–
–
–
•
(Duration of delay) in entering class
Instructor’s bag (bulging/not bulging)
Instructor (has/does not have) wicked/impish smile on face
There (was/wasn’t) a quiz last class
Naïve Bayesian
–
•
Concepts
Calculate P(Pop Quiz) from data with no regard to order of
calculation
Decision Tree
– Evaluate data in a particular branching sequence
If …. then
Lecture 12, CS567
elsif …..
2
Naïve Bayesian
•
•
Goal: To estimate P(M|D) aka Posterior
“Naïve” assumption
–
•
•
Prior = P(M). Binomial distribution with parameter p =
P(Pop Quiz). Thus p = ?
P(Pop Quiz|D) = P(D|Pop Quiz) P(Pop Quiz)/P(D) where,
for i attributes constituting the data
–
–
•
All data have “free will”- All attributes have independent
probability distributions (Mutual information between every pair
of attributes = 0)
P(D|Pop Quiz) = i P(Di|Pop Quiz)
P(D) = K (uniform assumption) OR P(D) = i P(Di)
Thus, either calculate explicit P(Pop Quiz|D) OR Max
likelihood comparison of P(Pop Quiz|D) and P(No Pop
Quiz|D)
Lecture 12, CS567
3
Decision Trees
• Directed Graph for reaching a decision
• Decision =
– Verdict
– More generally, classification into no of several classes
– If (..) OR (…) .. then Pop Quiz; If (..) OR (..) .. Then no Pop Quiz
• Given i attributes/data about an instance, navigate graph
based on the values of {i}
• Based on minimizing uncertainty
– Greedy approach: Largest drops in uncertainty occur first
Lecture 12, CS567
4
Decision Trees - Training
• Given: Set of labeled data (Di = {ai}, Ck)
• Goal: To find best classification tree
• Maximum uncertainty = log2(|class|) = 1 bit for a 2-class
problem
• Entropy for given data E(D) = - n P(Ck) log P(Ck), for n
classes
• Conditional/Residual entropy
E(D|ai, Ck) = l { [|Dl| / |D|].E(Dl) } for l subsets
• Reduction in uncertainty = Gain of Information
• Gain G(ai) = E(D) – E(D|ai, Ck)
Lecture 12, CS567
5
Decision Trees - Training
• Find ai | [G(ai) > G(aj) for all i  j]
• Root = ai with children = subsets of data falling into each
range of ai
• Iterate through remaining list of attributes till all ai have
been considered
• Label each subset with majority class label
• Optional, highly recommended, steps:
– Prepruning and postpruning: Avoid over-fitting
Lecture 12, CS567
6
Decision Trees - Training
• Caveat with previous approach:
– Subsetting with single or few data point(s) highly favored
– In other words, variable with higher resolution (that have a wider
range of possibilities) favored
• Gain ratio
– Alternative to information gain as criterion for choice of attributes
– Compensates for bias towards high scores for ai with high
resolution (higher number of states for that attribute)
– Gain ratio = G(ai) / E(ai)
• Gini
– Recommended for attribute selection for large training sets for
scalability
– Gini Index = 1 -  Pk2
Lecture 12, CS567
7
Decision Trees - Limitation
• Rectangular (Hypercuboidal) data space partitioning
assumption
• Not the best solution where the hyperline/plane is not
orthogonal to data dimensions
• Greeedy strategy can easily lead to overfitting
Lecture 12, CS567
8