Classification Problem

Download Report

Transcript Classification Problem

Classification
Today: Basic Problem
Decision Trees
Classification Problem
• Given a database D={t1,t2,…,tn} and a set of
classes C={C1,…,Cm}, the Classification
Problem is to define a mapping f:DgC where
each ti is assigned to one class.
• Actually divides D into equivalence classes.
• Prediction is similar, but may be viewed as
having infinite number of classes.
Classification Ex: Grading
x
<90
>=90
• If x >= 90 then grade =A.
• If 80<=x<90 then grade
x
A
=B.
<80
>=80
• If 70<=x<80 then grade
x
B
=C.
• If 60<=x<70 then grade
<70
>=70
x
=D.
C
• If x<50 then grade =F.
<50
>=60
F
D
Classification Techniques
•
Approach:
1. Create specific model by evaluating
training data (or using domain experts’
knowledge).
2. Apply model developed to new data.
• Classes must be predefined
• Most common techniques use DTs, or are
based on distances or statistical methods.
Defining Classes
Distance Based
Partitioning Based
Issues in Classification
• Missing Data
– Ignore
– Replace with assumed value
• Measuring Performance
– Classification accuracy on test data
– Confusion matrix
– OC Curve
Height Example Data
Name
Kristina
Jim
Maggie
Martha
Stephanie
Bob
Kathy
Dave
Worth
Steven
Debbie
Todd
Kim
Amy
Wynette
Gender
F
M
F
F
F
M
F
M
M
M
F
M
F
F
F
Height
1.6m
2m
1.9m
1.88m
1.7m
1.85m
1.6m
1.7m
2.2m
2.1m
1.8m
1.95m
1.9m
1.8m
1.75m
Output1
Short
Tall
Medium
Medium
Short
Medium
Short
Short
Tall
Tall
Medium
Medium
Medium
Medium
Medium
Output2
Medium
Medium
Tall
Tall
Medium
Medium
Medium
Medium
Tall
Tall
Medium
Medium
Tall
Medium
Medium
Classification Performance
True Positive
False Negative
False Positive
True Negative
Confusion Matrix Example
Using height data example with Output1
correct and Output2 actual assignment
Actual
Membership
Short
Medium
Tall
Assignment
Short
Medium
0
4
0
5
0
1
Tall
0
3
2
Operating Characteristic Curve
Classification Using Decision
Trees
• Partitioning based: Divide search space into
rectangular regions.
• Tuple placed into class based on the region within
which it falls.
• DT approaches differ in how the tree is built: DT
Induction
• Internal nodes associated with attribute and arcs
with values for that attribute.
• Algorithms: ID3, C4.5, CART
Decision Tree
Given:
– D = {t1, …, tn} where ti=<ti1, …, tih>
– Database schema contains {A1, A2, …, Ah}
– Classes C={C1, …., Cm}
Decision or Classification Tree is a tree associated with D such
that
– Each internal node is labeled with attribute, Ai
– Each arc is labeled with predicate which can be applied to
attribute at parent
– Each leaf node is labeled with a class, Cj
DT Induction
DT Splits Area
Gender
M
F
Height
Comparing DTs
Balanced
Deep
DT Issues
•
•
•
•
•
•
•
Choosing Splitting Attributes
Ordering of Splitting Attributes
Splits
Tree Structure
Stopping Criteria
Training Data
Pruning
Information/Entropy
• Given probabilitites p1, p2, .., ps whose sum is 1,
Entropy is defined as:
• Entropy measures the amount of randomness or
surprise or uncertainty.
• Goal in classification
– no surprise
– entropy = 0
ID3
• Creates tree using information theory concepts and
tries to reduce expected number of comparison..
• ID3 chooses split attribute with the highest
information gain:
ID3
Example
(Output1)
Starting state entropy:
•
4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384
• Gain using gender:
– Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764
– Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = 0.4392
– Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) =
0.34152
– Gain: 0.4384 – 0.34152 = 0.09688
• Gain using height:
0.4384 – (2/15)(0.301) = 0.3983
• Choose height as first splitting attribute
•
C4.5
ID3 favors attributes with large number of
divisions
• Improved version of ID3:
–
–
–
–
–
Missing Data
Continuous Data
Pruning
Rules
GainRatio:
CART
• Create Binary Tree
• Uses entropy
• Formula to choose split point, s, for node t:
• PL,PR probability that a tuple in the training set
will be on the left or right side of the tree.
CART Example
• At the start, there are six choices for split
point (right branch on equality):
–
–
–
–
–
–
P(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224
P(1.6) = 0
P(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169
P(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385
P(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256
P(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32
• Split at 1.8
Problem to Work On:
Training Dataset
This
follows an
example
from
Quinlan’s
ID3
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Output: A Decision Tree for
“buys_computer”
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Bayesian Classification: Why?
• Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to certain types
of learning problems
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct.
Prior knowledge can be combined with observed data.
• Probabilistic prediction: Predict multiple hypotheses, weighted
by their probabilities
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
Bayesian Theorem: Basics
• Let X be a data sample whose class label is unknown
• Let H be a hypothesis that X belongs to class C
• For classification problems, determine P(H/X): the
probability that the hypothesis holds given the observed
data sample X
• P(H): prior probability of hypothesis H (i.e. the initial
probability before we observe any data, reflects the
background knowledge)
• P(X): probability that sample data is observed
• P(X|H) : probability of observing the sample X, given that
the hypothesis holds
Bayes Theorem
(Recap)
• Given training data X, posteriori probability of a
hypothesis H, P(H|X) follows the Bayes theorem
P(H | X )  P( X | H )P(H )
P( X )
• MAP (maximum posteriori) hypothesis
h
 arg max P(h | D)  arg max P(D | h)P(h).
MAP hH
hH
• Practical difficulty: require initial knowledge of
many probabilities, significant computational cost;
insufficient data
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent:
n
P( X | C i) 
 P( x k | C i)
k 1
• The product of occurrence of say 2 elements x1 and x2,
given the current class is C, is the product of the
probabilities of each element taken separately, given the
same class P([y1,y2],C) = P(y1,C) * P(y2,C)
• No dependence relation between attributes
• Greatly reduces the computation cost, only count the class
distribution.
• Once the probability P(X|Ci) is known, assign X to the
class with maximum P(X|Ci)*P(Ci)
Training dataset
age
Class:
<=30
C1:buys_computer= <=30
‘yes’
30…40
C2:buys_computer= >40
>40
‘no’
>40
31…40
Data sample
<=30
X =(age<=30,
<=30
Income=medium,
>40
Student=yes
<=30
Credit_rating=
31…40
Fair)
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Naïve Bayesian Classifier:
Example
Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
Multiply by P(Ci)s and we can conclude that
X belongs to class “buys_computer=yes”
Naïve Bayesian Classifier:
Comments
• Advantages :
– Easy to implement
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence , therefore loss of accuracy
– Practically, dependencies exist among variables
– E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
– Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
• How to deal with these dependencies?
– Bayesian Belief Networks
Classification Using Distance
• Place items in class to which they are
“closest”.
• Must determine distance between an item
and a class.
• Classes represented by
– Centroid: Central value.
– Medoid: Representative point.
– Individual points
• Algorithm: KNN
K Nearest Neighbor (KNN):
• Training set includes classes.
• Examine K items near item to be classified.
• New item placed in class with the most
number of close items.
• O(q) for each tuple to be classified. (Here q
is the size of the training set.)
KNN
KNN Algorithm