X belongs to class “buys_computer=yes”

Download Report

Transcript X belongs to class “buys_computer=yes”

CS690L
Data Mining: Classification(2)
Reference:
J. Han and M. Kamber, Data Mining: Concepts and Techniques
Bayesian Classification
• Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to
certain types of learning problems
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with observed
data.
• Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
• Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
be measured
Bayesian Theorem: Basics
• Let X be a data sample whose class label is unknown
• Let H be a hypothesis that X belongs to class C
• For classification problems, determine P(H/X): the
probability that the hypothesis holds given the observed
data sample X
• P(H): prior probability of hypothesis H (i.e. the initial
probability before we observe any data, reflects the
background knowledge)
• P(X): probability that sample data is observed
• P(X|H) : probability of observing the sample X, given that
the hypothesis holds
Bayesian Theorem
• Given training data X, posteriori probability of a
hypothesis H, P(H|X) follows the Bayes theorem
• Informally, this can be written as
posterior =likelihood x prior / evidence
• MAP (maximum posteriori) hypothesis
• Practical difficulty: require initial knowledge of
many probabilities, significant computational
cost
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent:
n
P( X | C i)   P( x k | C i)
k 1
• The product of occurrence of say 2 elements x1 and x2,
given the current class is C, is the product of the
probabilities of each element taken separately, given
the same class P([y1,y2],C) = P(y1,C) * P(y2,C)
• No dependence relation between attributes
• Greatly reduces the computation cost, only count the
class distribution.
• Once the probability P(X|Ci) is known, assign X to the
class with maximum P(X|Ci)*P(Ci)
Training dataset
Class:
C1:buys_computer=
‘yes’
C2:buys_computer=
‘no’
Data sample
X =(age<=30,
Income=medium,
Student=yes
Credit_rating=
Fair)
age
<=30
<=30
30…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Naïve Bayesian Classifier: Example
• Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.007
X belongs to class “buys_computer=yes”
Naïve Bayes: Continuous Value (1)
Outlook
Temp(ºF)
Humidity
Windy
Class
sunny
85
85
false
Don't Play
sunny
80
90
true
Don't Play
overcast
83
86
false
Play
rainy
70
96
false
Play
rainy
68
80
false
Play
rainy
65
70
true
Don’t Play
overcast
64
65
true
Play
sunny
72
95
false
Don’t Play
sunny
69
70
false
Play
rainy
75
80
false
Play
sunny
75
70
true
Play
overcast
72
90
true
Play
overcast
81
75
false
Play
rainy
71
91
true
Don’t Play
Naïve Bayes: Continuous Value (2)
Outlook
temperature
humidity
windy
yes
no
yes
no
yes
no
sunny
2
3
83
85
86
85
overcast
4
0
70
80
96
90
rainy
3
2
68
65
80
70
64
72
65
95
69
71
70
91
75
80
75
70
72
90
81
75
play
yes
no
yes
no
false
6
2
9
5
true
3
3
Naïve Bayes: Continuous Value (3)
Outlook
temperature
humidity
windy
yes
no
yes
no
ye
s
no
sunny
2/9
3/5

73
74.6

79.1
86.2
overcast
4/9
0/5

6.2
7.9

10.2
9.7
rainy
3/9
2/5
: mean and  standard deviation
play
yes
no
yes
no
false
6/9
2/5
9/14
5/14
true
3/9
3/5
Naïve Bayes: Continuous Value (4)
Outlook
Temp
Humidity
Windy
Class
sunny
66
90
true
?
• P(Class = yes | Outlook = sunny  Temp = 66  Humidity = 90  Windy = true)
• P(Class = no | Outlook = sunny  Temp = 66  Humidity = 90  Windy = true)
Gaussian (Normal) Density Function
f ( x) 
1
2
e

( x )2
2 2
Naïve Bayes: Continuous Value (5)
1
f (Temp  66 | Class  yes) 
e
2 (6.2)
( x ( 73)) 2

2 ( 6.2 ) 2
 0.0340
f ( Humidity  90 | Class  yes)  0.0221
 0.0340  0.0221 93  149
Pyes 

P(Outlook  sunny  Temp  66  Humidity  90  Windy  true)
2
9

0.000036
P(Outlook  sunny  Temp  cool  Humidity  high  Windy  true)
Naïve Bayes: Continuous Value (6)
 0.0291 0.0380  53  145
Pno 

P(Outlook  sunny  Temp  66  Humidity  90  Windy  true)
3
5

0.000136
P(Outlook  sunny  Temp  cool  Humidity  high  Windy  true)
Pyes  20.9%
Pno  79.1%
Naïve Bayesian Classifier: Evaluation
• Advantages :
– Easy to implement
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence , therefore loss of
accuracy
– Practically, dependencies exist among variables.
– E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
– Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
• How to deal with these dependencies?
– Bayesian Belief Networks
Bayesian Networks
• Bayesian belief network allows a subset of the variables
conditionally independent
• A graphical model of causal relationships
– Represents dependency among the variables
– Gives a specification of joint probability distribution
Y
X
Z
P
Nodes: random variables
Links: dependency
X,Y are the parents of Z, and Y is the
parent of P
No dependency between Z and P
Has no loops or cycles
Bayesian Belief Network: An Example
Family
History
Smoker
(FH, S)
LungCancer
PositiveXRay
Emphysema
Dyspnea
Bayesian Belief Networks
(FH, ~S) (~FH, S) (~FH, ~S)
LC
0.8
0.5
0.7
0.1
~LC
0.2
0.5
0.3
0.9
The conditional probability table
for the variable LungCancer:
Shows the conditional probability
for each possible combination of its
parents
n
P( z1,..., zn ) 
 P( z i | Parents( Z i ))
i 1
Learning Bayesian Networks
• Several cases
– Given both the network structure and all variables
observable: learn only the CPTs
– Network structure known, some hidden variables:
method of gradient descent, analogous to neural
network learning
– Network structure unknown, all variables observable:
search through the model space to reconstruct graph
topology
– Unknown structure, all hidden variables: no good
algorithms known for this purpose
Prediction
• Prediction is similar to classification
– First, construct a model
– Second, use model to predict unknown value
• Major method for prediction is regression
– Linear and multiple regression
– Non-linear regression
• Prediction is different from classification
– Classification refers to predict categorical class label
– Prediction models continuous-valued functions
Predictive Modeling in Databases
• Predictive modeling: Predict data values or construct
generalized linear models based on the database data.
• One can only predict value ranges or category
distributions
• Method outline:
–
–
–
–
Minimal generalization
Attribute relevance analysis
Generalized linear model construction
Prediction
• Determine the major factors which influence the
prediction
– Data relevance analysis: uncertainty measurement, entropy
analysis, expert judgment, etc.
• Multi-level prediction: drill-down and roll-up analysis
Regress Analysis and Log-Linear
Models in Prediction
• Linear regression: Y =  +  X
– Two parameters ,  and  specify the line and are to
be estimated by using the data at hand.
– using the least squares criterion to the known values
of Y1, Y2, …, X1, X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2.
– Many nonlinear functions can be transformed into the
above.
• Log-linear models:
– The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
– Probability: p(a, b, c, d) = ab acad bcd
Regression
• Method of Least Squares
ˆ (X ) - Yi]2
• Principle: Minimize  [Y
i
n
i 1
• Assume Yˆ (Xi ) = a + bXi
(linear relationship)
• Minimize
n
 [Yi-(a+bXi)]2
i 1
Linear regression
• Minimizing w.r.t. “a”
n
Y
i 1
i
n
 na  b X i
(1)
i 1
• Minimizing w.r.t. “b”
n
X Y
i 1
i
i
n
n
 a  Xi  b X
i 1
i 1
2
i
(2)
Linear regression
• Solving (1) and (2) we get:
b
n XY   X  Y
n X 2  ( X ) 2
1
a  ( Y  b  X )
n
Linear Regression: Example
X (years)
Y (Salary)
3
30
8
57
9
64
13
72
3
36
6
43
11
59
21
90
1
20
16
83
1
Y   y  55.4
n
1
X   X  9 .1
n
b
n XY   X  Y
n X 2  ( X ) 2

 ( X  X )(Y  Y )
(X  X )
2
a  Y bX
(3 – 9.1)(30 – 55.4)+(8 – 9.1)(57 – 55.4)…+(16 – 9.1)(83 – 55.4)
b = ------------------------------------------------------------------------(3 – 9.1)2 + (8 – 9.1) 2 + … + (16 – 9.1) 2
a = 55.4 – (3.5)(9.1) = 23.6
Linear regression
Determine linear regression line, with Y linearly dependent on X