Statical Inference
Download
Report
Transcript Statical Inference
Statistical Inference (By Michael Jordon)
Bayesian perspective
– conditional perspective—inferences should be made conditional
on the current data
– natural in the setting of a long-term project with a domain expert
– the optimist: let’s make the best possible use of our sophisticated
inferential tool
Frequentist perspective
– unconditional perspective—inferential methods should give good
answers in repeated use
– natural in the setting of writing software that will be used by many
people with many data sets
– the pessimist: let’s protect ourselves against bad decisions given
that our inferential procedure is inevitably based on a
simplification of reality
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Bayes Classifier
A probabilistic framework for solving classification
problems
Conditional Probability:
P ( A, C )
P (C | A)
P ( A)
P ( A, C )
P( A | C )
P (C )
Bayes theorem:
P( A | C ) P(C )
P(C | A)
P( A)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Example of Bayes Theorem
Given:
– A doctor knows that C causes A 50% of the time
– Prior probability of any patient having C is 1/50,000
– Prior probability of any patient having A is 1/20
If a patient has A, what’s the probability he/she
has C?
P( A | C ) P(C ) 0.5 1 / 50000
P(C | A)
0.0002
P( A)
1 / 20
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Bayesian Classifiers
Consider each attribute and class label as random
variables
Given a record with attributes (A1, A2,…,An)
– Goal is to predict class C
– Specifically, we want to find the value of C that
maximizes P(C| A1, A2,…,An )
Can we estimate P(C| A1, A2,…,An ) directly from
data?
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Bayesian Classifiers
Approach:
– compute the posterior probability P(C | A1, A2, …, An) for
all values of C using the Bayes theorem
P (C | A A A )
1
2
n
P ( A A A | C ) P (C )
P( A A A )
1
2
n
1
2
n
– Choose value of C that maximizes
P(C | A1, A2, …, An)
– Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
How to estimate P(A1, A2, …, An | C )?
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Naïve Bayes Classifier
Assume independence among attributes Ai when class is
given:
– P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)
– Can estimate P(Ai| Cj) for all Ai and Cj.
– New point is classified to Cj if P(Cj) P(Ai| Cj) is
maximal.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Training dataset
age
Class:
<=30
C1:buys_computer= <=30
‘yes’
30…40
C2:buys_computer= >40
>40
‘no’
>40
31…40
Data sample
<=30
X =(age<=30,
<=30
Income=medium,
>40
Student=yes
<=30
Credit_rating=
31…40
Fair)
31…40
>40
© Tan,Steinbach, Kumar
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
Introduction to Data Mining
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
4/18/2004
‹#›
Naïve Bayesian Classifier: Example
Compute P(X|Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 , income =medium, student=yes, credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer=“no”) = 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”) = 0.028
P(X|buys_computer=“no”) * P(buys_computer=“no”) = 0.007
X belongs to class “buys_computer=yes”
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Naïve Bayes Classifier
If one of the conditional probability is zero, then
the entire expression becomes zero.
Probability estimation:
N ic
Original : P( Ai | C )
Nc
c: number of classes
N ic 1
Laplace : P( Ai | C )
Nc c
m: parameter (equivalent sample size)
p: prior probability
N ic mp
m - estimate : P( Ai | C )
Nc m
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Naïve Bayes (Summary)
Robust to isolated noise points because such
points are averaged out.
Handle missing values by ignoring the instance
during probability estimate calculations
Robust to irrelevant attributes. If Ai is irrelevant,
then P(Ai | Y) becomes almost uniformly
distributed.
Independence assumption may not hold for some
attributes
– Use other techniques such as Bayesian Belief
Networks (BBN)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Instance-Based Classifiers
Set of Stored Cases
Atr1
……...
AtrN
Class
A
• Store the training records
• Use training records to
predict the class label of
unseen cases
B
B
C
A
Unseen Case
Atr1
……...
AtrN
C
B
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Nearest Neighbor Classifiers
Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck
Compute
Distance
Training
Records
© Tan,Steinbach, Kumar
Test
Record
Choose k of the
“nearest” records
Introduction to Data Mining
4/18/2004
‹#›
Nearest-Neighbor Classifiers
Unknown record
Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
To classify an unknown record:
– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Nearest Neighbor Classification
Compute distance between two points:
– Euclidean distance
d ( p, q )
( pi
i
q )
2
i
Determine the class from nearest neighbor list
– take the majority vote of class labels among
the k-nearest neighbors
– Weigh the vote according to distance
weight factor, w = 1/d2
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Definition of Nearest Neighbor
X
(a) 1-nearest neighbor
X
X
(b) 2-nearest neighbor
(c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Nearest Neighbor Classification…
Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes
X
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
K-nearest neighbors
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Distance functions
Dsum(A,B) = Dgender(A,B) + Dage(A,B) + Dsalary(A,B)
Dnorm(A,B) = Dsum(A,B)/max(Dsum)
Deuclid(A,B) = sqrt(Dgender(A,B)2 + Dage(A,B)2 + Dsalary(A,B)2
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Remarks on Lazy vs. Eager Learning
Instance-based learning: lazy evaluation
Decision-tree and Bayesian classification: eager evaluation
Key differences
– Lazy method may consider query instance xq when deciding how to
generalize beyond the training data D
– Eager method cannot since they have already chosen global
approximation when seeing the query
Efficiency: Lazy - less time training but more time predicting
Accuracy
– Lazy method effectively uses a richer hypothesis space since it uses
many local linear functions to form its implicit global approximation
to the target function
– Eager: must commit to a single hypothesis that covers the entire
instance space
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›