classificationII - The University of Kansas

Download Report

Transcript classificationII - The University of Kansas

EECS 800 Research Seminar
Mining Biological Data
Instructor: Luke Huan
Fall, 2006
The UNIVERSITY of Kansas
Overivew
KNN
Native Bayes
FLD
SVM
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide2
Classification (i.e. Discrimination)
CS version:
Pattern Recognition
Artificial Intelligence
Neural Networks
Data Mining
Machine Learning
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide3
Classification (i.e. discrimination)
Statistics
Model Classes with Probability Distribution
CS
Use to study class difference & find rules
Data are just Sets of Numbers
Rules distinguish between these
Current thought: combine these
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide4
Classification Basics
Point Clouds
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide5
Instance-Based Classifiers
Set of Stored Cases
Atr1
……...
AtrN
Class
A
• Store the training records
• Use training records to
predict the class label of
unseen cases
B
B
C
A
Unseen Case
Atr1
……...
AtrN
C
B
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide6
Instance Based Classifiers
Examples:
Rote-learner
Memorizes entire training data and performs classification
only if attributes of record match one of the training examples
exactly
Nearest neighbor
Uses k “closest” points (nearest neighbors) for performing
classification
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide7
Nearest Neighbor Classifiers
Basic idea:
If it walks like a duck, quacks like a duck, then it’s probably a
duck
Compute
Distance
Training
Records
10/23/2006
Classification II
Test
Record
Choose k of the
“nearest” records
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide8
Nearest-Neighbor Classifiers
Unknown record

Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

To classify an unknown record:
– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide9
Definition of Nearest Neighbor
X
(a) 1-nearest neighbor
X
X
(b) 2-nearest neighbor
(c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide10
1 nearest-neighbor
Voronoi Diagram
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide11
Nearest Neighbor Classification
Compute distance between two points:
Euclidean distance
d ( p, q ) 
 ( pi
i
q )
2
i
Determine the class from nearest neighbor list
take the majority vote of class labels among the k-nearest
neighbors
Weigh the vote according to distance
weight factor, w = 1/d2
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide12
Nearest Neighbor Classification…
Choosing the value of k:
If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other
classes
X
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide13
Nearest Neighbor Classification…
Scaling issues
Attributes may have to be scaled to prevent distance measures
from being dominated by one of the attributes
Example:
height of a person may vary from 1.5m to 1.8m
weight of a person may vary from 90lb to 300lb
income of a person may vary from $10K to $1M
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide14
Nearest Neighbor Classification…
Problem with Euclidean measure:
High dimensional data
curse of dimensionality
Can produce counter-intuitive results
111111111110
100000000000
vs
011111111111
000000000001
d = 1.4142
d = 1.4142
Solution: Normalize the vectors to unit length
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide15
Nearest neighbor Classification…
k-NN classifiers are lazy learners
It does not build models explicitly
Unlike eager learners such as decision tree induction
Classifying unknown records are relatively expensive
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide16
Example: PEBLS
PEBLS: Parallel Exemplar-Based Learning System (Cost
& Salzberg)
Works with both continuous and nominal features
For nominal features, distance between two nominal values is
computed using modified value difference metric (MVDM)
Each record is assigned a weight factor
Number of nearest neighbor, k = 1
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide17
Example: PEBLS
Distance between nominal attribute values:
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
d(Single,Married)
2
No
Married
100K
No
= | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1
3
No
Single
70K
No
d(Single,Divorced)
4
Yes
Married
120K
No
= | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
d(Refund=Yes,Refund=No)
9
No
Married
75K
No
= | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7
10
No
Single
90K
Yes
60K
d(Married,Divorced)
= | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1
10
Marital Status
Class
Refund
Single
Married
Divorced
Yes
2
0
1
No
2
4
1
10/23/2006
Classification II
Class
Yes
No
Yes
0
3
No
3
4
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
d (V1 ,V2 )  
i
n1i n2i

n1 n2
slide18
Example: PEBLS
Tid Refund Marital
Status
Taxable
Income Cheat
X
Yes
Single
125K
No
Y
No
Married
100K
No
10
Distance between record X and record Y:
d
( X , Y )  wX wY  d ( X i , Yi ) 2
i 1
Number of times X is used for prediction
wX 
Number of times X predicts correctly
where:
wX  1 if X makes accurate prediction most of the time
wX > 1 if X is not reliable for making predictions
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide19
Bayes Classifier
A probabilistic framework for solving classification
problems
Conditional Probability:
Bayes theorem:
P ( A, C )
P (C | A) 
P ( A)
P ( A, C )
P( A | C ) 
P (C )
P( A | C ) P(C )
P(C | A) 
P( A)
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide20
Example of Bayes Theorem
Given:
A doctor knows that meningitis causes stiff neck 50% of the time
Prior probability of any patient having meningitis is 1/50,000
Prior probability of any patient having stiff neck is 1/20
If a patient has stiff neck, what’s the probability he/she has
meningitis?
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S ) 

 0.0002
P( S )
1 / 20
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide21
Bayesian Classifiers
Consider each attribute and class label as random variables
Given a record with attributes (A1, A2,…,An)
Goal is to predict class C
Specifically, we want to find the value of C that maximizes P(C|
A1, A2,…,An )
Can we estimate P(C| A1, A2,…,An ) directly from data?
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide22
Bayesian Classifiers
Approach:
compute the posterior probability P(C | A1, A2, …, An) for all
values of C using the Bayes theorem
P ( A A  A | C ) P (C )
P (C | A A  A ) 
P( A A  A )
1
1
2
2
n
n
1
2
n
Choose value of C that maximizes
P(C | A1, A2, …, An)
Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
How to estimate P(A1, A2, …, An | C )?
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide23
Naïve Bayes Classifier
Assume independence among attributes Ai when class is
given:
P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)
Can estimate P(Ai| Cj) for all Ai and Cj.
New point is classified to Cj if P(Cj)  P(Ai| Cj) is maximal.
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide24
How toa l Estimate
Probabilities
from
Data?
s
al
u
c
Tid
e
at
Refund
g
ic
r
o
c
e
at
g
ic
r
o
c
on
o
u
tin
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
10/23/2006
Classification II
as
l
c
s
Class: P(C) = Nc/N
e.g., P(No) = 7/10,
P(Yes) = 3/10
For discrete attributes:
P(Ai | Ck) = |Aik|/ Nc
k
where |Aik| is number of
instances having attribute Ai
and belongs to class Ck
Examples:
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide25
How to Estimate Probabilities from Data?
For continuous attributes:
Discretize the range into bins
one ordinal attribute per bin
violates independence assumption
k
Two-way split: (A < v) or (A > v)
choose only one of the two splits as new attribute
Probability density estimation:
Assume attribute follows a normal distribution
Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
Once probability distribution is known, can use it to estimate
the conditional probability P(Ai|c)
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide26
s
u
How to
Estimate
o Probabilities from Data?
u
o
o
c
Tid
e
at
Refund
g
l
a
ric
c
e
at
Marital
Status
g
l
a
ric
c
t
on
Taxable
Income
in
s
s
a
cl
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Normal distribution:
1
P( A | c ) 
e
2
i
j

( Ai   ij ) 2
2  ij2
2
One for each (Ai,ci) pair
ij
For (Income, Class=No):
If Class=No
sample mean = 110
sample variance = 2975
10
1
P( Income  120 | No) 
e
2 (54.54)
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06

( 120110) 2
2 ( 2975)
 0.0072
slide27
Naïve Bayes Classifier
If one of the conditional probability is zero, then the entire
expression becomes zero
Probability estimation:
N ic
Original : P( Ai | C ) 
Nc
N ic  1
Laplace : P( Ai | C ) 
Nc  c
N ic  mp
m - estimate : P( Ai | C ) 
Nc  m
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
c: number of classes
p: prior probability
m: parameter
slide28
Naïve Bayes (Summary)
Robust to isolated noise points
Handle missing values by ignoring the instance during
probability estimate calculations
Robust to irrelevant attributes
Independence assumption may not hold for some
attributes
Use other techniques such as Bayesian Belief Networks (BBN)
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide29
Revisiting Naïve Bayes
Harder Example (slanted clouds):
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide30
Data Normalization
Rescale (standardize) coordinate axes
i. e. replace (full) data matrix:
 x11  x1n 
1 / s1  0 
 x11 / s1





X         
 X   
x

 0  1/ s 
x /s

x
 d1

 d1 d
dn 
d 
 x1n / s1 


 
 xdn / sd 
Then do mean difference
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide31
A Toy 2D Case
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide32
Why Covariance is Important
“Orientation” of the distribution is captured by covariance
Which one is red? + or
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide33
Fisher Linear Discrimination
Useful notation (data vectors of length d ):
Class +1:
X
( 1)
1
Class -1:
,..., X
( 1)
n1
X
( 1)
1
,..., X
( 1)
n1
Centerpoints:
X
( 1)
n1
1
( 1)

Xi

n1 i 1
and
X
( 1)
1 n1 ( 1)

Xi

n1 i 1
Based on S. Marron
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide34
Fisher Linear Discrimination
Covariances,
ˆ
~ (k ) ~ (k )t
X X
(k )
for
k  1,  1
(outer products)
Based on centered, normalized data matrices:

1
~ (k )
(k )
(k )
(k )
(k )
X 
X 1  X ,..., X nk  X
nk

Note: use “MLE” version of estimated covariance matrices,
for simpler notation
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide35
Assumption
Major Assumption:
Class covariances are the same (or “similar”)
Like this:
Not this:
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide36
Fisher Linear Discrimination
Good estimate of (common) within class cov?
Pooled (weighted average) within class cov:
( 1)
( 1)
ˆ
ˆ
~
~
n


n

w
t
1
ˆ  1
 XX
n1  n1
based on the combined full data matrix:
~ 1
~ ( 1)
~ ( 1)
 n1 X n1 X 
X
n
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide37
Fisher Linear Discrimination
Simple way to find “correct cov. adjustment”:
Individually transform subpopulations so “spherical” about
their means
For
define
k  1,  1
10/23/2006
Classification II
Y
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
(k )
i
 ˆ

w 1 / 2
X
(k )
i
slide38
Fisher Linear Discrimination
In transformed space, best
separating hyperplane is
perpendicular bisector of
line between means
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide39
Fisher Linear Discrimination
Relationship to Mahalanobis distance
Idea: For
X 1 , X 2 ~ N  ,  
, a natural distance measure is:
“unit free”, i.e. “standardized”
essentially mod out covariance structure
Euclidean dist. applied to  1 / 2 X 1&  1 / 2 X 2
Same as key transformation for FLD
I.e. FLD is mean difference in Mahalanobis space

d M  X1, X 2    X1  X 2  
10/23/2006
Classification II
t
1
 X 1  X 2 
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
1/ 2
slide40
Support Vector Machines
Find a linear hyperplane (decision boundary) that will separate the data
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide41
Support Vector Machines
B1
One Possible Solution
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide42
Support Vector Machines
B2
Another possible solution
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide43
Support Vector Machines
B2
Other possible solutions
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide44
Support Vector Machines
B1
B2
Which one is better? B1 or B2?
How do you define better?
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide45
Support Vector Machines
B1
B2
b21
b22
margin
b11
b12
Find hyperplane maximizes the margin => B1 is better than
B2
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide46
Support Vector Machines
B1
 
w x  b  0
 
w  x  b  1
 
w  x  b  1
b11
 
1
if
w
x  b 1


f ( x)  
 
 1 if w  x  b  1
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
b12
2
Margin   2
|| w ||
slide47
Support Vector Machines
We want to maximize:
2
Margin   2
|| w ||
Which is equivalent to minimizing:
 2
|| w ||
L( w) 
2
But subjected to the following constraints:
 
if w  x i  b  1
1

f ( xi )  
 
 1 if w  x i  b  1
This is a constrained optimization problem
– Numerical approaches to solve it (e.g., quadratic
programming)
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide48
Support Vector Machines
What if the problem is not linearly separable?
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide49
Support Vector Machines
What if the problem is not linearly separable?
Introduce slack variables
Need to minimize:
Subject to:
 2
|| w ||
 N k
L( w) 
 C   i 
2
 i 1 
 
if w  x i  b  1 - i
1

f ( xi )  
 
 1 if w  x i  b  1  i
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide50
Nonlinear Support Vector Machines
What if decision boundary is not linear?
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide51
Nonlinear Support Vector Machines
Transform data into higher dimensional space
10/23/2006
Classification II
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide52