Lecture Title - University of Wisconsin–Madison

Download Report

Transcript Lecture Title - University of Wisconsin–Madison

Pattern Classification: An
Introduction
Outline
• Statistical Pattern Recognition
–
–
–
–
Maximum Posterior Probability (MAP) Classifier
Maximum Likelihood (ML) Classifier
K-Nearest Neighbor Classifier
MLP classifier
(C) 2001-2005 by Yu Hen Hu
2
An Example
•
•
•
Consider classify eggs into 3
categories with labels:
medium, large, or jumbo.
The classification will be
based on the weight and
length of each egg.
Decision rules:
1. If W < 10 g & L < 3cm, then
the egg is medium
2. If W > 20g & L > 5 cm then
the egg is jumbo
3. Otherwise, the egg is large
• Three components in a
pattern classifier:
– Category (target) label
– Features
– Decision rule
L
W
(C) 2001-2005 by Yu Hen Hu
3
Statistical Pattern Classification
• The objective of statistical
pattern classification is to
draw an optimal decision
rule given a set of training
samples.
• The decision rule is optimal
because it is designed to
minimize a cost function,
called the expected risk in
making classification
decision.
• This is a learning problem!
(C) 2001-2005 by Yu Hen Hu
Assumptions
1. Features are given.
•
•
Feature selection problem
needs to be solved
separately.
Training samples are
randomly chosen from a
population
2. Target labels are given
•
•
Assume each sample is
assigned to a specific,
unique label by the nature.
Assume the label of
training samples are
known.
4
Pattern Classification Problem
Let X be the feature space, and
C = {c(i), 1  i  M} be M class
labels.
For each x  X, it is assumed that
the nature assigned a class
label t(x)  C according to some
probabilistic rule.
Randomly draw a feature vector x
from X,
P(c(i)) = P(x  c(i)) is the a priori
probability that t(x) = c(i) without
referring to x.
P(c(i)|x) = P(x  c(i)|x) is the
posteriori probability that t(x) =
c(i) given the value of x
P(x|c(i)) = P(x |x  c(i)) is the
conditional probability (a.k.a.
likelihood function) that x will
assume its value given that it is
drawn from class c(i).
P(x) is the marginal probability that
x will assume its value without
referring to which class it
belongs to.
Use Bayes’ Rule, we have
P(x|c(i))P(c(i)) = P(c(i)|x)P(x)
Also,
P( x | c(i ))P(c(i ))
P(c(i ) | x)  M
 P( x | c(i))P(c(i))
i 1
(C) 2001-2005 by Yu Hen Hu
5
Decision Function and Prob. Mis-Classification
• Given a sample x  X, the
objective of statistical pattern
classification is to design a
decision rule g(x)  C to
assign a label to x.
• If g(x) = t(x), the naturally
assigned class label, then it
is a correct classification.
Otherwise, it is a misclassification.
• Define a 0-1 loss function:
0 if g ( x)  t ( x)
( x | g ( x))  
1 if g ( x)  t ( x)
(C) 2001-2005 by Yu Hen Hu
Given that g(x) = c(i*), then
P(( x | g ( x)  c(i*))  0 | x)
 P(t ( x)  c(i*) | x)  P(c(i*) | x)
Hence the probability of misclassification for a specific
decision g(x) = c(i*) is
P(( x | g ( x)  c(i*))  1 | x)
 1  P(c(i*) | x)
Clearly, to minimize the Pr. of
mis-classification for a given
x, the best choice is to
choose g(x) = c(i*) if
P(c(i*)|x) > P(c(i)|x) for i  i*
6
MAP: Maximum A Posteriori Classifier
The MAP classifier stipulates
that the classifier that
minimizes pr. of misclassification should choose
g(x) = c(i*) if
P(c(i*)|x) > P(c(i)|x), i  i*.
This is an optimal decision rule.
Unfortunately, in real world
applications, it is often
difficult to estimate P(c(i)|x).
(C) 2001-2005 by Yu Hen Hu
Fortunately, to derive the optimal
MAP decision rule, one can
instead estimate a
discriminant function Gi(x)
such that for any x  X, i  i*.
Gi*(x) > Gi(x) iff
P(c(i*)|x) > P(c(i)|x)
Gi(x) can be an approximation of
P(c(i)|x) or any function
satisfying above relationship.
7
Maximum Likelihood Classifier
Use Bayes rule,
p(c(i)|x) = p(x|c(i))p(c(i))/p(x).
Hence the MAP decision rule
can be expressed as:
g(x) = c(i*) if
p(c(i*))p(x|c(i*)) > p(c(i))p(x|c(i)),
i  i*.
Under the assumption that the a
priori Pr. is unknown, we
may assume p(c(i)) = 1/M.
As such, maximizing p(x|c(i))
is equivalent to maximizing
p(c(i)|c).
(C) 2001-2005 by Yu Hen Hu
• The likelihood function
p(x|c(i)) may assume a univariate Gaussian model.
That is,
p(x|c(i)) ~ N(i,i)
i,i can be estimated using
samples from {x|t(x) = c(i)}.
• A priori pr. p(c(i)) can be
estimated as:
P(c(i)) 
{x; # x s. t. t(x) c(i)}
|X|
8
Nearest-Neighbor Classifier
• Let {y(1), • • •, y(n)}  X be n samples which has already been
classified. Given a new sample x, the NN decision rule
chooses g(x) = c(i) if
y (i*)  Min . || y (i )  x ||
1 i  n
is labeled with c(i).
• As n , the prob. Mis-classification using NN classifier is at
most twice of the prob. Mis-classification of the optimal (MAP)
classifier.
• k-Nearest Neighbor classifier examine the k-nearest, classified
samples, and classify x into the majority of them.
• Problem of implementation: require large storage to store ALL
the training samples.
(C) 2001-2005 by Yu Hen Hu
9
What is “Clustering”?
1
What can we learn from
these “unlabeled” data
samples?
0.5
0
0
20
40
60
80
100
1.5
1
0.5
0
-0.5
(C) 2001-2005 by Yu Hen Hu
0
0.5
1
– Structures: Some samples
are closer to each other
than other samples
– The closeness between
samples are determined
using a “similarity measure”
– The number of samples per
unit volume is related to the
concept of “density” or
“distribution”
1.5
10
Clustering Problem Statement
• Given a set of vectors {xk; 1  k  K}, find a set of
M clustering centers {w(i); 1  i  c} such that
each xk is assigned to a cluster, say, w(i*),
according to a distance (distortion, similarity)
measure d(xk, w(i)) such that the average
distortion
1 c K
D   I ( xk , i)d ( xk ,W (i))
K i 1 k 1
is minimized.
• I(xk,i) = 1 if x is assigned to cluster i with cluster
center w(I); and = 0 otherwise -- indicator function.
(C) 2001-2005 by Yu Hen Hu
11
k-means Clustering Algorithm
Initialization: Initial cluster center w(i); 1  i  c, D(–1)= 0, I(xk,i)
= 0, 1  i  c, 1  k  K;
Repeat
(A) Assign cluster membership (Expectation step)
Evaluate d(xk, w(i));
1  i  c, 1  k  K
I(xk,i) = 1 if d(xk, w(i)) < d(xk, w(j)), j  i;
= 0; otherwise.
1kK
N
(B) Evaluate distortion D: D(iter) 
I ( x , i)d ( x , w(i)) 1  k  K

k 1
k
k
(C) Update code words according to new assignment
(Maximization)
N
W (i)   I ( xk , i) xk ,
k 1
N
N i   I ( xk , i),
1 i  c
k 1
(D) Check for convergence
if 1–D(Iter–1)/D(Iter) < e , then convergent = TRUE,
(C) 2001-2005 by Yu Hen Hu
12
A Numerical Example
x = {-1, -2,0,2,3,4},
W={2.1, 2.3}
1. Assign membership
2.1: {-1, -2, 0, 2}
2.3: {3, 4}
2. Distortion
D = (-1-2.1)2 + (-2-2.1)2
+ (0-2.1)2 + (2-2.1)2 +
(3-2.3)2 + (4-2.3)2
(C) 2001-2005 by Yu Hen Hu
3. Update W to minimize
distortion
W1 = (-1-2+0+2)/4 = -.25
W2 = (3+4)/2 = 3.5
4. Reassign membership
-.25: {-1, -2, 0}
3.5: {2, 3, 4}
5. Update W:
w1 = (-1-2+0)/3 = -1
w2 = (2+3+4)/3 = 3.
Converged.
13
Kmeans Algorithm Demonstration
2.5
2
data points
True cluster centers
data samples
converged centers
true centers
cluster boundary
2
1.5
1.5
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
-0.5
(C) 2001-2005 by Yu Hen Hu
0
0.5
1
1.5
-1
-2
-1
0
1
2
3
Clusterdemo.m 14
Scatter Matrics
• Scatter matrices are defined
in the context of analysis of
variance in statistics.
• They are used in linear
discriminant analysis.
• However, they can also be
used to gauge the fitness of
a particular clustering
assignment.
• Mean vector for i-th cluster:
1
mi 
Ni
N
 I ( xk , i) xk
k 1
• Total mean vector
1 c
1 N
m   N i mi   xk
N i 1
N k 1
• Scatter matrix for i-th cluster:

N
Si   I ( xk , i) ( xk  mi )(xk  mi )T

k 1
• Within-cluster scatter matrix
c
SW   Si
i 1
• Between-cluster scatter matrix
c

S B   N i (mi  m)(mi  m)T

i 1
(C) 2001-2005 by Yu Hen Hu
15
Scattering Criteria
• Total scatter matrix:
N

ST   ( xk  m)(xk  m)T

k 1
 SW  S B
• Note that the total scatter
matrix is independent of the
assignment I(xk,i). But …
• SW and SB both depend on
I(xk,i)!
• Desired clustering property
– SW small
– SB large
• How to gauge Sw is small or
SB is large?
There are several ways.
• Tr. Sw (trace of SW): Let
M
SW   m vm vmT
m 1
be the eigenvalue
decomposition of SW, then
M
c
m 1
i 1
Tr. SW   m   Tr.Si
c
N
  I ( xk , i ) || xk  mi ||2  D
i 1 k 1
(C) 2001-2005 by Yu Hen Hu
16
Classifier Design Issues
• Which classifier to use?
– Performance:
• cross-validation may be used to facilitate performance
comparison.
• Each classifier should have been fully developed
– Cost:
• CPU time for developing and executing the algorithm
• Memory and storage requirements may prevent some
classifier to be used
(C) 2001-2005 by Yu Hen Hu
17
Classifier Design Issues
• Order Selection
– Appears in almost all classifiers
• # of neighbors in kNN
• # of Gaussian mixtures in each class in ML classifier
• # of hidden layers, hidden neurons
– Cross-validation can be used to offer hints for selecting the
order of classifiers.
• Which set of parameters to use
– When classifiers are adaptively trained (such as a MLP),
different set of parameters may results due to multiple
training runs.
– Again, CV may be used to aid the selection of which set of
parameters to use.
(C) 2001-2005 by Yu Hen Hu
18
Feature Representation
raw pattern
x
Feature
Extractor
f eature
v ector
y
classif ication
c
Classif ier
A typical pattern classification system
• Proper feature representation is essential.
• Feature Transformation: Exposing important features from
among unimportant features.
• Feature Selection: Select among a set of given features. Bad
feature will confuse classifier
• A feature = a particular dimension of the feature vector. E.g.
feature vector x = [x1, x2, x3]. Then xi, i = 1, 2, 3 will be the
features.
(C) 2001-2005 by Yu Hen Hu
19
Symbolic Feature Encoding
Many classifiers allow only numerical input
values. Features that are represented with
symbols must be encoded in numerical form. eg.
{red, green, blue}, {G, A, C, T}
1. Real number encoding: map each feature symbol into a
quantized number in real line. E.g. red  1, green 
0, and blue  +1.
2. 1-in-N encoding: e.g. red  [1 0 0], green  [0 1 0],
and blue  [0 0 1]
3. Fuzzy encoding: e.g. red  [1 0.5 0 0 0], green  [ 0
0.5 1 0.5 0], and blue  [ 0 0 0 0.5 1]
(C) 2001-2005 by Yu Hen Hu
20
Feature Transformation: Why?
• To expose structures inherent in the data,
– make difficult classification problem easier to solve.
– Requires in-depth understanding of the data and extensive
trial-and-error
• To equalize the influence of different features.
– Feature value ranges should often be normalized
• To have zero mean in each dimension
• To have same (or similar) ranges or standard deviation in each
dimension
• Linear transformation = projection onto basis
• Nonlinear transformation
– Potentially more powerful. But no easy rule to follow.
(C) 2001-2005 by Yu Hen Hu
21
Feature Transformation: How?
• Bias, shift of origin: x' = x – b
• Linear Transformation (Data Independent) : y = T x
Rotation:
 y1  cos 
 y    sin 
 2 
 sin    x1 
cos    x2 
Scaling: y = a x
2km 

y
(
k
)

x
(
m
)
exp

j


DFT:

N 

m 0
N 1
0  k  N–1
Linear Digital Filtering: FIR, IIR
Others: Discrete Cosine Transform, singular value
decomposition, etc.
(C) 2001-2005 by Yu Hen Hu
22
Feature Dimension Reduction
• Irrelevant feature
reduction
– An irrelevant feature
(dimension) is one that is
uncorrelated with the
class label.
– E.g. the feature value in
a dimension is a constant
– E.g. the feature value in
a dimension is random
(C) 2001-2005 by Yu Hen Hu
• Redundant feature
reduction
– If the values of a feature
is linearly dependent on
remaining features, then
this feature can be
removed.
– Method 1. Using principal
component analysis
(eigenvalue/singular
value decomposition)
– Method 2. Subset
selection
23
Irrelevant Feature Reduction
• If a feature value remains
constant, or nearly constant
for all the training samples,
then this feature dimension
can be removed.
• Method:
– Calculate mean and
variance of each feature
dimension. If the variance is
less than a preset threshold,
the feature dimension is
marked for removal.
(C) 2001-2005 by Yu Hen Hu
•
If the distribution (histogram) of
values of a feature
corresponding to different
classes overlap each other, this
feature “may be” subject to
removal. However, higher
dimensional correlation may
exist!
24
Redundant Feature Reduction
Given a feature matrix
X  [ x1 , x2 ,, xM ]
– Each row = feature (sample)
vector.
– Each column = feature
If x3  ax1 bx2 , t hen
1 0 a 
[ x1 x2 x3 ]  [ x1 x2 ]

0
1
b


In other words, x3 is redundant
and hence can be removed
without affecting the result of
classification.
(C) 2001-2005 by Yu Hen Hu
Method:
1. Perform SVD on X to
identify its rank r ( M).
2. Repeat M-r times:
find i*, s. t. X = [xi* Xr]
and the projection error
|| Xr (XTr Xr )1 XTr X  X ||
is minimized.
set X = Xr.
25
Higher Dimension Features
• (10)
x (00)
x (11)
• (01)
Not linearly
s eparable
(C) 2001-2005 by Yu Hen Hu
• (101)
• (011)
x (110)
x
(000)
Linearly separable
26
Data Sampling
• Samples are assumed to be drawn independently from the
underlying population.
• Use resampling, i.e. repeated train-and-test partitions to
estimate the error rate.
• M-fold cross-validation: partition all available samples into M
mutually exclusive sets. Each time, use one set as the testing
set, and the remaining as training set. Repeat M times. Take
the average testing error as an estimate of the true error rate.
• If sample size < 100, use leave-one-out cross validation where
only one sample is used as the test set each time
(C) 2001-2005 by Yu Hen Hu
27