AN ADAPTIVE METRIC MACHINE FOR PATTERN CLASSIFICATION
Download
Report
Transcript AN ADAPTIVE METRIC MACHINE FOR PATTERN CLASSIFICATION
Classification Problem
• Given {( xn , yn )}nN1
-
- - -- - - x2
- -
xn q
yn {, }
- + + + + +
+
+ +
+
+ + +
- - +
++ +
+
-x +
0
+
+
+
+
+
+
.
x1
• Predict class label of a given query
x0
Classification Problem
• Unknown probability distribution P( x, y )
• We need to estimate:
P ( | x0 ) f ( x0 )
P ( | x0 ) f ( x0 )
The Bayesian Classifier
• Loss function:
j | k
• Expected loss (conditional risk) associated with class j:
J
• Bayes rule:
R( j | x ) j | k Pk | x
k 1
j * arg min R( j | x )
1 j J
• Zero-one loss function:
0 if j k
j | k
1 if j k
j * arg max P( j | x ) Bayes rule
1 j J
The Bayesian Classifier
j * arg max P( j | x )
1 j J
• Bayes rule achieves the minimum error rate
• How to estimate the posterior probabilities:
P j | x
J
j 1
ˆj x arg max Pˆ ( j | x )
1 j J
Density estimation
• Use Bayes theorem to estimate the posterior probability
values:
P( j | x )
p x | j P j
J
p x | k Pk
k 1
p x | j
class
P j
is the probability density function of
j
is the prior probability of class
j
x
given
Naïve Bayes Classifier
• Makes the assumption of independence of features
given the class:
p x | j px1 , x2 ,, xq | j pxi | j
q
i 1
• The task of estimating a q-dimensional density function
is reduced to the estimation of q one-dimensional density
functions. Thus, the complexity of the task is drastically
reduced.
• The use of Bayes theorem becomes much simpler.
• Proven to be effective in practice.
Nearest-Neighbor Methods
• Predict the class label of x0 as the most frequent
one occurring in the K neighbors
- - - -+ ++ + +
-- - - - + ++
+ +
-- + +
- + +
x2
+
+ +
- -+
+
- +
+
- - - ++ ++
-
.
x1
Nearest-Neighbor Methods
• Predict the class label of x0 as the most frequent
one occurring in the K neighbors
- - - -+ ++ + +
-- - - - + ++
+ +
-- + +
- + +
x2
+
+ +
- -- +
+
+
- +
+
- - - ++ ++
-
x1
Nearest-Neighbor Methods
• Predict the class label of x0 as the most frequent
one occurring in the K neighbors
- - - -+ ++ + +
-- - - - + ++
+ +
-- + +
- + +
x2
+
+ +
- -+
+
- +
+
- - - ++ ++
-
.
..
x1
Basic assumption:
f ( x x) f ( x)
f ( x x) f ( x)
for small
x
Example: Letter Recognition
First statistical
moment
.
. .
Edge count
Asymptotic Properties of
K-NN Methods
lim N fˆj x f j ( x )
if lim N K and lim N K / N 0
• The first condition reduces the variance by making the estimation
independent of the accidental characteristics of the K nearest
neighbors.
• The second condition reduces the bias by assuring that the K
nearest neighbors are arbitrarily close to the query point.
Asymptotic Properties of
K-NN Methods
lim N E1 2E
E1 classification error rate of the 1-NN rule
E classification error rate of the Bayes rule
In the asymptotic limit no decision rule is more
than twice as accurate as the 1-NN rule
Finite-sample settings
• How
well the 1-NN rule works in finitesample settings?
• If the number of training data N is large and the number
of input features q is small, then the asymptotic results may
still be valid.
• However, for a moderate to large number of input
variables, the sample required for their validity is
beyond feasibility.
Curse-of-Dimensionality
• This phenomenon is known as
the curse-of-dimensionality
• It refers to the fact that in high dimensional
spaces data become extremely sparse and
are far apart from each other
• It affects any estimation problem with
high dimensionality
Curse of Dimensionality
DMAX/DMIN
DMAX
DMIN
Sample of size N=500 uniformly distributed in
[0, 1]q
Curse of Dimensionality
dim
The distribution of the ratio DMAX/DMIN
converges to 1 as the dimensionality increases
Curse of Dimensionality
dim
Variance of distances from a given point
Curse of Dimensionality
dim
The variance of distances from a given point
converges to 0 as the dimensionality increases
Curse of Dimensionality
Distance values from a given point
Values flatten out as dimensionality increases
Computing radii of nearest neighborhoods
median radius of a nearest neighborhood
uniform distributi on in the unit cube - .5,.5
q
Curse-of-Dimensionality
As dimensionality increases, the distance from the
closest point increases faster
• Random sample of size N ~ uniform distribution in the
q -dimensional unit hypercube
• Diameter of a K 1neighborhood using Euclidean
distance: d (q, N ) O( N 1/ q )
q
N
d(q,N)
4
4
6
6
10
10
100
1000
100
1000
1000
10000
10000
0.71
0.48
0.91
0.72
1.51
0.42 0.23
20
20
20
106
1010
1.20 0.76
Large d (q, N ) Highly biased estimations
Curse-of-Dimensionality
• It is a serious problem in many
real-world applications
• Microarray data: 3,000-4,000 genes;
• Documents: 10,000-20,000 words in
dictionary;
• Images, face recognition, etc.
How can we deal with
the curse of dimensionality?
7.68 92.2
92.2 1912.5
x1
1 1 N
x μ xi
x2
2 N i 1
2 2 covariance matrix :
E x μ x μ
T
x1 1
x1 1 , x2 2
E
x2 2
x1 1 2
E
x1 1 x2 2
1
N
x1 1 x2 2
x2 2 2
2
x1i 1
i
i
i 1
x1 1 x2 2
N
x
i
1
1 x2i 2
2
x2i 2
1
N
2
x1i 1
i
i
x
x
i 1
1
2 2
1
N
x
i
1
1 x2i 2
2
i
x2 2
variance
covariance
2
1 N i
x
1 1
N i 1
N
i
i
1
x
x
1
1
2 2
N
i 1
covariance
1
N
x x
N
i 1
1
N
i
1
N
i 1
1
2
2
x2i 2
i
2
variance
0.94 0.93
0.93 1.03
0.97 0.49
0.49 1.04
0.93 0.01
0.01 1.05
0.99 0.5
0.5 1.06
1.04 1.05
1.05 1.15
Dimensionality Reduction
• Many dimensions are often
interdependent (correlated);
We can:
• Reduce the dimensionality of problems;
• Transform interdependent coordinates
into significant and independent ones;