Some Test Results of CQT3

Download Report

Transcript Some Test Results of CQT3

Nonparametric Techniques
Shyh-Kang Jeng
Department of Electrical Engineering/
Graduate Institute of Communication/
Graduate Institute of Networking and
Multimedia, National Taiwan University
1
Problems of Parameter Estimation
Approaches
Common parametric forms rarely fit
the densities actually encountered in
practice
All of the classical parametric
densities are unimodal
– Many practical problems involve
multimodal densities
High-dimensional density not often
be represented as the product of
one-dimensional functions
2
Nonparametric Methods
Can be used with arbitrary
distributions
– Need no assumption on the forms of the
underlying densities
Basic types
– Estimating the density function p(x|wj)
from samples
– Directly estimating the a posteriori
probabilities p(wj|x)
– Bypass probability distribution and go
directly to decision function
3
Density Estimation:
Naïve Approach
P   p (x' )dx'
R
x1 ,  x n are i.i.d. samples according to p (x)
n k
Pk    P (1  P) n  k , E[k ]  nP
k 
Assume that R is small,
 p(x' )dx'  p(x)V
R
k /n
p ( x) 
V
4
Density Estimation:
Naïve Approach
Pk/Pk,max
5
Problems of Naïve Approach
If volume is fixed and take more
samples, we get only a spaceaveraged value of p(x):
p ( x ' ) dx '
P

V

R
 dx '
R
If volume approaches zero and fix
n, then estimated p(x) will be close
to zero or infinity and useless
6
Better Approaches
kn / n
pn ( x) 
 p ( x)
Vn
lim Vn  0, lim k n / n  0
n 
n 
7
Hypercube Parzen Windows
 1 u j  1 / 2, j  1, , d
 (u)  
0 otherwise
n
 x  xi 

k n    
i 1  hn

1 n 1  x  xi 
d


pn ( x )    
, Vn  hn

n i 1 Vn  hn 
8
Parzen Windows for Interpolation
 (x)  0,
  (u)du  1,
1 x
 n (x)    ,
Vn  hn 
Vn  h
d
n
1 n
pn ( x )    n ( x  x i )
n i 1
1  x  xi 
  n (x  xi )dx   Vn   hn dx   (u)du  1
9
Examples of Parzen Windows
10
Convergence Considerations
pn (x) depends on random numbers x1 ,  , x n
pn (x) has some mean pn (x) and variance  (x)
2
n
Convergenc e conditions :
lim pn (x)  p (x), lim  (x)  0
n 
n 
sup  (u)  ,
u
2
n
d
lim  (u) ui  0
u 
i 1
lim Vn  0, lim nVn  
n 
n 
11
Convergence of the Mean
1 n  1  x  x i 

pn (x)  E  pn (x)   E   
n i 1 Vn  hn 
1 xv
 p ( v )dv    n (x  v ) p ( v )dv
   
Vn  hn 
lim  n (x  v )   (x  v )
n 
 lim pn (x)  p (x)
n 
12
Convergence of the Variance
2
 1  x  x  1


i
  pn (x)  
 n2 (x)   E 
 

nV
h
n


i 1
n
n






n
 1
 1 2
2  x  xi 
  pn (x)
 nE  2 2  
 hn  n
 n Vn
1
1 2 x  v 
1 2
 p ( v )dv  pn (x)

 

nVn Vn  hn 
n
13
Convergence of the Variance
1 2
lim pn (x)  0
n  n
sup  ()  pn (x)
2
 n ( x) 
nVn
for lim nVn   (e.g., Vn  V1 / n or V1 / ln n)
n 
lim  (x)  0
n 
2
n
14
Illustration 1: 1D Gaussian
p ( x) ~ N (0,1)
1 u 2 / 2
 (u ) 
e
,
2
1 n 1  x  xi 

pn ( x)    
n i 1 hn  hn 
hn  h1 / n
15
Illustration 1: 1D Gaussian
16
Illustration 2: 2D Gaussian
17
Illustration 3:
Uniform and Triangular
18
Classification Examples
19
Pros and Cons of
Nonparametric Methods
Generality
Number of samples needed may be
very large
– Much larger than required if we know
the form of the density
Curse of dimensionality
– High-dimensional functions are much
more complicated and harder to discern
– Better to incorporate knowledge about
the data that is correct
20
Probabilistic Neural Networks (PNN)
21
PNN Training
initialize j  0, n, a ji  0 for j  1,  , n; i  1, , c
do j  j  1

x jk  x jk / i 1 x
d
2
ji

1/ 2
w jk  x jk
if x j  wi then a ji  1
until j  n
end
22
Activation Function
net k  w x
t
k
 x  wk
 
 hn
e

 ( x  w k ) t ( x  w k ) / 2 2
  e

 ( x t x  w tk w k  2 w tk x ) / 2 2
e
( netk 1) /  2
23
PNN Classification
initialize k  0, x  test pattern, g i  0
do k  k  1
net k  w tk x
if aki  1 then g i  g i  exp[( net k  1) /  ]
2
until k  n
return class  arg max g i (x)
i
end
24
kn-Nearest-Neighbor Estimation
Let cell volume be a function of the
test data
Prototypes
– Training samples
Estimate p(x)
– Center a cell about x
– Let the cell grow until it captures kn
samples
kn / n
pn ( x) 
Vn
25
kn-Nearest-Neighbor Estimation
26
kn-Nearest-Neighbor Estimation
27
kn-Nearest-Neighbor Estimation
Sufficient and necessary conditions
for pn(x) converge to p(x)
lim kn  , lim kn / n  0
n 
n 
Example
kn  n
1
V1
Vn 
, Vn 
 0 as n  
n p( x)
n
28
kn-Nearest-Neighbor Estimation
29
Estimation of
A Posteriori Probabilities
Place a cell of volume V (Parzen or
kn-nearest-neighbor) around x
capture k samples
– ki of them are labeled wi
Estimate of Joint probability p(x,wi)
ki / n
pn (x, wi ) 
V
Estimate for p(wi|x)
pn (x, wi )
ki
pn (wi | x)  c

 j 1 pn (x, w j ) k
30
Nearest-Neighbor Rule
Dn = {x1, . . ., xn} denote a set of n
labeled prototypes
x’ in Dn is the prototype nearest to a
test point x
Assign x the label associated with x’
Suboptimal
– Lead to an error rate greater than the
Bayes rate
– but never worse than twice the Bayes
rate
31
Heuristic Understanding
Label q’ associated with the
nearest neighbor is a random
number
P(q’=wi|x’)=P(wi|x’)
When number of samples is very
large, assume that x’ is close to x
that P(wi|x’) approximately equals
P(wi|x)
P(wm | x)  max P(wi | x)
i
32
Voronoi Tessellation
33
Probability of Error
P ( e )   P ( e | x ) p ( x ) dx
*
Minimum possible value of P (e | x) : P (e | x)
Minimum possible value of P (e) : P *
P  lim Pn (e)
n 
P(e | x)   P(e | x, x' ) p (x' | x)dx'
34
Convergence of Nearest Neighbor
S : hyperspher e centered about x
Ps 
p
(
x
'
)
d
x
'

0

x 'S
Probabilit y that all n of the samples
outside S : (1  Ps )  0 as n  
n
 x'  x
p (x' | x)   (x'x)
35
Error Rate for
Nearest-Neighbor Rule
test point x, label to be determined : q
training sample nearest x : x 'n labeled q n'
P(q , q | x, x )  P(q | x) P(q | x )
'
n
'
n
'
n
'
n
c
Pn (e | x, x )  1   P (q  wi ,q  wi | x, x )
'
n
'
n
i 1
'
n
c
 1   P(wi | x) P(wi | x )
i 1
'
n
36
Error Rate for
Nearest-Neighbor Rule
Pn (e | x)   Pn (e | x, x 'n ) p (x' | x)dx'
lim Pn (e | x)
n 
c

' 
 lim  1   P(wi | x) P(wi | x n ) (x 'n  x)dx 'n
n 
 i 1

 1  P 2 (wi | x)
P  lim Pn (e)  lim  Pn (e | x) p (x)dx
n 
n 
c


2
  1   P (wi | x) p(x)dx
 i 1

37
Approximate Error Bound
Assume Bayes rate P   P (e | x) p (x)dx is low,
*
*
P* (e | x)  1  P(wm | x), P(wm | x)  1
c
1   P (wi | x)  1  P (wm | x)  21  P(wm | x) 
2
2
i 1
 2 P * (e | x)
P   2 P (e | x) p(x)dx  2 P
*
*
38
A More Rigorous Approach
c
P
2
i 1
(wi | x)  P (wm | x)   P (wi | x)
2
im
P(wi | x)  0,
P
im
2
2
 P(w
im
i
| x)  1  P(wm | x)
(wi | x) is minimum when all P(wi | x)
equal, with i  m
39
A More Rigorous Approach
 P * (e | x )

im
P(wi | x)   c  1
1  P * (e | x) i  m


*2
P
(e | x )
2
*
P (wi | x)  1  P (e | x) 

c 1
i 1
c
2
*2
cP
(e | x )
2
*
1   P (wi | x)  2 P (e | x) 
c 1
i 1
c
c


2
P   1   P (wi | x) p (x)dx  2 P *
 i 1

40
A More Rigorous Approach

 
Var P (e | x)   P (e | x)  P
*
*
 p(x)dx
* 2
  P (e | x) p(x)dx  P  0
*2
*2
*2
*2
P
(
e
|
x
)
p
(
x
)
d
x

P

c *2
 *

P    2 P (e | x ) 
P (e | x) p (x)dx
c 1


c *
*
 P 2 
P 
c 1 

41
A More Rigorous Approach
The upper bound is achieved in the
zero-information case
– p(x|wi) are identical
– P(wi|x) = P(wi)
– P*(e|x) is independent of x
P* is between 0 and (c-1)/c
42
Bounds of
Nearest-Neighbor Error Rate
43
Convergence Speed
Convergence can be arbitrarily slow
Pn(e) need not even decrease
monotonically with n
44
k-Nearest-Neighbor Rule
45
Simplified Analysis Results
Two-class case with k odd
Labels on each of the k-nearestneighbors are random variables
– Independently assume the values wi
with probabilities P(wi|x)
Select wm if a majority of the k
nearest neighbors are labeled wm,
with probability
k 
k i
i




P
(
w
|
x
)
1

P
(
w
|
x
)

m
m
i
i ( k 1) / 2  
k
46
Simplified Analysis Results
The larger the value of k, the
greater the probability that wm
will be selected
Large-sample two-class error rate
is bounded above by Ck(P*) defined
to be the smallest concave
function of P* greater than
( k 1) / 2

i 0

 k  * i 1
* k i
* k i
  P  (1  P )  P  (1  P* )i 1
i

47
Upper Bounds of Error Rate
for Two-Class Cases
48
More Comments on
k-Nearest-Neighbor Rule
As an attempt to estimate P(wi|x)
from samples
– Needs larger k to obtain a reliable
estimate
Want all k nearest neighbors x’ to be
very near x to ensure P(wi|x’) is
approximately the same as P(wi|x)
– k should be a small fraction of n
Only when n goes to infinity that we
can be assured of nearly optimal
behavior
49
Computational Complexity of the
Nearest-Neighbor Rule
Inspect each stored point in turn
– O(n)
Calculate its Euclidean distance to x
– Each calculation O(d)
Returning identity only of the current
closest one
Total complexity O(dn)
50
A Parallel Nearest-Neighbor Circuit
51
Reducing Computational Burden in
Nearest-Neighbor Search
Computing partial distance
1/ 2

2
Dr (a, b)    ak  bk   , r  d
 k 1

r
Prestructuring
– Create some form of search tree , e.g.,
a quad-tree with representative points
– Prototypes are selectively linked
– Not guaranteed to find the closest
prototype
Editing the stored prototypes
52
Nearest-Neighbor Editing
Initialize j0, Ddata set, n#prototypes
construct full Voronoi diagram of D
do j  j +1; for each prototype x’j
find Voronoi neighbors of x’j
if any neighbor not from the same class
as x’j, then mark x’j
until j=n
discard all points that are not marked
construct the Voronoi diagram of the
remaining (marked) prototypes
end
53
Nearest-Neighbor Editing
Complexity O(d 3n d / 2  ln n)
Not guarantee the minimum set of
points
Reduce complexity without
affecting the accuracy
Generally can not add training data
later
Can be combined with
prestructuring and partial distance
54
Properties of Metrics
Nonnegativ ity : D(a, b)  0
Reflexivit y : D(a, b)  0 if and only if a  b
Symmetry : D(a, b)  D(b, a)
Triangule Inequality : D(a, b)  D(b, c)  D(a, c)
55
Effect of Scaling in
Euclidean Distance
56
Minkowski Metric (Lk Norm)
1/ k

k
Lk (a, b)    ai  bi 
 i 1

d
57
Tanimoto Metric
Finds most use in taxonomy
–When two patterns or features
are either the same or different
Distance between two sets
(n1  n12 )  (n2  n12 )
DTanimoto( S1 , S 2 ) 
n1  n2  n12
58
Uncritical Use of Euclidean Metric
59
A Naïve Approach
Compute distance only when the
patterns have been transformed to
be as similar to one another as
possible
The computational burden is
prohibitive
– Do not know the proper parameters for
the transformation ahead of time
– More serious if several transformations
for each stored prototype during
classification are to be considered
60
Tangent Vector
r transformations are applicable
– e.g., horizontal translation, shear,
line thinning for hand-written images
For each prototype x’, perform
each of the transformation Fi(x’;ai)
Tangent vector for each
transformation
TVi  Fi (x' ;a i )  x'
61
Linear Combination of
Tangent Vectors
62
Tangent Distance
Construct an r by d matrix T
through tangent vectors at x’
Tangent distance from x’ to x
Dtan (x' , x)  min  (x'Ta )  x 
a
63
Concept of Tangent Distance
64
“Category Membership” Functions
in Fuzzy Logic
Dark
Medium-light
Medium-dark
Light
65
Conjunction Rule and
Discriminant Function
66
Cox-Jaynes Axioms for
“Category Membership” Functions
P(a | d )  P(b | d ) and P(b | d )  P(c | d )  P(a | d )  P(c | d )
P(not a | d )  F1 P(a | d )
P(a, b | d )  F2 P(a | d ), P(b | d )
67
Contribution and Limitation
Guiding the steps by which one takes
knowledge in a linguistic form and
casts it into discriminant functions
Do not rely on data
68
Reduced Coulomb Energy (RCE)
Networks
69
RCE Training
70
RCE Training
initialize j  0,   small param., m  max radius
do j  j  1, x  x 'j , wk is the category of x 'j
wij  xi
xˆ  arg min D(x, x' )
x 'w k
 j  max min D(xˆ , x), m ,  
a jk  1
until j  n
end
71
RCE Classification
initialize j  0, x  test pattern, Dt   
do j  j  1
if D(x, x j )   j then Dt  Dt  x j
'
'
until j  n
if labels of all x j  Dt is the same
'
then return label of all x k  Dt
else return " ambiguous" label
end
72
Approximations by
Series Expansions
 x  xi  m
   a j j (x)  j (x i )
 
 hn  j 1
n
n
 x  xi  m
  a j j (x)  j (x i )
 

i 1  hn
i 1
 j 1
1 n 1  x  xi  m
   b j j (x)
pn (x)    
n i 1 Vn  hn  j 1
bj 
aj
nVn
n

i 1
j
(x i )
73
One-Dimensional Example
  u   e u
2
(1) j u 2 j

j!
j 0
m
2
1 2 1 2
 x  xi 
 x  xi 
 
  1 
  1  2 xxi  2 x  2 xi
h
h
h
 h 
 h 
2
 pn ( x)  b0  b1 x  b2 x 2
1 1 1 n 2
2 1 n
1
b0   3  xi , b1  3  xi , b2   3
h h n i 1
h n i 1
h
x  xi  h is required
74