Transcript Tsypin
Estimating the reliability of the kNN classification
Maxim Tsypin and Heinrich Röder
Biodesix, Steamboat Springs, CO
1. kNN classification
2. Problems with kNN
3. Estimating the reliability of the kNN classification
1
k-Nearest Neighbor (kNN) classification
x2
• Two classes
• Training set: N1 instances of class 1
N2 instances of class 2
• Each instance is characterized by d values,
and is represented by a point x ( x1 xd )
in d-dimensional space
• k nearest neighbors of the test instance:
k1 instances of class 1
k2 instances of class 2
A
B
k=5
A: 5:0
B: 3:2
C: 0:5
C
x1
2
Problems of simple kNN
x2
• Works properly only when N1 = N2 . Adding
to the training set more instances of a given
class would bias classification results in
favor of this class.
• No information on the confidence of class
assignment for the individual test instances.
Intuitively, the confidence of class
assignment in the 5:0 case should be greater
than in the 3:2 case.
A
B
k=5
A: 5:0
B: 3:2
C: 0:5
C
x1
3
The question
x2
• Two classes
• Training set: N1 instances of class 1
N2 instances of class 2
• k nearest neighbors of a given test instance:
k1 instances of class 1
k2 instances of class 2
k = k1 + k2
Given N1 , N2 , k1 , k2 , what is the probability
that this test instance belongs to class 1 ?
4
A
B
k=5
A: 5:0
B: 3:2
C: 0:5
C
x1
The answer
Two derivations:
1)
within the kernel density estimation framework: a fixed vicinity of the
test instance determines the number of neighbors.
2)
within the kNN framework: a fixed number of neighbors determines
the size of the vicinity.
Both approaches lead to the same result:
k1 1
N
P(class 1)
2 F1 1, k2 1; k1 k2 3; 1 N12 .
k1 k2 2
For N1 = N2, this simplifies to:
k1 1
P(class 1) k1 1
P(class 1)
,
.
k1 k2 2 P(class 2) k2 1
•
•
Quantifies the reliability of class assignment for each individual test
instance, depending only on the (known) training set data.
Properly accounts for complications arising when the numbers of
training instances in the two classes are different, i.e. N1 ≠ N2 .
5