Nearest Neighbor

Download Report

Transcript Nearest Neighbor

1
NEAREST NEIGHBORS
ALGORITHM
Lecturer: Yishay Mansour
Presentation: Adi Haviv and Guy Lev
2
Lecture Overview
• NN general overview
• Various methods of NN
• Models of the Nearest Neighbor Algorithm
• NN – Risk Analysis
• KNN – Risk Analysis
• Drawbacks
• Locality Sensitive Hashing (LSH)
• Implementing KNN using LSH
• Extension for Bounded Values
• Extension for Real Numbers
3
General Overview
• The nearest neighbors algorithm is a simple classification
algorithm that classify a new point 𝑥 according to its
nearest neighbors class\label:
• Let 𝑋 = < 𝑥𝑖 , 𝑏𝑖 > be a sample space of points and their
classification. Given a new point 𝑦 we find the 𝑎𝑟𝑚𝑖𝑛𝑖 𝑥𝑖 − 𝑦 and
classify 𝑦 with 𝑏𝑖 .
• An implicit assumptions is that close points have the
same classification
4
Various methods of NN
• NN - Given a new point x, we wish to find it's nearest point
and return it's classification.
• K-NN - Given a new point x, we wish to find it's k nearest
points and return their average classification.
• Weighted - Given a new point x , we assign weights to all
the sample points according to the distance from x and
classify x according to the weighted average.
5
Models of the Nearest Neighbor Algorithm
1.
Distribution 𝐷 over 𝑋.
• Sample 𝑥 according to 𝐷
• 𝑝 𝑥 = Pr[𝑏 = 1|𝑥].
• The problem with this model is that 𝑝 𝑥 is not measurable.
2.
2 distributions 𝐷0 (negative points), 𝐷1 (positive points) and
parameter 𝛼 ∈ 0,1 .
• sample a class 𝑏 using 𝛼 such that 𝑝 𝑏 = 1 = 𝛼
• sample 𝑥 from 𝐷𝑏 and return < 𝑥, 𝑏 >
• 𝑝 𝑥 =
3.
𝛼𝐷1 (𝑥)
𝛼𝐷1 𝑥 +(1−𝛼)𝐷0 (𝑥)
Distribution 𝐷 over 𝑋 × {0,1}.
• sample < 𝑥, 𝑏 > from D.
• 𝑝 𝑥 =
𝐷(<𝑥,1>)
𝐷 <𝑥,0> +𝐷(<𝑥,1>)
6
NN – Risk Analysis
1 𝑝 𝑥 ≥ 0.5
• The optimal classifier is: ℎ 𝑥 =
0 𝑝 𝑥 < 0.5
• r(x) = min{p(x), 1-p(x)},
• the loss of the optimal classifier is (Bayes Risk):
𝑅∗ = 𝐸𝑥 𝐿𝑜𝑠𝑠 ℎ, 𝑥 =
𝑥
𝑟 𝑥 𝐷 𝑥 𝑑𝑥
• P = 𝐸𝑥 𝐿𝑜𝑠𝑠 𝑁𝑁, 𝑥
• We will prove that: 𝐏 ≤ 𝟐𝑹∗ 𝟏 − 𝑹∗
7
KNN vs. Bayes Risk Proof
• we split to 2 cases: Simple Case and General Case.
• Simple Case:
• There exist exatcly one < 𝑥𝑖 , 𝑏𝑖 >∈ 𝑋 such that 𝑥 ≡ 𝑥𝑖
• Proof:
• Thus we get that the expected error is:
8
NN vs. Bayes Risk Proof Cont.
• General Case:
• The nearest neighbor of 𝑥 converge to 𝑥 m → ∞
• The classification of the neighborhood of 𝑥 is close to that of p 𝑥
• Proof
• Simplifications:
• D(x) is non zero
• P(x) is continues
𝑚
• Theorem: for every given 𝑥, lim 𝑁𝑁 (𝑥) = 𝑥 with probability 1.
𝑚→∞
• Proof:
• for every ε 𝑙𝑒𝑡 𝐵𝜀 𝑥 be a sphere with radius 𝜀 with center 𝑥.
• ∀𝜀, 𝛿 > 0.∃𝑚 Pr[𝑁𝑁 𝑚 ∉ 𝐵𝜀 𝑥 ] ≤ 𝛿
• Theorem: 𝑟 𝑁𝑁
𝑚
𝑥
→ 𝑟 𝑥 𝑤ℎ𝑒𝑛 𝑚 → ∞
• Proof:
• 𝑒 𝑚 = {the event that the NN algorithm made a mistake with a sample space
of m points}
• Pr[𝑒 𝑚 ] → 2r x 1 − r x .
• lim
𝑚→∞
Pr 𝑒 𝑚 𝐷 𝑥 𝑑𝑥 =
lim Pr 𝑒 𝑚 𝐷 𝑥 𝑑𝑥 ≤ 𝟐𝑹∗ 𝟏 − 𝑹∗
𝑚→∞
9
KNN – Risk Analysis
• We go on to the case of K points, here we will gain that
we wont lose the factor of 2 of the Bayes Risk.
• Denote:
• 𝑁𝑁𝑘
𝑚
𝑥 = 𝑡ℎ𝑒 𝑘 𝑥𝑖 𝑤ℎ𝑖𝑐ℎ 𝑎𝑟𝑒 𝑐𝑙𝑜𝑠𝑒𝑠𝑡 𝑡𝑜 𝑥 .
• l= number of 𝑥𝑖 𝑤𝑖𝑡ℎ 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 = 1
• The estimation is : 𝑝 𝑥 =
• Our conditions are:
• 𝑘→∞
•
𝑘
𝑚
→0
• We want to proof that:
1.
2.
𝑚
∀𝑦 ∈ 𝑁𝑁𝑘
𝑥
𝑝 𝑥 → 𝑝(𝑥)
𝑦→𝑥
𝑙
𝑘
10
KNN vs. Bayes Risk Proof
• Same as before we split to 2 cases.
• Simple Case:
• All k nearest neighbors are identical to 𝑥 and only them  (1) is
satisfied
• proof
• By Chertoff bound:
Pr
𝑙
−𝑝 𝑥
𝑘
≥ 𝜆 ≤ 𝑒 −𝜆𝑘 → 0 𝑤ℎ𝑒𝑛 𝑘 → ∞
𝑙
⇒ 𝑝 𝑥 = 𝑘 → 𝑝(𝑥)
11
KNN vs. Bayes Risk Proof
• Proof for the General Case:
• Define 𝐵𝜀 𝑥 same as before.
• q = Pr[𝑥𝑖 ∈ 𝐵𝜀 𝑥 ] > 0
• Expected number of point that will fall in 𝐵𝜀 𝑥 is 𝑞 ∙ 𝑚
• z = number of points that fall in 𝐵𝜀 𝑥
• Pr[at most k-1 𝑥𝑖 in 𝐵𝜀 𝑥 ] = Pr[|z − 𝑞 ∙ 𝑚|≥ 𝑞 ∙ 𝑚 −
12
Drawbacks
• No bound on the number of samples (m): the nearest
neighbor is dependent on the actual distribution
• for example: we set m and take 𝛿 such that m≪
𝛿
𝛿
𝛿
1
𝛿
+
?
−
• The probability of error is at least (1 − 4𝛿)𝑚 ≤ 1 − 4𝛿𝑚
• Determine the distance function- distance between points
should not be effected by different scales. 2 ways to
normalize:
• Assuming normal distribution :
• Assuming uniform distribution:
𝑣𝑖 −𝜇𝑖
𝜎𝑖
𝑣𝑖
𝑚𝑎𝑥𝑖 −𝑚𝑖𝑛𝑖
13
Locality Sensitive Hashing (LSH)
• Trivial algorithm for KNN : for every point go over all other
points and compute distance  linear time.
• We want to pre-process the sample set so that search
time would be sub-linear.
• We can look at the following problem: given x and R, find
all y such that 𝑥 − 𝑦 ≤ 𝑅𝑐
• (𝑐 > 1 𝑎𝑝𝑟𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟)
14
Locality Sensitive Hashing (LSH)
• A Locality Sensitive Hashing family is a set H of hash
functions s.t. for any p,q:
• If
then
• If
then
for some probabilities
• Example:
•
•
•
• If
then we have
as required.
15
Implementing KNN using LSH
• Step 1: Amplification:
• Use functions of the form
where
• If
• If
are randomly selected from H. Then:
then
then
• k is chosen s.t.
• Denote:
. Thus:
16
Implementing KNN using LSH Cont.
• Step 2: Combination
• pick L functions
(use L hash tables).
• For each i:
• Probability of no-match for any of the functions:
• For given δ, Choose
, then we have:
• For “far” points, the probability to hit is
point in any of the tables is bounded by
so the probability of hitting a “far”
17
Implementing KNN using LSH Cont.
• We are given LSH family H and sample set.
• Pre-processing:
• Pick L functions
(use L hash tables).
• Insert each sample x to each table i, according to
• Finding nearest neighbors of q:
• For each i calculate
and search in the ith table.
• Thus obtain
• Check the distance between q and each point in P.
18
Implementing KNN using LSH Cont.
• Complexity:
• Space Complexity: L tables, each containing n samples.
Therefore:
• Search time complexity:
• O(L) queries to hash tables.
• We assume lookup time is constant.
• For each sample retrieved we check if it is “close”.
• Expected number of “far” points is at most
Therefore rejecting “far” samples is O(L).
• Time for processing “close” samples: O(kL)
• Where k is number of desired neighbors.
19
Extension for Bounded Values
• Sample space is
• We use L1 as distance metric.
• Use unary encoding:
• Represent each coordinate by a block of s bits
• A value t is represented by t consecutive 1s followed by s-t zeros.
• Example: s=8, x=<5,7>
• Representation of x: 1111100011111110
• Hamming distance in this representation is same as L1
distance in the original representation.
• Problems with real values can be reduced to this solution
by quantization.
20
Extension for Real Numbers
• Sample space is X =
• Assume R<<1
• Pick randomly and uniformly
• Hash function is:
• For
:
• Therefore:
• If R is small then:
21
Extension for Real Numbers Cont.
• Therefore:
• So we get a separation between
enough constant c.
and
given a big