Transcript Slide 1
Database Group
Nearest Neighbor Retrieval Using Distance-Based Hashing
Michalis Potamias and Panagiotis Papapetrou
supervised by Prof George Kollios
Hash Based Indexing
Same probability on a k-bit hash table:
Idea:
1. Come up with hash functions
neighbor N(Q): Accuracy
k ,l
QX
COST MODEL: Minimize number of Distance Computations
Ck ,l Q, N Q PrQ dQ
LookupCostk ,l Q Ck ,l Q, x
xU
HashCost: # of distance computations to evaluate h-functions: HashCostk ,l Q 2kl
tables
some buckets
3. At query time apply the same
Total Cost per query: Costk ,l Q LookupCostk ,l Q HashCostk ,l Q
hash function to the query
Efficiency (for all Queries): Costk ,l
QX
query
4. Filter: Retrieve the collisions.
Costk ,l Q PrQ dQ
Use Sampling to estimate Accuracy and Efficiency
The rest of the database is
pruned.
h
Sample Queries
Sample Database Objects
Sample Hash Functions
Compute Integrals
1.
2.
3.
5. Refine:
Compute
actual
distances. Return the object
with the smallest distance as
the NN.
4.
D
D
D
800
Finding optimal k & l
min
..given accuracy (say 90%)… ..For k=1,2,…
..compute smallest l that yields required accuracy.
700
600
500
400
300
200
100
0
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
C(Q,N(Q))
Typically, optimal k : last k for which efficiency improves.
Additional Optimizations
C
Dx1 , x2 r1 PrhH hx1 hx2 p1
Hierarchical DBH
r2
A
d1
d2
d3
D(Q,N(Q))
Reduce Hash Cost
r1
Amplify the gap between p1 and p2:
Randomly pick l hash vectors of k
functions each. Probability of collision
in at least one of l hash tables:
0
Rank Queries according to D(Q,N(Q)
Divide space into disjoint subsets (equi-height)
Train separate indices for each subset
B
Dx1 , x2 r2 PrhH hx1 hx2 p2
Use small number of “pseudoline” points
Experiments
Prdist r 1 1 p
Prdist r1 1 1 p1
Computing D may be very expensive
Dynamic Time Warping for Time Series
Edit Distance Variants for DNA alignment
Ck ,l x1 , x2 1 1 C x1 , x2
k l
LookupCost: Expected number of objects that collide in at least one of the l hash
2. Hash every database object to
r1 , r2 , p1 , p2 r1 r2 p1 p2
is: for a previous unseen query q, locate a point p of the database such that the
distance between q and every point o of the database is greater or equal than the
distance between p and q.
k
Accuracy, i.e. the probability over all queries Q that we will retrieve the nearest
that hash similar objects to
similar buckets
Locality Sensitive Family of Functions
NEAREST NEIGHBOR: Given a database S, a distance function D our task
Ck x1, x2 Cx1, x2
Prob of collision in at least one of the l hash tables:
h
Locality Sensitive Hashing
Problem
Cx1, x2 PrhH DBH hx1 hx2
Probability of collision between any two objects:
Number of Queries
A method is proposed for indexing spaces with arbitrary distance
measures, so as to achieve efficient approximate nearest neighbor
retrieval. Hashing methods, such as Locality Sensitive Hashing (LSH), have
been successfully applied for similarity indexing in vector spaces and
string spaces under the Hamming distance. The key novelty of the hashing
technique proposed here is that it can be applied to spaces with arbitrary
distance measures. First, we describe a domain-independent method for
constructing a family of binary hash functions. Then, we use these
functions to construct multiple multi-bit hash tables. We show that the LSH
formalism is not applicable for analyzing the behavior of these tables as
index structures. We present a novel formulation, that uses statistical
observations from sample data to analyze retrieval accuracy and efficiency
for the proposed indexing method. Experiments on several real-world data
sets demonstrate that our method produces good trade-offs between
accuracy and efficiency, and significantly outperforms VP-trees, which are
a well-known method for distance-based indexing.
Analysis
k l
k l
2
PROBLEM DEFINITION: Define index structure to answer Nearest Neighbor
queries efficiently
2
H using Pseudoline Projections (HDBH)
Works on Arbitrary Space but is not Locality Sensitive!
A SOLUTION:
Brute Force! Try them all and get the exact answer
OUR SOLUTION: Are we willing to trade accuracy for efficiency ?
ACCURACY vs. EFFICIENCY:
How often is the actual NN retrieved?
How much time does NN retrieval take?
Distance Matrix
0 5
3 0
...
4…
…
...
0
TRAINING PHASE
Desired
Accuracy
Prev
DBH Index Structure
Define a line projection function that maps
an arbitrary space into the real line R:
D(
x,x
2
Dx, x1 Dx1 , x2 Dx, x2
x1 , x2
F x
2Dx1 , x2
2
2
2
x1
Real valued Discrete valued:
uns
een
que
ry
F(x)
NN
….with statistical arguments
Conclusion
D(x1,x2)
R
t2
t1
0
V x1 , x2 t1 , t 2 PrxX Ft1x,1t,2x2 x 0 0.5
DBH Index Structure
)
x2
x1 , x2
x t1 , t2
0
if
F
x1 , x2
Ft1 ,t2 x
1 otherwise
Hash tables should be balanced.
Thus t1, t2 are chosen from V:
ious
x
0
1
General purpose
Distance is black box
Does not require metric properties
Statistical analysis is possible
Even when NN is not returned, a
very close N is returned… For many
applications that’s fine!!
Not sublinear in size of DB
Statistical (not probabilistic)
Need “representative” sample sets
Hands dataset .. actual
performance was different than
simulation ..
– the training set was not
representative!