Transcript Byun
Similarity Search
in High Dimensions
via Hashing
Aristides Gionis, Protr Indyk and Rajeev Motwani
Department of Computer Science
Stanford University
presented by Jiyun Byun
Vision Research Lab in ECE at UCSB
Outline
Introduction
Locality Sensitive Hashing
Analysis
Experiments
Concluding Remarks
Introduction
Nearest neighbor search (NNS)
The curse of dimensionality
experimental approach : use heuristic
analytical approach
Approximate approach
ε-Nearest Neighbor Search (ε-NNS)
Goal : for any given query q ∈ Rd, returns the points p P
d(q, p) ≤(1+ )d(q,P)
where d(q,P) is the distance of q to the its closest points in P
right answers are much closer than irrelevant ones
time/quality trade off
Locality Sensitive Hashing (LSH)
Collision probability depends on distance between
points
higher collision probability for close objects
small collision probability for those that far apart
Given a query point,
hash it using a set of hash functions
inspect the entries in each bucket
Locality Sensitive Hashing
Locality Sensitive Hashing (LSH)
Setting
C : the largest coordinate among all points in the
given dataset P of dimension d (Rd)
Embed P into the Hamming cube {0,1}d’
dimension d’ = Cd
v(p) = UnaryC(x1)…UnaryC(xd)
use the unary code for each point along each dimension
isometric embedding
d1(p,q)
dHP(v(p),v(q))
p( 2,1=
)∈
p( v ) = 110100
embedding preserves the distance between points
R2
H6 i.e.{0,1} 6 where d = 2, C = 3
Locality Sensitive Hashing (LSH)
Hash functions(1/2)
Build a hash function on Hamming cube in d’
dimensions
Choose L subsets of the dimensions: I1,I2, ..IL
Ij consists of k elements from {1,…,d’}
found by sampling uniformly at random with replacement
Project each point on each Ij.
gj(p) = projection of p on Ij obtained by concatenating the
bit values of p for dimensions Ij
Store p in buckets gj(p), j = 1.. L
Locality Sensitive Hashing (LSH)
Hash functions(2/2)
Two levels of hashing
LSH function
maps a point p to bucket gj(p)
standard hash function
maps the contents of buckets into a hash table of size M
n
M=
B
B : bucket capacity : memory utilization parameter
Query processing
Search buckets gj(q)
Approximate K-NNS
until CL points are found or all L indices are searched.
output the K points closest to q
fewer if less than K points are found
-neighbor with parameter r
Analysis
Family H = {h = {0,1} d → U} is (r1,r2 ,P1,P2 ) - sensitive
if all p, q ∈ S
if p - q ≤r1 then PrH [h(q) = h(p)] ≥P1
if p - q > r2 then PrH [h(q) = h(p)] ≤P2
where r1 < r2 and P1>P2
Family of single projections in Hamming cube Hd’ is
(r, r(1+ ), 1-r/d’, 1- r(1+ )/d’) sensitive
if dH(q,p) = r (r bits on which p and q differ)
Pr[ h(q)≠h(p)] = r/d’
LSH solve
(r+ ) Neighbor problem
Determine if
In the former case,
there exists a point within distance r of query point q
or whether all points are at least a distance r(1+ ) away from q
return a point within distance r(1+ ) of q.
Repeat construction to boost the probability.
ε-NN problem
For a given query point q,
return a point p from the dataset P
d(q, p) ≤(1+ )d(q,P)
multiple instances of (r, )-neighbor solution.
(r0, )-neighbor, (r0(1+ ), )-neighbor, (r0(1+ )2, )neighbor, …,rmax neighbor
Experiments(1/3)
Datasets
color histograms (Corel Draw)
n = 20,000; d= 8,…,64
texture features (Aerial photos)
n = 270,000; d = 60
Query sets
Disk
8192
8 KB / block ⇒
pts / block, 2 • d • n bytes / index
d
second level bucket is directly mapped to a disk block
Experiments(2/3)
Normalized frequency
profiles
Normalized frequency
Interpoint distance
color histogram
Interpoint distance
texture features
Experiments(3/3)
Performance
speed : average number of blocks accessed
effective error
dLSH
∑ d*
query q∈Q
1
E=
Q
dLSH : LSH NN distance(q) , d* : NN distance(q)
miss ratio
the fraction of queries for which no answer was found
Experiments :
color histogram(1/4)
Error vs. Number of indices(L)
=2
n = 19000
k = 700
Experiments :
color histogram(2/4)
Disk Accesses
Dependence on n
Disk Accesses
Number of database points
Approximate 1 NNS
Number of database points
Approximate 10 NNS
Experiments :
color histogram(3/4)
Miss ratio
Miss ratios
Miss ratio
Number of database points
Approximate 1 NNS
Number of database points
Approximate 10 NNS
Experiments :
color histogram(4/4)
Disk Accesses
Dependence on d
Disk Accesses
Number of dimension
Approximate 1 NNS
Number of dimension
Approximate 10 NNS
Experiments :
texture features(1/2)
Number of indices vs. Error
Experiments :
texture features(2/2)
Number of indices vs. Size
Concluding remarks
Locality Sensitive Hashing
fast approximation
dynamic/join version
Future work
hybrid techniques : tree-based and hashing-based