slides - Persone

Download Report

Transcript slides - Persone

Locality Sensitive Hashing
Basics and applications
A well-known problem

Given a large collection of documents
Identify the near-duplicate documents

Web search engines


Proliferation of near-duplicate documents



Legitimate – mirrors, local copies, updates, …
Malicious – spam, spider-traps, dynamic URLs, …
30% of web-pages are near-duplicates [1997]
Natural Approaches

Fingerprinting:




Edit-distance




only works for exact matches
Karp Rabin (rolling hash) – collision probability guarantees
MD5 – cryptographically-secure string hashes
metric for approximate string-matching
expensive – even for one pair of documents
impossible – for billion web documents
Random Sampling



sample substrings (phrases, sentences, etc)
hope: similar documents  similar samples
But – even samples of same document will differ
Basic Idea: Shingling

[Broder 1997]
dissect document into q-grams (shingles)
T = I leave and study in Pisa, ….
If we set q=3 the 3-grams are:
<I leave and><leave and study><and study in><study in Pisa>…

represent documents by sets of hash[shingle]
 The problem reduces to set intersection among int
Basic Idea: Shingling
Doc
A
SA
[Broder 1997]
Doc
B
SB
Set intersection  Jaccard similarity
sim(S A , SB ) 
SA  SB
SA  SB
• Claim: A & B are near-duplicates if sim(SA,SB) is high
Sec. 19.6
Sketching of a document
From each shingle-set
we build a “sketch vector” (~200 size)
Postulate: Documents that share ≥ t
components of their sketch-vectors are
claimed to be near duplicates
Sketching by Min-Hashing

Consider
 SA, SB  P = {0,…,p-1}

Pick a random permutation π of the whole set P
(such as ax+b mod p)


Pick the minimal element of SA :  = min{π(SA)}

Pick the minimal element of SB : b = min{π(SB)}
Lemma:
P[α  β] 
SA  SB
SA  SB
Strengthening it…

Similarity sketch sk(A)



d minimal elements under π(SA)
Or take d permutations and the min of each
Note: we can reduce the variance by using
a larger d

Typically d is few hundred mins (~200)
Sec. 19.6
Computing Sketch[i] for Doc1
Document 1
264 Start with 64-bit f(shingles)
264 Permute with p
i
264
264 Pick the min value
Sec. 19.6
Test if Doc1.Sketch[i] = Doc2.Sketch[i]
Document 2
Document 1
A
264
264
264
264
264
264
264
B
264
Are these equal?
Claim: This happens with probability
Size_of_intersection / Size_of_union
Use 200 random permutations (minimum), and thus create one 200-dim
vector per document and evaluate the fraction of shared components
It’s even more difficult…

So we have squeezed few Kbs of data (web
page) into few hundred bytes
 But you still need a brute-force
comparison (quadratic time) to compute all
nearly-duplicate documents

This is too much even if it is executed in RAM
Locality Sensitive Hashing
The case of the Hamming distance
How to compute fast the fraction of different compo. in d-dim vectors
How to compute fast the hamming distance between d-dim vectors
Fraction different components = HammingDist/d
A warm-up


Consider the case of binary (sketch) vectors,
thus living in the hypercube {0,1}d
Hamming distance
D(p,q)= # coord on which p and q differ

Define hash function h by choosing a set I of
k random coordinates
h(p) = p|I = projection of p on I
Example: if p=01011 (d=5), a pick I={1,4} (with k=2),
Note similarity with the Bloom Filter
A key property
1

d
Pr[to pick an equal component]= (d - D(p,q))/d
D ( p, q ) k
Pr[ h( p )  h(q )]  (1 
)
d

2 ….
equal
We can vary the probability by changing k
Pr
k=1
Pr
k=2
What about
False
Negatives ?
distance
distance
Reiterate

Repeat L times the k-projections hi

Declare a «match» if at least one hi matches
Example: d=5, k=2, p = 01011 and q = 00101
•
I1 = {2,4}, we have h1(p) = 11 and h1(q)=00
•
I2 = {1,4}, we have h2(p) = 01 and h2(q)=00
•
I3 = {1,5}, we have h3(p) = 01 and h3(q)=01
We set g( ) = < h1( ), h2( ), h3( )>
p and q
match !!
Measuring the error prob.
D ( p, q ) k
Pr[ hi ( p)  hi (q)]  (1 
)
d
The g() consists of L independent hashes hi
Pr[g(p) matches g(q)] =1 - Pr[hi(p) ≠ hi(q), i=1, …, L]
D ( p, q ) k 

Pr  g ( p )  g (q )   1  1  (1 
) 
d


L
Pr

Prg ( p)  g (q)   1  1  s

k L
(1/L)^(1/k)
s
Find groups of similar items

SOL 1: Buckets provide the candidate similar items
«Merge» similar sets if they share items
p
h1(p)
T1
Points in a bucket
are possibly
similar objects
h2(p)
T2
hL(p)
TL
Find groups of similar items

SOL 1: Buckets provide the candidate similar items

SOL 2:

Sort items by the hi(), and pick as similar candidate the
equal ones

Repeat L times, for all hi()

«Merge» candidate sets if they share items.
Check candidates !!!
What about clustering ?
LSH versus K-means

What about optimality ? K-means is locally optimal [recently,
some researchers showed how to introduce some guarantee]

What about the Sim-cost ? K-means compares items in Q(d) time
and space [notice that d may be bi/millions]

What about the cost per iteration and their number? Typically Kmeans requires few iterations, each costs K  n time: I  K  n  d

What about K ? In principle have to iterate K=1,…, n
LSH needs sort(n) time hence, on disk, few passes
over the data and with guaranteed error bounds
Also on-line query
Given a query q, check the buckets of hj(q) for j=1,…, L
q
h1(q)
T1
h2(q)
T2
hL(q)
TL
Locality Sensitive Hashing
and its applications
More problems, indeed
Another classic problem
The problem: Given U users, the goal is to find
groups of similar users (or, smilar to user Q)
Features = Personal data, preferences, purchases,
navigational behavior, followers/ing or +1,…


A feature is typically a numerical value: binary or real
1
2
3
4
5
U1
0
1
0
1
1
U2
0
1
1
0
0
U3
0
1
1
1
1
Hamming distance: #different components
More than Hamming distance

Example:
q


P
*
q is the query point.
P* is its Nearest Neighbor
Approximation helps
r
q
p*
A slightly different problem
Approximate Nearest Neighbor

Given an error parameter >0

For query q and nearest-neighbor p’, return p such that
D (q , p ' )  (1  ε )D (q , p )
Justification

Mapping objects to metric space is heuristic anyway

Get tremendous performance improvement
A workable approach

Given an error parameter >0, distance threshold t>0

(t,)-Approximate NN Query



If no point p with D(q,p)<t, return FAILURE
Else, return any p’ with D(q,p’)< (1+)t
Application: Approximate Nearest Neighbor

Assume maximum distance is T

Run in parallel for
t  1 , (1  ε), (1  ε) 2 , (1  ε) 3 ,  , T

Time/space – O(log1+  T) overhead
Locality Sensitive Hashing
and its applications
The analysis
LSH Analysis

For a fixed threshold r, we distinguish between


Near D(p,q) < r
Far
r
D(p,q) > (1+ε)r
c

b
a
(1  ε)r
A locality-sensitive hash h should guarantee

Near points are hashed together with Pr[h(a)=h(b)] ≥ P1

Far points may be mapped together but Pr[h(a)=h(c)] ≤ P2
where, of course, we have that P1 > P2
What about hamming distance?
Family hi(.) = p|c1,…,ck ,where the ci are chosen randomly

If D(a,b) ≤ r, then Pr[hi(a)= hi(b)] = (1 – D(a,b)/d)k
≥ ( 1 – r/d )k = ( p1 )k = P1

If D(a,c) > (1+)r, then Pr[hi(a)= hi(c)] = (1 – D(a,c)/d)k
< ( 1 – r(1+)/d )k = ( p2 )k = P2
where, of course, we have that p1 > p2 (as P1 > P2)
LSH Analysis
The LSH-algorithm with the L mappings hi() correctly solves the
(r,)-NN problem on a point query q if the following hold:
I. The total number of points FAR from q and belonging to
the bucket hi(q) is a constant.
II. If p* NEAR to q, then hi(p*)= hi(q) for some I (p* is in a
visited bucket)
Theorem. Take k=log_{1/p2} n and L=n(ln p1/ ln p2), then the two
properties above hold with probability at least 0.298.
Repeating the process Q(1/d) times, we ensure a probability of
success of at least 1-d.


Space ≈ nL = n1+r, where r = (ln p1 / ln p2 ) < 1
Query time ≈ L buckets accessed, they are nr
Proof
p* is a point near to q: D(q,p*) < r
FAR(q) = set of points p s.t. D(q,p) > (1+) r
BUCKETi (q) = set of points p s.t. hi(p)= hi(q)
Let us define the following events:
E1 = Num of far points in the visited buckets ≤ 3 L
 FAR(q)  BUCKET q  3* L
j 1, L
j
E2 = p* occurs in some visited bucket, i.e. j s.t. hj(q) = hj(p*)
Bad collisions more than 3L
Let p is a point in FAR(q):
Pr[fixed j, hj(p) = hj(q)] < P2 = ( p2 )k
Given that k = log1/p2 n
Pr[fixed j, a far point p satisfies hj(p) = hj(q)] = 1/n
L
L
j 1
j 1
E[  FAR(q)  BUCKETj q  ]   E[ FAR(q)  BUCKETj q  ]  n * (1 / n) L
j 1, L
By Markov inequality Pr(X > 3E[x]) ≤ 1/3, follows:
Good collision: p* occurs
For any hj the probability that
Pr[hj(p*) = hj(q)] ≥ P1 = ( p1 )k = ( p1 )^{ log1/p2 n } =
Given that L=n(ln p1/ ln p2), this is actually 1/L.
So we have that Pr[not E2] = Pr[not finding p* in q’s buckets] =
= (1 - Pr[hj(p*) = hj(q)])L = (1 – 1/L)L ≤ 1/e
Finally Pr[E1 and E2] ≥ 1 – Pr[not E1 OR not E2] ≥ 1 – (Pr[not E1] +
Pr [not E2]) ≥ 1 - (1/3) - (1/e) = 0.298