Transcript ppt

Finding Similar Items
Set Similarity
• Problem: Find similar sets.
• Motivation: Many things can be modeled/represented as
sets
• Applications:
–
–
–
–
Face Recognition
Recommender Systems (Collaborative Filtering)
Document Similarity/Clustering
Similar (but infrequent) Items
Application: Face Recognition
• We have a database of (say) 1 million face images.
• We want to find the most similar images in the database.
• Represent faces by (relatively) invariant values, e.g.,
ratio of nose width to eye width.
• Many-one problem : given a new face, see if it is close to
any of the 1 million old faces.
• Many-Many problem : which pairs of the 1 million faces
are similar.
Simple Solution
• Represent each face by a vector of 1000 values and score
the comparisons.
• Sort-of OK for many-one problem.
• Out of the question for the many-many problem
(106*106*1000/2 numerical comparisons).
• We can do better!
Applications - Collaborative Filtering (1)
• Represent a customer, e.g., of Netflix, by the set of movies
they rented.
– Similar customers have a relatively large fraction of their
choices in common.
• Motivation: A customer can be pitched an item that a
similar customer bought, but which they did not buy.
Applications - Collaborative Filtering (2)
• Products are similar if they are bought by many of the
same customers.
– E.g., movies of the same genre are typically rented by
similar sets of Netflix customers.
• Motivation: A customer can be pitched an item that is a
similar to an item that he/she already bought.
• …isn’t this mining frequent pairs…
• Well, not exactly…
Applications: Similar Items
Problem
• Search for pairs of items that appear together a large
fraction of the times that either appears,
– even if neither item appears in very many baskets.
– Such items are considered "similar"
– E.g. caviar and crackers
Modeling
• Each item is a set: the set of baskets in which it appears.
– Thus, the problem becomes: Find similar sets!
Applications: Similar Documents (1)
• Given a body of documents, e.g., Web pages, find pairs of
docs that have a lot of text in common, e.g.:
– Mirror sites, or approximate mirrors.
– Plagiarism, including large quotations.
– Repetitions of news articles at news sites.
• How do you represent a document so it is easy to
compare with others?
– Special cases are easy, e.g., identical documents, or one
document contained verbatim in another.
– General case, where many small pieces of one doc appear
out of order in another, is hard.
Applications: Similar Documents (2)
• Represent doc by its set of shingles (or k -grams).
• A k-shingle (or k-gram) for a document is a sequence of k
characters that appears in the document.
Example.
• k=2; doc = abcab.
• Set of 2-shingles = {ab, bc, ca}.
• At that point, doc problem becomes finding similar sets.
The Jaccard Measure of Similarity
• The similarity of sets S and T is the ratio of the
sizes of the intersection and union of S and T.
– Sim (C1,C2) = |ST|/|ST| = Jaccard similarity.
• Disjoint sets have a similarity of 0, and the similarity of a
set with itself is 1.
• Another example: similarity of sets {1, 2, 3} and {1, 3, 4, 5}
is
– 2/5.
Solutions
1. Divide-Compute-Merge (DCM) uses external sorting,
merging.
2. Locality-Sensitive Hashing (LSH) can be carried out in
main memory, but admits some false negatives.
Divide-Compute-Merge
• Originally designed for “shingles” and docs.
– Applicable for other set similarity problems.
• Based on the merge-sort paradigm.
DCM Steps
doc1: s11,s12,…,s1k
doc2: s21,s22,…,s2k
…
Invert
s11,doc1
s12,doc1
…
s1k,doc1
s21,doc2
…
sort on
shingleId
t1,doc11
t1,doc12
…
t2,doc21
t2,doc22
…
Invert and pair
doc11,doc12,1
doc11,doc12,2
doc11,doc12,1
doc11,doc13,10
…
…
Merge doc11,doc13,1
…
sort on
<docId1,
docId2>
doc11,doc12,1
doc11,doc13,1
…
doc21,doc22,1
…
DCM Summary
1.
2.
3.
4.
5.
6.
Start with the pairs <shingleId, docId>.
Sort by shingleId.
In a sequential scan, generate triplets <docId1, docId2, 1>
for pairs of docs that share a shingle.
Sort on <docId1, docId2>.
Merge triplets with common docIds to generate triplets of
the form <docId1,docId2,count>.
Output document pairs with count > threshold.
Some Optimizations
• “Invert and Pair” is the most expensive step.
• Speed it up by eliminating very common shingles.
– “the”, “404 not found”, “<A HREF”, etc.
• Also, eliminate exact-duplicate docs first.
Locality-Sensitive Hashing
• First represent sets as bit vectors, and organize them as
columns of a signature matrix M.
• Big idea: hash columns of matrix M several times.
• Arrange that (only) similar columns are likely to hash to
the same bucket.
• Candidate pairs are those that hash at least once to the
same bucket.
LSH Roadmap
Similar
customers
Similar
products
Documents
Sets
Technique:
Shingling
Technique:
Minhashing
Signatures
Technique:
Locality-Sensitive
Hashing
Buckets
containing
mostly
similar
items (sets)
Minhashing
• Suppose that the elements of each set are chosen from a
"universal" set of n elements e0, el,...,en-1.
• Pick a random permutation of the n elements.
• Minhash value of a set S is the first element, in the
permuted order, that is a member of S.
Example
• Suppose the universal set is {1, 2, 3, 4, 5} and the
permutation we choose is (3,5,4,2,1).
– Set {2, 3, 5} hashes to
3.
– Set {1, 2, 5} hashes to
5.
– Set {1,2} hashes to
2.
Minhash signatures
• Compute signatures for the sets by using m permutations.
– Typically, m would be about 100.
• Signature of a set S is the list of the minhash values of S,
for each of the m permutations, in order.
Example
• Universal set is {1,2,3,4,5}, m = 3, and the permutations
are:
– 1= (1,2,3,4,5),
– 2= (5,4,3,2,1),
– 3= (3,5,1,4,2).
• Signature of S = {2,3,4} is
– (2,4,3).
Minhashing and Jaccard Distance
Surprising relationship
If we choose a permutation at random, the probability that
it will produce the same minhash values for two sets is the
same as the Jaccard similarity of those sets.
Why?
(see the next slides)
• Thus, if we have the signatures of two sets S and T, we
can estimate the Jaccard similarity of S and T by the
fraction of corresponding minhash values for the two sets
that agree.
Four Types of Rows
• Given columns C1 and C2, each representing a set,
rows may be classified as:
a
b
c
d
C1
1
1
0
0
C2
1
0
1
0
• Also, a = # rows of type a , etc.
• Note Sim (C1, C2) = a /(a +b +c ).
Minhashing–put differently
• Imagine the rows permuted randomly.
• Define “hash” function h (C ) = the number of the first (in
the permuted order) row in which column C has 1.
• Use several independent hash functions to create a
signature.
Minhashing and Jaccard Distance (2)
• The probability (over all permutations of the rows) that h
(C1) = h (C2) is the same as Sim (C1, C2).
• Both are a /(a +b +c )!
• Why?
– Look down columns C1 and C2 until we see a 1.
– If it’s a type-a row, then h (C1) = h (C2). If a type-b or
type-c row, then not.
– Type-d row doesn’t play any role for either of them.
Implementing Minhashing
• Generating a permutation of all the universe of objects is
infeasible.
• Rather, simulate the choice of a random permutation by
picking a random hash function h.
– Pretend that the permutation that h represents places
element e in position h(e).
• Several elements might wind up in the same position.
• As long as number of buckets is large, we can break ties as we
like, and the simulated permutations will be sufficiently random
that the relationship between signatures and similarity still
holds.
Algorithm for minhashing
• To compute the minhash value for a set S = {a1, a2,. . . ,an}
using a hash function h, we can execute:
V =infinity;
FOR i := 1 TO n DO
IF h(ai) < V THEN V = h(ai);
• As a result, V will be set to the hash value of the element
of S that has the smallest hash value.
Algorithm for set signature
• If we have m hash functions h1, h2, . .. , hm, then we can
compute m minhash values in parallel, as we process
each member of S.
FOR j := 1 TO m DO
Vj := infinity;
FOR i := 1 TO n DO
FOR j := 1 TO m DO
IF hj(ai) < Vj THEN Vj = hj(ai)
Example
S = {1,3,4}
T = {2,3,5}
h(x) = x mod 5
g(x) = 2x+1 mod 5
h(1) = 1
h(3) = 3
h(4) = 4
g(1) = 3
g(3) = 2
g(4) = 4
h(2) = 2
h(3) = 3
h(5) = 0
g(2) = 0
g(3) = 2
g(5) = 1
sig(S) = 1,3
sig(T) = 5,2
Locality-Sensitive Hashing of
Signatures
• Organize signatures of the various sets as a matrix M, with
a column for each set's signature and a row for each hash
function.
• Big idea: hash columns of signature matrix M several times.
• Arrange that (only) similar columns are likely to hash to the
same bucket.
• Candidate pairs are those that hash at least once to the
same bucket.
Partition Into Bands
r rows
per band
b bands
Matrix M
Partition Into Bands
• For each band,
hash its portion of
each column to a
hash table with k
buckets.
• Candidate column
pairs are those
that hash to the
same bucket for at
least one band.
Buckets
Matrix M
r rows
b bands
Analysis
• Probability that the signatures agree on one row is
s
(Jaccard similarity)
• Probability that they agree on all r rows of a given band is
s^r.
• Probability that they do not agree on all the rows of a band is
1 - s^r
• Probability that for none of the b bands do they agree in all
rows of that band is
(1 - s^r)^b
• Probability that the signatures will agree in all rows of at least
one band is
1 - (1 - s^r)^b
• This function is the probability that the signatures will be
compared for similarity.
Example
•
•
•
•
Suppose 100,000 columns.
Signatures of 100 integers.
Therefore, signatures take 40Mb.
But 5,000,000,000 pairs of signatures take a while to
compare.
• Choose 20 bands of 5 integers/band.
Suppose C1, C2 are 80% Similar
• Probability C1, C2 agree on one particular band:
– (0.8)5 = 0.328.
• Probability C1, C2 do not agree on any of the 20 bands:
– (1-0.328)20 = .00035 .
– i.e., we miss about 1/3000th of the 80%-similar column
pairs.
• The chance that we do find this pair of signatures together
in at least one bucket is 1 - 0.00035,or 0.99965.
Suppose C1, C2 Only 40% Similar
• Probability C1, C2 agree on one particular band:
– (0.4)5 = 0.01 .
• Probability C1, C2 do not agree on any of the 20 bands:
– (1-0.01)^20  .80
– i.e., we miss a lot...
• The chance that we do find this pair of signatures together
in at least one bucket is 1 - 0.80,or 0.20 (i.e. only 20%).
Analysis of LSH – What We Want
Probability
= 1 if s > t
Probability
of sharing
a bucket
No chance
if s < t
t
Similarity s of two columns
What One Row Gives You
Remember:
probability of
equal hash-values
= similarity
Probability
of sharing
a bucket
t
Similarity s of two columns
What b Bands of r Rows Gives
You
At least
one band
identical
t ~ (1/b)1/r
Probability
of sharing
a bucket
t
Similarity s of two columns
No bands
identical
1 - (1 - s r )b
Some row All rows
of a band of a band
unequal are equal
LSH Summary
• Tune to get almost all pairs with similar signatures,
but eliminate most pairs that do not have similar
signatures.
• Check in main memory that candidate pairs really do
have similar signatures.
• Optional: In another pass through data, check that
the remaining candidate pairs really are similar
columns.