pornographic search engine

Download Report

Transcript pornographic search engine

Agglomerative Clustering of a
Search Engine Query Log
Sixth ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining
Monday, August 21, 2000
Doug Beeferman
Lycos Research Group
Lycos, Inc.
Waltham, MA USA
Adam Berger
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA USA
Talk overview
Motivation
Approach
Algorithmic details
Sample application: Related searches
Experimental results
Other Applications
Motivation
Group together similar Internet entities
Search queries
URLs
Users
{“rare books”, “magazines”, “bookstores”}
{http://www.amazon.com, http://www.bn.com}
{Doug Beeferman, Adam Berger}
Applications
Recommendation systems
Automatic taxonomies
Automatic communities
Approach
 Exploit search result “click log” data, which bridges all three.
1
Lycos
DB
3
2
Search:
books
“DB searched for books and
then went to bn.com”
Search Results
1. www.bn.com
2. www.amazon.com
3. www.salon.com
bn.com
(DB, books, bn.com)
Approach
 Accumulate click log over time:
Time
1
2
3
4




User
<DB>
<AB>
<DB>
<AB>
Query string
<“books”>
<“magazines”>
<“biographies”>
<“books”>
Selected URL
<http://www.bn.com>
<http://www.amazon.com>
<http://www.bn.com>
<http://www.amazon.com>
DB searched for “books” and then went to bn.com
AB searched for “magazines” and then went to amazon.com
DB searched for “biographies” and then went to bn.com
AB searched for “books” and then went to amazon.com
Approach
“Factor” data into bipartite graphs
books
books
magazines
magazines
biographies
biographies
DB
AB
bn.com
amazon.com
Approach
 Choose two of the three entities, say queries and URLs.
 Apply agglomerative clustering algorithm to the nodes in
its bipartite graph:
Merge the two “most similar” nodes in the left-hand-side
Merge the two “most similar” nodes in the right-hand-side
Repeat until some termination condition applies.
At termination queries are clustered on the left-hand-side, and
URLs are clustered on the right-hand-side.
Distance metric
Similarity of two vertices in graph defined
as the number of neighbors in common,
normalized by the total number of unique
neighbors:
 N (q1 )  N (q2 )
, if N (q1 )  N (q2 )  0

D q1 , q2    N (q1 )  N (q2 )

0, otherwise

Approach
 Agglomerative clustering in action
012…n
A
1
B
2
C
3
D
4
E
F
G
H
5
6
7
Approach
 Agglomerative clustering in action
012…n
A
1
B
2
C, F
3
D
4
E
5
6
G
H
7
Approach
 Agglomerative clustering in action
012…n
A
1
B
2
C, F
D
3, 5
4
E
6
G
H
7
Approach
 Agglomerative clustering in action
A, D, E, H
012…n
1
B
C, F
G
2, 6, 3, 5
4
7
Algorithm
 1. Score all pairs of query nodes according to distance
metric D.
 2. Merge the two query nodes q1 and q2 for which the
distance D(q1, q2) is minimal.
 3. Score all pairs of URL nodes according to distance
metric D.
 4. Merge the two URL nodes u1 and u2 for which the
distance D(u1, u2) is minimal.
 5. Unless a termination condition applies, go to step 1.
Algorithm: complexity
 Naïve implementation requires a distance metric
computation per pair of nodes, or (|Q|2 + |U|2) time
per iteration.
 But only a the neighbors of the affected nodes change
during a merge, so the per-iteration cost is proportional
to the maximum degree of any node.
Application: Related searches
 Lycos.com “related searches” feature:
 Baseline algorithm based on past audience overlap of
individual query pairs
 Data-intensive; poor coverage of rare queries
Application: Related searches
“Query”
“Suggestion”
 Current approach
(Query, Suggestion) pairs must be evidenced explicitly in data.
 Alternate approach using query clusters:
Train a set of query clusters using click log data
For an input query q, identify its cluster Q
Output the top members of Q as suggestions
Application: Related searches
 Click log data:
About 500,000 click records from a portion of a day on
Lycos.com
Pornographic queries filtered out
Queries canonicalized:
• downcased
• Whitespace collapsed
No URL canonicalization
243,000 unique queries and 362,000 unique URLs remained
Ran agglomerative clustering algorithm for 100,000 iterations
Examples of query
clusters
lyrics
guitar tabs
tabs
song lyrics
casinos
las vegas
online casinos
las vegas strip
movies
hollywood
hbo
bad movies
disney
vacations
disney world tickets
irs
irs.gov
internal revenue service forms
American airlines
aadvantage
aa.com
fitness
muscle
fitness magazines
Related Searches: Experiments
 1. Baseline
 2. Full-replacement: Draw all suggestions for an input
query from its cluster
 3. Hybrid: Replace only “weak” suggestions
 Impact of experiments measured by “clickthrough rate”
of the related searches feature:
 clicks (q)
q
 impression s(q)
q
Related Searches: Results
Experiment
Clickthrough rate
 clicks (q)
q
 impression s(q)
q
 1. Baseline :
 2. Full-replacement:
 3. Hybrid:
1.16%
1.03%
1.31%
Related Searches: Conclusions
Full-replacement strategy (preferring
cluster-derived suggestions) is inferior to
baseline
Hybrid strategy is superior to both,
overcoming data sparseness problems
inherent in baseline query pair model.
Other applications
 Cluster URLs with respect to queries
 Cluster URLs with respect to users
 Cluster users with respect to URLs, queries
 All these treat documents as mere URLs. Such “contentignorance” has its advantages:
Faster: Less data manipulation
Handles Web pages that lack text
Handles Web pages that are restricted access
Handles Web pages that are dynamic