pornographic search engine
Download
Report
Transcript pornographic search engine
Agglomerative Clustering of a
Search Engine Query Log
Sixth ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining
Monday, August 21, 2000
Doug Beeferman
Lycos Research Group
Lycos, Inc.
Waltham, MA USA
Adam Berger
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA USA
Talk overview
Motivation
Approach
Algorithmic details
Sample application: Related searches
Experimental results
Other Applications
Motivation
Group together similar Internet entities
Search queries
URLs
Users
{“rare books”, “magazines”, “bookstores”}
{http://www.amazon.com, http://www.bn.com}
{Doug Beeferman, Adam Berger}
Applications
Recommendation systems
Automatic taxonomies
Automatic communities
Approach
Exploit search result “click log” data, which bridges all three.
1
Lycos
DB
3
2
Search:
books
“DB searched for books and
then went to bn.com”
Search Results
1. www.bn.com
2. www.amazon.com
3. www.salon.com
bn.com
(DB, books, bn.com)
Approach
Accumulate click log over time:
Time
1
2
3
4
User
<DB>
<AB>
<DB>
<AB>
Query string
<“books”>
<“magazines”>
<“biographies”>
<“books”>
Selected URL
<http://www.bn.com>
<http://www.amazon.com>
<http://www.bn.com>
<http://www.amazon.com>
DB searched for “books” and then went to bn.com
AB searched for “magazines” and then went to amazon.com
DB searched for “biographies” and then went to bn.com
AB searched for “books” and then went to amazon.com
Approach
“Factor” data into bipartite graphs
books
books
magazines
magazines
biographies
biographies
DB
AB
bn.com
amazon.com
Approach
Choose two of the three entities, say queries and URLs.
Apply agglomerative clustering algorithm to the nodes in
its bipartite graph:
Merge the two “most similar” nodes in the left-hand-side
Merge the two “most similar” nodes in the right-hand-side
Repeat until some termination condition applies.
At termination queries are clustered on the left-hand-side, and
URLs are clustered on the right-hand-side.
Distance metric
Similarity of two vertices in graph defined
as the number of neighbors in common,
normalized by the total number of unique
neighbors:
N (q1 ) N (q2 )
, if N (q1 ) N (q2 ) 0
D q1 , q2 N (q1 ) N (q2 )
0, otherwise
Approach
Agglomerative clustering in action
012…n
A
1
B
2
C
3
D
4
E
F
G
H
5
6
7
Approach
Agglomerative clustering in action
012…n
A
1
B
2
C, F
3
D
4
E
5
6
G
H
7
Approach
Agglomerative clustering in action
012…n
A
1
B
2
C, F
D
3, 5
4
E
6
G
H
7
Approach
Agglomerative clustering in action
A, D, E, H
012…n
1
B
C, F
G
2, 6, 3, 5
4
7
Algorithm
1. Score all pairs of query nodes according to distance
metric D.
2. Merge the two query nodes q1 and q2 for which the
distance D(q1, q2) is minimal.
3. Score all pairs of URL nodes according to distance
metric D.
4. Merge the two URL nodes u1 and u2 for which the
distance D(u1, u2) is minimal.
5. Unless a termination condition applies, go to step 1.
Algorithm: complexity
Naïve implementation requires a distance metric
computation per pair of nodes, or (|Q|2 + |U|2) time
per iteration.
But only a the neighbors of the affected nodes change
during a merge, so the per-iteration cost is proportional
to the maximum degree of any node.
Application: Related searches
Lycos.com “related searches” feature:
Baseline algorithm based on past audience overlap of
individual query pairs
Data-intensive; poor coverage of rare queries
Application: Related searches
“Query”
“Suggestion”
Current approach
(Query, Suggestion) pairs must be evidenced explicitly in data.
Alternate approach using query clusters:
Train a set of query clusters using click log data
For an input query q, identify its cluster Q
Output the top members of Q as suggestions
Application: Related searches
Click log data:
About 500,000 click records from a portion of a day on
Lycos.com
Pornographic queries filtered out
Queries canonicalized:
• downcased
• Whitespace collapsed
No URL canonicalization
243,000 unique queries and 362,000 unique URLs remained
Ran agglomerative clustering algorithm for 100,000 iterations
Examples of query
clusters
lyrics
guitar tabs
tabs
song lyrics
casinos
las vegas
online casinos
las vegas strip
movies
hollywood
hbo
bad movies
disney
vacations
disney world tickets
irs
irs.gov
internal revenue service forms
American airlines
aadvantage
aa.com
fitness
muscle
fitness magazines
Related Searches: Experiments
1. Baseline
2. Full-replacement: Draw all suggestions for an input
query from its cluster
3. Hybrid: Replace only “weak” suggestions
Impact of experiments measured by “clickthrough rate”
of the related searches feature:
clicks (q)
q
impression s(q)
q
Related Searches: Results
Experiment
Clickthrough rate
clicks (q)
q
impression s(q)
q
1. Baseline :
2. Full-replacement:
3. Hybrid:
1.16%
1.03%
1.31%
Related Searches: Conclusions
Full-replacement strategy (preferring
cluster-derived suggestions) is inferior to
baseline
Hybrid strategy is superior to both,
overcoming data sparseness problems
inherent in baseline query pair model.
Other applications
Cluster URLs with respect to queries
Cluster URLs with respect to users
Cluster users with respect to URLs, queries
All these treat documents as mere URLs. Such “contentignorance” has its advantages:
Faster: Less data manipulation
Handles Web pages that lack text
Handles Web pages that are restricted access
Handles Web pages that are dynamic