Mining Query Subtopics from Search Log Data

Download Report

Transcript Mining Query Subtopics from Search Log Data

Mining Query Subtopics from
Search Log Data
Hu et al., SIGIR’12
Presented by Baichuan Li
Outline
•
•
•
•
•
•
Problem
Two observations
Approach
Results
Applications
Conclusion
Problem
• Intention != Query
• Many queries are ambiguous or multifaceted
Problem
Problem
• Xbox
– Homepage?
– Online game?
– Where to buy?
• Manchester
– Manchester news?
– Manchester tourism?
– Manchester weather?
– Manchester united?
Solution
• Identifying the major subtopics of a query
from the search log data
– Personalized search
– Query suggestion
– Search result presentation
• Clustering
• Re-ranking
• Diversification
Observations from The Logs
• One subtopic per search (OSS)
– If a user clicks multiple URLs after submitting a
query, then the clicked URLs tend to represent the
same subtopic
Observations from The Logs
• Subtopic clarification by additional keyword
(SCAK)
– Users often add additional keywords (in most cases,
one additional keyword) to a query to expand the
query in order to clarify its subtopic
– The URLs clicked after searching both with the original
and the expanded queries tend to represent the same
subtopic
– The key word tends to be indicative of the subtopic
– E.g.
• harry shum microsoft
One Subtopic per Search
• Use it as the rule for subtopic identification
– 10,000 groups of multiclicks of individual queries
– The rule is accurate when all the URLs within the
multi-clicks are about the same sense or face
Accuracy V.S. Frequency
Accuracy v.s. Click Position
Distribution
• The queries with higher frequencies in search
log data are more likely to have multi-clicks
Subtopic Clarification by Additional
Keyword
• The keywords ‘microsoft’ and ‘jr’ can be used
to represent the two groups (subtopics)
respectively
Query Types
•
•
•
•
‘Q’: the query is a single phrase
‘Q + W’: ‘Q’ + a keyword
‘W + Q’: a keyword + ‘Q’
‘Others’
Subtopic Overlap and URL Overlap
• Randomly selected 500 pairs of queries with
the forms ‘Q’ and ‘Q + W’
• If subtopics of an expanded query are
contained in subtopics of the original query,
then there is ‘subtopic overlap’
• Check whether two queries share identical
clicked URLs (‘URL overlap’)
Distribution
• The more popular (frequent) a query is, the
more likely the rule is applicable
Mining Subtopics
Preprocessing
Clustering
• Similarity function
– The similarity function between URLs ui and uj :
– S1: based on the OSS phenomenon
where
and
denote the vectors of multi-clicks of
ui and uj respectively
– S2: based on the SCAK phenomenon
– S3: based on string similarities
SCAK Similarity (S2)
where
and
denote the
vectors of keywords associated
with ui and uj respectively
Data
• TREC Data: TREC search result diversification
track 3 in 2009
• DataSetA and DataSetB: queries and URLs
randomly sampled from the logs of the
commercial search engine
Results
Application: Search Result Reranking
• The user is first asked which subtopic she is
interested in, with the subtopics shown at the
top of the results page
• When the user selects a subtopic, the URLs
belonging to the subtopic will be moved to the
top (re-ranked)
• The relative order between URLs inside and
outside of the subtopic will be kept
Example
Evaluation
• Data
– Collect search log data of 20, 000 randomly selected
searches
– Each query has at least two subtopics mined by the
method
• Results
– The average position of last clicked URLs is 3.41
– Assume the cost for the user to check the subtopics
and click one of them is 1.0
– The average position of last clicked URLs belonging to
the same subtopics is 1.80
Conclusion
• Study the problem of query subtopic mining
• Discovered two phenomena of user search
behavior
– one subtopic per search
– Subtopic clarification by additional keyword
• Design a novel similarity function
• Applications on search result reranking (and
search result clustering)
• Problem
– Can only be employed when there is enough search
log data