Towards Context-Aware Search by Learning A Very Large Variable

Download Report

Transcript Towards Context-Aware Search by Learning A Very Large Variable

Towards Context-Aware Search by Learning A
Very Large Variable Length Hidden Markov
Model from Search Logs
Huanhuan Cao1, Daxin Jiang2, Jian Pei3,
Enhong Chen1 and Hang Li2
1University of Science and Technology of China,
2Microsoft Research Asia,
3Simon Fraser University
Context of User Queries
• A user usually raises multiple queries and conducts
multiple rounds of interactions for an information
need.
One round of Interaction
User
query
click
click
query
Current Query
click
Context
An Example
• Suppose Ada plans to buy a new car and need some cars
reviews.
• But she doesn’t know to formulate an effective query.
Consequently, she raises a series of queries about different
cars.
• No surprisingly, for each query, the review web sites are
ranked low and not easy to be noticed.
Why Context is Useful?
• Suppose we have such a search log:
SID
S1
Search session
Ford => Toyota => GMC => Allstate
www.autohome.com
S2
Ford cars => Toyota cars => GMC cars => Allstate
www.autohome.com
S3
Ford cars => Toyota cars => Allstate
www.allstate.com
S4
GMC => GMC dealers
www.gmc.com
Patterns in The Search Log
• Pattern1:
– 50% users clicked a cars review web site
www.autohome.com after asking a series of cars.
• Ada will have better experience if the search engine knows
pattern1.
• Pattern2:
– 75% users searched for car insurances after a series of
queries about different cars.
• The search engine will provide more appropriate query
suggestions and URL recommendations if it knows pattern2.
• Idea:
Learning from search log to provide context-aware ranking,
query suggestion and URL recommendation.
Related Work
Mining “wisdom of the crowds” from search logs
Improve ranking
Use click-through data as
implicit feedback
Mining click-through data
Query suggestion
Mining session data
Mixture: CACB
URL recomendation
Mining search trials
Only CACB considers context, but:
1. CACB constraints a query to one search intent
2. CACB doesn’t use click information as context
3. CACB can only be used for query suggestion
Modeling Context by vlHMM
(variable length Hidden Markov Model)
Overview of Technique Details
•
•
•
•
Definition of vlHMM
Parameters Estimation
Challenges and Strategies
Applications
Formal Definition
• Given:
–
–
–
–
A set of hidden states {s1 … sNs};
A set of queries {q1 … qNq};
A set of URLs {u1 … uNu};
The maximal length Tmax of state sequences
• A vlHMM is a probability model defined as follows:
– The transition probability distribution Δ = {P(si|Sj)};
– The initial state distribution Ψ = {P(si)};
– The emission probability distribution for each state sequence Λ
= {P(q, U|Sj)};
Parameter Estimation
• Let X = {O1…ON} be the set of training sessions, where:
– On is a sequence of pairs (qn,1 ,Un,1) … (qn,Tn ,Un,Tn)
– qn,t and Un,t are the t-th query and the set of clicked URLs,
respectively
– Moreover, we use un,t,k to denote the k-th URL in Un,t .
• The task is to find Θ* such that
EM
• The original problem is in a complex form which may
not have a closed-form solution.
• Alternatively, we use an iterative method: EM
(Expectation Maximum).
• Objective function:
• E-step:
• M-step:
Challenges for Training A Large vlHMM
• Challenge1:
– The EM algorithm needs a user-specified number of
hidden states.
– However, in our problem, the hidden states correspond to
users' search intents, whose number is unknown.
• Strategy:
– We apply the mining techniques developed by our
previous work as a prior process to the parameter learning
process.
• Challenge2:
– Search logs may contain hundreds of millions of training
sessions.
– It is impractical to learn a vlHMM from such a huge
training data set using a single machine.
• Strategy:
– We deploy the learning task on a distributed system under
the map-reduce programming model
• Challenge3:
– Each machine needs to hold the values of all parameters.
– Since the log data usually contains millions of unique
queries and URLs, the space of parameters is extremely
large.
• Strategy:
– we develop a special initialization strategy based on the
clusters mined from the click-through bipartite
Applications
• Given a observation O consists of q1 … qt and U1 … Ut
• Document re-ranking:
– Rank by P(u|O) = ∑ P(u|st) P(st|O)
• Query suggestion & URL-recommendation:
– Suggest top k queries with P(q|O) = ∑ P(q|st+1) P(st+1|O)
– Recommend top k URLs with P(u|O) = ∑ P(u|st+1) P(st+1|O)
• The advantages of our model: unification and power of
prediction.
Experiments
• A large-scale search log from Live Search
– Web searches in English from the US market
• Training Data
–
–
–
–
–
1,812,563,301 search queries,
2,554,683,191 clicks
840,356,624 sessions
151,869,102 unique queries
114,882,486 unique URLs.
• Test Data
– 100,000 sessions extracted from another search log
Coverage
• For each test session <(q1 ,U1)…(qT UT )>, the vlHMM
deals with each qi . When i > 1, <(q1 ,U1)…(qi-1 ,Ui-1)>
is used as a context.
• The total coverage is 58.3%.
• Denote the set of test cases without context as Test0
and the other as Test1.
• For the covered cases in Test1, 25.5% contexts are
recognized.
Re-ranking
• Baseline:
– Boost the URLs with high click times given the query.
• Evaluation:
– Sample 500 re-ranking URL pairs from Test0 and from the
cases whose context are recognized in Test1, respectively.
– Each re-ranking URL pair is judged as Improved or
Degraded or Unsure by 3 experts.
The effectiveness of re-ranking by the
vlHMM and Baseline1 on (a) Test0 and (b) Test1.
Examples of Re-ranking
Search for games
Visit the homepage of
Ask Jeeves
Up the URL about game
Up the URL which introduces
the history of Ask Jeeves
URL Recommendation
• Baseline:
– Recommend the URLs with high click times following the
current query.
• Evaluation:
– “Leave-one-out" method: given <(q1 ,U1)…(qT UT )>, we use
qT-1 as the test query and consider UT as the ground truth.
– Suppose the set of recommended URLs is R, the precision
is |R∩UT |/|R| and the recall is |R ∩ UT |/|UT |.
The precision and
recall of the URLs
recommended by
the vlHMM and
Baseline2.
An Example of URL Recommendation
Search for online store about electronics
Online store about
electronics
Online store about
equipments
Query Suggestion
• Baseline:
– CACB, a context-aware concept based approach of query
suggestion.
• Evaluation:
– The results of two approaches are comparable since they
both consider contexts.
– However, the ratio of recognizing contexts is increased by
55% by vlHMM.
Summary
• We propose a general approach to context-aware
search by learning a vlHMM from log data.
• We tackle the challenges of learning a large vlHMM
with millions of states from hundreds of millions of
search sessions.
• The experimental results on a large real data set
clearly show that our context-aware approach is both
effective and efficient.
• Our recent works on context-aware search:
• Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-tao
Sun, Enhong Chen and Qiang Yang. Context-aware query
classification. To appear in SIGIR’09.
• Huanhuan Cao, Daxin Jiang, Jian Pei, Enhong Chen and Hang Li.
Towards context-aware search by learning a large variable
length Hidden Markov Model from search logs. To appear in
WWW’09.
• Huanhuan Cao, Daxin Jiang, Jian Pei, Qi He, Zhen Liao, Enhong
Chen and Hang Li. Context-aware query suggestion by mining
click-through and session data. KDD’08, pages 875-883, 2008.
(This paper won the Best Application Paper Award of KDD’08)
Thanks