Structured annotations of Web Queries

Download Report

Transcript Structured annotations of Web Queries

Author: N. Sarkas et al.
SIGMOD 2010




Motivation
Approach
Query Processing
Probabilistic Model
2

Do keyword search on structured database
◦ The user leverages the experience on using search
engine to search records in a database

Problems
◦ Keyword matching may not get any result
◦ Misinterpretation of the user’s intent
 Q=“white tiger”
 T=Shoes, Color=white, Shoe Line=tiger
 T=Book, title=white tiger
 The animal white tiger (not in the database)
◦ Provide fast response (in ms) for web users
 Not efficient if the system queries every database
3

Annotation
◦ Annotation token for a table (AT) = (t, T.A)
 Q=50 inch LG
 (LG,TVs.Brand), (LG,Monitors.Brand),(50
inch,TVs.Diagonal)
◦ All tokens (nominal, ordinal and numerical) are
annotated on the table.
 Accept numerical tokens in a certain range
TVs
Monitors
Type
Brand
Diagonal
Type
Brand
Diagonal
TV
Samsung
46 inch
Monitor
Samsung
19 inch
TV
LG
50 inch
Monitor
LG
24 inch
4



Generate structured annotation for the
keywords
Find the maximal annotation
Annotation scoring
5

Generate structured annotation for a query
 Keyword: K1, k2
 (T1, {(k1,T1.Att1)(…)}, {free token})
(T2, {(k1,T2.Att1) (…)}, {})
 Free token: a query keyword not associated with an
attribute
 Example “50 inch LG lcd”
 (TVs, {(50 inch, TVs.Diagonal), (LG, TVs.Brand), (lcd,
TVs.screen)}, {})
 (Monitors, {(50 inch, Monitors.Diagonal),(LG, Monitors.Brand),
(lcd, Monitors.Screen)}, {})
 (Refrig, {(50 inch, Refrig.Width), (LG, Refrig.Brand)}, {lcd})
6

Find the maximal annotation
◦ Given a table, we want more annotation tokens, less
free tokens
◦ Annotation S = (T, AT, FT)
 there’s no S’ = (T, AT’, FT’) s.t. AT’ > AT, FT’ < FT
 AT: annotated token
 FT: free token
◦ Example




S1=(TVs, {(LG, TVs.Brand), (lcd, TVs.screen)}, {})
S2=(TVs, {(LG, TVs.Brand), {lcd})
S3=(TVs, {(50 inch, TVs.Diagonal), (LG, TVs.Brand), {lcd})
S2 is not maximal
7

Scoring annotation
◦ Intuition
 Query: LG 30 inch screen
 Want: TV, monitor
 Dislike
 DVD Player
 There’s no DVD player with a screen in the database
 People don’t query size of a DVD player
 Cell phone
 The size of the screen is significantly smaller in the
database
◦ A probabilistic model is chosen for the scoring
8

Generative probabilistic model
◦ If the user searches a table, what are the words that
the user may use (the probability of each word)?
◦ P(T.Ai): the probability that the user search table T
and the subset of the attributes T.Ai
◦ Given attributes, users select tokens with
probability
 𝑃 𝑆𝑖 = 𝑃 {𝒜𝒯𝑖 , ℱ𝒯𝑖 } 𝑇. Å𝑖 ) ∙ 𝑃(𝑇. Å𝑖 )
 Å: the attributes of table T + free tokens
 Example “LG 30 inch screen”
 𝑃 𝑆𝑖 = 𝑃 𝐿𝐺, 30 𝑖𝑛𝑐ℎ , 𝑠𝑐𝑟𝑒𝑒𝑛 𝐵𝑟𝑎𝑛𝑑, 𝐷𝑖𝑎𝑔𝑛𝑜𝑎𝑙, 𝑓 ∙
𝑃 𝑇𝑉𝑠. 𝐵𝑟𝑎𝑛𝑑, 𝑇𝑉𝑠. 𝐷𝑖𝑎𝑔𝑛𝑜𝑎𝑙, 𝑇𝑉𝑠. 𝑓
 Need to simplify the equation
9
 Assumption 1
 Annotated and free tokens are independent
 𝑃 𝒜𝒯𝑖 , ℱ𝒯𝑖 𝑇. Å𝑖 ) = 𝑃 𝒜𝒯𝑖 𝑇. Å𝑖 ∙ 𝑃(ℱ𝒯𝑖 |𝑇. Å𝑖 )
 Assumption 2
 The user depends on the table to choose free tokens
 𝑃 ℱ𝒯𝑖 𝑇. Å = 𝑃 ℱ𝒯𝑖 𝑇
 The user depends on the attributes of the table to choose
annotated tokens, not on free tokens
 𝑃 𝒜𝒯𝑖 𝑇. Å𝑖 = 𝑃 𝒜𝒯𝑖 𝑇. 𝐴𝑖 )
 𝑷 𝑺𝒊 = 𝑷(𝓐𝓣𝒊 |𝑻. 𝑨𝒊 ) ∙ 𝑷 𝓕𝓣𝒊 𝑻 ∙ 𝑷(𝑻. Å𝒊 ) ……………………(2)
 Si: given query q, the annotation Sq=S1,…,Sk
10

The equation (2) assumes that all queries are
targeting some table in the data collection.
◦ Not true. Ex: Q=“green apple”.
 Annotation: green=color, apple = brand.
 Could green apple mean a fruit?
◦ Approach
 Open Language Model (OLM) table: capture open-world
queries (ex: the log of Bing )
 Sq={S1,…,Sk, Sk+1}, where Sk+1=SOLM.
 SOLM=(OLM,{FTq})
 𝑷 𝑺𝑶𝑳𝑴 = 𝑷 𝓕𝓣𝒒 𝑶𝑳𝑴 𝑷(𝑶𝑳𝑴) ……………………. (3)

𝑃 𝑆𝑖
𝑃(𝑆𝑂𝐿𝑀 )
>𝜃>0
 To keep plausible annotation
11

We have two probabilistic models
◦ 𝑃 𝑆𝑖 = 𝑃(𝒜𝒯𝑖 |𝑇. 𝐴𝑖 ) ∙ 𝑃 ℱ𝒯𝑖 𝑇 ∙ 𝑃(𝑇. Å𝑖 )……………(2)
◦ 𝑃 𝑆𝑂𝐿𝑀 = 𝑃 ℱ𝒯𝑞 𝑂𝐿𝑀 𝑃(𝑂𝐿𝑀) …………………. (3)

What’s next?
◦ Maximize the probability
 Simplify the equation when necessary
◦ Build up a system which is based on the model
12
Thank You!
13

Consider a query from web log
◦ It can either be formulated by an annotation or is a
free-text query.
 𝑃 𝑞 =
𝑆𝑖 ∈𝑆𝑞 𝑃
𝑆𝑖 + 𝑷(𝑺𝑶𝑳𝑴 )
 𝑃(𝒜𝒯𝑖 |𝑇. 𝐴𝑖 ) ∙ 𝑃 ℱ𝒯𝑖 𝑇 ∙ 𝑃(𝑇. Å𝑖 )
 =
 =
𝑆𝑖 ∈𝑆𝑞 𝑃
𝑆𝑖 ∈𝑆𝑞
𝐴𝑇𝑖 𝑇𝑖 . 𝐴𝑖 𝑃 ℱ𝒯𝑖 𝑇 𝑃 𝑇𝑖 , Å𝑖 + 𝑃 𝐹𝑇𝑞 𝑂𝐿𝑀 𝑃(𝑂𝐿𝑀)
𝛼𝑖 𝜋𝑖𝑗 + 𝛽 𝜋0
𝛽 = 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑠 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑤𝑒𝑏 𝑞𝑢𝑒𝑟𝑦 𝑙𝑜𝑔
𝛼𝑖 =
𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑛𝑡𝑟𝑖𝑒𝑠 𝑡ℎ𝑎𝑡 𝑡𝑎𝑘𝑒 𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑛𝑡𝑟𝑖𝑒𝑠
* 𝑃 ℱ𝒯𝑖 𝑇
𝑃 ℱ𝒯𝑖 𝑇 = 𝜆 ∗ 𝑃 𝐹𝑇𝑖 𝑈𝑀𝑇 + 1 − 𝜆 ∗ 𝑃(𝐹𝑇𝑖 |𝑂𝐿𝑀)
UMT=all possible names and values that can be associated with table T
𝜆 = confidence level
Ex: FT=computer.
T1=Monitor, T2=TV
14

Given
 Observed data: web query log
 Model: 𝑃 𝑞 =

To Find
𝑇
𝑖=1
𝐴𝑗 ∈𝑃𝑖 𝛼𝑞𝑖𝑗 𝜋𝑖𝑗
+ 𝛽𝑞 𝜋0
 𝜋𝑖𝑗 and 𝜋0 that maximize the likelihood
 ℒ 𝑄 = log (
𝑇
𝑖=1
𝐴𝑗 ∈𝑃𝑖 𝛼𝑞𝑖𝑗 𝜋𝑖𝑗
+ 𝛽𝑞 𝜋0 )
𝜋𝑖𝑗 + 𝜋0 = 1
𝑖𝑗
15

Initial step

Repeat
◦ Select initial 𝜋𝑖𝑗 , 𝜋0
◦ Expectation step
 Based on current 𝜋𝑖𝑗 , 𝜋0 , estimate 𝛾, 𝜃

𝛾 𝑡+1

𝜃 𝑡+1
=
=
𝑡
𝛼𝑞𝑖𝑗 𝜋𝑖𝑗
…
𝛽𝑞 𝜋𝑡
…
◦ Maximization step
 Based on current 𝛾, 𝜃, maximize 𝜋𝑖𝑗 , 𝜋0

𝑡+1
𝜋𝑖𝑗
=
 𝜋0𝑡+1 =
𝛾𝑡+1
𝑞 |𝑄|
𝜃 𝑡+1
𝑞 |𝑄|
16
Thank You!
17

𝑃(𝒜𝒯𝑖 |𝑇. 𝐴𝑖 ) =
𝑻(𝑨𝑻𝒊 .𝑽)
𝑻
◦ The fraction of the entries in a table T that take the
values ATi.
◦ Q=“50 inch LG lcd”
 S = (TVs, {(LG,TVs.Brand),(50 inch, TVs.Diagonal),{lcd}).
 T.A = {Brand, Diagonal}
 T(AT.V) = all records in TV of brand LG with diagonal size
50 inch.
 [offline] A mapping from the value to the number of
matched records in the table
18

Maximum likelihood estimation
◦ Given
Likelihood
 A set of observed data {𝑥 , …}
1
 A proposed model 𝑃(𝑥|𝜃)
I were
◦IfTo
find to flip a fair coin 100 times, what is

 The
parameter 𝜃of
that
the likelihoodevery
the
probability
it maximize
landing heads-up
Rephrase
time?“
◦ Given
 Observed data: web query log
Given that I have
𝑇 flipped a coin 100 times and it
 Model: 𝑃 𝑞 = 𝑖=1 𝐴𝑗∈𝑃𝑖 𝛼𝑞𝑖𝑗 𝜋𝑖𝑗 + 𝛽𝑞 𝜋0
landed heads-up 100 times, what is
◦has
To Find
the
the coin
is fair?“ℒ 𝑄 =
 𝜋𝑖𝑗likelihood
and 𝜋0 that that
maximize
the likelihood
𝑞∈𝑄 log

𝑃(𝑞)
EM algorithm can be used to solve ℒ 𝑄
(source: wikipedia)
19


An iterative algorithm with 2 steps
Expectation step
◦ Estimate parameters 𝜃 𝑡

Maximization step
◦ Calculate expected value Q of the log likelihood
function
◦ Find parameter 𝜃 𝑡+1 that maximize Q.
20