Transcript ppt slides

Probabilistic Information
Retrieval
CSE6392 - Database Exploration
Gautam Das
Thursday, March 29 2006
Z.M. Joseph
Spring 2006, CSE, UTA
Basic Rules of Probability
• Recall the product rule:
P( A  B)  P( A).P( A | B)
• Baye’s Theorem:
P ( A  B )  P ( A) P ( B | A)  P ( B ) P ( A | B )
 P ( A) P ( B | A)  P ( B ) P ( A | B )
Thus :
P( A | B) P( B)
P ( B | A) 
P ( A)
Basic Assumptions
• Assume a database D consisting of a set of objects: documents,
tuples, etc.
• Q : Query
• R : ‘Relevant Set’ of tuples
• Goal is to find an R for each Q, given D.
• Instead of deterministic, consider probabilistic ordering
• Ranking/Scoring function should decide the degree of relevance of a
document
• Thus given a document d:
Score(d) = P(R|D)
[1]
Thus, according to this, if you know the relevance set, then R’s
members would have probability of 1, which would be the maximum
score. Others would get a probability of 0.
Simplification
• From [1]:
Score(d )  P( R | d ) 
P(d | R) P( R)
P(d )
• Take ratios of probability that document is in R
to probability that it is not in R:
ScoreRatio (d ) 
P( R | d )
P( R | d )

P( R | d ) 1  P( R | d )
• This retains the old ordering. Factors in the
elements outside R which are part of D.
Applying Bayes Theorem
• Simplify as follows:
P(d | R) P( R)
P( R | d )
P(d | R) P( R)
P(d )


1  P( R | d ) P(d | R) P( R) P(d | R) P( R)
P(d )
Since R and R do not change for a query and
P(R)
is constant, then :
P(R )
P(d | R) P( R) P(d | R) P(d | R)


P(d | R) P( R) P(d | R) P(d | D)
Can make approximat ion that R  D in a large database with small R
Observations
• Forms the scoring function
• The equation still retains R, which we do
not know.
• The ordering will still be the same using
this equation as a scoring function
Derivation for Keyword Queries
• Now assume that a query contains a vector of
words, with zero probability assigned if it does
not occur.
• Then, applying the previous equation to each
word w (instead of to a document) and
combining all the words of the query gives:
P ( d | D)
P( w | R)

P(d | R) wQ P( w | D)
Search for “Microsoft Corporation”
• Thus expression would be:
P( Microsoft | R) P(Corporation | R)

P( Microsoft | D) P(Corporation | D)
• Assume you had two documents:
– D1 : Contains ‘Microsoft’ but not ‘Corporation’
– D2 : Contains ‘Corporation’ but not ‘Microsoft’
• Thus:
P( Microsoft | R)
Score( D1) 
P( Microsoft | D)
P(Corporation | R)
Score( D 2) 
P(Corporation | D)
Search for “Microsoft Corporation”
• Because Corporation is more common in the
database D, then P(Corporation|D) will be far
higher than P(Microsoft|D).
• Thus Score(D1) will be higher than Score(D2).
• Thus document which has ‘Microsoft’ in it will get
higher ranking as this is more specific than the
word ‘Corporation’.
• Similar to Vector Space ranking by relevance
Relevance Feedback
• Can keep fine-tuning R by getting user
feedback on initial rankings.
• Once a better R is known, better scoring
and ranking of matches is possible.
PIR Applied to Databases
• Originally PIR was applied to documents and not
to databases
• Applying PIR to databases is not easy as it is
difficult to capture various aspects
– These include:
• Different values of an attributes
– PIR is based on words in document, in a database if a car is
blue, black,etc. that is not easily captured
» Would you assign each color as a keyword?
– What to sacrifice in ranking is also not easy to capture – if a
user’s preference is black cars, how is PIR applied to that when
listing results that do not match entirely?