No Slide Title

Download Report

Transcript No Slide Title

Information Retrieval
Lecture 10
Probabilistic Information Retrieval
Calculation of tf.idf
Example: What is the tf and idf for the term monstrous?
Definition: tfij = fij / mi
From DocumentFreq1.xls, there is one posting for
monstrous in file19.txt
From AllFiles1.xls, fij =1, mi = 16
tfij =1/16 = 0.0625
Calculation of tf.idf (continued)
Definition: idfj = log2 (n/nj) + 1
From DocumentFreq1.xls, there is one posting for
monstrous
idfj = log2 (n/nj) + 1
= log2 (20/1) + 1
= 5.322
tfij.idfj = 0.0625 * 5.322 = 0.3326
Three Approaches to Information Retrieval
Many authors divide the methods of information retrieval
into three categories:
Boolean (based on set theory)
Vector space (based on linear algebra)
Probabilistic (based on Bayesian statistics)
In practice, the latter two have considerable overlap.
Probability: independent random variables
and conditional probability
Notation
Let a, b be two events, with probability P(a) and
P(b).
Independent events
The events a and b are independent if and only if:
P(a  b) = P(b) P(a)
Conditional probability
P(a | b) is the probability of a given b, also called
the conditional probability of a given b.
P(a | b) P(b) = P(a  b) = P(b | a) P(a)
Example: independent random variables and
conditional probability
Independent
a and b are the results of throwing two dice
P(a=5 | b=3) = P(a=5) = 1/6
Not independent
a and b are the results of throwing two dice
t=a+b
P(t=8 | a=2) = 1/6
P(t=8 | a=1) = 0
Probability Theory -- Bayesian Formulas
Notation
Let a, b be two events.
P(a | b) is the probability of a given b
Bayes Theorem
P(a | b) =
P(a | b) =
P(b | a) P(a)
P(b)
P(b | a) P(a)
P(b)
where a is the event
not a
Derivation
P(a | b) P(b) = P(a  b) = P(b | a) P(a)
Example of Bayes Theorem
Example
P(a | b) = D / (A+D) = D / P(b)
a Weight over 200
lb.
P(b | a) = D / (D+C) = D / P(a)
D is P(a  b)
b Height over 6 ft.
D
C
A
B
Probability Ranking Principle
"If a reference retrieval system’s response to each
request is a ranking of the documents in the collections in
order of decreasing probability of usefulness to the
user who submitted the request, where the probabilities
are estimated as accurately a possible on the basis of
whatever data is made available to the system for this
purpose, then the overall effectiveness of the system to
its users will be the best that is obtainable on the basis of
that data."
W.S. Cooper
Probabilistic Ranking
Basic concept:
"For a given query, if we know some documents that are
relevant, terms that occur in those documents should be
given greater weighting in searching for other relevant
documents.
By making assumptions about the distribution of terms
and applying Bayes Theorem, it is possible to derive
weights theoretically."
Van Rijsbergen
Concept
R is a set of documents that are guessed to be
relevant
R is the complement of R.
1. Guess a preliminary probabilistic description of R
and use it to retrieve a first set of documents.
2. Interact with the user to refine the description.
3. Repeat, thus generating a succession of
approximations to R.
Probabilistic Principle
Basic concept:
The probability that a document is relevant to a query is
assumed to depend on the terms in the query and the
terms used to index the document, only.
Given a user query q, the ideal answer set, R, is the set of
all relevant documents.
Given a user query q and a document dj in the collection, the
probabilistic model estimates the probability that the user
will find dj relevant, i.e., that dj is a member of R.
Probabilistic Principle
Similarity measure:
The similarity (dj, q) is the ratio of the probability that dj is
relevant to q, to the probability that dj is not relevant to q.
This measure runs from near zero, if the probability is
small that the document is relevant, to large as the
probability of relevance approaches one.
Probabilistic Principle
Given a query q and a document dj the model needs an
estimate of the probability that the user finds dj relevant.
i.e., P(R | dj).
P(R | dj)
similarity (d , q) = P(R | dj)
j
P(dj | R) P(R)
= P(dj | R) P(R)
Theorem
by Bayes
P(dj | R)
P(dj | R)
=
xk
where k is
P(d
j | R) is the probability of randomly selecting dj from R.
constant
Binary Independence Retrieval Model (BIR)
Let x = (x1, x2, ... xn) be the term incidence vector for dj.
xi = 1 if term i is in the document and 0 otherwise.
Let q = (q1, q2, ... qn) be the term incidence vector for the
query.
We estimate P(dj | R) by P(x | R)
If the index terms are independent
P(x | R) = P(x1 | R) P(x2 | R) ... P(xn | R) = ∏ P(xi | R)
Binary Independence Retrieval Model (BIR)
S = similarity (dj, q) = k
∏ P(xi | R)
∏ P(xi | R)
Since the xi are either 0 or 1, this can we written:
S = k
∏
xi = 1
P(xi = 1 | R)
P(xi = 1 | R)
∏
xi = 0
P(xi = 0 | R)
P(xi = 0 | R)
Binary Independence Retrieval Model (BIR)
For terms that appear in the query let
pi = P(xi = 1 | R)
ri = P(xi = 1 | R)
For terms that do not appear in the query assume
pi = ri
S = k
= k
∏
∏
x i = qi = 1
x i = qi = 1
pi
ri
∏
xi = 0, qi = 1
pi (1 - ri)
ri(1 - pi)
∏
1 - pi
1 - ri
qi = 1
1 - pi
1 - ri
constant
for a
given
query
Binary Independence Retrieval Model (BIR)
Taking logs and ignoring factors that are constant for a
given query, we have:
{
similarity (d, q) = ∑ log
pi (1 - ri )
(1 - pi) ri }
where the summation is taken over those terms that
appear in both the query and the document.
This similarity measure can be used to rank all
documents against the query q.
Estimates of P(xi | R)
Initial guess, with no information to work from:
pi = P(xi | R) = c
ri = P(xi | R) = ni / N
where:
c is an arbitrary constant, e.g., 0.5
ni is the number of documents that contain xi
N is the total number of documents in the collection
Improving the Estimates of P(xi | R)
Human feedback -- relevance feedback (discussed
later)
Automatically
(a) Run query q using initial values. Consider the t top
ranked documents. Let si be the number of these
documents that contain the term xi.
(b) The new estimates are:
pi = P(xi | R) = si / t
ri = P(xi | R) = (ni - si) / (N - t)
Discussion of Probabilistic Model
Advantages
• Based on firm theoretical basis
Disadvantages
• Initial definition of R has to be guessed.
• Weights ignore term frequency
• Assumes independent index terms (as does vector
model)