Recuperação de Informação B

Download Report

Transcript Recuperação de Informação B

What is coming…

Today:
Probabilistic models
 Improving classical models

Latent Semantic Indexing
 Relevance feedback (Chapter 5)


Monday Feb 5


Chapter 5 continued
Wednesday Feb 7

Web Search Engines

Chapter 13 & Google paper
Annoucement: Free
Food Event



Where & When: GWC 487, Tuesday Feb
6th, 12:15-1:30
What: Pizza, Softdrinks
Catch: A pitch on going to graduate
school at ASU CSE…


Meet admissions committee, some graduate students,
some faculty members
Silver-lining: Did we mention Pizza?

Also not as boring as time-sharing condo
presentations.
Problems with Vector
Model

No semantic basis!
 Keywords
are plotted as axes
 But
are they really independent?
 Or they othogonal?
 No
support for boolean queries
 How
do you ask for papers that
don’t contain a keyword?
Probabilistic Model


Objective: to capture the IR problem using a
probabilistic framework
Given a user query, there is an ideal answer set
 Querying
as specification of the properties of
this ideal answer set (clustering)

But, what are these properties?
 Guess
at the beginning what they could be (i.e.,
guess initial description of ideal answer set)
 Improve by iteration
Probabilistic Model




An initial set of documents is retrieved somehow
User inspects these docs looking for the relevant
ones (in truth, only top 10-20 need to be inspected)
IR system uses this information to refine
description of ideal answer set
By repeting this process, it is expected that the
description of the ideal answer set will improve
 Have
always in mind the need to guess at the very
beginning the description of the ideal answer set
 Description of ideal answer set is modeled in
probabilistic terms
Probabilistic Ranking Principle

Given a user query q and a document dj,
 estimate
the probability that the user will find the
document dj interesting (i.e., relevant).
 The
model assumes that this probability of relevance
depends on the query and the document
representations only.
 Ideal answer set is referred to as R and should
maximize the probability of relevance. Documents in the
set R are predicted to be relevant.

But,
 how
to compute probabilities?
 what is the sample space?
The Ranking

Probabilistic ranking computed as:
 sim(q,dj)
= P(dj relevant-to q) / P(dj non-relevant-to q)
 This
is the odds of the document dj being relevant
 Taking the odds minimize the probability of an erroneous
judgement

Definition:
 wij  {0,1}
 P(R
| vec(dj)) : probability that given doc is relevant
 P(R | vec(dj)) : probability doc is not relevant
The Ranking

sim(dj,q) = P(R | vec(dj)) / P(R | vec(dj))
=
[P(vec(dj) | R) * P(R)]
[P(vec(dj) | R) * P(R)]
~
P(vec(dj) | R)
P(vec(dj) | R)
 P(vec(dj)
| R) : probability of randomly selecting the
document dj from the set R of relevant documents
The Ranking

sim(dj,q) ~
P(vec(dj) | R)
P(vec(dj) | R)
Where vec(dj) is of the form (k1, k2,k3....kt)
Using pair-wise independence assumption among keywords
~

[  P(ki | R)] * [  P(ki | R)]
[  P(ki | R)] * [  P(ki | R)]
P(ki | R) : probability that the index term ki is present in a document randomly
selected from the set R of relevant documents
The Ranking

sim(dj,q) ~ log
[  P(ki | R)] * [  P(kj | R)]
[  P(ki | R)] * [  P(kj | R)]
~ K * [ log  P(ki | R) +
log
P(ki | R)
 P(ki | R) ]
P(ki | R)
~  wiq * wij * (log P(ki | R) + log P(ki | R) )
P(ki | R)
P(ki | R)
where
P(ki | R) = 1 - P(ki | R)
P(ki | R) = 1 - P(ki | R)
The Initial Ranking

sim(dj,q) ~
~  wiq * wij * (log P(ki | R) + log P(ki | R) )
P(ki | R)
P(ki | R)
 Probabilities

P(ki | R) and P(ki | R) ?
Estimates based on assumptions:
 P(ki
| R) = 0.5
 P(ki | R) = ni
N
where ni is the number of docs that contain ki
 Use this initial guess to retrieve an initial ranking
 Improve upon this initial ranking
Pluses and Minuses

Advantages:
 Docs
ranked in decreasing order of probability of
relevance

Disadvantages:
 need
to guess initial estimates for P(ki | R)
 method does not take into account tf and idf factors
Alternative Probabilistic Models

Probability Theory
 Semantically
clear
 Computationally clumsy

Why Bayesian Networks?
 Clear
formalism to combine evidences
 Modularize the world (dependencies)
 Bayesian Network Models for IR
 Inference
Network (Turtle & Croft, 1991)
 Belief Network (Ribeiro-Neto & Muntz, 1996)
Bayesian Networks
Definition:
Bayesian networks are directed acyclic graphs
(DAGS) in which the nodes represent random
variables, the arcs portray causal relationships
between these variables, and the strengths of
these causal influences are expressed by
conditional probabilities.
Bayesian Networks
yi : parent nodes (in this case, root nodes)
x : child node
y1
y2
yi cause x
Y the set of parents of x
The influence of Y on x
x
can be quantified by any function
F(x,Y) such that x F(x,Y) = 1
0 < F(x,Y) < 1
For example, F(x,Y)=P(x|Y)
yt
Bayesian Networks
x1
Given the dependencies declared
in a Bayesian Network, the
x3
x2
expression for the joint
probability can be computed as
a product of local conditional
probabilities, for example,
x4
P(x1, x2, x3, x4, x5)=
P(x1 ) P(x2| x1 ) P(x3| x1 ) P(x4| x2, x3 ) P(x5| x3 ).
P(x1 ) : prior probability of the root node
x5
Bayesian Networks
x1
In a Bayesian network each
variable x is conditionally
independent of all its
non-descendants, given its
parents.
x3
x2
x4
For example:
P(x4, x5| x2 , x3)= P(x4| x2 , x3) P( x5| x3)
x5
An Example Bayes Net
•Typically,
networks written in
causal direction
wind up being
most compact
•need least
number of
probabilities
to be specified
Two Models
dj
k1
k2
…
q
ki
…
kt
k1
and
q
ki
k2
kt
q2
or
or
q1
I
“Inference Network model”
d1
dj
dn
“Belief network model”
ku
Comparison
 Inference
Network Model is the first and well known
• Used in Inquery system
 Belief
Network adopts a set-theoretic view
• a clearly defined sample space
• a separation between query and document portions
• is able to reproduce any ranking produced by the Inference
Network while the converse is not true (for example: the
ranking of the standard vector model)
Belief Network Model

As the Inference Network Model
 Epistemological
view of the IR problem
 Random variables associated with documents,
index terms and queries

Contrary to the Inference Network Model
 Clearly
defined sample space
 Set-theoretic view
 Different network topology
Belief Network Model
The Probability Space
Define:
K={k1, k2, ...,kt} the sample space (a concept space)
u  K a subset of K (a concept)
ki an index term (an elementary concept)
k=(k1, k2, ...,kt) a vector associated to each u such that
gi(k)=1  ki  u
ki a binary random variable associated with the index
term ki , (ki = 1  gi(k)=1  ki  u)
Belief Network Model
A Set-Theoretic View
Define:
a document dj and query q as concepts in K
a generic concept c in K
a probability distribution P over K, as
P(c)=uP(c|u) P(u)
P(u)=(1/2)t
P(c) is the degree of coverage of the space K
by c
Belief Network Model
Network topology
q
query side
k1
ki
k2
kt
ku
document side
d1
dj
dn
Belief Network Model
Assumption
P(dj|q) is adopted as the rank of the document dj with
respect to the query q. It reflects the degree of
coverage provided to the concept dj by the concept
q.
Belief Network Model
q
The rank of dj
k1
P(dj|q) = P(dj  q) / P(q)
~ P(dj  q)
d
~ u P(dj  q | u) P(u)
~ u P(dj | u) P(q | u) P(u)
~ k P(dj | k) P(q | k) P(k)
1
ki
k2
dj
kt
dn
ku
Belief Network Model
For the vector model
Define
Define a vector ki given by
ki = k | ((gi(k)=1)  (ji gj(k)=0))
 in the state ki only the node ki is active and all
the others are inactive
Belief Network Model
For the vector model
Define
(wi,q / |q|)
if k = ki  gi(q)=1
P(q | k) =
0
if k  ki v gi(q)=0
P(¬q | k) = 1 - P(q | k)
 (wi,q / |q|) is a normalized version of weight of
the index term ki in the query q
Belief Network Model
For the vector model
Define
(wi,j / |dj|)
if k = ki  gi(dj)=1
P(dj | k) =
0
if k  ki v gi(dj)=0
P(¬ dj | k) = 1 - P(dj | k)
 (wi,j / |dj|) is a normalized version of the weight
of the index term ki in the document d,j
Bayesian Network Models
Computational costs
 Inference
Network Model one document node at a
time then is linear on number of documents
 Belief Network only the states that activate each
query term are considered
 The networks do not impose additional costs
because the networks do not include cycles.
Bayesian Network Models
Impact
The major strength is net combination of distinct
evidential sources to support the rank of a given
document.