oct5-semantics-for-i..

Download Report

Transcript oct5-semantics-for-i..

Towards Semantics for IR
Eugene Agichtein
Emory University
Acknowledgements
A bunch of slides in this talk are adapted from lots of people, including Chris
Manning, ChengXiang Zhai, James Allan, Ray Mooney, and Jimmy Lin.
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
1
Who is this guy?
Sept 2006-: Assistant Professor in the Math & CS department at Emory.
2004 to 2006: Postdoc in the Text Mining, Search, and Navigation group
at Microsoft Research, Redmond.
2004: Ph.D. in Computer Science from Columbia University: dissertation
on extracting structured relations from large unstructured text databases
1998: B.S. in Engineering from The Cooper Union.
Research interests: accessing, discovering, and managing information
in unstructured (text) data, with current emphasis on developing robust
and scalable text mining techniques for the biology and health domains.
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
2
Outline

Text Information Retrieval: 10-minute overview

Problems with lexical retrieval

Synonymy, Polysemy, Ambiguity

A partial solution: synonym lookup

Towards concept retrieval




LSI
Language Models for IR
PLSI
Towards real semantic search

Fall 2006
Entities, Relations, Facts, Events in Text (my research area)
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
3
Information Retrieval From Text
Document
corpus
Query
String
IR
System
Ranked
Documents
Fall 2006
1. Doc1
2. Doc2
3. Doc3
.
.
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
4
Was that the whole story in IR?
Source
Selection
Resource
Query
Formulation
Query
Search
Ranked List
Selection
query reformulation,
vocabulary learning,
relevance feedback
Documents
Examination
source reselection
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
Documents
Delivery
5
Supporting the Search Process
Source
Selection
Resource
Query
Formulation
Query
Search
Indexing
Acquisition
Collection
Index
Ranked List
Selection
Documents
Examination
Documents
Delivery
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
6
Example: Query

Which plays of Shakespeare contain the words Brutus AND
Caesar but NOT Calpurnia?

One could grep all of Shakespeare’s plays for Brutus and
Caesar, then strip out lines containing Calpurnia?
 Slow (for large corpora)


Fall 2006
NOT Calpurnia requires egrep 
But other operations (e.g., find the word Romans near
countrymen , or top-K scenes “most about” ) not feasible
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
7
Term-document incidence
Antony and Cleopatra
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Brutus AND Caesar but NOT
Calpurnia
Fall 2006
1 if play contains
word, 0 otherwise
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
8
Incidence vectors

So we have a 0/1 vector for each term.

Boolean model:



Vector-space model:


Fall 2006
To answer query: take the vectors for Brutus, Caesar and
Calpurnia (complemented)  bitwise AND.
110100 AND 110111 AND 101111 = 100100
Compute query-document similarity as dot product/cosine
between query and document vector
Rank by similarity
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
9
Answers to query








Fall 2006
Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
Hamlet, Act III, Scene ii
Lord Polonius: I did enact Julius Caesar I was killed i' the
Capitol; Brutus killed me.
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
10
Modern Search Engines in 1 Minute

Crawl Time:



“Inverted List”: terms  doc IDs
“Content chunks” (doc copies)
index
angina 5
treatment 4
Query Time:




Lookup query terms in IL “filter set”
Get content chunks for doc IDs
Rank documents using hundreds of
features (e.g., term weights, web
topology, proximity, position)
Retrieve Top K documents for query
( K < 100 << |filter set|)
Fall 2006
Doc ID pos
11001
44
99875
14
11222
11
40942
24
99875
14
92739
32145
3
44
11222
11
34266
14
11222
17
40942
59
Content chunks
11222 … A myocardial infraction is …
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
11
Outline

Text Information Retrieval: 10-minute overview

Problems with lexical retrieval


A partial solution: synonym lookup

Towards concept retrieval




LSI
Language Models for IR
PLSI
Towards real semantic search

Fall 2006
Synonymy, Polysemy, Ambiguity
Entities, Relations, Facts, Events
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
12
The Central Problem in IR
Information Seeker
Authors
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
13
Noisy-Channel Model of IR
Information
need
d1
d2
Query
…
User has a information need,
“thinks” of a relevant document…
and writes down
some queries
dn
Task of information retrieval: given the query, figure document
out which document it came from?
collection
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
14
How is this a noisy-channel?
Source
message
Destination
Transmitter
channel
Receiver
message
noise
Source
Information
need
Destination
Encoder
channel
Decoder
query
terms
Query formulation process

No one seriously claims that this is actually what’s going on…
 But this view is mathematically convenient!
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
15
Problems with term-based retrieval

Synonymy


Polysemy


“Power law” vs. “Zipf distribution”
“Saturn”
Ambiguity

Fall 2006
“What do frogs eat?”
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
16
Polysemy and Context

Document similarity on single word level: polysemy and
context
ring
jupiter
•••
…
planet
...
…
meaning 1
space
voyager
saturn
...
meaning 2
car
company
•••
contribution to similarity, if
used in 1st meaning, but not
if in 2nd
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
dodge
ford
17
Ambiguity

Different documents with the same keywords may have
different meanings…
What do frogs eat?
keywords: frogs, eat
(1)
(2)
(3)
Fall 2006
Adult frogs eat mainly insects
and other small animals,
including earthworms, minnows,
and spiders.
Alligators eat many kinds of
small animals that live in or near
the water, including fish,
snakes, frogs, turtles, small
mammals, and birds.
Some bats catch fish with their
claws, and a few species eat
lizards, rodents, small birds,
tree frogs, and other bats.
What is the largest volcano in the
Solar System?
keywords: largest, volcano, solar, system
(1)
(2)
(3)
Mars boasts many extreme
geographic features; for example,
Olympus Mons, is the largest
volcano in the solar system.
The Galileo probe's mission to
Jupiter, the largest planet in the
Solar system, included amazing
photographs of the volcanoes on Io,
one of its four most famous moons.
Even the largest volcanoes found
on Earth are puny in comparison to
others found around our own cosmic
backyard, the Solar System.
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
18
Indexing Word Synsets/Senses

How does indexing word senses solve the
synonym/polysemy problem?
{dog, canine, doggy, puppy, etc.}  concept 112986
I deposited my check in the bank.
I saw the sailboat from the bank.

bank  concept 76529
bank  concept 53107
Okay, so where do we get the word senses?
 WordNet: a lexical database for standard English
http://wordnet.princeton.edu/

Fall 2006
Automatically find “clusters” of words that describe the
same concepts
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
19
Example: Contextual Word Similarity
Use Mutual Information:
Dagan et al, Computer Speech & Language, 1995
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
20
Word Sense Disambiguation


Given a word in context, automatically determine the
sense (concept)
 This is the Word Sense Disambiguation (WSD)
problem
Context is the key:
 For each ambiguous word, note the surrounding
words
bank {river, sailboat, water, etc.}  side of a river
bank {check, money, account, etc.}  financial institution


Fall 2006
“Learn” a classifier from a collection of examples
Use the classifier to determine the senses of words in
the documents
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
21
Example: Unsupervised WSD

Hypothesis: same senses of words will have
similar neighboring words

Disambiguation algorithm



Identify context vectors corresponding to all occurrences of
a particular word
Partition them into regions of high density
Assign a sense to each such region
“Sit on a chair”
“Take a seat on this chair”
“The chair of the Math Department”
“The chair of the meeting”
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
22
Does it help retrieval?

Not really…
Ellen M. Voorhees. (1993) Using WordNet to Disambiguate Word
Senses for Text Retrieval. Proceedings of SIGIR 1993.
Mark Sanderson. (1994) Word-Sense Disambiguation and Information
Retrieval. Proceedings of SIGIR 1994
And others…

Examples of limited success….
Hinrich Schütze and Jan O. Pedersen. (1995) Information Retrieval
Based on Word Senses. Proceedings of the 4th Annual Symposium on
Document Analysis and Information Retrieval.
Rada Mihalcea and Dan Moldovan. (2000) Semantic Indexing Using
WordNet Senses. Proceedings of ACL 2000 Workshop on Recent
Advances in NLP and IR.
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
23
Why Disambiguation Can Hurt

Bag-of-words techniques already disambiguate
 Context for each term is established in the query
 Heuristics (e.g., always most frequent sense) work better

WSD is hard!
 Many words are highly polysemous, e.g., interest
 Granularity of senses is often domain/application specific
 Queries are short – not enough context for accurate WSD

WSD tries to improve precision
 But incorrect sense assignments would hurt recall
 Slight gains in precision do not offset large drops in recall
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
24
Outline

Text Information Retrieval: 10-minute overview

Problems with lexical retrieval

Synonymy, Polysemy, Ambiguity

A partial solution: word synsets, WSD

Towards concept retrieval




LSI
Language Models for IR
PLSI
Towards real semantic search

Fall 2006
Entities, Relations, Facts, Events
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
25
Latent Semantic Analysis


Perform a low-rank approximation of document-term
matrix (typical rank 100-300)
General idea




Map documents (and terms) to a low-dimensional
representation.
Design a mapping such that the low-dimensional space reflects
semantic associations (latent semantic space).
Compute document similarity based on the inner product in this
latent semantic space
Goals


Fall 2006
Similar terms map to similar location in low dimensional space
Noise reduction by dimension reduction
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
26
Latent Semantic Analysis

Latent semantic space: illustrating example
courtesy of Susan Dumais
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
27
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
28
Simplistic picture
Topic 1
Topic 2
Topic 3
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
29
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
30
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
31
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
32
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
33
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
34
Some (old) empirical evidence

Precision at or above median TREC precision
 Top scorer on almost 20% TREC 1,2,3 topics (c.f. 1990)

Slightly better on average than original vector space

Effect of dimensionality:
Dimensions
250
300
346
Fall 2006
Precision
0.367
0.371
0.374
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
35
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
36
Problems with term-based retrieval

Synonymy


Polysemy


“Power law” vs. “Zipf distribution”
“Saturn”
Ambiguity

Fall 2006
“What do frogs eat?”
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
37
Outline

Text Information Retrieval: 5-minute overview

Problems with lexical retrieval

Synonymy, Polysemy, Ambiguity

A partial solution: synonym lookup

Towards concept retrieval




LSI
Language Models for IR
PLSI
Towards real semantic search

Fall 2006
Entities, Relations, Facts, Events
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
38
IR based on Language Model (LM)
Information
need
P(Q | M d )
generation
query
M d1
d1
M d2
d2

A common search heuristic is to use words
that you expect to find in matching
documents as your query – why, I saw
Sergey Brin advocating that strategy on late
night TV one night in my hotel room, so it
must be good!
The LM approach directly exploits that idea!
Fall 2006
M dn
…
…

dn
document collection
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
39
Formal Language (Model)

Traditional generative model: generates strings


Finite state machines or regular grammars, etc.
Example:
I
wish
I wish
I wish I wish
I wish I wish I wish
I wish I wish I wish I wish
…
*wish I wish
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
40
Stochastic Language Models

Models probability of generating strings in the
language (commonly all strings over alphabet ∑)
Model M
0.2
the
0.1
a
0.01
man
0.01
woman
0.03
said
0.02
likes
…
Fall 2006
the
man
likes
the
woman
0.2
0.01
0.02
0.2
0.01
multiply
P(s | M) = 0.00000008
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
41
Stochastic Language Models

Model probability of generating any string
Model M1
Model M2
0.2
the
0.2
the
0.01
class
0.0001 class
0.0001 sayst
0.03
sayst
0.0001 pleaseth
0.02
pleaseth
0.0001 yon
0.1
yon
0.0005 maiden
0.01
maiden
0.01
0.0001 woman
Fall 2006
woman
the
class
pleaseth
0.2
0.2
0.01
0.0001
0.0001 0.02
yon
maiden
0.0001 0.0005
0.1
0.01
P(s|M2) > P(s|M1)
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
42
Stochastic Language Models

A statistical model for generating text

Probability distribution over strings in a given
language
M
P(
Fall 2006
|M)
=P(
|M)
P(
| M,
P(
| M,
P(
| M,
)
)
)
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
43
Unigram and higher-order models
P(
)
=P(



) P(
|
) P(
|
Unigram Language Models
P( )P( ) P( ) P(
) P(
|
)
)
Easy.
Effective!
Bigram (generally, n-gram) Language Models
P( )P( | )P( | ) P( | )
Other Language Models
 Grammar-based models (PCFGs), etc.
 Probably not the first thing to try in IR
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
44
Using Language Models in IR




Treat each document as the basis for a model (e.g.,
unigram sufficient statistics)
Rank document d based on P(d | q)
P(d | q) = P(q | d) x P(d) / P(q)
 P(q) is the same for all documents, so ignore
 P(d) [the prior] is often treated as the same for all d
 But we could use criteria like authority, length, genre
 P(q | d) is the probability of q given d’s model
Very general formal approach
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
45
The fundamental problem of LMs

Usually we don’t know the model M

But have a sample of text representative of that
model
P(


|M(
))
Estimate a language model from a sample
Then compute the observation probability
M
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
46
Language Models for IR

Language Modeling Approaches


Attempt to model query generation process
Documents are ranked by the probability that a
query would be observed as a random sample
from the respective document model

Fall 2006
Multinomial approach
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
47
Retrieval based on probabilistic LM


Treat the generation of queries as a random process.
Approach
 Infer a language model for each document.
 Estimate the probability of generating the query
according to each of these models.
 Rank the documents according to these probabilities.
 Usually a unigram estimate of words is used
 Some work on bigrams, paralleling van Rijsbergen
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
48
Retrieval based on probabilistic LM


Intuition
 Users …
 Have a reasonable idea of terms that are likely to
occur in documents of interest.
 They will choose query terms that distinguish these
documents from others in the collection.
Collection statistics …
 Are integral parts of the language model.
 Are not used heuristically as in many other
approaches.
 In theory. In practice, there’s usually some wiggle
room for empirically set parameters
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
49
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
50
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
51
Query generation probability

Ranking formula
p (Q, d )  p (d ) p (Q | d )
 p (d ) p (Q | M d )

The probability of producing the query given the
language model of document d using MLE is:
pˆ (Q | M d )   pˆ ml (t | M d )
tQ

tQ
tf(t ,d )
dld
Unigram assumption:
Given a particular language model,
the query terms occur independently
M d : language model of document d
tf(t ,d ) : raw tf of term t in document d
dld : total number of tokens in document d
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
52
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
53
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
54
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
55
Smoothing (continued)

There’s a wide space of approaches to smoothing
probability distributions to deal with this problem, such as
adding 1, ½ or  to counts, Dirichlet priors, discounting,
and interpolation [Chen and Goodman, 98]

Another simple idea that works well in practice is to use
a mixture between the document multinomial and the
collection multinomial distribution
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
56
Smoothing: Mixture model






P(w|d) = Pmle(w|Md) + (1 – )Pmle(w|Mc)
Mixes the probability from the document with the general
collection frequency of the word.
Correctly setting  is very important
A high value of lambda makes the search “conjunctivelike” – suitable for short queries
A low value is more suitable for long queries
Can tune  to optimize performance
 Perhaps make it dependent on document size (cf.
Dirichlet prior or Witten-Bell smoothing)
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
57
Basic mixture model summary

General formulation of the LM for IR
p(Q, d )  p(d ) ((1   ) p(t )  p(t | M d ))
tQ
general language model
individual-document model


Fall 2006
The user has a document in mind, and generates the
query from this document.
The equation represents the probability that the
document that the user had in mind was in fact this one.
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
58
Example





p(Q, d )  p(d ) p(Q | d )
Document collection (2 documents)
 d1: Xerox reports a profit but revenue is down
 d2: Lucent narrows quarter loss but revenue
decreases further
Model: MLE unigram from documents;  = ½
Query: revenue down
 P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2]
= 1/8 x 3/32 = 3/256
 P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2]
= 1/8 x 1/32 = 1/256
A component of model is missing… what is it, and why?
Ranking: d1 > d2
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
59
Language Models for IR Tasks








Cross-lingual IR
Distributed IR
Structured doc retrieval
Personalization
Modelling redundancy
Predicting query difficulty
Predicting information extraction accuracy
PLSI
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
60
Standard Probabilistic IR
Information
need
P ( R | Q, d )
matching
query
d1
d2
…
dn
document collection
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
61
IR based on Language Model (LM)
Information
need
P(Q | M d )
generation
query
M d1
d1
M d2
d2

A common search heuristic is to use words
that you expect to find in matching
documents as your query – why, I saw
Sergey Brin advocating that strategy on late
night TV one night in my hotel room, so it
must be good!
LM approach directly exploits that idea!
Fall 2006
M dn
…
…

dn
document collection
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
62
Collection-Topic-Document Model
MC
Information
need
P(Q | M C , M T )
P(Q | M C , M T , M d )
generation
M d1
d1
M T2
M d2
d2
M Tm
M dn
…
…
…
query
M T1
dn
document collection
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
63
Collection-Topic-Document model


3-level model
1.
Whole collection model ( M C)
2.
Specific-topic model; relevant-documents model ( M T)
3.
Individual-document model (M d )
Relevance hypothesis

A request(query; topic) is generated from a specific-topic model
{M C ,M T }.

Iff a document is relevant to the topic, the same model will apply
to the document.
 It will replace part of the individual-document model in
explaining the document.

The probability of relevance of a document
 The probability that this model explains part of the document
 The probability that the {M , M , M } combination is better
C
d
T
than the { M C,M d } combination
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
64
Outline

Text Information Retrieval: 5-minute overview

Problems with lexical retrieval

Synonymy, Polysemy, Ambiguity

A partial solution: synonym lookup

Towards concept retrieval




LSI
Language Models for IR
PLSI
Towards real semantic search

Fall 2006
Entities, Relations, Facts, Events
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
65
Probabilistic LSI


Uses LSI idea, but based in probability theory
Comes from statistical Aspect [Language] Model


Generate co-occurrence model based on non-observed
class
This is a mixture model


Independence Assumptions


Fall 2006
Models a distribution through a mixture (weighted sum) of
other distributions
Observed pairs (doc, word) are generated randomly
Conditional independence: conditioned on latent class, words
are generated independently of document
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
66
Aspect Model

Generation process

Choose a doc d with prob P(d)




There are N d’s
Choose a latent class z with (generated) prob P(z|d)


K chosen in advance
(how many topics in
collection???)
There are K z’s, and K << N
Generate a word w with (generated) prob P(w|z)
This creates pair (d, w), without direct concern for z
Joining the probabilities gives you
Fall 2006
Remember:
P(z|d)
“probability
of z, given
d”
CS 584: Information
Retrieval.means
Math & Computer
Science Department,
Emory University
67
Aspect Model (2)

Applying Bayes theorem:

This is conceptually different than LSI

Fall 2006
Word distribution P(w|d) based on combination of specific
classes/factors/aspects, P(w|z)
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
68
Detour: EM Algorithm

Tune parameters of distributions with
missing/hidden data

Topics = hidden classes

Extremely useful, general technique
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
69
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
70
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
71
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
72
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
73
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
74
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
75
Expectation Maximization

Sketch of an EM algorithm for PLSI



Fall 2006
E-step: calculate future probabilities of z based on
current estimates
M-step: update estimate parameters based on
calculated probabilities
Problem: overfitting 
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
76
Similarities: LSI and PLSI


Using intermediate, latent, non-observed data for classification
(hence the “L”)
Can compose Joint Probability similar to LSI SVD

U  U_hat = P(di | zk)

V  V_hat = P(wj | zk)

S  S_hat = diag(P(zk))k

JP = U_hat*S_hat*V_hat
 JP is simliar to SVD term-doc matrix N
 Values calculated probabilistically
P(di | zk)
Fall 2006
diag(P(zk))k
P(wj | zk)
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
77
Differences: LSI and PLSI

Basis


Fall 2006
LSI: term frequencies (usually) and performs
dimension reduction via projection or 0-ing weaker
components
PLSI: statistical – generate model of probabilistic
relation between W, D and Z; refine until effective
model is produced
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
78
Experiment: 128-factor decomposition
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
79
Experiments
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
80
pLSI Improves on LSI


Consistently better accuracy curves than LSI
TEM  SVD, computationally

Better from a modeling sense



More intuitive:

Polysemy is recognizable


Fall 2006
Uses likelihood of sampling and aims for maximization
SVD uses L2-norm or other: implicit Gaussian noise
assumption
By viewing P(w|z)
Similar handling of synonymy
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
81
LSA, pLSA in Practice?

Only rumors (about Teoma using it)

Both LSA, pLSA VERY expensive

LSA


pLSA



Running times of ~ one day on ~10K docs
M. Federico (ICASSP 2002, [hermes.itc.it]) use a corpus of 1.2
millions of newspaper articles with a vocabulary of 200K words:
approximate pLSA using Non-negative Matrix Factorization (NMF)
612 hours of CPU time (7 processors, 2.5 hours/iteration, 35
iterations)
Do we need (P)LSI for web search?
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
82
Did we solve our problem?
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
83
Ambiguity

Different documents with the same keywords may have
different meanings…
What do frogs eat?
keywords: frogs, eat
(1)
(2)
(3)
Fall 2006
Adult frogs eat mainly insects
and other small animals,
including earthworms, minnows,
and spiders.
Alligators eat many kinds of
small animals that live in or near
the water, including fish,
snakes, frogs, turtles, small
mammals, and birds.
Some bats catch fish with their
claws, and a few species eat
lizards, rodents, small birds,
tree frogs, and other bats.
What is the largest volcano in the
Solar System?
keywords: largest, volcano, solar, system
(1)
(2)
(3)
Mars boasts many extreme
geographic features; for example,
Olympus Mons, is the largest
volcano in the solar system.
The Galileo probe's mission to
Jupiter, the largest planet in the
Solar system, included amazing
photographs of the volcanoes on Io,
one of its four most famous moons.
Even the largest volcanoes found
on Earth are puny in comparison to
others found around our own cosmic
backyard, the Solar System.
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
84
What we need

Detect and exploit semantic relations
between entities

Whole other lecture 
Fall 2006
CS 584: Information Retrieval. Math & Computer Science Department, Emory University
85