is a language model - Text Summarization

Download Report

Transcript is a language model - Text Summarization

Probabilistic Models in Information Retrieval
SI650: Information Retrieval
Winter 2010
School of Information
University of Michigan
Most slides are from Prof. ChengXiang Zhai’s lectures
2010 © University of Michigan
1
Essential Probability & Statistics
2010 © University of Michigan
Prob/Statistics & Text Management
• Probability & statistics provide a principled way to
quantify the uncertainties associated with natural
language
• Allow us to answer questions like:
– Given that we observe “baseball” three times and “game” once in a
news article, how likely is it about “sports”?
(text categorization, information retrieval)
– Given that a user is interested in sports news, how likely would the
user use “baseball” in a query?
(information retrieval)
2010 © University of Michigan
Basic Concepts in Probability
• Random experiment: an experiment with uncertain outcome
(e.g., flipping a coin, picking a word from text)
• Sample space: all possible outcomes, e.g.,
– Tossing 2 fair coins, S ={HH, HT, TH, TT}
• Event: a subspace of the sample space
–
–
–
–
ES, E happens iff outcome is in E, e.g.,
E={HH} (all heads)
E={HH,TT} (same face)
Impossible event ({}), certain event (S)
• Probability of Event : 0  P(E) ≤1, s.t.
– P(S)=1 (outcome always in S)
– P(A B)=P(A)+P(B), if (AB)= (e.g., A=same face, B=different
face)
2010 © University of Michigan
Example: Flip a Coin
• Sample space: {Head, Tail}
• Fair coin:
– p(H) = 0.5, p(T) = 0.5
• Unfair coin, e.g.:
– p(H) = 0.3, p(T) = 0.7
• Flipping two fair coins:
– Sample space: {HH, HT, TH, TT}
• Example in modeling text:
– Flip a coin to decide whether or not to
include a word in a document
– Sample space = {appear, absence}
2010 © University of Michigan
5
Example: Toss a Dice
• Sample space: S = {1,2,3,4,5,6}
• Fair dice:
– p(1) = p(2) = p(3) = p(4) = p(5) = p(6)
= 1/6
• Unfair dice: p(1) = 0.3, p(2) = 0.2, ...
• N-dimensional dice:
– S = {1, 2, 3, 4, …, N}
• Example in modeling text:
– Toss a die to decide which word to
write in the next position
– Sample space = {cat, dog, tiger, …}
2010 © University of Michigan
6
Basic Concepts of Prob. (cont.)
• Joint probability: P(AB), also written as P(A, B)
• Conditional Probability: P(B|A)=P(AB)/P(A)
– P(AB) = P(A)P(B|A) = P(B)P(A|B)
– So, P(A|B) = P(B|A)P(A)/P(B) (Bayes’ Rule)
– For independent events, P(AB) = P(A)P(B), so P(A|B)=P(A)
• Total probability: If A1, …, An form a partition of S, then
– P(B) = P(BS) = P(B, A1) + … + P(B, An) (why?)
– So, P(Ai|B) = P(B|Ai)P(Ai)/P(B)
= P(B|Ai)P(Ai)/[P(B|A1)P(A1)+…+P(B|An)P(An)]
– This allows us to compute P(Ai|B) based on P(B|Ai)
2010 © University of Michigan
7
Revisit: Bayes’ Rule
Hypothesis space: H={H1 , …, Hn}
P( H i | E ) 
Evidence: E
P( E | H i )P( H i )
P( E )
In text classification: H: class space; E: data (features)
If we want to pick the most likely hypothesis H*, we can drop P(E)
Posterior probability of Hi
Prior probability of Hi
P( H i | E )  P( E | H i ) P( H i )
Likelihood of data/evidence
if Hi is true
2010 © University of Michigan
8
Random Variable
• X: S   (“measure” of outcome)
– E.g., number of heads, all same face?, …
• Events can be defined according to X
– E(X=a) = {si|X(si)=a}
– E(Xa) = {si|X(si)  a}
• So, probabilities can be defined on X
– P(X=a) = P(E(X=a))
– P(Xa) = P(E(Xa))
• Discrete vs. continuous random variable (think of
“partitioning the sample space”)
2010 © University of Michigan
9
Getting to Statistics ...
• We are flipping an unfair coin, but P(Head)=?
(parameter estimation)
– If we see the results of a huge number of random experiments,
then
count ( Heads )
count ( Heads )
Pˆ ( Head ) 

count ( Flips ) count ( Heads )  count (Tails )
– But, what if we only see a small sample (e.g., 2)? Is this estimate
still reliable? We flip twice and got two tails, does it mean
P(Head) = 0?
• In general, statistics has to do with drawing conclusions
on the whole population based on observations of a
sample (data)
2010 © University of Michigan
Parameter Estimation
• General setting:
– Given a (hypothesized & probabilistic) model that
governs the random experiment
– The model gives a probability of any data p(D|) that
depends on the parameter 
– Now, given actual sample data X={x1,…,xn}, what can
we say about the value of ?
• Intuitively, take your best guess of  -- “best”
means “best explaining/fitting the data”
• Generally an optimization problem
2010 © University of Michigan
11
Maximum Likelihood vs. Bayesian
• Maximum likelihood estimation
– “Best” means “data likelihood reaches maximum”
ˆ  arg max P ( X |  )

– Problem: small sample
• Bayesian estimation
– “Best” means being consistent with our “prior”
knowledge and explaining data well
– Problem: how to define prior?
ˆ  arg max P ( | X )  arg max P( X |  ) P( )


2010 © University of Michigan
12
Illustration of Bayesian Estimation
Posterior:
p(|X) p(X|)p()
Likelihood:
p(X|)
X=(x1,…,xN)
Prior: p()

: prior mode
: posterior mode
2010 © University of Michigan
ml: ML estimate
13
Maximum Likelihood Estimate
Data: a document d with counts c(w1), …, c(wN), and length |d|
Model: multinomial distribution M with parameters {p(wi)}
Likelihood: p(d|M)
Maximum likelihood estimator: M=argmax M p(d|M)
c( wi )
p ( wi ) 
d
Problem: a document is short
if c(wi) = 0, does this mean p(wi) = 0?
2010 © University of Michigan
14
If you want to know how we get it…
Data: a document d with counts c(w1), …, c(wN), and length |d|
Model: multinomial distribution M with parameters {p(wi)}
Likelihood: p(d|M)
Maximum likelihood estimator: M=argmax M p(d|M)
N
|d |

 N c ( wi )
p (d | M )  
  i c ( wi )
  i
i 1
 c( w1 )...c( wN )  i 1
where, i  p ( wi )
N

i 1
i
1
N
l (d | M )  log p(d | M )   c( wi ) log i
i 1
N
N
i 1
i 1
l (d | M )   c( wi ) log i   ( i  1)
'
l ' c( wi )

 0
i
i
N
Since

i 1
 i  
N
i
Use Lagrange multiplier approach
c( wi )
Set partial derivatives to zero

 1,    c( wi )   | d |
i 1
We’ll tune p(wi) to maximize l(d|M)
So, i  p ( wi ) 
c( wi )
|d |
2010 © University of Michigan
ML estimate
15
Statistical Language Models
2010 © University of Michigan
16
What is a Statistical Language Model?
• A probability distribution over word sequences
– p(“Today is Wednesday”)  0.001
– p(“Today Wednesday is”)  0.0000000000001
– p(“The eigenvalue is positive”)  0.00001
• Context-dependent!
• Can also be regarded as a probabilistic
mechanism for “generating” text, thus also called
a “generative” model
2010 © University of Michigan
17
Why is a Language Model Useful?
• Provides a principled way to quantify the
uncertainties associated with natural language
• Allows us to answer questions like:
– Given that we see “John” and “feels”, how likely will we see
“happy” as opposed to “habit” as the next word?
(speech recognition)
– Given that we observe “baseball” three times and “game” once in
a news article, how likely is it about “sports”?
(text categorization, information retrieval)
– Given that a user is interested in sports news, how likely would
the user use “baseball” in a query?
(information retrieval)
2010 © University of Michigan
18
Source-Channel Framework
(Communication System)
Transmitter
(encoder)
Source
X
P(X)
Noisy
Channel
Y
Receiver
(decoder)
X’
Destination
P(X|Y)=?
P(Y|X)
Xˆ  arg max p( X | Y )  arg max p(Y | X ) p( X )
X
X
When X is text, p(X) is a language model
2010 © University of Michigan
19
Speech Recognition
Acoustic
signal (A)
Words (W)
Speaker
Noisy
Channel
P(W)
P(A|W)
Recognizer
Recognized
words
P(W|A)=?
Wˆ  arg max p(W | A)  arg max p( A | W ) p(W )
W
W
Acoustic model
Language model
Given acoustic signal A, find the word sequence W
2010 © University of Michigan
20
Machine Translation
English
Speaker
English
Words (E)
P(E)
Chinese
Words(C)
Noisy
Channel
Translator
P(C|E)
P(E|C)=?
English
Translation
Eˆ  arg max p( E | C )  arg max p(C | E) p( E)
E
E
English->Chinese English
Translation model Language model
Given Chinese sentence C, find its English translation E
2010 © University of Michigan
21
Spelling/OCR Error Correction
Original
Text
Original
Words (O)
P(O)
“Erroneous”
Words(E)
Noisy
Channel
Corrector
P(E|O)
P(O|E)=?
Corrected
Text
Oˆ  arg max p(O | E )  arg max p( E | O) p(O)
O
O
Spelling/OCR Error model
“Normal”
Language model
Given corrupted text E, find the original text O
2010 © University of Michigan
22
Basic Issues
• Define the probabilistic model
– Event, Random Variables, Joint/Conditional Prob’s
– P(w1 w2 ... wn)=f(1, 2 ,…, m)
• Estimate model parameters
– Tune the model to best fit the data and our prior
knowledge
– i=?
• Apply the model to a particular task
– Many applications
2010 © University of Michigan
23
The Simplest Language Model
(Unigram Model)
• Generate a piece of text by generating each
word INDEPENDENTLY
• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)
• Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size)
• Essentially a multinomial distribution over words
• A piece of text can be regarded as a sample
drawn according to this word distribution
2010 © University of Michigan
24
Modeling Text As Tossing a Die
• Sample space:
S = {w1,w2,w3,… wN}
• Die is unfair:
– P(cat) = 0.3, P(dog) = 0.2, ...
– Toss a die to decide which word to
write in the next position
• E.g., Sample space = {cat, dog, tiger}
• P(cat) = 0.3; P(dog) = 0.5; P(tiger) = 0.2
• Text = “cat cat dog”
• P(Text) = P(cat) * P(cat) * P(dog)
= 0.3 * 0.3 * 0.5 = 0.045
2010 © University of Michigan
25
Text Generation with Unigram LM
(Unigram) Language Model 
p(w| )
Sampling
Document
…
Topic 1:
Text mining
text 0.2
mining 0.1
association 0.01
clustering 0.02
…
food 0.00001
Text mining
paper
…
…
Topic 2:
Health
food 0.25
nutrition 0.1
healthy 0.05
diet 0.02
Food nutrition
paper
…
2010 © University of Michigan
26
Estimation of Unigram LM
(Unigram) Language Model 
p(w| )=?
Estimation
…
10/100
5/100
3/100
3/100
1/100
Document
text 10
mining 5
association 3
database 3
algorithm 2
…
query 1
efficient 1
text ?
mining ?
association ?
database ?
…
query ?
…
A “text mining paper”
(total #words=100)
2010 © University of Michigan
27
Empirical distribution of words
• There are stable language-independent patterns
in how people use natural languages
• A few words occur very frequently; most occur
rarely. E.g., in news articles,
– Top 4 words: 10~15% word occurrences
– Top 50 words: 35~40% word occurrences
• The most frequent word in one corpus may be
rare in another
2010 © University of Michigan
28
Revisit: Zipf’s Law
• rank * frequency  constant
Word
Freq.
F ( w) 
C
r ( w)
  1, C  0.1
Most useful words (Luhn 57)
Is “too rare” a problem?
Biggest
data structure
(stop words)
Word Rank (by Freq)
Generalized Zipf’s law:
C
F ( w) 
[r ( w)  B]
Applicable in many domains
2010 © University of Michigan
29
Problem with the ML Estimator
• What if a word doesn’t appear in the text?
• In general, what probability should we give a
word that has not been observed?
• If we want to assign non-zero probabilities to
such words, we’ll have to discount the
probabilities of observed words
• This is what “smoothing” is about …
2010 © University of Michigan
30
Language Model Smoothing (Illustration)
P(w)
Max. Likelihood Estimate
p ML ( w ) 
count of w
count of all words
Smoothed LM
w
2010 © University of Michigan
31
How to Smooth?
• All smoothing methods try to
– discount the probability of words seen in a
document
– re-allocate the extra counts so that unseen words
will have a non-zero count
• Method 1 (Additive smoothing): Add a
constant  to the counts of each word
Counts of w in d
p( w | d ) 
c( w, d )  1
| d |  |V |
“Add one”, Laplace smoothing
Vocabulary size
Length of d (total counts)
• Problems?
2010 © University of Michigan
32
How to Smooth? (cont.)
• Should all unseen words get equal
probabilities?
• We can use a reference model to discriminate
unseen words
Discounted ML estimate
if w is seen in d
 pseen ( w | d )
p( w | d )  
otherwise
 d p( w | REF )
Reference language model
 p (w | d )
 p(w | REF )
1
d 
seen
w is seen
w is unseen
2010 © University of Michigan
33
Other Smoothing Methods
• Method 2 (Absolute discounting): Subtract a
constant  from the counts of each word
# uniq words
p (w | d ) 
max( c ( w;d )  ,0)  |d |u p ( w| REF )
|d |
αd = δ|d|u/|d|
• Method 3 (Linear interpolation, Jelinek-Mercer):
“Shrink” uniformly toward p(w|REF)
c( w, d )
p( w | d )  (1   )
  p( w | REF )
|d |
ML estimate
αd = λ
parameter
2010 © University of Michigan
34
Other Smoothing Methods (cont.)
• Method 4 (Dirichlet Prior/Bayesian): Assume
pseudo counts p(w|REF)
αd = μ/(|d|+ μ)
p (w | d ) 
c ( w;d )   p ( w| REF )
|d | 

|d |
|d | 
c( w, d )
 |d |  p( w | REF )
|d |
parameter
• Method 5 (Good Turing): Assume total # unseen
events to be n1 (# of singletons), and adjust the
seen events in the same way
p (w | d ) 
0* 
c*( w;d )
|d |
; c *( w, d )  r* 
r 1
nr 1 , where r  c( w, d )
nr
n1
2* n2
,1* 
,..... What if nr  0? What about p  w | REF  ?
n0
n1
2010 © University of Michigan
35
So, which method is the best?
It depends on the data and the task!
Many other sophisticated smoothing methods have been
proposed…
Cross validation is generally used to choose the best
method and/or set the smoothing parameters…
For retrieval, Dirichlet prior performs well…
Smoothing will be discussed further in the course…
2010 © University of Michigan
36
Language Models for IR
2010 © University of Michigan
37
Text Generation with Unigram LM
(Unigram) Language Model 
p(w| )
Sampling
Document
…
Topic 1:
Text mining
text 0.2
mining 0.1
assocation 0.01
clustering 0.02
…
food 0.00001
Text mining
paper
…
…
Topic 2:
Health
food 0.25
nutrition 0.1
healthy 0.05
diet 0.02
Food nutrition
paper
…
2010 © University of Michigan
38
Estimation of Unigram LM
(Unigram) Language Model 
p(w| )=?
Estimation
…
10/100
5/100
3/100
3/100
1/100
Document
text 10
mining 5
association 3
database 3
algorithm 2
…
query 1
efficient 1
text ?
mining ?
assocation ?
database ?
…
query ?
…
A “text mining paper”
(total #words=100)
2010 © University of Michigan
39
Language Models for Retrieval
(Ponte & Croft 98)
Document
Language Model
…
Text mining
paper
text ?
mining ?
assocation ?
clustering ?
…
food ?
…
…
Food nutrition
paper
Query =
“data mining algorithms”
?
Which model would most
likely have generated
this query?
food ?
nutrition ?
healthy ?
diet ?
…
2010 © University of Michigan
40
Ranking Docs by Query Likelihood
Doc LM
Query likelihood
d1
d1
p(q| d1)
d2
d2
p(q| d2)
q
p(q| dN)
dN
dN
2010 © University of Michigan
41
Retrieval as
Language Model Estimation
• Document ranking based on query likelihood
log p(q | d )   log p(w i | d )
i
where , q  w 1w 2 ...w n
• Retrieval
Document language model
problem  Estimation of
p(wi|d)
• Smoothing is an important issue, and
distinguishes different approaches
2010 © University of Michigan
42
How to Estimate p(w|d)?
• Simplest solution: Maximum Likelihood
Estimator
– P(w|d) = relative frequency of word w in d
– What if a word doesn’t appear in the text? P(w|d)=0
• In general, what probability should we give a
word that has not been observed?
• If we want to assign non-zero probabilities to
such words, we’ll have to discount the
probabilities of observed words
• This is what “smoothing” is about …
2010 © University of Michigan
43
A General Smoothing Scheme
• All smoothing methods try to
– discount the probability of words seen in a doc
– re-allocate the extra probability so that unseen words
will have a non-zero probability
• Most use a reference model (collection language
model) to discriminate unseen words
pseen (w | d )
p(w | d )  
 d p(w | C )
Discounted ML estimate
if w is seen in d
otherwise
Collection language model
2010 © University of Michigan
44
Smoothing & TF-IDF Weighting
• Plug in the general smoothing scheme to the
query likelihood retrieval formula, we obtain
Doc length normalization
(long doc is expected to have a smaller d)
TF weighting
log p(q | d ) 

wi  d
wi q
[log
pseen ( wi | d )
]  n log  d 
 d p( wi | C )
IDF weighting
 log p(w | C )
i
i
Ignore for ranking
• Smoothing with p(w|C)  TF-IDF + length
norm.
2010 © University of Michigan
45
Three Smoothing Methods
(Zhai & Lafferty 01)
• Simplified Jelinek-Mercer: Shrink uniformly
toward p(w|C)
p(w | d )  (1   )pml (w | d )   p(w | C)
• Dirichlet prior (Bayesian): Assume pseudo counts
p(w|C)
c ( w;d )   p ( w|C )

|d |
p (w | d ) 
 |d |  pml ( w | d )  |d |  p( w | C )
|d | 
• Absolute discounting: Subtract a constant 
p (w | d ) 
max( c ( w;d )  , 0 )  |d |u p ( w|C )
|d |
2010 © University of Michigan
47
Assume We Use Dirichlet Smoothing
p (w | d ) 
c ( w;d )   p ( w| REF )
|d | 

|d |
|d | 
c( w, d )
 |d |  p( w | REF )
|d |
log p (q | d )
Doc. Term frequency
Query term frequency
Score(d , q) 

c( w, q)  log( 1 
wq , wd
c( w, d )

) | q |  log
  p( w | C )
| d | 
Similar to IDF
Only consider words in the query
2010 © University of Michigan
Doc. length
48
The Notion of Relevance
Relevance
(Rep(q), Rep(d))
Similarity
Different
rep & similarity
P(d q) or P(q d)
Probabilistic inference
P(r=1|q,d) r {0,1}
Probability of Relevance
Regression
Model
(Fox 83)
…
Vector space
Prob. distr.
model
model
(Salton et al., 75) (Wong & Yao, 89)
Generative
Model
Doc
generation
Classical
prob. Model
(Robertson &
Sparck Jones, 76)
Query
generation
Different
inference system
Prob. concept Inference
network
space model
(Wong & Yao, 95) model
(Turtle & Croft, 91)
LM
approach
(Ponte & Croft, 98)
(Lafferty & Zhai, 01a)
2010 © University of Michigan
49
Query Generation and Document
Generation
• Document ranking based on the conditional probability
of relevance
P ( R  1 | D, Q )
Ranking based on
O ( D, Q ) 
R {0,1}
P ( R  1 | D, Q )
P ( R  0 | D, Q )
P ( D | Q, R )
P ( R | D, Q )
P ( D, Q | R )
Document generation
 Okapi/BM25
P(Q | D, R)
Query generation
 Language models
2010 © University of Michigan
50
The Notion of Relevance
Relevance
(Rep(q), Rep(d))
Similarity
Different
rep & similarity
P(r=1|q,d) r {0,1}
Probability of Relevance
Regression
Model
(Fox 83)
…
Vector space
Prob. distr.
model
model
(Salton et al., 75) (Wong & Yao, 89)
Generative
Model
Doc
generation
P(d q) or P(q d)
Probabilistic inference
Different
inference system
Query
generation
Prob. concept Inference
network
space model
Classical
LM
(Wong & Yao, 95) model
(Turtle & Croft, 91)
prob. Model
approach
(Robertson &
(Ponte & Croft, 98)
Sparck Jones, 76)(Lafferty & Zhai, 01a)
2010 © University of Michigan
51
The Okapi/BM25 Retrieval Formula
• Classical probabilistic model – among the best
performance
Doc. Term frequency
Score(d , q) 

(log
wq , wd
Only consider words
in the query
Query term frequency
(k  1)  c( w, q)
N  df ( w)  0.5
(k1  1)  c( w, d )

 3
)
df ( w)  0.5 k ((1  b)  b | d | )  c( w, d ) k3  c( w, q)
1
avdl
IDF
Doc. length
2010 © University of Michigan
52
What are the Essential Features in a
Retrieval Formula?
• Document TF – how many occurrence of each query
term do we observe in a document
• Query TF – how many occurrence of each term do we
observe in a query
• IDF – how distinctive/important is each query term
(term with a higher IDF is more distinctive)
• Document length – how long the document is (we want
to penalize long documents, but don’t want to overpenalize them!)
• Others?
2010 © University of Michigan
53