CS 401R, section 1: Natural Language Processing

Download Report

Transcript CS 401R, section 1: Natural Language Processing

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.
CS 479, section 1:
Natural Language Processing
Lectures #12: Language Model
Smoothing, Interpolation
Thanks to Dan Klein of UC Berkeley for many of the materials used in this lecture.
Announcements
 Reading Report #5
 M&S 6.3-end (of ch. 6)
 Due: today
 Project #1, Part 1
 Build an interpolated language model
 ASAP: Work through the Tutorial with your pair
programming partner
 Early: Wednesday
 Due: Friday
Recap: Language Models
 What is the purpose of a language model?
 What do you think are the main challenges in
building n-gram language models?
Objectives
 Get comfortable with the process of factoring
and smoothing a joint model of a familiar
object: text!
 Motivate smoothing of language models
 Dig into Linear Interpolation as a method for
smoothing
 Feel confident about how to use these
techniques in Project #1, Part 1.
 Discuss how to train interpolation weights
Problem
Cause: Sparsity
 New words appear all the time:
 Synaptitute
 132,701.03
 fuzzificational




New bigrams: even more often
Trigrams or larger – still worse!
What was the point of Zipf’s law for us?
What will we do about it?
Solution: Smoothing
Very important all over NLP, but easy to do badly!
man
outcome
man
outcome
attack
attack
request
claims
7 total
reports
P(w | denied the)
2.5 allegations
1.5 reports
0.5 claims
0.5 request
2 other

…
…
Smoothing flattens distributions so they generalize better
allegations
allegations

request
7 total
claims
P(w | denied the)
3 allegations
2 reports
1 claims
1 request
reports
We often want to make predictions from sparse statistics:
allegations

Smoothing
 Two approaches we will explore:
 Interpolation: combine multiple estimates to give
probability mass to unseen events
 Think: Two heads are better than one!
 Project 1.1
 Today’s lecture
 Discounting: explicitly reserve mass for unseen events
 Think: Robin Hood – rob from the rich to feed the poor!
 Project 1.2 (not for Fall 2012)
 Next lecture
 Can be used in combination!
 Another approach you read about:
 Back-off: we won’t spend time on this
Interpolation
 Idea: two heads are better than one
 i.e., combine (less sparse) lower-order model
with higher-order model to get a more robust
higher-order model
𝑃′
𝑤𝑖 𝑤𝑖−1 = 𝑓 𝑃1 𝑤𝑖 𝑤𝑖−1
1
, 𝑃 𝑤𝑖 ,
𝑉
Convex, Linear Interpolation
 Convex:
 interpolation constants sum to 1
 ∀𝑖, 0 ≤ 𝜆𝑖 ≤ 1
 One extreme: General linear interpolation
ˆˆ
P( w | h1..n 1 )  1   ( w, h1..n 1 )  Pˆ ( w | h1..n 1 )
  ( w, h1..n  2 ) P( w | h1..n  2 )
 One interpolation coefficient per history and
predicted word
Linear Interpolation
 The other extreme: a single global mixing weight
 Generally not ideal but it works:
ˆˆ
P( w | h1..n 1 )  1    Pˆ ( w | h1..n 1 )   P( w | h1..n  2 )
 Middle ground: different weights for classes of
histories defined at other granularities:
 Bucket histories (and their weights) by count k:
 for each bucketk, use a single weight (k)
 Bucket histories by average count (better):
 for a range of buckets bucketk … bucketk+m , have a single
weight
Example: Linear Interpolation
history (h)
fall into
𝑷𝟑 (𝒘|𝒉)
𝑷𝟐 (𝒘|𝒉)
the
0.30
0.5
0.030
a
0.10
0.2
0.010
two
0.00
0.0
0.001
0.6
0.3
0.959
w
<other>
<OOV>
<UNK>
𝑷𝟏 (𝒘|𝒉) interpolated
Question: using the following weights
• λ3,”fall into” = 0.1
• λ2, ”fall into” = 0.5
• λ1, ”fall into” = 0.4
How do you compute the combined, interpolated probabilities?
Learning the Weights
 How?
Tuning on Held-Out Data
 Important tool for getting models to generalize:
Training Data
Held-Out
Data
Test
Data
PLIN ( ) (w | w1 )  (1   ) Pˆ (w | w1 )   Pˆ (w)
Wisdom
Training Data
Held-Out
Data
Test
Data
 “A cardinal sin in Statistical NLP is to test on your
training data.”
 Manning & Schuetze, p. 206
 Corollary: “You should always eyeball the training
data – you want to use your human pattern-finding
abilities to get hints on how to proceed. You
shouldn’t eyeball the test data – that’s cheating …”
 M&S, p. 207
Likelihood of the Data
 We want the joint probability of a data set 𝑤 (containing 𝑆 sentences):
𝑆
𝑃 𝑤 =
𝑃 𝑤𝑖
𝑖=1
 Use our model 𝑀 (w/ local model 𝐿), trained on the training set:
𝑆
𝑁𝑖
𝑃𝑀 𝑤 =
𝑃𝐿 𝑤𝑖𝑗 |𝑤𝑖,𝑗−1
𝑖=1 𝑗=1
 Take the logarithm: (why?)
𝑆
𝑁𝑖
log 𝑃𝑀 𝑤 = log
𝑃𝐿 𝑤𝑖𝑗 |𝑤𝑖,𝑗−1
𝑖=1 𝑗=1
 Distribute the log. Inward:
𝑆
𝑁𝑖
log 𝑃𝑀 𝑤 =
log 𝑃𝐿 𝑤𝑖𝑗 |𝑤𝑖,𝑗−1
𝑖=1 𝑗=1
 Compare models using the Log Likelihood function:
𝑆
𝑁𝑖
ℒ 𝑤𝑀 =
log 𝑃𝐿 𝑤𝑖𝑗 |𝑤𝑖,𝑗−1
𝑖=1 𝑗=1
Maximizing the Likelihood
 Situation: we have a small number of parameters 1 … k that
control the degree of smoothing in our model
 Goal: set them to maximize the (log-)likelihood of held-out
data:
𝑆
ℒ 𝑤ℎ𝑒𝑙𝑑𝑜𝑢𝑡 𝑀 𝜆1 … 𝜆𝑘
𝑁𝑖
=
log 𝑃𝐿
𝜆1 …𝜆𝑘
𝑤𝑖𝑗 |𝑤𝑖,𝑗−1
𝑖=1 𝑗=1
𝜆1 … 𝜆𝑘 = 𝑎𝑟𝑔𝑚𝑎𝑥
𝜆1 …𝜆𝑘
ℒ 𝑤ℎ𝑒𝑙𝑑𝑜𝑢𝑡 𝑀 𝜆1 … 𝜆𝑘
 Method: use any optimization technique
 line search – easy, OK
 EM (to be discussed later in this course)
Tuning on Held-Out Data
 Important tool for getting models to generalize:
Held-Out
Data
Training Data
Test
Data
PLIN ( ) (w | w1 )  (1   ) Pˆ (w | w1 )   Pˆ (w)
ℒ

What’s Next
 Upcoming lectures:
 Discounting strategies
 Reserving mass for Unknown Word (UNK) and
Unseen n-grams
 i.e., “Open Vocabulary”
 Speech Recognition