Transcript D2_ngrams

BASIC TECHNIQUES IN
STATISTICAL NLP
Word prediction
n-grams
smoothing
1
Fall 2004
Statistical Methods in NLE

Two characteristics of NL make it desirable to endow
programs with the ability to LEARN from examples of
past use:
–
–


2
VARIETY (no programmer can really take into account all
possibilities)
AMBIGUITY (need to have ways of choosing between
alternatives)
In a number of NLE applications, statistical methods
are very common
The simplest application: WORD PREDICTION
We are good at word prediction
Stocks plunged this morning, despite a cut in interest
rates by the Federal Reserve, as Wall
Street began ….
3
Real Spelling Errors
They are leaving in about fifteen minuets to go to her house
The study was conducted mainly be John Black.
The design an construction of the system will take more than
one year.
Hopefully, all with continue smoothly in my absence.
Can they lave him my messages?
I need to notified the bank of this problem.
He is trying to fine out.
4
The `cloze’ task
Pablo did not get up at seven o’clock, as he always does. He
woke up late, at eight o’clock. He dressed quickly and came
out of the house barefoot. He entered the garage __ could not
open his __ door. Therefore, he had __ go to the office __
bus. But when he __ to pay his fare __ the driver, he realized
__ he did not have __ money. Because of that, __ had to
walk. When __ finally got into the __, his boss was offended
__ Pablo treated him impolitely.
5
Handwriting recognition

6
From Woody Allen’s Take the Money and Run (1969)
– Allen (a bank robber), walks up to the teller and
hands her a note that reads. "I have a gun. Give
me all your cash."
The teller, however, is puzzled, because he reads "I
have a gub." "No, it's gun", Allen says.
"Looks like 'gub' to me," the teller says, then asks
another teller to help him read the note, then
another, and finally everyone is arguing over what
the note means.
Applications of word prediction





7
Spelling checkers
Mobile phone texting
Speech recognition
Handwriting recognition
Disabled users
Statistics and word prediction


The basic idea underlying the statistical approach to
word prediction is to use the probabilities of
SEQUENCES OF WORDS to choose the most likely
next word / correction of spelling error
I.e., to compute
P(w | W1 …. WN-1)

8
For all words w, and predict as next word the one for
which this (conditional) probability is highest.
Using corpora to estimate
probabilities


But where do we get these probabilities? Idea:
estimate them by RELATIVE FREQUENCY.
The simplest method: Maximum Likelihood Estimate
(MLE). Count the number of words in a corpus, then
count how many times a given sequence is
encountered.
C (W1..Wn )
P (W1..Wn ) 
N

9
‘Maximum’ because doesn’t waste any probability on
events not in the corpus
Maximum Likelihood Estimation for
conditional probabilities

In order to estimate P(w|W1 … WN), we can use
instead:
C (W1..Wn )
P (Wn | W1..Wn 1 ) 
C (W1..Wn 1 )

Cfr.:
–
10
P(A|B) = P(A&B) / P(B)
Aside: counting words in corpora


Keep in mind that it’s not always so obvious what ‘a
word’ is (cfr. yesterday)
In text:
–

In speech:
–


11
He stepped out into the hall, was delighted to encounter a
brother. (From the Brown corpus.)
I do uh main- mainly business data processing
LEMMAS: cats vs cat
TYPES vs. TOKENS
The problem: sparse data



12
In principle, we would like the n of our models to be
fairly large, to model ‘long distance’ dependencies such
as:
– Sue SWALLOWED the large green …
However, in practice, most events of encountering
sequences of words of length greater than 3 hardly
ever occur in our corpora! (See below)
(Part of the) Solution: we APPROXIMATE the
probability of a word given all previous words
The Markov Assumption

The probability of being in a certain state only depends
on the previous state:
P(Xn = Sk| X1 … Xn-1) = P(Xn = Sk|Xn-1)
This is equivalent to the assumption that the next state
only depends on the previous m inputs, for m finite
(N-gram models / Markov models can be seen as
probabilistic finite state automata)
13
The Markov assumption for
language: n-grams models

Making the Markov assumption for word prediction
means assuming that the probability of a word only
depends on the previous n words (N-GRAM model)
P(Wn | W1..Wn1 )  P(Wn | Wn N 1..Wn1 )
14
Bigrams and trigrams


Typical values of n are 2 or 3 (BIGRAM or TRIGRAM
models):
P(Wn|W1 ….. W n-1) ~ P(Wn|W n-2,W n-1)
P(W1,…Wn) ~ П P(Wi| W i-2,W i-1)
What bigram model means in practice:
–
–

15
Instead of P(rabbit|Just the other day I saw a)
We use P(rabbit|a)
Unigram: P(dog)
Bigram: P(dog|big)
Trigram: P(dog|the,big)
The chain rule

So how can we compute the probability of sequences
of words longer than 2 or 3? We use the CHAIN RULE:
P(W1..Wn )  P(W1 ) P(W2 | W1 ) P(W3 | W1W2 )..P(Wn | W1..Wn1 )

E.g.,
–
P(the big dog) = P(the) P(big|the) P(dog|the big)
P(W1..Wn )  P(W1 ) P(W2 | W1 ) P(W3 | W1W2 )..P(Wn | Wn2Wn1 )

16
Then we use the Markov assumption to reduce this to
manageable proportions:
Example: the Berkeley Restaurant
Project (BERP) corpus


BERP is a speech-based restaurant consultant
The corpus contains user queries; examples
include
–
–
–
–
17
I’m looking for Cantonese food
I’d like to eat dinner someplace nearby
Tell me about Chez Panisse
I’m looking for a good place to eat breakfast
Computing the probability of a
sentence


Given a corpus like BERP, we can compute the
probability of a sentence like “I want to eat
Chinese food”
Making the bigram assumption and using the
chain rule, the probability can be approximated
as follows:
–
18
P(I want to eat Chinese food) ~
P(I|”sentence start”) P(want|I) P(to|want)P(eat|to)
P(Chinese|eat)P(food|Chinese)
Bigram counts
19
How the bigram probabilities are
computed

Example of P(I,I):
–
–
–
20
C(“I”,”I”): 8
C(“I”): 8 + 1087 + 13 …. = 3437
P(“I”|”I”) = 8 / 3437 = .0023
Bigram probabilities
P(.|want)
21
The probability of the example
sentence




22
P(I want to eat Chinese food) 
P(I|”sentence start”) * P(want|I) * P(to|want) *
P(eat|to) * P(Chinese|eat) * P(food|Chinese) =
Assume P(I|”start of sentence”) = .25
P = .25 * .32 * .65 * .26 * .020 * .56 = .000151
Examples of actual bigram
probabilities computed using BERP
23
The tradeoff between prediction and
sparsity: comparing Austen n-grams
In
person
she
1-gram
was
P(.)
inferior
P(.)
to
P(.)
1
the
.034
the
.034
the
.034
the
.034
2
to
.032
to
.032
to
.032
to
.032
3
and
.030
and
.030
and
.030
was
.015
was
.015
she
.011
inferior
.00005
…
8
…
13
…
1701
24
P(.)
Comparing Austen n-grams:
bigrams
In
person
she
2-gram
was
P(.|person)
inferior
P(.|she)
to
P(.|was)
1
and
.099
had
.0141
not
.065
2
who
.099
was
.122
a
.052
she
.009
inferior
0
…
23
…
25
P(.inferior)
to
.212
Comparing Austen n-grams:
trigrams
In
person
3-gram
1
2
she
was
P(.|In,person)
UNSEEN
inferior
P(.|person,
she)
to
P(.|she,
was)
did
.05
not
.057
was
.05
very
.038
inferior
0
…
26
P(.was,
inferior)
UNSEEN
Evaluating n-gram based language models:
the Shannon/Miller/Selfridge method

For unigrams:
–
–

For bigrams:
–
–
27
Choose a random value r between 0 and 1
Print out w such that P(w) = r
Choose a random bigram P(w|<s>)
Then pick up bigrams to follow as before
The Shannon/Miller/Selfridge method trained
on Shakespeare
28
Approximating Shakespeare, cont’d
29
A more formal evaluation
mechanism


30
Entropy
Cross-entropy
Small corpora?

The entire Shakespeare oeuvre consists of
–
–
–

All of Jane Austen’s novels (on Manning and
Schuetze’s website, also cc437/data):
–
–
31
884,647 tokens (N)
29,066 types (V)
300,000 bigrams
N = 617,091 tokens
V = 14,585 types
Maybe with a larger corpus?


32
Words such as ‘ergativity’ unlikely to be found
outside a corpus of linguistic articles
More in general: Zipf’s law
Zipf’s law for the Brown corpus
33
Addressing the zeroes

SMOOTHING is re-evaluating some of the zeroprobability and low-probability n-grams, assigning them
non-zero probabilities
–
–
–

BACK-OFF is using the probabilities of lower order ngrams when higher order ones are not available
–
–
34
Add-one
Witten-Bell
Good-Turing
Backoff
Linear interpolation
Add-one (‘Laplace’s Law’)
35
Effect on BERP bigram counts
36
Add-one bigram probabilities
37
The problem
38
The problem


Add-one has a huge effect on probabilities:
e.g., P(to|want) went from .65 to .28!
Too much probability gets ‘removed’ from ngrams actually encountered
–
39
(more precisely: the ‘discount factor’
Witten-Bell Discounting



40
How can we get a better estimate of the
probabilities of things we haven’t seen?
The Witten-Bell algorithm is based on the idea
that a zero-frequency N-gram is just an event
that hasn’t happened yet
How often these events happen? We model
this by the probability of seeing an N-gram for
the first time (we just count the number of
times we first encountered a type)
Witten-Bell: the equations

Total probability mass assigned to zero-frequency Ngrams:
(NB: T is OBSERVED types, not V)
 So each zero N-gram gets the probability:
41
Witten-Bell: why ‘discounting’

42
Now of course we have to take away
something (‘discount’) from the probability of
the events seen more than once:
Witten-Bell for bigrams

43
We `relativize’ the types to the previous word:
Add-one vs. Witten-Bell discounts
for unigrams in the BERP corpus
44
Word
Add-One
Witten-Bell
“I’”
.68
.97
“want”
.42
.94
“to”
.69
.96
“eat”
.37
.88
“Chinese”
.12
.91
“food”
.48
.94
“lunch”
.22
.91
One last discounting method ….



45
The best-known discounting method is GOODTURING (Good, 1953)
Basic insight: re-estimate the probability of Ngrams with zero counts by looking at the
number of bigrams that occurred once
For example, the revised count for bigrams that
never occurred is estimated by dividing N1, the
number of bigrams that occurred once, by N0,
the number of bigrams that never occurred
Combining estimators



46
A method often used (generally in combination
with discounting methods) is to use lower-order
estimates to ‘help’ with higher-order ones
Backoff (Katz, 1987)
Linear interpolation (Jelinek and Mercer, 1980)
Backoff: the basic idea
47
Backoff with discounting
48
A more radical solution: the Web as
a corpus




49
Keller and Lapata (2003): using the Web to
obtain frequencies for unseen bigrams
Corpora: the British National Corpus (150M
words), Google, Altavista
Average factor by which Web counts are larger
than BNC counts: ~ 1,000
Percentage of bigrams unseen in BNC that
are unseen using Google: 2% (7/270)
NB
STILL need smoothing!!
50
Readings



Jurafsky and Martin, chapter 6
The Statistics Glossary
Word prediction:
–
–

51
For mobile phones
For disabled users
Further reading: Manning and Schuetze,
chapters 6 (Good-Turing)
Acknowledgments

52
Some of the material in these slides was taken
from lecture notes by Diane Litman & James
Martin