Basic statistics and n-grams

Download Report

Transcript Basic statistics and n-grams

BASIC TECHNIQUES IN
STATISTICAL NLP
Word prediction
n-grams
smoothing
1
September 2003
Statistical Methods in NLE

Two characteristics of NL make it desirable to endow
programs with the ability to LEARN from examples of
past use:
–
–


2
VARIETY (no programmer can really take into account all
possibilities)
AMBIGUITY (need to have ways of choosing between
alternatives)
In a number of NLE applications, statistical methods
are very common
The simplest application: WORD PREDICTION
September 2003
We are good at word prediction
Stocks plunged this morning, despite a cut in interest
rates by the Federal Reserve, as Wall
Street began ….
3
September 2003
Real Spelling Errors
They are leaving in about fifteen minuets to go to her house
The study was conducted mainly be John Black.
The design an construction of the system will take more than
one year.
Hopefully, all with continue smoothly in my absence.
Can they lave him my messages?
I need to notified the bank of this problem.
He is trying to fine out.
4
September 2003
Handwriting recognition

5
From Woody Allen’s Take the Money and Run (1969)
– Allen (a bank robber), walks up to the teller and
hands her a note that reads. "I have a gun. Give
me all your cash."
The teller, however, is puzzled, because he reads "I
have a gub." "No, it's gun", Allen says.
"Looks like 'gub' to me," the teller says, then asks
another teller to help him read the note, then
another, and finally everyone is arguing over what
the note means.
September 2003
Applications of word prediction





6
Spelling checkers
Mobile phone texting
Speech recognition
Handwriting recognition
Disabled users
September 2003
Statistics and word prediction


The basic idea underlying the statistical approach to
word prediction is to use the probabilities of
SEQUENCES OF WORDS to choose the most likely
next word / correction of spelling error
I.e., to compute
P(w | W1 …. WN-1)

7
For all words w, and predict as next word the one for
which this (conditional) probability is highest.
September 2003
Using corpora to estimate
probabilities


But where do we get these probabilities? Idea:
estimate them by RELATIVE FREQUENCY.
The simplest method: Maximum Likelihood Estimate
(MLE). Count the number of words in a corpus, then
count how many times a given sequence is
encountered.
C (W1..Wn )
P (W1..Wn ) 
N

8
‘Maximum’ because doesn’t waste any probability on
events not in the corpus
September 2003
Maximum Likelihood Estimation for
conditional probabilities

In order to estimate P(w|W1 … WN), we can use
instead:
C (W1..Wn )
P(Wn | W1..Wn 1 ) 
C (W1..Wn1 )

Cfr.:
–
9
P(A|B) = P(A&B) / P(B)
September 2003
Aside: counting words in corpora


Keep in mind that it’s not always so obvious what ‘a
word’ is (cfr. yesterday)
In text:
–

In speech:
–


10
He stepped out into the hall, was delighted to encounter a
brother. (From the Brown corpus.)
I do uh main- mainly business data processing
LEMMAS: cats vs cat
TYPES vs. TOKENS
September 2003
The problem: sparse data



11
In principle, we would like the n of our models to be
fairly large, to model ‘long distance’ dependencies such
as:
– Sue SWALLOWED the large green …
However, in practice, most events of encountering
sequences of words of length greater than 3 hardly
ever occur in our corpora! (See below)
(Part of the) Solution: we APPROXIMATE the
probability of a word given all previous words
September 2003
The Markov Assumption

The probability of being in a certain state only depends
on the previous state:
P(Xn = Sk| X1 … Xn-1) = P(Xn = Sk|Xn-1)
This is equivalent to the assumption that the next state
only depends on the previous m inputs, for m finite
(N-gram models / Markov models can be seen as
probabilistic finite state automata)
12
September 2003
The Markov assumption for
language: n-grams models

Making the Markov assumption for word prediction
means assuming that the probability of a word only
depends on the previous n words (N-GRAM model)
P(Wn | W1..Wn1 )  P(Wn | Wn N 1..Wn1 )
13
September 2003
Bigrams and trigrams


Typical values of n are 2 or 3 (BIGRAM or TRIGRAM
models):
P(Wn|W1 ….. W n-1) ~ P(Wn|W n-2,W n-1)
P(W1,…Wn) ~ П P(Wi| W i-2,W i-1)
What bigram model means in practice:
–
–

14
Instead of P(rabbit|Just the other day I saw a)
We use P(rabbit|a)
Unigram: P(dog)
Bigram: P(dog|big)
Trigram: P(dog|the,big)
September 2003
The chain rule

So how can we compute the probability of sequences
of words longer than 2 or 3? We use the CHAIN RULE:
P(W1..Wn )  P(W1 ) P(W2 | W1 ) P(W3 | W1W2 )..P(Wn | W1..Wn1 )

E.g.,
–
P(the big dog) = P(the) P(big|the) P(dog|the big)
P(W1..Wn )  P(W1 ) P(W2 | W1 ) P(W3 | W1W2 )..P(Wn | Wn2Wn1 )

15
Then we use the Markov assumption to reduce this to
manageable proportions:
September 2003
Example: the Berkeley Restaurant
Project (BERP) corpus


BERP is a speech-based restaurant consultant
The corpus contains user queries; examples
include
–
–
–
–
16
I’m looking for Cantonese food
I’d like to eat dinner someplace nearby
Tell me about Chez Panisse
I’m looking for a good place to eat breakfast
September 2003
Computing the probability of a
sentence


Given a corpus like BERP, we can compute the
probability of a sentence like “I want to eat
Chinese food”
Making the bigram assumption and using the
chain rule, the probability can be approximated
as follows:
–
17
P(I want to eat Chinese food) ~
P(I|”sentence start”) P(want|I) P(to|want)P(eat|to)
P(Chinese|eat)P(food|Chinese)
September 2003
Bigram counts
18
September 2003
How the bigram probabilities are
computed

Example of P(I,I):
–
–
–
19
C(“I”,”I”): 8
C(“I”): 8 + 1087 + 13 …. = 3437
P(“I”|”I”) = 8 / 3437 = .0023
September 2003
Bigram probabilities
20
September 2003
The probability of the example
sentence



21
P(I want to eat Chinese food) 
P(I|”sentence start”) * P(want|I) * P(to|want) *
P(eat|to) * P(Chinese|eat) * P(food|Chinese) =
.25 * .32 * .65 * .26 * .002 * .60 = .000016
September 2003
Examples of actual bigram
probabilities computed using BERP
22
September 2003
Visualizing an n-gram based language
model: the Shannon/Miller/Selfridge method

For unigrams:
–
–

For bigrams:
–
–
23
Choose a random value r between 0 and 1
Print out w such that P(w) = r
Choose a random bigram P(w|<s>)
Then pick up bigrams to follow as before
September 2003
The Shannon/Miller/Selfridge method trained
on Shakespeare
24
September 2003
Approximating Shakespeare, cont’d
25
September 2003
A more formal evaluation
mechanism


26
Entropy
Cross-entropy
September 2003
The downside

The entire Shakespeare oeuvre consists of
–
–
–

All of Jane Austen’s novels (on Manning and
Schuetze’s website):
–
–
27
884,647 tokens (N)
29,066 types (V)
300,000 bigrams
N = 617,091 tokens
V = 14,585 types
September 2003
Comparing Austen n-grams:
unigrams
In
person
she
1-gram
was
P(.)
inferior
P(.)
to
P(.)
P(.)
1
the
.034
the
.034
the
.034
the
.034
2
to
.032
to
.032
to
.032
to
.032
3
and
.030
and
.030
and
.030
was
.015
was
.015
she
.011
inferior
.00005
…
8
…
13
…
1701
28
September 2003
Comparing Austen n-grams:
bigrams
In
person
she
2-gram
was
P(.|person)
inferior
P(.|she)
to
P(.|was)
1
and
.099
had
.0141
not
.065
2
who
.099
was
.122
a
.052
she
.009
inferior
0
P(.inferior)
to
.212
…
23
…
29
September 2003
Comparing Austen n-grams:
trigrams
In
person
3-gram
1
2
she
was
P(.|In,person)
UNSEEN
inferior
P(.|person,
she)
to
P(.|she,
was)
did
.05
not
.057
was
.05
very
.038
inferior
0
P(.was,
inferior)
UNSEEN
…
30
September 2003
Maybe with a larger corpus?


31
Words such as ‘ergativity’ unlikely to be found
outside a corpus of linguistic articles
More in general: Zipf’s law
September 2003
Zipf’s law for the Brown corpus
32
September 2003
Addressing the zeroes

SMOOTHING is re-evaluating some of the zeroprobability and low-probability n-grams, assigning them
non-zero probabilities
–
–
–

BACK-OFF is using the probabilities of lower order ngrams when higher order ones are not available
–
–
33
Add-one
Witten-Bell
Good-Turing
Backoff
Linear interpolation
September 2003
Add-one (‘Laplace’s Law’)
34
September 2003
Effect on BERP bigram counts
35
September 2003
Add-one bigram probabilities
36
September 2003
The problem
37
September 2003
The problem


Add-one has a huge effect on probabilities:
e.g., P(to|want) went from .65 to .28!
Too much probability gets ‘removed’ from ngrams actually encountered
–
38
(more precisely: the ‘discount factor’
September 2003
Witten-Bell Discounting



39
How can we get a better estimate of the
probabilities of things we haven’t seen?
The Witten-Bell algorithm is based on the idea
that a zero-frequency N-gram is just an event
that hasn’t happened yet
How often these events happen? We model
this by the probability of seeing an N-gram for
the first time (we just count the number of
times we first encountered a type)
September 2003
Witten-Bell: the equations

Total probability mass assigned to zero-frequency Ngrams:
(NB: T is OBSERVED types, not V)
 So each zero N-gram gets the probability:
40
September 2003
Witten-Bell: why ‘discounting’

41
Now of course we have to take away
something (‘discount’) from the probability of
the events seen more than once:
September 2003
Witten-Bell for bigrams

42
We `relativize’ the types to the previous word:
September 2003
Add-one vs. Witten-Bell discounts
for unigrams in the BERP corpus
43
Word
Add-One
Witten-Bell
“I’”
.68
.97
“want”
.42
.94
“to”
.69
.96
“eat”
.37
.88
“Chinese”
.12
.91
“food”
.48
.94
“lunch”
.22
.91
September 2003
One last discounting method ….



44
The best-known discounting method is GOODTURING (Good, 1953)
Basic insight: re-estimate the probability of Ngrams with zero counts by looking at the
number of bigrams that occurred once
For example, the revised count for bigrams that
never occurred is estimated by dividing N1, the
number of bigrams that occurred once, by N0,
the number of bigrams that never occurred
September 2003
Combining estimators



45
A method often used (generally in combination
with discounting methods) is to use lower-order
estimates to ‘help’ with higher-order ones
Backoff (Katz, 1987)
Linear interpolation (Jelinek and Mercer, 1980)
September 2003
Backoff: the basic idea
46
September 2003
Backoff with discounting
47
September 2003
Readings



Jurafsky and Martin, chapter 6
The Statistics Glossary
Word prediction:
–
–

48
For mobile phones
For disabled users
Further reading: Manning and Schuetze,
chapters 6 (Good-Turing)
September 2003
Acknowledgments

49
Some of the material in these slides was taken
from lecture notes by Diane Litman & James
Martin
September 2003