Transcript 771Lec03

CSCE 771
Natural Language Processing
Lecture 3
Ngrams
Topics




Python
NLTK
N – grams
Smoothing
Readings:

Chapter 4 – Jurafsky and Martin
January 23, 2013
Last Time

Slides from Lecture 1 30 Regular expressions in Python, (grep, vi, emacs, word)?
 Eliza

Morphology
Today

N-gram models for prediction
–2–
CSCE 771 Spring 2013
Eliza.py
https://github.com/nltk/nltk/blob/master/nltk/chat/eliza.py
• List of re – response pattern pairs
• If Regular expression matches
• Then respond with …
pairs = (
(r'I need (.*)',
( "Why do you need %1?",
"Would it really help you to get %1?",
"Are you sure you need %1?")),
(r'Why don\'t you (.*)',
( "Do you really think I don't %1?",
"Perhaps eventually I will %1.",
"Do you really want me to %1?")),
–3–
CSCE 771 Spring 2013
http://nltk.org/book/
Natural Language Processing with Python
--- Analyzing Text with the Natural Language Toolkit
Steven Bird, Ewan Klein, and Edward Loper
Preface (extras) 1. Language Processing and Python (extras) 2.
Accessing Text Corpora and Lexical Resources (extras) 3.
Processing Raw Text 4. Writing Structured Programs (extras) 5.
Categorizing and Tagging Words 6. Learning to Classify Text
(extras) 7. Extracting Information from Text 8. Analyzing
Sentence Structure (extras) 9. Building Feature Based
Grammars 10. Analyzing the Meaning of Sentences (extras) 11.
Managing Linguistic Data 12. Afterword: Facing the Language
Challenge
–4–
nltk.org/book
CSCE 771 Spring 2013
Language Processing and Python
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
…
–5–
nltk.org/book
CSCE 771 Spring 2013
Simple text processing with NLTK
>>> text1.concordance("monstrous")
>>> text1.similar("monstrous")
>>> text2.common_contexts(["monstrous", "very"])
>>> text4.dispersion_plot(["citizens", "democracy",
"freedom", "duties", "America"])
>>> text3.generate()
>>> text5[16715:16735]
–6–
nltk.org/book
CSCE 771 Spring 2013
Counting Vocabulary
>>> len(text3)
>>> sorted(set(text3))
>>> from __future__ import division
>>> len(text3) / len(set(text3))
>>> text3.count("smote")
–7–
nltk.org/book
CSCE 771 Spring 2013
lexical_diversity
>>> def lexical_diversity(text):
... return len(text) / len(set(text))
...
>>> def percentage(count, total):
... return 100 * count / total
...
–8–
nltk.org/book
CSCE 771 Spring 2013
1.3 Computing with Language:
Simple Statistics
Frequency Distributions
>>> fdist1 = FreqDist(text1)
>>> fdist1 <FreqDist with 260819 outcomes>
>>> vocabulary1 = fdist1.keys()
>>> vocabulary1[:50]
>>> fdist1['whale']
>>> V = set(text1)
>>> long_words = [w for w in V if len(w) > 15]
>>> sorted(long_words)
–9–
nltk.org/book
CSCE 771 Spring 2013
List constructors in Python
>>> V = set(text1)
>>> long_words = [w for w in V if len(w) > 15]
>>> sorted(long_words)
>>> fdist5 = FreqDist(text5)
>>> sorted([w for w in set(text5) if len(w) > 7 and
fdist5[w] > 7])
– 10 –
nltk.org/book
CSCE 771 Spring 2013
Collocations and Bigrams
>>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>> text4.collocations()
Building collocations list
United States; fellow citizens; years ago; Federal
Government; General Government; American people;
Vice President; Almighty God; Fellow citizens; Chief
Magistrate; Chief Justice; God bless; Indian tribes;
public debt; foreign nations; political parties; State
governments; National Government; United Nations;
public money
– 11 –
nltk.org/book
CSCE 771 Spring 2013
Table 1.2
Example
fdist.freq('monstrous')
fdist.N()
Description
create a frequency distribution containing the
given samples
increment the count for this sample
count of the number of times a given sample
occurred
frequency of a given sample
total number of samples
fdist.keys()
sorted in order of decreasing frequency
for sample in fdist:
iterate over the samples, decreasing frequency
fdist.max()
fdist.tabulate()
sample with the greatest count
tabulate the frequency distribution
fdist.plot()
graphical plot of the frequency distribution
fdist.plot(cumulative=True)
cumulative plot of the frequency distribution
fdist = FreqDist(samples)
fdist.inc(sample)
fdist['monstrous']
– 12 –
nltk.org/book
test if samples in fdist1 occur less
CSCEfrequently
771 Spring 2013
Quotes from Chapter 4
But it must be recognized that the notion “probability of
a sentence” is an entirely useless one, under any
known interpretation of this term.
Noam Chomsky 1969
• http://www.chomsky.info/
Anytime a linguist leaves the group the recognition rate
goes up.
Fred Jelinek (then of the IBM speech group)
– 13 –
SLP – Jurafsky and Matrin for the rest of the day
CSCE 771 Spring 2013
Predicting Words
Please turn your homework …
What is the next word?
Language models: N-gram models
– 14 –
CSCE 771 Spring 2013
Word/Character prediction Uses
1. Spelling correction (at character level)
2. Spelling correction (at a higher level) when the
corrector corrects to the wrong word
3. Augmentative communication – person with
disability chooses words from a menu predicted by
the system
– 15 –
CSCE 771 Spring 2013
Real-Word Spelling Errors
Mental confusions





Their/they’re/there
To/too/two
Weather/whether
Peace/piece
You’re/your
Typos that result in real words
– 16 –
CSCE 771 Spring 2013
Spelling Errors that are Words
Typos
Context


Left context
Right context
– 17 –
CSCE 771 Spring 2013
Real Word Spelling Errors
Collect a set of common pairs of confusions
Whenever a member of this set is encountered compute
the probability of the sentence in which it appears
Substitute the other possibilities and compute the
probability of the resulting sentence
Choose the higher one
– 18 –
CSCE 771 Spring 2013
Word Counting
Probability based on counting
• He stepped out into the hall, was delighted to
encounter a water brother. (from the Brown corpus)
•
•
Words?
Bi-grams
Frequencies of words, but what words?
Corpora ?
•
•
•
•
Web everything on it
Shakespeare
Bible/Koran
Spoken transcripts (switchboard)
• Problems with spoken speech “uh” , “um” fillers
– 19 –
CSCE 771 Spring 2013
6.2 Bigrams from Berkeley Restaurant Proj.
Berkeley Restaurant Project – a speech based restaurant consultant
Handling requests:


I’m looking for Cantonese food.
I’m looking for a good place to eat breakfast.
– 20 –
CSCE 771 Spring 2013
Chain Rule
Recall the definition of
conditional probabilities
Rewriting
Or…
P( A^ B)
P( A | B) 
P( B)
P( A^ B)  P( A | B) P( B)
Or…
P(The big )  P(big | the) P(the)
P(The big )  P(the) P(big | the)
– 21 –
CSCE 771 Spring 2013
Example
The big red dog
P(The)*P(big|the)*P(red|the big)*P(dog|the big red)
Better P(The| <Beginning of sentence>) written as
P(The | <S>)
– 22 –
CSCE 771 Spring 2013
General Case
The word sequence from position 1 to n is
n
w1
So the probability of a sequence is
n 1
1
P(w )  P( w1) P( w2 | w1) P( w3 | w )...P( wn | w )
n
1
2
1
 P( w1) k 2 P(wk | w )
n
k 1
1
– 23 –
CSCE 771 Spring 2013
Unfortunately
That doesn’t help since its unlikely we’ll ever gather the
right statistics for the prefixes.
– 24 –
CSCE 771 Spring 2013
Markov Assumption
Assume that the entire prefix history isn’t necessary.
In other words, an event doesn’t depend on all of its
history, just a fixed length near history
– 25 –
CSCE 771 Spring 2013
Markov Assumption
So for each component in the product replace each with
its with the approximation (assuming a prefix of N)
n 1
1
n 1
n  N 1
P( wn | w )  P( wn | w
)
– 26 –
CSCE 771 Spring 2013
Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) - Method to
estimate probabilities for the n-gram models
Normalize counts from a corpus
– 27 –
CSCE 771 Spring 2013
N-Grams: The big red dog
Unigrams:
Bigrams:
Trigrams:
P(dog)
P(dog|red)
P(dog|big red)
Four-grams: P(dog|the big red)
In general, we’ll be dealing with
P(Word| Some fixed prefix)
– 28 –
CSCE 771 Spring 2013
Caveat
The formulation P(Word| Some fixed prefix) is not really
appropriate in many applications.
It is if we’re dealing with real time speech where we
only have access to prefixes.
But if we’re dealing with text we already have the right
and left contexts. There’s no a priori reason to stick
to left contexts only.
– 29 –
CSCE 771 Spring 2013
BERP Table: Counts (fig 4.1)
Then we can normalize by dividing each row by the unigram counts.
– 30 –
CSCE 771 Spring 2013
BERP Table: Bigram Probabilities
– 31 –
CSCE 771 Spring 2013
Example
For this example
•
P(I | <s>) = .25
•
P(food | english) = .5
P (english | want) 0.0011
P (</s> | food) = .68
•
•
Now consider “<s> I want English food </s>”
P(<s> I want English food </s>)
= P(I | <s>) P(want | i) P(english | want) P(food | english) P(</s>|food)
– 32 –
CSCE 771 Spring 2013
An Aside on Logs
You don’t really do all those multiplies. The numbers
are too small and lead to underflows
Convert the probabilities to logs and then do additions.
To get the real probability (if you need it) go back to the
antilog.
– 33 –
CSCE 771 Spring 2013
Some Observations
The following numbers are very informative. Think
about what they capture.





P(want|I) = .32
P(to|want) = .65
P(eat|to) = .26
P(food|Chinese) = .56
P(lunch|eat) = .055
– 34 –
CSCE 771 Spring 2013
Some More Observations
P(I | I)
I I I want
P(want | I)
I want I want to
P(I | food)
The food I want is
– 35 –
CSCE 771 Spring 2013
Generation
Choose N-Grams according to their probabilities and
string them together
– 36 –
CSCE 771 Spring 2013
BERP
I want
want to
to eat
eat Chinese
Chinese food
food .
– 37 –
CSCE 771 Spring 2013
Some Useful Observations
A small number of events occur with high frequency

You can collect reliable statistics on these events with
relatively small samples
A large number of events occur with small frequency

You might have to wait a long time to gather statistics on the
low frequency events
– 38 –
CSCE 771 Spring 2013
Some Useful Observations
Some zeroes are really zeroes

Meaning that they represent events that can’t or shouldn’t
occur
On the other hand, some zeroes aren’t really zeroes

They represent low frequency events that simply didn’t
occur in the corpus
– 39 –
CSCE 771 Spring 2013
Shannon’s Method
Sentences randomly generated based on the probability
models (n-gram models)
Sample a random bigram (<s>, w) according to its probability
Now sample a random bigram (w, x) according to its probability
 Where the prefix w matches the suffix of the first.
And so on until we randomly choose a (y, </s>)
Then string the words together
<s> I
I want
want to
to eat
eat Chinese
Chinese food
food </s>
– 40 –
Slide from: Speech and Language Processing Jurafsky and Martin
CSCE 771 Spring 2013
Shannon’s method applied to
Shakespeare
– 41 –
CSCE 771 Spring 2013
Shannon applied to Wall Street Journal
– 42 –
CSCE 771 Spring 2013
Evaluating N-grams: Perplexity
Training set
Test set : W = w1w2….wn
Perplexity (PP) is a Measure of how good a model is.
PP(W) = P(w1w2….wn )-1/N
Higher probability  lower perplexity
Wall Street Journal perplexities of models
– 43 –
CSCE 771 Spring 2013
Unknown words: Open versus Closed
Vocabularies
<UNK> unrecognized word token
– 44 –
CSCE 771 Spring 2013
Google words visualization
http://googlesystem.blogspot.com/2008/05/using-googles-n-gram-corpus.html
– 45 –
CSCE 771 Spring 2013
Problem
Let’s assume we’re using N-grams
How can we assign a probability to a sequence where
one of the component n-grams has a value of zero
Assume all the words are known and have been seen



Go to a lower order n-gram
Back off from bigrams to unigrams
Replace the zero with something else
– 46 –
CSCE 771 Spring 2013
Smoothing
Smoothing - reevaluating some of the zero and low
probability N-grams and assigning them non-zero
values
Add-One (Laplace)
Make the zero counts 1.
Rationale: They’re just events you haven’t seen yet. If
you had seen them, chances are you would only
have seen them once… so make the count equal to 1.
– 47 –
CSCE 771 Spring 2013
Add-One Smoothing
Terminology
N – Number of total words
V – vocabulary size == number of distinct words
Maximum Likelihood estimate
P( wx ) 
c( wx )
 c(wi )
i
– 48 –
CSCE 771 Spring 2013
Adjusted counts “C*”
Terminology
N – Number of total words
V – vocabulary size ==
number of distinct words
Adjusted count C*
N
c  (ci  1)
N V
*
i
Adjusted probabilities
ci
p 
N
*
i
– 49 –
CSCE 771 Spring 2013
Discounting
Discounting – lowering some of
the larger non-zero counts to
get the “probability” to assign
to the zero entries
*
c
dc 
c
dc – the discounted counts
The discounted probabilities
can then be directly calculated
ci  1
p 
N V
*
i
– 50 –
CSCE 771 Spring 2013
Original BERP Counts (fig 6.4 again)
Berkeley Restaurant Project data
V = 1616
– 51 –
CSCE 771 Spring 2013
Figure 6.6 Add one counts
Counts
Probabilities
– 52 –
CSCE 771 Spring 2013
Figure 6.6 Add one counts & prob.
Counts
Probabilities
– 53 –
CSCE 771 Spring 2013
Add-One Smoothed bigram counts
Think about the occurrence of an unseen item (
– 54 –
CSCE 771 Spring 2013
Witten-Bell
Think about the occurrence of an unseen item
(word, bigram, etc) as an event.
The probability of such an event can be measured
in a corpus by just looking at how often it
happens.
Just take the single word case first.
Assume a corpus of N tokens and T types.
How many times was an as yet unseen type
encountered?
– 55 –
CSCE 771 Spring 2013
Witten Bell
First compute the probability of an unseen event
Then distribute that probability mass equally among the
as yet unseen events



That should strike you as odd for a number of reasons
In the case of words…
In the case of bigrams
– 56 –
CSCE 771 Spring 2013
Witten-Bell
In the case of bigrams, not all conditioning events are
equally promiscuous


P(x|the) vs
P(x|going)
So distribute the mass assigned to the zero count
bigrams according to their promiscuity
– 57 –
CSCE 771 Spring 2013
Witten-Bell
Finally, renormalize the whole table so that you still
have a valid probability
– 58 –
CSCE 771 Spring 2013
Original BERP Counts;
Now the Add 1 counts
– 59 –
CSCE 771 Spring 2013
Witten-Bell Smoothed and Reconstituted
– 60 –
CSCE 771 Spring 2013
Add-One Smoothed BERP
Reconstituted
– 61 –
CSCE 771 Spring 2013