Computing with Ngrams

Transcript Computing with Ngrams

I256:
Applied Natural Language Processing
Marti Hearst
Sept 13, 2006
1
Counting Tokens
Useful for lots of things
One cute application: see who talks where in a novel
Idea comes from Eick et al. who did it with The
Jungle Book by Kipling
2
SeeSoft Vizualization of Jungle Book Characters, From
Eick, Steffen, and Sumner ‘92
3
The FreqDist Data Structure
Purpose: collect counts and frequencies for some phenomenon
Initialize a new FreqDist:
from nltk_lite.probability import FreqDist
fd = FreqDist()
When in a counting loop:
fd.inc(‘item of interest’)
After done counting:
fd.N() # total number of tokens counted
fd.B()
# number of unique tokens
fd.samples() # list of all the tokens seen (there are N)
fd.Nr(10)
# number of samples that occurred 10 times
fd.count(‘red’) # number of times the token ‘red’ was seen
fd.freq(‘red’) # frequency of ‘red’; that is fd.count(‘red’)/fd.N()
fd.max()
# which token had the highest count
fd.sorted_samples() # show the samples in decreasing order of frequency
4
FreqDist() in action
5
Word Lengths by Language
6
Word Lengths by Language
7
Doing Character Distribution
8
How to determine the characters?
Write some code that takes as input a gutenberg file
and quickly suggests who the main characters are.
9
How to determine the characters?
My solution: look for words that begin with capital
letters; count how often each occurs.
Then show the most frequent.
10
11
Language Modeling
A fundamental concept in NLP
Main idea:
For a given language, some words are more likely than
others to follow each other, or
You can predict (with some degree of accuracy) the
probability that a given word will follow another word.
12
Next Word Prediction
From a NY Times story...
Stocks ...
Stocks plunged this ….
Stocks plunged this morning, despite a cut in
interest rates
Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall ...
Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall Street
began
Adapted from slide by Bonnie Dorr
13
Human Word Prediction
Clearly, at least some of us have the ability to predict
future words in an utterance.
How?
Domain knowledge
Syntactic knowledge
Lexical knowledge
Adapted from slide by Bonnie Dorr
14
Simple Statistics Does a Lot
A useful part of the knowledge needed to allow word
prediction can be captured using simple statistical
techniques
In particular, we'll rely on the notion of the probability
of a sequence (a phrase, a sentence)
Adapted from slide by Bonnie Dorr
15
N-Gram Models of Language
Use the previous N-1 words in a sequence to predict
the next word
How do we train these models?
Very large corpora
Adapted from slide by Bonnie Dorr
16
Simple N-Grams
Assume a language has V word types in its lexicon, how
likely is word x to follow word y?
Simplest model of word probability: 1/V
Alternative 1: estimate likelihood of x occurring in new
text based on its general frequency of occurrence
estimated from a corpus (unigram probability)
popcorn is more likely to occur than unicorn
Alternative 2: condition the likelihood of x occurring in
the context of previous words (bigrams, trigrams,…)
mythical unicorn is more likely than mythical popcorn
Adapted from slide by Bonnie Dorr
17
ConditonalFreqDist() Data Structure
A collection of FreqDist() objects
Indexed by the “condition” that is being tested or compared
Initialize a new one:
cfd = ConditionalFreqDist()
Add a count
cfd[‘berkeley’].inc(‘blue’)
cfd[‘berkeley’].inc(‘gold’)
cfd[‘stanford’].inc(‘red’)
Can access each FreqDist object by indexing on condition
cfd[‘berkeley’].samples()
cfd[‘berkeley’].N()
Can also get a list of the conditions from the cfd object
cfd.conditions()
>> [‘stanford’, ‘berkeley’]
18
Computing Next Words
19
Auto-generate a Story
20
Applications
Why do we want to predict a word, given some
preceding words?
Rank the likelihood of sequences containing various
alternative hypotheses, e.g. for ASR
Theatre owners say popcorn/unicorn sales have
doubled...
Assess the likelihood/goodness of a sentence
– for text generation or machine translation.
The doctor recommended a cat scan.
El doctor recommendó una exploración del gato.
Adapted from slide by Bonnie Dorr
21
Comparing Modal Verb Counts
How to implement this?
22
Comparing Modals
23
Comparing Modals
24
Next Time
Part of Speech Tagging
25

Computing with Ngrams

Transcript Computing with Ngrams

Directory