CS-595: Machine Learning in Natural Language Processing

Transcript CS-595: Machine Learning in Natural Language Processing

CS 595-052
Machine Learning and Statistical
Natural Language Processing
Prof. Shlomo Argamon, [email protected]
Room: 237C Office Hours: Mon 3-4 PM
Book: Statistical Natural Language Processing
C. D. Manning and H. Schütze
Requirements:
– Several programming projects
– Research Proposal
Machine Learning
Training
Examples
Learning
Algorithm
Test
Examples
Learned
Model
Classification/
Labeling
Results
Modeling
• Decide how to represent learned models:
–
–
–
–
Decision rules
Linear functions
Markov models
…
• Type chosen affects generalization accuracy
(on new data)
Generalization
Example Representation
• Set of Features:
–
–
–
–
Continuous
Discrete (ordered and unordered)
Binary
Sets vs. Sequences
• Classes:
– Continuous vs. discrete
– Binary vs. multivalued
– Disjoint vs. overlapping
Learning Algorithms
• Find a “good” hypothesis “consistent” with
the training data
– Many hypotheses may be consistent, so may
need a “preference bias”
– No hypothesis may be consistent, so need to
find “nearly” consistent
• May rule out some hypotheses to start with:
– Feature reduction
Estimating Generalization Accuracy
• Accuracy on the training says nothing about new
examples!
• Must train and test on different example sets
• Estimate generalization accuracy over multiple
train/test divisions
• Sources of estimation error:
– Bias: Systematic error in the estimate
– Variance: How much the estimate changes between
different runs
Cross-validation
1. Divide training into k sets
2. Repeat for each set:
1. Train on the remaining k-1 sets
2. Test on the kth
3. Average k accuracies (and compute statistics)
Bootstrapping
For a corpus of n examples:
1. Choose n examples randomly (with replacement)
Note: We expect ~0.632n different examples
2. Train model, and evaluate:
•
•
acc0 = accuracy of model on non-chosen examples
accS = accuracy of model on n training examples
3. Estimate accuracy as 0.632*acc0 + 0.368*accS
4. Average accuracies over b different runs
Also note: there are other similar bootstrapping techniques
Bootstrapping vs. Cross-validation
• Cross-validation:
– Equal participation of all examples
– Dependency of class distribution in tests on
distributions in training
– Stratified cross-validation: equalize class dist.
• Bootstrap:
– Often has higher bias (fewer distinct examples)
– Best for small datasets
Natural Language Processing
• Extract useful information from natural
language texts (articles, books, web pages,
queries, etc.)
• Traditional method: Handcrafted lexicons,
grammars, parsers
• Statistical approach: Learn how to process
language from a corpus of real usage
Some Statistical NLP Tasks
1. Part of speech tagging - How to distinguish
between book the noun, and book the verb.
2. Shallow parsing – Pick out phrases of different
types from a text, such as the purple people
eater or would have been going
3. Word sense disambiguation - How to
distinguish between river bank and bank as a
financial institution.
4. Alignment – Find the correspondence between
words, sentences and paragraphs of a source text
and its translation.
A Paradigmatic Task
• Language Modeling:
Predict the next word of a text (probabilistically):
P(wn | w1w2…wn-1) = m(wn | w1w2…wn-1)
• To do this perfectly, we must capture true notions of
grammaticality
• So:
Better approximation of prob. of “the next word”
 Better language model
Measuring “Surprise”
• The lower the probability of the actual word,
the more the model is “surprised”:
H(wn | w1…wn-1) = -log2 m(wn | w1…wn-1)
(The conditional entropy of wn given w1,n-1)
Cross-entropy: Suppose the actual distribution
of the language is p(wn | w1…wn-1), then our
model is on average surprised by:
Ep[H(wn|w1,n-1)] = w p(wn=w|w1,n-1)H(wn=w |w1,n-1)
= Ep[-log2 m(wn | w1,n-1)]
Estimating the Cross-Entropy
How can we estimate Ep[H(wn|w1,n-1)] when we
don’t (by definition) know p?
Assume:
• Stationarity: The language doesn’t change
• Ergodicity: The language never gets “stuck”
Then:
Ep[H(wn|w1,n-1)] = limn (1/n)
nH(w | w
n
)
1,n-1
Perplexity
Commonly used measure of “model fit”:
perplexity(w1,n,m) = 2H(w ,m)
= m(w1,n)-(1/n)
1,n
How many “choices” for next word on
average?
• Lower perplexity = better model
N-gram Models
• Assume a “limited horizon”:
P(wk | w1w2…wk-1) = P(wk | wk-n…wk-1)
– Each word depends only on the last n-1 words
• Specific cases:
– Unigram model: P(wk) – words independent
– Bigram model: P(wk | wk-1)
• Learning task: estimate these probabilities
from a given corpus
Using Bigrams
• Compute probability of a sentence:
W = The cat sat on the mat
P(W) = P(The|START)P(cat|The)P(sat|cat) 
P(on|sat)P(the|on)P(mat|the)P(END|mat)
• Generate a random text and examine for
“reasonableness”
Maximum Likelihood Estimation
• PMLE(w1…wn) = C(w1…wn) / N
• PMLE(wn | w1…wn-1) = C(w1…wn) / C(w1…wn-1)
Problem: Data Sparseness!!
• For the vast majority of possible n-grams, we
get 0 probability, even in a very large corpus
• The larger the context, the greater the problem
• But there are always new cases not seen before!
Smoothing
• Idea: Take some probability away from seen
events and assign it to unseen events
Simple method (Laplace):
Give every event an a priori count of 1
PLap(X) = C(X)+1 / N+B
where X is any entity, B is the number of entity types
• Problem: Assigns too much probability to new events
The more event types there are, the worse this becomes
Interpolation
Lidstone:
PLid(X) = (C(X) + d) / (N + dB)
[d<1]
Johnson:
PLid(X) = m PMLE(X) + (1 – m)(1/B)
where m = N/(N+dB)
• How to choose d?
• Doesn’t match low-frequency events well
Held-out Estimation
Idea: Estimate freq on unseen data from
“unseen” data
• Divide data: “training” &“held out” subsets
C1(X) = freq of X in training data
C2(X) = freq of X in held out data
Tr = X:C1(X)=r C2(X)
Pho(X) = Tr/(NrN)
where C(X)=r
Deleted Estimation
Generalize to use all the data :
• Divide data into 2 subsets:
Nar = number of entities s.t. Ca(X)=r
Tarb =
X:Ca(X)=r Cb(X)
Pdel (X) = (T0r1 + T1r0 ) / N(N0r1 + N1r0 )
[C(X)=r]
• Needs a large data set
• Overestimates unseen data, underestimates infrequent data
Good-Turing
For observed items, discount item count:
r* = (r+1) E[Nr+1] / E[Nr]
• The idea is that the chance of seeing the
item one more time, is about E[Nr+1] / E[Nr]
For unobserved items, total probability is:
E[N1] / N
– So, if we assume a uniform distribution over
unknown items, we have:
P(X) = E[N1] / (N0N)
Good-Turing Issues
• Has problems with high-frequency items
(consider rmax* = E[Nrmax+1]/E[Nrmax] = 0)
Usual answers:
– Use only for low-frequency items (r < k)
– Smooth E[Nr] by function S(r)
• How to divide probability among unseen
items?
– Uniform distribution
– Estimate which seem more likely than others…
Back-off Models
• If high-order n-gram has insufficient data,
use lower order n-gram:
{
Pbo(wi|wi-n+1,i-1) =
(1-d(wi-n+1,i-1)) P(wi|wi-n+1,i-1)
if enough data
(wi-n+1,i-1)Pbo(wi|wi-n+2,i-1)
• Note recursive formulation
otherwise
Linear Interpolation
More generally, we can interpolate:
Pint(wi|h) = k k(h)Pk(wi| h)
• Interpolation between different orders
• Usually set weights by iterative training (gradient
descent – EM algorithm)
• Partition histories h into equivalence classes
• Need to be responsive to the amount of data!

CS-595: Machine Learning in Natural Language Processing

Transcript CS-595: Machine Learning in Natural Language Processing

Directory