Transcript POS tagging

CS60057
Speech &Natural Language
Processing
Autumn 2007
Lecture 8
9 August 2007
Lecture 1, 7/21/2005
Natural Language Processing
1
POS Tagging

Task:
 assign the right part-of-speech tag, e.g. noun, verb, conjunction, to a
word in context

POS taggers
 need to be fast in order to process large corpora


full parsing is slow


should take no more than time linear in the size of the corpora
e.g. context-free grammar  n3, n length of the sentence
POS taggers try to assign correct tag without actually parsing the
sentence
Lecture 1, 7/21/2005
Natural Language Processing
2
POS Tagging

Components:
 Dictionary of words

Exhaustive list of closed class items
 Examples:
 the, a, an: determiner
 from, to, of, by: preposition
 and, or: coordination conjunction

Large set of open class (e.g. noun, verbs, adjectives) items with frequency
information
Lecture 1, 7/21/2005
Natural Language Processing
3
POS Tagging

Components:
 Mechanism to assign tags



Context-free: by frequency
Context: bigram, trigram, HMM, hand-coded rules
 Example:
 Det Noun/*Verb the walk…
Mechanism to handle unknown words (extra-dictionary)


Capitalization
Morphology: -ed, -tion
Lecture 1, 7/21/2005
Natural Language Processing
4
POS Tagging
Words often have more than one POS: back
 The back door = JJ
 On my back = NN
 Win the voters back = RB
 Promised to back the bill = VB
 The POS tagging problem is to determine the
POS tag for a particular instance of a word.

Lecture 1, 7/21/2005
Natural Language Processing
These examples from Dekang Lin
5
How hard is POS tagging?
Measuring ambiguity
Lecture 1, 7/21/2005
Natural Language Processing
6
Algorithms for POS Tagging
•Ambiguity – In the Brown corpus, 11.5% of the word
types are ambiguous (using 87 tags):
Worse, 40% of the tokens are ambiguous.
Lecture 1, 7/21/2005
Natural Language Processing
7
Problem Setup

There are M types of POS tags


The word vocabulary size is V


Tag set: {t1,..,tM}.
Vocabulary set: {w1,..,wV}.
We have a word sequence of length n:
W = w1,w2…wn

Want to find the best sequence of POS tags:
T = t1,t2…tn
Tbest  arg max Pr(T | W )
T
Lecture 1, 7/21/2005
Natural Language Processing
8
Information sources for tagging
All techniques are based on the same observations…
 some tag sequences are more probable than others
 ART+ADJ+N is more probable than ART+ADJ+VB

Lexical information: knowing the word to be tagged gives a lot of
information about the correct tag
 “table”: {noun, verb} but not a {adj, prep,…}
 “rose”: {noun, adj, verb} but not {prep, ...}
Lecture 1, 7/21/2005
Natural Language Processing
9
Algorithms for POS Tagging
Why can’t we just look them up in a dictionary?
•Words that aren’t in the dictionary
http://story.news.yahoo.com/news?tmpl=story&cid=578&ncid
=578&e=1&u=/nm/20030922/ts_nm/iraq_usa_dc
•One idea: P(ti | wi) = the probability that a random hapax
legomenon in the corpus has tag ti.
Nouns are more likely than verbs, which are more likely
than pronouns.
•Another idea: use morphology.
Lecture 1, 7/21/2005
Natural Language Processing
10
Algorithms for POS Tagging - Knowledge
•Dictionary
•Morphological rules, e.g.,
•_____-tion
•_____-ly
•capitalization
•N-gram frequencies
•to _____
•DET _____ N
•But what about rare words, e.g, smelt (two verb forms, melt and past
tense of smell, and one noun form, a small fish)
•Combining these
• V _____-ing
Lecture 1, 7/21/2005
I was gracking vs. Gracking is fun.
Natural Language Processing
11
POS Tagging - Approaches
Approaches
Rule-based tagging
(ENGTWOL)
Stochastic (=Probabilistic) tagging
HMM (Hidden Markov Model) tagging
Transformation-based tagging
Brill tagger
•
Do we return one best answer or several answers and let later
steps decide?
•
How does the requisite knowledge get entered?
Lecture 1, 7/21/2005
Natural Language Processing
12
3 methods for POS tagging
1.
Rule-based tagging

Example: Karlsson (1995) EngCG tagger based on
the Constraint Grammar architecture and ENGTWOL
lexicon
 Basic Idea:
 Assign all possible tags to words (morphological
analyzer used)
 Remove wrong tags according to set of
constraint rules (typically more than 1000 handwritten constraint rules, but may be machinelearned)
Lecture 1, 7/21/2005
Natural Language Processing
13
Sample rules
N-IP rule:
A tag N (noun) cannot be followed by a tag IP (interrogative
pronoun)
... man who …
 man: {N}
 who: {RP, IP} --> {RP} relative pronoun
ART-V rule:
A tag ART (article) cannot be followed by a tag V (verb)
...the book…
 the: {ART}
 book: {N, V} --> {N}
Lecture 1, 7/21/2005
Natural Language Processing
14
After The First Stage


Example:
He had a book.
After the fırst stage:
 he
he/pronoun
 had
have/verbpast have/auxliarypast
 a
a/article
 book
book/noun book/verb
Tagging Rule
Rule-1:
if (the previous tag is an article)
then eliminate all verb tags
Rule-2:
if (the next tag is verb)
then eliminate all verb tags
Lecture 1, 7/21/2005
Natural Language Processing
15
Rule-Based POS Tagging

ENGTWOL tagger (now ENGCG-2)
 http://www.lingsoft.fi/cgi-bin/engcg
Lecture 1, 7/21/2005
Natural Language Processing
16
3 methods for POS tagging
2. Transformation-based tagging

Example: Brill (1995) tagger - combination of rule-based and
stochastic (probabilistic) tagging methodologies
 Basic Idea:
 Start with a tagged corpus + dictionary (with most frequent
tags)
 Set the most probable tag for each word as a start value
 Change tags according to rules of type “if word-1 is a
determiner and word is a verb then change the tag to noun”
in a specific order (like rule-based taggers)
 machine learning is used—the rules are automatically
induced from a previously tagged training corpus (like
stochastic approach)
Lecture 1, 7/21/2005
Natural Language Processing
17
An example
1. Assign to words their most likely tag
P(NN|race) = .98
 P(VB|race) = .02
2. Change some tags by applying transformation rules

Rule
Context (trigger)
(apply the rule when…)
Examples
NN  VB
(noun  verb)
the previous tag is the
preposition to
go to sleep(VB)
? go to school(VB)
VBR  VB
(past tense  base form)
one of the previous 3 tags
is a modal (MD)
you may cut (VB)
JJR  RBR
(comparative adj  comparative adv)
next tag is an adjective
(JJ)
a more (RBR) valuable
VBP  VB
(past tense  base form)
one of the previous 2
words is “n’t”
should (VB) n’t
Lecture 1, 7/21/2005
Natural Language Processing
18
Types of context


lots of latitude…
can be:

tag-triggered transformation




word- triggered transformation



The preceding/following word this word
…
morphology- triggered transformation



The preceding/following word is tagged this way
The word two before/after is tagged this way
...
The preceding/following word finishes with an s
…
a combination of the above

The preceding word is tagged this ways AND the following word is this word
Lecture 1, 7/21/2005
Natural Language Processing
19
Learning the transformation rules

Input: A corpus with each word:




correctly tagged (for reference)
tagged with its most frequent tag (C0)
Output: A bag of transformation rules
Algorithm:

Instantiates a small set of hand-written templates (generic rules) by
comparing the reference corpus to C0
Change tag a to tag b when…
The preceding/following word is tagged z
The word two before/after is tagged z
One of the 2 preceding/following words is tagged z
One of the 2 preceding words is z
…

Lecture 1, 7/21/2005
Natural Language Processing
20
Learning the transformation rules (con't)

Run the initial tagger and compile types of errors

<incorrect tag, desired tag, # of occurrences>

For each error type, instantiate all templates to generate candidate
transformations

Apply each candidate transformation to the corpus and count the
number of corrections and errors that it produces

Save the transformation that yields the greatest improvement

Stop when no transformation can reduce the error rate by a
predetermined threshold
Lecture 1, 7/21/2005
Natural Language Processing
21
Example



if the initial tagger mistags 159 words as verbs instead of
nouns
 create the error triple: <verb, noun, 159>
Suppose template #3 is instantiated as the rule:
 Change the tag from <verb> to <noun> if one of the
two preceding words is tagged as a determiner.
When this template is applied to the corpus:
 it corrects 98 of the 159 errors
but it also creates 18 new errors
Error reduction is 98-18=80


Lecture 1, 7/21/2005
Natural Language Processing
22
Learning the best transformations

input:
 a corpus with each word:




correctly tagged (for reference)
tagged with its most frequent tag (C0)
a bag of unordered transformation rules
output:
 an ordering of the best transformation rules
Lecture 1, 7/21/2005
Natural Language Processing
23
Learning the best transformations
(con’t)
let:


E(Ck) = nb of words incorrectly tagged in the corpus at iteration k
v(C) = the corpus obtained after applying rule v on the corpus C
ε = minimum number of errors desired
for k:= 0 step 1 do
bt := argmint (E(t(Ck)) // find the transformation t that minimizes
if ((E(Ck) - E(bt(Ck)))
// the error rate
< ε) // if bt does not improve the tagging significantly
then goto finished
Ck+1 := bt(Ck) // apply rule bt to the current corpus
Tk+1 := bt
// bt will be kept as the current transformation rule
end
finished: the sequence T1 T2 … Tk is the ordered transformation rules
Lecture 1, 7/21/2005
Natural Language Processing
24
Strengths of transformation-based tagging

exploits a wider range of lexical and syntactic regularities

can look at a wider context
 condition the tags on preceding/next words not just preceding
tags.
 can use more context than bigram or trigram.

transformation rules are easier to understand than matrices of
probabilities
Lecture 1, 7/21/2005
Natural Language Processing
25
How TBL Rules are Applied



Before the rules are applied the tagger labels every word with
most likely tag.
We get these most likely tags from a tagged corpus.
Example:

He is expected to race tomorrow
 he/PRN is/VBZ expected/VBN to/TO race/NN tomorrow/NN
After selecting most-likely tags, we apply transformation rules.
 Change NN to VB when the previous tag is TO
 This rule converts race/NN into race/VB

This may not work for every case
its


….. According to race
Lecture 1, 7/21/2005
Natural Language Processing
26
How TBL Rules are Learned


We will assume that we have a tagged corpus.
Brill’s TBL algorithm has three major steps.





Tag the corpus with the most likely tag for each (unigram model)
Choose a transformation that deterministically replaces an
existing tag with a new tag such that the resulting tagged training
corpus has the lowest error rate out of all transformations.
Apply the transformation to the training corpus.
These steps are repeated until a stopping criterion is reached.
The result (which will be our tagger) will be:


First tags using most-likely tags
Then apply the learned transformations
Lecture 1, 7/21/2005
Natural Language Processing
27
Transformations

A transformation is selected from a small set of templates.
Change tag a to tag b when
- The preceding (following) word is tagged z.
- The word two before (after) is tagged z.
- One of two preceding (following) words is tagged z.
- One of three preceding (following) words is tagged z.
- The preceding word is tagged z and the following word is tagged w.
- The preceding (following) word is tagged z and the word
two before (after) is tagged w.
Lecture 1, 7/21/2005
Natural Language Processing
28
3 methods for POS tagging
3.


Stochastic (=Probabilistic) tagging
Assume that a word’s tag only depends on the previous tags
(not following ones)
Use a training set (manually tagged corpus) to:

learn the regularities of tag sequences

learn the possible tags for a word

model this info through a language model (n-gram)

Example: HMM (Hidden Markov Model) tagging a training corpus used to compute the probability
(frequency) of a given word having a given POS
tag in a given context
Lecture 1, 7/21/2005
Natural Language Processing
29
Topics







Probability
Conditional Probability
Independence
Bayes Rule
HMM tagging
Markov Chains
Hidden Markov Models
Lecture 1, 7/21/2005
Natural Language Processing
30
6. Introduction to Probability


Experiment (trial)
 Repeatable procedure with well-defined possible outcomes
Sample Space (S)



Example



the set of all possible outcomes
finite or infinite
coin toss experiment
possible outcomes: S = {heads, tails}
Example


die toss experiment
possible outcomes: S = {1,2,3,4,5,6}
Lecture 1, 7/21/2005
Natural Language Processing
31
Introduction to Probability

Definition of sample space depends on what we are asking
 Sample Space (S): the set of all possible outcomes
 Example



die toss experiment for whether the number is even or odd
possible outcomes: {even,odd}
not {1,2,3,4,5,6}
Lecture 1, 7/21/2005
Natural Language Processing
32
More definitions


Events
 an event is any subset of outcomes from the sample space
Example
 die toss experiment
 let A represent the event such that the outcome of the die toss
experiment is divisible by 3
 A = {3,6}
 A is a subset of the sample space S= {1,2,3,4,5,6}
Lecture 1, 7/21/2005
Natural Language Processing
33
Introduction to Probability

Some definitions
 Events



an event is a subset of sample space
simple and compound events
Example







Quic kT i me™ and a
T IFF (Unc ompres s ed) dec ompres s or
are needed t o s ee thi s pi c ture.
deck of cards draw experiment
suppose sample space S = {heart,spade,club,diamond} (four suits)
let A represent the event of drawing a heart
let B represent the event of drawing a red card
A = {heart} (simple event)
B = {heart} u {diamond} = {heart,diamond} (compound event)
 a compound event can be expressed as a set union of simple events
Example


alternative sample space S = set of 52 cards
A and B would both be compound events
Lecture 1, 7/21/2005
Natural Language Processing
QuickTime™ and a
TIFF (U ncompressed) decompressor
are needed to see t his picture.
34
Introduction to Probability

Some definitions
 Counting



suppose an operation oi can be performed in ni ways,
a set of k operations o1o2...ok can be performed in n1  n2  ...
 nk ways
Example



dice toss experiment, 6 possible outcomes
two dice are thrown at the same time
number of sample points in sample space = 6  6 = 36
Lecture 1, 7/21/2005
Natural Language Processing
35
Definition of Probability





The probability law assigns to an event a nonnegative
number
Called P(A)
Also called the probability A
That encodes our knowledge or belief about the
collective likelihood of all the elements of A
Probability law must satisfy certain properties
Lecture 1, 7/21/2005
Natural Language Processing
36
Probability Axioms



Nonnegativity
 P(A) >= 0, for every event A
Additivity
 If A and B are two disjoint events, then the probability
of their union satisfies:
 P(A U B) = P(A) + P(B)
Normalization
 The probability of the entire sample space S is equal
to 1, i.e. P(S) = 1.
Lecture 1, 7/21/2005
Natural Language Processing
37
An example








An experiment involving a single coin toss
There are two possible outcomes, H and T
Sample space S is {H,T}
If coin is fair, should assign equal probabilities to 2 outcomes
Since they have to sum to 1
P({H}) = 0.5
P({T}) = 0.5
P({H,T}) = P({H})+P({T}) = 1.0
Lecture 1, 7/21/2005
Natural Language Processing
38
Another example







Experiment involving 3 coin tosses
Outcome is a 3-long string of H or T
S ={HHH,HHT,HTH,HTT,THH,THT,TTH,TTT}
Assume each outcome is equiprobable
 “Uniform distribution”
What is probability of the event that exactly 2 heads occur?
A = {HHT,HTH,THH}
3 events/outcomes
P(A) = P({HHT})+P({HTH})+P({THH}) additivity - union of the
probability of the individual events


= 1/8 + 1/8 + 1/8
= 3/8
Lecture 1, 7/21/2005
total 8 events/outcomes
Natural Language Processing
39
Probability definitions

In summary:
Probability of drawing a spade from 52 well-shuffled playing cards:
Lecture 1, 7/21/2005
Natural Language Processing
40
Moving toward language

What’s the probability of drawing a 2 from a deck
of 52 cards with four 2s?
4
1
P(drawing a two) 
  .077
52 13
 What’s the probability of a random word (from a
random dictionary page) being a verb?

# of ways to get a verb
P(drawing a verb) 
all words
Lecture 1, 7/21/2005
Natural Language Processing
41
Probability and part of speech tags
•
What’s the probability of a random word (from a random dictionary
page) being a verb?
P(drawing a verb) 
# of ways to get a verb
all words
• How to compute each of these
•
•
•
All words = just count all the words in the dictionary
# of ways to get a verb: # of words which are verbs!
If a dictionary has 50,000 entries, and 10,000 are verbs…. P(V) is
10000/50000 = 1/5 = .20
Lecture 1, 7/21/2005
Natural Language Processing
42
Conditional Probability

A way to reason about the outcome of an experiment
based on partial information
 In a word guessing game the first letter for the word is
a “t”. What is the likelihood that the second letter is
an “h”?
 How likely is it that a person has a disease given that
a medical test was negative?
 A spot shows up on a radar screen. How likely is it
that it corresponds to an aircraft?
Lecture 1, 7/21/2005
Natural Language Processing
43
More precisely





Given an experiment, a corresponding sample space S, and a
probability law
Suppose we know that the outcome is some event B
We want to quantify the likelihood that the outcome also belongs to
some other event A
We need a new probability law that gives us the conditional
probability of A given B
P(A|B)
Lecture 1, 7/21/2005
Natural Language Processing
44
An intuition
•
•
•
•
•
•
•
Let’s say A is “it’s raining”.
Let’s say P(A) in Kharagpur is 0.2
Let’s say B is “it was sunny ten minutes ago”
P(A|B) means “what is the probability of it raining now if it was sunny
10 minutes ago”
P(A|B) is probably way less than P(A)
Perhaps P(A|B) is .0001
Intuition: The knowledge about B should change our estimate of the
probability of A.
Lecture 1, 7/21/2005
Natural Language Processing
45
Conditional Probability



let A and B be events in the sample space
P(A|B) = the conditional probability of event A occurring given some fixed
event B occurring
definition: P(A|B) = P(A  B) / P(B)
Lecture 1, 7/21/2005
Natural Language Processing
46
Conditional probability


P(A|B) = P(A  B)/P(B)
Or
P( A | B) 
P( A, B)
P( B)
Note: P(A,B)=P(A|B) · P(B)
Also: P(A,B) = P(B,A)
A
Lecture 1, 7/21/2005
A,B B
Natural Language Processing
47
Independence

What is P(A,B) if A and B are independent?

P(A,B)=P(A) · P(B) iff A,B independent.
P(heads,tails) = P(heads) · P(tails) = .5 · .5 = .25
Note: P(A|B)=P(A) iff A,B independent
Also: P(B|A)=P(B) iff A,B independent
Lecture 1, 7/21/2005
Natural Language Processing
48
Bayes Theorem
P( A | B) P( B)
P( B | A) 
P( A)
• Idea: The probability of an A conditional on another event
B is generally different from the probability of B conditional
on A. There is a definite relationship between the two.
Lecture 1, 7/21/2005
Natural Language Processing
49
Deriving Bayes Rule
The probability of event A given event B is
P(A  B)
P(A | B) 
P(B)
Lecture 1, 7/21/2005
Natural Language Processing
50
Deriving Bayes Rule
The probability of event B given event A is
P(A  B)
P(B | A) 
P(A)

Lecture 1, 7/21/2005
Natural Language Processing
51
Deriving Bayes Rule
P(A  B)
P(A  B)
P(B | A) 
P(A | B) 
P(A)
P(B)
P(A | B)P(B)  P(A  B) P(B | A)P(A)  P(A  B)


Lecture 1, 7/21/2005
Natural Language Processing
52
Deriving Bayes Rule
P(A  B)
P(A  B)
P(B | A) 
P(A | B) 
P(A)
P(B)
P(A | B)P(B)  P(A  B) P(B | A)P(A)  P(A  B)
 | B)P(B)  P(B | A)P(A)
P(A

Lecture 1, 7/21/2005

P(B | A)P(A)
P(A | B) 
P(B)
Natural Language Processing
53

Deriving Bayes Rule
P(B | A)P(A)
P(A | B) 
P(B)
the theorem may be paraphrased as
conditional/posterior probability =
(LIKELIHOOD multiplied by PRIOR) divided by NORMALIZING CONSTANT
Lecture 1, 7/21/2005
Natural Language Processing
54
Hidden Markov Model (HMM) Tagging

Using an HMM to do POS tagging

HMM is a special case of Bayesian inference

It is also related to the “noisy channel” model in ASR
(Automatic Speech Recognition)
Lecture 1, 7/21/2005
Natural Language Processing
55
Hidden Markov Model (HMM) Taggers

Goal: maximize P(word|tag) x P(tag|previous n tags)
Lexical information
Syntagmatic information

P(word|tag)
 word/lexical likelihood
 probability that given this tag, we have this word
 NOT probability that this word has this tag
 modeled through language model (word-tag matrix)

P(tag|previous n tags)
 tag sequence likelihood
 probability that this tag follows these previous tags
 modeled through language model (tag-tag matrix)
Lecture 1, 7/21/2005
Natural Language Processing
56
POS tagging as a sequence classification task



We are given a sentence (an “observation” or “sequence of
observations”)
 Secretariat is expected to race tomorrow
 sequence of n words w1…wn.
What is the best sequence of tags which corresponds to this
sequence of observations?
Probabilistic/Bayesian view:
 Consider all possible sequences of tags
 Out of this universe of sequences, choose the tag sequence
which is most probable given the observation sequence of n
words w1…wn.
Lecture 1, 7/21/2005
Natural Language Processing
57
Getting to HMM





Let T = t1,t2,…,tn
Let W = w1,w2,…,wn
Goal: Out of all sequences of tags t1…tn, get the the most probable
sequence of POS tags T underlying the observed sequence of
words w1,w2,…,wn
Hat ^ means “our estimate of the best = the most probable tag sequence”
Argmaxx f(x) means “the x such that f(x) is maximized”
it maximazes our estimate of the best tag sequence
Lecture 1, 7/21/2005
Natural Language Processing
58
Getting to HMM

This equation is guaranteed to give us the best tag sequence

But how do we make it operational? How do we compute this value?
Intuition of Bayesian classification:



Use Bayes rule to transform it into a set of other
probabilities that are easier to compute
Thomas Bayes: British mathematician (1702-1761)
Lecture 1, 7/21/2005
Natural Language Processing
59
Bayes Rule
Breaks down any conditional probability P(x|y) into three other
probabilities
P(x|y): The conditional probability of an event x assuming that y
has occurred
Lecture 1, 7/21/2005
Natural Language Processing
60
Bayes Rule
We can drop the denominator: it does not change for each tag
sequence; we are looking for the best tag sequence for the
same observation, for the same fixed set of words
Lecture 1, 7/21/2005
Natural Language Processing
61
Bayes Rule
Lecture 1, 7/21/2005
Natural Language Processing
62
Likelihood and prior
n
Lecture 1, 7/21/2005
Natural Language Processing
63
Likelihood and prior
Further Simplifications
1. the probability of a word appearing depends only on its own POS tag,
i.e, independent of other words around it
n
2. BIGRAM assumption: the probability of a tag appearing depends only
on the previous tag
3. The most probable tag sequence estimated by the bigram tagger
Lecture 1, 7/21/2005
Natural Language Processing
64
Likelihood and prior
Further Simplifications
1. the probability of a word appearing depends only on its own POS tag,
i.e, independent of other words around it
n
WORDS
the
koala
put
the
keys
on
the
table
Lecture 1, 7/21/2005
TAGS
N
V
P
DET
Natural Language Processing
65
Likelihood and prior
Further Simplifications
2. BIGRAM assumption: the probability of a tag appearing depends only
on the previous tag
Bigrams are groups of two written letters, two syllables, or two words; they
are a special case of N-gram.
Bigrams are used as the basis for simple statistical analysis of text
The bigram assumption is related to the first-order Markov assumption
Lecture 1, 7/21/2005
Natural Language Processing
66
Likelihood and prior
Further Simplifications
3. The most probable tag sequence estimated by the bigram tagger
---------------------------------------------------------------------------------------------------------------
n
biagram assumption
Lecture 1, 7/21/2005
Natural Language Processing
67
Two kinds of probabilities (1)

Tag transition probabilities p(ti|ti-1)
 Determiners likely to precede adjs and nouns
 That/DT
flight/NN
 The/DT yellow/JJ hat/NN
 So we expect P(NN|DT) and P(JJ|DT) to be high
 But P(DT|JJ) to be:?
Lecture 1, 7/21/2005
Natural Language Processing
68
Two kinds of probabilities (1)

Tag transition probabilities p(ti|ti-1)
 Compute P(NN|DT) by counting in a labeled
corpus:
# of times DT is followed by NN
Lecture 1, 7/21/2005
Natural Language Processing
69
Two kinds of probabilities (2)
Word

likelihood probabilities p(wi|ti)
P(is|VBZ) = probability of VBZ (3sg Pres verb) being “is”
If we were expecting a third person singular verb, how likely is it that
this verb would be is?

Compute P(is|VBZ) by counting in a labeled corpus:
Lecture 1, 7/21/2005
Natural Language Processing
70
An Example: the verb “race”



Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
tomorrow/NR
People/NNS continue/VB to/TO inquire/VB the/DT
reason/NN for/IN the/DT race/NN for/IN outer/JJ
space/NN
How do we pick the right tag?
Lecture 1, 7/21/2005
Natural Language Processing
71
Disambiguating “race”
Lecture 1, 7/21/2005
Natural Language Processing
72
Disambiguating “race”
P(NN|TO)
= .00047
P(VB|TO) = .83
The tag transition probabilities P(NN|TO) and P(VB|TO) answer the question: ‘How likely
are we to expect verb/noun given the previous tag TO?’
P(race|NN)
= .00057
P(race|VB) = .00012
Lexical likelihoods from the Brown corpus for ‘race’ given a POS tag NN or VB.
P(NR|VB)
= .0027
P(NR|NN) = .0012
tag sequence probability for the likelihood of an adverb occurring given the previous tag
verb or noun
P(VB|TO)P(NR|VB)P(race|VB)
= .00000027
P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
Multiply the lexical likelihoods with the tag sequence probabiliies: the verb wins
Lecture 1, 7/21/2005
Natural Language Processing
73
Hidden Markov Models



What we’ve described with these two kinds of
probabilities is a Hidden Markov Model (HMM)
Let’s just spend a bit of time tying this into the model
In order to define HMM, we will first introduce the Markov
Chain, or observable Markov Model.
Lecture 1, 7/21/2005
Natural Language Processing
74
Definitions



A weighted finite-state automaton adds probabilities to
the arcs
 The sum of the probabilities leaving any arc must sum
to one
A Markov chain is a special case of a WFST in which the
input sequence uniquely determines which states the
automaton will go through
Markov chains can’t represent inherently ambiguous
problems
 Useful for assigning probabilities to unambiguous
sequences
Lecture 1, 7/21/2005
Natural Language Processing
75
Markov chain =
“First-order observed Markov Model”
a set of states


Q = q1, q2…qN; the state at time t is qt
a set of transition probabilities:




a set of probabilities A = a01a02…an1…ann.
Each aij represents the probability of transitioning from state i to state j
The set of these is the transition probability matrix A
aij  P(qt  j | qt1  i) 1  i, j  N
N
a
ij
 1;
1 i  N
j1

Distinguished start and end states

Special initial probability vector 
 i the probability that the MM will start in state i, each i expresses the probability
p(qi|START)
Lecture 1, 7/21/2005
Natural Language Processing
76
Markov chain =
“First-order observed Markov Model”
Markov Chain for weather: Example 1
 three types of weather: sunny, rainy, foggy
 we want to find the following conditional probabilities:
P(qn|qn-1, qn-2, …, q1)
- I.e., the probability of the unknown weather on day n,
depending on the (known) weather of the preceding
days
- We could infer this probability from the relative frequency (the
statistics) of past observations of weather sequences
Problem: the larger n is, the more observations we must collect.
Suppose that n=6, then we have to collect statistics for 3(6-1) =
243 past histories
Lecture 1, 7/21/2005
Natural Language Processing
77
Markov chain =
“First-order observed Markov Model”

Therefore, we make a simplifying assumption, called the (first-order) Markov
assumption
for a sequence of observations q1, … qn,
current state only depends on previous state

the joint probability of certain past and current observations
Lecture 1, 7/21/2005
Natural Language Processing
78
Markov chain =
“First-order observable Markov Model”
Lecture 1, 7/21/2005
Natural Language Processing
79
Markov chain =
“First-order observed Markov Model”
Given that today the weather is sunny,
what's the probability that tomorrow is
sunny and the day after is rainy?

Using the Markov assumption and the
probabilities in table 1, this translates into:

Lecture 1, 7/21/2005
Natural Language Processing
80
The weather figure: specific example

Markov Chain for weather: Example 2
Lecture 1, 7/21/2005
Natural Language Processing
81
Markov chain for weather




What is the probability of 4 consecutive rainy days?
Sequence is rainy-rainy-rainy-rainy
I.e., state sequence is 3-3-3-3
P(3,3,3,3) =
 1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432
Lecture 1, 7/21/2005
Natural Language Processing
82
Hidden Markov Model





For Markov chains, the output symbols are the same as
the states.
 See sunny weather: we’re in state sunny
But in part-of-speech tagging (and other things)
 The output symbols are words
 But the hidden states are part-of-speech tags
So we need an extension!
A Hidden Markov Model is an extension of a Markov
chain in which the output symbols are not the same as
the states.
This means we don’t know which state we are in.
Lecture 1, 7/21/2005
Natural Language Processing
83
Markov chain for weather
Lecture 1, 7/21/2005
Natural Language Processing
84
Markov chain for words
Observed events: words
Hidden events: tags
Lecture 1, 7/21/2005
Natural Language Processing
85
Hidden Markov Models


States Q = q1, q2…qN;
Observations O = o1, o2…oN;


Transition probabilities (prior)


Transition probability matrix A = {aij}
Observation likelihoods (likelihood)


Each observation is a symbol from a vocabulary V = {v1,v2,…vV}
Output probability matrix B={bi(ot)}
a set of observation likelihoods, each expressing the probability of an
observation ot being generated from a state i, emission probabilities
Special initial probability vector 
i the probability that the HMM will start in state i, each i expresses the probability
p(qi|START)
Lecture 1, 7/21/2005
Natural Language Processing
86
Assumptions

Markov assumption: the probability of a particular state depends
only on the previous state
P(qi | q1...qi1)  P(qi | qi1)


Output-independence assumption: the probability of an output
observation depends only on the state that produced that
observation
Lecture 1, 7/21/2005
Natural Language Processing
87
HMM for Ice Cream






You are a climatologist in the year 2799
Studying global warming
You can’t find any records of the weather in Boston, MA
for summer of 2007
But you find Jason Eisner’s diary
Which lists how many ice-creams Jason ate every date
that summer
Our job: figure out how hot it was
Lecture 1, 7/21/2005
Natural Language Processing
88
Noam task


Given
 Ice Cream Observation Sequence: 1,2,3,2,2,2,3…
(cp. with output symbols)
Produce:
 Weather Sequence: C,C,H,C,C,C,H …
(cp. with hidden states, causing states)
Lecture 1, 7/21/2005
Natural Language Processing
89
HMM for ice cream
Lecture 1, 7/21/2005
Natural Language Processing
90
Different types of HMM structure
Bakis = left-to-right
Lecture 1, 7/21/2005
Natural Language Processing
Ergodic =
fully-connected
91
HMM Taggers


Two kinds of probabilities
 A transition probabilities (PRIOR)
 B observation likelihoods (LIKELIHOOD)
HMM Taggers choose the tag sequence which
maximizes the product of word likelihood and tag
sequence probability
Lecture 1, 7/21/2005
Natural Language Processing
92
Weighted FSM corresponding to hidden
states of HMM, showing A probs
Lecture 1, 7/21/2005
Natural Language Processing
93
B observation likelihoods for POS HMM
Lecture 1, 7/21/2005
Natural Language Processing
94
The A matrix for the POS HMM
Lecture 1, 7/21/2005
Natural Language Processing
95
The B matrix for the POS HMM
Lecture 1, 7/21/2005
Natural Language Processing
96
HMM Taggers



The probabilities are trained on hand-labeled training
corpora (training set)
Combine different N-gram levels
Evaluated by comparing their output from a test set to
human labels for that test set (Gold Standard)
Lecture 1, 7/21/2005
Natural Language Processing
97
The Viterbi Algorithm



best tag sequence for "John likes to fish in the sea"?
efficiently computes the most likely state sequence given a particular
output sequence
based on dynamic programming
Lecture 1, 7/21/2005
Natural Language Processing
98
A smaller
b
a example
0.4
start
0.3

0.7
q
1

0.6
0.5
a
b
0.2
0.8
r
end
1
0.5
What is the best sequence of states for the input string
“bbba”?
Computing all possible paths and finding the one with the
max probability is exponential
Lecture 1, 7/21/2005
Natural Language Processing
99
A smaller example (con’t)


For each state, store the most likely sequence that could lead to it (and its probability)
Path probability matrix:
 An array of states versus time (tags versus words)
 That stores the prob. of being at each state at each time in terms of the prob. for being
in each state at the preceding time.
Best sequence
leading
to q
coming
from q
Input sequence / time
ε --> b
b --> b
bb --> b
bbb --> a
ε --> q 0.6
(1.0x0.6)
q --> q 0.108
(0.6x0.3x0.6)
qq --> q 0.01944
(0.108x0.3x0.6)
qrq --> q 0.018144
(0.1008x0.3x0.4)
r --> q 0
(0x0.5x0.6)
qr --> q 0.1008
(0.336x0.5x 0.6)
qrr --> q 0.02688
(0.1344x0.5x0.4)
q --> r 0.336
(0.6x0.7x0.8)
qq --> r 0.0648
(0.108x0.7x0.8)
qrq --> r 0.014112
(0.1008x0.7x0.2)
r --> r 0
(0x0.5x0.8)
qr --> r 0.1344
(0.336x0.5x0.8)
qrr --> r 0.01344
(0.1344x0.5x0.2)
coming
from r
leading
to r
coming
from q
coming
from r
Lecture 1, 7/21/2005
ε --> r 0
(0x0.8)
Natural Language Processing
100
Viterbi intuition: we are looking for
the best ‘path’
S1
S2
S3
S4
S5
JJ
DT
VB
VB
NNP
NN
RB
NN
VBN
TO
VBD
promised
Lecture 1, 7/21/2005
to
back
the
bill
Natural Language Processing
101
Slide from Dekang Lin
The Viterbi Algorithm
Lecture 1, 7/21/2005
Natural Language Processing
102
Intuition


The value in each cell is computed by taking the MAX
over all paths that lead to this cell.
An extension of a path from state i at time t-1 is
computed by multiplying:
 Previous path probability from previous cell viterbi[t1,i]
 Transition probability aij from previous state I to
current state j
 Observation likelihood bj(ot) that current state j
matches observation symbol t
Lecture 1, 7/21/2005
Natural Language Processing
103
Viterbi example
Lecture 1, 7/21/2005
Natural Language Processing
104
Smoothing of probabilities


Data sparseness is a problem when estimating probabilities based on corpus data.
The “add one” smoothing technique –


P w1,n 


C w1,n  1
NB
C- absolute frequency
N: no of training instances
B: no of different types
Linear interpolation methods can compensate for data sparseness with
higher order models. A common method is interpolating trigrams, bigrams
and unigrams:



P ti | t1,i 1  1P1 (ti )  2 P2 (ti | ti 1 )  3 P3 (ti | ti 1,i  2 )
0  i  1,  i  1

i
The lambda values are automatically determined using a variant of the
Expectation Maximization algorithm.
Lecture 1, 7/21/2005
Natural Language Processing
105
Possible improvements

in bigram POS tagging, we condition a tag only on the
preceding tag

why not...
 use more context (ex. use trigram model)

more precise:
 “is clearly marked” --> verb, past participle
 “he clearly marked” --> verb, past tense



combine trigram, bigram, unigram models
condition on words too
but with an n-gram approach, this is too costly (too many
parameters to model)
Lecture 1, 7/21/2005
Natural Language Processing
108
Further issues with Markov Model
tagging


Unknown words are a problem since we don’t have the required
probabilities. Possible solutions:
 Assign the word probabilities based on corpus-wide distribution
of POS
 Use morphological cues (capitalization, suffix) to assign a more
calculated guess.
Using higher order Markov models:
 Using a trigram model captures more context
 However, data sparseness is much more of a problem.
Lecture 1, 7/21/2005
Natural Language Processing
110
TnT


Efficient statistical POS tagger developed by Thorsten Brants, ANLP-2000
Underlying model:
T
arg max  P(ti | ti 1 , ti 2 ) P( wi | ti ) P(tT 1 | tT )
Trigram modelling
t1tT
i –1
 The probability of a POS only depends on its two preceding POS
 The probability of a word appearing at a particular position given that its
POS occurs at that position is independent of everything else.
Lecture 1, 7/21/2005
Natural Language Processing
111
Training

Maximum likelihood estimates:
c(t 3 )
N
^
c(t , t )
Bigrams : P(t3 | t 2 )  2 3
c(t3 )
^
Unigrams : P (t 3 ) 
^
Trigrams : P (t3 | t1 , t 2 ) 
Lexical : P ( w3 | t3 ) 
c(t1 , t 2 , t3 )
c(t 2 , t3 )
c( w3 , t3 )
c(t3 )
Smoothing : context-independent variant of linear interpolation.
^
^
^
P(t3 | t1 , t2 )  1 P(t3 )  2 P(t3 | t2 )  3 P(t3 | t1 , t2 )
Lecture 1, 7/21/2005
Natural Language Processing
112
Smoothing algorithm


Set λi=0
For each trigram t1 t2 t3 with f(t1,t2,t3 )>0
 Depending on the max of the following three values:




Case (f(t1,t2,t3 )-1)/ f(t1,t2) : incr λ3 by f(t1,t2,t3 )
Case (f(t2,t3 )-1)/ f(t2)
: incr λ2 by f(t1,t2,t3 )
Case (f(t3 )-1)/ N-1
: incr λ1 by f(t1,t2,t3 )
Normalize λi
Lecture 1, 7/21/2005
Natural Language Processing
113
Evaluation of POS taggers




compared with gold-standard of human performance
metric:
 accuracy = % of tags that are identical to gold standard
most taggers ~96-97% accuracy
must compare accuracy to:
 ceiling (best possible results)



how do human annotators score compared to each other? (9697%)
so systems are not bad at all!
baseline (worst possible results)


what if we take the most-likely tag (unigram model) regardless of
previous tags ? (90-91%)
so anything less is really bad
Lecture 1, 7/21/2005
Natural Language Processing
114
More on tagger accuracy

is 95% good?



that’s 5 mistakes every 100 words
if on average, a sentence is 20 words, that’s 1 mistake per sentence
when comparing tagger accuracy, beware of:

size of training corpus


difference between training & testing corpora (genre, domain…)


the closer, the better the results
size of tag set


the bigger, the better the results
Prediction versus classification
unknown words

the more unknown words (not in dictionary), the worst the results
Lecture 1, 7/21/2005
Natural Language Processing
115
Error Analysis

Look at a confusion matrix (contingency table)

E.g. 4.4% of the total errors caused by mistagging VBD as VBN
See what errors are causing problems




Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)
Adverb (RB) vs Particle (RP) vs Prep (IN)
Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)

ERROR
ANALYSIS
Lecture
1, 7/21/2005
IS ESSENTIAL!!!
Natural Language Processing
116
Tag indeterminacy
Lecture 1, 7/21/2005
Natural Language Processing
117
Major difficulties in POS tagging

Unknown words (proper names)



because we do not know the set of tags it can take
and knowing this takes you a long way (cf. baseline POS tagger)
possible solutions:


assign all possible tags with probabilities distribution identical to lexicon as a
whole
use morphological cues to infer possible tags
 ex. word ending in -ed are likely to be past tense verbs or past participles

Frequently confused tag pairs

preposition vs particle
<running> <up> a hill (prep) / <running up> a bill (particle)

verb, past tense vs. past participle vs. adjective
Lecture 1, 7/21/2005
Natural Language Processing
118
Unknown Words

Most-frequent-tag approach.

What about words that don’t appear in the training set?

Suffix analysis:

The probability distribution for a particular suffix is generated from all
words in the training set that share the same suffix.

Suffix estimation – Calculate the probability of a tag t given the last i letters
of an n letter word.

Smoothing: successive abstraction through sequences of increasingly
more general contexts (i.e., omit more and more characters of the suffix)

Use a morphological analyzer to get the restriction on the possible tags.
Lecture 1, 7/21/2005
Natural Language Processing
119
Unknown words
Lecture 1, 7/21/2005
Natural Language Processing
120
Alternative graphical
models for part of
speech tagging
Lecture 1, 7/21/2005
Natural Language Processing
121
Different Models for POS tagging



HMM
Maximum Entropy Markov Models
Conditional Random Fields
Lecture 1, 7/21/2005
Natural Language Processing
122
Hidden Markov Model (HMM) :
Generative Modeling
y
Source Model
PY
P(y )   P( yi | yi 1 )
i
Lecture 1, 7/21/2005
x
Noisy Channel
PX|Y
Natural Language Processing
P (x | y )   P ( xi | yi )
i
123
Dependency (1st order)
X k 2
X k 1
P ( X k  2 | Yk  2 )
P ( X k 1 | Yk 1 )
P (Yk 1 | Yk  2 )
Yk  2
Lecture 1, 7/21/2005
P ( X k | Yk )
P (Yk | Yk 1 )
Yk 1
X k 1
Xk
P ( X k 1 | Yk 1 )
P (Yk 1 | Yk )
Yk
Natural Language Processing
Yk 1
124
Disadvantage of HMMs (1)

No Rich Feature Information
 Rich information are required



Example: POS Tagging
 How to evaluate Pwk|tk for unknown words wk ?
 Useful features



When xk is complex
When data of xk is sparse
Suffix, e.g., -ed, -tion, -ing, etc.
Capitalization
Generative Model
 Parameter estimation: maximize the joint likelihood of training examples
 log
2
P(X  x, Y  y)
( x , y )T
Lecture 1, 7/21/2005
Natural Language Processing
125
Generative Models

Hidden Markov models (HMMs) and stochastic grammars


Assign a joint probability to paired observation and label sequences
The parameters typically trained to maximize the joint likelihood of train examples
Lecture 1, 7/21/2005
Natural Language Processing
126
Generative Models (cont’d)

Difficulties and disadvantages



Need to enumerate all possible observation sequences
Not practical to represent multiple interacting features or long-range
dependencies of the observations
Very strict independence assumptions on the observations
Lecture 1, 7/21/2005
Natural Language Processing
127

Better Approach
 Discriminative model which models P(y|x) directly
 Maximize the conditional likelihood of training examples
 log
2
P(Y  y | X  x)
( x , y )T
Lecture 1, 7/21/2005
Natural Language Processing
128
Maximum Entropy modeling

N-gram model : probabilities depend on the previous few tokens.

We may identify a more heterogeneous set of features which contribute in some way
to the choice of the current word. (whether it is the first word in a story, whether the
next word is to, whether one of the last 5 words is a preposition, etc)

Maxent combines these features in a probabilistic model.

The given features provide a constraint on the model.

We would like to have a probability distribution which, outside of these constraints, is
as uniform as possible – has the maximum entropy among all models that satisfy
these constraints.
Lecture 1, 7/21/2005
Natural Language Processing
129
Maximum Entropy Markov Model

Discriminative Sub Models
 Unify two parameters in generative model into one
conditional model

Two parameters in generative model,

parameter in source model
noisy channel


Unified conditional model
and parameter in
P( yk | yk 1 )
P ( xk | y k )
Employ maximum entropy principle
P ( y k | xk , yk 1 )
P(y | x)   P( yi | yi 1 , xi )
i

Maximum Entropy Markov Model
Lecture 1, 7/21/2005
Natural Language Processing
130
General Maximum Entropy Principle


Model
 Model distribution PY |X with a set of features {f1,f2,,fl}
defined on X and Y
Idea
 Collect information of features from training data
 Principle
 Model what is known
 Assume nothing else
 Flattest distribution
 Distribution with the maximum Entropy
Lecture 1, 7/21/2005
Natural Language Processing
131
Example

(Berger et al., 1996) example

Model translation of word “in” from English to French


Need to model P(wordFrench)
Constraints
 1: Possible translations: dans, en, à, au course de, pendant
 2: “dans” or “en” used in 30% of the time
 3: “dans” or “à” in 50% of the time
Lecture 1, 7/21/2005
Natural Language Processing
132
Features

Features
 0-1 indicator functions



1 if x, y satisfies a predefined condition
0 if not
Example: POS Tagging
1, if x ends with - tion and y is NN
f1 ( x, y)  
 0, otherwise
1, if x starts with Captializa tion and y is NNP
f 2 ( x, y )  
0, otherwise
Lecture 1, 7/21/2005
Natural Language Processing
133
Constraints

Empirical Information
 Statistics from training data T
1
Pˆ ( f i ) 
f i ( x, y)

| T | ( x , y )T

Expected Value

From the distribution PY |X we want to model
P( f i ) 

Constraints
Lecture 1, 7/21/2005
1
P(Y  y | X  x) f i ( x, y)


| T | ( x , y )T yD (Y )
Pˆ ( f i )  P( f i )
Natural Language Processing
134
Maximum Entropy: Objective

Entropy
1
P(Y  y  | X  x) log 2 P(Y  y  | X  x)

| T | ( x , y )T

Pˆ ( x) P(Y  y | X  x) log P(Y  y | X  x)
I 

x

2
y
Maximization Problem
max I
P (Y | X )
s.t. Pˆ ( f )  P ( f )
Lecture 1, 7/21/2005
Natural Language Processing
135
Dual Problem

Dual Problem
 Conditional model
l
P(Y  y | X  x)  exp(  i f i ( x, y ))

i 1
Maximum likelihood of conditional data
max
1 ,, l

 log
2
P(Y  y | X  x)
( x , y )T
Solution


Improved iterative scaling (IIS) (Berger et al. 1996)
Generalized iterative scaling (GIS) (McCallum et al.
2000)
Lecture 1, 7/21/2005
Natural Language Processing
136
Maximum Entropy Markov Model

Use Maximum Entropy Approach to Model
 1st order
P(Yk  yk | X k  xk , Yk 1  yk 1 )

Features

Basic features (like parameters in HMM)



Bigram (1st order) or trigram (2nd order) in source
model
State-output pair feature Xk xk, Yk  yk
Advantage: incorporate other advanced
features on xk, yk
Lecture 1, 7/21/2005
Natural Language Processing
137
HMM vs MEMM (1st order)
Xk
Xk
P ( X k | Yk )
P (Yk | Yk 1 )
Yk 1
Yk
HMM
P(Yk | X k , Yk 1 )
Yk 1
Yk
Maximum Entropy
Markov Model (MEMM)
Performance in POS Tagging

POS Tagging
 Data set: WSJ
 Features:


HMM features, spelling features (like –ed, -tion, -s, -ing,
etc.)
Results (Lafferty et al. 2001)
 1st order HMM


1st order MEMM

Lecture 1, 7/21/2005
94.31% accuracy, 54.01% OOV accuracy
95.19% accuracy, 73.01% OOV accuracy
Natural Language Processing
139
ME applications

Part of Speech (POS) Tagging (Ratnaparkhi, 1996)
 P(POS tag | context)
 Information sources



Word window (4)
Word features (prefix, suffix, capitalization)
Previous POS tags
Lecture 1, 7/21/2005
Natural Language Processing
140
ME applications

Abbreviation expansion (Pakhomov, 2002)

Information sources



Word Sense Disambiguation (WSD) (Chao & Dyer, 2002)

Information sources



Word window (4)
Document title
Word window (4)
Structurally related words (4)
Sentence Boundary Detection (Reynar & Ratnaparkhi, 1997)

Information sources


Token features (prefix, suffix, capitalization, abbreviation)
Word window (2)
Lecture 1, 7/21/2005
Natural Language Processing
141
Solution


Global Optimization
 Optimize parameters in a global model simultaneously,
not in sub models separately
Alternatives
 Conditional random fields
 Application of perceptron algorithm
Lecture 1, 7/21/2005
Natural Language Processing
142
Why ME?

Advantages
 Combine multiple knowledge sources



Local
 Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996))
 Word POS, POS class, suffix (WSD - (Chao & Dyer, 2002))
 Token prefix, suffix, capitalization, abbreviation (Sentence Boundary (Reynar & Ratnaparkhi, 1997))
Global
 N-grams (Rosenfeld, 1997)
 Word window
 Document title (Pakhomov, 2002)
 Structurally related words (Chao & Dyer, 2002)
 Sentence length, conventional lexicon (Och & Ney, 2002)
Combine dependent knowledge sources
Lecture 1, 7/21/2005
Natural Language Processing
143
Why ME?


Advantages
 Add additional knowledge sources
 Implicit smoothing
Disadvantages
 Computational



Expected value at each iteration
Normalizing constant
Overfitting

Feature selection
 Cutoffs
 Basic Feature Selection (Berger et al., 1996)
Lecture 1, 7/21/2005
Natural Language Processing
144
Conditional Models

Conditional probability P(label sequence y | observation sequence x) rather than
joint probability P(y, x)

Specify the probability of possible label sequences given an observation
sequence

Allow arbitrary, non-independent features on the observation sequence X

The probability of a transition between labels may depend on past and future
observations

Relax strong independence assumptions in generative models
Lecture 1, 7/21/2005
Natural Language Processing
145
Discriminative Models
Maximum Entropy Markov Models (MEMMs)


Exponential model
Given training set X with label sequence Y:
 Train a model θ that maximizes P(Y|X, θ)
 For a new data sequence x, the predicted label y maximizes P(y|x, θ)
 Notice the per-state normalization
Lecture 1, 7/21/2005
Natural Language Processing
146
MEMMs (cont’d)

MEMMs have all the advantages of Conditional Models

Per-state normalization: all the mass that arrives at a state must be distributed
among the possible successor states (“conservation of score mass”)

Subject to Label Bias Problem

Bias toward states with fewer outgoing transitions
Lecture 1, 7/21/2005
Natural Language Processing
147
Label Bias Problem
• Consider this MEMM:
•
P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r)
P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)
• Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri)
In the training data, label value 2 is the only label value observed after label value 1
Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x
• However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro).
• Per-state normalization does not allow the required expectation
Lecture 1, 7/21/2005
Natural Language Processing
148
Solve the Label Bias Problem

Change the state-transition structure of the model


Not always practical to change the set of states
Start with a fully-connected model and let the training procedure figure out a
good structure

Prelude the use of prior, which is very valuable (e.g. in information extraction)
Lecture 1, 7/21/2005
Natural Language Processing
149
Random Field
Lecture 1, 7/21/2005
Natural Language Processing
150
Conditional Random Fields (CRFs)

CRFs have all the advantages of MEMMs without label bias
problem




MEMM uses per-state exponential model for the conditional probabilities
of next states given the current state
CRF has a single exponential model for the joint probability of the entire
sequence of labels given the observation sequence
Undirected acyclic graph
Allow some transitions “vote” more strongly than others depending on the
corresponding observations
Lecture 1, 7/21/2005
Natural Language Processing
151
Definition of CRFs
X is a random variable over data sequences to be labeled
Y is a random variable over corresponding label sequences
Lecture 1, 7/21/2005
Natural Language Processing
152
Example of CRFs
Lecture 1, 7/21/2005
Natural Language Processing
153
Graphical comparison among
HMMs, MEMMs and CRFs
HMM
Lecture 1, 7/21/2005
MEMM
Natural Language Processing
CRF
154
Conditional Distribution
If the graph G = (V, E) of Y is a tree, the conditional distribution over the
label sequence Y = y, given X = x, by fundamental theorem of random
fields is:


p (y | x)  exp   k f k (e, y |e , x)   k gk (v, y |v , x) 
vV ,k
 eE,k

x is a data sequence
y is a label sequence
v is a vertex from vertex set V = set of label random variables
e is an edge from edge set E over V
fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a
Boolean edge feature
k is the number of features
  (1 , 2 , , n ; 1 , 2 , , n ); k and k are parameters to be estimated
y|e is the set of components of y defined by edge e
y|v is the set of components of y defined by vertex v
Lecture 1, 7/21/2005
Natural Language Processing
155
Conditional Distribution (cont’d)
• CRFs use the observation-dependent normalization Z(x) for the
conditional distributions:


1
p (y | x) 
exp   k f k (e, y |e , x)   k gk (v, y |v , x) 
Z (x)
vV ,k
 eE,k

Z(x) is a normalization over the data sequence x
Lecture 1, 7/21/2005
Natural Language Processing
156
Parameter Estimation for CRFs

The paper provided iterative scaling algorithms

It turns out to be very inefficient

Prof. Dietterich’s group applied Gradient Descendent Algorithm, which is
quite efficient
Lecture 1, 7/21/2005
Natural Language Processing
157
Training of CRFs (From Prof. Dietterich)
• First, we take the log of the equation
log p ( y | x) 

eE,k
k
f k (e, y |e , x) 
g
vV ,k
k
k
(v, y |v , x)  log Z (x)
• Then, take the derivative of the above equation

 log p ( y | x)  

  k f k (e, y |e , x)   k gk (v, y |v , x)  log Z (x) 

  eE,k
vV ,k

• For training, the first 2 items are easy to get.
• For example, for each k, fk is a sequence of Boolean numbers, such
as 00101110100111.
is just the total number of 1’s in the sequence.
k f k (e, y |e , x)
• The hardest thing is how to
calculate Z(x)
Natural Language Processing
Lecture 1, 7/21/2005
158
Training of CRFs (From Prof. Dietterich) (cont’d)
• Maximal cliques
y1
c1
y2
c1
c2
c2
y3
c3
y4
c3
c1 : exp( (y1 ,x)   (y2 ,x)  (y1,y 2 ,x))  c1 (y1,y 2 ,x)
c2 : exp( (y3 ,x)  (y2 ,y3 ,x))  c2 (y 2 ,y3 ,x)
c3 : exp( (y4 ,x)  (y3 ,y4 ,x))  c3 (y3 ,y4 ,x)
Z (x) 

c1 (y1 ,y 2 ,x)c2 (y 2 ,y3 ,x)c3 (y3 ,y 4 ,x)
y1 ,y 2 ,y3 ,y 4
  c1 (y1 ,y 2 ,x) c2 (y 2 ,y3 ,x) c3 (y3 ,y 4 ,x)
y y2
Lecture1 1, 7/21/2005
y3
y
Natural Language 4Processing
159
POS tagging Experiments
Lecture 1, 7/21/2005
Natural Language Processing
160
POS tagging Experiments (cont’d)
• Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging
• Each word in a given input sentence must be labeled with one of 45 syntactic tags
• Add a small set of orthographic features: whether a spelling begins with a number
or upper case letter, whether it contains a hyphen, and if it contains one of the
following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies
• oov = out-of-vocabulary (not observed in the training set)
Lecture 1, 7/21/2005
Natural Language Processing
161
Summary

Discriminative models are prone to the label bias problem

CRFs provide the benefits of discriminative models

CRFs solve the label bias problem well, and demonstrate good performance
Lecture 1, 7/21/2005
Natural Language Processing
162