Sequence Learning

Download Report

Transcript Sequence Learning

Parts of Speech
Sudeshna Sarkar
7 Aug 2008
1
Why Do We Care about Parts of Speech?
•Pronunciation
Hand me the lead pipe.
•Predicting what words can be expected next
Personal pronoun (e.g., I, she) ____________
•Stemming
-s means singular for verbs, plural for nouns
•As the basis for syntactic parsing and then meaning extraction
I will lead the group into the lead smelter.
•Machine translation
• (E) content +N  (F) contenu +N
• (E) content +Adj  (F) content +Adj or satisfait +Adj
2
What is a Part of Speech?
Is this a semantic distinction? For example, maybe Noun is the
class of words for people, places and things. Maybe Adjective
is the class of words for properties of nouns.
Consider:
green book
book is a Noun
green is an Adjective
Now consider:
book worm
This green is very soothing.
3
How Many Parts of Speech Are There?
A first cut at the easy distinctions:
Open classes:
•nouns, verbs, adjectives, adverbs
Closed classes: function words
•conjunctions: and, or, but
•pronounts: I, she, him
•prepositions: with, on
•determiners: the, a, an
4
Part of speech tagging
8 (ish) traditional parts of speech
Noun, verb, adjective, preposition, adverb, article,
interjection, pronoun, conjunction, etc
This idea has been around for over 2000 years (Dionysius
Thrax of Alexandria, c. 100 B.C.)
Called: parts-of-speech, lexical category, word classes,
morphological classes, lexical tags, POS
We’ll use POS most frequently
I’ll assume that you all know what these are
5
POS examples
N
V
ADJ
ADV
P
PRO
DET
noun
verb
adj
adverb
preposition
pronoun
determiner
chair, bandwidth, pacing
study, debate, munch
purple, tall, ridiculous
unfortunately, slowly,
of, by, to
I, me, mine
the, a, that, those
6
Tagsets
Brown corpus tagset (87 tags):
http://www.scs.leeds.ac.uk/amalgam/tagsets/brown.html
Penn Treebank tagset (45 tags):
http://www.cs.colorado.edu/~martin/SLP/Figures/ (8.6)
C7 tagset (146 tags)
http://www.comp.lancs.ac.uk/ucrel/claws7tags.html
7
POS Tagging: Definition
The process of assigning a part-of-speech or lexical
class marker to each word in a corpus:
WORDS
the
koala
put
the
keys
on
the
table
TAGS
N
V
P
DET
8
POS Tagging example
WORD
tag
the
koala
put
the
keys
on
the
table
DET
N
V
DET
N
P
DET
N
9
POS tagging: Choosing a tagset
There are so many parts of speech, potential
distinctions we can draw
To do POS tagging, need to choose a standard set of
tags to work with
Could pick very coarse tagets
N, V, Adj, Adv.
More commonly used set is finer grained, the “UPenn
TreeBank tagset”, 45 tags
PRP$, WRB, WP$, VBG
Even more fine-grained tagsets exist
10
Penn TreeBank POS Tag set
11
Using the UPenn tagset
The/DT grand/JJ jury/NN commmented/VBD on/IN
a/DT number/NN of/IN other/JJ topics/NNS ./.
Prepositions and subordinating conjunctions marked
IN (“although/IN I/PRP..”)
Except the preposition/complementizer “to” is just
marked “to”.
12
POS Tagging
Words often have more than one POS: back
The back door = JJ
On my back = NN
Win the voters back = RB
Promised to back the bill = VB
The POS tagging problem is to determine the
POS tag for a particular instance of a word.
13
How hard is POS tagging? Measuring
ambiguity
14
Algorithms for POS Tagging
•Ambiguity – In the Brown corpus, 11.5% of the word
types are ambiguous (using 87 tags):
Worse, 40% of the tokens are ambiguous.
15
Algorithms for POS Tagging
Why can’t we just look them up in a dictionary?
•Words that aren’t in the dictionary
http://story.news.yahoo.com/news?tmpl=story&cid=578&ncid
=578&e=1&u=/nm/20030922/ts_nm/iraq_usa_dc
•One idea: P(ti | wi) = the probability that a random hapax
legomenon in the corpus has tag ti.
Nouns are more likely than verbs, which are more likely
than pronouns.
•Another idea: use morphology.
16
Algorithms for POS Tagging - Knowledge
•Dictionary
•Morphological rules, e.g.,
•_____-tion
•_____-ly
•capitalization
•N-gram frequencies
•to _____
•DET _____ N
•But what about rare words, e.g, smelt (two verb forms, melt and past
tense of smell, and one noun form, a small fish)
•Combining these
• V _____-ing
I was gracking vs. Gracking is fun.
17
POS Tagging - Approaches
Approaches
Rule-based tagging
(ENGTWOL)
Stochastic (=Probabilistic) tagging
HMM (Hidden Markov Model) tagging
Transformation-based tagging
Brill tagger
•
Do we return one best answer or several answers and let later
steps decide?
•
How does the requisite knowledge get entered?
18
3 methods for POS tagging
1.
Rule-based tagging
Example: Karlsson (1995) EngCG tagger based on the
Constraint Grammar architecture and ENGTWOL lexicon
– Basic Idea:
 Assign all possible tags to words (morphological analyzer
used)
 Remove wrong tags according to set of constraint rules
(typically more than 1000 hand-written constraint rules,
but may be machine-learned)
19
3 methods for POS tagging
2. Transformation-based tagging
Example: Brill (1995) tagger - combination of rule-based and stochastic
(probabilistic) tagging methodologies
– Basic Idea:
 Start with a tagged corpus + dictionary (with most frequent tags)
 Set the most probable tag for each word as a start value
 Change tags according to rules of type “if word-1 is a determiner
and word is a verb then change the tag to noun” in a specific
order (like rule-based taggers)
 machine learning is used—the rules are automatically induced
from a previously tagged training corpus (like stochastic
approach)
20
3 methods for POS tagging
3. Stochastic (=Probabilistic) tagging
Example: HMM (Hidden Markov Model) tagging - a training
corpus used to compute the probability (frequency) of a
given word having a given POS tag in a given context
21
Hidden Markov Model (HMM) Tagging
Using an HMM to do POS tagging
HMM is a special case of Bayesian inference
It is also related to the “noisy channel” model in ASR
(Automatic Speech Recognition)
22
Hidden Markov Model (HMM) Taggers
Goal: maximize P(word|tag) x P(tag|previous n tags)
Lexical information
Syntagmatic information
P(word|tag)
word/lexical likelihood
probability that given this tag, we have this word
NOT probability that this word has this tag
modeled through language model (word-tag matrix)
P(tag|previous n tags)
tag sequence likelihood
probability that this tag follows these previous tags
modeled through language model (tag-tag matrix)
23
POS tagging as a sequence classification task
We are given a sentence (an “observation” or “sequence of
observations”)
Secretariat is expected to race tomorrow
sequence of n words w1…wn.
What is the best sequence of tags which corresponds to this
sequence of observations?
Probabilistic/Bayesian view:
Consider all possible sequences of tags
Out of this universe of sequences, choose the tag sequence which
is most probable given the observation sequence of n words
w1…wn.
24
Getting to HMM
Let T = t1,t2,…,tn
Let W = w1,w2,…,wn
Goal: Out of all sequences of tags t1…tn, get the the most probable
sequence of POS tags T underlying the observed sequence of
words w1,w2,…,wn
Hat ^ means “our estimate of the best = the most probable tag sequence”
Argmaxx f(x) means “the x such that f(x) is maximized”
it maximazes our estimate of the best tag sequence
25
Getting to HMM
This equation is guaranteed to give us the best tag sequence
But how do we make it operational? How do we compute this value?
Intuition of Bayesian classification:
Use Bayes rule to transform it into a set of other probabilities that
are easier to compute
Thomas Bayes: British mathematician (1702-1761)
26
Bayes Rule
Breaks down any conditional probability P(x|y) into three other
probabilities
P(x|y): The conditional probability of an event x assuming that y
has occurred
27
Bayes Rule
We can drop the denominator: it does not change for each tag
sequence; we are looking for the best tag sequence for the
same observation, for the same fixed set of words
28
Bayes Rule
29
Likelihood and prior
n
30
Likelihood and prior
Further Simplifications
1. the probability of a word appearing depends only on its own POS tag,
i.e, independent of other words around it
n
2. BIGRAM assumption: the probability of a tag appearing depends only
on the previous tag
3. The most probable tag sequence estimated by the bigram tagger
31
Likelihood and prior
Further Simplifications
1. the probability of a word appearing depends only on its own POS tag,
i.e, independent of other words around it
n
WORDS
the
koala
put
the
keys
on
the
table
TAGS
N
V
P
DET
32
Likelihood and prior
Further Simplifications
2. BIGRAM assumption: the probability of a tag appearing depends only
on the previous tag
Bigrams are groups of two written letters, two syllables, or two words; they
are a special case of N-gram.
Bigrams are used as the basis for simple statistical analysis of text
The bigram assumption is related to the first-order Markov assumption
33
Likelihood and prior
Further Simplifications
3. The most probable tag sequence estimated by the bigram tagger
---------------------------------------------------------------------------------------------------------------
n
biagram assumption
34
Two kinds of probabilities (1)
Tag transition probabilities p(ti|ti-1)
Determiners likely to precede adjs and nouns
–
–
–
–
That/DT flight/NN
The/DT yellow/JJ hat/NN
So we expect P(NN|DT) and P(JJ|DT) to be high
But P(DT|JJ) to be:?
35
Two kinds of probabilities (1)
Tag transition probabilities p(ti|ti-1)
Compute P(NN|DT) by counting in a labeled corpus:
# of times DT is followed by NN
36
Two kinds of probabilities (2)
Word likelihood probabilities p(wi|ti)
P(is|VBZ) = probability of VBZ (3sg Pres verb) being “is”
If we were expecting a third person singular verb, how likely is it that
this verb would be is?
Compute P(is|VBZ) by counting in a labeled corpus:
37
An Example: the verb “race”
Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
tomorrow/NR
People/NNS continue/VB to/TO inquire/VB the/DT
reason/NN for/IN the/DT race/NN for/IN outer/JJ
space/NN
How do we pick the right tag?
38
Disambiguating “race”
39
Disambiguating “race”
P(NN|TO) = .00047
P(VB|TO) = .83
The tag transition probabilities P(NN|TO) and P(VB|TO) answer the question: ‘How likely
are we to expect verb/noun given the previous tag TO?’
P(race|NN) = .00057
P(race|VB) = .00012
Lexical likelihoods from the Brown corpus for ‘race’ given a POS tag NN or VB.
P(NR|VB) = .0027
P(NR|NN) = .0012
tag sequence probability for the likelihood of an adverb occurring given the previous tag
verb or noun
P(VB|TO)P(NR|VB)P(race|VB) = .00000027
P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
Multiply the lexical likelihoods with the tag sequence probabiliies: the verb wins
40
Hidden Markov Models
What we’ve described with these two kinds of
probabilities is a Hidden Markov Model (HMM)
Let’s just spend a bit of time tying this into the model
In order to define HMM, we will first introduce the
Markov Chain, or observable Markov Model.
41
Definitions
A weighted finite-state automaton adds probabilities to
the arcs
The sum of the probabilities leaving any arc must sum to one
A Markov chain is a special case of a WFST in which
the input sequence uniquely determines which states
the automaton will go through
Markov chains can’t represent inherently ambiguous
problems
Useful for assigning probabilities to unambiguous sequences
42
Markov chain =
“First-order observed Markov Model”
a set of states
Q = q1, q2…qN; the state at time t is qt
a set of transition probabilities:
a set of probabilities A = a01a02…an1…ann.
Each aij represents the probability of transitioning from state i to state j
The set of these is the transition probability matrix A
aij  P(qt  j | qt1  i) 1  i, j  N
N
a
ij
 1;
1 i  N
j1
Distinguished start and end states
Special
 initial probability vector 

i the probability that the MM will start in state i, each i expresses the
probability p(qi|START)
43
Markov chain =
“First-order observed Markov Model”
Markov Chain for weather: Example 1
three types of weather: sunny, rainy, foggy
we want to find the following conditional probabilities:
P(qn|qn-1, qn-2, …, q1)
- I.e., the probability of the unknown weather on day n,
depending on the (known) weather of the preceding
days
- We could infer this probability from the relative frequency (the
statistics) of past observations of weather sequences
Problem: the larger n is, the more observations we must collect.
Suppose that n=6, then we have to collect statistics for 3(6-1) =
243 past histories
44
Markov chain =
“First-order observed Markov Model”
Therefore, we make a simplifying assumption, called the (first-order) Markov
assumption
for a sequence of observations q1, … qn,
current state only depends on previous state
the joint probability of certain past and current observations
45
Markov chain =
“First-order observable Markov Model”
46
Markov chain =
“First-order observed Markov Model”
Given that today the weather is sunny,
what's the probability that tomorrow is
sunny and the day after is rainy?
Using the Markov assumption and the
probabilities in table 1, this translates into:
47
The weather figure: specific example
Markov Chain for weather: Example 2
48
Markov chain for weather
What is the probability of 4 consecutive rainy days?
Sequence is rainy-rainy-rainy-rainy
I.e., state sequence is 3-3-3-3
P(3,3,3,3) =
1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432
49
Hidden Markov Model
For Markov chains, the output symbols are the same
as the states.
See sunny weather: we’re in state sunny
But in part-of-speech tagging (and other things)
The output symbols are words
But the hidden states are part-of-speech tags
So we need an extension!
A Hidden Markov Model is an extension of a Markov
chain in which the output symbols are not the same
as the states.
This means we don’t know which state we are in.
50
Markov chain for weather
51
Markov chain for words
Observed events: words
Hidden events: tags
52
Hidden Markov Models
States Q = q1, q2…qN;
Observations O = o1, o2…oN;
Each observation is a symbol from a vocabulary V = {v1,v2,…vV}
Transition probabilities (prior)
Transition probability matrix A = {aij}
Observation likelihoods (likelihood)
Output probability matrix B={bi(ot)}
a set of observation likelihoods, each expressing the probability of an observation
ot being generated from a state i, emission probabilities
Special initial probability vector 
i the probability that the HMM will start in state i, each i expresses the probability
p(qi|START)
53
Assumptions
Markov assumption: the probability of a particular state depends
only on the previous state
P(qi | q1...qi1)  P(qi | qi1)

Output-independence assumption: the probability of an output
observation depends only on the state that produced that
observation
54
HMM for Ice Cream
You are a climatologist in the year 2799
Studying global warming
You can’t find any records of the weather in Boston,
MA for summer of 2007
But you find Jason Eisner’s diary
Which lists how many ice-creams Jason ate every
date that summer
Our job: figure out how hot it was
55
Noam task
Given
Ice Cream Observation Sequence: 1,2,3,2,2,2,3…
(cp. with output symbols)
Produce:
Weather Sequence: C,C,H,C,C,C,H …
(cp. with hidden states, causing states)
56
HMM for ice cream
57
Different types of HMM structure
Bakis = left-to-right
Ergodic =
fully-connected
58
HMM Taggers
Two kinds of probabilities
A transition probabilities (PRIOR)
B observation likelihoods (LIKELIHOOD)
HMM Taggers choose the tag sequence which
maximizes the product of word likelihood and tag
sequence probability
59
Weighted FSM corresponding to hidden states
of HMM, showing A probs
60
B observation likelihoods for POS HMM
61
The A matrix for the POS HMM
62
The B matrix for the POS HMM
63
HMM Taggers
The probabilities are trained on hand-labeled training
corpora (training set)
Combine different N-gram levels
Evaluated by comparing their output from a test set to
human labels for that test set (Gold Standard)
64
The Viterbi Algorithm
best tag sequence for "John likes to fish in the sea"?
efficiently computes the most likely state sequence given a particular
output sequence
based on dynamic programming
65
A smaller example
a
0.4
start
b
a
0.6
0.2
0.7
q
1
0.3
0.5
b
0.8
r
end
1
0.5
What is the best sequence of states for the input string
“bbba”?
Computing all possible paths and finding the one with the
max probability is exponential
66
A smaller example (con’t)
For each state, store the most likely sequence that could lead to it (and its probability)
Path probability matrix:
An array of states versus time (tags versus words)
That stores the prob. of being at each state at each time in terms of the prob. for being in each state
at the preceding time.
Best sequence
leading
to q
coming
from q
Input sequence / time
ε --> b
b --> b
bb --> b
bbb --> a
ε --> q 0.6
(1.0x0.6)
q --> q 0.108
(0.6x0.3x0.6)
qq --> q 0.01944
(0.108x0.3x0.6)
qrq --> q 0.018144
(0.1008x0.3x0.4)
r --> q 0
(0x0.5x0.6)
qr --> q 0.1008
(0.336x0.5x 0.6)
qrr --> q 0.02688
(0.1344x0.5x0.4)
q --> r 0.336
(0.6x0.7x0.8)
qq --> r 0.0648
(0.108x0.7x0.8)
qrq --> r 0.014112
(0.1008x0.7x0.2)
r --> r 0
(0x0.5x0.8)
qr --> r 0.1344
(0.336x0.5x0.8)
qrr --> r 0.01344
(0.1344x0.5x0.2)
coming
from r
leading
to r
coming
from q
coming
from r
ε --> r 0
(0x0.8)
67
Viterbi intuition: we are looking for the
best ‘path’
S1
S2
S3
S4
S5
JJ
DT
VB
VB
NNP
NN
RB
NN
VBN
TO
VBD
promised
to
back
the
bill
Slide from Dekang Lin
68
The Viterbi Algorithm
69
Intuition
The value in each cell is computed by taking the MAX
over all paths that lead to this cell.
An extension of a path from state i at time t-1 is
computed by multiplying:
Previous path probability from previous cell viterbi[t-1,i]
Transition probability aij from previous state I to current state j
Observation likelihood bj(ot) that current state j matches
observation symbol t
70
Viterbi example
71
Smoothing of probabilities
Data sparseness is a problem when estimating probabilities based on corpus data.
The “add one” smoothing technique –


P w1,n 


C w1,n  1
NB
C- absolute frequency
N: no of training instances
B: no of different types
Linear interpolation methods can compensate for data sparseness with
higher order models. A common method is interpolating trigrams, bigrams
and unigrams:


P ti | t1,i 1  1P1 (ti )  2 P2 (ti | ti 1 )  3 P3 (ti | ti 1,i  2 )
0  i  1,  i  1
i
The lambda values are automatically determined using a variant of the
Expectation Maximization algorithm.
72
Possible improvements
in bigram POS tagging, we condition a tag only on the
preceding tag
why not...
use more context (ex. use trigram model)
– more precise:
 “is clearly marked” --> verb, past participle
 “he clearly marked” --> verb, past tense
– combine trigram, bigram, unigram models
condition on words too
but with an n-gram approach, this is too costly (too many
parameters to model)
75
Further issues with Markov Model tagging
Unknown words are a problem since we don’t have the required
probabilities. Possible solutions:
Assign the word probabilities based on corpus-wide distribution of
POS
Use morphological cues (capitalization, suffix) to assign a more
calculated guess.
Using higher order Markov models:
Using a trigram model captures more context
However, data sparseness is much more of a problem.
76
TnT
Efficient statistical POS tagger developed by Thorsten Brants,
ANLP-2000
Underlying model:
T
arg max  P(ti | ti 1 , ti 2 ) P( wi | ti ) P(tT 1 | tT )
t1tT
i 1
Trigram modelling –
The probability of a POS only depends on its two preceding POS
The probability of a word appearing at a particular position given that its
POS occurs at that position is independent of everything else.
77
Training
c(t3 )
N
^
c(t , t )
Bigrams : P(t3 | t 2 )  2 3
c(t3 )
^
Maximum likelihood
estimates:
Unigrams : P (t 3 ) 
^
Trigrams : P(t3 | t1 , t 2 ) 
Lexical : P( w3 | t3 ) 
c(t1 , t 2 , t3 )
c(t 2 , t3 )
c( w3 , t3 )
c(t3 )
Smoothing : context-independent variant of linear interpolation.
^
^
^
P(t3 | t1 , t2 )  1 P(t3 )  2 P(t3 | t2 )  3 P(t3 | t1 , t2 )
78
Smoothing algorithm
Set λi=0
For each trigram t1 t2 t3 with f(t1,t2,t3 )>0
Depending on the max of the following three values:
– Case (f(t1,t2,t3 )-1)/ f(t1,t2) : incr λ3 by f(t1,t2,t3 )
– Case (f(t2,t3 )-1)/ f(t2)
: incr λ2 by f(t1,t2,t3 )
– Case (f(t3 )-1)/ N-1
: incr λ1 by f(t1,t2,t3 )
Normalize λi
79
Evaluation of POS taggers
compared with gold-standard of human performance
metric:
accuracy = % of tags that are identical to gold standard
most taggers ~96-97% accuracy
must compare accuracy to:
ceiling (best possible results)
– how do human annotators score compared to each other? (96-97%)
– so systems are not bad at all!
baseline (worst possible results)
– what if we take the most-likely tag (unigram model) regardless of previous
tags ? (90-91%)
– so anything less is really bad
80
More on tagger accuracy
is 95% good?
that’s 5 mistakes every 100 words
if on average, a sentence is 20 words, that’s 1 mistake per sentence
when comparing tagger accuracy, beware of:
size of training corpus
– the bigger, the better the results
difference between training & testing corpora (genre, domain…)
– the closer, the better the results
size of tag set
– Prediction versus classification
unknown words
– the more unknown words (not in dictionary), the worst the results
81
Error Analysis
Look at a confusion matrix (contingency table)
E.g. 4.4% of the total errors caused by mistagging VBD as VBN
See what errors are causing problems
Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)
Adverb (RB) vs Particle (RP) vs Prep (IN)
Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)
ERROR ANALYSIS IS ESSENTIAL!!!
82
Tag indeterminacy
83
Major difficulties in POS tagging
Unknown words (proper names)
because we do not know the set of tags it can take
and knowing this takes you a long way (cf. baseline POS tagger)
possible solutions:
– assign all possible tags with probabilities distribution identical to lexicon as a whole
– use morphological cues to infer possible tags

ex. word ending in -ed are likely to be past tense verbs or past participles
Frequently confused tag pairs
preposition vs particle
<running> <up> a hill (prep) / <running up> a bill (particle)
verb, past tense vs. past participle vs. adjective
84
Unknown Words
Most-frequent-tag approach.
What about words that don’t appear in the training set?
Suffix analysis:
The probability distribution for a particular suffix is generated from all words in
the training set that share the same suffix.
Suffix estimation – Calculate the probability of a tag t given the last i letters
of an n letter word.
Smoothing: successive abstraction through sequences of increasingly
more general contexts (i.e., omit more and more characters of the suffix)
Use a morphological analyzer to get the restriction on the possible tags.
85
Unknown words
86
Alternative graphical models for
part of speech tagging
87
Different Models for POS tagging
HMM
Maximum Entropy Markov Models
Conditional Random Fields
88
Hidden Markov Model (HMM) : Generative
Modeling
y
Source Model
PY
P(y )   P( yi | yi 1 )
i
x
Noisy Channel
PX|Y
P(x | y )   P( xi | yi )
i
89
Dependency (1st order)
X k 2
X k 1
P ( X k  2 | Yk  2 )
P ( X k 1 | Yk 1 )
P (Yk 1 | Yk  2 )
Yk  2
P ( X k | Yk )
P (Yk | Yk 1 )
Yk 1
X k 1
Xk
P ( X k 1 | Yk 1 )
P (Yk 1 | Yk )
Yk
Yk 1
90
Disadvantage of HMMs (1)
No Rich Feature Information
Rich information are required
– When xk is complex
– When data of xk is sparse
Example: POS Tagging
How to evaluate Pwk|tk for unknown words wk ?
Useful features
– Suffix, e.g., -ed, -tion, -ing, etc.
– Capitalization
Generative Model
Parameter estimation: maximize the joint likelihood of training examples
 log
2
P(X  x, Y  y)
( x , y )T
91
Generative Models
Hidden Markov models (HMMs) and stochastic grammars
Assign a joint probability to paired observation and label sequences
The parameters typically trained to maximize the joint likelihood of train examples
92
Generative Models (cont’d)
Difficulties and disadvantages
Need to enumerate all possible observation sequences
Not practical to represent multiple interacting features or long-range
dependencies of the observations
Very strict independence assumptions on the observations
93
Better Approach
Discriminative model which models P(y|x) directly
Maximize the conditional likelihood of training examples
log
2
P(Y  y | X  x)
( x , y )T
94
Maximum Entropy modeling
N-gram model : probabilities depend on the previous few tokens.
We may identify a more heterogeneous set of features which contribute in some
way to the choice of the current word. (whether it is the first word in a story,
whether the next word is to, whether one of the last 5 words is a preposition, etc)
Maxent combines these features in a probabilistic model.
The given features provide a constraint on the model.
We would like to have a probability distribution which, outside of these
constraints, is as uniform as possible – has the maximum entropy among all
models that satisfy these constraints.
95
Maximum Entropy Markov Model
Discriminative Sub Models
Unify two parameters in generative model into one conditional
model
– Two parameters in generative model,
– parameter in source model
channel
– Unified conditional model
P( yk | yk 1 )
and parameter in noisy
P ( xk | y k )
Employ maximum entropy principle P ( y | x , y )
k
k
k 1
P(y | x)   P( yi | yi 1 , xi )
i

Maximum Entropy Markov Model
96
General Maximum Entropy Principle
Model
Model distribution PY |X with a set of features {f1,f2,,fl}
defined on X and Y
Idea
Collect information of features from training data
Principle
– Model what is known
– Assume nothing else
 Flattest distribution
 Distribution with the maximum Entropy
97
Example
(Berger et al., 1996) example
Model translation of word “in” from English to French
– Need to model P(wordFrench)
– Constraints
 1: Possible translations: dans, en, à, au course de, pendant
 2: “dans” or “en” used in 30% of the time
 3: “dans” or “à” in 50% of the time
98
Features
Features
0-1 indicator functions
– 1 if x, y satisfies a predefined condition
– 0 if not
Example: POS Tagging
1, if x ends with - tion and y is NN
f1 ( x, y)  
 0, otherwise
1, if x starts with Captializa tion and y is NNP
f 2 ( x, y )  
0, otherwise
99
Constraints
Empirical Information
Statistics from training data T
1
Pˆ ( f i ) 
f i ( x, y)

| T | ( x , y )T

Expected Value

From the distribution PY |X we want to model
P( f i ) 

Constraints
1
P(Y  y | X  x) f i ( x, y)


| T | ( x , y )T yD (Y )
Pˆ ( f i )  P( f i )
100
Maximum Entropy: Objective
Entropy
1
I 
P(Y  y | X  x) log 2 P(Y  y | X  x)

| T | ( x , y )T

Pˆ ( x) P(Y  y | X  x) log P(Y  y | X  x)

x

2
y
Maximization Problem
max I
P (Y | X )
s.t. Pˆ ( f )  P ( f )
101
Dual Problem
Dual Problem
Conditional model
l
P(Y  y | X  x)  exp(  i f i ( x, y))
i 1
Maximum likelihood of conditional data
max
1 ,, l

 log
2
P(Y  y | X  x)
( x , y )T
Solution


Improved iterative scaling (IIS) (Berger et al. 1996)
Generalized iterative scaling (GIS) (McCallum et al.
2000)
102
Maximum Entropy Markov Model
Use Maximum Entropy Approach to Model
1st order
P(Yk  yk | X k  xk , Yk 1  yk 1 )

Features

Basic features (like parameters in HMM)



Bigram (1st order) or trigram (2nd order) in source
model
State-output pair feature Xk xk, Yk  yk
Advantage: incorporate other advanced
features on xk, yk
103
HMM vs MEMM (1st order)
Xk
Xk
P ( X k | Yk )
P (Yk | Yk 1 )
Yk 1
Yk
HMM
P(Yk | X k , Yk 1 )
Yk 1
Yk
Maximum Entropy
Markov Model (MEMM)
104
Performance in POS Tagging
POS Tagging
Data set: WSJ
Features:
– HMM features, spelling features (like –ed, -tion, -s, -ing, etc.)
Results (Lafferty et al. 2001)
1st order HMM
– 94.31% accuracy, 54.01% OOV accuracy
1st order MEMM
– 95.19% accuracy, 73.01% OOV accuracy
105
ME applications
Part of Speech (POS) Tagging (Ratnaparkhi, 1996)
P(POS tag | context)
Information sources
– Word window (4)
– Word features (prefix, suffix, capitalization)
– Previous POS tags
106
ME applications
Abbreviation expansion (Pakhomov, 2002)
Information sources
– Word window (4)
– Document title
Word Sense Disambiguation (WSD) (Chao & Dyer, 2002)
Information sources
– Word window (4)
– Structurally related words (4)
Sentence Boundary Detection (Reynar & Ratnaparkhi, 1997)
Information sources
– Token features (prefix, suffix, capitalization, abbreviation)
– Word window (2)
107
Solution
Global Optimization
Optimize parameters in a global model simultaneously, not
in sub models separately
Alternatives
Conditional random fields
Application of perceptron algorithm
108
Why ME?
Advantages
Combine multiple knowledge sources
– Local
 Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996))
 Word POS, POS class, suffix (WSD - (Chao & Dyer, 2002))
 Token prefix, suffix, capitalization, abbreviation (Sentence Boundary - (Reynar
& Ratnaparkhi, 1997))
– Global
 N-grams (Rosenfeld, 1997)
 Word window
 Document title (Pakhomov, 2002)
 Structurally related words (Chao & Dyer, 2002)
 Sentence length, conventional lexicon (Och & Ney, 2002)
Combine dependent knowledge sources
109
Why ME?
Advantages
Add additional knowledge sources
Implicit smoothing
Disadvantages
Computational
– Expected value at each iteration
– Normalizing constant
Overfitting
– Feature selection
 Cutoffs
 Basic Feature Selection (Berger et al., 1996)
110
Conditional Models
Conditional probability P(label sequence y | observation sequence x) rather than
joint probability P(y, x)
Specify the probability of possible label sequences given an observation sequence
Allow arbitrary, non-independent features on the observation sequence X
The probability of a transition between labels may depend on past and future
observations
Relax strong independence assumptions in generative models
111
Discriminative Models
Maximum Entropy Markov Models (MEMMs)
Exponential model
Given training set X with label sequence Y:
Train a model θ that maximizes P(Y|X, θ)
For a new data sequence x, the predicted label y maximizes P(y|x, θ)
Notice the per-state normalization
112
MEMMs (cont’d)
MEMMs have all the advantages of Conditional Models
Per-state normalization: all the mass that arrives at a state must be distributed
among the possible successor states (“conservation of score mass”)
Subject to Label Bias Problem
Bias toward states with fewer outgoing transitions
113
Label Bias Problem
• Consider this MEMM:
•
P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r)
P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)
• Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri)
In the training data, label value 2 is the only label value observed after label value 1
Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x
• However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro).
• Per-state normalization does not allow the required expectation
114
Solve the Label Bias Problem
Change the state-transition structure of the model
Not always practical to change the set of states
Start with a fully-connected model and let the training procedure figure out
a good structure
Prelude the use of prior, which is very valuable (e.g. in information extraction)
115
Random Field
116
Conditional Random Fields (CRFs)
CRFs have all the advantages of MEMMs without label
bias problem
MEMM uses per-state exponential model for the conditional probabilities of
next states given the current state
CRF has a single exponential model for the joint probability of the entire
sequence of labels given the observation sequence
Undirected acyclic graph
Allow some transitions “vote” more strongly than others depending on
the corresponding observations
117
Definition of CRFs
X is a random variable over data sequences to be labeled
Y is a random variable over corresponding label sequences
118
Example of CRFs
119
Graphical comparison among
HMMs, MEMMs and CRFs
HMM
MEMM
CRF
120
Conditional Distribution
If the graph G = (V, E) of Y is a tree, the conditional distribution over the
label sequence Y = y, given X = x, by fundamental theorem of random
fields is:


p (y | x)  exp   k f k (e, y |e , x)   k gk (v, y |v , x) 
vV ,k
 eE,k

x is a data sequence
y is a label sequence
v is a vertex from vertex set V = set of label random variables
e is an edge from edge set E over V
fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a
Boolean edge feature
k is the number of features
  (1 , 2 , , n ; 1 , 2 , , n ); k and k are parameters to be estimated
y|e is the set of components of y defined by edge e
y|v is the set of components of y defined by vertex v
121
Conditional Distribution (cont’d)
• CRFs use the observation-dependent normalization Z(x) for the
conditional distributions:


1
p (y | x) 
exp   k f k (e, y |e , x)   k gk (v, y |v , x) 
Z (x)
vV ,k
 eE,k

Z(x) is a normalization over the data sequence x
122
Parameter Estimation for CRFs
The paper provided iterative scaling algorithms
It turns out to be very inefficient
Prof. Dietterich’s group applied Gradient Descendent Algorithm, which
is quite efficient
123
Training of CRFs (From Prof. Dietterich)
• First, we take the log of the equation
log p ( y | x) 

eE,k
k
f k (e, y |e , x) 
  g (v, y | , x)  log Z (x)
vV ,k
k
k
v
• Then, take the derivative of the above equation

 log p ( y | x)  
   k f k (e, y |e , x)   k gk (v, y |v , x)  log Z (x) 

  eE,k
vV ,k

• For training, the first 2 items are easy to get.
• For example, for each k, fk is a sequence of Boolean numbers, such
as 00101110100111.
is just the total number of 1’s in the sequence.
k f k (e, y |e , x)
• The hardest thing is how to calculate Z(x)
124
Training of CRFs (From Prof. Dietterich) (cont’d)
• Maximal cliques
y1
c1
y2
c1
c2
c2
y3
c3
y4
c3
c1 : exp( (y1 ,x)   (y2 ,x)  (y1 ,y2 ,x))  c1 (y1,y2 ,x)
c2 : exp( (y3 ,x)  (y2 ,y3 ,x))  c2 (y2 ,y3 ,x)
c3 : exp( (y4 ,x)  (y3 ,y4 ,x))  c3 (y3 ,y4 ,x)
Z (x) 

c1 (y1 ,y 2 ,x)c2 (y 2 ,y3 ,x)c3 (y 3 ,y 4 ,x)
y1 ,y 2 ,y3 ,y 4
  c1 (y1 ,y 2 ,x) c2 (y 2 ,y3 ,x) c3 (y3 ,y 4 ,x)
y1
y2
y3
y4
125
POS tagging Experiments
126
POS tagging Experiments (cont’d)
• Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging
• Each word in a given input sentence must be labeled with one of 45 syntactic tags
• Add a small set of orthographic features: whether a spelling begins with a number
or upper case letter, whether it contains a hyphen, and if it contains one of the
following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies
• oov = out-of-vocabulary (not observed in the training set)
127
Summary
Discriminative models are prone to the label bias problem
CRFs provide the benefits of discriminative models
CRFs solve the label bias problem well, and demonstrate good
performance
128