Transcript POS tagging

CS60057
Speech &Natural Language
Processing
Autumn 2007
Lecture 8
9 August 2007
Lecture 1, 7/21/2005
Natural Language Processing
1
Why Do We Care about Parts of Speech?
•Pronunciation
Hand me the lead pipe.
•Predicting what words can be expected next
Personal pronoun (e.g., I, she) ____________
•Stemming
-s means singular for verbs, plural for nouns
•As the basis for syntactic parsing and then meaning extraction
I will lead the group into the lead smelter.
•Machine translation
• (E) content +N  (F) contenu +N
• (E) content +Adj  (F) content +Adj or satisfait +Adj
Lecture 1, 7/21/2005
Natural Language Processing
2
What is a Part of Speech?
Is this a semantic distinction? For example, maybe Noun is the
class of words for people, places and things. Maybe Adjective
is the class of words for properties of nouns.
Consider:
green book
book is a Noun
green is an Adjective
Now consider:
book worm
This green is very soothing.
Lecture 1, 7/21/2005
Natural Language Processing
4
How Many Parts of Speech Are There?
A first cut at the easy distinctions:
Open classes:
•nouns, verbs, adjectives, adverbs
Closed classes: function words
•conjunctions: and, or, but
•pronounts: I, she, him
•prepositions: with, on
•determiners: the, a, an
Lecture 1, 7/21/2005
Natural Language Processing
5
Part of speech tagging

8 (ish) traditional parts of speech
 Noun, verb, adjective, preposition, adverb, article,
interjection, pronoun, conjunction, etc
 This idea has been around for over 2000 years
(Dionysius Thrax of Alexandria, c. 100 B.C.)
 Called: parts-of-speech, lexical category, word
classes, morphological classes, lexical tags, POS
 We’ll use POS most frequently
 I’ll assume that you all know what these are
Lecture 1, 7/21/2005
Natural Language Processing
6
POS examples







N
V
ADJ
ADV
P
PRO
DET
Lecture 1, 7/21/2005
noun
verb
chair, bandwidth, pacing
study, debate, munch
adj
purple, tall, ridiculous
adverb
unfortunately, slowly,
preposition of, by, to
pronoun
I, me, mine
determiner the, a, that, those
Natural Language Processing
7
Tagsets
Brown corpus tagset (87 tags):
http://www.scs.leeds.ac.uk/amalgam/tagsets/brown.html
Penn Treebank tagset (45 tags):
http://www.cs.colorado.edu/~martin/SLP/Figures/ (8.6)
C7 tagset (146 tags)
http://www.comp.lancs.ac.uk/ucrel/claws7tags.html
Lecture 1, 7/21/2005
Natural Language Processing
8
POS Tagging: Definition

The process of assigning a part-of-speech or lexical
class marker to each word in a corpus:
WORDS
TAGS
the
koala
put
the
keys
on
the
table
Lecture 1, 7/21/2005
N
V
P
DET
Natural Language Processing
9
POS Tagging example
Lecture 1, 7/21/2005
WORD
tag
the
koala
put
the
keys
on
the
table
DET
N
V
DET
N
P
DET
N
Natural Language Processing
10
POS tagging: Choosing a tagset





There are so many parts of speech, potential distinctions
we can draw
To do POS tagging, need to choose a standard set of
tags to work with
Could pick very coarse tagets
 N, V, Adj, Adv.
More commonly used set is finer grained, the “UPenn
TreeBank tagset”, 45 tags
 PRP$, WRB, WP$, VBG
Even more fine-grained tagsets exist
Lecture 1, 7/21/2005
Natural Language Processing
11
Penn TreeBank POS Tag set
Lecture 1, 7/21/2005
Natural Language Processing
12
Using the UPenn tagset



The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT
number/NN of/IN other/JJ topics/NNS ./.
Prepositions and subordinating conjunctions marked IN
(“although/IN I/PRP..”)
Except the preposition/complementizer “to” is just
marked “to”.
Lecture 1, 7/21/2005
Natural Language Processing
13
POS Tagging
Words often have more than one POS: back
 The back door = JJ
 On my back = NN
 Win the voters back = RB
 Promised to back the bill = VB
 The POS tagging problem is to determine the
POS tag for a particular instance of a word.

Lecture 1, 7/21/2005
Natural Language Processing
These examples from Dekang Lin
14
How hard is POS tagging?
Measuring ambiguity
Lecture 1, 7/21/2005
Natural Language Processing
15
Algorithms for POS Tagging
•Ambiguity – In the Brown corpus, 11.5% of the word
types are ambiguous (using 87 tags):
Worse, 40% of the tokens are ambiguous.
Lecture 1, 7/21/2005
Natural Language Processing
16
Algorithms for POS Tagging
Why can’t we just look them up in a dictionary?
•Words that aren’t in the dictionary
http://story.news.yahoo.com/news?tmpl=story&cid=578&ncid
=578&e=1&u=/nm/20030922/ts_nm/iraq_usa_dc
•One idea: P(ti | wi) = the probability that a random hapax
legomenon in the corpus has tag ti.
Nouns are more likely than verbs, which are more likely
than pronouns.
•Another idea: use morphology.
Lecture 1, 7/21/2005
Natural Language Processing
17
Algorithms for POS Tagging - Knowledge
•Dictionary
•Morphological rules, e.g.,
•_____-tion
•_____-ly
•capitalization
•N-gram frequencies
•to _____
•DET _____ N
•But what about rare words, e.g, smelt (two verb forms, melt and past
tense of smell, and one noun form, a small fish)
•Combining these
• V _____-ing
Lecture 1, 7/21/2005
I was gracking vs. Gracking is fun.
Natural Language Processing
18
POS Tagging - Approaches
Approaches
Rule-based tagging
(ENGTWOL)
Stochastic (=Probabilistic) tagging
HMM (Hidden Markov Model) tagging
Transformation-based tagging
Brill tagger
•
Do we return one best answer or several answers and let later
steps decide?
•
How does the requisite knowledge get entered?
Lecture 1, 7/21/2005
Natural Language Processing
19
3 methods for POS tagging
1.
Rule-based tagging

Example: Karlsson (1995) EngCG tagger based on
the Constraint Grammar architecture and ENGTWOL
lexicon
 Basic Idea:
 Assign all possible tags to words (morphological
analyzer used)
 Remove wrong tags according to set of
constraint rules (typically more than 1000 handwritten constraint rules, but may be machinelearned)
Lecture 1, 7/21/2005
Natural Language Processing
20
3 methods for POS tagging
2. Transformation-based tagging

Example: Brill (1995) tagger - combination of rule-based and
stochastic (probabilistic) tagging methodologies
 Basic Idea:
 Start with a tagged corpus + dictionary (with most frequent
tags)
 Set the most probable tag for each word as a start value
 Change tags according to rules of type “if word-1 is a
determiner and word is a verb then change the tag to noun”
in a specific order (like rule-based taggers)
 machine learning is used—the rules are automatically
induced from a previously tagged training corpus (like
stochastic approach)
Lecture 1, 7/21/2005
Natural Language Processing
21
3 methods for POS tagging
3. Stochastic (=Probabilistic) tagging

Example: HMM (Hidden Markov Model) tagging a training corpus used to compute the probability
(frequency) of a given word having a given POS
tag in a given context
Lecture 1, 7/21/2005
Natural Language Processing
22
Topics







Probability
Conditional Probability
Independence
Bayes Rule
HMM tagging
Markov Chains
Hidden Markov Models
Lecture 1, 7/21/2005
Natural Language Processing
23
6. Introduction to Probability


Experiment (trial)
 Repeatable procedure with well-defined possible outcomes
Sample Space (S)



Example



the set of all possible outcomes
finite or infinite
coin toss experiment
possible outcomes: S = {heads, tails}
Example


die toss experiment
possible outcomes: S = {1,2,3,4,5,6}
Lecture 1, 7/21/2005
Natural Language Processing
24
Introduction to Probability

Definition of sample space depends on what we are asking
 Sample Space (S): the set of all possible outcomes
 Example



die toss experiment for whether the number is even or odd
possible outcomes: {even,odd}
not {1,2,3,4,5,6}
Lecture 1, 7/21/2005
Natural Language Processing
25
More definitions


Events
 an event is any subset of outcomes from the sample space
Example
 die toss experiment
 let A represent the event such that the outcome of the die toss
experiment is divisible by 3
 A = {3,6}
 A is a subset of the sample space S= {1,2,3,4,5,6}
Lecture 1, 7/21/2005
Natural Language Processing
26
Introduction to Probability

Some definitions
 Events



an event is a subset of sample space
simple and compound events
Example







Quic kT i me™ and a
T IFF (Unc ompres s ed) dec ompres s or
are needed t o s ee thi s pi c ture.
deck of cards draw experiment
suppose sample space S = {heart,spade,club,diamond} (four suits)
let A represent the event of drawing a heart
let B represent the event of drawing a red card
A = {heart} (simple event)
B = {heart} u {diamond} = {heart,diamond} (compound event)
 a compound event can be expressed as a set union of simple events
Example


alternative sample space S = set of 52 cards
A and B would both be compound events
Lecture 1, 7/21/2005
Natural Language Processing
QuickTime™ and a
TIFF (U ncompressed) decompressor
are needed to see t his picture.
27
Introduction to Probability

Some definitions
 Counting



suppose an operation oi can be performed in ni ways,
a set of k operations o1o2...ok can be performed in n1  n2  ...
 nk ways
Example



dice toss experiment, 6 possible outcomes
two dice are thrown at the same time
number of sample points in sample space = 6  6 = 36
Quic kTime™ and a
TIFF (Unc ompres sed) dec ompres sor
are needed to see this pic ture.
Lecture 1, 7/21/2005
Natural Language Processing
28
Definition of Probability





The probability law assigns to an event a nonnegative
number
Called P(A)
Also called the probability A
That encodes our knowledge or belief about the
collective likelihood of all the elements of A
Probability law must satisfy certain properties
Lecture 1, 7/21/2005
Natural Language Processing
29
Probability Axioms



Nonnegativity
 P(A) >= 0, for every event A
Additivity
 If A and B are two disjoint events, then the probability
of their union satisfies:
 P(A U B) = P(A) + P(B)
Normalization
 The probability of the entire sample space S is equal
to 1, i.e. P(S) = 1.
Lecture 1, 7/21/2005
Natural Language Processing
30
An example








An experiment involving a single coin toss
There are two possible outcomes, H and T
Sample space S is {H,T}
If coin is fair, should assign equal probabilities to 2 outcomes
Since they have to sum to 1
P({H}) = 0.5
P({T}) = 0.5
P({H,T}) = P({H})+P({T}) = 1.0
Lecture 1, 7/21/2005
Natural Language Processing
31
Another example







Experiment involving 3 coin tosses
Outcome is a 3-long string of H or T
S ={HHH,HHT,HTH,HTT,THH,THT,TTH,TTT}
Assume each outcome is equiprobable
 “Uniform distribution”
What is probability of the event that exactly 2 heads occur?
A = {HHT,HTH,THH}
3 events/outcomes
P(A) = P({HHT})+P({HTH})+P({THH}) additivity - union of the
probability of the individual events


= 1/8 + 1/8 + 1/8
= 3/8
Lecture 1, 7/21/2005
total 8 events/outcomes
Natural Language Processing
32
Probability definitions

In summary:
Probability of drawing a spade from 52 well-shuffled playing cards:
Lecture 1, 7/21/2005
Natural Language Processing
33
Moving toward language

What’s the probability of drawing a 2 from a deck
of 52 cards with four 2s?
4
1
P(drawing a two) 
  .077
52 13
 What’s the probability of a random word (from a
random dictionary page) being a verb?

# of ways to get a verb
P(drawing a verb) 
all words
Lecture 1, 7/21/2005
Natural Language Processing
34
Probability and part of speech tags
•
What’s the probability of a random word (from a random dictionary
page) being a verb?
P(drawing a verb) 
# of ways to get a verb
all words
• How to compute each of these
•
•
•
All words = just count all the words in the dictionary
# of ways to get a verb: # of words which are verbs!
If a dictionary has 50,000 entries, and 10,000 are verbs…. P(V) is
10000/50000 = 1/5 = .20
Lecture 1, 7/21/2005
Natural Language Processing
35
Conditional Probability

A way to reason about the outcome of an experiment
based on partial information
 In a word guessing game the first letter for the word is
a “t”. What is the likelihood that the second letter is
an “h”?
 How likely is it that a person has a disease given that
a medical test was negative?
 A spot shows up on a radar screen. How likely is it
that it corresponds to an aircraft?
Lecture 1, 7/21/2005
Natural Language Processing
36
More precisely





Given an experiment, a corresponding sample space S, and a
probability law
Suppose we know that the outcome is some event B
We want to quantify the likelihood that the outcome also belongs to
some other event A
We need a new probability law that gives us the conditional
probability of A given B
P(A|B)
Lecture 1, 7/21/2005
Natural Language Processing
37
An intuition
•
•
•
•
•
•
•
Let’s say A is “it’s raining”.
Let’s say P(A) in dry Florida is .01
Let’s say B is “it was sunny ten minutes ago”
P(A|B) means “what is the probability of it raining now if it was sunny
10 minutes ago”
P(A|B) is probably way less than P(A)
Perhaps P(A|B) is .0001
Intuition: The knowledge about B should change our estimate of the
probability of A.
Lecture 1, 7/21/2005
Natural Language Processing
38
Conditional Probability



let A and B be events in the sample space
P(A|B) = the conditional probability of event A occurring given some fixed
event B occurring
definition: P(A|B) = P(A  B) / P(B)
S
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Lecture 1, 7/21/2005
Natural Language Processing
39
Conditional probability


P(A|B) = P(A  B)/P(B)
Or
P( A | B) 
P( A, B)
P( B)
Note: P(A,B)=P(A|B) · P(B)
Also: P(A,B) = P(B,A)
A
Lecture 1, 7/21/2005
A,B B
Natural Language Processing
40
Independence

What is P(A,B) if A and B are independent?

P(A,B)=P(A) · P(B) iff A,B independent.
P(heads,tails) = P(heads) · P(tails) = .5 · .5 = .25
Note: P(A|B)=P(A) iff A,B independent
Also: P(B|A)=P(B) iff A,B independent
Lecture 1, 7/21/2005
Natural Language Processing
41
Bayes Theorem
P( A | B) P( B)
P( B | A) 
P( A)
• Idea: The probability of an A conditional on another event
B is generally different from the probability of B conditional
on A. There is a definite relationship between the two.
Lecture 1, 7/21/2005
Natural Language Processing
42
Deriving Bayes Rule
The probability of event A given event B is
P(A  B)
P(A | B) 
P(B)
Lecture 1, 7/21/2005
Natural Language Processing
43
Deriving Bayes Rule
The probability of event B given event A is
P(A  B)
P(B | A) 
P(A)

Lecture 1, 7/21/2005
Natural Language Processing
44
Deriving Bayes Rule
P(A  B)
P(A  B)
P(B | A) 
P(A | B) 
P(A)
P(B)
P(A | B)P(B)  P(A  B) P(B | A)P(A)  P(A  B)


Lecture 1, 7/21/2005
Natural Language Processing
45
Deriving Bayes Rule
P(A  B)
P(A  B)
P(B | A) 
P(A | B) 
P(A)
P(B)
P(A | B)P(B)  P(A  B) P(B | A)P(A)  P(A  B)
 | B)P(B)  P(B | A)P(A)
P(A

Lecture 1, 7/21/2005

P(B | A)P(A)
P(A | B) 
P(B)
Natural Language Processing
46

Deriving Bayes Rule
P(B | A)P(A)
P(A | B) 
P(B)
the theorem may be paraphrased as
conditional/posterior probability =
(LIKELIHOOD multiplied by PRIOR) divided by NORMALIZING CONSTANT
Lecture 1, 7/21/2005
Natural Language Processing
47
Hidden Markov Model (HMM) Tagging

Using an HMM to do POS tagging

HMM is a special case of Bayesian inference

It is also related to the “noisy channel” model in ASR
(Automatic Speech Recognition)
Lecture 1, 7/21/2005
Natural Language Processing
48
POS tagging as a sequence classification task



We are given a sentence (an “observation” or “sequence of
observations”)
 Secretariat is expected to race tomorrow
 sequence of n words w1…wn.
What is the best sequence of tags which corresponds to this
sequence of observations?
Probabilistic/Bayesian view:
 Consider all possible sequences of tags
 Out of this universe of sequences, choose the tag sequence
which is most probable given the observation sequence of n
words w1…wn.
Lecture 1, 7/21/2005
Natural Language Processing
49
Getting to HMM





Let T = t1,t2,…,tn
Let W = w1,w2,…,wn
Goal: Out of all sequences of tags t1…tn, get the the most probable
sequence of POS tags T underlying the observed sequence of
words w1,w2,…,wn
Hat ^ means “our estimate of the best = the most probable tag sequence”
Argmaxx f(x) means “the x such that f(x) is maximized”
it maximazes our estimate of the best tag sequence
Lecture 1, 7/21/2005
Natural Language Processing
50
Getting to HMM

This equation is guaranteed to give us the best tag sequence

But how do we make it operational? How do we compute this value?
Intuition of Bayesian classification:



Use Bayes rule to transform it into a set of other
probabilities that are easier to compute
Thomas Bayes: British mathematician (1702-1761)
Lecture 1, 7/21/2005
Natural Language Processing
51
Bayes Rule
Breaks down any conditional probability P(x|y) into three other
probabilities
P(x|y): The conditional probability of an event x assuming that y
has occurred
Lecture 1, 7/21/2005
Natural Language Processing
52
Bayes Rule
We can drop the denominator: it does not change for each
tag sequence; we are looking for the best tag sequence for
the same observation, for the same fixed set of words
Lecture 1, 7/21/2005
Natural Language Processing
53
Bayes Rule
Lecture 1, 7/21/2005
Natural Language Processing
54
Likelihood and prior
n
Lecture 1, 7/21/2005
Natural Language Processing
55
Likelihood and prior
Further Simplifications
1. the probability of a word appearing depends only on its own POS tag,
i.e, independent of other words around it
n
2. BIGRAM assumption: the probability of a tag appearing depends only
on the previous tag
3. The most probable tag sequence estimated by the bigram tagger
Lecture 1, 7/21/2005
Natural Language Processing
56
Likelihood and prior
Further Simplifications
1. the probability of a word appearing depends only on its own POS tag,
i.e, independent of other words around it
n
WORDS
the
koala
put
the
keys
on
the
table
Lecture 1, 7/21/2005
TAGS
N
V
P
DET
Natural Language Processing
57
Likelihood and prior
Further Simplifications
2. BIGRAM assumption: the probability of a tag appearing depends only
on the previous tag
Bigrams are groups of two written letters, two syllables, or two words; they
are a special case of N-gram.
Bigrams are used as the basis for simple statistical analysis of text
The bigram assumption is related to the first-order Markov assumption
Lecture 1, 7/21/2005
Natural Language Processing
58
Likelihood and prior
Further Simplifications
3. The most probable tag sequence estimated by the bigram tagger
---------------------------------------------------------------------------------------------------------------
n
biagram assumption
Lecture 1, 7/21/2005
Natural Language Processing
59
Two kinds of probabilities (1)

Tag transition probabilities p(ti|ti-1)
 Determiners likely to precede adjs and nouns
 That/DT
flight/NN
 The/DT yellow/JJ hat/NN
 So we expect P(NN|DT) and P(JJ|DT) to be high
 But P(DT|JJ) to be:?
Lecture 1, 7/21/2005
Natural Language Processing
60
Two kinds of probabilities (1)

Tag transition probabilities p(ti|ti-1)
 Compute P(NN|DT) by counting in a labeled
corpus:
# of times DT is followed by NN
Lecture 1, 7/21/2005
Natural Language Processing
61
Two kinds of probabilities (2)
Word

likelihood probabilities p(wi|ti)
P(is|VBZ) = probability of VBZ (3sg Pres verb) being “is”
If we were expecting a third person singular verb, how likely is it that
this verb would be is?

Compute P(is|VBZ) by counting in a labeled corpus:
Lecture 1, 7/21/2005
Natural Language Processing
62
An Example: the verb “race”



Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
tomorrow/NR
People/NNS continue/VB to/TO inquire/VB the/DT
reason/NN for/IN the/DT race/NN for/IN outer/JJ
space/NN
How do we pick the right tag?
Lecture 1, 7/21/2005
Natural Language Processing
63
Disambiguating “race”
Lecture 1, 7/21/2005
Natural Language Processing
64
Disambiguating “race”
P(NN|TO)
= .00047
P(VB|TO) = .83
The tag transition probabilities P(NN|TO) and P(VB|TO) answer the question: ‘How likely
are we to expect verb/noun given the previous tag TO?’
P(race|NN)
= .00057
P(race|VB) = .00012
Lexical likelihoods from the Brown corpus for ‘race’ given a POS tag NN or VB.
P(NR|VB)
= .0027
P(NR|NN) = .0012
tag sequence probability for the likelihood of an adverb occurring given the previous tag
verb or noun
P(VB|TO)P(NR|VB)P(race|VB)
= .00000027
P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
Multiply the lexical likelihoods with the tag sequence probabiliies: the verb wins
Lecture 1, 7/21/2005
Natural Language Processing
65
Hidden Markov Models



What we’ve described with these two kinds of
probabilities is a Hidden Markov Model (HMM)
Let’s just spend a bit of time tying this into the model
In order to define HMM, we will first introduce the Markov
Chain, or observable Markov Model.
Lecture 1, 7/21/2005
Natural Language Processing
66
Definitions



A weighted finite-state automaton adds probabilities to
the arcs
 The sum of the probabilities leaving any arc must sum
to one
A Markov chain is a special case of a WFST in which the
input sequence uniquely determines which states the
automaton will go through
Markov chains can’t represent inherently ambiguous
problems
 Useful for assigning probabilities to unambiguous
sequences
Lecture 1, 7/21/2005
Natural Language Processing
67
Markov chain = “First-order
observed Markov Model”
a set of states


Q = q1, q2…qN; the state at time t is qt
a set of transition probabilities:




a set of probabilities A = a01a02…an1…ann.
Each aij represents the probability of transitioning from state i to state j
The set of these is the transition probability matrix A
aij  P(qt  j | qt1  i) 1  i, j  N
N
a
ij
 1;
1 i  N
j1

Distinguished start and end states

Special initial probability vector 
 i the probability that the MM will start in state i, each i expresses the probability
p(qi|START)
Lecture 1, 7/21/2005
Natural Language Processing
68
Markov chain = “First-order
observed Markov Model”
Markov Chain for weather: Example 1
 three types of weather: sunny, rainy, foggy
 we want to find the following conditional probabilities:
P(qn|qn-1, qn-2, …, q1)
- I.e., the probability of the unknown weather on day n,
depending on the (known) weather of the preceding
days
- We could infer this probability from the relative frequency (the
statistics) of past observations of weather sequences
Problem: the larger n is, the more observations we must collect.
Suppose that n=6, then we have to collect statistics for 3(6-1) =
243 past histories
Lecture 1, 7/21/2005
Natural Language Processing
69
Markov chain = “First-order
observed Markov Model”

Therefore, we make a simplifying assumption, called the (first-order) Markov
assumption
for a sequence of observations q1, … qn,
current state only depends on previous state

the joint probability of certain past and current observations
Lecture 1, 7/21/2005
Natural Language Processing
70
Markov chain = “First-order
observable Markov Model”
Lecture 1, 7/21/2005
Natural Language Processing
71
Markov chain = “First-order
observed Markov Model”
Given that today the weather is sunny,
what's the probability that tomorrow is
sunny and the day after is rainy?

Using the Markov assumption and the
probabilities in table 1, this translates into:

Lecture 1, 7/21/2005
Natural Language Processing
72
The weather figure: specific
example

Markov Chain for weather: Example 2
Lecture 1, 7/21/2005
Natural Language Processing
73
Markov chain for weather




What is the probability of 4 consecutive rainy days?
Sequence is rainy-rainy-rainy-rainy
I.e., state sequence is 3-3-3-3
P(3,3,3,3) =
 1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432
Lecture 1, 7/21/2005
Natural Language Processing
74
Hidden Markov Model





For Markov chains, the output symbols are the same as
the states.
 See sunny weather: we’re in state sunny
But in part-of-speech tagging (and other things)
 The output symbols are words
 But the hidden states are part-of-speech tags
So we need an extension!
A Hidden Markov Model is an extension of a Markov
chain in which the output symbols are not the same as
the states.
This means we don’t know which state we are in.
Lecture 1, 7/21/2005
Natural Language Processing
75
Markov chain for weather
Lecture 1, 7/21/2005
Natural Language Processing
76
Markov chain for words
Observed events: words
Hidden events: tags
Lecture 1, 7/21/2005
Natural Language Processing
77
Hidden Markov Models


States Q = q1, q2…qN;
Observations O = o1, o2…oN;


Transition probabilities (prior)


Transition probability matrix A = {aij}
Observation likelihoods (likelihood)


Each observation is a symbol from a vocabulary V = {v1,v2,…vV}
Output probability matrix B={bi(ot)}
a set of observation likelihoods, each expressing the probability of an
observation ot being generated from a state i, emission probabilities
Special initial probability vector 
i the probability that the HMM will start in state i, each i expresses the probability
p(qi|START)
Lecture 1, 7/21/2005
Natural Language Processing
78
Assumptions

Markov assumption: the probability of a particular state depends
only on the previous state
P(qi | q1...qi1)  P(qi | qi1)


Output-independence assumption: the probability of an output
observation depends only on the state that produced that
observation
Lecture 1, 7/21/2005
Natural Language Processing
79
HMM for Ice Cream






You are a climatologist in the year 2799
Studying global warming
You can’t find any records of the weather in Boston, MA
for summer of 2007
But you find Jason Eisner’s diary
Which lists how many ice-creams Jason ate every date
that summer
Our job: figure out how hot it was
Lecture 1, 7/21/2005
Natural Language Processing
80
Noam task


Given
 Ice Cream Observation Sequence: 1,2,3,2,2,2,3…
(cp. with output symbols)
Produce:
 Weather Sequence: C,C,H,C,C,C,H …
(cp. with hidden states, causing states)
Lecture 1, 7/21/2005
Natural Language Processing
81
HMM for ice cream
Lecture 1, 7/21/2005
Natural Language Processing
82
Different types of HMM structure
Bakis = left-to-right
Lecture 1, 7/21/2005
Natural Language Processing
Ergodic =
fully-connected
83
HMM Taggers

Two kinds of probabilities
 A transition probabilities (PRIOR) (slide 36)
 B observation likelihoods (LIKELIHOOD) (slide 36)

HMM Taggers choose the tag sequence which
maximizes the product of word likelihood and tag
sequence probability
Lecture 1, 7/21/2005
Natural Language Processing
84
Weighted FSM corresponding to hidden
states of HMM, showing A probs
Lecture 1, 7/21/2005
Natural Language Processing
85
B observation likelihoods for POS
HMM
Lecture 1, 7/21/2005
Natural Language Processing
86
HMM Taggers



The probabilities are trained on hand-labeled training
corpora (training set)
Combine different N-gram levels
Evaluated by comparing their output from a test set to
human labels for that test set (Gold Standard)
Lecture 1, 7/21/2005
Natural Language Processing
87
Next Time



Minimum Edit Distance
A “dynamic programming” algorithm
A probabilistic version of this called “Viterbi” is a key part
of the Hidden Markov Model!
Lecture 1, 7/21/2005
Natural Language Processing
88
Evaluation
Lecture 1, 7/21/2005
Natural Language Processing
89
Error Analysis
Lecture 1, 7/21/2005
Natural Language Processing
90
Tag indeterminacy
Lecture 1, 7/21/2005
Natural Language Processing
91
Unknown words
Lecture 1, 7/21/2005
Natural Language Processing
92