Transcript POS Tagging

Chapter 5: POS Tagging
Heshaam Faili
[email protected]
University of Tehran
Part of Speech tag








POS, word classes, morphological classes,
lexical tags
Verb Vs Noun
noun, verb, pronoun, preposition, adverb,
Conjunction, participle, and article
possessive pronouns Vs personal pronouns
Useful for pronunciation: conTENT, CONtent
Useful for stemming for information retrieval
Useful for Word-sense-disambiguation
Useful for shallow parsing to detect name, date,
place, …
2
POS tags


Based on syntactic and morphological function
Divided into







closed class types: pronoun, preposition, determiner(articles),
conjunction, auxiliary verbs, particles, numerical
open class types: nouns, verbs, adjectives, adverb
Proper noun, common noun
Count noun, mass noun
Main verb, auxiliary verb,
Directional(locative) adverbs, degree adverb, manner adverb,
temporal adverbs
particle, phrasal verb
3
Statistics

the: 1,071,676 a: 413,887 an: 59,359
4
Statistics
5
English language


Pronoun: personal, possessive, WH-pronoun
Auxiliary verb: modal verb (copula)
6
7
Example English Part-ofSpeech Tag sets

Brown corpus - 87 tags

Allows compound tags

“I'm” tagged as PPSS+BEM


Others have derived their work from Brown Corpus






PPSS for "non-3rd person nominative personal pronoun" and BEM for
"am, 'm“
LOB Corpus: 135 tags
Lancaster UCREL Group: 165 tags
London-Lund Corpus: 197 tags.
BNC – 61 tags (C5)
PTB – 45 tags
Other languages have developed other tagsets
8
PTB Tagset (36 main tags + punctuation tags)
9
Some examples from Brown

The/DT grand/JJ jury/NN commented/VBD on/IN a/DT
number/NN of/IN other/JJ topics/NNS ./.
There/EX are/VBP 70/CD children/NNS there/RB
Although/IN preliminary/JJ findings/NNS were/VBD
reported/VBN more/RBR than/IN a/DT year/NN ago/IN ,/,
the/DT latest/JJS results/NNS appear/VBP in/IN today/NN
’s/POS New/NNP England/NNP Journal/NNP of/IN
Medicine/NNP ,/,

Difficulty in tagging ‘around’’: particle, preposition, adverb





Mrs./NNP Shaefer/NNP never/RB got/VBD around/RP to/TO
joining/VBG
All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN the/DT
corner/NN
Chateau/NNP Petrus/NNP costs/VBZ around/RB 250/CD
10
POS Tagging Problem



Given a sentence W1…Wn and a tag set of lexical categories,
find the most likely tag C1…Cn for each word in the sentence
Note that many of the words may have unambiguous tags
Example
Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN
People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN
the/DT race/NN for/IN outer/JJ space/NN


But enough words are either ambiguous or unknown that it’s a
nontrivial task
Tokenization: punctuation marks (period, comma, etc) be
separated off of the words.
11
More details of the problem

How ambiguous?

Most words in English have only one Brown Corpus tag

Unambiguous (1 tag) 35,340 word types

Ambiguous (2- 7 tags) 4,100 word types = 11.5%


But many of the most common words are ambiguous


Over 40% of Brown corpus tokens are ambiguous
Obvious strategies may be suggested based on intuition


7 tags: 1 word type “still”

to/TO race/VB

the/DT race/NN

will/MD race/VB
This leads to hand-crafted rule-based POS tagging (J&M, 8.4)
Sentences can also contain unknown words for which tags have
to be guessed: Secretariat/NNP is/VBZ
12
POS tagging
13
POS tagging methods


First, we’ll look at how POS-annotated corpora are
constructed
Then, we’ll narrow in on various POS tagging methods,
which rely on POS-annotated corpora

Supervised learning: use POS-annotated corpora as training
material


Unsupervised learning: use training material which is
unannotated


HMMs, TBL, Maximum Entropy, MBL
HMMs, TBL
Two main categories for tagging: rule-based,
stochastic
14
Constructing POS-annotated
corpora


By examining how POS-annotated corpora
are created, we will understand the task even
better
Basic steps




1. Tokenize the corpus
2. Design a tagset, along with guidelines
3. Apply the tagset to the text
Knowing what issues the corpus annotators
had to deal with in hand-tagging tells us the
kind of issues we’ll have to deal with in
automatic tagging
15
1. Tokenize the corpus


The corpus is first segmented into sentences and
then each sentence is tokenized into unique tokens
(words)
Punctuation usually split off


Figuring out sentence-ending periods vs. abbreviators
periods is not always easy, etc.
Can become tricky:



Proper nouns and other multi-word units: one token or
separate tokens? e.g., Jimmy James, in spite of
Contractions often split into separate words: can’t  can n’t
(because these are two different POSs)
Hyphenations sometimes split, sometimes kept together
16
Some other preprocessing
issues

Should other word form information be included?




Lemma: the base, or root, form of a word. So, cat and cats
have the same lemma, cat.
Sometimes, words are lemmatized by stemming, other times
by morphological analysis, using a dictionary &
morphological rules
Sometimes it makes a big difference – roughly vs. rough
Fold case or not (usually folded)?


The the THE
Mark versus mark
One may need, however, to regenerate the original case
when presenting it to the user
17
2. Design the Tagset


It is not obvious what the parts of speech are in any
language, so a tagset has to be designed.
Criteria:




External: what do we need to capture the linguistic
properties of the text?
Internal: what distinctions can be made reliably in a text?
Some tagsets are compositional: each character in
the tag has a particular meaning: e.g., Vr3s-f = verbpresent-3rd-singular-<undefined_for_case>-female
What do you do with multi-token words, e.g. in terms
of?
18
Tagging Guidelines



The tagset should be clearly documented and explained,
with a “case law” of what to do in problematic cases
If you don’t do this, you will have problems in applying
the tagset to the text
If you have good documentation, you can have high
interannotator agreement and a corpus of high quality.
19
3. Apply Tagset to Text: PTB
Tagging Process



1. Tagset developed
2. Automatic tagging by rule-based and statistical POS taggers,
error rates of 2-6%.
3. Human correction using a text editor


Takes under a month for humans to learn this (at 15 hours a
week), and annotation speeds after a month exceed 3,000
words/hour
Inter-annotator disagreement (4 annotators, eight 2000-word docs)
was 7.2% for the tagging task and 4.1% for the correcting task
20
POS Tagging Methods

Two basic ideas to build from:



Establishing a simple baseline with unigrams
Hand-coded rules
Machine learning techniques:


Supervised learning techniques
Unsupervised learning techniques
21
A Simple Strategy for POS
Tagging

Choose the most likely tag for each ambiguous word,
independent of previous words



i.e., assign each token the POS category it occurred as most
often in the training set
E.g., race – which POS is more likely in a corpus?
This strategy gives you 90% accuracy in controlled
tests

So, this “unigram baseline” must always be compared
against
22
Example of the Simple
Strategy

Which POS is more likely in a corpus (1,273,000 tokens)?
NN VB
Total
race 400 600 1000


P(NN|race) = P(race&NN) / P(race) by the definition of
conditional probability

P(race)  1000/1,273,000 = .0008

P(race&NN)  400/1,273,000 =.0003

P(race&VB)  600/1,273,000 = .0005
And so we obtain:

P(NN|race) = P(race&NN)/P(race) = .0003/.0008 =.4

P(VB|race) = P(race&VB)/P(race) = .0004/.0008 = .6
23
Hand-coded rules

Two-stage system:




Dictionary assigns all possible tags to a word
Rules winnow down the list to a single tag (or sometimes,
multiple tags are left, if it cannot be determined)
Can also use some probabilistic information
These systems can be highly effective, but they of
course take time to write the rules.

We’ll see an example later of trying to automatically learn
the rules (transformation-based learning)
24
Hand-coded Rules: ENGTWOL
System



Based on two-level morphology
Uses 56,000-word lexicon which lists parts-of-speech
for each word
Each entry is annotated with a set of morphological
and syntactic features
25
ENGTWOL Lexicon
26
ENGTWOL tagger

First Stage return all possible POS


Input: “Pavlov had shown that salivation”
Output:





Pavlov PAVLOV N NOM SG PROPER
had HAVE V PAST VFIN SVO
HAVE PCP2 SVO
shown SHOW PCP2 SVOO SVO SV
that ADV
PRON DEM SG
DET CENTRAL DEM SG
CS
salivation N NOM SG
27
ENGTWOL tagger

Large set of constraints: 3744 constraints
28
Machine Learning



Machines can learn from examples
Learning can be supervised or unsupervised
Given training data, machines analyze the data, and
learn rules which generalize to new examples



Can be sub-symbolic (rule may be a mathematical function)
e.g., neural nets
Or it can be symbolic (rules are in a representation that is
similar to representation used for hand-coded rules)
In general, machine learning approaches allow for
more tuning to the needs of a corpus, and can be
reused across corpora
29
Ways of learning

Supervised learning:


Unsupervised learning:



A machine learner learns the patterns found in an annotated
corpus
A machine learner learns the patterns found in an
unannotated corpus
Often uses another database of knowledge, e.g., a dictionary
of possible tags
Techniques used in supervised learning can be
adapted to unsupervised learning, as we will see.
30
HMMs


Using probabilities in tagging first used on 1965 but
completed o 1980s (Marshall, 1983; Garside, 1987;
Church, 1988; DeRose, 1988(
HMM is a special case of Bayesian inference
31
Transition/word likelihood
probabilities

P(NN|DT) and P(JJ|DT) is high while P(DT|JJ) is low
32
HMMs: A Probabilistic
Approach

What you want to do is find
the “best sequence” of POS
tags T=T1..Tn for a sentence
W=W1..Wn.



(Here T1 is pos_tag(W1)).
find a sequence of POS tags T
that maximizes P(T|W)
Word likelihood


Using Bayes’ Rule, we can
say
P(T|W) = P(W|T)*P(T)/P(W)

We want to find the value of
T which maximizes the RHS
transitioncan be
 denominator
discarded (it’s the same for
every T)
 Find T which maximizes
P(W|T) * P(T)
Example: He will race
Possible sequences:




He/PP
He/PP
He/PP
He/PP
will/MD race/NN
will/NN race/NN
will/MD race/VB
will/NN race/VB
W = W1 W 2 W 3
= He will race

T = T1 T 2 T 3

Choices:

T= PP MD NN

T= PP NN NN

T = PP MD VB

T = PP NN VB
33
N-gram Models

POS problem formulation



Given a sequence of words, find a sequence of categories
that maximizes P(T1…Tn| W1…Wn)
i.e., that maximizes P(W1…Wn | T1…Tn) * P(T1…Tn) (by
Bayes’ Rule)
Chain Rule of probability:
P(W|T) = i=1, n P(Wi|W1…Wi-1T1…Ti)
prob. of this word based on previous words & tags
P(T) = i=1, n P(Ti|W1…WiT1…Ti-1)
prob. of this tag based on previous words & tags

But we don’t have sufficient data for this, and we
would likely overfit the data, so we make some
assumptions to simplify the problem …
34
Independence Assumptions


Assume that current event is based only on previous
n-1 events (i.e., for bigram model, based on previous
event)
P(T1…Tn)  i=1, n P(Ti| Ti-1)

assumes that the event of a POS tag occurring is
independent of the event of any other POS tag occurring,
except for the immediately previous POS tag


From a linguistic standpoint, this seems an unreasonable
assumption, due to long-distance dependencies
P(W1…Wn | T1…Tn)  i=1, n P(Wi| Ti)

assumes that the event of a word appearing in a category is
independent of the event of any surrounding word or tag,
except for the tag at this position.
35
Independence Assumptions

P(Ti| Ti-1) represent the probability of a tag given the
previous tag.


determiners are very likely to precede adjectives and nouns,
as in sequences like that/DT flight/NN and the/DT yellow/JJ
hat/NN.
P(NN|DT) and P(JJ|DT) are high but P(DT|JJ) is low
36
Hidden Markov Models



Linguists know both these assumptions are incorrect!
But, nevertheless, statistical approaches based on
these assumptions work pretty well for part-ofspeech tagging
In particular, one called a Hidden Markov Model
(HMM)


Very widely used in both POS-tagging and speech
recognition, among other problems
A Markov model, or Markov chain, is just a weighted Finite
State Automaton
37
HMMs = Weighted FSAs


Nodes (or states): tag options
Edges: Each edge from a node has a (transition) probability to
another state


The probability of a path is the product of the probabilities of
the edges along the path.


The sum of transition probabilities of all edges going out from a
node is 1
Why multiply? Because of the assumption of independence (Markov
assumption).
The probability of a string is the sum of the probabilities of all
the paths for the string.


Why add? Because we are considering the union of all
interpretations.
A string that isn’t accepted has probability zero.
38
example
39
HMM


Weighted finite-state automaton
Markov Chain: a special case of a weighted
automaton in which the input sequence uniquely
determines which states the automaton will go
through



Use only observed events : in POS tagging problem: only
words
It’s not useful for POS-tagging: because of ambiguity
Hidden Markov Model: use both


observed events ( words that we see in sentence)
Hidden events (part-of-speech tags)
40
Hidden Markov Models


Why the word “hidden”?
Called “hidden” because one cannot tell what state
the model is in for a given sequence of words.


e.g., will could be generated from NN state, or from MD
state, with different probabilities.
That is, looking at the words will not tell you directly
what tag state you’re in.
41
POS Tagging Based on
Bigrams

Problem: Find T which maximizes P(W | T) * P(T)


Here W=W1…Wn and T=T1…Tn
Using the bigram model, we get:

Transition probabilities (prob. of transitioning from one
state/tag to another):


Emission probabilities (prob. of emitting a word at a given
state):


P(T1…Tn)  i=1, n P(Ti|Ti-1)
P(W1…Wn | T1…Tn)  i=1, n P(Wi| Ti)
So, we want to find the value of T1…Tn which
maximizes:
i=1, n
P(Wi| Ti)
*
P(Ti| Ti-1)
42
Using POS bigram
probabilities: transitions



P(T1….Tn)  i=1, n P(Ti|Ti-1)
Example: He will race
Choices for T=T1..T3






T= PRP MD NN
T= PRP NN VB
T = PRP MD VB
T = PRP NN NN
POS bigram probs from
training corpus can be used for
P(T)
P(PRP-MD-NN)=1*.8*.4 =.32
MD
.4
NN
.8

1
PRP
.3
.2
NN
.6
.7
VB
C|R MD NN VB PRP
POS
bigram
probs
MD
.4
.6
NN
.3
.7
PRP .8
.2

1
43
Factoring in lexical generation
probabilities
From the training corpus, we need to find the Ti which maximizes
i=1, n P(Wi| Ti) * P(Ti| Ti-1)
So, we’ll need to factor the lexical generation (emission)
probabilities, somehow:


C
E
MD
.4
NN
.8

A
MD NN VB PRP
1
he
PRP
B
.3
.2
NN
D
+
.6
.7
F
VB
0
0
0
1
will .8 .2
0
0
race 0 .4
.6
0
lexical generation probs
44
Adding emission probabilities
MD NN VB PRP
he
will|MD
.8
.4
race|NN
.4
.8
<s>|
1
0
0
0
1
will .8 .2
0
0
race 0 .4
.6
0
lexical generation probs
he|PRP
.3
.3
.2
will|NN
.2
C|R MD NN VB PRP
.6
.7
race|VB
.6
MD
.4
.6
NN
.3
.7
PRP .8
.2

1
pos bigram probs
Like a weighted FSA, except that there are output
probabilities associated with each node.
45
HMM formulation
46
HMM formulation
47
Emission &Transition probability
48
Dynamic Programming


In order to find the most likely sequence of
categories for a sequence of words, we don’t need to
enumerate all possible sequences of categories.
Because of the Markov assumption, if you keep track
of the most likely sequence found so far for each
possible ending category, you can ignore all the other
less likely sequences.



i.e., multiple edges coming into a state, but only keep the
value of the most likely path
This is another use of dynamic programming
The algorithm to do this is called the Viterbi
algorithm.
49
The Viterbi algorithm


Assume we’re at state I in the HMM
Obtain




Multiple the probabilities for each new path:



the probability of each previous state H1…Hm
the transition probabilities of H1-I, …, Hm-I
the emission probability for word w at I
P(H1,I) = Score(H1)*P(I|H1)*P(w|I)
…
One of these states (H1…Hm) will give the highest
probability

Only keep the highest probability when using I for the next
state
50
51
Finding the best path through an HMM
C
E
will|MD
.8
A
<s>|
.4
race|NN
.4
.8
1
MD NN VB PP
he|PP
he
0
0
0 .3
.3
will .8
.2
0
0
race 0
.4 .6
0
lex(B)
B
.3
.2
will|NN
.7
.2
F
.6
D
race|VB
.6
lexical generation probs

Score(I) = Max

Score(B) = P(PP|)* P(he|PP) =1*.3=.3

Score(C)=Score(B) *P(MD|PP) * P(will|MD) = .3*.8*.8= .19

Score(D)=Score(B) *P(NN|PP) * P(will|NN) = .3*.2*.2= .012

Score(E) = Max [Score(C)*P(NN|MD), Score(D)*P(NN|NN)] *P(race|NN) =

Score(F) = Max [Score(C)*P(VB|MD), Score(D)*P(VB|NN)]*P(race|VB)=
H<I
[Score(H)* transition(I|H)]* lex(I)
Viterbi
algorithm
52
53
Smoothing


Lexical generation probabilities will lack observations
for low-frequency and unknown words
Most systems do one of the following

Smooth the counts



E.g., add a small number to unseen data (to zero counts). For
example, assume a bigram not seen in the data has a very
small probability, e.g., .0001.
Use lots more data (but you’ll still need to smooth)
Group items into classes, thus increasing class frequency

e.g., group words into ambiguity classes, based on their set of
tags. For counting, all words in an ambiguity class are treated
as variants of the same ‘word’
54
Extending the HMM
algorithm to trigrams

Sparnness
55
2. TBL: A Symbolic Learning
Method





HMMs are subsymbolic – they don’t give you rules that you can
inspect
A method called error-driven Transformation-Based Learning
(TBL) (Brill algorithm) can be used for symbolic learning
The rules (actually, a sequence of rules) are learned from an
annotated corpus
Performs at least as accurately as other statistical approaches
Has better treatment of context compared to HMMs

rules which use the next (or previous) POS


HMMs just use P(Ci| Ci-1) or P(Ci| Ci-2,Ci-1)
rules which use the previous (next) word

HMMs just use P(Wi|Ci)
56
Brill Algorithm (Overview)



Assume you are given a
training corpus G (for gold
standard)
First, create a tag-free
version V of it
Notes:


As the algorithm
proceeds, each
successive rule becomes
narrower (covering fewer
examples, i.e., changing
fewer tags), but also
potentially more accurate
Some later rules may
change tags changed by
earlier rules
1. First label every word token
in V with most likely tag for
that word type from G. If
this ‘initial state
annotator’ is perfect,
you’re done!
2. Then consider every
possible transformational
rule, selecting the one that
leads to the most
improvement in V using G to
measure the error
3. Retag V based on this rule
4. Go back to 2, until there is
no significant improvement
in accuracy over previous
iteration
57
Rule Templates

Brill’s method learns transformations which fit
different templates

Change tag X to tag Y when previous word is W


Change tag X to tag Y when next tag is Z



NN  NNP when next tag = NNP
Change tag X to tag Y when previous 1st, 2nd, or 3rd word
is W


NN  VB when previous word = to
VBP  VB when one of previous 3 words = has
The learning process is guided by a small number of
templates (e.g., 26) to learn specific rules from the
corpus
Note how these rules sort of match linguistic intuition 58
Error-driven method


How does one learn the rules?
The TBL method is error-driven
The rule which is learned on a given iteration is the one which
reduces the error rate of the corpus the most, e.g.:
 Rule 1 fixes 50 errors but introduces 25 more  net decrease is
25
 Rule 2 fixes 45 errors but introduces 15 more  net decrease is
30
 Choose rule 2 in this case


We set a stopping criterion, or threshold  once we
stop reducing the error rate by a big enough margin,
learning is stopped
59
Brill Algorithm (More Detailed)




1. Label every word token with its
most likely tag (based on lexical
generation probabilities).
2. List the positions of tagging
errors and their counts, by
comparing with “truth” (T)
3. For each error position,
consider each instantiation I of X,
Y, and Z in Rule template. If
Y=T, increment improvements[I],
else increment errors[I].
4. Pick the I which results in the
greatest error reduction, and add
to output



e.g., VB NN PREV1OR2TAG DT
improves 98 errors, but
produces 18 new errors, so net
decrease of 80 errors
5. Apply that I to corpus
6. Go to 2, unless stopping
criterion is reached
Most likely tag:
P(NN|race) = .98
P(VB|race) = .02
Is/VBZ expected/VBN to/TO
race/NN tomorrow/NN
Change a
word from tag X to tag Y
when previous tag is Z
Rule template:
Rule Instantiation for above
example: NN VB PREV1OR2TAG
TO
Applying this rule yields:
Is/VBZ expected/VBN to/TO
race/VB tomorrow/NN
60
Example of Error Reduction
From Eric Brill (1995):
Computational Linguistics, 21, 4, p. 7
61
Rule ordering

One rule is learned with every pass through the
corpus



The set of final rules is what the final output is
Unlike HMMs, such a representation allows a linguist to look
through and make more sense of the rules
Thus, the rules are learned iteratively and must be
applied in an iterative fashion.

At one stage, it may make sense to change NN to VB after
to

But at a later stage, it may make sense to change VB back
to NN in the same context, e.g., if the current word is school
62
Example of Learned Rule
Sequence

1. NN VB PREVTAG TO


2. VBP VB PREV1OR20R3TAG MD


might/MD not/MD reply/NN -> VB
4. VB NN PREV1OR2TAG DT


might/MD vanish/VBP-> VB
3. NN VB PREV1OR2TAG MD


to/TO race/NN->VB
the/DT great/JJ feast/VB->NN
5. VBD VBN PREV1OR20R3TAG VBZ

He/PP was/VBZ killed/VBD->VBN by/IN Chapman/NNP
63
Handling Unknown Words


Can also use the Brill method
to learn how to tag unknown
words
Instead of using surrounding
words and tags, use affix
info, capitalization, etc.




Example Learned Rule Sequence
for Unknown Words
Guess NNP if capitalized, NN
otherwise.
Or use the tag most
common for words ending
in the last 3 letters.
etc.
TBL has also been applied to
some parsing tasks
64
Insights on TBL




TBL takes a long time to train, but is relatively fast at
tagging once the rules are learned
The rules in the sequence may be decomposed into
non-interacting subsets, i.e., only focus on VB
tagging (need to only look at rules which affect it)
In cases where the data is sparse, the initial guess
needs to be weak enough to allow for learning
Rules become increasingly specific as you go down
the sequence.

However, the more specific rules don’t overfit because they
cover just a few cases
65
Tag Indeterminacy



1. Somehow replace the indeterminate tags with only
one tag.
2. In testing, count a tagger as having correctly
tagged an indeterminate token if it gives either of the
correct tags. In training, somehow choose only one
of the tags for the word.
3. Treat the indeterminate tag as a single complex
tag
66
Tokenization



Word tokens
Period detection
Word splitting


would/MD n’t/RB
multi-part words

New York
67
Unknown words

pretend that each unknown word is ambiguous
among all possible tags, with equal probability



contextual POS-trigrams to suggest the proper tag
the probability distribution of tags over unknown
words is very similar to the distribution of tags over
words that occurred only once in a training set
hapax legomena: words occur once


Good-Turing smoothing algorithm
Unknown words and hapax legomena are similar in that they
are both most likely to be nouns followed by verbs, but are
very unlikely to be determiners or interjections
68
Unknown words

Use more information: morphology





Ended by –s  plural noun
Ended by –ed  past or past participle verb
Ended by –able adjective
Use Orthographic information: word start with Capital letter are
proper nouns
3 inflectional endings (-ed, -s, -ing), 32 derivational endings
(such as -ion, -al, -ive, and -ly), 4 values of capitalization
depending on whether a word is sentence-initial (+/capitalization, +/- initial) and whether the word was hyphenated

P(wi|ti) = P(unknown-word|ti) ∗ P(capital|ti) (5.50) ∗
P(endings/hyph|ti)
69
Unknown words




HMM-based approach: P(ti|ln−i+1 . . . ln)
Train on open classes
Only to word less occuring < 10
Use Bayesian to calculate P(wi|ti )
70
3. Maximum Entropy



Maximum entropy: model what you know as best you
can, remain uncertain about the rest
Set up some features, similar to Brill’s templates
When tagging, the number of features which are true
determines the probability of a tag in a particular
context



Make your probability model match the training data as well
as it can, i.e., if an example matches features we’ve seen,
we know exactly what to do
But the probability model also maximizes the entropy =
uncertainty … so, we are noncommittal to anything not seen
in the training data
E.g. MXPOST
71
4. Decision Tree Tagging

Each tagging decision can be viewed as a series of
questions

Each question should make a maximal division in the data,
i.e., ask a question which gives you the most information at
that point




Different metrics can be used here
Each question becomes more specific as you go down the
tree
Training consists of constructing a decision tree
Tagging consists of using the decision tree to give us
transition probabilities … and the rest plays out like an
HMM
72
Decision Tree example
A tree for nouns:
Tag-1 = JJ?
YES
Tag-2 = DT?
YES
NO
NN = .85
JJ = .14
…
NO
…
…
73
Other techniques




Neural networks
Conditional random fields
A network of linear functions
Basically, whatever techniques are being used in
machine learning are appropriate techniques to try
with POS tagging
74
Unsupervised learning

Unsupervised learning:




Use an unannotated corpus for training data
Instead, will have to use another database of knowledge,
such as a dictionary of possible tags
Unsupervised learning use the same general
techniques as supervised, but there are differences
Advantage is that there is more unannotated data to
learn from

And annotated data isn’t always available
75
Unsupervised Learning #1:
HMMs




We still want to use transition (P(ti|ti-1)) and emission
(P(w|t)) probabilities, but how do we obtain them?
During training, the values of the states (i.e., the tags)
are unknown, or hidden
Start by using a dictionary, to at least estimate the
emission probability
Using the forward-backward algorithm, iteratively derive
better probability estimates


The best tag may change with each iteration
Stop when probability of the sequence of words has
been (locally) maximized
76
Forward-Backward Algorithm

The Forward-Backward (or Baum-Welch) Algorithm
for POS tagging works roughly as follows:


1. Initialize all parameters (forward and backward probability
estimates)
2. Calculate new probabilities




A. Re-estimate the probability for a given word in the sentence
from the forward and backward probabilities
B. Use that to re-estimate the forward and backward
parameters
3. Terminate when a local maximum for the string is reached
NB: The initial probabilities for words are based on
dictionaries … which you probably need tagged text
to obtain
77
Unsupervised Learning #2:
TBL




With TBL, we want to learn rules of patterns, but
how can we learn the rules if there’s no annotated
data?
Main idea: look at the distribution of unambiguous
words to guide the disambiguation of ambiguous
words
Example: the can, where can can be a noun, modal,
or verb
Let’s take unambiguous words from dictionary and
count their occurrences after the



the elephant
the guardian
Conclusion: immediately after the, nouns are more
common than verbs or modals
78
Unsupervised TBL

Initial state annotator




Supervised: assign random tag to each word
Unsupervised: for each word, list all tags in dictionary
The templates change accordingly …
Transformation template:


Change tag  of word to tag Y if the previous (next) tag
(word) is Z, where  is a set of 1 or more tags
Don’t change any other tags
79
Error Reduction in
Unsupervised Method

Let a rule to change  to Y in context C be
represented as Rule(, Y, C).



Rule1: {VB, MD, NN} NN PREVWORD the
Rule2: {VB, MD, NN} VB PREVWORD the
Idea:

since annotated data isn’t available, score rules so as to
prefer those where Y appears much more frequently in the
context C than all others in 


frequency is measured by counting unambiguously tagged
words
so, prefer {VB, MD, NN} NN PREVWORD the
to {VB, MD, NN} VB PREVWORD the
since dict-unambiguous nouns are more common in a corpus
after the than dict-unambiguous verbs
80
Using weakly supervised
methods

Can mix supervised and unsupervised to improve
performance. For the TBL tagging:



Use the unsupervised method for the initial state tagger
Use the supervised method to adjust the tags
Advantage of using weakly supervised over just
supervised:


Lots of training data available for unsupervised part
Less training data available for supervised part
81
Voting Taggers

A more recent technique which seems to improve
performance is to set up a system of tagger voting.


Can vote in different ways



Train multiple taggers on a corpus and learn not only about
individual performance, but how they interact
Each tagger gets a fraction of the vote
For certain tagging decisions, one tagger is always right
Works best when the taggers are looking at different,
or complementary, information

Intuition: if taggers A, B, and C are all looking at different
info and two of them agree, that’s better than the one that
disagrees
82
Ambiguity classes

Another way that POS tagging methods have been
improved is by grouping words together in classes


Words are grouped together if they have the same set of
possible tags
Can calculate new probabilities based on these
classes

This can help if a word occurs very rarely but other
members in its class are more frequent
83
Summary: POS tagging

Part-of-speech tagging can use





When Machine Learning is used, it can be



Hand-crafted rules based on inspecting a corpus
Machine Learning-based approaches based on corpus statistics
Machine Learning-based approaches using rules derived
automatically from a corpus
And almost any other Machine Learning technique you can think of
Supervised (using an annotated corpus)
Unsupervised (using an unannotated corpus)
Combinations of different methods often improve performance
84
Practice


5.1, 5.2
Small project: 5.8 on Persian (delivery on 30 Azar)
85