Transcript Course 6

Computational Lexicology,
Morphology and Syntax
Course 6
Diana Trandabăț
Academic year: 2015-2016
1
Word Classes
• Words that somehow ‘behave’ alike:
–Appear in similar contexts
–Perform similar functions in sentences
–Undergo similar transformations
• ~10 traditional word classes of parts of speech
(POS)
–Noun, verb, adjective, preposition, adverb, article,
interjection, pronoun, conjunction, numeral
2
Some Examples
•N
•V
• ADJ
• ADV
•P
• PRO
• DET
noun
verb
adjective
adverb
preposition
pronoun
determiner
chair, bandwidth, pacing
study, debate, munch
purple, tall, ridiculous
unfortunately, slowly
of, by, to
I, me, mine
the, a, that, those
3
Ambiguity in POS Tagging
• “Like” can be a verb or a preposition
– I like/VBP candy.
– Time flies like/IN an arrow.
• “Around” can be a preposition, particle, or adverb
– I bought it at the shop around/IN the corner.
– I never got around/RP to getting a car.
– A new Prius costs around/RB $25K.
4
Closed vs. Open class of words
• Closed class categories are composed of a small, fixed
set of grammatical words for a given language.
– Prepositions: of, in, by, …
– Auxiliary verbs: may, can, will, had, been, …
– Pronouns: I, you, she, mine, his, them, …
– Usually function words (short common words which play a
role in grammar)
• Open class categories have large number of words and
new ones are easily invented.
– English has 4 categories: Nouns, Verbs, Adjectives, Adverbs
– Nouns: Googler
– Verbs: Google
– Adjectives: geeky
5
Open Class Words
• Nouns
–Proper nouns
• Columbia University, New York City, Arthi
Ramachandran, Metropolitan Transit Center
• English (and Romanian, and many other languages)
capitalizes these
• Many have abbreviations
–Common nouns
• All the rest
6
Count nouns vs. mass nouns
• Count: Have plurals, countable: goat/goats, one goat, two
goats
• Mass: Not countable (fish, salt, communism) (?two fishes)
Invariable nouns (same form for singular and plural)
•
ochi, ardei, nume
Singularia tantum (nouns having only the form for
the singular number)
•
Dreptate, studenţime, aur, soare
Pluralia tantum (nouns having only the form for the
plural number)
• Icre, ochelari, rechizite
7
• Adjectives: identify properties or qualities
of nouns
–Color, size, age, …
–Adjective ordering restrictions in English:
• Old blue book, not Blue old book
–In Korean, adjectives are realized as verbs
• Adverbs: also modify things (verbs,
adjectives, adverbs)
–The very happy man walked home extremely
slowly yesterday.
8
– Directional/locative adverbs (here, home,
downhill)
– Degree adverbs (extremely, very, somewhat)
– Manner adverbs (slowly, silkily, delicately)
– Temporal adverbs (Monday, tomorrow)
• Verbs:
– In most languages take morphological affixes
(eat/eats/eaten)
– Represent actions (walk, ate), processes (provide,
see), and states (be, seem)
– Many subclasses, e.g.
• eats/V  eat/VB, eat/VBP, eats/VBZ, ate/VBD,
eaten/VBN, eating/VBG, ...
• Reflect morphological form & syntactic function
How Do We Assign Words to Open or
Closed?
• Nouns denote people, places and things and
can be preceded by articles. But…
My typing is very bad.
*The Mary loves John.
• Verbs are used to refer to actions, processes,
states
–But some are closed class and some are open
I will have emailed everyone by noon.
• Adverbs modify actions
–Is Monday a temporal adverbial or a noun?
10
Closed Class Words
• Closed class words (Prep, Det, Pron, Conj,
Aux, Part, Num) are generally easy to
process, since we can enumerate them….
but
–Is it a Particles or a Preposition?
• George eats up his dinner/George eats his dinner up.
• George eats up the street/*George eats the street up.
–Articles come in 2 flavors: definite (the) and
indefinite (a, an)
• What is this in ‘this guy…’?
11
Choosing a POS Tagset
• To do POS tagging, first need to choose a set
of tags
• Could pick very coarse (small) tagsets
–N, V, Adj, Adv.
• More commonly used: Brown Corpus (Francis
& Kucera ‘82), 1M words, 87 tags – more
informative but more difficult to tag
• Most commonly used: Penn Treebank: handannotated corpus of Wall Street Journal, 1M
words, 45 tags
12
English Parts of Speech
• Noun (person, place or thing)
– Singular (NN): dog, fork
– Plural (NNS): dogs, forks
– Proper (NNP, NNPS): John, Springfields
• Pronouns
– Personal pronoun (PRP): I, you, he, she, it
– Wh-pronoun (WP): who, what
• Verb (actions and processes)
–
–
–
–
–
–
–
–
Base, infinitive (VB): eat
Past tense (VBD): ate
Gerund (VBG): eating
Past participle (VBN): eaten
Non 3rd person singular present tense (VBP): eat
3rd person singular present tense: (VBZ): eats
Modal (MD): should, can
To (TO): to (to eat)
13
English Parts of Speech (cont.)
• Adjective (modify nouns)
– Basic (JJ): red, tall
– Comparative (JJR): redder, taller
– Superlative (JJS): reddest, tallest
• Adverb (modify verbs)
– Basic (RB): quickly
– Comparative (RBR): quicker
– Superlative (RBS): quickest
• Preposition (IN): on, in, by, to, with
• Determiner:
– Basic (DT) a, an, the
– WH-determiner (WDT): which, that
• Coordinating Conjunction (CC): and, but, or,
• Particle (RP): off (took off), up (put up)
14
Penn Treebank Tagset
15
Defining POS Tagging
• The process of assigning a part-of-speech or
lexical class marker to each word in a corpus:
WORDS
the
koala
put
the
keys
on
the
table
TAGS
N
V
P
DET
16
Applications for POS Tagging
• Speech synthesis pronunciation
–
–
–
–
–
–
Lead
INsult
OBject
OVERflow
DIScount
CONtent
Lead
inSULT
obJECT
overFLOW
disCOUNT
conTENT
• Word Sense Disambiguation: e.g. Time flies like an arrow
– Is flies an N or V?
• Word prediction in speech recognition
– Possessive pronouns (my, your, her) are likely to be followed by
nouns
– Personal pronouns (I, you, he) are likely to be followed by verbs
• Machine Translation
17
Part of Speech (POS) Tagging
• Lowest level of syntactic analysis.
John saw the saw and decided to take it to the table.
NNP VBD DT NN CC VBD TO VB PRP IN DT NN
18
POS Tagging Process
• Usually assume a separate initial tokenization process that
separates and/or disambiguates punctuation, including
detecting sentence boundaries.
• Degree of ambiguity in English (based on Brown corpus)
– 11.5% of word types are ambiguous.
– 40% of word tokens are ambiguous.
• Average POS tagging disagreement amongst expert human
judges for the Penn treebank was 3.5%
– Based on correcting the output of an initial automated tagger, which
was deemed to be more accurate than tagging from scratch.
• Baseline: Picking the most frequent tag for each specific word
type gives about 90% accuracy
– 93.7% if use model for unknown words for Penn Treebank tagset.
19
POS Tagging Approaches
• Rule-Based: Human crafted rules based on lexical
and other linguistic knowledge.
• Learning-Based: Trained on human annotated
corpora like the Penn Treebank.
– Statistical models: Hidden Markov Model (HMM),
Maximum Entropy Markov Model (MEMM),
Conditional Random Field (CRF)
– Rule learning: Transformation Based Learning (TBL)
• Generally, learning-based approaches have been
found to be more effective overall, taking into
account the total amount of human expertise and
effort involved.
20
Simple taggers
• Default tagger has one tag per word, and assigns it on the
basis of dictionary lookup
– Tags may indicate ambiguity but not resolve it, e.g. NVB for noun-orverb
• Words may be assigned different tags with associated
probabilities
– Tagger will assign most probable tag unless
– there is some way to identify when a less probable tag is in fact correct
• Tag sequences may be defined by regular expressions, and
assigned probabilities (including 0 for illegal sequences –
negative rules)
21/24
Rule-based taggers
• Earliest type of tagging: two stages
• Stage 1: look up word in lexicon to give list of
potential POSs
• Stage 2: Apply rules which certify or disallow
tag sequences
• Rules originally handwritten; more recently
Machine Learning methods can be used
22/24
Start with a POS Dictionary
• she:
PRP
• promised:
VBN,VBD
• to:
TO
• back:
VB, JJ, RB, NN
• the:
DT
• bill:
NN, VB
• Etc… for the ~100,000 words of English
23
Assign All Possible POS to Each Word
VBN
PRP VBD
TO
She promised to
NN
RB
JJ
VB
back
VB
DT NN
the bill
24
Apply Rules Eliminating Some POS
E.g., Eliminate VBN if VBD is an option
when VBN|VBD follows “<start> PRP”
NN
RB
VBN
JJ
VB
PRP VBD
TO VB
DT NN
She promised to back
the bill
25
Apply Rules Eliminating Some POS
E.g., Eliminate VBN if VBD is an option
when VBN|VBD follows “<start> PRP”
NN
RB
JJ
VB
PRP VBD
TO VB
DT NN
She promised to back
the bill
26
EngCG ENGTWOL Tagger
• Richer dictionary includes morphological and
syntactic features (e.g. subcategorization
frames) as well as possible POS
• Uses two-level morphological analysis on
input and returns all possible POS
• Apply negative constraints (> 3744) to rule out
incorrect POS
Sample ENGTWOL Dictionary
28
ENGTWOL Tagging: Stage 1
• First Stage: Run words through FST morphological analyzer
to get POS info from morph
• E.g.: Pavlov had shown that salivation …
Pavlov
PAVLOV N NOM SG PROPER
had
HAVE V PAST VFIN SVO
HAVE PCP2 SVO
shown
SHOW PCP2 SVOO SVO SV
that
ADV
PRON DEM SG
DET CENTRAL DEM SG
CS
salivation N NOM SG
29
ENGTWOL Tagging: Stage 2
• Second Stage: Apply NEGATIVE constraints
• E.g., Adverbial that rule
–Eliminate all readings of that except the one in
It isn’t that odd.
Given input: that
If
(+1 A/ADV/QUANT)
(+2 SENT-LIM)
(NOT -1 SVOC/A)
; if next word is adj/adv/quantifier
; followed by E-O-S
; and the previous word is not a verb like
consider which allows adjective
complements (e.g. I consider that odd)
Then eliminate non-ADV tags
Else eliminate ADV
30
How do they work?
•
•
•
•
•
•
Tagger must be “trained”
Many different techniques, but typically …
Small “training corpus” hand-tagged
Tagging rules learned automatically
Rules define most likely sequence of tags
Rules based on
– Internal evidence (morphology)
– External evidence (context)
– Probabilities
What probabilities do we have to learn?
(a) Individual word probabilities:
Probability that a given tag t is appropriate for
a given word w
– Easy (in principle): learn from training corpus:
f (t , w)
P(t | w) 
f ( w)
run occurs 4800 times in the
training corpus: 3600 times as a
verb, 1200 times as a noun:
P(verb|run) = 0.75
– Problem of “sparse data”:
• Add a small amount to each calculation, so we get
no zeros
(b) Tag sequence probability:
Probability that a given tag sequence t1,t2,…,tn is
appropriate for a given word sequence
w1,w2,…,wn
– P(t1,t2,…,tn | w1,w2,…,wn ) = ???
– Too hard to calculate for entire sequence:
P(t1,t2 ,t3 ,t4 ,...) = P(t2|t1 )  P(t3|t1,t2 )  P(t4|t1,t2 ,t3 )  …
– Subsequence is more tractable
– Sequence of 2 or 3 should be enough:
Bigram model: P(t1,t2) = P(t2|t1 )
Trigram model: P(t1,t2 ,t3) = P(t2|t1 )  P(t3|t2 )
N-gram model: P(t1 ,..., t n )   P(ti | ti 1 )
i 1, n
More complex taggers
• Bigram taggers assign tags on the basis of
sequences of two words (usually assigning tag
to wordn on the basis of wordn-1)
• An nth-order tagger assigns tags on the basis
of sequences of n words
• As the value of n increases, so does the
complexity of the statistical calculation
involved in comparing probability
combinations.
Transformation-based tagging
• Eric Brill (1993)
• Start from an initial tagging, and apply a series
of transformations
• Transformations are learned as well, from the
training data
• Captures the tagging data in much fewer
parameters than stochastic models
• The transformations learned (often) have
linguistic “reality”
35/24
Transformation-based tagging
• Three stages:
– Lexical look-up
– Lexical rule application for unknown words
– Contextual rule application to correct mis-tags
36/24
Transformation-based learning
• Change tag a to b when:
– Internal evidence (morphology)
– Contextual evidence
• One or more of the preceding/following words has a specific tag
• One or more of the preceding/following words is a specific word
• One or more of the preceding/following words has a certain form
• Order of rules is important
– Rules can change a correct tag into an incorrect tag, so
another rule might correct that “mistake”
37/24
Transformation-based tagging: examples
• if a word is currently tagged NN, and has a suffix of
length 1 which consists of the letter 's', change its tag
to NNS
• if a word has a suffix of length 2 consisting of the
letter sequence 'ly', change its tag to RB (regardless
of the initial tag)
• change VBN to VBD if previous word is tagged as NN
• Change VBD to VBN if previous word is ‘by’
38/24
Transformation-based tagging:
example
Example after lexical lookup
Booth/NP killed/VBN Abraham/NP Lincoln/NP
Abraham/NP Lincoln/NP was/BEDZ shot/VBD by/BY Booth/NP
He/PPS witnessed/VBD Lincoln/NP killed/VBN by/BY Booth/NP
Example after application of contextual rule ’vbn vbd PREVTAG np’
Booth/NP killed/VBD Abraham/NP Lincoln/NP
Abraham/NP Lincoln/NP was/BEDZ shot/VBD by/BY Booth/NP
He/PPS witnessed/VBD Lincoln/NP killed/VBD by/BY Booth/NP
Example after application of contextual rule ’vbd vbn NEXTWORD by’
Booth/NP killed/VBD Abraham/NP Lincoln/NP
Abraham/NP Lincoln/NP was/BEDZ shot/VBN by/BY Booth/NP
He/PPS witnessed/VBD Lincoln/NP killed/VBN by/BY Booth/NP
39/24
Templates for TBL
3/31/2016
40
Sample TBL Rule Application
• Labels every word with its most-likely tag
– E.g. race occurences in the Brown corpus:
• P(NN|race) = .98
• P(VB|race)= .02
• is/VBZ expected/VBN to/TO race/NN tomorrow/NN
• Then TBL applies the following rule
– “Change NN to VB when previous tag is TO”
… is/VBZ expected/VBN to/TO race/NN
tomorrow/NN
becomes
… is/VBZ expected/VBN to/TO race/VB
tomorrow/NN
3/31/2016
41
TBL Tagging Algorithm
• Step 1: Label every word with most likely tag (from
dictionary)
• Step 2: Check every possible transformation & select one
which most improves tag accuracy (cf Gold)
• Step 3: Re-tag corpus applying this rule, and add rule to
end of rule set
• Repeat 2-3 until some stopping criterion is reached, e.g.,
X% correct with respect to training corpus
• RESULT: Ordered set of transformation rules to use on new
data tagged only with most likely POS tags
3/31/2016
42
TBL Issues
• Problem: Could keep applying (new)
transformations ad infinitum
• Problem: Rules are learned in ordered
sequence
• Problem: Rules may interact
• But: Rules are compact and can be
inspected by humans
3/31/2016
43
Evaluating Tagging Approaches
• For any NLP problem, we need to know
how to evaluate our solutions
• Possible Gold Standards -- ceiling:
– Annotated naturally occurring corpus
– Human task performance (96-7%)
• How well do humans agree?
• Kappa statistic: avg pairwise agreement
corrected for chance agreement
– Can be hard to obtain for some tasks:
sometimes humans don’t agree
• Baseline: how well does simple method do?
– For tagging, most common tag for each word
(90%)
– How much improvement do we get over
baseline?
Methodology: Error Analysis
• Confusion matrix:
– E.g. which tags did
we most often
confuse with which
other tags?
– How much of the
overall error does
each confusion
account for?
VB
VB
TO
NN
TO
NN
More Complex Issues
• Tag indeterminacy: when ‘truth’ isn’t clear
Caribbean cooking, child seat
• Tagging multipart words
wouldn’t --> would/MD n’t/RB
• How to handle unknown words
– Assume all tags equally likely
– Assume same tag distribution as all other
singletons in corpus
– Use morphology, word length,….
Classification Learning
• Typical machine learning addresses the problem of
classifying a feature-vector description into a fixed
number of classes.
• There are many standard learning methods for this
task:
–
–
–
–
–
–
Decision Trees and Rule Learning
Naïve Bayes and Bayesian Networks
Logistic Regression / Maximum Entropy (MaxEnt)
Perceptron and Neural Networks
Support Vector Machines (SVMs)
Nearest-Neighbor / Instance-Based
48
Beyond Classification Learning
• Standard classification problem assumes individual
cases are disconnected and independent (i.i.d.:
independently and identically distributed).
• Many NLP problems do not satisfy this assumption
and involve making many connected decisions, each
resolving a different ambiguity, but which are
mutually dependent.
• More sophisticated learning and inference
techniques are needed to handle such situations in
general.
49
Sequence Labeling Problem
• Many NLP problems can viewed as sequence
labeling.
• Each token in a sequence is assigned a label.
• Labels of tokens are dependent on the labels of
other tokens in the sequence, particularly their
neighbors (not i.i.d).
foo
bar
blam
zonk
zonk
bar
blam
50
Information Extraction
• Identify phrases in language that refer to specific types of
entities and relations in text.
• Named entity recognition is task of identifying names of
people, places, organizations, etc. in text.
people organizations places
– Michael Dell is the CEO of Dell Computer Corporation and lives in
Austin Texas.
• Extract pieces of information relevant to a specific
application, e.g. used car ads:
make model year mileage price
– For sale, 2002 Toyota Prius, 20,000 mi, $15K or best offer.
Available starting July 30, 2006.
51
Semantic Role Labeling
• For each clause, determine the semantic role
played by each noun phrase that is an
argument to the verb.
agent patient source destination instrument
– John drove Mary from Austin to Dallas in his
Toyota Prius.
– The hammer broke the window.
• Also referred to a “case role analysis,”
“thematic analysis,” and “shallow semantic
parsing”
52
Bioinformatics
• Sequence labeling also valuable in labeling
genetic sequences in genome analysis.
extron intron
– AGCTAACGTTCGATACGGATTACAGCCT
53
Sequence Labeling as Classification
• Classify each token independently but use as
input features, information about the
surrounding tokens (sliding window).
John saw the saw and decided to take it
to the table.
classifier
NNP
54
Sequence Labeling as Classification
• Classify each token independently but use as
input features, information about the
surrounding tokens (sliding window).
John saw the saw and decided to take it
to the table.
classifier
VBD
55
Sequence Labeling as Classification
• Classify each token independently but use as
input features, information about the
surrounding tokens (sliding window).
John saw the saw and decided to take it
to the table.
classifier
DT
56
Sequence Labeling as Classification
• Classify each token independently but use as
input features, information about the
surrounding tokens (sliding window).
John saw the saw and decided to take it
to the table.
classifier
NN
57
Sequence Labeling as Classification
• Classify each token independently but use as
input features, information about the
surrounding tokens (sliding window).
John saw the saw and decided to take it
to the table.
classifier
CC
58
Sequence Labeling as Classification
• Classify each token independently but use as
input features, information about the
surrounding tokens (sliding window).
John saw the saw and decided to take it
to the table.
classifier
VBD
59
Sequence Labeling as Classification
• Classify each token independently but use as
input features, information about the
surrounding tokens (sliding window).
John saw the saw and decided to take it
to the table.
classifier
TO
60
Sequence Labeling as Classification
• Classify each token independently but use as
input features, information about the
surrounding tokens (sliding window).
John saw the saw and decided to take it
to the table.
classifier
VB
61
Sequence Labeling as Classification
• Classify each token independently but use as
input features, information about the
surrounding tokens (sliding window).
John saw the saw and decided to take it
to the table.
classifier
PRP
62
Sequence Labeling as Classification
• Classify each token independently but use as
input features, information about the
surrounding tokens (sliding window).
John saw the saw and decided to take it
to the table.
classifier
IN
63
Sequence Labeling as Classification
• Classify each token independently but use as
input features, information about the
surrounding tokens (sliding window).
John saw the saw and decided to take it
to the table.
classifier
DT
64
Sequence Labeling as Classification
• Classify each token independently but use as
input features, information about the
surrounding tokens (sliding window).
John saw the saw and decided to take it
to the table.
classifier
NN
65
Sequence Labeling as Classification
Using Outputs as Inputs
• Better input features are usually the
categories of the surrounding tokens, but
these are not available yet.
• Can use category of either the preceding or
succeeding tokens by going forward or back
and using previous output.
66
Forward Classification
John saw the saw and decided to take it
to the table.
classifier
NNP
67
Forward Classification
NNP
John saw the saw and decided to take it
to the table.
classifier
VBD
68
Forward Classification
NNP VBD
John saw the saw and decided to take it
to the table.
classifier
DT
69
Forward Classification
NNP VBD DT
John saw the saw and decided to take it
to the table.
classifier
NN
70
Forward Classification
NNP VBD DT NN
John saw the saw and decided to take it
to the table.
classifier
CC
71
Forward Classification
NNP VBD DT NN CC
John saw the saw and decided to take it
to the table.
classifier
VBD
72
Forward Classification
NNP VBD DT NN CC VBD
John saw the saw and decided to take it
to the table.
classifier
TO
73
Forward Classification
NNP VBD DT NN CC VBD TO
John saw the saw and decided to take it
to the table.
classifier
VB
74
Forward Classification
NNP VBD DT NN CC VBD TO VB
John saw the saw and decided to take it
to the table.
classifier
PRP
75
Forward Classification
NNP VBD DT NN CC VBD TO VB PRP
John saw the saw and decided to take it to the table.
classifier
IN
76
Forward Classification
NNP VBD DT NN CC VBD TO VB PRP IN
John saw the saw and decided to take it to the table.
classifier
DT
77
Forward Classification
NNP VBD DT NN CC VBD TO VB PRP IN DT
John saw the saw and decided to take it to the table.
classifier
NN
78
Backward Classification
• Disambiguating “to” in this case would be
even easier backward.
John saw the saw and decided to take it
to the table.
classifier
NN
79
Backward Classification
• Disambiguating “to” in this case would be
even easier backward.
John saw the saw and decided to take it
NN
to the table.
classifier
DT
80
Backward Classification
• Disambiguating “to” in this case would be
even easier backward.
John saw the saw and decided to take it
DT NN
to the table.
classifier
IN
81
Backward Classification
• Disambiguating “to” in this case would be
even easier backward.
IN DT NN
John saw the saw and decided to take it to the table.
classifier
PRP
82
Backward Classification
• Disambiguating “to” in this case would be
even easier backward.
PRP IN DT NN
John saw the saw and decided to take it to the table.
classifier
VB
83
Backward Classification
• Disambiguating “to” in this case would be
even easier backward.
VB PRP IN DT NN
John saw the saw and decided to take it to the table.
classifier
TO
84
Backward Classification
• Disambiguating “to” in this case would be
even easier backward.
TO VB PRP IN DT NN
John saw the saw and decided to take it to the table.
classifier
VBD
85
Backward Classification
• Disambiguating “to” in this case would be
even easier backward.
VBD TO VB PRP IN DT NN
John saw the saw and decided to take it to the table.
classifier
CC
86
Backward Classification
• Disambiguating “to” in this case would be
even easier backward.
CC VBD TO VB PRP IN DT NN
John saw the saw and decided to take it to the table.
classifier
NN
87
Backward Classification
• Disambiguating “to” in this case would be
even easier backward.
VBD CC VBD TO VB PRP IN DT NN
John saw the saw and decided to take it to the table.
classifier
DT
88
Backward Classification
• Disambiguating “to” in this case would be
even easier backward.
DT VBD CC VBD TO VB PRP IN DT NN
John saw the saw and decided to take it to the table.
classifier
VBD
89
Backward Classification
• Disambiguating “to” in this case would be
even easier backward.
VBD DT VBD CC VBD TO VB PRP IN DT NN
John saw the saw and decided to take it to the table.
classifier
NNP
90
Problems with Sequence Labeling as
Classification
• Not easy to integrate information from
category of tokens on both sides.
• Difficult to propagate uncertainty between
decisions and “collectively” determine the
most likely joint assignment of categories to
all of the tokens in a sequence.
91
Stochastic taggers
• Nowadays, pretty much all taggers are
statistics-based and have been since 1980s (or
even earlier ... Some primitive algorithms
were already published in 60s and 70s)
• Most common is based on Hidden Markov
Models (also found in speech processing, etc.)
92/24
(Hidden) Markov Models
• Probability calculations imply Markov models: we assume that
P(t|w) is dependent only on the (or, a sequence of) previous
word(s)
• (Informally) Markov models are the class of probabilistic
models that assume we can predict the future without
taking into account too much of the past
• Markov chains can be modelled by finite state automata: the
next state in a Markov chain is always dependent on some
finite history of previous states
• Model is “hidden” if it is actually a succession of Markov
models, whose intermediate states are of no interest
93/24
Supervised vs unsupervised training
• Learning tagging rules from a marked-up corpus
(supervised learning) gives very good results (98%
accuracy)
– Though assigning most probable tag, and “proper noun” to
unknowns will give 90%
• But it depends on having a corpus already marked up
to a high quality
• If this is not available, we have to try something else:
– “forward-backward” algorithm
– A kind of “bootstrapping” approach
94/24
Forward-backward (Baum-Welch)
algorithm
• Start with initial probabilities
– If nothing known, assume all Ps equal
• Adjust the individual probabilities so as to increase
the overall probability.
• Re-estimate the probabilities on the basis of the last
iteration
• Continue until convergence
– i.e. there is no improvement, or improvement is below a
threshold
• All this can be done automatically
95/24
Conclusions
• POS Tagging is the lowest level of syntactic
analysis.
• It is an instance of sequence labeling, a collective
classification task that also has applications in
information
extraction,
phrase
chunking,
semantic role labeling, and bioinformatics.