Stochastic Tagging

Download Report

Transcript Stochastic Tagging

CSA2050:
Introduction to Computational
Linguistics
Part of Speech (POS) Tagging I
Rule-based Tagging
 Stochastic Tagging

Acknowledgment

Most slides taken from Bonnie Dorr’s course
notes:
www.umiacs.umd.edu/~bonnie/courses/cmsc723-03

In turn based on Jurafsky & Martin Chapter 8
April 2005
CLINT Lecture IV
2
Outline




The Task
Tags
Rule-based Tagging
Stochastic Tagging
April 2005
CLINT Lecture IV
3
Definition: PoS-Tagging
“Part-of-Speech Tagging is the process of assigning a
part-of-speech or other lexical class marker to each word
in a corpus” (Jurafsky and Martin)
WORDS
TAGS
the
girl
kissed
the
boy
on
the
cheek
April 2005
N
V
P
DET
CLINT Lecture IV
4
Motivation


Corpus analysis of tagged corpora yields useful
information
Speech synthesis — pronunciation
CONtent (N) vs. conTENT (Adj)


Speech recognition — word class-based Ngrams predict category of next word.
Information retrieval



stemming
selection of high-content words
Word-sense disambiguation
April 2005
CLINT Lecture IV
5
Word Classes

Parts of Speech: words in the same class (same
POS) have similar behaviour w.r.t.




Morphology (what kinds of affixes they take)
Syntax (what kinds of words can occur nearby)
Open Classes
Closed Classes
April 2005
CLINT Lecture IV
6
Open Class Words

Nouns: people, places, things




April 2005
Classes of nouns
 proper vs. common
 count vs. mass
Verbs: actions and processes
Adjectives: properties, qualities
Adverbs: many different kinds
CLINT Lecture IV
7
Closed Class Words


Fixed membership
Examples:







April 2005
prepositions: on, under, over, …
particles: up, down, on, off, …
determiners: a, an, the, …
pronouns: she, who, I, ..
conjunctions: and, but, or, …
auxiliary verbs: can, may should, …
numerals: one, two, three, third, …
CLINT Lecture IV
8
Prepositions from CELEX
Frequencies from COBUILD 16M word corpus
April 2005
CLINT Lecture IV
9
Tagsets: how detailed?
Swedish SUC
25
Penn Treebank
46
German STTS
50
Lancaster BNC
61
Lancaster Full
146
April 2005
CLINT Lecture IV
10
Penn Treebank Tagset
PRP
PRP$
April 2005
CLINT Lecture IV
11
Example of Penn Treebank
Tagging of Brown Corpus
Sentence
The/DT grand/JJ jury/NN commented/VBD
on/IN a/DT number/NN of/IN other/JJ
topics/NNS ./.
VB DT NN .
Book that flight .
VBZ DT NN VB NN
?
Does that flight serve dinner ?
April 2005
CLINT Lecture IV
12
The Problem
1.
2.
3.
April 2005
He can can a can.
I can light a fire and you can open a can of
beans. Now the can is open, and we can
eat in the light of the fire.
Flying planes can be dangerous.
CLINT Lecture IV
13
The Problem

Words often belong to more than one word
class: this




This is a nice day = PRP
This day is nice = DT
You can go this far = RB (adverb)
Many of the most common words are
ambiguous
April 2005
CLINT Lecture IV
14
How Hard is the Tagging Task?

In the Brown Corpus





11.5% of word types are ambiguous
40% of word tokens are ambiguous
Most words in English are unambiguous.
Many of the most common words are
ambiguous.
Typically ambiguous tags are not equally
probable.
April 2005
CLINT Lecture IV
15
Word Class Ambiguity
(in the Brown Corpus)
Unambiguous (1 tag):
35,340 types
Ambiguous (2-7 tags):
4,100 types
.
2 tags
3,760
3 tags
264
4 tags
61
5 tags
12
6 tags
2
7 tags
1
(Derose, 1988)
April 2005
CLINT Lecture IV
16
3 Approaches to Tagging
1.
Rule-Based Tagger: ENGTWOL Tagger
(Voutilainen 1995)
2.
Stochastic Tagger: HMM-based Tagger
3.
Transformation-Based Tagger: Brill Tagger
(Brill 1995)
April 2005
CLINT Lecture IV
17
Rule-Based Tagger

Basic Idea:



April 2005
Assign all possible tags to words
Remove tags according to set of rules
if word+1 is an adj, adv, or quantifier and the
following is a sentence boundary and word-1 is
not a verb like “consider” then eliminate non-adv
else eliminate adv.
Typically more than 1000 hand-written rules, but
may be machine-learned.
CLINT Lecture IV
18
ENGTWOL



Based on two-level morphology
56,000 entries for English word stems
Each entry annotated with morphological and
syntactic features
April 2005
CLINT Lecture IV
19
Sample ENGTWOL Lexicon
April 2005
CLINT Lecture IV
20
ENGTWOL Tagger

Stage 1: Run words through morphological
analyzer to get all parts of speech.

E.g. for the phrase “the tables”, we get the following
output:
"<the>" "the"
<Def> DET CENTRAL ART SG/PL
"<tables>" "table"
N NOM PL "table"
<SVO> V PRES SG3 VFIN

Stage 2: Apply constraints to rule out incorrect POSs
April 2005
CLINT Lecture IV
21
Examples of Constraints





Discard all verb readings if to the left there is an
unambiguous determiner, and between that determiner
and the ambiguous word itself, there are no nominals
(nouns, abbreviations etc.).
Discard all finite verb readings if the immediately
preceding word is to.
Discard all subjunctive readings if to the left, there are
no instances of the subordinating conjunction that or lest.
The first constraint would discard the verb reading in the
previous representation.
There are about 1,100 constraints
April 2005
CLINT Lecture IV
22
Example
Pavlov
had
shown
that
salivation
April 2005
PVLOV N NOM SG PROPER
HAVE V PAST VFIN SVO
HAVE PCP2 SVO
SHOW PCP2 SVOO SVO SV
ADV
PRON DEM SG
DET CENTRAL SEM SG
CS
N NOM SG
CLINT Lecture IV
23
Actual Constraint Syntax
Given input: “that”
If
(+1 A/ADV/QUANT)
(+2 SENT-LIM)
(NOT -1 SVOC/A)
Then eliminate non-ADV tags
Else eliminate ADV tag

this rule eliminates the adverbial sense of that as
in “it isn’t that odd”
April 2005
CLINT Lecture IV
24
Stochastic Tagging



Based on the probability of a certain tag
given various possibilities.
Necessitates a training corpus.
Difficulties


April 2005
There are no probabilities for words that are not in
the training corpus.  Smoothing
Training corpus may be too different from test
corpus.
CLINT Lecture IV
25
Stochastic Tagging
Simple Method: Choose the most frequent tag
in the training text for each word!



April 2005
Result: 90% accuracy !!
But we can do better than that by employing a
more elaborate statistical model
Hidden Markov Models (HMM) are a class of such
models.
CLINT Lecture IV
26
Hidden Markov Model
(for pronunciation)
April 2005
CLINT Lecture IV
27
Three Fundamental Questions
for HMMs



Given an HMM, how likely is a given
observation sequence?
Given an observation sequence, how do we
choose a state sequence that best explains
the observations?
Given an observation sequence and a space
of possible HMMs, how do we find the HMM
that best fits the observed data?
April 2005
CLINT Lecture IV
28
Two Observation Sequences
for Tagging
N
N
P
D
N
Time
flies
like
an
arrow
V
V
P
D
N
April 2005
CLINT Lecture IV
29
Two Kinds of Probability involved
in generating a sequence
t1 t2 t3 t5 t6
w1 w2 w3 w4 w5
t1 t2 t4 t5 t6
Transitional
P(tag|previous n tags)
t1 t2 t3 t5 t6
w1 w2 w3 w4 w5
t1 t2 t4 t5 t6
Output
P(w|t)
April 2005
CLINT Lecture IV
30
Simplifying Assumptions
cannot handle all phenomena

Limited Horizon: a given tag depends only
upon a N previous tags – usually N=2.



central embedding?
The cat the dog the bird saw bark meowed.
long distance dependencies
Chris is easy to consider it impossible for anyone
but a genius to try to talk to __.
Time (sentence position) invariance: (P,V)
may not be equally likely at beginning/end of
sentence
April 2005
CLINT Lecture IV
31
Estimating N-gram
probabilities

To estimate the probability that “such”
appears after “to create”:





count how many times “to create such” appears
=A
count how many times “to create” appears = B
Estimate = A/B
Same principle applies for tags
We can use these estimates to rank
alternative tags for a given word.
April 2005
CLINT Lecture IV
32
Data Used for Training a Hidden
Markov Model

Estimate the probabilities from relative
frequencies.


April 2005
Transitional probabilities: probability that a
sequence of tags t1, ... tn is followed by a tag t
P(t|t1..tn) = count(t1..tnfollowed by t)/count(t1..tn)
Output probabilities: probability that a given tag
t will be realised as a word w:
P(w|t) = count(w tagged as t)/count(t)
CLINT Lecture IV
33
An Example



Secretariat/NNP is/VBZ expected/VBN
to/TO race/VB tomorrow/NN
People/NNS continue/VBP to/TO inquire/VB the
DT reason/NN for/IN the/DT race/NN for/IN
outer/JJ space/NN
Consider first sentence; choose between
A = to/TO race/VB
B = to/TO race/NN

We need to choose maximum probability:


April 2005
P(A) = P(VB|TO) × P(race|VB)
P(B) = P(NN|TO) × P(race|NN)]
CLINT Lecture IV
34
Calculating Maximum
P(VB|TO) P(race|VB)
P(VB|TO) x P(race|VB)
.34
.00001
.00003
P(NN|TO) P(race|NN)
P(NN|TO) x P(race|NN)
.021
.000008
April 2005
.00041
CLINT Lecture IV
35
Remarks




We have shown how to calculate the most
probable tag for one word.
Normally we are interested in the most
probable sequence of tags for the entire
sentence.
This is achieved by the Viterbi algorithm.
See Jurafsky and Martin Ch 5 for an
introduction to this algorithm
April 2005
CLINT Lecture IV
36