Transcript ch13

Chapter 13
Part-of-Speech Tagging
13-1
POS Tagging + HMMs
• Part of Speech Tagging
– What and Why?
•
•
•
•
•
What Information is Available?
Visible Markov Models
Hidden Markov Models
Training and Initialization
Other Methods
13-2
Part Of Speech Tagging
• Assign syntactic categories to words in text
– The-AT representative-NN put-VBD chairs-NNS on-IN
the-AT table-NN.
– The-AT representative-JJ put-NN chairs-VBZ on-IN
the-AT table-NN.
– Tagging set (see next)
• Usefulness
– Lexical Acquisition, Shallow/Partial Parse
– Information Extraction
– Question Answering
13-3
Brown/Penn tag sets
13-4
Sources of information
• syntagmatic structural information
– looking at information about tag sequences
– AT JJ NN vs. AT JJ VBP
– 77% performance (Greene and Rubin, 1971)
• lexical information
– predicting a tag based on the word concerned
– The word flour is much more likely to be a
noun than a verb
13-5
Visible Markov
• View as Markov Chain
– Limited Horizon: a word’s tag only depends on
the previous tag
P( X i 1  t j | X 1 , X i )  P( X i 1  t j | X i )
– Time Invariant: the dependency does not change
over time
P( X i 1  t j | X i )  P( X 2  t j | X 1 )
13-6
13-7
Find the best tagging t1,n for a sentence w1,n.
P ( w1,n | t1,n ) P(t1,n )
arg max P (t1,n | w1,n )  arg max
P( w1,n )
t1, n
t1, n
 arg max P ( w1,n | t1,n ) P(t1,n )
t1, n
Words are independently of each other
n
P( w1,n | t1,n ) P(t1,n )   P( wi | t1,n )  P(t n | t1,n 1 )  P(t n 1 | t1,n  2 )   P(t 2,t1 )
i 1
A word’s identity only depends on its tag.
n
  P( wi | ti )  P(tn | tn 1 )  P(tn 1 | tn 2 )    P(t2,t1 )
i 1
n
  [ P( wi | ti )  P(ti | ti 1 )]
i 1
13-8
( we define P(t1, | t0 )  1.0)
The final equation for determining the optimal tags for a sentence:

n
t 1,n  arg max P(t1,n | w1,n )   P( wi | ti ) P(ti | ti 1 )
t1,n
i 1
j k
C
(
t
,t )
k
j
P(t | t ) 
C (t j )
l
j
C
(
w
,
t
)
l
j
P( w | t ) 
C (t j )
13-9
Viterbi Algorithm
13-10
Unknown Words
• Simplest model
– Unknown words can be of any part of speech
– Or only any open class part of speech, I.e.,
nouns, verbs, and so on
• Morphological and other cues
– -ed: past tense forms or past participles
13-11
Transformation-Based Learning of Tags
• A specification of which “error-correcting”
transformations are admissible
• The learning algorithm
– Tag each word in the training corpus with its
most frequent tag
– Construct a ranked list of transformations that
transforms the initial tagging into a tagging that
is close to correct
13-12
Transformations
Triggering environment
Rewrite rule t1  t2: replace tag t1 by tag t2.
Triggering environment:
potential rewriting
Tag tj occurs in one of the three previous positions
Tag tj occurs two positions
earlier and tag tk occurs in
the following position
13-13
locations where a trigger will be sought
(1) Trigger by tags
(2) Trigger by word
(3) Trigger by morphology
to work in a school
go to school??
for cut, put
more valuable player
don’t, shouldn’t
e.g., unknown words are tagged as proper nouns (NNP) if
capitalized, as common nouns (NN) otherwise.
13-14
Replace NN by NNS if the unknown word’s suffix is –s.
Reduce the error rate
E(Ck): the number of words that are mistagged in tagged corpus Ck.
13-15
Tagging Accuracy
•
•
•
•
95%~97%
The amount of training data available
The tag set
The difference between training corpus and
dictionary on the one hand and the corpus
of application on the other
• Unknown words
13-16