POS tagging - Villanova Department of Computing Sciences

Download Report

Transcript POS tagging - Villanova Department of Computing Sciences

Part of Speech (POS)
Tagging
CSC 9010: Special Topics. Natural Language Processing.
Paula Matuszek, Mary-Angela Papalaskari
Spring, 2005
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
1
Sources (and Resources)
• Some slides adapted from
– Dorr, www.umiacs.umd.edu/~christof/courses/cmsc723-fall04
– Jurafsky, www.stanford.edu/class/linguist238
– McCoy, www.cis.udel.edu/~mccoy/courses/cisc882.03f
• With some additional examples and ideas from
–
–
–
–
Martin: www.cs.colorado.edu/~martin/csci5832.html
Hearst: www.sims.berkeley.edu/courses/is290-2/f04/resources.html
Litman: www.cs.pitt.edu/~litman/courses/cs2731f03/cs2731.html
Rich: www.cs.utexas.edu/users/ear/cs378NLP
• You may find some or all of these useful resources
throughout the course.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
2
Word Classes and
Part-of-Speech Tagging
•
•
•
•
•
•
•
•
What is POS tagging?
Why do we need POS?
Word Classes
Rule-based Tagging
Stochastic Tagging
Transformation-Based Tagging
Tagging Unknown Words
Evaluating POS Taggers
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
3
Parts of Speech
• 8 traditional parts of speech (more or less)
– Noun, verb, adjective, preposition, adverb, article,
pronoun, conjunction.
– This idea has been around for over 2000 years
(Dionysius Thrax of Alexandria, c. 100 B.C.)
– Called: parts-of-speech, lexical category, word
classes, morphological classes, lexical tags, POS
– Actual categories vary by language , by reason for
tagging, by who you ask!
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
4
POS examples
•
•
•
•
•
•
•
N
V
ADJ
ADV
P
PRO
DET
noun
verb
adj
adverb
preposition
pronoun
determiner
chair, bandwidth, pacing
study, debate, munch
purple, tall, ridiculous
unfortunately, slowly,
of, by, to
I, me, mine
the, a, that, those
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
5
Definition of POS Tagging
“The process of assigning a part-of-speech or
other lexical class marker to each word in a
corpus” (Jurafsky and Martin)
WORDS
the
girl
kissed
the
boy
on
the
cheek
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
TAGS
N
V
P
DET
6
POS Tagging example
WORD
tag
the
koala
put
the
keys
on
the
table
DET
N
V
DET
N
P
DET
N
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
7
What does Tagging do?
1. Collapses Distinctions
•
•
Lexical identity may be discarded
e.g. all personal pronouns tagged with PRP
2. Introduces Distinctions
•
•
•
Ambiguities may be removed
e.g. deal tagged with NN or VB
e.g. deal tagged with DEAL1 or DEAL2
3. Helps classification and prediction
Modified from Diane Litman's version of Steve Bird's notesCSC 9010: Special Topics, Natural Language
Processing. Spring, 2005. Matuszek & Papalaskari
8
Significance of Parts of Speech
• A word’s POS tells us a lot about the word
and its neighbors:
– Limits the range of meanings (deal), pronunciation
(object vs object) or both (wind)
– Helps in stemming
– Limits the range of following words for Speech
Recognition
– Can help select nouns from a document for IR
– Basis for partial parsing (chunked parsing)
– Parsers can build trees directly on the POS tags
instead of maintaining a lexicon
Modified from Diane Litman's version of Steve Bird's notesCSC 9010: Special Topics, Natural Language
Processing. Spring, 2005. Matuszek & Papalaskari
9
Word Classes
• What are we trying to classify words into?
• Classes based on
– Syntactic properties. What can precede/follow.
– Morphological properties. What affixes they take.
– Not primarily by semantic coherence (Conjunction
Junction notwithstanding!)
• Broad "grammar" categories are familiar
• NLP uses much richer "tagsets"
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
10
Open and closed class words
• Two major categories of classes:
– Closed class: a relatively fixed membership
•
•
•
•
Prepositions: of, in, by, …
Auxiliaries: may, can, will had, been, …
Pronouns: I, you, she, mine, his, them, …
Usually function words (short common words
which play a role in grammar)
– Open class: new ones can be created all the
time
• English has 4: Nouns, Verbs, Adjectives, Adverbs
• Many languages have all 4, but not all!
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
11
Open Class Words
• Every known human language has nouns
and verbs
• Nouns: people, places, things
– Classes of nouns
• proper vs. common
• count vs. mass
• Verbs: actions and processes
• Adjectives: properties, qualities
• Adverbs: hodgepodge!
– Unfortunately, John walked home extremely
slowly yesterday
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
12
Closed Class Words
• Idiosyncratic. Differ more from language to
language.
• Language strongly resists additions
• Examples:
–
–
–
–
–
–
–
prepositions: on, under, over, …
particles: up, down, on, off, …
determiners: a, an, the, …
pronouns: she, who, I, ..
conjunctions: and, but, or, …
auxiliary verbs: can, may should, …
numerals: one, two, three, third, …
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
13
Prepositions from CELEX
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
14
English Single-Word Particles
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
15
Pronouns in CELEX
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
16
Conjunctions
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
17
Auxiliaries
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
18
POS Tagging: Choosing a Tagset
• Many parts of speech, potential distinctions
• To do POS tagging, need to choose a standard set of
tags to work with
• Sets vary in # of tags: a dozen to over 200
• Size of tag sets depends on language,
objectives and purpose
• Need to strike a balance between
– Getting better information about context (best:
introduce more distinctions)
– Make it possible for classifiers to do their job
(need to minimize distinctions)
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
19
Some of the best-known
Tagsets
• Brown corpus: 87 tags
• Penn Treebank: 45 tags
• Lancaster UCREL C5 (used to tag the
BNC): 61 tags
• Lancaster C7: 145 tags
Slide modified from Massimo Poesio'sCSC 9010: Special Topics, Natural Language Processing. Spring,
2005. Matuszek & Papalaskari
20
The Brown Corpus
• The first digital corpus (1961)
– Francis and Kucera, Brown University
• Contents: 500 texts, each 2000 words
long
– From American books, newspapers,
magazines
– Representing genres:
• Science fiction, romance fiction, press
reportage scientific writing, popular lore
Modified from Diane Litman's version of Steve Bird's notesCSC 9010: Special Topics, Natural Language
Processing. Spring, 2005. Matuszek & Papalaskari
21
Penn Treebank
• First syntactically annotated corpus
• 1 million words from Wall Street Journal
• Part of speech tags and syntax trees
Modified from Diane Litman's version of Steve Bird's notesCSC 9010: Special Topics, Natural Language
Processing. Spring, 2005. Matuszek & Papalaskari
22
Tag Set Example: Penn Treebank
PRP
PRP$
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
23
Example of Penn Treebank Tagging
of Brown Corpus Sentence
The/DT grand/JJ jury/NN commented/VBD
on/IN a/DT number/NN of/IN other/JJ
topics/NNS ./.
VB DT NN .
Book that flight .
VBZ DT NN VB NN ?
Does that flight serve dinner ?
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
24
POS Tagging
• Words often have more than one POS: back
–
–
–
–
The back door = JJ
On my back = NN
Win the voters back = RB
Promised to back the bill = VB
• The POS tagging problem is to determine the
POS tag for a particular instance of a word.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
25
Word Class Ambiguity
(in the Brown Corpus)
Unambiguous (1 tag): 35,340
Ambiguous (2-7 tags): 4,100
2 tags
3 tags
3,760
264
4 tags
5 tags
6 tags
61
12
2
7 tags
1
(Derose, 1988)
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
26
Part-of-Speech Tagging
• Rule-Based Tagger: ENGTWOL
• Stochastic Tagger: HMM-based
• Transformation-Based Tagger: Brill
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
27
Rule-Based Tagging
• Basic Idea:
– Assign all possible tags to words
– Remove tags according to set of rules of
type: if word+1 is an adj, adv, or quantifier
and the following is a sentence boundary
and word-1 is not a verb like “consider”
then eliminate non-adv else eliminate adv.
– Typically more than 1000 hand-written
rules, but may be machine-learned.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
28
Start With a Dictionary
•
•
•
•
•
•
she:
promised:
to
back:
the:
bill:
PRP
VBN,VBD
TO
VB, JJ, RB, NN
DT
NN, VB
• Etc… for the ~100,000 words of English
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
29
Assign All Possible Tags
NN
RB
VBN
JJ
PRP VBD
TO VB
She promised to back
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
VB
DT NN
the bill
30
Write rules to eliminate tags
Eliminate VBN if VBD is an option when
VBN|VBD follows “<start> PRP”
NN
RB
VBN
JJ
VB
PRP VBD
TO VB
DT NN
She promised to back
the bill
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
31
Sample ENGTWOL Lexicon
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
32
Stochastic Tagging
• Based on probability of certain tag
occurring given various possibilities
• Necessitates a training corpus
• No probabilities for words not in corpus.
• Training corpus may be too different
from test corpus.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
33
Stochastic Tagging (cont.)
Simple Method: Choose most frequent
tag in training text for each word!
– Result: 90% accuracy
– Why?
– Baseline: Others will do better
– HMM is an example
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
34
HMM Tagger
• Intuition: Pick the most likely tag for this word.
• HMM Taggers choose tag sequence that
maximizes this formula:
– P(word|tag) × P(tag|previous n tags)
• Let T = t1,t2,…,tn
Let W = w1,w2,…,wn
• Find POS tags that generate a sequence of
words, i.e., look for most probable sequence
of tags T underlying the observed words W.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
35
Conditional Probability
• A brief digression…
• Conditional probability: how do we determine
the likelihood of one event following another if
they are not independent?
• Example:
– I am trying to diagnose a rash in a 6-year-old child.
– Is it measles?
– On other words, given that the child has a rash, what
is the probability that it is measles?
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
36
Conditional Probabilities cont.
• What would affect your decision?
– The overall frequency of rashes in 6-yr-olds
– The overall frequency of measles in 6-yr-olds
– The frequency with which 6-yr-olds with
measles have rashes.
• P(measles|rash) = P(rash|measles)P(measles)
P(rash)
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
37
Bayes' Theorem
• Bayes' Theorem or Bayes' Rule
formalizes this intuition:
• P(X|Y) = P(Y|X) P(X)
P(Y)
• P(X) and P(Y) are known as the "prior
probabilities" or "priors".
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
38
Probabilities
• We want the best
set of tags for a
sequence of
words (a
sentence)
• W is a sequence of
words
• T is a sequence of
tags
P(W | T ) P(T )
arg max P(T | W ) 
P(W )
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
39
Probabilities
• We want the best
set of tags for a
sequence of
words (a
sentence)
• W is a sequence of
words
• T is a sequence of
tags
arg max P(T | W )  P(W | T ) P(T )
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
40
Tag Sequence: P(T)
• How do we get the probability of a specific tag
sequence?
– Count the number of times a sequence occurs and
divide by the number of sequences of that length.
Not likely.
– Make a Markov assumption and use N-grams over
tags...
• P(T) is a product of the probability of N-grams* that make it
up
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
41
N-Grams
• The N stands for how many terms are used
– Unigram: 1 term; Bigram: 2 terms; Trigrams: 3 terms
• Usually don’t go beyond 3.
• You can use different kinds of terms, e.g.:
– Character based n-grams
– Word-based n-grams
– POS-based n-grams
• Ordering
– Often adjacent, but not required
• We use N-grams to help determine the context in which
some linguistic phenomenon happens.
– E.g., look at the words before and after the period to see if it is the
end of a sentence or not.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
42
P(T): Bigram Example
• Given a sentence:
<s> Det Adj Adj Noun </s>
• Probability is product of four N-grams:
P(Det|<s>) P(Adj|Det) P(Adj|Adj) P(Noun|Adj)
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
43
Counts
• Where do you get the N-gram counts?
• From a large hand-tagged corpus.
– For N-grams, count all the Tagi Tagi+1 pairs
– And smooth them to get rid of the zeroes
• Alternatively, you can learn them from
an untagged corpus
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
44
What about P(W|T)
• First its odd. It is asking the probability of
seeing “The big red dog” given “Det Adj Adj
Noun”
– Collect up all the times you see that tag sequence
and see how often “The big red dog” shows up.
Again not likely to work.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
45
P(W|T)
• We’ll make the following assumption (because
it’s easy)… Each word in the sequence only
depends on its corresponding tag. So…
n
P(W | T )   P( wi | ti )
i 1
• How do you get the statistics for that?
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
46
So…
• We start with
arg max P(T | W )  P(W | T ) P(T )
• And get
n
n
i 2
i 2
arg max  P( wi | ti ) * P(t1) *  P(ti | ti  1)
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
47
HMMs
• This is a Hidden Markov Model (HMM)
n
n
i2
i 2
arg max  P( wi | ti ) * P(t1) *  P(ti | ti  1)
• The states in the model are the tags, and the
observations are the words.
– The state to state transitions are driven by the bigram statistics
– The observed words are based solely on the state that you’re in
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
48
An Example
• Secretariat/NNP is/VBZ expected/VBN to/TO
race/VB tomorrow/NN
• People/NNS continue/VBP to/TO inquire/VB the
DT reason/NN for/IN the/DT race/NN for/IN
outer/JJ space/NN
• to/TO
race/???
the/DT race/???
• ti = argmaxj P(tj|ti-1)P(wi|tj)
– max[P(VB|TO)P(race|VB) , P(NN|TO)P(race|NN)]
• Brown:
– P(NN|TO) = .021 ×
– P(VB|TO) = .34 ×
P(race|NN) = .00041
P(race|VB) = .00003
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
= .000007
= .00001
49
Performance
• This method has achieved 95-96%
correct with reasonably complex English
tagsets and reasonable amounts of
hand-tagged training data.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
50
Transformation-Based
Tagging (Brill Tagging)
• Combination of Rule-based and stochastic
tagging methodologies
– Like rule-based because rules are used to
specify tags in a certain environment
– Like stochastic approach because machine
learning is used—with tagged corpus as input
– Transformation-Based Learning (TBL)
• Input:
– tagged corpus
– dictionary (with most frequent tags)
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
51
Transformation-Based
Tagging (cont.)
• Basic Idea:
– Set the most probable tag for each word as a start value
– Change tags according to rules of type “if word-1 is a
determiner and word is a verb then change the tag to noun” in
a specific order
• Training is done on tagged corpus:
–
–
–
–
Write a set of rule templates
Among the set of rules, find one with highest score
Continue from 2 until lowest score threshold is passed
Keep the ordered set of rules
• Rules make errors, corrected by later rules
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
52
TBL Rule Application
• Tagger labels every word with its most-likely
tag
– For example: race has the following probabilities in
the Brown corpus:
• P(NN|race) = .98
• P(VB|race)= .02
• Transformation rules make changes to tags
– “Change NN to VB when previous tag is TO”
… is/VBZ expected/VBN to/TO race/NN
tomorrow/NN
becomes
… is/VBZ expected/VBN to/TO race/VB
tomorrow/NN
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
53
TBL: The Rule-Learning
Algorithm
• Step 1: Label every word with most likely tag
(from dictionary)
• Step 2: Check every possible transformation
& select one which most improves tagging
• Step 3: Re-tag corpus applying the rules
• Repeat 2-3 until some criterion is reached,
e.g., X% correct with respect to training
corpus
• RESULT: Sequence of transformation rules
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
54
TBL: Rule Learning (cont.)
• Problem: Could apply transformations ad
infinitum!
• Constrain the set of transformations with
“templates”:
– Replace tag X with tag Y, provided tag Z or
word Z’ appears in some position
• Rules are learned in ordered sequence
• Rules may interact.
• Rules are compact and can be inspected
by humans
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
55
Templates for TBL
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
56
TBL: Problems
• First 100 rules achieve 96.8% accuracy
First 200 rules achieve 97.0% accuracy
• Execution Speed: TBL tagger is slower than
HMM approach
• Learning Speed: Brill’s implementation can
take over a day (600k tokens)
BUT …
(1) Learns small number of simple, non-stochastic
rules
(2) Can be made to work faster with FST
(3) Best performing algorithm on unknown words
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
57
Tagging Unknown Words
• Major continuing issue in taggers
– New words added to language 20+ per month
– Plus many proper names …
– Increases error rates by 1-2%
• Method 1: assume they are nouns
• Method 2: assume the unknown words have
a probability distribution similar to words only
occurring once in the training set.
• Method 3: Use morphological information,
e.g., words ending with –ed tend to be
tagged VBN.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
58
Evaluating performance
• How do we know how well a tagger does?
• Say we had a test sentence, or a set of test
sentences, that were already tagged by a
human (a “Gold Standard”)
• We could run a tagger on this set of test
sentences
• And see how many of the tags we got right.
• This is called “Tag accuracy” or “Tag percent
correct”
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
59
Test set
•
•
•
•
We take a set of test sentences
Hand-label them for part of speech
The result is a “Gold Standard” test set
Who does this?
– Brown corpus: done by U Penn
– Grad students in linguistics
• Don’t they disagree?
– Yes! But on about 97% of tags no disagreements
– And if you let the taggers discuss the remaining 3%, they
often reach agreement
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
60
So What's "Good"?
• If we tag every word with its most
frequent POS we get about 90%
accuracy, so this is a minimum tagger.
• Human taggers (without discussion)
agree about 97% of the time, so if we
can get to 97% we have done about as
well as we can.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
61
Training and test sets
• But we can’t train our frequencies on
the test set sentences. (Why?)
• So for testing the Most-Frequent-Tag
algorithm (or any other stochastic
algorithm), we need 2 things:
– A hand-labeled training set: the data that
we compute frequencies from, etc
– A hand-labeled test set: The data that we
use to compute our % correct.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
62
Computing % correct
• Of all the words in the test set
• For what percent of them did the tag
chosen by the tagger equal the humanselected tag.
# of words tagged correctly in test set
%correct 
total # of words in test set
• Human tag set: (“Gold Standard” set)
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
63
Training and Test sets
• Often they come from the same labeled
corpus!
• We just use 90% of the corpus for
training and save out 10% for testing!
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
64