Corpus Annotation for Computational Linguistics

Download Report

Transcript Corpus Annotation for Computational Linguistics

POS Tagging: Introduction
Heng Ji
[email protected]
Feb 2, 2008
Acknowledgement: some slides from Ralph Grishman, Nicolas Nicolov, J&M
1
Some Administrative Stuff
 Assignment 1 due on Feb 17
 Textbook: required for assignments and final
exam
2/39
Outline
 Parts of speech (POS)
 Tagsets
 POS Tagging



Rule-based tagging
Markup Format
Open source Toolkits
3/39
What is Part-of-Speech (POS)
 Generally speaking, Word Classes (=POS) :

Verb, Noun, Adjective, Adverb, Article, …
 We can also include inflection:
 Verbs: Tense, number, …
 Nouns: Number, proper/common, …
 Adjectives: comparative, superlative, …
 …
4/39
Parts of Speech
 8 (ish) traditional parts of speech
 Noun, verb, adjective, preposition, adverb,
article, interjection, pronoun, conjunction, etc
 Called: parts-of-speech, lexical categories,
word classes, morphological classes, lexical
tags...
 Lots of debate within linguistics about the
number, nature, and universality of these

We’ll completely ignore this debate.
5/39
7 Traditional POS Categories
N
V
 ADJ
 ADV
P
 PRO
 DET
noun
chair, bandwidth, pacing
verb
study, debate, munch
adj
purple, tall, ridiculous
adverb
unfortunately, slowly,
preposition of, by, to
pronoun
I, me, mine
determiner the, a, that, those
6/39
POS Tagging
 The process of assigning a part-of-speech or
lexical class marker to each word in a
WORD
collection.
the
koala
put
the
keys
on
the
table
tag
DET
N
V
DET
N
P
DET
N
7/39
Penn TreeBank POS Tag Set
 Penn Treebank: hand-annotated corpus of
Wall Street Journal, 1M words
 46 tags
 Some particularities:


to /TO not disambiguated
Auxiliaries and verbs not distinguished
8/39
Penn Treebank Tagset
9/39
Why POS tagging is useful?
 Speech synthesis:






How to pronounce “lead”?
INsult
inSULT
OBject
obJECT
OVERflow
overFLOW
DIScount
disCOUNT
CONtent
conTENT
 Stemming for information retrieval

Can search for “aardvarks” get “aardvark”
 Parsing and speech recognition and etc


Possessive pronouns (my, your, her) followed by nouns
Personal pronouns (I, you, he) likely to be followed by verbs
Need to know if a word is an N or V before you can parse
 Information extraction


Finding names, relations, etc.
 Machine Translation
10/39
Equivalent Problem in Bioinformatics
 Durbin et al. Biological
Sequence Analysis, Cambridge
University Press.
 Several applications, e.g.
proteins
 From primary structure
ATCPLELLLD
 Infer secondary structure
HHHBBBBBC..
11/39
Why is POS Tagging Useful?
 First step of a vast number of practical tasks
 Speech synthesis






How to pronounce “lead”?
INsult
inSULT
OBject
obJECT
OVERflow
overFLOW
DIScount
disCOUNT
CONtent
conTENT
 Parsing

Need to know if a word is an N or V before you can parse
 Information extraction

Finding names, relations, etc.
 Machine Translation
12/39
Open and Closed Classes
 Closed class: a small fixed membership
 Prepositions: of, in, by, …
 Auxiliaries: may, can, will had, been, …
 Pronouns: I, you, she, mine, his, them, …
 Usually function words (short common words which
play a role in grammar)
 Open class: new ones can be created all the time
 English has 4: Nouns, Verbs, Adjectives, Adverbs
 Many languages have these 4, but not all!
13/39
Open Class Words
 Nouns

Proper nouns (Boulder, Granby, Eli Manning)



English capitalizes these.
Common nouns (the rest).
Count nouns and mass nouns


Count: have plurals, get counted: goat/goats, one goat, two goats
Mass: don’t get counted (snow, salt, communism) (*two snows)
 Adverbs: tend to modify things




Unfortunately, John walked home extremely slowly yesterday
Directional/locative adverbs (here,home, downhill)
Degree adverbs (extremely, very, somewhat)
Manner adverbs (slowly, slinkily, delicately)
 Verbs

In English, have morphological affixes (eat/eats/eaten)
14/39
Closed Class Words
Examples:







prepositions: on, under, over, …
particles: up, down, on, off, …
determiners: a, an, the, …
pronouns: she, who, I, ..
conjunctions: and, but, or, …
auxiliary verbs: can, may should, …
numerals: one, two, three, third, …
15/39
Prepositions from CELEX
16/39
English Particles
17/39
Conjunctions
18/39
POS Tagging
Choosing a Tagset
 There are so many parts of speech, potential distinctions we can




draw
To do POS tagging, we need to choose a standard set of tags to
work with
Could pick very coarse tagsets
 N, V, Adj, Adv.
More commonly used set is finer grained, the “Penn TreeBank
tagset”, 45 tags
 PRP$, WRB, WP$, VBG
Even more fine-grained tagsets exist
19/39
Using the Penn Tagset
 The/DT grand/JJ jury/NN commmented/VBD
on/IN a/DT number/NN of/IN other/JJ
topics/NNS ./.
 Prepositions and subordinating conjunctions
marked IN (“although/IN I/PRP..”)
 Except the preposition/complementizer “to” is
just marked “TO”.
20/39
POS Tagging
 Words often have more than one POS: back




The back door = JJ
On my back = NN
Win the voters back = RB
Promised to back the bill = VB
 The POS tagging problem is to determine the
POS tag for a particular instance of a word.
These examples from Dekang Lin
21/39
How Hard is POS Tagging?
Measuring Ambiguity
22/39
Current Performance
 How many tags are correct?



About 97% currently
But baseline is already 90%
Baseline algorithm:


Tag every word with its most frequent tag
Tag unknown words as nouns
 How well do people do?
23/39
Quick Test: Agreement?
 the students went to class
 plays well with others
 fruit flies like a banana
DT: the, this, that
NN: noun
VB: verb
P: prepostion
ADV: adverb
24/39
Quick Test
 the students went to class
DT NN
VB P NN
 plays well with others
VB ADV P NN
NN NN P DT
 fruit flies like a banana
NN NN VB DT NN
NN VB P DT NN
NN NN P DT NN
NN VB VB DT NN
25/39
How to do it? History
Trigram Tagger
(Kempe)
96%+
DeRose/Church
Efficient HMM
Sparse Data
95%+
Greene and Rubin
Rule Based - 70%
1960
Brown Corpus
Created (EN-US)
1 Million Words
HMM Tagging
(CLAWS)
93%-95%
1970
Brown Corpus
Tagged
LOB Corpus
Created (EN-UK)
1 Million Words
Tree-Based Statistics
(Helmut Shmid)
Rule Based – 96%+
Transformation
Based Tagging
(Eric Brill)
Rule Based – 95%+
1980
Combined Methods
98%+
Neural Network
96%+
1990
2000
LOB Corpus
Tagged
POS Tagging
separated from
other NLP
British National
Corpus
(tagged by CLAWS)
Penn Treebank
Corpus
(WSJ, 4.5M)
26/39
Two Methods for POS Tagging
1. Rule-based tagging

(ENGTWOL)
2. Stochastic
1.
Probabilistic sequence models


HMM (Hidden Markov Model) tagging
MEMMs (Maximum Entropy Markov Models)
27/39
Rule-Based Tagging
 Start with a dictionary
 Assign all possible tags to words from the
dictionary
 Write rules by hand to selectively remove
tags
 Leaving the correct tag for each word.
28/39
Rule-based taggers
 Early POS taggers all hand-coded
 Most of these (Harris, 1962; Greene and Rubin, 1971)
and the best of the recent ones, ENGTWOL
(Voutilainen, 1995) based on a two-stage architecture


Stage 1: look up word in lexicon to give list of potential
POSs
Stage 2: Apply rules which certify or disallow tag
sequences
 Rules originally handwritten; more recently Machine
Learning methods can be used
29/39
Start With a Dictionary
• she:
PRP
• promised: VBN,VBD
• to
TO
• back:
VB, JJ, RB, NN
• the:
DT
• bill:
NN, VB
• Etc… for the ~100,000 words of English with more than 1
tag
30/39
Assign Every Possible Tag
NN
RB
VBN
JJ
VB
PRP VBD
TO VB DT NN
She promised to back the
bill
31/39
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when
VBN|VBD follows “<start> PRP”
NN
RB
VBN
JJ
VB
PRP VBD
TO VB
DT NN
She promised to
back the bill
32/39
Stage 1 of ENGTWOL Tagging
 First Stage: Run words through FST morphological
analyzer to get all parts of speech.
 Example: Pavlov had shown that salivation …
Pavlov
had
shown
that
salivation
PAVLOV N NOM SG PROPER
HAVE V PAST VFIN SVO
HAVE PCP2 SVO
SHOW PCP2 SVOO SVO SV
ADV
PRON DEM SG
DET CENTRAL DEM SG
CS
N NOM SG
33/39
Stage 2 of ENGTWOL Tagging
 Second Stage: Apply NEGATIVE constraints.
 Example: Adverbial “that” rule

Eliminates all readings of “that” except the one in
 “It isn’t that odd”
Given input: “that”
If
(+1 A/ADV/QUANT) ;if next word is adj/adv/quantifier
(+2 SENT-LIM)
;following which is E-O-S
(NOT -1 SVOC/A) ; and the previous word is not a
; verb like “consider” which
; allows adjective complements
; in “I consider that odd”
Then eliminate non-ADV tags
Else eliminate ADV
34/39
Inline Mark-up
 POS Tagging
http://nlp.cs.qc.cuny.edu/wsj_pos.zip
 Input Format
Pierre Vinken, 61/CD years/NNS old , will join
the board as a nonexecutive director Nov. 29.
 Output Format
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS
old/JJ ,/, will/MD join/VB the/DT board/NN
as/IN a/DT nonexecutive/JJ director/NN
Nov./NNP 29/CD ./.
35/39
POS Tagging Tools
 NYU Prof. Ralph Grishman’s HMM POS tagger
(in Java)
http://nlp.cs.qc.cuny.edu/jet.zip
http://nlp.cs.qc.cuny.edu/jet_src.zip
http://www.cs.nyu.edu/cs/faculty/grishman/jet/license.html
 Demo
 How it works:
Learned HMM: data/pos_hmm.txt
Source code: src/jet/HMM/HMMTagger.java
36/39
POS Tagging Tools
 Stanford tagger (Loglinear tagger )
http://nlp.stanford.edu/software/tagger.shtml
 Brill tagger
 http://www.tech.plym.ac.uk/soc/staff/guidbugm/software/RULE_B
ASED_TAGGER_V.1.14.tar.Z
 tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE
 YamCha (SVM)
http://chasen.org/~taku/software/yamcha/
 MXPOST (Maximum Entropy)
ftp://ftp.cis.upenn.edu/pub/adwait/jmx/
 More complete list at:
http://www-nlp.stanford.edu/links/statnlp.html#Taggers
37/39
NLP Toolkits
 Uniform CL Annotation Platform




UIMA (IBM NLP platform): http://incubator.apache.org/uima/svn.html
Mallet (UMASS): http://mallet.cs.umass.edu/index.php/Main_Page
MinorThird (CMU): http://minorthird.sourceforge.net/
NLTK: http://nltk.sourceforge.net/
Natural langauge toolkit, with data sets  Demo
 Information Extraction
 Jet (NYU IE toolkit)
http://www.cs.nyu.edu/cs/faculty/grishman/jet/license.html
 Gate: http://gate.ac.uk/download/index.html
University of Sheffield IE toolkit
 Information Retrieval
 INDRI: http://www.lemurproject.org/indri/
Information Retrieval toolkit
 Machine Translation
 Compara: http://adamastor.linguateca.pt/COMPARA/Welcome.html
 ISI decoder: http://www.isi.edu/licensed-sw/rewrite-decoder/
 MOSES: http://www.statmt.org/moses/
38/39
Looking Ahead: Next Class
 Machine Learning for POS Tagging:
Hidden Markov Model
39/39