Corpus Annotation for Computational Linguistics
Download
Report
Transcript Corpus Annotation for Computational Linguistics
POS Tagging: Introduction
Heng Ji
[email protected]
Feb 2, 2008
Acknowledgement: some slides from Ralph Grishman, Nicolas Nicolov, J&M
1
Some Administrative Stuff
Assignment 1 due on Feb 17
Textbook: required for assignments and final
exam
2/39
Outline
Parts of speech (POS)
Tagsets
POS Tagging
Rule-based tagging
Markup Format
Open source Toolkits
3/39
What is Part-of-Speech (POS)
Generally speaking, Word Classes (=POS) :
Verb, Noun, Adjective, Adverb, Article, …
We can also include inflection:
Verbs: Tense, number, …
Nouns: Number, proper/common, …
Adjectives: comparative, superlative, …
…
4/39
Parts of Speech
8 (ish) traditional parts of speech
Noun, verb, adjective, preposition, adverb,
article, interjection, pronoun, conjunction, etc
Called: parts-of-speech, lexical categories,
word classes, morphological classes, lexical
tags...
Lots of debate within linguistics about the
number, nature, and universality of these
We’ll completely ignore this debate.
5/39
7 Traditional POS Categories
N
V
ADJ
ADV
P
PRO
DET
noun
chair, bandwidth, pacing
verb
study, debate, munch
adj
purple, tall, ridiculous
adverb
unfortunately, slowly,
preposition of, by, to
pronoun
I, me, mine
determiner the, a, that, those
6/39
POS Tagging
The process of assigning a part-of-speech or
lexical class marker to each word in a
WORD
collection.
the
koala
put
the
keys
on
the
table
tag
DET
N
V
DET
N
P
DET
N
7/39
Penn TreeBank POS Tag Set
Penn Treebank: hand-annotated corpus of
Wall Street Journal, 1M words
46 tags
Some particularities:
to /TO not disambiguated
Auxiliaries and verbs not distinguished
8/39
Penn Treebank Tagset
9/39
Why POS tagging is useful?
Speech synthesis:
How to pronounce “lead”?
INsult
inSULT
OBject
obJECT
OVERflow
overFLOW
DIScount
disCOUNT
CONtent
conTENT
Stemming for information retrieval
Can search for “aardvarks” get “aardvark”
Parsing and speech recognition and etc
Possessive pronouns (my, your, her) followed by nouns
Personal pronouns (I, you, he) likely to be followed by verbs
Need to know if a word is an N or V before you can parse
Information extraction
Finding names, relations, etc.
Machine Translation
10/39
Equivalent Problem in Bioinformatics
Durbin et al. Biological
Sequence Analysis, Cambridge
University Press.
Several applications, e.g.
proteins
From primary structure
ATCPLELLLD
Infer secondary structure
HHHBBBBBC..
11/39
Why is POS Tagging Useful?
First step of a vast number of practical tasks
Speech synthesis
How to pronounce “lead”?
INsult
inSULT
OBject
obJECT
OVERflow
overFLOW
DIScount
disCOUNT
CONtent
conTENT
Parsing
Need to know if a word is an N or V before you can parse
Information extraction
Finding names, relations, etc.
Machine Translation
12/39
Open and Closed Classes
Closed class: a small fixed membership
Prepositions: of, in, by, …
Auxiliaries: may, can, will had, been, …
Pronouns: I, you, she, mine, his, them, …
Usually function words (short common words which
play a role in grammar)
Open class: new ones can be created all the time
English has 4: Nouns, Verbs, Adjectives, Adverbs
Many languages have these 4, but not all!
13/39
Open Class Words
Nouns
Proper nouns (Boulder, Granby, Eli Manning)
English capitalizes these.
Common nouns (the rest).
Count nouns and mass nouns
Count: have plurals, get counted: goat/goats, one goat, two goats
Mass: don’t get counted (snow, salt, communism) (*two snows)
Adverbs: tend to modify things
Unfortunately, John walked home extremely slowly yesterday
Directional/locative adverbs (here,home, downhill)
Degree adverbs (extremely, very, somewhat)
Manner adverbs (slowly, slinkily, delicately)
Verbs
In English, have morphological affixes (eat/eats/eaten)
14/39
Closed Class Words
Examples:
prepositions: on, under, over, …
particles: up, down, on, off, …
determiners: a, an, the, …
pronouns: she, who, I, ..
conjunctions: and, but, or, …
auxiliary verbs: can, may should, …
numerals: one, two, three, third, …
15/39
Prepositions from CELEX
16/39
English Particles
17/39
Conjunctions
18/39
POS Tagging
Choosing a Tagset
There are so many parts of speech, potential distinctions we can
draw
To do POS tagging, we need to choose a standard set of tags to
work with
Could pick very coarse tagsets
N, V, Adj, Adv.
More commonly used set is finer grained, the “Penn TreeBank
tagset”, 45 tags
PRP$, WRB, WP$, VBG
Even more fine-grained tagsets exist
19/39
Using the Penn Tagset
The/DT grand/JJ jury/NN commmented/VBD
on/IN a/DT number/NN of/IN other/JJ
topics/NNS ./.
Prepositions and subordinating conjunctions
marked IN (“although/IN I/PRP..”)
Except the preposition/complementizer “to” is
just marked “TO”.
20/39
POS Tagging
Words often have more than one POS: back
The back door = JJ
On my back = NN
Win the voters back = RB
Promised to back the bill = VB
The POS tagging problem is to determine the
POS tag for a particular instance of a word.
These examples from Dekang Lin
21/39
How Hard is POS Tagging?
Measuring Ambiguity
22/39
Current Performance
How many tags are correct?
About 97% currently
But baseline is already 90%
Baseline algorithm:
Tag every word with its most frequent tag
Tag unknown words as nouns
How well do people do?
23/39
Quick Test: Agreement?
the students went to class
plays well with others
fruit flies like a banana
DT: the, this, that
NN: noun
VB: verb
P: prepostion
ADV: adverb
24/39
Quick Test
the students went to class
DT NN
VB P NN
plays well with others
VB ADV P NN
NN NN P DT
fruit flies like a banana
NN NN VB DT NN
NN VB P DT NN
NN NN P DT NN
NN VB VB DT NN
25/39
How to do it? History
Trigram Tagger
(Kempe)
96%+
DeRose/Church
Efficient HMM
Sparse Data
95%+
Greene and Rubin
Rule Based - 70%
1960
Brown Corpus
Created (EN-US)
1 Million Words
HMM Tagging
(CLAWS)
93%-95%
1970
Brown Corpus
Tagged
LOB Corpus
Created (EN-UK)
1 Million Words
Tree-Based Statistics
(Helmut Shmid)
Rule Based – 96%+
Transformation
Based Tagging
(Eric Brill)
Rule Based – 95%+
1980
Combined Methods
98%+
Neural Network
96%+
1990
2000
LOB Corpus
Tagged
POS Tagging
separated from
other NLP
British National
Corpus
(tagged by CLAWS)
Penn Treebank
Corpus
(WSJ, 4.5M)
26/39
Two Methods for POS Tagging
1. Rule-based tagging
(ENGTWOL)
2. Stochastic
1.
Probabilistic sequence models
HMM (Hidden Markov Model) tagging
MEMMs (Maximum Entropy Markov Models)
27/39
Rule-Based Tagging
Start with a dictionary
Assign all possible tags to words from the
dictionary
Write rules by hand to selectively remove
tags
Leaving the correct tag for each word.
28/39
Rule-based taggers
Early POS taggers all hand-coded
Most of these (Harris, 1962; Greene and Rubin, 1971)
and the best of the recent ones, ENGTWOL
(Voutilainen, 1995) based on a two-stage architecture
Stage 1: look up word in lexicon to give list of potential
POSs
Stage 2: Apply rules which certify or disallow tag
sequences
Rules originally handwritten; more recently Machine
Learning methods can be used
29/39
Start With a Dictionary
• she:
PRP
• promised: VBN,VBD
• to
TO
• back:
VB, JJ, RB, NN
• the:
DT
• bill:
NN, VB
• Etc… for the ~100,000 words of English with more than 1
tag
30/39
Assign Every Possible Tag
NN
RB
VBN
JJ
VB
PRP VBD
TO VB DT NN
She promised to back the
bill
31/39
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when
VBN|VBD follows “<start> PRP”
NN
RB
VBN
JJ
VB
PRP VBD
TO VB
DT NN
She promised to
back the bill
32/39
Stage 1 of ENGTWOL Tagging
First Stage: Run words through FST morphological
analyzer to get all parts of speech.
Example: Pavlov had shown that salivation …
Pavlov
had
shown
that
salivation
PAVLOV N NOM SG PROPER
HAVE V PAST VFIN SVO
HAVE PCP2 SVO
SHOW PCP2 SVOO SVO SV
ADV
PRON DEM SG
DET CENTRAL DEM SG
CS
N NOM SG
33/39
Stage 2 of ENGTWOL Tagging
Second Stage: Apply NEGATIVE constraints.
Example: Adverbial “that” rule
Eliminates all readings of “that” except the one in
“It isn’t that odd”
Given input: “that”
If
(+1 A/ADV/QUANT) ;if next word is adj/adv/quantifier
(+2 SENT-LIM)
;following which is E-O-S
(NOT -1 SVOC/A) ; and the previous word is not a
; verb like “consider” which
; allows adjective complements
; in “I consider that odd”
Then eliminate non-ADV tags
Else eliminate ADV
34/39
Inline Mark-up
POS Tagging
http://nlp.cs.qc.cuny.edu/wsj_pos.zip
Input Format
Pierre Vinken, 61/CD years/NNS old , will join
the board as a nonexecutive director Nov. 29.
Output Format
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS
old/JJ ,/, will/MD join/VB the/DT board/NN
as/IN a/DT nonexecutive/JJ director/NN
Nov./NNP 29/CD ./.
35/39
POS Tagging Tools
NYU Prof. Ralph Grishman’s HMM POS tagger
(in Java)
http://nlp.cs.qc.cuny.edu/jet.zip
http://nlp.cs.qc.cuny.edu/jet_src.zip
http://www.cs.nyu.edu/cs/faculty/grishman/jet/license.html
Demo
How it works:
Learned HMM: data/pos_hmm.txt
Source code: src/jet/HMM/HMMTagger.java
36/39
POS Tagging Tools
Stanford tagger (Loglinear tagger )
http://nlp.stanford.edu/software/tagger.shtml
Brill tagger
http://www.tech.plym.ac.uk/soc/staff/guidbugm/software/RULE_B
ASED_TAGGER_V.1.14.tar.Z
tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE
YamCha (SVM)
http://chasen.org/~taku/software/yamcha/
MXPOST (Maximum Entropy)
ftp://ftp.cis.upenn.edu/pub/adwait/jmx/
More complete list at:
http://www-nlp.stanford.edu/links/statnlp.html#Taggers
37/39
NLP Toolkits
Uniform CL Annotation Platform
UIMA (IBM NLP platform): http://incubator.apache.org/uima/svn.html
Mallet (UMASS): http://mallet.cs.umass.edu/index.php/Main_Page
MinorThird (CMU): http://minorthird.sourceforge.net/
NLTK: http://nltk.sourceforge.net/
Natural langauge toolkit, with data sets Demo
Information Extraction
Jet (NYU IE toolkit)
http://www.cs.nyu.edu/cs/faculty/grishman/jet/license.html
Gate: http://gate.ac.uk/download/index.html
University of Sheffield IE toolkit
Information Retrieval
INDRI: http://www.lemurproject.org/indri/
Information Retrieval toolkit
Machine Translation
Compara: http://adamastor.linguateca.pt/COMPARA/Welcome.html
ISI decoder: http://www.isi.edu/licensed-sw/rewrite-decoder/
MOSES: http://www.statmt.org/moses/
38/39
Looking Ahead: Next Class
Machine Learning for POS Tagging:
Hidden Markov Model
39/39