pptx - University of Virginia
Download
Report
Transcript pptx - University of Virginia
Lecture 9: Part of Speech
Kai-Wei Chang
CS @ University of Virginia
[email protected]
Couse webpage: http://kwchang.net/teaching/NLP16
CS6501 Natural Language Processing
1
This lecture
Parts of speech (POS)
POS Tagsets
CS6501 Natural Language Processing
2
Parts of Speech
Traditional parts of speech
~ 8 of them
CS6501 Natural Language Processing
3
POS examples
N
noun
chair, bandwidth, pacing
V
verb
study, debate, munch
ADJ adjective
purple, tall, ridiculous
ADV adverb
unfortunately, slowly
P
preposition of, by, to
PRO pronoun
I, me, mine
DET determiner the, a, that, those
CS6501 Natural Language Processing
4
Parts of Speech
A.k.a. parts-of-speech, lexical categories,
word classes, morphological classes,
lexical tags...
Lots of debate within linguistics about the
number, nature, and universality of these
CS6501 Natural Language Processing
5
POS Tagging
The process of assigning a part-of-speech to
each word in a collection (sentence).
WORD
tag
the
koala
put
the
keys
on
the
table
DET
N
V
DET
N
P
DET
N
CS6501 Natural Language Processing
6
Why is POS Tagging Useful?
First step of a vast number of practical tasks
Parsing
Need to know if a word is an N or V before you can parse
Information extraction
Finding names, relations, etc.
Speech synthesis/recognition
OBject
OVERflow
DIScount
CONtent
obJECT
overFLOW
disCOUNT
conTENT
Machine Translation
CS6501 Natural Language Processing
7
Open and Closed Classes
Closed class: a small fixed membership
Prepositions: of, in, by, …
Pronouns: I, you, she, mine, his, them, …
Usually function words (short common words which
play a role in grammar)
Open class: new ones can be created
English has 4: Nouns, Verbs, Adjectives, Adverbs
Many languages have these 4, but not all!
CS6501 Natural Language Processing
8
Open Class Words
Nouns
Proper nouns (Boulder, Granby, Eli Manning)
Common nouns (the rest).
Count nouns and mass nouns
Count: have plurals, get counted: goat/goats, one
goat, two goats
Mass: don’t get counted (snow, salt, communism)
(*two snows)
Verbs
In English, have morphological affixes (eat/eats/eaten)
CS6501 Natural Language Processing
9
Closed Class Words
Examples:
prepositions: on, under, over, …
particles: up, down, on, off, …
determiners: a, an, the, …
pronouns: she, who, I, ..
conjunctions: and, but, or, …
auxiliary verbs: can, may should, …
numerals: one, two, three, third, …
CS6501 Natural Language Processing
10
Prepositions from CELEX
CELEX: online dictionary
Frequency counts are from COBUILD 16-billion-word corpus
CS6501 Natural Language Processing
11
English Particles
CS6501 Natural Language Processing
12
Conjunctions
CS6501 Natural Language Processing
13
Choosing a Tagset
Could pick very coarse tagsets
N, V, Adj, Adv, Other
More commonly used set is finer grained
E.g., “Penn TreeBank tagset”, 45 tags: PRP$, WRB,
WP$, VBG
Brown cropus, 87 tags.
Prague Dependency Treebank (Czech)
4452 tags
AAFP3----3N----: (nejnezajímavějším)
Adj Regular Feminine Plural….Superlative [Hajic 2006, VMC tutorial]
CS6501 Natural Language Processing
14
Penn TreeBank POS Tagset
CS6501 Natural Language Processing
15
Using the Penn Tagset
The/DT grand/JJ jury/NN
commmented/VBD on/IN a/DT number/NN
of/IN other/JJ topics/NNS ./.
CS6501 Natural Language Processing
16
Universal Tag set
~ 12 different tags
NOUN, VERB, ADJ, ADV, PRON, DET, ADP,
NUM, CONJ, PRT, “.”, X
CS6501 Natural Language Processing
17
POS Tagging v.s. Word clustering
Words often have more than one POS:
back
The back door = JJ
On my back = NN
Win the voters back = RB
Promised to back the bill = VB
These examples from Dekang Lin
CS6501 Natural Language Processing
18
How Hard is POS Tagging?
CS6501 Natural Language Processing
19
POS tag sequences
Some tag sequences more likely occur
than others
POS Ngram view
https://books.google.com/ngrams/graph?co
ntent=_ADJ_+_NOUN_%2C_ADV_+_NOU
N_%2C+_ADV_+_VERB_
Existing methods often model POS tagging as a
sequence tagging problem
CS6501 Natural Language Processing
20
Evaluation
How many words in the unseen test data
can be tagged correctly?
Usually evaluated on Penn Treebank
State of the art ~97%
Trivial baseline (most likely tag) ~94%
Human performance ~97%
CS6501 Natural Language Processing
21