pptx - University of Virginia

Download Report

Transcript pptx - University of Virginia

Lecture 9: Part of Speech
Kai-Wei Chang
CS @ University of Virginia
[email protected]
Couse webpage: http://kwchang.net/teaching/NLP16
CS6501 Natural Language Processing
1
This lecture
 Parts of speech (POS)
 POS Tagsets
CS6501 Natural Language Processing
2
Parts of Speech
 Traditional parts of speech
 ~ 8 of them
CS6501 Natural Language Processing
3
POS examples
N
noun
chair, bandwidth, pacing
V
verb
study, debate, munch
 ADJ adjective
purple, tall, ridiculous
 ADV adverb
unfortunately, slowly
P
preposition of, by, to
 PRO pronoun
I, me, mine
 DET determiner the, a, that, those
CS6501 Natural Language Processing
4
Parts of Speech
 A.k.a. parts-of-speech, lexical categories,
word classes, morphological classes,
lexical tags...
 Lots of debate within linguistics about the
number, nature, and universality of these
CS6501 Natural Language Processing
5
POS Tagging
 The process of assigning a part-of-speech to
each word in a collection (sentence).
WORD
tag
the
koala
put
the
keys
on
the
table
DET
N
V
DET
N
P
DET
N
CS6501 Natural Language Processing
6
Why is POS Tagging Useful?
 First step of a vast number of practical tasks
 Parsing
 Need to know if a word is an N or V before you can parse
 Information extraction
 Finding names, relations, etc.
 Speech synthesis/recognition




OBject
OVERflow
DIScount
CONtent
obJECT
overFLOW
disCOUNT
conTENT
 Machine Translation
CS6501 Natural Language Processing
7
Open and Closed Classes
 Closed class: a small fixed membership
 Prepositions: of, in, by, …
 Pronouns: I, you, she, mine, his, them, …
 Usually function words (short common words which
play a role in grammar)
 Open class: new ones can be created
 English has 4: Nouns, Verbs, Adjectives, Adverbs
 Many languages have these 4, but not all!
CS6501 Natural Language Processing
8
Open Class Words
 Nouns
 Proper nouns (Boulder, Granby, Eli Manning)
 Common nouns (the rest).
 Count nouns and mass nouns
 Count: have plurals, get counted: goat/goats, one
goat, two goats
 Mass: don’t get counted (snow, salt, communism)
(*two snows)
 Verbs
 In English, have morphological affixes (eat/eats/eaten)
CS6501 Natural Language Processing
9
Closed Class Words
Examples:
prepositions: on, under, over, …
particles: up, down, on, off, …
determiners: a, an, the, …
pronouns: she, who, I, ..
conjunctions: and, but, or, …
auxiliary verbs: can, may should, …
numerals: one, two, three, third, …
CS6501 Natural Language Processing
10
Prepositions from CELEX
CELEX: online dictionary
Frequency counts are from COBUILD 16-billion-word corpus
CS6501 Natural Language Processing
11
English Particles
CS6501 Natural Language Processing
12
Conjunctions
CS6501 Natural Language Processing
13
Choosing a Tagset
 Could pick very coarse tagsets
 N, V, Adj, Adv, Other
 More commonly used set is finer grained
 E.g., “Penn TreeBank tagset”, 45 tags: PRP$, WRB,
WP$, VBG
 Brown cropus, 87 tags.
 Prague Dependency Treebank (Czech)
 4452 tags
 AAFP3----3N----: (nejnezajímavějším)
Adj Regular Feminine Plural….Superlative [Hajic 2006, VMC tutorial]
CS6501 Natural Language Processing
14
Penn TreeBank POS Tagset
CS6501 Natural Language Processing
15
Using the Penn Tagset
 The/DT grand/JJ jury/NN
commmented/VBD on/IN a/DT number/NN
of/IN other/JJ topics/NNS ./.
CS6501 Natural Language Processing
16
Universal Tag set
 ~ 12 different tags
 NOUN, VERB, ADJ, ADV, PRON, DET, ADP,
NUM, CONJ, PRT, “.”, X
CS6501 Natural Language Processing
17
POS Tagging v.s. Word clustering
 Words often have more than one POS:
back
 The back door = JJ
 On my back = NN
 Win the voters back = RB
 Promised to back the bill = VB
These examples from Dekang Lin
CS6501 Natural Language Processing
18
How Hard is POS Tagging?
CS6501 Natural Language Processing
19
POS tag sequences
 Some tag sequences more likely occur
than others
 POS Ngram view
https://books.google.com/ngrams/graph?co
ntent=_ADJ_+_NOUN_%2C_ADV_+_NOU
N_%2C+_ADV_+_VERB_
Existing methods often model POS tagging as a
sequence tagging problem
CS6501 Natural Language Processing
20
Evaluation
 How many words in the unseen test data
can be tagged correctly?
 Usually evaluated on Penn Treebank
 State of the art ~97%
 Trivial baseline (most likely tag) ~94%
 Human performance ~97%
CS6501 Natural Language Processing
21