chapter5part1

Download Report

Transcript chapter5part1

Word classes and part of
speech tagging
Chapter 5
Outline
Why part of speech tagging?
Word classes
Tag sets and problem definition
Automatic approaches 1: rule-based tagging
Automatic approaches 2: stochastic tagging
On Part 2: finish stochastic tagging, and continue on to:
evaluation
Slide 1
Definition
“The process of assigning a part-of-speech or
other lexical class marker to each word in a
corpus” (Jurafsky and Martin)
WORDS
the
girl
kissed
the
boy
on
the
cheek
TAGS
N
V
P
DET
Slide 2
An Example
WORD
the
girl
kissed
the
boy
on
the
cheek
LEMMA
the
girl
kiss
the
boy
on
the
cheek
TAG
+DET
+NOUN
+VPAST
+DET
+NOUN
+PREP
+DET
+NOUN
Slide 3
Motivation
Speech synthesis — pronunciation
Speech recognition — class-based N-grams
Information retrieval — stemming, selection high-content
words
Word-sense disambiguation
Corpus analysis of language & lexicography
Slide 4
Word Classes
Basic word classes: Noun, Verb, Adjective, Adverb,
Preposition, …
Open vs. Closed classes
Open:
Nouns, Verbs, Adjectives, Adverbs.
Why “open”?
Closed:
determiners: a, an, the
pronouns: she, he, I
prepositions: on, under, over, near, by, …
Slide 5
Open Class Words
Every known human language has nouns and verbs
Nouns: people, places, things
Classes of nouns
proper vs. common
count vs. mass
Verbs: actions and processes
Adjectives: properties, qualities
Adverbs: hodgepodge!
Unfortunately, John walked home extremely slowly yesterday
Numerals: one, two, three, third, …
Slide 6
Closed Class Words
Differ more from language to language than open class
words
Examples:
prepositions: on, under, over, …
particles: up, down, on, off, …
determiners: a, an, the, …
pronouns: she, who, I, ..
conjunctions: and, but, or, …
auxiliary verbs: can, may should, …
Slide 7
Word Classes: Tag Sets
• Vary in number of tags: a dozen to over 200
• Size of tag sets depends on language, objectives and
purpose
– Some tagging approaches (e.g., constraint grammar based) make
fewer distinctions e.g., conflating prepositions, conjunctions,
particles
– Simple morphology = more ambiguity = fewer tags
Slide 8
Word Classes: Tag set example
PRP
PRP$
Slide 9
Example of Penn Treebank Tagging of
Brown Corpus Sentence
The/DT grand/JJ jury/NN commented/VBD on/IN a/DT
number/NN of/IN other/JJ topics/NNS ./.
VB DT NN .
Book that flight .
VBZ DT NN VB NN ?
Does that flight serve dinner ?
See http://www.infogistics.com/posdemo.htm
Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo
Slide 10
The Problem
Words often have more than one word class: this
This is a nice day = PRP
This day is nice = DT
You can go this far = RB
Slide 11
Word Class Ambiguity
(in the Brown Corpus)
Unambiguous (1 tag): 35,340
Ambiguous (2-7 tags): 4,100
2 tags
3,760
3 tags
264
4 tags
61
5 tags
12
6 tags
2
7 tags
1
(Derose, 1988)
Slide 12
Part-of-Speech Tagging
• Rule-Based Tagger: ENGTWOL (ENGlish TWO Level
analysis)
• Stochastic Tagger: HMM-based
• Transformation-Based Tagger (Brill) (we won’t cover this)
Slide 13
Rule-Based Tagging
• Basic Idea:
– Assign all possible tags to words
– Remove tags according to set of rules of type: if word+1 is an
adj, adv, or quantifier and the following is a sentence boundary
and word-1 is not a verb like “consider” then eliminate non-adv
else eliminate adv.
– Typically more than 1000 hand-written rules
Slide 14
Sample ENGTWOL Lexicon
Demo: http://www2.lingsoft.fi/cgi-bin/engtwol
Slide 15
Stage 1 of ENGTWOL Tagging
First Stage: Run words through a morphological analyzer to get all
parts of speech.
Example: Pavlov had shown that salivation …
Pavlov
had
shown
that
salivation
PAVLOV N NOM SG PROPER
HAVE V PAST VFIN SVO
HAVE PCP2 SVO
SHOW PCP2 SVOO SVO SV
ADV
PRON DEM SG
DET CENTRAL DEM SG
CS
N NOM SG
Slide 16
Stage 2 of ENGTWOL Tagging
Second Stage: Apply constraints.
Constraints used in negative way.
Example: Adverbial “that” rule
Given input: “that”
If
(+1 A/ADV/QUANT)
(+2 SENT-LIM)
(NOT -1 SVOC/A)
Then eliminate non-ADV tags
Else eliminate ADV
Slide 17