07 - School of Computing

Download Report

Transcript 07 - School of Computing

School of Computing
something
FACULTY OF ENGINEERING
OTHER
PoS-Tagging theory and terminology
COMP3310 Natural Language Processing
Eric Atwell, Language Research Group
(with thanks to Katja Markert, Marti Hearst,
and other contributors)
Reminder: PoS-tagging programs
Models behind some example PoS-tagging methods in NLTK:
Hand-coded
Statistical taggers
Brill (transformation-based) tagger
NB you don’t have to use NLTK – useful to illustrate
Training and Testing of
Machine Learning Algorithms
Algorithms that “learn” from data see a set of examples
and try to generalize from them.
Training set:
• Examples trained on
Test set:
• Also called held-out data and unseen data
• Use this for evaluating your algorithm
• Must be separate from the training set
• Otherwise, you cheated!
“Gold standard” evaluation set
• A test set that a community has agreed on and uses as a common
benchmark. DO NOT USE IN TRAINING OR TESTING
PoS word classes in English
Word classes, also called syntactic categories or
grammatical categories or Parts of Speech
closed class type: classes with fixed and few members,
function words e.g. prepositions;
open class type: large class of members, many new
additions, content words e.g. nouns
8 major word classes: nouns, verbs, adjectives, adverbs,
prepositions, determiners, conjunctions, pronouns
In English, also most (?all) Natural Languages
What properties define “noun”?
Semantic properties: refer to people, places and things
Distributional properties: ability to occur next to
determiners, possessives, adjectives (specific locations)
Morphological properties: most occur in singular and plural
These are properties of a word TYPE,
eg “man” is a noun (usually)
Sometimes a given TOKEN may not meet all these criteria …
The men are happy … the man is happy …
They man the lifeboat (?)
Subcategories
Noun
Proper Noun v Common Noun
(Mass noun v Count Noun)
singular v plural
Count v mass (often not covered in PoS-tagsets)
Some tag-sets may have other subcategories,
Eg NNP = common noun with Word Initial Capital
(eg Englishman)
PoS-tagset Often encodes morphological categories like
person, number, gender, tense, case . . .
Verb: action or process
VB present/infinitive teach, eat
VBZ 3rd-person-singular present (s-form) teaches, eats
VBG progressive (ing-form) teaching, eating
VBD/VBN past taught, ate/eaten
Intransitive he died, transitive she killed him, …
(transitivity usually not marked in PoS-tags)
Auxiliaries: Modal verb e.g. can, must, may
Have, be, do can be modal or ma verbs
e.g. I have a present v I have given you a present
Adjective: quality or property
(of a thing: noun phrase)
English is simple:
JJ big, JJR comparative bigger, JJT superlative biggest
More features in other languages, eg
Agreement (number, gender) with noun
Before a noun v after “be”
Adverb: quality or property of verb
or adjective (or other functions…)
A hodge-podge (!)
General adverb often ends –ly slowly, happily (but NOT early)
Place adverb home, downhill
Time adverb now, tomorrow
Degree adverbs very, extremely, somewhat
Function words
Preposition e.g. in of on for over with (to)
Determiner e.g. this that, article the a
Conjunction e.g. and or but because that
Pronoun e.g. personal pronouns
I we (1st person),
you (2nd person),
he she it they (3rd person)
Possessive pronouns my, your, our, their
WH-pronouns what who whoever
Others: negatives (not), interjections (oh), existential there, …
Parts of “multi word expressions”
Particle – like preposition but “part of” a phrasal verb
I looked up her address v I looked up her skirt
I looked her address up v *I looked her skirt up
Big problem for PoS-tagging: common, and ambiguous
Other multi-word idioms: ditto tags
Bigram Markov Model tagger
Naive Method
1. Get all possible tag sequences of the sentence
2. Compute the probability of each tag sequence given the
Sentence, using word-tag and tag-bigram probabilites
3. Take the maximum probability
Problem: This method has exponential complexity!
Solution: Viterbi Algorithm (not discussed in this module)
N-gram tagger
Uses the preceding N-1 predicted tags
Also uses the unigram estimate for the current word
Example
p(AT NN BEZ IN AT NN|The bear is on the move) =
p(the|AT)p(AT|PERIOD)× p(bear|NN)p(NN|AT) . . .
×p(move|NN)p(NN|AT)
p(AT NN BEZ IN AT VB|The bear is on the move) =
p(the|AT)p(AT|PERIOD)× p(bear|NN)p(NN|AT) . . .
×p(move|VB)p(VB|AT)
Bigram tagger: problems
Unknown words in new input
Parameter estimation: need a tagged training text, what if this
is different genre/dialect/language-type from new input?
Tokenization of training text and new input:
contractions (isn’t), multi-word tokens (New York)
crude assumptions
very short distance dependencies
tags are not conditioned on previous words
Unintuitive
Transformation-based tagging
Markov model tagging: small range of regularities only
TB tagging first used by Brill, 1995
Encodes more complex interdependencies between words
and tags
by learning intuitive rules from a training corpus
exploits linguistic knowledge; rules can be tuned manually
Transformation Templates
Templates specify general, admissible transformations:
Change Tag1 to Tag2 if
The preceding (following) word is tagged Tag3
The word two before (after) is tagged Tag3
One of the two preceding (following) words is tagged Tag3
One of the three preceding (following) words is tagged Tag3
The preceding word is tagged Tag3
and the following word is tagged Tag4
The preceding (following) word is tagged Tag3
and the word two before (after) is tagged Tag4
Machine Learning Algorithm
Learns rules from tagged training corpus by specialising in templates
1. Assume you do not know the precise tagging sequence in your training
corpus
2. Tag each word in the training corpus with its most frequent tag, e.g.
move => VB
3. Consider all possible transformations and apply the one that
improves tagging most (greedy search) ,
e.g. Change VB to NN if the preceding word is tagged AT
4. Retag whole corpus applying that rule
5. Go back to 3 and repeat until no significant improvements are reached
6. Output all the rules you learnt in order!
Example: 1st cycle
First approximation: Initialise with most frequent tag (lexical information)
The/AT
bear/VB
is/BEZ
on/IN
the/AT
move/VB
to/TO
race/NN
there/RN
Change VB to NN if previous tag is AT
Try all possible transformations, choose the most useful one and apply it:
The/AT
bear/NN
is/BEZ
on/IN
the/AT
move/NN
to/TO
race/NN
there/RN
Change NN to VB if previous tag is TO
Try all possible transformations, choose the most useful one and apply it:
The/AT
bear/NN
is/BEZ
on/IN
the/AT
move/NN
to/TO
race/VB
there/RN
Final set of learnt rules
Brill rules corresponding to syntagmatic patterns
1. Change VB to NN if previous tag is AT
2. Change NN to VB if previous tag is TO
Can now be applied to an untagged corpus!
uses pre-encoded linguistic knowledge explicitly
uses wider context + following context
can be expanded to word-driven templates
can be expanded to morphology-driven templates (for
unknown words)
learnt rules are intuitive, easy to understand
Combining taggers
Can be combined via backoff: if first tagger finds no tag
(None) then try another tagger
This really only makes sense with N-gram taggers:
If trigram tagger finds no tag, backoff to bigram tagger,
if bigram tagger fails then backoff to unigram tagger
Better: combine tagger results by a voting system
Combinatory Hybrid Elementary Analysis of Text
(combines results of morphological analysers / taggers)