Effect of linguistically Motivated Features in Word Sense
Download
Report
Transcript Effect of linguistically Motivated Features in Word Sense
Combining Lexical and Syntactic Features for
Supervised Word Sense Disambiguation
Masters Thesis : Saif Mohammad
Advisor : Dr. Ted Pedersen
University of Minnesota, Duluth
Date: August 1, 2003
1
Path Map
Introduction
Background
Data
Experiments
Conclusions
2
Word Sense Disambiguation
Harry cast a bewitching spell
Humans immediately understand spell to mean a
charm or incantation
reading out letter by letter or a period of time ?
Words with multiple senses – polysemy, ambiguity
Utilize background knowledge and context
Machines lack background knowledge
Automatically identifying the intended sense of a word in
written text, based on its context, remains a hard problem
Features are identified from the context
Best accuracies in latest international event, around 65%
3
Why do we need WSD !
Information Retrieval
Query: cricket bat
Machine Translation
Consider English to Hindi translation
Documents pertaining to the insect and the mammal, irrelevant
head to sar (upper part of the body) or adhyaksh (leader)
Machine Human interaction
Instructions to machines
Interactive home system: turn on the lights
Domestic Android: get the door
Applications are widespread and will affect our way of life
4
Terminology
Harry cast a bewitching spell
Target word – the word whose intended sense is to
be identified
Context – the sentence housing the target word and
possibly, 1 or 2 sentences around it
spell
Harry cast a bewitching spell
Instance – target word along with its context
WSD is a classification problem wherein the occurrence of the
target word is assigned to one of its many possible senses
5
Corpus-Based Supervised Machine Learning
A computer program is said to learn from experience … if its
performance at tasks … improves with experience
- Mitchell
Task : Word Sense Disambiguation of given test instances
Performance : Ratio of instances correctly disambiguated
to the total test instances - accuracy
Experience : Manually created instances such that target
words are marked with intended sense – training instances
Harry cast a bewitching spell / incantation
6
Path Map
Introduction
Background
Data
Experiments
Conclusions
7
Decision Trees
A kind of classifier
Assigns a class by asking a series of questions
Questions correspond to features of the instance
Question asked depends on answer to previous question
Inverted tree structure
Interconnected nodes
Top most node is called the root
Each node corresponds to a question / feature
Each possible value of feature has corresponding branch
Leaves terminate every path from root
Each leaf is associated with a class
8
Automating Toy Selection for Max
NODES
ROOT
Moving Parts ?
Yes
No
Car ?
No
Yes
Size ?
HATE
Small
Big
SO SO
LOVE
Color ?
Blue
LOVE
Car ?
No
HATE
Yes
SO SO
Size ?
Small
LEAVES
Other
Red
HATE
Big
LOVE
9
WSD Tree
Feature 1 ?
1
0
Feature 2 ?
0
1
SENSE 1
0
SENSE 3
Feature 4 ?
0
Feature 4?
1
SENSE 4
1
SENSE 1
Feature 2 ?
0
1
SENSE 3
Feature 3 ?
0
SENSE 2
1
SENSE 3
10
Issues…
Why use decision trees for WSD ?
How are decision trees learnt ?
ID3 and C4.5algorithms
What is bagging and its advantages
Drawbacks of decision trees bagging
Pedersen[2002]: Choosing the right features is of
greater significance than the learning algorithm itself
11
Lexical Features
Surface form
A word we observe in text
Case(n)
1. Object of investigation 2. frame or covering 3. A weird person
Surface forms : case, cases, casing
An occurrence of casing suggests sense 2
Unigrams and Bigrams
One word and two word sequences in text
The interest rate is low
Unigrams: the, interest, rate, is, low
Bigrams: the interest, interest rate, rate is, is low
12
Part of Speech Tagging
Pre-requisite for many Natural Language Tasks
Parsing, WSD, Anaphora resolution
Brill Tagger – most widely used tool
Accuracy around 95%
Source code available
Easily understood rules
Harry/NNP cast/VBD a/DT bewitching/JJ spell/NN
NNP proper noun, VBD verb past, DT determiner, NN noun
13
Pre-Tagging
Pre-tagging is the act of manually assigning tags to
selected words in a text prior to tagging
Mona will sit in the pretty chair//NN this time
chair is the pre-tagged word, NN is its pre-tag
Reliable anchors or seeds around which tagging is done
Brill Tagger facilitates pre-tagging
Pre-tag not always respected !
Mona/NNP will/MD sit/VB in/IN the/DT
pretty/RB chair//VB this/DT time/NN
14
Contextual Rules
Initial state tagger – assigns most frequent tag for a type
based on entries in a Lexicon (pre-tag respected)
Final state tagger – may modify tag of word based on
context (pre-tag not given special treatment)
Relevant Lexicon Entries
Type
Most frequent tag
chair
NN(noun)
pretty
RB(adverb)
Other possible tags
VB(verb)
JJ(adjective)
Relevant Contextual Rules
Current Tag
NN
RB
When
NEXTTAG DT
NEXTTAG NN
New Tag
VB
JJ
15
Guaranteed Pre-Tagging
A patch to the tagger provided – BrillPatch
Application of contextual rules to the pre-tagged words
bypassed
Application of contextual rules to non pre-tagged words
unchanged.
Mona/NNP will/MD sit/VB in/IN the/DT
pretty/JJ chair//NN this/DT time/NN
Tag of chair retained as NN
Contextual rule to change tag of chair from NN to VB not applied
Tag of pretty transformed
Contextual rule to change tag of pretty from RB to JJ applied
16
Part of Speech Features
A word in different parts of speech has different senses
A word used in different senses is likely to have different
sets of pos around it
Why did jack turn/VB against/IN his/PRP$ team/NN
Why did jack turn/VB left/VBN at/IN the/DT crossing
Features used
Individual word POS: P-2, P-1, P0, P1, P2*
Sequential POS: P-1P0, P-1P0 P1, and so on
P2 = JJ implies P2 is an adjective
P-1P0 = NN, VB implies P-1 is a noun and P0 is a verb
A combination of the above
17
Parse Features
Collins Parser used to parse the data
Head word of a phrase
Source code available
Uses part of speech tagged data as input
the hard work, the hard surface
Phrase itself : noun phrase, verb phrase and so on
Parent : Head word of the parent phrase
fasten the line, cross the line
Parent Phrase
18
Sample Parse Tree
SENTENCE
NOUN PHRASE
VERB PHRASE
Harry
cast
NNP
VBD
NOUN PHRASE
a
bewitching
spell
DT
JJ
NN
19
Path Map
Introduction
Background
Data
Experiments
Conclusions
20
Sense-Tagged Data
Senseval2 data
Senseval1 data
4328 instances of test data and 8611 instances of training data
ranging over 73 different noun, verb and adjectives.
8512 test instances and 13,276 training instances, ranging over 35
nouns, verbs and adjectives.
Line, hard, interest, serve data
4,149, 4,337, 4378 and 2476 sense-tagged instances with line,
hard, serve and interest as the head words.
Around 50,000 sense-tagged instances in all !
21
Data Processing
Packages to convert line hard, serve and interest data to
Senseval-1 and Senseval-2 data formats
refine preprocesses data in Senseval-2 data format to make it
suitable for tagging
posSenseval part of speech tags any data in Senseval-2 data
format
Restore one sentence per line and one line per sentence, pre-tag
the target words, split long sentences
Brill tagger along with Guaranteed Pre-tagging utilized
parseSenseval parses data in a format as output by the Brill
Tagger
restores xml tags, creating a parsed file in Senseval-2 data format
Uses the Collins Parser
22
Sample line data instance
Original instance:
art} aphb 01301041:
" There's none there . " He hurried outside to see if there were
any dry ones on the line .
Senseval-2 data format:
<instance id="line-n.art} aphb 01301041:">
<answer instance="line-n.art} aphb 01301041:" senseid="cord"/>
<context>
<s> " There's none there . " </s> <s> He hurried outside to see
if there were any dry ones on the <head>line</head> . </s>
</context>
</instance>
23
Sample Output from parseSenseval
<instance id=“harry">
<answer instance=“harry" senseid=“incantation"/>
<context>
Harry cast a bewitching <head>spell</head>
</context>
</instance>
<instance id=“harry">
<answer instance=“harry" senseid=“incantation"/>
<context>
<P=“TOP~cast~1~1”> <P=“S~cast~2~2”> <P=“NPB~Potter~2~2”> Harry
<p=“NNP”/> <P=“VP~cast~2~1”> cast <p=“VB”/> <P=“NPB~spell~3~3”>
a <p=“DT”/> bewitching <p=“JJ”/> spell <p=“NN”/> </P> </P> </P> </P>
</context>
</instance>
24
Issues…
How is the target word identified in line, hard and
serve data
How the data is tokenized for better quality pos
tagging and parsing
How is the data pre-tagged
How is parse output of Collins Parser interpreted
How is the parsed output XML’ized and brought back
to Senseval-2 data format
Idiosyncrasies of line, hard, serve, interest, Senseval1 and Senseval-2 data and how they are handled
25
Path Map
Introduction
Background
Data
Experiments
Conclusions
26
Surface Forms Senseval-1 & Senseval-2
Senseval-2
Senseval-1
Majority
47.7%
56.3%
Surface
Form
49.3%
62.9%
Unigrams
55.3%
66.9%
Bigrams
55.1%
66.9%
27
Individual Word POS (Senseval-1)
All
Majority 56.3%
Nouns
57.2%
Verbs
56.9%
Adj.
64.3%
P-2
57.5%
58.2%
58.6%
64.0
P-1
59.2%
62.2%
58.2%
64.3%
P0
60.3%
62.5%
58.2%
64.3%
P1
63.9%
65.4%
64.4%
66.2%
P-2
59.9%
60.0%
60.8%
65.2%
28
Individual Word POS (Senseval-2)
All
Majority 47.7%
Nouns
51.0%
Verbs
39.7%
Adj.
59.0%
P-2
47.1%
51.9%
38.0%
57.9%
P-1
49.6%
55.2%
40.2%
59.0%
P0
49.9%
55.7%
40.6%
58.2%
P1
53.1%
53.8%
49.1%
61.0%
P-2
48.9%
50.2%
43.2%
59.4%
29
Combining POS Features
Senseval-2 Senseval-1
line
Majority
47.7%
56.3%
54.3%
P0, P1
54.3%
66.7%
54.1%
P-1, P0, P1
54.6%
68.0%
60.4%
P-2, P-1, P0, P1 , P2
54.6%
67.8%
62.3%
30
Effect Guaranteed Pre-tagging on WSD
Senseval-1
Senseval-2
Guar. P. Reg. P.
Guar. P. Reg. P
P-1, P0
62.2%
62.1%
50.8%
50.9%
P0, P1
66.7%
66.7%
54.3%
53.8%
P-1, P0, P1
68.0%
67.6%
54.6%
54.7%
P-1P0, P0P1
66.7%
66.3%
54.0%
53.7%
P-2, P-1, P0,
P1 , P2
67.8%
66.1%
54.6%
54.1%
31
Parse Features (Senseval-1)
All
Nouns
Verbs
Adj.
Majority
56.3%
57.2%
56.9%
64.3%
Head
64.3%
70.9%
59.8%
66.9%
Parent
60.6%
62.6%
60.3%
65.8%
Phrase
58.5%
57.5%
57.2%
66.2%
Par. Phr.
57.9%
58.1%
58.3%
66.2%
32
Parse Features (Senseval-2)
Majority
All
47.7%
Nouns
51.0%
Verbs
39.7%
Adj.
59.0%
Head
51.7%
58.5%
39.8%
64.0%
Parent
50.0%
56.1%
40.1%
59.3%
Phrase
48.3%
51.7%
40.3%
59.5%
Par. Phr.
48.5%
53.0%
39.1%
60.3%
33
Thoughts…
Both lexical and syntactic features perform
comparably
But do they get the same instances right ?
How much are the individual feature sets redundant
Are there instances correctly disambiguated by one
feature set and not by the other ?
How much are the individual feature sets complementary
Is the effort to combine of lexical and syntactic
features justified ?
34
Measures
Baseline Ensemble: accuracy of a hypothetical ensemble
which predicts the sense correctly only if both individual
feature sets do so
Quantifies redundancy amongst feature sets
Optimal Ensemble: accuracy of a hypothetical ensemble
which predicts the sense correctly if either of the individual
feature sets do so
Difference with individual accuracies quantifies complementarity
We used a simple ensemble which sums up the
probabilities for each sense by the individual feature
sets to decide the intended sense
35
Best Combinations
Data
Set 1
Set 2
Base
Maj.
Ens.
Opt.
Sval2
Unigrams
55.3%
P-1,P0, P1
55.3%
43.6%
47.7%
57.0%
67.9%
Sval1
Unigrams
66.9%
P-1,P0, P1
68.0%
57.6%
56.3%
71.1%
78.0%
line
Unigrams
74.5%
P-1,P0, P1
60.4%
55.1%
54.3%
74.2%
82.0%
hard
Bigrams
89.5%
Head, Par 86.1%
87.7%
81.5%
88.9%
91.3%
serve
Unigrams
73.3%
P-1,P0, P1
73.0%
58.4%
42.2%
81.6%
89.9%
Interest
Bigrams
79.9%
P-1,P0, P1
78.8%
67.6%
54.9%
83.2%
90.1%
36
Path Map
Introduction
Background
Data
Experiments
Conclusions
37
Conclusions
Significant amount of complementarity across lexical
and syntactic features
Part of speech of word immediately to the right of
target word found most useful
Combination of the two justified
Pos of words immediately to the right of target word best for
verbs and adjectives
Nouns helped by tags on either side
Head word of phrase particularly useful for adjectives
Nouns helped by both head and parent
38
Other Contributions
Converted line, hard, serve and interest data into
Senseval-2 data format
Part of speech tagged and Parsed the Senseval2,
Senseval-1, line, hard, serve and interest data
Developed the Guaranteed Pre-tagging mechanism to
improve quality of pos tagging
Showed that guaranteed pre-tagging improves WSD
39
Code, Data, Resources and Publication
posSenseval : part of speech tags any data in Senseval-2 data format
parseSenseval : parses data in a format as output by the Brill Tagger.
Output is in Senseval-2 data format with part of speech and parse
information as xml tags.
Packages to convert line hard, serve and interest data to Senseval-1
and Senseval-2 data formats
BrillPatch : Patch to Brill Tagger to employ Guaranteed Pre-Tagging
http://www.d.umn.edu/~tpederse/data.html
Brill Tagger: http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z
Collins Parser: http://www.ai.mit.edu/people/mcollins
“Guaranteed Pre-Tagging for the Brill Tagger”, Mohammad and
Pedersen, Fourth International Conference of Intelligent Systems and
Text Processing, February 2003, Mexico
40
Thank You
41