Effect of linguistically Motivated Features in Word Sense

Transcript Effect of linguistically Motivated Features in Word Sense

Combining Lexical and Syntactic Features for
Supervised Word Sense Disambiguation
Saif Mohammad
Ted Pedersen
University of Toronto
University of Minnesota
http//:www.cs.toronto.edu/~smm
http//:www.d.umn.edu/~tpederse
1
Word Sense Disambiguation
Harry cast a bewitching spell

Humans immediately understand spell to mean a
charm or incantation.

reading out letter by letter or a period of time ?



Words with multiple senses – polysemy, ambiguity!
Utilize background knowledge and context.
Machines lack background knowledge.


Automatically identifying the intended sense of a word in
written text, based on its context, remains a hard problem.
Best accuracies in recent international event, around 65%.
2
Why do we need WSD !

Information Retrieval

Query: cricket bat


Machine Translation

Consider English to Hindi translation.


Documents pertaining to the insect and the mammal, irrelevant.
head to sar (upper part of the body) or adhyaksh (leader)?
Machine-human interaction

Instructions to machines.


Interactive home system: turn on the lights
Domestic Android: get the door
Applications are widespread and will affect our way of life.
3
Terminology

Harry cast a bewitching spell
Target word – the word whose intended sense is to
be identified.


Context – the sentence housing the target word and
possibly, 1 or 2 sentences around it.


spell
Harry cast a bewitching spell
Instance – target word along with its context.
WSD is a classification problem wherein the occurrence of the
target word is assigned to one of its many possible senses.
4
Corpus-Based Supervised Machine Learning
A computer program is said to learn from experience … if its
performance at tasks … improves with experience.
- Mitchell

Task : Word Sense Disambiguation of given test instances.

Performance : Ratio of instances correctly disambiguated
to the total test instances – accuracy.

Experience : Manually created instances such that target
words are marked with intended sense – training
instances.
Harry cast a bewitching spell / incantation
5
Decision Trees

A kind of classifier.




Assigns a class by asking a series of questions.
Questions correspond to features of the instance.
Question asked depends on answer to previous question.
Inverted tree structure.

Interconnected nodes.

Top most node is called the root.

Each node corresponds to a question / feature.
Each possible value of feature has corresponding branch.

Leaves terminate every path from root.


Each leaf is associated with a class.
6
WSD Tree
Feature 1 ?
1
0
Feature 2 ?
0
1
SENSE 1
0
SENSE 3
Feature 4 ?
0
Feature 4?
1
SENSE 4
1
SENSE 1
Feature 2 ?
0
1
SENSE 3
Feature 3 ?
0
SENSE 2
1
SENSE 3
7
Choice of Learning Algorithm

Why use decision trees for WSD ?
 It has drawbacks – training data fragmentation

What about other learning algorithms such as neural
networks?

Context is a rich source of discrete features.

The learned model likely meaningful.

May provide insight into the interaction of features.
Pedersen[2001]*: Choosing the right features is of
greater significance than the learning algorithm itself
A Decision Tree of Bigrams is an Accurate Predictor of Word Sense T. Pedersen, In the Proceedings of the
Second Meeting of the North American Chapter of the Association for Computational Linguistics
(NAACL-01), June 2-7, 2001, Pittsburgh, PA.
8
Lexical Features

Surface form


A word we observe in text.
Case(n)




1. Object of investigation 2. frame or covering 3. A weird person
Surface forms : case, cases, casing
An occurrence of casing suggests sense 2.
Unigrams and Bigrams

One word and two word sequences in text.

The interest rate is low
Unigrams: the, interest, rate, is, low
Bigrams: the interest, interest rate, rate is, is low

9
Part of Speech Tagging

Brill Tagger – most widely used tool.




Accuracy around 95%.
Source code available.
Easily understood rules.
Pre-tagging is the act of manually assigning tags to
selected words in a text prior to tagging.


Brill tagger does not guaranteed pre-tagging.
A patch to the tagger provided – BrillPatch*.
* ”Guaranteed Pre-Tagging for the Brill Tagger”, Mohammad, S. and Pedersen, T., In Proceedings of
Fourth International Conference of Intelligent Systems and Text Processing, February 2003, Mexico.
10
Part of Speech Features

A word used in different senses is likely to have
different sets of pos tags around it.
Why did jack turn/VB against/IN his/PRP$ team/NN
Why did jack turn/VB left/NN at/IN the/DT crossing

Features used

Individual word POS: P-2, P-1, P0, P1, P2


P1 = JJ implies that the word to the right of the target word is an
adjective.
A combination of the above.
11
Parse Features

Collins Parser* used to parse the data.



Head word of a phrase.



Source code available.
Uses part of speech tagged data as input.
the hard work, the hard surface
Phrase itself : noun phrase, verb phrase and so on.
Parent : Head word of the parent phrase.


fasten the line, cross the line
Parent phrase.
* http://www.ai.mit.edu/people/mcollins
12
Sample Parse Tree
SENTENCE
NOUN PHRASE
VERB PHRASE
Harry
cast
NNP
VBD
NOUN PHRASE
a
bewitching
spell
DT
JJ
NN
13
Sense-Tagged Data

Senseval-2 data


Senseval-1 data


4,328 instances of test data and 8,611 instances of training data
ranging over 73 different noun, verb and adjectives.
8,512 test instances and 13,276 training instances, ranging over 35
nouns, verbs and adjectives.
line, hard, interest, serve data

4149, 4337, 4378 and 2476 sense-tagged instances with line, hard,
serve and interest as the head words.
Around 50,000 sense-tagged instances in all!
14
Experiments
15
Lexical: Senseval-1 & Senseval-2
Sval-2
Sval-1
line
hard
serve
interest
Majority
47.7%
56.3%
54.3%
81.5%
42.2%
54.9%
Surface
Form
49.3%
62.9%
54.3%
81.5%
44.2%
64.0%
Unigram
55.3%
66.9%
74.5%
83.4%
73.3%
75.7%
Bigram
55.1%
66.9%
72.9%
89.5%
72.1%
79.9%
16
Individual Word POS (Senseval-1)
All
Nouns
Verbs
Adj.
Majority
56.3%
57.2%
56.9%
64.3%
P-2
57.5%
58.2%
58.6%
64.0
P-1
59.2%
62.2%
58.2%
64.3%
P0
60.3%
62.5%
58.2%
64.3%
P1
63.9%
65.4%
64.4%
66.2%
P-2
59.9%
60.0%
60.8%
65.2%
17
Individual Word POS (Senseval-2)
All
Nouns
Verbs
Adj.
Majority
47.7%
51.0%
39.7%
59.0%
P-2
47.1%
51.9%
38.0%
57.9%
P-1
49.6%
55.2%
40.2%
59.0%
P0
49.9%
55.7%
40.6%
58.2%
P1
53.1%
53.8%
49.1%
61.0%
P-2
48.9%
50.2%
43.2%
59.4%
18
Combining POS Features
Sval-2
Sval-1
line
hard
serve
interest
Majority
47.7%
56.3%
54.3%
81.5%
42.2%
54.9%
P0, P1
54.3%
66.7%
54.1%
81.9%
60.2%
70.5%
P-1, P0, P1
54.6%
68.0%
60.4%
84.8%
73.0%
78.8%
P-2, P-1,
54.6%
P0, P1 , P2
67.8%
62.3%
86.2%
75.7%
80.6%
19
Parse Features (Senseval-1)
All
Nouns
Verbs
Adj.
Majority
56.3%
57.2%
56.9%
64.3%
Head
64.3%
70.9%
59.8%
66.9%
Parent
60.6%
62.6%
60.3%
65.8%
Phrase
58.5%
57.5%
57.2%
66.2%
Par. Phr.
57.9%
58.1%
58.3%
66.2%
20
Parse Features (Senseval-2)
All
Nouns
Verbs
Adj.
Majority
47.7%
51.0%
39.7%
59.0%
Head
51.7%
58.5%
39.8%
64.0%
Parent
50.0%
56.1%
40.1%
59.3%
Phrase
48.3%
51.7%
40.3%
59.5%
Par. Phr.
48.5%
53.0%
39.1%
60.3%
21
Thoughts…


Both lexical and syntactic features perform
comparably.
But do they get the same instances right ?


How much are the individual feature sets redundant.
Are there instances correctly disambiguated by one
feature set and not by the other ?

How much are the individual feature sets complementary.
Is the effort to combine of lexical and syntactic
features justified?
22
Measures

Baseline Ensemble: accuracy of a hypothetical ensemble
which predicts the sense correctly only if both individual
feature sets do so.


Quantifies redundancy amongst feature sets.
Optimal Ensemble: accuracy of a hypothetical ensemble
which predicts the sense correctly if either of the individual
feature sets do so.

Difference with individual accuracies quantifies complementarity.
We used a simple ensemble which sums up the
probabilities for each sense by the individual feature
sets to decide the intended sense.
23
Best Combinations
Data
Set 1
Set 2
Base
Ens.
Opt.
Best
Sval-2
47.7%
Unigrams
55.3%
P-1,P0, P1
55.3%
43.6%
57.0%
67.9%
66.7%
Sval-1
56.3%
Unigrams
66.9%
P-1,P0, P1
68.0%
57.6%
71.1%
78.0%
81.1%
line
54.3%
Unigrams
74.5%
P-1,P0, P1
60.4%
55.1%
74.2%
82.0%
88.0%
hard
81.5%
Bigrams
89.5%
Head, Par 86.1%
87.7%
88.9%
91.3%
83.0%
serve
42.2%
Unigrams
73.3%
P-1,P0, P1
73.0%
58.4%
81.6%
89.9%
83.0%
interest
54.9%
Bigrams
79.9%
P-1,P0, P1
78.8%
67.6%
83.2%
90.1%
89.0%
24
Conclusions

Significant amount of complementarity across lexical
and syntactic features.

Combination of the two justified.

We show that simple lexical and part of speech
features can achieve state of the art results.

How best to capitalize on the complementarity still an
open issue.
25
Conclusions (continued)

Part of speech of word immediately to the right of
target word found most useful.




Pos of words immediately to the right of target word best for
verbs and adjectives.
Nouns helped by tags on either side.
(P0, P1) found to be most potent in case of small training
data per instance (Sval data).
Larger pos context size (P-2, P-1, P0, P1 , P2) shown to be
beneficial when training data per instance is large (line, hard,
serve and interest data)

Head word of phrase particularly useful for adjectives

Nouns helped by both head and parent.
26
Code, Data & Resources

SyntaLex : A system to do WSD using lexical and syntactic
features. Weka’s decision tree learning algorithm is utilized.

posSenseval : part of speech tags any data in Senseval-2 data
format. Brill Tagger used.

parseSenseval : parses data in a format as output by the Brill
Tagger. Output is in Senseval-2 data format with part of speech
and parse information as xml tags. Uses Collins Parser.

Packages to convert line hard, serve and interest data to
Senseval-1 and Senseval-2 data formats.

BrillPatch : Patch to Brill Tagger to employ Guaranteed
Pre-Tagging.
http://www.d.umn.edu/~tpederse/code.html
http://www.d.umn.edu/~tpederse/data.html
27
Senseval-3 (Mar-1 to April 15, 2004)
Around 8000 training and 4000 test instances.
Results expected shortly.
Thank You
28

Effect of linguistically Motivated Features in Word Sense

Transcript Effect of linguistically Motivated Features in Word Sense

Directory