Transcript ppt

Finding Predominant Word
Senses in Untagged Text
Diana McCarthy & Rob Koeling & Julie Weeds & Carroll
Department of Indormatics,
University of Sussex
{dianam, robk, juliewe, johnca}@sussex.ac.uk
ACL 2004
Introduction

In word sense disambiguation, the heuristic of choosing the
most common sense is extremely powerful.




Does not take surrounding context into account
Assumes some quality of the hand-tagged data
One would expect the frequency distribution of the senses to
depend on the domain of the text.
We present work on the use of automatically acquired thesaurus
and WordNet similarity package to find the predominant sense.

Does not require any hand-tagged text, such as SemCor


In SENSEVAL-2 even systems which show superior performance
to the(above) heuristic often make use of the heuristic where
evidence from the context is not sufficient.
There is a strong case for obtaining a predominant sense from
untagged corpus data so that a WSD system can be tuned to
the domain.

SemCor comprises a relatively small sample of 250,000 words.


Our work is aimed at discovering the predominant senses from
raw text.



tiger -> audacious person / carnivorous animal
Hand-tagged data is not always available
Can produce predominant senses for the domain type required.
We believe that automatic means of finding a predominant
sense can be useful for systems that use it as backing-off and
as lexical acquisition under limiting-size hand-tagges sources.

Many researchers are developing thesaurus from automatically
parsed data.



Each target word is entered with an ordered list of “nearest
neighbors” which are ordered in terms of the “distributional
similarity” with the target word.
Distributional similarity is a measure indicating the degree of cooccurrence in contexts between two words.
The quality and similarity of the neighbors pertaining to
different senses will reflect the dominance of the sense.

The neighbors of star in a corpus provided by Lin has the ordered
neighbors: superstar, player, termmate, …, galaxy, sun, world,…
Method



We use a thesaurus based on the method of Lin(1998) which provides
k nearest neighbors to each target word along with distributional similarity
scores. Then use the WordNet similarity package to weight the
contribution that each neighbor makes to the various senses of the
target word.
We rank each sense wsi using:
Nw = {n1, n2, …, nk} be the top scoring k neighbors along with DS
scores {dss(w, n1), dss(w, n2), …, dss(w, nk)}
Acquiring the Automatic Thesaurus

The thesaurus was acquired using the method described by
Lin(1998).




For input we use grammatical relation data extracted using an
automatic parser.
A noun w is described using a set of co-occurrence triples <w, r, x>
and associated frequencies, where r is a grammatical relation and x
is a possible co-occurrence with w in the relation.
For every pair of nouns where each noun has total frequency in the
tuple>9, compute their distributional similarity.
If T(w) is the set of co-occurrence types (r, x) such that I(w, r, x)
is positive then distributional similarity of two noun w and n is
dss(w, n):
Automatic Retrieval and Clustering of Similar Words
Dekang Lin
Proceeding of COLING-ACL 98


The meaning of an unknown word can often be inferred from its
context.
We use a broad-coverage parser to extract dependency triples
from text corpus. A dependency triple consists two words and
the grammatical relationship.


The triples extracted from “I have a brown dog” are:
The description of a word w consists of the frequency counts of
all dependency triples that match the pattern (w, *, *).

For example the description of the word cell is:

Measure the amount of information in the statement that a
randomly selected triple is (w, r, w’), when we do not know the
value of ||w, r, w’||.





An occurrence of the triple (w, r, w’) can be regarded as the cooccurrence of three events.
A: a randomly selected word is w.
B: a randomly selected dependency type is r.
C: a randomly selected word is w’.
Assume that A and C are conditionally independent given B, thus
the probability is given by:


Measure the amount of information when we know the value of
||w, r, w’||, and the difference is the information contained in ||w,
r, w’|| = c.
Let T(w) be the set of pairs (r, w’) such that I(w, r, w’) is positive,
define the similarity (w1, w2) between words w1 and w2 as follows:
The WordNet Similarity



The WordNet similarity package supports a range of similarity
scores.
lesk: maximizes the number of overlapping words in the gloss,
or definition, of the senses.
jcn: each synset is incremented with the frequency counts from
the corpus of all words belonging to the synset.



Calculate the “information content” IC(s) = -log(p(s))
Djcn(s1, s2) = IC(s1) + IC(s2) – 2* IC(s3), where s3 is the most
informative superordinate synset of s1 and s2.
jcn(s1, s2) = 1/Djcn(s1, s2)
Experiment with SemCor


We generated a thesaurus entry for all polysemous nouns which
occurred in SemCor>2 and BNC>9 times in the grammatical
relations. jcn use the BNC corpus, and the thesaurus entry k set
to 50.
The accuracy of finding the predominant sense in SemCor and
the WSD accuracy on SemCor when using our first sense in all
contexts are as follows:


We choose jcn on remaining experiments because this gave
good results for finding the predominant sense and is more
efficient.
There are cases where the acquired first sense disagree with
SemCor yet is intuitively plausible.


pipe -> tobacco pipe / tube made of metal or plastic used to carry water,
oil or gas etc… with nearest neighbors tube, cable, wire, tank, hole,
cylinder, …
soil -> filth, stain, the state of being unclean / dirt, ground, earth, this
seems intuitive given our expected usage in modern British English.
Experiment in SENSEVAL-2 English all Words Data

To see how well the predominant sense perform on a WSD task
we use the SENSEVAL-2 all-words data.


We do not assume that ours is not method of WSD, however, it is
important to know the performance for any system that use it.
Generate a thesaurus entry for all polysemous nouns in
WordNet and compare the results using the first sense in
SemCor and the SENSEVAL-2 all-words data itself.

Trivially label all monosemous items.

The automatically acquired predominant sense performs nearly
as well as the hand-tagged SemCor.


The items not covered by our method were those with
insufficient grammatical relations for the tuples employed.


Use only raw text with no manual labeling.
today and one each occurred 5 times in the test data.
Extending the grammatical relations used for building the
thesaurus should improve the coverage.
Experiment with Domain Specific Corpora




A major motivation is to try to capture changes in ranking of senses for
documents from different domains.
We selected two domains: SPORTS(35317 docs) and FINANCE(117734
docs) from the Reuters corpus and acquire thesaurus from these
corpora.
We selected a number of words and evaluate these words qualitatively.
The words are not chosen randomly since we anticipated different
predominant senses for these words.
Additionally we evaluated quantitatively using the Subject Field Codes
source which annotates WordNet synsets with domain labels. We
selected words that have at least one SPORTS and one ECONOMY labels,
resulting 38 words.

The results are summarized below with the WordNet sense
number for each word.



Most words show the change in predominant senses.
The first sense of the words like division, tie and goal shift towards
the more specific senses.
The senses of the word share remains the same, however the sense
stock certificate ended up in higher rank for FINANCE domain.

The figure shows the domain labels distribution of the
predominant senses using SPORTS and FINANCE corpora for the
set of 38 words.


Both domains have a similarity percentage of factotum(domain
independent) labels.
As we expect, the other peaks correspond to the economy label for
the FINANCE corpus and sports label for the SPORTS corpus.
Conclusions




We have devised a method that use raw corpus data to
automatically find a predominant sense of nouns in WordNet.
We have demonstrated the possibility of finding predominant
sense in domain specific corpora.
In the future we will investigate the effect of the frequency and
choice of distributional similarity measure and apply our method
for words whose PoS other than noun.
It would be possible to use this method with another inventory
given a measure of semantic relatedness between the neighbors
and the senses.