Word Sense Disambiguation and Information Retrieval
Download
Report
Transcript Word Sense Disambiguation and Information Retrieval
Word Sense
Disambiguation and
Information Retrieval
By
Guitao Gao
Qing Ma
Prof:Jian-Yun Nie
Outline
Introduction
WSD Approches
Conclusion
Introduction
Task of Information Retrieval
Content Repesentation
Indexing
Bag of words indexing
Problems:
– Synonymy: query expansion
– Polysemy: Word Sense Disambiguation
WSD Approaches
Disambiguation based on manually created
rules
Disambiguation using machine readable
dictionaries
Disambiguation using thesauri
Disambiguation based on unsupervised
machine learning with corpora
Disambiguation based on
manually created rules
Weiss’ approach [Lesk 1988] :
– set of rules to disambiguate five words
– context rule: within 5 words
– template rule: specific location
– accuracy : 90%
– IR improvement: 1%
Small & Rieger’s approach [Small 1982] :
– Expert system
Disambiguation using machine
readable dictionaries
Lesk’s approach [Lesk 1988] :
– Senses are represented by different definitions
– Looked up context words definitions
– Find co-occurring words
– Select most similar sense
– Accuracy: 50% - 70%.
– Problem: no enough overlapping words
between definitions
Disambiguation using machine
readable dictionaries
Wilks’ approach [Wilks 1990] :
– Attempt to solve Lesk’s problem
– Expanding dictionary definition
– Use Longman Dictionary of Contemporary
English ( LDOCE )
– more word co-occurring evidence collected
– Accuracy: between 53% and 85%.
Wilks’ approach [Wilks 1990]
Commonly co-occurring words in LDOCE. [Wilks 1990]
Disambiguation using machine
readable dictionaries
Luk’s approach [Luk 1995]:
– Statistical sense disambiguation
– Use definitions from LDOCE
– co-occurrence data collected from Brown
corpus
– defining concepts : 1792 words used to write
definitions of LDOCE
– LDOCE pre-processed :conceptual expansion
Luk’s approach [Luk 1995]:
Entry in LDOCE
Conceptual expansion
1. (an order given by a judge
which fixes) a punishment for
a criminal found guilty in
court
found guilty in court
{ {order, judge, punish, crime,
criminal,find, guilt, court},
2. a group of words that forms {group, word, form,
a statement, command,
exclamation, or
question, usu. contains a
subject and a verb, and (in
writing) begins
with a capital letter and ends
with one of the marks. ! ?
statement,
command, question, contain,
subject,
verb, write, begin, capital,
letter, end,
mark} }
Noun “sentence” and its conceptual expansion [Luk 1995]
Luk’s approach [Luk 1995] cont.
Collect co-occurrence data of defining
concepts by constructing a two-dimensional
Concept Co-occurrence Data Table (CCDT)
– Brown corpus divided into sentences
– collect conceptual co-occurrence data for each
defining concept which occurs in the sentence
– Insert collect data in the Concept Cooccurrence Data Table.
Luk’s approach [Luk 1995] cont.
– Score each sense S with respect to context C
[Luk 1995]
Luk’s approach [Luk 1995] cont.
– Select sense with the highest score
– Accuracy: 77%
– Human accuracy: 71%
Approaches using Roget's
Thesaurus [Yarowsky 1992]
Resources used:
– Roget's Thesaurus
– Grolier Multimedia Encyclopedia
Senses of a word: categories in Roget's
Thesaurus
1042 broad categories covering areas like,
tools/machinery or animals/insects
Approaches using Roget's
Thesaurus [Yarowsky 1992] cont.
tool, implement, appliance, contraption, apparatus,
utensil, device, gadget, craft, machine, engine,
motor, dynamo, generator, mill, lathe, equipment,
gear, tackle, tackling, rigging, harness, trappings,
fittings, accoutrements, paraphernalia, equipage,
outfit, appointments, furniture, material, plant,
appurtenances, a wheel, jack, clockwork, wheelwork, spring, screw,
Some words placed into the tools/machinery
category [Yarowsky 1992]
Approaches using Roget's
Thesaurus [Yarowsky 1992] cont.
Collect context for each category:
– From Grolier Encyclopedia
– each occurrence of each member of the category
– extracts 100 surrounding words
Sample occurrence of words in the tools/machinery category
[Yarowsky 1992]
Approaches using Roget's
Thesaurus [Yarowsky 1992] cont.
Identify and weight salient words:
Sample salient words for Roget categories 348 and 414
[Yarowsky 1992]
To disambiguate a word: sums up the weights of
all salient words appearing in context
Accuracy: 92% disambiguating 12 words
Introduction to WordNet(1)
Online thesaurus system
Synsets: Synonymous Words
Hierachical Relationship
Introduction to WordNet(2)
[Sanderson 2000]
Voorhees’ Disambg. Experiment
Calculation of Semantic Distance:
Synset and Context words
Word’s Sense: Synset closest to Context
Words
Retrieval Result: Worse than non-Disambig.
Gonzalo’s IR experiment(1)
Two Questions
Can WordNet really offer any potential for
text retrieval
How is text Retrieval performance affected
by the disambiguation errors?
Gonzalo’s IR experiment(2)
Text Collection: Summary and Document
Experiments
1. Standard Smart Run
2. Indexed In Terms of Word-Sense
3. Indexed In Terms of Synset
4. Introduction of Disambiguation Error
Gonzalo’s IR experiment(3)
Experiements
%correct document retrieved
Indexed by synsets
62.0
Indexing by word senses
53.2
Indexing by words
48.0
Indexing by synsets(5% error)
62.0
Id. with 10% errors
60.8
Id. with 20% errors
56.1
Id. with 30% errors
54.4
Id. with all possible
52.6
Id. with 60% errors
49.1
Gonzalo’s IR experiment(4)
Disambiguation with WordNet can improve
text retrieval
Solution lies in reliable Automatic WSD
technique
Disambiguation With
Unsupervised Learning
Yarowsky’s Unsupervised Method
One Sense Per Collocation
eg: Plant(manufacturing/life)
One Sense Per Discourse
eg: defense(War/Sports)
Yarowsky’s Unsupervised Method cont.
Algorithm Details
Step1:Store Word and its contexts as line
eg:….zonal distribution of plant life…..
Step2: Identify a few words that represent the word Sense
eg. plant(manufacturing/life)
Step3a: Get rules from the training set
plant + X => A, weight
plant + Y => B, weight
Step3b:Use the rules created in 3a to classify all occurrences of plant
sample set.
Yarowsky’s Unsupervised Method cont.
Step3c: Use one-sense-per-discourse rule to filter or
augment this addition
Step3d: Repeat Step 3 a-b-c iteratively.
Step4: the training converges on a stable residual set.
Step 5: the result will be a set of rules. Those rules will be
used to disambiguate the word “plant”.
eg. plant + growth => life
plant + car => manufacturing
Yarowsky’s Unsupervised Method cont.
Advantages of this method:
Better accuracy compared to other unsupervised
method
No need for costly hand-tagged training
sets(supervised method)
Schütze and Pedersen’s approach
[Schütze 1995]
Source of word sense definitions
– Not using a dictionary or thesaurus
– Only using only the corpus to be disambiguated
(Category B TREC-1 collection )
Thesaurus construction
– Collect a (symmetric ) term-term matrix C
– Entry cij : number of times that words i and j
co-occur in a symmetric window of total size k
– Use SVD to reduce the dimensionality
Schütze and Pedersen’s approach
[Schütze 1995] cont.
– Thesaurus vector: columns
– Semantic similarity: cosine between columns
– Thesaurus: associate each word with its nearest
neighbors
– Context vector: summing thesaurus vectors of context
words
Schütze and Pedersen’s approach
[Schütze 1995] cont.
Disambiguation algorithm
– Identify context vectors corresponding to all
occurrences of a particular word
– Partition them into regions of high density
– Tag a sense for each such region
– Disambiguating a word:
• Compute context vector of its occurrence
• Find the closest centroid of a region
• Assign the occurrence the sense of that centroid
Schütze and Pedersen’s approach
[Schütze 1995] cont.
Accuracy: 90%
Application to IR
– replacing the words by word senses
– sense based retrieval’s average precision for 11 points
of recall increased 4% with respect to word based.
– Combine the ranking for each document:
• average precision increased: 11%
– Each occurrence is assigned n(2,3,4,5) senses;
• average precision increased: 14% for n=3
Schütze and Pedersen’s approach
[Schütze 1995] cont.
Conclusion
How much can WSD help improve IR
effectiveness? Open question
– Weiss: 1%, Voorhees’ method : negative
– Krovetz and Croft, Sanderson : only useful for short queries
– Schütze and Pedersen’s approaches and Gonzalo’s experiment :
positive result
WSD must be accurate to be useful for IR
Schütze and Pedersen’s, Yarowsky’s algorithm:
promising for IR
Luk’s approach : robust for data sparse, suitable
for small corpus.
References
[Krovetz 92] R. Krovetz & W.B. Croft (1992). Lexical Ambiguity and Information
Retrieval, in ACM Transactions onInformation Systems, 10(1).
Gonzalo 1998] J. Gonzalo, F. Verdejo, I. Chugur and J. Cigarran, “Indexing with
WordNet synsets can improve Text Retrieval”, Proceedings of the COLING/ACL ’98
Workshop on Usage of WordNet for NLP, Montreal,1998
[Gonzalo 1992] R. Krovetz & W.B. Croft . “Lexical Ambiguity and Information
Retrieval”, in ACM Transactions on Information Systems, 10(1), 1992
[Lesk 1988] M. Lesk , “They said true things, but called them by wrong names” –
vocabulary problems in retrieval systems, in Proc. 4th Annual Conference of the
University of Waterloo Centre for the New OED, 1988
[Luk 1995] A.K. Luk. “Statistical sense disambiguation with relatively small corpora
using dictionary definitions”. In Proceedings of the 33rd Annual Meeting of the
ACL, Columbus, Ohio, June 1995. Association for Computational Linguistics.
[Salton 83] G. Salton & M.J. McGill (1983). Introduction To Modern Information
Retrieval. The SMART and SIRE experimental retrieval systems, in New York:
McGraw-Hill
[Sanderson 1997] Sanderson, M. Word Sense Disambiguation and Information Retrieval,
PhD Thesis, Technical Report (TR-1997-7) of the Department of Computing Science
at the University of Glasgow, Glasgow G12 8QQ, UK.
[Sanderson 2000] Sanderson, Mark, “Retrieving with Good Sense”,
http://citeseer.nj.nec.com/sanderson00retrieving.html, 2000
References cont.
[Schütze 1995] H. Schütze & J.O. Pedersen. “Information retrieval based on word
senses”, in Proceedings of the Symposium on Document Analysis and Information
Retrieval, 4: 161-175.
[Small 1982] S. Small & C. Rieger , “Parsing and comprehending with word experts (a
theoryand its realisation) ” in Strategies for Natural Language Processing, W.G.
Lehnert & M.H. Ringle, Eds., LEA: 89-148, 1982
[Voorhees 1993] E. M. Voorhees, “Using WordNet™ to disambiguate word sense for text
retrieval, in Proceedings of ACM SIGIR Conference”, (16): 171-180. 1993
[Weiss 73] S.F. Weiss (1973). Learning to disambiguate, in Information Storage and
Retrieval, 9:33-41, 1973
[Wilks 1990] Y. Wilks, D. Fass, C. Guo, J.E. Mcdonald, T. Plate, B.M. Slator (1990).
ProvidingMachine Tractable Dictionary Tools, in Machine Translation, 5: 99-154,
1990
[Yarowsky 1992] D. Yarowsky, `“Word sense disambiguation using statistical models of
Roget’s categories trained on large corpora, in Proceedings of COLING Conference”:
454-460, 1992
[Yarowsky 1994] Yarowsky, D. “Decision lists for lexical ambiguity resolution:Application to Accent
Restoration in Spanish and French.” In Proceedings of the 32rd Annual Meeting of the Association
for Computational Linguistics, Las Cruces, NM, 1994
[Yarowsky 1995] Yarowsky, D. “Unsupervised word sense disambiguation rivaling supervised methods.”
In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages
189-- 196, Cambridge, MA, 1995