Transcript Document

Word Sense Disambiguation
&
Information Retrieval
CMSC 35100
Natural Language Processing
May 20, 2003
Roadmap
• Word Sense Disambiguation
– Knowledge-based Approaches
• Sense similarity in a taxonomy
– Issues in WSD
• Why they work & why they don’t
• Information Retrieval
– Vector Space Model
• Computing similarity
• Term weighting
• Enhancements: Expansion, Stemming, Synonyms
Resnik’s WordNet Labeling: Detail
•
•
•
•
Assume Source of Clusters
Assume KB: Word Senses in WordNet IS-A hierarchy
Assume a Text Corpus
Calculate Informativeness
– For Each KB Node:
• Sum occurrences of it and all children
• Informativeness
• Disambiguate wrt Cluster & WordNet
– Find MIS for each pair, I
– For each subsumed sense, Vote += I
– Select Sense with Highest Vote
Sense Labeling Under WordNet
• Use Local Content Words as Clusters
– Biology: Plants, Animals, Rainforests,
species…
– Industry: Company, Products, Range,
Systems…
• Find Common Ancestors in WordNet
– Biology: Plants & Animals isa Living Thing
– Industry: Product & Plant isa Artifact isa
Entity
– Use Most Informative
• Result: Correct Selection
The Question of Context
• Shared Intuition:
– Context
Sense
• Area of Disagreement:
– What is context?
• Wide vs Narrow Window
• Word Co-occurrences
Taxonomy of Contextual
Information
•
•
•
•
•
Topical Content
Word Associations
Syntactic Constraints
Selectional Preferences
World Knowledge & Inference
A Trivial Definition of Context
All Words within X words of Target
• Many words: Schutze - 1000 characters,
several sentences
• Unordered “Bag of Words”
• Information Captured: Topic & Word
Association
• Limits on Applicability
– Nouns vs. Verbs & Adjectives
– Schutze: Nouns - 92%, “Train” -Verb, 69%
Limits of Wide Context
• Comparison of Wide-Context Techniques (LTV
‘93)
– Neural Net, Context Vector, Bayesian Classifier,
Simulated Annealing
• Results: 2 Senses - 90+%; 3+ senses ~ 70%
• People: Sentences ~100%; Bag of Words: ~70%
• Inadequate Context
• Need Narrow Context
– Local Constraints Override
– Retain Order, Adjacency
Surface Regularities = Useful
Disambiguators
• Not Necessarily!
• “Scratching her nose” vs “Kicking the
bucket” (deMarcken 1995)
• Right for the Wrong Reason
– Burglar Rob… Thieves Stray Crate Chase Lookout
• Learning the Corpus, not the Sense
– The “Ste.” Cluster: Dry Oyster Whisky Hot Float Ice
• Learning Nothing Useful, Wrong Question
– Keeping: Bring Hoping Wiping Could Should Some
Them Rest
Interactions Below the Surface
• Constraints Not All Created Equal
– “The Astronomer Married the Star”
– Selectional Restrictions Override Topic
• No Surface Regularities
– “The emigration/immigration bill guaranteed
passports to all Soviet citizens
– No Substitute for Understanding
What is Similar
• Ad-hoc Definitions of Sense
– Cluster in “word space”, WordNet Sense, “Seed
Sense”: Circular
• Schutze: Vector Distance in Word Space
• Resnik: Informativeness of WordNet
Subsumer + Cluster
– Relation in Cluster not WordNet is-a hierarchy
• Yarowsky: No Similarity, Only Difference
– Decision Lists - 1/Pair
– Find Discriminants
Information Retrieval
• Query/Document similarity
– Expression of user’s information need
• Query
– Searchable units: encode concepts
• Documents
– Paragraphs, encyclopedia entries, web pages,…
• Collection: searchable group of documents
– Elementary units: terms
• E.g. words, phrases, stems,..
– Bag of words: (typically)
• man, dog, bit
Vector Space Model
• Represent documents and queries as
– Vectors of term-based features
• Features: tied to occurrence of terms in collection
– E.g.


d j  (t1, j , t2, j ,..., t N , j ); qk  (t1,k , t2,k ,..., t N ,k )
• Solution 1: Binary features: t=1 if presense, 0
otherwise
– Similiarity: number of terms in common
• Dot product

N

sim (qk , d j )   ti ,k ti , j
i 1
Vector Space Model II
• Problem: Not all terms equally interesting
– E.g. the vs dog vs Levow


d j  ( w1, j , w2, j ,..., wN , j ); qk  ( w1,k , w2,k ,..., wN ,k )
• Solution: Replace binary term features with
weights
– Document collection: term-by-document matrix
– View as vector in multidimensional space
• Nearby vectors are related
– Normalize for vector length
Vector Similarity Computation
• Similarity = Dot product
N
 
 
sim (qk , d j )  qk  d j   wi ,k wi , j
i 1
• Normalization:
– Normalize weights in advance
– Normalize post-hoc
 
sim (qk , d j ) 

N
i 1
wi ,k wi , j
i 1 wi2,k
N

N
2
w
i 1 i , j
Term Weighting
• “Aboutness”
– To what degree is this term what document is about?
– Within document measure
– Term frequency (tf): # occurrences of t in doc j
• “Specificity”
– How surprised are you to see this term?
– Collection frequency
– Inverse document frequency (idf):
idf i  log(
N
)
ni
wi , j  tfi , j idf i
Term Selection & Formation
• Selection:
– Some terms are truly useless
• Too frequent, no content
– E.g. the, a, and,…
– Stop words: ignore such terms altogether
• Creation:
– Too many surface forms for same concepts
• E.g. inflections of words: verb conjugations, plural
– Stem terms: treat all forms as same underlying
Query Refinement
• Typical queries very short, ambiguous
– Cat: animal/Unix command
– Add more terms to disambiguate, improve
• Relevance feedback
– Retrieve with original queries
– Present results
• Ask user to tag relevant/non-relevant
– “push” toward relevant vectors, away from nr

  R   S 
qi 1  qi   rj   sk
R j 1
S k 1
– β+γ=1 (0.75,0.25); r: rel docs, s: non-rel docs