PPT_Presentation

Download Report

Transcript PPT_Presentation

Semantic distance & WordNet
Serge B. Potemkin
Moscow State University
Philological faculty
Distance and metrics
Fundamental concept =
distance between entities under consideration
Semantic distance between words or concepts
Metrical space axioms?
Distance is needed for:
•
•
•
•
•
•
•
•
word sense disambiguation,
determining the structure of texts,
text summarization and annotation,
information extraction and retrieval,
automatic indexing,
lexical selection,
the automatic correction of word errors in text
…
Approaches to distance measuring:
•
•
•
•
Corpora-based
Dictionary-based
Roget-structured thesauri
WordNet and other semantic networks
WordNet
• Synonym sets (synsets)
• Subsumption hierarchy (hyponymy /
hypernymy),
• 3 meronymic (PART-OF) relations
COMPONENT-OF,
MEMBER-OF,
SUBSTANCE-OF and their inverses;
• Antonymy,
• COMPLEMENT-OF
WordNet shortcomings:
•
•
•
•
•
150000 synsets – inadequate coverage
Non-English versions 20 – 70% of English
(100000 synsets for Russian)
Extension is hard
Distance measuring is controversial
Corpora-based approach
• Two words wa and wb are as close as often
their neighbors (+/- 5 words) coincide.
• Ex. (distributional profile of the word)
• star: space 0.28, movie 0.2, famous 0.13,
light 0.09, rich 0.04, . .
Dictionary-based approach
• Two words wa and wb are as close as often
words in definitions coincide.
• Ex. wa=linguistics wb=stylistics
• {the, study, of, language, in, general, and, of,
particular, languages, and, their, structure, and,
grammar, and, history}
• {the, study, of, style, in, written, or, spoken,
language}.
2 words coincide in definitions
Bilingual dictionary approach
• Two words wa and wb are as close as often
their equivalents coincide.
ρ(Wa, Wb) = 1/Σni,
Where
Σ is the sum over all coinciding Russian
equivalents
and ni is the number of dictionaries where an
equivalent occurs
Or ρ(Wa, Wb) = Σ nainbi /(||aR||||bR||)
Multidimensional scaling
• Semantic network is a graph
nodes -- words
edges -- links between words via bilingual lexicon
|| edge || = ρ(Wa, Wb)
Immersion of graph is possible to N-dimensional
space
where N=number of words in the lexicon
(>100000)
Multidimensional scaling for visualization
New synonyms
1-neighborhood of accolade
• Links between
synonyms (black)
• Links between
synonyms from
the dictionary
(green)
• 2 isolated
clusters.
Dominant in acerbity neighborhood
• ascerbity
(терпкость)
excluded
• cluster (bold
lines) derived by
Markovian
process
• asperity
(резкость) is the
centre of the
cluster
2 dominants for bicycling (wheel+crook)
Adjustable parameters
• - space dimension;
• - minimal number of dictionaries linking
synonyms;
• - maximal distance from the word under
consideration
• - maximal number of displayed words
• - word excluded from clustering
• …
Compare LDB with WordNet
(accolade)
Synset
award
WordNet
LDB
# of syn. # of syn.
3n+2v
80
accolade
1n
8
honor =
honour
4n+3v
>100
laurels
2n
15
n – noun, v - verb
Synonyms in LDB
commendation, praise, approbation,
applause, + honorable mention,
mention, positive mention
Controversy 1
• Immediate hyperonym for the accolade
synset in WordNet is symbol -- (an arbitrary
sign (written or printed) that has acquired a
conventional significance).
• Immediate hyperonym for commendation,
(more frequent than accolade) is accolade
synset
• Actually accolade is hyponym for
commendation
• It is impossible to disambiguate accolade
(bracket) from accolade (praise)
Controversy 2
• WordNet:
dog 1 – «domestic dog»
hyperonym - canine, canid.
further – mammal, …, entity
• Nor animal, neither pet, are linked with dog
as hyperonyms.
• Tree structure is inadequate for semantic
coding.
Conclusion
• Each meaning of the polysemic word could be
coded as pair (wE, wR) in contrast to synset
coding.
• Metrics superimposed over LDB enables
homograph disambiguation and extraction of
dominants
• Network has particular advantages over
hierarchical representation of semantic
relations