WWW 2007 Paper Presentation

Transcript WWW 2007 Paper Presentation

Measuring Semantic Similarity between Words
Using Web Search Engines
Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka

Topic


Semantic similarity measures between two words
Why interesting?

In information retrieval




Query expansion
Automatic annotation of Web pages
Community mining
In natural language processing




Word-sense disambiguation
Synonym extraction
Language modeling
…
WWW 2007 Paper Presentation
Zheshen Wang
May 8th, 2007
Solution proposed
By using the information available on the Web
Page Counts + Text Snippets
SVM for an optimal combination



Page Counts


Co-occurrence measures: Jaccard, Overlap (Simpson), Dice, PMI
Modification: Suppress random co-occurrences


Score=0, if H(P∩Q)<c, H(x): page counts for the query x
Text Snippets (context and statistical based) top 200 Pattern Freq

Lexico -syntactic Patterns Extraction
e.g. “Toyota and Nissan are two major Japanese car manufactures.”

If the appearing times of a pattern words
in snippets for synonymous words >> in snippets for non-synonymous
it is a reliable indicator of synonymy.


Combination


204-D Feature vector F= [200 Pattern Freq, 4 co-occurrence measures]
Two-class SVM

synonymous word-pairs (Positive), non-synonymous word-pairs (Negative)
WWW 2007 Paper Presentation
Zheshen Wang
May 8th, 2007
My criticisms of the solution

Statistics and context based pattern selection is not
reliable (No ontology or syntax templates)




Sparse Distribution
Noises (meaningless patterns)
Correlations (e.g. “X and Y” , “X and Y are”, “X and Y are two”)
Missing meaningful patterns due to limited n-grams range
(e.g. X and Y are far apart, beyond the range of n-grams, n=2,3,4,5
“Rose is a very popular flower in the US.”)


Feature vector F= [200 Pattern Freq, 4 co-occurrence measures]
Error prone for uncommon words



e.g. rarely used professional terms
Base set from the web is too small to be reliable.
Like the case of CBioC, users voting would be better
WWW 2007 Paper Presentation
Zheshen Wang
May 8th, 2007
How it is related to our course?

Web-based information extraction (Knowledge Extraction)




Making use of Collective Unconscious—Big Idea 3


Analyzing term co-occurrences to capture semantic information
Co-occurrence measures



Extract base level knowledge (“facts”) directly from the web
Page counts(Hits), e.g. Knowitall
Inevitable drawback: Error prone for uncommon words in the web, e.g. CBioC
Similarity measure in terms of co-occurrence
Jaccard, Overlap (Simpson), PMI…
Making use of context based on statistics



Patterns from context rather than from an ontology (“SemTag & Seeker”).
Patterns decided by statistics rather than templates from syntax tree (Generic
extraction patterns, Hearst ’92).
n-grams for a word, somewhat like the “20-word-window” of “spot(l,c)” in
“SemTag & Seeker”.
WWW 2007 Paper Presentation
Zheshen Wang
May 8th, 2007
Measuring Semantic Similarity between Words Using Web Search Engines
Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka
WWW 2007 Paper Presentation
Zheshen Wang
May 8th, 2007

WWW 2007 Paper Presentation

Transcript WWW 2007 Paper Presentation

Directory