Cognate or False Friend? Ask the Web!
Download
Report
Transcript Cognate or False Friend? Ask the Web!
Automatic Acquisition of
Synonyms Using the Web
as a Corpus
3rd Annual South East European Doctoral
Student Conference (DSC2008): Infusing
Knowledge and Research in South East Europe
Svetlin Nakov, Sofia University "St. Kliment Ohridski"
[email protected]
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Introduction
We want to automatically extract all pairs
of synonyms inside given text
Our goal is:
Design an algorithm that can distinguish
between synonyms and non-synonyms
Our approach:
Measure semantic similarity using the Web
as a corpus
Synonyms are expected to have higher
semantic similarity than non-synonyms
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
The Paper in One Slide
Measuring semantic similarity
Analyze the words local contexts
Use the Web as a corpus
Similar contexts similar words
TF.IDF weighting & reverse context lookup
Evaluation
94 words (Russian fine arts terminology)
50 synonym pairs to be found
11pt average precision: 63.16%
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Contextual Web Similarity
What is local context?
Few words before and after the target word
Same day delivery of fresh flowers, roses, and unique gift baskets
from our online boutique. Flower delivery online by local florists for
birthday flowers.
The words in the local context of given word are
semantically related to it
Need to exclude the stop words: prepositions,
pronouns, conjunctions, etc.
Stop words appear in all contexts
Need of sufficiently big corpus
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Contextual Web Similarity
Web as a corpus
The Web can be used as a corpus to
extract the local context for given word
The Web is the largest possible corpus
Contains large corpora in any language
Searching some word in Google can return
up to 1 000 snippets of texts
The target word is given along with its local
context: few words before and after it
Target language can be specified
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Contextual Web Similarity
Web as a corpus
Example: Google query for "flower"
Flowers, Plants, Gift Baskets - 1-800-FLOWERS.COM - Your Florist ...
Flowers, balloons, plants, gift baskets, gourmet food, and teddy bears
presented by 1-800-FLOWERS.COM, Your Florist of Choice for over 30
years.
Margarita Flowers - Delivers in Bulgaria for you! - gifts, flowers, roses ...
Wide selection of BOUQUETS, FLORAL ARRANGEMENTS,
CHRISTMAS ECORATIONS, PLANTS, CAKES and GIFTS appropriate
for various occasions. CREDIT cards acceptable.
Flowers, plants, roses, & gifts. Flowers delivery with fewer ...
Flowers, roses, plants and gift delivery. Order flowers from ProFlowers
once, and you will never use flowers delivery from florists again.
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Contextual Web Similarity
Measuring semantic similarity
For given two words their local contexts
are extracted from the Web
A set of words and their frequencies
Semantic similarity is measured as
similarity between these local contexts
Local contexts are represented as
frequency vectors for given set of words
Cosine between the frequency vectors in
the Euclidean space is calculated
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Contextual Web Similarity
Example of context words frequencies
word: flower
word: computer
word
count
word
count
fresh
order
rose
delivery
gift
welcome
red
...
217
204
183
165
124
98
87
...
Internet
PC
technology
order
new
Web
site
...
291
286
252
185
174
159
146
...
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Contextual Web Similarity
Example of frequency vectors
v1: flower
#
0
1
2
3
...
4999
5000
word
alias
alligator
amateur
apple
...
zap
zoo
v2: computer
freq.
#
3
2
0
5
...
0
6
0
1
2
3
...
4999
5000
Similarity = cosine(v1, v2)
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
word
alias
alligator
amateur
apple
...
zap
zoo
freq.
7
0
8
133
...
3
0
TF.IDF Weighting
TF.IDF (term frequency times inverted
document frequency)
Statistical measure in information retrieval
Shows how important is a certain word for
a given document in a set of documents
Increases proportionally to the number of
word's occurrences in the document
Decreases proportionally to the total
number of documents containing the word
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Reverse Context Lookup
Local context extracted from the Web can
contain arbitrary parasite words like
"online", "home", "search", "click", etc.
Internet terms appear in any Web page
Such words are not likely to be
associated with the target word
Example (for the word flowers)
"send flowers online", "flowers here",
"order flowers here"
Will the word "flowers" appear in the local
context of "send", "online" and "here"?
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Reverse Context Lookup
If two words are semantically related, then
Both of them should appear in the local contexts of
each other
Let #{x,y} = number of occurrences of x in the
local context of y
For any word w and a word from its local context
wc, we define their strength of semantic
association p(w,wc) as follows:
p(w, wc) = min{ #(w, wc), #(wc,w) }
We use p(w, wc) as vector coordinates
We introduce a minimal occurrence threshold (e.g.
5) to filter words appearing just by chance
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Data Set
We use a list of 94 Russian words:
Terms extracted from texts in the subject
of fine arts
Limited to nouns only
The data set:
абрис, адгезия, алмаз, алтарь, амулет, асфальт, беломорит,
битум, бородки, ваятель, вермильон, ..., шлифовка, штихель,
экспрессивность, экспрессия, эстетизм, эстетство
There are 50 synonym pairs in these words
We expect to find them by our algorithms
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Experiments
We tested few modifications of our
contextual Web similarity algorithm
Basic algorithm (without modifications)
TF.IDF weighting
Reverse context lookup with different
frequency threshold
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Experiments
RAND – random ordering of all the pairs
SIM – the basic algorithm for extraction of
semantic similarity from the Web
Context size of 3 words
Without analyzing the reverse context
With lemmatization
SIM+TFIDF – modification of the SIM algorithm
with TF.IDF weighting
REV2, REV3, REV4, REV5, REV6, REV7 – the
SIM algorithm + “reverse context lookup” with
frequency thresholds of: 2, 3, 4, 5, 6 and 7
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Resources Used
We used the following resources:
Google Web search engine: extracted the
first 1 000 results for 82 645 Russian words
Russian lemma dictionary: 1 500 000
wordforms and 100 000 lemmata
A list of 507 Russian stop words
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Evaluation
Our algorithms arrange all pairs of words
according to their semantic similarity
We expect the 50 synonyms pairs to be at
the top of the result list
We count how many synonyms are found
in the top N results (e.g. top 5, top 10, etc.)
We measure precision and recall
We measure 11pt average precision to
evaluate the results
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
SIM Algorithm – Results
n
Word 1
Words 2
Semantic
Similarity
Synonyms
Precision
@n
Recall @
n
1
выжигание
пирография
0.433805
yes
100.00%
2%
2
тонирование
тонировка
0.382357
yes
100.00%
4%
3
гематит
кровавик
0.325138
yes
100.00%
6%
4
подрамок
подрамник
0.271659
yes
100.00%
8%
5
оливин
перидот
0.252256
yes
100.00%
10%
6
полирование
шлифование
0.220559
no
83.33%
10%
7
полировка
шлифовка
0.216347
no
71.43%
10%
8
амулет
талисман
0.200595
yes
75.00%
12%
9
пластификаторы
мягчители
0.170770
yes
77.78%
14%
...
...
...
...
...
...
...
Precision and recall obtained by the SIM algorithm
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Comparison of the Algorithms
Algorithm
1
5
10
20
30
40
50 100 200 Max
RAND
0
SIM
1
5
8
15
18
23
25
SIM+TFIDF
1
4
8
16
22
27
REV2
1
4
8
16
21
REV3
1
4
8
16
REV4
1
4
8
REV5
1
4
REV6
1
REV7
1
0.1 0.1 0.2 0.3 0.4 0.6 1.1
2.3
50
39
48
50
29
43
48
50
27
32
42
43
46
20
28
32
41
42
46
15
20
28
33
41
42
45
8
15
20
28
33
40
41
42
4
8
15
22
28
32
39
40
42
4
8
15
21
27
30
37
39
40
Comparison of the algorithms (number of synonyms in the top results)
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Comparison of the Algorithms
(11pt Average Precision)
11pt Average Precision
70,00%
63,16%
58,98%
60,00%
50,00%
40,00%
30,00%
20,00%
10,00%
1,15%
n/a
n/a
n/a
n/a
n/a
n/a
REV2
REV3
REV4
REV5
REV6
REV7
0,00%
RAND
SIM
SIM+TFIDF
Comparing RAND, SIM, SIM+TDIDF and REV2 … REV7
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Results (Precision-Recall Graph)
Comparing the recall-precision graphs of evaluated algorithms
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Discussion
Our approach is original because:
Measures automatically semantic similarity
Uses the Web as a corpus
Does not rely on any preexisting corpora
Does not requires semantic resources like
WordNet and EuroWordNet
Works for any language
Tested for Bulgarian and Russian
Uses reverse-context lookup and TF.IDF
Significant improvement in quality
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Discussion
Good accuracy, but far away from 100%
Known problems of the proposed algorithms:
Semantically related words are not always
synonyms
red – blue
wood – pine
apple – computer
Similar contexts does not always mean similar
words (distributional hypothesis)
The Web as a corpus introduces noise
Google returns the first 1 000 results only
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Discussion
Known problems of the proposed algorithms:
Google ranks higher news portals, travel
agencies and retail sites than books, articles
and forum messages
Local context always contain noise
Working with words, not capturing phrases
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Conclusion and Future Work
Conclusion
Our algorithms can distinguish between
synonyms and non-synonyms
Accuracy should be improved
Future Work
Additional techniques to distinguish
between synonyms and semantically
related words
Improve the semantic similarity measure
algorithm
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
References
Hearst M. (1991). "Noun Homograph Disambiguation Using Local Context in
Large Text Corpora". In Proceedings of the 7th Annual Conference of the
University of Waterloo Centre for the New OED and Text Research, Oxford,
England, pages 1-22.
Nakov P., Nakov S., Paskaleva E. (2007a). “Improved Word Alignments Using
the Web as a Corpus”. In Proceedings of RANLP'2007, pages 400-405,
Borovetz, Bulgaria.
Nakov S., Nakov P., Paskaleva E. (2007b). “Cognate or False Friend? Ask the
Web!”. In Proceedings of the Workshop on Acquisition and Management of
Multilingual Lexicons, held in conjunction with RANLP'2007, pages 55-62,
Borovetz, Bulgaria.
Sparck-Jones K. (1972). “A Statistical Interpretation of Term Specificity and its
Application in Retrieval”. Journal of Documentation, volume 28, pages 11-21.
Salton G., McGill M. (1983), Introduction to Modern Information Retrieval,
McGraw-Hill, New York.
Paskaleva E. (2002). “Processing Bulgarian and Russian Resources in Unified
Format”. In Proceedings of the 8th International Scientific Symposium
MAPRIAL, Veliko Tarnovo, Bulgaria, pages 185-194.
Harris, Z. (1954). "Distributional structure”. Word, 10, pages 146-162.
Lin D. (1998). "Automatic Retrieval and Clustering of Similar Words". In
Proceedings of COLING-ACL'98, Montreal, Canada, pages 768-774.
Curran J., Moens M. (2002). "Improvements in Аutomatic Тhesaurus
Еxtraction". In Proceedings of the Workshop on Unsupervised Lexical
Acquisition, SIGLEX 2002, Philadelphia, USA, pages 59-67.
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
References
Plas L., Tiedeman J. (2006). "Finding Synonyms Using Automatic Word
Alignment and Measures of Distributional Similarity". In Proceedings of
COLING/ACL 2006, Sydney, Australia.
Och F., Ney H. (2003). "A Systematic Comparison of Various Statistical
Alignment Models". Computational Linguistics, 29 (1), 2003.
Hagiwara М., Ogawa Y., Toyama K. (2007). "Effectiveness of Indirect
Dependency for Automatic Synonym Acquisition". In Proceedings of CoSMo
2007 Workshop, held in conjuction with CONTEXT 2007, Roskilde, Denmark.
Kilgarriff A., Grefenstette G. (2003). "Introduction to the Special Issue on the
Web as Corpus", Computational Linguistics, 29(3):333–347.
Inkpen D. (2007). "Near-synonym Choice in an Intelligent Thesaurus". In
Proceedings of the NAACL-HLT, New York, USA.
Chen H., Lin M., Wei Y. (2006). "Novel Association Measures Using Web
Search with Double Checking". In Proceedings of the COLING/ACL 2006,
Sydney, Australia, pages 1009-1016.
Sahami M., Heilman T. (2006). "A Web-based Kernel Function for Measuring
the Similarity of Short Text Snippets". In Proceedings of 15th International
World Wide Web Conference, Edinburgh, Scotland.
Bollegala D., Matsuo Y., Ishizuka M. (2007). "Measuring Semantic Similarity
between Words Using Web Search Engines", In Proceedings of the 16th
International World Wide Web Conference (WWW2007), Banff, Canada, pages
757-766.
Sanchez D., Moreno A. (2005), "Automatic Discovery of Synonyms and
Lexicalizations from the Web". Artificial Intelligence Research and
Development, Volume 131, 2005.
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece
Automatic Acquisition of Synonyms
Using the Web as a Corpus
Questions?
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece