The Google Similarity Distance

Download Report

Transcript The Google Similarity Distance

The Google Similarity
Distance
 We’ve been talking about Natural Language
parsing
 Understanding the meaning in a sentence
requires knowing relationships between words
e.g.
house -> square
house -> home
house -> rooms
 There are many of these in our language!
 There are ongoing attempts to build




databases of these relationships. They are
time and labour intensive.
The Web is the largest text database on
Earth. It contains low-grade information in
abundance.
There are two kinds of objects on which
knowledge can be attained: actual object (a
graph) and names of objects (“a graph”).
Actual objects can be compared for similarity
through features.
Names of objects can be compared for
similarity through ‘Google Semantics’ i.e.
how they occur together in the web.
The Idea:
 Define a new kind of semantics understandable




by a computer.
Google semantics: content of the pages
returned for a query on a word.
For a pair of words: the pages after querying the
words singly, and then together.
Semantics is the context in which the words
appear. Links from the pages to additional
context are ignored
Only identifies associations, not similarity of
meaning. For example, “rich” and “poor” will
often occur together.
The method:
Count how many pages are returned by
Google for “monkey”, “president” and
“monkey president”.
Monkey: 74,200,000
President: 363,000,000
Monkey president: 2,230,000
The Google Distribution:
Number of pages returned for a word x is event x.
Number of pages returned for words x and y together is event x∩y.
Probability L of monkey is
74,200,000 / total number of pages(8x109 )
=0.009275
Probability L of president is
363,000,000 / total number of pages
=0.045375
Probability L of monkey∩president is
2,230,000 / total number of pages
= 0.00027875
Normalisation:
 The values are normalised to produce a
normalized Google distance (NGD).
 N = the sum of the three sets:
74,200,000 + 363,000,000 + 2,230,000 = 439430000