Transcript ppt
A Webbased Kernel Function for
Measuring the Similarity of
Short Text Snippets
Mehran Sahami
Timothy D. Heilman
Introduction
Wish to determine how similar two
short text snippets are.
High degree of semantic similarity
United Nations Secretary General vs
Kofi Annan
AI vs Articial Intelligence
Share terms
graphical models vs
graphical interface
5%
Related Work
Query expansion techniques
Other means of determining query
similarity
Set overlap (intersection)
SVM for text classification
Latent Semantic Kernels (LSK)
Semantic Proximity Matrix
Cross-lingual techniques
10%
A New Similarity Function
x represent a short text snippet
(query) to a search engine S
R (x ) be the set of n retrieved
documents d1 , d 2 ,..., d n
Compute the TFIDF term vector vi for
each document di R(x)
Truncate each vector vi to include its
m highest weighted term
15%
Normalize
Let C (x ) be the centroid of the L2
normalized vector vi
C ( x)
n
1
n
i 1
vi
vi 2
Let QE(x) be the L2 normalization of the
centroid C(x)
C ( x)
QE ( x)
C ( x)
2
20%
Kernel Function
K ( x, y ) QE ( x) QE ( y )
25%
Initial Results with Kernel
Three genres of text snippet matching
Acronyms
Individuals and their positions
Multi-faceted terms
30%
Acronyms
Text1
Text2 Kernel
Cosine
Set
Overlap
Support vector machine
SVM
0.812 0.0
0.110
Portable document format
PDF
0.732 0.0
0.060
Artificial intelligence
AI
0.831 0.0
0.255
Artificial insemination
AI
0.391 0.0
0.000
term frequency inverse
document frequency
tf idf
0.831 0.0
0.125
term frequency inverse
document frequency
tfidf
0.507 0.0
0.060
35%
Individuals and their positions
40%
Multi-faceted terms
45%
Related Query Suggestion
Kernel function K (u, qi ) for qi Q
u is any newly issued user query
A repository Q of approximately 116 million
popular user queries issued in 2003,
determined by sampling anonymized web
search logs from the Google search engine
50%
Algorithm
Given user query u and list of matched
queries from repository
Output list Z of queries to suggest
Initialize suggestion list Z
Sort kernel scores K (u,q i ) in descending
order to produce an ordered list L (q1 , q2 ,..., qk )
of corresponding queries qi
MAX is set to the maximum number of
suggestions
55%
Post-Filter
|q| denotes the number of terms in query q
60%
Evaluation of
Query Suggestion System
1.
2.
3.
4.
5.
suggestion is totally off topic.
suggestion is not as good as original
query.
suggestion is basically same as original
query.
suggestion is potentially better than
original query.
suggestion is fantastic - should suggest
this query since it might help a user find
what they're looking for if they issued it
instead of the original query.
65%
Evaluations
Original
Query
california
lottery
Suggested Queries
Kernel
Score
Human
Rating
california lotto home
0.812
3
winning lotto numbers in california
0.792
5
california lottery super lotto plus
0.778
3
0.832
3
0.822
4
valentines day greeting cards
0.758
4
I love you valentine
0.736
2
new valentine one
0.671
1
valentines 2003 valentine's day
day
valentine day card
70%
Average ratings at
various kernel thresholds
75%
Average ratings versus average
number of query suggestions
80%
Application in QA
K("Who shot Abraham Lincoln", "John
Wilkes Booth") = 0.730
K("Who shot Abraham Lincoln",
"Abraham Lincoln") = 0.597
85%
Conclusion
A new kernel function for measuring
the semantic similarity between pairs
of short text snippets
The first is improvement in the
generation of query expansions with
the goal of improving the match score
for the kernel function
Term Weighting Scheme
The weight wi , j associated with the
term ti in document d j is defined to
be : w tf log( N )
i, j
i, j
dfi
Where tfi , j is the frequency of ti in d j
N is the total number of ducuments ,
and dfi is the total number of
documents that contain ti
Lp Norm
Given by:
Most common cases
P=1 ,This is the L1 norm, which is also
called Manhattan distance
P=2 ,This is the L2 norm, which is also
called the Euclidean distance
P= , This is the L norm, also called the
infinity norm or the Chebyshev norm