PowerPoint - Masumi Shirakawa
Download
Report
Transcript PowerPoint - Masumi Shirakawa
Probabilistic Semantic Similarity Measurements
for Noisy Short Texts Using Wikipedia Entities
Masumi Shirakawa1, Kotaro Nakayama2, Takahiro Hara1, Shojiro
Nishio1
1Osaka
University, Osaka, Japan
2University of Tokyo, Tokyo, Japan
1
Challenge in short text analysis
Statistics are not always enough.
A year and a half after Google
pulled its popular search engine
out of mainland China
Baidu and Microsoft did not
disclose terms of the agreement
2
Challenge in short text analysis
Statistics are not always enough.
A year and a half after Google
pulled its popular search engine
out of mainland China
They are talking about...
Search engines
and China
Baidu and Microsoft did not
disclose terms of the agreement
3
Challenge in short text analysis
Statistics are not always enough.
A year and a half after Google
pulled its popular search engine
out of mainland China
They are talking about...
Search engines
and China
Baidu and Microsoft did not
disclose terms of the agreement
How do machines know that the two
sentences mention about the similar topic?
4
Reasonable solution
Use external knowledge.
A year and a half after Google
pulled its popular search engine
out of mainland China
Baidu and Microsoft did not
disclose terms of the agreement
Wikipedia Thesaurus [Nakayama06]
5
Related work
ESA: Explicit Semantic Analysis [Gabrilovich07]
Add Wikipedia articles (entities) to a text as its semantic representation.
1. Get search ranking of Wikipedia for each term (i.e. Wiki articles and scores).
2. Simply sum up the scores for aggregation.
Key term
extraction
Apple sells
a new product
Input: T
Apple
product
sells
new
t
Related entity
finding
pear
iPhone
pricing
iPad
Apple Inc.
business
c
Aggregation
Apple Inc.
iPhone
pear
Output: ranked list
of c
6
Problems in real world noisy short texts
βNoisyβ means semantically noisy in this work.
(We do not handle informal or casual surface forms, or misspells)
Term ambiguity
β’ Apple (fruit) should not be related with Microsoft.
Fluctuation of term dominance
β’ A term is not always important in texts.
We explore more effective aggregation method.
7
Probabilistic method
We propose Extended
naïve Bayes to aggregate related entities
Key term
extraction
Related entity
finding
Apple
Apple sells
a new product
product
Input: T
t
pear
iPhone
pricing
iPad
Apple Inc.
π ππ‘
[Mihalcea07]
[Milne08]
Apple Inc.
iPhone
iPad
Output: ranked
list of c
c
π ππ‘
π(π‘ β π)
Aggregation
π ππ
[Nakayama06]
ππ
π π‘π
π π
πΎβ1
[Song11]
From text T to related articles c
π π|π =
πΎ
π=1
π π‘π β π π π π‘π + 1 β π π‘π β π π(π)
π π πΎβ1
8
When input is multiple terms
Apply naïve Bayes [Song11] to multiple terms π‘1 , β¦ , π‘πΎ
to obtain related entity c using each probability P (c |tk ).
π π π‘1 , β¦ , π‘πΎ
π π‘1 , β¦ , π‘πΎ π π π
π π π π π‘π π
π π π π‘π
=
=
=
π π‘1 , β¦ , π‘πΎ
π π‘1 , β¦ , π‘πΎ
π π πΎβ1
π‘1
Apple
π‘2
product
π(π|π‘1 )
π(π|π‘2 )
π = βiPhoneβ
iPhone
π
β¦
π‘πΎ
new
π(π|π‘πΎ )
Compute π(π|π‘1 , β― , π‘πΎ ) for each
related entity c
9
When input is multiple terms
Apply naïve Bayes [Song11] to multiple terms π‘1 , β¦ , π‘πΎ
to obtain related entity c using each probability P (c |tk ).
π π π‘1 , β¦ , π‘πΎ
π‘1
π‘2
π π‘1 , β¦ , π‘πΎ π π π
π π π π π‘π π
π π π π‘π
=
=
=
π π‘1 , β¦ , π‘πΎ
π π‘1 , β¦ , π‘πΎ
π π πΎβ1
Apple
π(π|π‘1 )
π = βiPhoneβ
By using naïve Bayes, entities that are related
product
to multiple terms
can be boosted.iPhone π
π(π|π‘ )
2
β¦
π‘πΎ
new
π(π|π‘πΎ )
Compute π(π|π‘1 , β― , π‘πΎ ) for each
related entity c
10
When input is text
Not βmultiple termsβ but βtext,β i.e., we donβt know which terms are key terms.
We developed extended naïve Bayes to solve this problem.
Cannot observe
which are key terms
π‘1
Apple
π‘2
product
π(π|π‘1 )
π(π|π‘2 )
π = βiPhoneβ
iPhone
π
β¦
π‘πΎ
new
π(π|π‘πΎ )
11
Extended naïve Bayes
Apple
product
Candidates of
key term
π
Apple
β¦
β¦
new
πβ² = {π‘1 }
πβ² = {π‘1 , π‘2 }
Apple
product
β¦
Apply naïve Bayes
to each state Tβ
Apple
πβ² = {π‘1 , β― , π‘πΎ }
product
β¦
Probability that the set of key
terms T is a state Tβ : P(π
= πβ²)
new
12
Extended naïve Bayes
Apple
product
Candidates of
key term
πβ² = {π‘1 }
Apple
β¦
β¦
new
πβ² = {π‘1 , π‘2 }
π
Apple
product
β¦
Apply naïve Bayes
to each state Tβ
Apple
πβ²
πβ² = {π‘1 , β― , π‘πΎ }
π(π|π β² ) π(π = π β² ) =
π
product
β¦
Probability that the set of key
terms T is a state Tβ :
P(π = πβ²)
new
π π‘π β π π π π‘π + 1 β π π‘π β π π(π)
π π πΎβ1
13
Extended naïve Bayes
Apple
product
Candidates of
key term
πβ² = {π‘1 }
Apple
β¦
β¦
new
πβ² = {π‘1 , π‘2 }
π
Apple
product
β¦
Apply naïve Bayes
to each state Tβ
Term dominance is incorporated into naïve Bayes
Apple
πβ²
πβ² = {π‘1 , β― , π‘πΎ }
π(π|π β² ) π(π = π β² ) =
π
product
β¦
Probability that the set of key
terms T is a state Tβ :
P(π = πβ²)
new
π π‘π β π π π π‘π + 1 β π π‘π β π π(π)
π π πΎβ1
14
Experiments on short text sim datasets
[Datasets] Four datasets derived from word similarity datasets using
dictionary
[Comparative methods] Original ESA [Gabrilovich07], ESA with 16
parameter settings
[Metrics] Spearmanβs rank correlation coefficient
ESA with well-adjusted parameter is superior
to our method for βcleanβ texts.
15
Tweet clustering
K-means clustering using the vector of related entities for measuring distance
[Dataset] 12,385 tweets including 13 topics
#MacBook (1,251) #Silverlight (221) #VMWare (890)
#MySQL (1,241) #Ubuntu (988)
#Chrome (1,018)
#NFL (1,044)
#NHL (1,045)
#MLB (752)
#MLS (981)
#NASCAR (878)
#NBA (1,085)
#UFC (991)
[Comparative methods] Bag-of-words (BOW), ESA with the same
ESA with well-adjusted parameter
parameter,
[Metric] Average of Normalized Mutual Information (NMI), 20 runs
16
Results
0.567
0.6
NMI score
0.5
0.524
0.429
0.4
p-value < 0.01
0.421
0.3
0.2
0.1
0
10
BOW
20
50
100 200 500 1,000 2,000
Number of related entities
ESA-same
ESA-adjusted
Our method
17
Results
0.567
0.6
NMI score
0.5
0.524
0.429
0.4
p-value < 0.01
0.421
0.3
0.2
Our method outperformed ESA with
well-adjusted parameter for noisy short texts.
0.1
0
10
BOW
20
50
100 200 500 1,000 2,000
Number of related entities
ESA-same
ESA-adjusted
Our method
18
Conclusion
We proposed extended naïve Bayes to derive related Wikipedia
entities given a real world noisy short text.
[Future work]
Tackle multilingual short texts
Develop applications of the method
19