Opinion Mining
Download
Report
Transcript Opinion Mining
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Public Opinion Mining
on Micro-blogs
By zerup
Put conference information here
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
What is opinion?
An opinion is a subjective belief, and is the result
of emotion or interpretation of facts.
An opinion may be supported by an argument or a
set of facts.
An opinion may be the result of a person's
perspective, understanding, particular
feelings, beliefs, and desires.
According to these features, we can make up a
dictionary of subjective words to pick up opinions
from micro-blogs with lots of facts.
2
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Problem Definition
Given thousands of micro-blogs on the same topic
Mining the top-k most popular opinions
Solution Framework
• Word partition
• Using extracted keywords to calculate similarity
• Clustering
3
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Why clustering? Why AP?
In information-theory, an opinion is a set of
encoding symbols.
In human recognition, an opinion is made up of
subjective, objective and comment.
In semantic network, for a certain topic, opinions
sharing common words have similar meanings
According to the three perspectives above, we can
construct a graph with every micro-blog as a data
point and mutual information as edge. Then we
apply a clustering algorithm to the graph to find
those opinion exemplars.
4
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Basic Idea of Affinity Propagation (AP)
Data Point i & Exemplar k
Responsibility: i->k noted as r(i,k)
Availability:
k->i noted as a(i,k)
5
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
s(i,k)
i≠k
• For each word w in sentence i, encoding w using words in
sentence k or in the dictionary
• s(i,k) is negative sum of the encoding cost
Example:
• Sentence i has keywords: A B
• Sentence k has keywords: A C D E
• Dictionary has keywords: A B C D E F G H
• s(i,k) = -(log4 + log8) = -5
• s(k,i) = -(log2 + 3*log8) = -10
s(k,k) is set to input preference of k to be a exemplar
6
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
r(i,k)
i≠𝑘
i=k
r(k,k) = s(k,k) – max{s(i,k’)}where k ≠ 𝑘 ′
Actually, r(k,k) is fixed.
7
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
a(i,k)
i≠𝑘
i=k
8
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Result of AP
#iterations is predefined because the convergence
of AP is not assured
Finally, we got a matrix M = A + R, where 𝑚𝑖𝑗 =
𝑎 𝑖, 𝑗 + 𝑟 𝑖, 𝑗 . For each 𝑚𝑘𝑘 ∈ 𝑀, 𝑖𝑓 𝑚𝑘𝑘 >0, we set k
to be one exemplar. For 𝑚𝑖𝑘 ∈ 𝑀 𝑖 ≠ 𝑘, 𝑖𝑓𝑚𝑖𝑘 =
max 𝑚𝑖𝑘 ′ 𝑘 ′ ≠ 𝑘, we set k to be i’s exemplar.
9
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Framework of AP
input
filtered
Microblogs
Word
dict.
partition
word&its
property
keyword
Semantic
network
word with real
meaning
cluster
Number of cluster
Center&members
return
output
10
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Preprocessing of Micro-blogs
First, we check all the Micro-blogs and find out
that a lot of advertisement share a same
embedding URL http://t.cn/agxZ4i. According to
this, about 5000 micro-blogs are filtered out of all
the total of 165000.
Second, we find that many spam micro-blogs
contains lots of unrelated words(so it can occur in
many topics), thus it has a lot of spaces in itself.
Third, we hope to develop a subjective word
dictionary to pick up those opinion sentences.
11
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Word Partition of AP
Accuracy of Word Partition
•
•
Tool: ICTCLAS
Limitation: recognizing a person’s name, unsuitable
granularity
Define user’s dictionary and update it
•
Store some latest network vocabulary
Which words(w.r.t their properties) are most
related with opinion/sentiment
•
We pick up nouns, verbs, adjectives as meaningful
keywords
Translate traditional Chinese into simple Chinese
12
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Example of AP
1. 大雾天,但是温度还好,早上起来心情还不错,连续上班
11天,终于休息啊。
2. 很奇怪到十一月的修水竟然还不冷 太阳战胜了大雾 又是
个周五 心情莫名的愉悦啊。。。
3. 虽然大雾,心情尚好。
4. 大雾中能见度很低,请大家小心驾驶。
5. 雾很大,为了保证安全,只能将车速减到最低。
6. 大雾天,减速行车,注意安全。
13
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
AP after first iteration
For i->k & k->I
+,+ i ≈ 𝑘
+,- 𝑖 ∈ 𝑘
-,+ 𝑘 ∈ 𝑖
-,- unrelated
14
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Result of the Example
1. 大雾天,但是温度还好,早上起来心情还不错,连续上班
11天,终于休息啊。
2. 很奇怪到十一月的修水竟然还不冷 太阳战胜了大雾 又是
个周五 心情莫名的愉悦啊。。。
3. 虽然大雾,心情尚好。
4. 大雾中能见度很低,请大家小心驾驶。
5. 雾很大,为了保证安全,只能将车速减到最低。
6. 大雾天,减速行车,注意安全。
After three iterations, 3 and 6 become the
exemplars.
15
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Limitations of AP
it is hard to know what value of parameter
'preference' can yield an optimal clustering
solution
oscillations cannot be eliminated automatically if
occur
Difficult to scale up
16
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Scalability of AP
As Matrix S, A and R are stored, read and written
frequently, when #micro-blogs is more than
10,000, AP is much too slow!
To scale up the AP, we have got 3 perspectives:
• Sampling or Pruning to reduce #micro-blogs
• Give a sketch of the matrix and store those values only
above a predefined threshold
• Use a new similarity measurement and make the
similarity symmetric or sparser
18
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Sampling or Pruning
Sampling:
• Randomly sample from the set of micro-blogs
• Sample over the whole period those micro-blogs are sent
• Sample over the social network
Pruning according to specific features of microblogs
• Lots of comments on a micro-blog means it is a central
opinion, at least a central topic
• The more a micro-blog is retweetted, the more people
approve of it
• Validation of overlap btw community center and opinion
exemplar
19
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Validation of Overlap between social clusters and
micro-blogs clusters
How micro-blogs clusters spread over a social
network
If they overlaps a lot, clustering micro-blogs is
much easier, cuz the social network is relatively
fixed.
If they overlaps a little bit, then we can choose
several social communities as our focus.
Check a micro-blogs cluster is coordinated or
uncoordinated across the network
The validation is topic-dependent
20
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Focus on the matrix
Divide the similarity matrix into symmetric part
and unsymmetric part
Only store those values bigger than a predefined
threshold
Use a decomposition of matrix:
• M = 𝒂𝑻 𝒃 then we need only store two vectors a and b
21
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
A new similarity measurement
Information encoding (-log)
Word Frequency (cos<x,y>)
• Tf-idf (unsuitable)
• Bag-of-words
Semantic Network(a long-term work)
• A big training data
• A big dictionary and Jaccard Coefficient
Social Network
• Based on the former validation
• If overlap is frequent, we take the relationship of the
senders of two micro-blogs as a factor of similarity
22
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Conclusion
Techniques:
• Word Partition
• AP
• Sampling
Challenges and Solutions:
• Filtering spam micro-blogs => build a dict of subjective
word (2)
• Word partition => build a user dictionary and improve
the accurancy of partition (4)
• Similarity measurement => a semantic network (3)
• Sampling and Pruning => leverage the features of social
network and micro-blogs (1)
23
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
The large-scale structure of semantic networks
Stanford, Cognitive Science
graph-theoretically analyze three types of
semantic networks: word associations, WordNet,
and Roget’s thesaurus
Conclusion: they have a small-world structure,
characterized by sparse connectivity, short
average path-lengths between words, and strong
local clustering. In addition, the distributions of
the number of connections follow power laws that
suggest a hub structure similar to that found in
other natural networks, such as the world wide
web.
24
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Basic ideas for build such a framework
because new concepts are preferentially attached
to rich concepts, the distribution of the connectivity follows a power law
Concepts that are learned early in life should show
higher connectivity.
25