Opinion Mining

Download Report

Transcript Opinion Mining

GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Public Opinion Mining
on Micro-blogs
By zerup
Put conference information here
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
What is opinion?
 An opinion is a subjective belief, and is the result
of emotion or interpretation of facts.
 An opinion may be supported by an argument or a
set of facts.
 An opinion may be the result of a person's
perspective, understanding, particular
feelings, beliefs, and desires.
 According to these features, we can make up a
dictionary of subjective words to pick up opinions
from micro-blogs with lots of facts.
2
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Problem Definition
 Given thousands of micro-blogs on the same topic
 Mining the top-k most popular opinions
 Solution Framework
• Word partition
• Using extracted keywords to calculate similarity
• Clustering
3
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Why clustering? Why AP?
 In information-theory, an opinion is a set of
encoding symbols.
 In human recognition, an opinion is made up of
subjective, objective and comment.
 In semantic network, for a certain topic, opinions
sharing common words have similar meanings
 According to the three perspectives above, we can
construct a graph with every micro-blog as a data
point and mutual information as edge. Then we
apply a clustering algorithm to the graph to find
those opinion exemplars.
4
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Basic Idea of Affinity Propagation (AP)
 Data Point i & Exemplar k
 Responsibility: i->k noted as r(i,k)
 Availability:
k->i noted as a(i,k)
5
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
s(i,k)
 i≠k
• For each word w in sentence i, encoding w using words in
sentence k or in the dictionary
• s(i,k) is negative sum of the encoding cost
 Example:
• Sentence i has keywords: A B
• Sentence k has keywords: A C D E
• Dictionary has keywords: A B C D E F G H
• s(i,k) = -(log4 + log8) = -5
• s(k,i) = -(log2 + 3*log8) = -10
 s(k,k) is set to input preference of k to be a exemplar
6
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
r(i,k)
 i≠𝑘
 i=k
r(k,k) = s(k,k) – max{s(i,k’)}where k ≠ 𝑘 ′
 Actually, r(k,k) is fixed.
7
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
a(i,k)
 i≠𝑘
 i=k
8
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Result of AP
 #iterations is predefined because the convergence
of AP is not assured
 Finally, we got a matrix M = A + R, where 𝑚𝑖𝑗 =
𝑎 𝑖, 𝑗 + 𝑟 𝑖, 𝑗 . For each 𝑚𝑘𝑘 ∈ 𝑀, 𝑖𝑓 𝑚𝑘𝑘 >0, we set k
to be one exemplar. For 𝑚𝑖𝑘 ∈ 𝑀 𝑖 ≠ 𝑘, 𝑖𝑓𝑚𝑖𝑘 =
max 𝑚𝑖𝑘 ′ 𝑘 ′ ≠ 𝑘, we set k to be i’s exemplar.
9
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Framework of AP
input
filtered
Microblogs
Word
dict.
partition
word&its
property
keyword
Semantic
network
word with real
meaning
cluster
Number of cluster
Center&members
return
output
10
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Preprocessing of Micro-blogs
 First, we check all the Micro-blogs and find out
that a lot of advertisement share a same
embedding URL http://t.cn/agxZ4i. According to
this, about 5000 micro-blogs are filtered out of all
the total of 165000.
 Second, we find that many spam micro-blogs
contains lots of unrelated words(so it can occur in
many topics), thus it has a lot of spaces in itself.
 Third, we hope to develop a subjective word
dictionary to pick up those opinion sentences.
11
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Word Partition of AP
 Accuracy of Word Partition
•
•
Tool: ICTCLAS
Limitation: recognizing a person’s name, unsuitable
granularity
 Define user’s dictionary and update it
•
Store some latest network vocabulary
 Which words(w.r.t their properties) are most
related with opinion/sentiment
•
We pick up nouns, verbs, adjectives as meaningful
keywords
 Translate traditional Chinese into simple Chinese
12
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Example of AP
1. 大雾天,但是温度还好,早上起来心情还不错,连续上班
11天,终于休息啊。
2. 很奇怪到十一月的修水竟然还不冷 太阳战胜了大雾 又是
个周五 心情莫名的愉悦啊。。。
3. 虽然大雾,心情尚好。
4. 大雾中能见度很低,请大家小心驾驶。
5. 雾很大,为了保证安全,只能将车速减到最低。
6. 大雾天,减速行车,注意安全。
13
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
AP after first iteration
For i->k & k->I
+,+ i ≈ 𝑘
+,- 𝑖 ∈ 𝑘
-,+ 𝑘 ∈ 𝑖
-,- unrelated
14
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Result of the Example
1. 大雾天,但是温度还好,早上起来心情还不错,连续上班
11天,终于休息啊。
2. 很奇怪到十一月的修水竟然还不冷 太阳战胜了大雾 又是
个周五 心情莫名的愉悦啊。。。
3. 虽然大雾,心情尚好。
4. 大雾中能见度很低,请大家小心驾驶。
5. 雾很大,为了保证安全,只能将车速减到最低。
6. 大雾天,减速行车,注意安全。
 After three iterations, 3 and 6 become the
exemplars.
15
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Limitations of AP
 it is hard to know what value of parameter
'preference' can yield an optimal clustering
solution
 oscillations cannot be eliminated automatically if
occur
 Difficult to scale up
16
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Scalability of AP
 As Matrix S, A and R are stored, read and written
frequently, when #micro-blogs is more than
10,000, AP is much too slow!
 To scale up the AP, we have got 3 perspectives:
• Sampling or Pruning to reduce #micro-blogs
• Give a sketch of the matrix and store those values only
above a predefined threshold
• Use a new similarity measurement and make the
similarity symmetric or sparser
18
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Sampling or Pruning
 Sampling:
• Randomly sample from the set of micro-blogs
• Sample over the whole period those micro-blogs are sent
• Sample over the social network
 Pruning according to specific features of microblogs
• Lots of comments on a micro-blog means it is a central
opinion, at least a central topic
• The more a micro-blog is retweetted, the more people
approve of it
• Validation of overlap btw community center and opinion
exemplar
19
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Validation of Overlap between social clusters and
micro-blogs clusters
 How micro-blogs clusters spread over a social
network
 If they overlaps a lot, clustering micro-blogs is
much easier, cuz the social network is relatively
fixed.
 If they overlaps a little bit, then we can choose
several social communities as our focus.
 Check a micro-blogs cluster is coordinated or
uncoordinated across the network
 The validation is topic-dependent
20
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Focus on the matrix
 Divide the similarity matrix into symmetric part
and unsymmetric part
 Only store those values bigger than a predefined
threshold
 Use a decomposition of matrix:
• M = 𝒂𝑻 𝒃 then we need only store two vectors a and b
21
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
A new similarity measurement
 Information encoding (-log)
 Word Frequency (cos<x,y>)
• Tf-idf (unsuitable)
• Bag-of-words
 Semantic Network(a long-term work)
• A big training data
• A big dictionary and Jaccard Coefficient
 Social Network
• Based on the former validation
• If overlap is frequent, we take the relationship of the
senders of two micro-blogs as a factor of similarity
22
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Conclusion
 Techniques:
• Word Partition
• AP
• Sampling
 Challenges and Solutions:
• Filtering spam micro-blogs => build a dict of subjective
word (2)
• Word partition => build a user dictionary and improve
the accurancy of partition (4)
• Similarity measurement => a semantic network (3)
• Sampling and Pruning => leverage the features of social
network and micro-blogs (1)
23
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
The large-scale structure of semantic networks
 Stanford, Cognitive Science
 graph-theoretically analyze three types of
semantic networks: word associations, WordNet,
and Roget’s thesaurus
 Conclusion: they have a small-world structure,
characterized by sparse connectivity, short
average path-lengths between words, and strong
local clustering. In addition, the distributions of
the number of connections follow power laws that
suggest a hub structure similar to that found in
other natural networks, such as the world wide
web.
24
GDM@FUDAN http://gdm.fudan.edu.cn
Graph Data Management Lab, School of Computer Science
Basic ideas for build such a framework
 because new concepts are preferentially attached
to rich concepts, the distribution of the connectivity follows a power law
 Concepts that are learned early in life should show
higher connectivity.
25