Sentiment Analysis of Microblogging Data Using - ISKO

Download Report

Transcript Sentiment Analysis of Microblogging Data Using - ISKO

By
OLUGBEMI Eric
ODUMUYIWA Victor
OKUNOYE Olusoji
OBJECTIVES OF THE STUDY
RESEARCH QUESTIONS
SIGNIFICANCE OF THE STUDY
MACHINE LEARNING
SUPERVISED MACHINE LEARNING
WEB MINING AND TEXT CLASSIFICATION
WHY TWITTER?
THE BAG OF WORDS APPROACH
THE NAÏVE BAYES ALGORITHM
SUPPORT VECTOR MACHINES
THE PROCESS OF TEXT CLASSIFICATION
MY DATA SET
RESULTS
PRACTICAL DEMONSTRATION
CONCLUSION


Improve the accuracy of SVM and the Naïve bayes
classifier by using sentiment lexicons rather than
emoticons as noisy labels in creating a training
corpus
 Compare the accuracy of the Naïve bayes classifier
with that of an SVM When sentiment lexicons are
used as noisy labels and when emoticons are
Used as noisy labels.
Is
it better to use sentiment lexicons as noisy labels or emoticons as
noisy labels in creating a training corpus for sentiment analysis?
What
is the accuracy of SVM on twitter data with training corpus created
by using sentiment lexicons as noisy label?
What
is the accuracy of the Naïve Bayes classifier on twitter data with
training corpus created by using sentiment lexicons as noisy label?
What
is the effect of word Ngrams on the accuracy of our classifiers and
accuracy SVM classifier and Naïve Bayes classifier using the approach in
this study?
What
is the effect of term frequency inverse document frequency on the
accuracy SVM classifier and Naïve Bayes classifier using the approach in
this study?
Mining
the opining of customers, electorates e.t.c
Product
reviews
A machine can learn if you teach it.
teaching a machine
supervised learning
semi-super. leaarning Unsupervised learning.
TRAINING
LABELED
TWEETS
FEATURE
EXTRACTOR
CLASSIFIER
FEATURE S
PREDICTION
CLASSIFIER
UNLABELED
TWEETS
FEATURE
EXTRACTOR
FEATURE S
LABEL
WEB MINING: Mining web content for information
Sentiment analysis of web content involves extracting sentiment from
web content. Sentiment in this case can be positive, negative, or
neutral.
Twitter data are messy
 A large data set can be collected from twitter
 Tweets have fixed length (140 characters)
 Twitter users are heterogeneous.

The sentiment of a text depends only on the type of words in the
text. So each word in a text has to be assessed independent of
other words in the same text.
the naïve Baye’s classifier is a very simple classifier that relies
on the “bag of word” representation of a document
Assumptions:
1. The position of a word in a document does not matter all that
2. P(xi|Cj) are independent
n
1.NB  max P(c j ) p( xi | cj)
i
n (document sin classj)
2.P(C j ) 
n (document s)
3.P( Wi | C j ) 
count( Wi , C j )  1
 (count(W, C )  1)
wV
j
Trained
Data
Doc
text
1
Nigeria is a Good country
2
3
4
The people in Nigeria are friendly
The youths in Nigeria are productive
One word to describe this country: bad
leadership.
How do Nigerians cope with erratic
power supply
Nigeria is a country with viable
manpower
5
Test Data 6
Class
doc
pos
pos
pos
neg
neg
?
of
Doc Words in doc
Trained 1
Data
2
3
4
Test
Data
5
6
nigeria good country
people nigeria friendly
youth nigeria productive
word
describe
country
bad
leadership.
nigeria cope erratic power supply
nigeria country viable youth
Class of
doc
pos
pos
pos
neg
neg
?
Nc
P(C) 
N
2
P(n ) 
5
3
P (p) 
5
for the test data
V = {nigeria, good, country, people, friendly, youth, productive, word,
describe, bad, leadership, cope, erratic, power, supply}, |V|= 15
Count (p) = n(nigeria, good, country, people, nigeria, friendly, youth, nigeria,
productive) = 9
Count(n) = n(word, describe, country, bad, leadership, nigeria, cope, erratic,
power, supply) = 10
Doc
Trained Data 1
Test Data
2
3
4
5
6
Words in doc
Class of doc
nigeria good country
pos
people nigeria friendly
youth nigeria productive
word describe country bad leadership.
nigeria cope erratic power supply
nigeria country viable youth
pos
pos
neg
neg
?
count( w , c)  1
P( W | c) 
count(c) | V |
P(nigeria|p) = (3+1)/(9+15) = 4/24 =2/12 = 1/6
P(nigeria|n) = (1+1)/(10+15) =2/25
Doc Words in doc
P(country|p) = (1+1)/(9+15) = 2/24 = 1/12
P(country|n) = (1+1)/(10+15)= 2/25
P(viable|p) = (0+1)/(9+15) = 1/24
Traine 1
nigeria good country
P(viable|n) = (1+1)/(10+15) = 2/25
d Data
P(youth|p) = (1+1)/(9+15) = 2/24 = 1/12
2
people nigeria friendly
P(youth|n) = (0+1)/(10+15) = 1/25
3
4
To determine the class of text6:
5
Test
Data
6
Class
of doc
pos
pos
youth nigeria productive
pos
word describe country bad neg
leadership.
nigeria cope erratic power neg
supply
nigeria country viable youth ?
P(p|text6) = 3/5 * 1/6 * 1/12 * 1/24 * 1/12 = 0.00003
P(n|text6) = 2/5 *2/25 *2/25 * 2/25 * 1/25 = 0.00001
Since 0.00003 > 0.00001 text6 is classified as a positive text.
searches for the linear or nonlinear optimal separating hyper
plane (i.e., a “decision boundary”) that separate the data sample
of one class from another.
Minimize (in,W,b) ||W||
Subject to yi(W.Xi – b) ≥ 1
(for any i = 1,…,n)
RUN
USING EMOTICONS :
POSITIVE EMOTICONS : ‘=]’, ‘:]’, ‘:-)’, ‘:)’, ‘=)’ and ':D’
NEGATIVE EMOTICONS:’:-(‘, ‘:(‘, ‘=(‘, ‘;(‘
NEUTRAL EMOTICONS : ‘=/ ‘, and ‘:/ ‘
Tweets with both positive and negative emoticons are ignored
USING SENTIMENT LEXICON:
POSITIVE : using positive lexicons
NEGATIVE : using negative lexicons
NEUTRAL : contains no neg and no pos lexicon
Tweets with question marks are ignored
Using emoticons: pos = 8000 Using sentiment lexicon: pos = 8000
neg = 8000
neg = 8000
neu = 8000
neu = 8000
Hand labeled test :
pos = 748
neg = 912
neu = 874
Total = 2534
Lexiconbased data set
Emoticonbased data set
WITH EMOTICONS
One gram
Two
grams
Three grams
Mean
MNB
64.24%
62.78%
62.46%
63.16%
SVM
58.94%
60.48%
61.39%
60.27%
With TFIDF (MNB)
63.05%
62.70%
62.94%
62.90%
With TFIDF (SVM)
59.45%
60.44%
61.31%
60.4%
61.42%
61.6%
62.03%
61.68%
MEAN
WITH
SENTIMENT
One gram
LEXICON
Two grams
Three grams
Mean
MNB
66.96%
66.02%
65.55%
66.18%
SVM
70.97%
71.88%
71.80%
71.55%
With TFIDF (MNB)
66.61%
66.10%
66.26%
66.32%
With TFIDF (SVM)
70.89%
69.46%
69.19%
69.65%
68.86%
68.37%
68.2%
68.45%
MEAN
neg
neu
pos
average
Precision
0.89
0.59
0.83
0.77
recall
0.55
0.92
0.69
0.72
f1-score
0.68
0.72
0.75
0.72
CONFUSION MATRIX
neg
neu
pos
neg
501
29
30
neu
350
805
201
pos
61
40
511
support
912
874
742
2528
Emoticons
are noisier than sentiment lexicon, therefore
it is better to use sentiment lexicon as noisy label to train
a classifier for sentiment analysis
SVM perform better than the Naïve Bayes classifier
Increasing the number of grams did not improve the
accuracy of our classifiers trained with corpus generated
using sentiment lexicons as noisy labels. The reverse was
the case when emoticons were used as noise labels.
DataGenetics 2012“Emoticon Analysis in Twitter”.
http://www.datagenetics.com/blog/october52012/index.html

 Alec
Go, Richa Bhayani, and Lei Huang, 2009, Twitter Sentiment analysis,CS224N
Project Report, Stanford
Pedregosa
F. ,Varoquaux, G.,Gramfort, A.,Michel, V., Thirion, B., Grisel, O. ,Blondel,
M. et al., 2011, “Scikit-learn: Machine Learning in Python” Journal of Machine
Learning Research vol 12

Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing
Opinions on the Web." Proceedings of the 14th
International World Wide Web conference (WWW-2005), May 10-14,
2005, Chiba, Japan