Presentation
Download
Report
Transcript Presentation
#GHC14
Lexicon-Based
Sentiment Analysis
Using the Most-Mentioned
Word Tree
Oct 10th, 2014
Bo-Hyun Kim, Sr. Software Engineer
With Lina Chen, Sr. Software Engineer
HP Big Data Business Unit
2014
2014
What to Expect
Sentiment Analysis
−
−
−
−
−
What is it?
Why is it interesting?
How HP Vertica Pulse works
Achieving greater accuracy
Different point of view using the mostmentioned word tree
2014
What I Expect
A 5-star rating on GHC app
I just expect you to enjoy and learn!
2014
Sentiment Analysis
In plain English
− the process of automatically detecting if a text
segment contains emotional or opinionated
content and determining its polarity (e.g., “thumbs
up” or “thumbs down”), is a field of research that has
received significant attention in recent years, both in
academia and in industry. [Wright, 2009]
2014
Gimme Examples!
Also known as:
− Opinion Mining
− Text Mining
Determine people’s general opinion
− “I just got a new car, and I’m loving it ”
− “My new car isn’t as fast as I thought.”
2014
Why are we interested?
Increasing(every minute!) web usage
− Articles
− Blogs
− Comments
Power of Social Media
−
−
−
−
Online Shopping
Customer Reviews
Recommended products on Amazon
How other people feel about the product
2014
Product Review
2014
Data… Data… Data…
2014
HP Vertica Pulse
2014
How to Analyze?
Lexicon-based approach – HP Labs [Zhang et. al. 2011]
Choose a product, person, event, organization, or topic
[Hu and Liu, 2004] to analyze the opinion
Determine the Semantic Orientation score of opinion
lexicons
Word
Semantic Orientation Value
Fabulous
+3
Good
+1
Bad
-1
Nasty
-3
2014
Sentiment Scoring
Input: text or sentence
Output: For each attribute or entity, generates a sentiment score
ranging from -1 to 1
−
−
−
-1: Negative sentiment
0: Neutral sentiment
1: Positive sentiment
Entity-level lexicon-based sentiment scoring
2014
Limitation
Semantic Orientation value(‘missed’) = -1
Gives more weight to the closely located
word
Accuracy can suffer
2014
Improve accuracy
Accuracy is what we strive for!
More robust pre-processing
− Prune data to fit for different types of user
opinion (e.g. Twitter vs. YouTube comments)
Naïve Bayes Classifier Training
Tune accordingly
2014
Data Set
Test dataset
−
−
−
−
Stanford students collected
In 2009
Over 3 million tweets with tested score
Analyzed 3500 tweets
Collected dataset
−
−
−
HP Vertica Pulse Twitter Connector
In 2014
Total of 1.2 million tweets
2014
Data Pruning
Remove
− Job postings
• #job, #jobs, #tweetmyjob
− Links
• http://this.is/nogood
− Duplicates
− Twitter specific characters
• RT, @, #
− Emoticons
• I hate my life :-), sarcasm is wide-spread disease
After pruning
− ~287000 tweets, 24% of the 1.2 million tweets
2014
Naïve Bayes Classifier
Supervised learning
− Probabilistic classifier based on Bayes’ theorem
− Requires a small amount of data
− Assumes the presence/absence of a particular
feature of a class is unrelated to the
presence/absence of any other feature
− Classifying the object based on its included features
𝑃 𝐷 𝐶𝑗 𝑃(𝐶𝑗)
𝑃(𝐶𝑗|𝐷) =
𝑃(𝐷)
− Open source found at [nltk.org]
2014
Naïve Bayes Classifier
Results:
− Final accuracy : 0.788
2014
Tuning Pulse
Positive words
Negative words
Neutral words
White lists
Stop words
Synonym mappings
2014
Accuracy Comparison
Sentiment scores generated for each
phase
Keyword
Ideal
Original
Pruning
Training
Tuning
Healthcare -0.1515
-0.0333
-0.0833
-0.1
-0.125
Obama
0.0944
0.1535
0.1535
0.1842
0.308
2014
Trend/Targeted Analysis
Targeted dataset analysis can help improve accuracy
Identify the most-mentioned words
−
Use the most-recurrent words to narrow the scope of analysis
Find new trends
−
Government healthcare (2009) vs. Obamacare (2014)
Are we looking at the targeted data?
−
−
−
“Solve healthcare challenges with technology!”
“Healthcare After ObamaCare”
“Get affordable healthcare at HealthCare.gov”
2014
Generating Tree
Increase the relevancy of sentiment score by
running the sentiment analysis on the entity, as
well as on the most-recurrent words to identify:
− Homonyms that machines do not understand
− More accurate scores based on user interest
Generate tree using Text Search
− Merge stemmer words
e.g. query, queries, querying…
− Lucene - apache open source
2014
Tree View
healthcare
obamacare
obama
!(Obamacare)
!(Obama)
2014
health
!(health)
Thank you
Questions?
[email protected]
[email protected]
Many thanks to*:
Tim Donar, Solution Engineer
Beth Favini, Tech Pubs Sr. Manager
Judith Plummer, Tech Pubs Editor in Chief
* In alphabetical order
2014
Got Feedback?
Rate and Review the session using the
GHC Mobile App
To download visit www.gracehopper.org
2014