Presentation

Transcript Presentation

#GHC14
Lexicon-Based
Sentiment Analysis
Using the Most-Mentioned
Word Tree
Oct 10th, 2014
Bo-Hyun Kim, Sr. Software Engineer
With Lina Chen, Sr. Software Engineer
HP Big Data Business Unit
2014
2014
What to Expect
 Sentiment Analysis
−
−
−
−
−
What is it?
Why is it interesting?
How HP Vertica Pulse works
Achieving greater accuracy
Different point of view using the mostmentioned word tree
2014
What I Expect
 A 5-star rating on GHC app
 I just expect you to enjoy and learn!
2014
Sentiment Analysis
 In plain English
− the process of automatically detecting if a text
segment contains emotional or opinionated
content and determining its polarity (e.g., “thumbs
up” or “thumbs down”), is a field of research that has
received significant attention in recent years, both in
academia and in industry. [Wright, 2009]
2014
Gimme Examples!
 Also known as:
− Opinion Mining
− Text Mining
 Determine people’s general opinion
− “I just got a new car, and I’m loving it ”
− “My new car isn’t as fast as I thought.”
2014
Why are we interested?
 Increasing(every minute!) web usage
− Articles
− Blogs
− Comments
 Power of Social Media
−
−
−
−
Online Shopping
Customer Reviews
Recommended products on Amazon
How other people feel about the product
2014
Product Review
2014
Data… Data… Data…
2014
HP Vertica Pulse
2014
How to Analyze?
 Lexicon-based approach – HP Labs [Zhang et. al. 2011]
 Choose a product, person, event, organization, or topic
[Hu and Liu, 2004] to analyze the opinion
 Determine the Semantic Orientation score of opinion
lexicons
Word
Semantic Orientation Value
Fabulous
+3
Good
+1
Bad
-1
Nasty
-3
2014
Sentiment Scoring
 Input: text or sentence
 Output: For each attribute or entity, generates a sentiment score
ranging from -1 to 1
−
−
−
-1: Negative sentiment
0: Neutral sentiment
1: Positive sentiment
 Entity-level lexicon-based sentiment scoring
2014
Limitation
 Semantic Orientation value(‘missed’) = -1
 Gives more weight to the closely located
word
 Accuracy can suffer
2014
Improve accuracy
 Accuracy is what we strive for!
 More robust pre-processing
− Prune data to fit for different types of user
opinion (e.g. Twitter vs. YouTube comments)
 Naïve Bayes Classifier Training
 Tune accordingly
2014
Data Set
 Test dataset
−
−
−
−
Stanford students collected
In 2009
Over 3 million tweets with tested score
Analyzed 3500 tweets
 Collected dataset
−
−
−
HP Vertica Pulse Twitter Connector
In 2014
Total of 1.2 million tweets
2014
Data Pruning
 Remove
− Job postings
• #job, #jobs, #tweetmyjob
− Links
• http://this.is/nogood
− Duplicates
− Twitter specific characters
• RT, @, #
− Emoticons
• I hate my life :-), sarcasm is wide-spread disease
 After pruning
− ~287000 tweets, 24% of the 1.2 million tweets
2014
Naïve Bayes Classifier
 Supervised learning
− Probabilistic classifier based on Bayes’ theorem
− Requires a small amount of data
− Assumes the presence/absence of a particular
feature of a class is unrelated to the
presence/absence of any other feature
− Classifying the object based on its included features
𝑃 𝐷 𝐶𝑗 𝑃(𝐶𝑗)
𝑃(𝐶𝑗|𝐷) =
𝑃(𝐷)
− Open source found at [nltk.org]
2014
Naïve Bayes Classifier
 Results:
− Final accuracy : 0.788
2014
Tuning Pulse






Positive words
Negative words
Neutral words
White lists
Stop words
Synonym mappings
2014
Accuracy Comparison
 Sentiment scores generated for each
phase
Keyword
Ideal
Original
Pruning
Training
Tuning
Healthcare -0.1515
-0.0333
-0.0833
-0.1
-0.125
Obama
0.0944
0.1535
0.1535
0.1842
0.308
2014
Trend/Targeted Analysis
 Targeted dataset analysis can help improve accuracy
 Identify the most-mentioned words
−
Use the most-recurrent words to narrow the scope of analysis
 Find new trends
−
Government healthcare (2009) vs. Obamacare (2014)
 Are we looking at the targeted data?
−
−
−
“Solve healthcare challenges with technology!”
“Healthcare After ObamaCare”
“Get affordable healthcare at HealthCare.gov”
2014
Generating Tree
 Increase the relevancy of sentiment score by
running the sentiment analysis on the entity, as
well as on the most-recurrent words to identify:
− Homonyms that machines do not understand
− More accurate scores based on user interest
 Generate tree using Text Search
− Merge stemmer words
e.g. query, queries, querying…
− Lucene - apache open source
2014
Tree View
healthcare
obamacare
obama
!(Obamacare)
!(Obama)
2014
health
!(health)
Thank you 
Questions?
[email protected]
[email protected]
Many thanks to*:
Tim Donar, Solution Engineer
Beth Favini, Tech Pubs Sr. Manager
Judith Plummer, Tech Pubs Editor in Chief
* In alphabetical order
2014
Got Feedback?
Rate and Review the session using the
GHC Mobile App
To download visit www.gracehopper.org
2014

Presentation

Transcript Presentation

Directory