CEEC_Text_Analytics_Tutorial_Labs_Intro_Sentimentx

Download Report

Transcript CEEC_Text_Analytics_Tutorial_Labs_Intro_Sentimentx

TEXT ANALYTICS - LABS
Maha Althobaiti
Udo Kruschwitz
Massimo Poesio
LABS
• Basic text analytics: text classification using
bags-of-words
– Sentiment analysis of tweets using Python’s SciKit
Learn library
• More advanced text analytics: information
extraction using NLP pipelines
– Named Entity Recognition
LABS
• Basic text analytics: text categorization using
bags-of-words
– Specifically, sentiment analysis of tweets using
Python’s SciKit-Learn’s library
• More advanced text analytics: information
extraction using NLP pipelines
– Named Entity Recognition
Sentiment analysis using SciKit Learn
• Materials for this part of the tutorial:
– http://csee.essex.ac.uk/staff/poesio/Teach/TextAn
alyticsTutorial/SentimentLab
– Based on: chap. 6 of
TEXT ANALYTICS IN PYTHON
• Not quite as easy to do text manipulation in
Python as in Perl, but a number of useful
packages
– SCIKIT-LEARN for machine learning including basic
text classification
– NLTK for NLP processing including libraries for
tokenization, POS tagging, chunking, parsing, NE
recognition; also support for ML-based methods
eg for text classification
TEXT ANALYTICS IN PYTHON
• Not quite as easy to do text manipulation in
Python as in Perl, but a number of useful
packages
– SCIKIT-LEARN for machine learning including
basic text classification
– NLTK for NLP processing including libraries for
tokenization, POS tagging, chunking, parsing, NE
recognition; also support for ML-based methods
eg for text classification
SCIKIT-LEARN
• An open-source library supporting machine learning
work
– Based on numpy, scipy, and matplotlib
• Provides implementations of
– Several supervised ML algorithms including eg regression,
Naïve Bayes, SVMs
– Clustering
– Dimensionality reduction
– It includes several facilities to support text classification
including eg ways to create NLP pipelines out of componen
td
• Website:
– http://scikit-learn.org/stable/
REMINDER :
SENTIMENT ANALYSIS
• (or opinion mining)
• Develop algorithms that can identify the
‘sentiment’ expressed by a text
– Product X sucks
– I was mesmerized by film Y
SENTIMENT ANALYSIS AS
TEXT CATEGORIZATION
• Sentiment analysis can be viewed as just another type
of text categorization, like spam detection or topic
classification
• Most successful approaches use SUPERVISED
LEARNING:
– Use corpora annotated for subjectivity and/or sentiment
– To train models using supervised machine learning
algorithms:
• Naïve bayes
• Decision trees
• SVM
• Good results can already be obtained using only
WORDS as features
TEXT CATEGORIZATION USING A NAÏVE
BAYES, WORD-BASED APPROACH
• Attributes are text positions, values are words.
cNB  argmax P(c j ) P( xi | c j )
c jC
i
 argmax P(c j ) P( x1 " our" | c j )  P( xn " text" | c j )
c jC
SENTIMENT ANALYSIS OF TWEETS
• A very popular application of sentiment analysis
is trying to extract sentiment towards products or
organizations from people’s comments about
them on Twitter
• Several datasets for that
– E.g., SEMEVAL-2014
• In this lab: Nick Sanders’s dataset
– 5000 Tweets
– Annotated as positive / negative / neutral / irrelevant
– List of ID / sentiment pairs, + script to download
tweets on the basis of their ID
First Script
Start an IDLE window
Open the file: 01_start.py (but do not run it yet!!)
A word-based, Naïve Bayes sentiment
analyzer using SciKit-Learn
• The library sklearn.naive_bayes includes
implementations of three Naïve Bayes
classifiers
– GaussianNB (for features that have a Gaussian
distribution, e.g., physical traits – height, etc)
– MultinomialNB (when features are frequencies of
words)
– BernoulliNB (for boolean features)
A word-based, Naïve Bayes sentiment
analyzer using SciKit-Learn
• The library sklearn.naive_bayes includes
implementations of three Naïve Bayes
classifiers
– GaussianNB (for features that have a Gaussian
distribution, e.g., physical traits – height, etc)
– MultinomialNB (when features are frequencies of
words)
– BernoulliNB (for boolean features)
• For sentiment analysis: MultinomialNB
Creating the model
• The words contained in the tweets are used as features. They are
extracted and weighted using the function
create_ngram_model
– create_ngram_model uses the function TfidfVectorizer
from the package feature_extraction in scikit learn to extract
terms from tweets
• http://scikitlearn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVect
orizer.html
• create_ngram_model uses MultinomialNB to learn a
classifier
– http://scikitlearn.org/stable/modules/generated/sklearn.naive_bayes.Multinomial
NB.html
• The function Pipeline of scikit-learn is used to combine the
feature extractor and the classifier in a single object (an
estimator) that can be used to extract features from data,
create (‘fit’) a model, and use the model to classify
– http://scikit-
Tweet term extraction & classification
Extract features and weights
them
Naïve Bayes classifier
Creates Pipeline
Training and evaluation
• The function train_model
– Uses a method from the cross_validation library in
scikit-learn, ShuffleSplit, to calculate the
folds to use in cross validation
– At each iteration, the function creates a model
using fit, then evaluates the results using
score
Creating a model
Identifies the indices in each
fold
Trains the model
Execution
Optimization
• The program above uses the default values of the
parametes for TfidfVectorizer and
MultinomialNB
• In text analytics it’s usually easy to build a first
prototype, but lots of experimentation is needed to
achieve good results
• Alternative choices for TfidfVectorizer:
– Using unigrams, bigrams, trigrams (Ngrams parameter)
– Removing stopwords (stop_words parameter)
– Using binomial format of counts
• Alternative choices for MultinomialNB:
– Which type of SMOOTHING to use
Smoothing
• Even a very large corpus remains a limited
sample of language use, so many words even
of common use are not found
– Problem particularly common with tweets where
a lot of ‘creative’ use of words found
• Solution: SMOOTHING – distribute the
probability so that every word gets some
• Most used: ADD ONE or LAPLACE smoothing
Optimization
• Looking for the best values for the parameters
is a standard operation in machine learning
• Scikit-learn, like Weka and similar packages,
provides a function (GridSearchCV) to explore
the results that can be achieved with different
parameter configurations
implemented as met r i cs. f 1_scor e :
Optimizing with GridSearchCV
Putting everything together, we get the following code:
f r om skl ear n. gr i d_sear ch i mpor t Gr i dSear chCV
f r om skl ear n. met r i cs i mpor t f 1_scor e
Note the syntax to specify the
values of the parameters
def gr i d_sear ch_model ( cl f _f act or y, X, Y) :
cv = Shuf f l eSpl i t (
n=l en( X) , n_i t er =10, t est _si ze=0. 3, i ndi ces=Tr ue, r andom_
st at e=0)
par am_gr i d = di ct ( vect __ngr am_r ange=[ ( 1, 1) , ( 1, 2) , ( 1, 3) ] ,
vect __mi n_df =[ 1, 2] ,
vect __st op_wor ds=[ None, " engl i sh" ] ,
Which smoothing
vect __smoot h_i df =[ Fal se, Tr ue] ,
function to use
vect __use_i df =[ Fal se, Tr ue] ,
vect __subl i near _t f =[ Fal se, Tr ue] ,
vect __bi nar y=[ Fal se, Tr ue] ,
cl f __al pha=[ 0, 0. 01, 0. 05, 0. 1, 0. 5, 1] ,
)
Use F metric to
gr i d_sear ch = Gr i dSear chCV( cl f _f act or y( ) ,
par am_gr i d=par am_gr i d,
cv=cv,
scor e_f unc=f 1_scor e,
ver bose=10)
gr i d_sear ch. f i t ( X, Y)
r et ur n gr i d_sear ch. best _est i mat or _
evaluate
Second Script
Start an IDLE window
Open the file: 02_tuning.py (but do not run it yet!!)
Additional improvements:
normalization, preprocessing
• Further improvements may be possible by
doing some form of NORMALIZATION
Example of normalization: emoticons
Normalization: abbreviations
Adding a preprocessing step to
TfidfVectorizer
Other possible improvements
• Using NLTK’s POS tagger
• Using a sentiment lexicon such as
SentiWordNet
– http://sentiwordnet.isti.cnr.it/download.php
– (in the data/ directory)
Third Script
(Start an IDLE window)
Open and run the file: 03_clean.py
Overall results
TO LEARN MORE
SCIKIT-LEARN
NLTK
http://www.nltk.org/book