Text Classification task on Yelp restaurant reviews

Download Report

Transcript Text Classification task on Yelp restaurant reviews

Thought
 Yelp data
On Multi-Tier Sentiment Analysis using
Supervised Machine Learning
Yan Zhu
Agenda
 Overview
 Objective
 Multi-tier Classification Architecture
 Experiments And Results
 Conclusion
Overview
 Text mining is an important area of data mining.
 Large sets of text data are transformed into numerical values and
are linked with knowledge database, in order to provide summary
of words, determine similarities
 In this way, text mining is commonly used to help representing
hidden patterns and making relevant information available to
various research interests
Overview
 Classification of large sets of text data is in high demand.
 Supervised machine learning takes a classified input data with
labels from various sources, and trains the data using a machine
learning model.
 Based on the built model, labels are predicted for new incoming
text data.
Overview
 Sentiment analysis is an important branch of text analysis. It is
the field of study that analyzes sentiments and emotions.
 The analysis finds the mood of people, and uses natural language
processing and data mining techniques to classify the sentiment.
 There are different levels of sentiment analysis, including document,
sentence, and aspect levels.
Objective
 This paper proposes a multi-tier classification architecture for
sentimental analysis.
 The proposed architecture is implemented using four classifiers:
Naïve Bayes, SVM (Support Vector Machine), Random Forest, and
SGD (Stochastic Gradient Descent).
 Each movie review is classified into one of the five levels: Very
negative, negative, neural, positive, and very positive.
Multi-tier Classification Architecture
 The classification system consists of seven modules (stages):
 Data Collection & Cleaning
 Data pre-processing
 Training data
 Feature Selection
 Training the classifier (with prediction model, classifier)
 Test set
 Evaluation measures
Multi-tier Classification Architecture
Multi-tier Classification Architecture
 Data Collection & Cleaning
 The data must be cleaned to avoid useless or meaningless data to be
processed in further stages.
Multi-tier Classification Architecture
 Data Pre-Processing
 From the data prepared in the previous stage we have to organize and
partition the datasets, which can then be used for training the model
and testing the data
 In most of the machine learning preprocessing methods the train set
and test set are separated using 80-20 rule, where 80% of data is used
for training and 20% for test set.
Multi-tier Classification Architecture
 Data Pre-Processing
 Initially we have split data based on the unique id, but the ratio of
classified labels is uneven, over fitting and cannot be sufficient for
training the model, which may result in unreliable accuracy.
 Use Hive bucket split method to extract random samples from each
class.
 The data is loaded to Hadoop database and Hive queries are used to
split the data into equal proportions.
Multi-tier Classification Architecture
 Feature Selection
 Feature selection is the process of selecting the features
 When relevant features are successfully selected, accuracy is
improved.
 Training time is also reduced after removing all the unwanted
features.
 Over-fitting is also decreased when noise and redundant data are
removed.
Multi-tier Classification Architecture
 Feature Selection
 Tokenization: break the sentences into meaningful words and phrases
 Stop word removal: remove stop words to reduce noise in the data
 Stemmers: remove the suffix from a word.
 Using N grams features: n words are considered in a given instance.
 Parts of Speech Tagging(POS): classify words into their part-of-speech
and label them according to the tagset.
Multi-tier Classification Architecture
 Prediction Model
 The prediction model is a vital part of the architecture for sentiment
classification.
 It is used to consider the labeled data and train the classifier. Based
on the trained model, the test set is used to predict each review
instance and labels are applied to the unclassified test data.
Multi-tier Classification Architecture
 Prediction Model
 In the single-tier prediction model, all the labeled data is considered
as a single tier and the entire data is trained
 Classified into five sentiment levels, 0-4, as listed below. 0: Very
Negative, 1: Negative, 2: Neutral, 3: Positive, and 4: Very Positive.
Multi-tier Classification Architecture
Multi-tier Classification Architecture
 Prediction Model

The proposed architecture consists of three models

Model-1:Used to train the classifier using whole reviews data but the data is
labelled as 3 classes.

Negative {1} and Very Negative {0}  labeled as class ‘0’.

Neutral {2}  labeled as class ‘2’.

Positive {3} and Very Positive {4}  labeled as class ‘4’.
Multi-tier Classification Architecture
 Prediction Model

Model-2:Used to train the classifier using trainset consisting of Negative and
Very Negative labels.

The test instance provided to this model are only negative reviews classified
by model-1.

Custom dictionaries that distinguish negative from very negative are used.
Multi-tier Classification Architecture
 Prediction Model

Model-3:Used to train the classifier using trainset consisting of Positive and
Very Positive labels.

The test instance provided to this model are only positive reviews classified by
model-1.
Multi-tier Classification Architecture
 This two-tier approach helps in improving the classifier accuracy as the
classification task complexity reduces.
 The probability of finding the correct labels increases as now each model
is trained with less, and more homogeneous, more focused dataset.
Experiments And Results
 The Experiment Data and Features
Total Size
Training set
Test set
150,000
120,000
30,000
 13,000 unique words considered as features
Experiments And Results
 Results of Single-Tier Approach
Experiments And Results
 Multi Tier Approach
 Naive Bayes using Mahout
Experiments And Results
 Multi Tier Approach
 SVM classifier using svmlight
 Use c & ε parameters to tune the model and results are improved with new
architecture, as compared to single tier approach.
 This classifier gained an accuracy of 81.27% using the multi-tier
architecture. This is an improvement over 7%.
Experiments And Results
 Multi Tier Approach
 Random Forest using Scikit Learn
Experiments And Results
 Multi Tier Approach
 SGD classifier using Scikit Learn
Experiments And Results
 Multi Tier Approach
 More Experiments by adding custom dictionaries
 Collect most frequently occurred words from labels (negative, very
negative) to accurately distinguish negative from very negative.
 Similarly, for Model 3.
Experiments And Results
 Multi Tier Approach vs Single Tier
Approach
 The multi-tier architecture is able to
significantly improve the accuracy.
 Among these four methods, SGD Classifier
with Scikit learn, using custom and
refined dictionaries, has provided the best
results in the multitier architecture.
Conclusion
 A multi-tier classification architecture for sentiment analysis has been
proposed. It includes a multi-tier prediction model, which applies various
supervised machine learning methods to predict sentiment levels.
 Demonstrate ways to fine tune parameters, as well as techniques to reduce
features for further improvement.
 Increase the accuracy level in other multi-class text classification problems
Further Thought
 Features
 Word as Feature
 Markov Blanket
 Bag of Words
 N-grams
 Score Representation
Further Thought
 Should we add models for Neutral, Positive and Negative reviews also?
 How about add an another layer to combine results from different classifiers
NB
SVM
RF
Majority Vote?? Another Model??
SGD