Text Classification task on Yelp restaurant reviews
Download
Report
Transcript Text Classification task on Yelp restaurant reviews
Thought
Yelp data
On Multi-Tier Sentiment Analysis using
Supervised Machine Learning
Yan Zhu
Agenda
Overview
Objective
Multi-tier Classification Architecture
Experiments And Results
Conclusion
Overview
Text mining is an important area of data mining.
Large sets of text data are transformed into numerical values and
are linked with knowledge database, in order to provide summary
of words, determine similarities
In this way, text mining is commonly used to help representing
hidden patterns and making relevant information available to
various research interests
Overview
Classification of large sets of text data is in high demand.
Supervised machine learning takes a classified input data with
labels from various sources, and trains the data using a machine
learning model.
Based on the built model, labels are predicted for new incoming
text data.
Overview
Sentiment analysis is an important branch of text analysis. It is
the field of study that analyzes sentiments and emotions.
The analysis finds the mood of people, and uses natural language
processing and data mining techniques to classify the sentiment.
There are different levels of sentiment analysis, including document,
sentence, and aspect levels.
Objective
This paper proposes a multi-tier classification architecture for
sentimental analysis.
The proposed architecture is implemented using four classifiers:
Naïve Bayes, SVM (Support Vector Machine), Random Forest, and
SGD (Stochastic Gradient Descent).
Each movie review is classified into one of the five levels: Very
negative, negative, neural, positive, and very positive.
Multi-tier Classification Architecture
The classification system consists of seven modules (stages):
Data Collection & Cleaning
Data pre-processing
Training data
Feature Selection
Training the classifier (with prediction model, classifier)
Test set
Evaluation measures
Multi-tier Classification Architecture
Multi-tier Classification Architecture
Data Collection & Cleaning
The data must be cleaned to avoid useless or meaningless data to be
processed in further stages.
Multi-tier Classification Architecture
Data Pre-Processing
From the data prepared in the previous stage we have to organize and
partition the datasets, which can then be used for training the model
and testing the data
In most of the machine learning preprocessing methods the train set
and test set are separated using 80-20 rule, where 80% of data is used
for training and 20% for test set.
Multi-tier Classification Architecture
Data Pre-Processing
Initially we have split data based on the unique id, but the ratio of
classified labels is uneven, over fitting and cannot be sufficient for
training the model, which may result in unreliable accuracy.
Use Hive bucket split method to extract random samples from each
class.
The data is loaded to Hadoop database and Hive queries are used to
split the data into equal proportions.
Multi-tier Classification Architecture
Feature Selection
Feature selection is the process of selecting the features
When relevant features are successfully selected, accuracy is
improved.
Training time is also reduced after removing all the unwanted
features.
Over-fitting is also decreased when noise and redundant data are
removed.
Multi-tier Classification Architecture
Feature Selection
Tokenization: break the sentences into meaningful words and phrases
Stop word removal: remove stop words to reduce noise in the data
Stemmers: remove the suffix from a word.
Using N grams features: n words are considered in a given instance.
Parts of Speech Tagging(POS): classify words into their part-of-speech
and label them according to the tagset.
Multi-tier Classification Architecture
Prediction Model
The prediction model is a vital part of the architecture for sentiment
classification.
It is used to consider the labeled data and train the classifier. Based
on the trained model, the test set is used to predict each review
instance and labels are applied to the unclassified test data.
Multi-tier Classification Architecture
Prediction Model
In the single-tier prediction model, all the labeled data is considered
as a single tier and the entire data is trained
Classified into five sentiment levels, 0-4, as listed below. 0: Very
Negative, 1: Negative, 2: Neutral, 3: Positive, and 4: Very Positive.
Multi-tier Classification Architecture
Multi-tier Classification Architecture
Prediction Model
The proposed architecture consists of three models
Model-1:Used to train the classifier using whole reviews data but the data is
labelled as 3 classes.
Negative {1} and Very Negative {0} labeled as class ‘0’.
Neutral {2} labeled as class ‘2’.
Positive {3} and Very Positive {4} labeled as class ‘4’.
Multi-tier Classification Architecture
Prediction Model
Model-2:Used to train the classifier using trainset consisting of Negative and
Very Negative labels.
The test instance provided to this model are only negative reviews classified
by model-1.
Custom dictionaries that distinguish negative from very negative are used.
Multi-tier Classification Architecture
Prediction Model
Model-3:Used to train the classifier using trainset consisting of Positive and
Very Positive labels.
The test instance provided to this model are only positive reviews classified by
model-1.
Multi-tier Classification Architecture
This two-tier approach helps in improving the classifier accuracy as the
classification task complexity reduces.
The probability of finding the correct labels increases as now each model
is trained with less, and more homogeneous, more focused dataset.
Experiments And Results
The Experiment Data and Features
Total Size
Training set
Test set
150,000
120,000
30,000
13,000 unique words considered as features
Experiments And Results
Results of Single-Tier Approach
Experiments And Results
Multi Tier Approach
Naive Bayes using Mahout
Experiments And Results
Multi Tier Approach
SVM classifier using svmlight
Use c & ε parameters to tune the model and results are improved with new
architecture, as compared to single tier approach.
This classifier gained an accuracy of 81.27% using the multi-tier
architecture. This is an improvement over 7%.
Experiments And Results
Multi Tier Approach
Random Forest using Scikit Learn
Experiments And Results
Multi Tier Approach
SGD classifier using Scikit Learn
Experiments And Results
Multi Tier Approach
More Experiments by adding custom dictionaries
Collect most frequently occurred words from labels (negative, very
negative) to accurately distinguish negative from very negative.
Similarly, for Model 3.
Experiments And Results
Multi Tier Approach vs Single Tier
Approach
The multi-tier architecture is able to
significantly improve the accuracy.
Among these four methods, SGD Classifier
with Scikit learn, using custom and
refined dictionaries, has provided the best
results in the multitier architecture.
Conclusion
A multi-tier classification architecture for sentiment analysis has been
proposed. It includes a multi-tier prediction model, which applies various
supervised machine learning methods to predict sentiment levels.
Demonstrate ways to fine tune parameters, as well as techniques to reduce
features for further improvement.
Increase the accuracy level in other multi-class text classification problems
Further Thought
Features
Word as Feature
Markov Blanket
Bag of Words
N-grams
Score Representation
Further Thought
Should we add models for Neutral, Positive and Negative reviews also?
How about add an another layer to combine results from different classifiers
NB
SVM
RF
Majority Vote?? Another Model??
SGD