Transcript ARSA2

ARSA: A Sentiment-Aware Model
for Predicting Sales
Performance Using Blogs
Yang Liu, Xiangji Huang, Aijun An and Xiaohui Yu
Department of Computer Science and Engineering
York University, Toronto, Canada
School of Information Technology
York University, Toronto, Canada
SIGIR 2007
Introduction
• What the general public thinks of a product can
no doubt influence how good it sells
• Blogs can be commentaries or discussions on a
particular subject
– Ranging from mainstream topics to highly personal
interests
• This paper studies the predictive power of
opinions and sentiments expressed in blogs
– Focus on the blogs that contain reviews on products
• Blogs serve as a very good indicator of the
product’s future sales performance
Introduction(Con.)
• Developing models and algorithms that can
– mine opinions and sentiments from blogs
– use them for predicting product sales
• Investigate how to predict box office revenues
of movies using the sentiment information
obtained from blog mentions
Why Using Movies?
• data availability
– daily box office revenue data are all published on
the Web (IMDB) and readily available
• expect the models and algorithms to be easily
adapted to handle other types of products
that are subject to online discussions
– books, music CDs and electronics
Introduction(Con.)
• S-PLSA Model
– sentiment mining based on Probabilistic Latent
Semantic Analysis
– Appraisal words are exploited to compose the
feature vectors for blogs
• ARSA Model (Autoregressive Sentiment Aware )
– AR Model
• Count in past sale performance
– Combining with sentiment information mined
from the blogs
Related Work
• Sentiment mining
– focuses on determining the semantic orientations
of documents
• machine learning approaches
– evaluate the semantic distance from a word to
good/bad with WordNet
• Blog mining
– make use of links or URLs in Blogspace
– analyzing the contents of blogs
Characteristics of Online
Discussion
• based on the number of blog mentions
– very difficult to make a successful prediction of sales ranks
• two movies both released on May 19, 2006
– The Da Vinci Code
– Over the Hedge
• use the name of each movie as a query to blog search
engine
– fixed time stamp
– starting from one week before the movie release till three
weeks after the release
• use the number of returned results for a particular
date as a rough estimate of the number of blog
mentions published on that day
Example
1 has higher # of blog mentions
票房差不多, 甚至2有時超越1
Box Office Data and User
Rating
• collect the average user ratings of the two movies
from the IMDB website
– The Da Vinci Code – 6.5
– Over the Hedge - 7.1
• the number of blog mentions may not be an
accurate indicator of a product’s sales
performance
• people’s opinions (as reflected by the user ratings)
seem to be a good indicator of how the box office
performance evolves
S-PLSA:a Probabilistic Approach to
Sentiment Mining
• Feature Selection
– Traditional way
• Compute the (relative) frequencies of various words in a given
blog
• Use the resulting multidimensional feature vector as the
representation of the blog
– focus on the set containing 2030 appraisal words
extracted from the lexicon constructed by Whitelaw et
al.
• use their frequencies in a blog as a feature vector
Sentiment PLSA
• sentiments are often multifaceted
– differ from one another in a variety of ways
• just classify the sentiments expressed in a blog as either
positive or negative
– too simplistic
• a blog can be considered as being generated under the
influence of a number of hidden sentiment factors
– each hidden factor focusing on one specific aspect of the
sentiments
– accommodate the intricate nature of sentiments
• model sentiments and opinions as a mixture of hidden
factors and use PLSA for sentiment mining
S-PLSA : Formerly Presentation
• a set of blog entries B = {b1, . . . , bN}
• a set of words (appraisal words) from a
vocabulary W = {w1, . . . ,wM}
• blog data can be described as a N × M matrix
D =(c(bi,wj))ij
– c(bi,wj) is the number of times wi appears in blog
entry bj
• each row in D is then a frequency vector that
corresponds to a blog entry
S-PLSA (Con.)
• consider the blog entries as being generated
from a number of hidden sentiment factors
Z = {z1, . . . , zK}
– correspond to blogger’s complex sentiments
expressed in the blog review
• Generative model
S-PLSA (Con.)
• result
• Assuming blog entry b and the word w are
conditionally independent given the hidden
sentiment factor z
• Estimate model parameters
– P(z), P(b|z), P(w|z)
S-PLSA (Con.)
• maximize the following likelihood function:
• EM algorithm
S-PLSA (Con.)
• P(z|b) represents how much a hidden
sentiment factor z “contributes” to the blog
document b
• the set of probabilities {P(z|b)|z Z} can be
considered as a succinct summarization of
b in terms of sentiments
ARSA : a Sentiment-Aware
Model
• Capture two different factors that can
affect the box office revenue of the current
day
– box office revenue of the preceding days
• autoregressive model (AR)
– people’s sentiments about the movie
The autoregressive model
• denote the box office revenue of the movie of
interest at day t by xt
– t = 1, . . . ,N
• basic AR process of order p
– φ1, φ2, . . . , φp : parameters of the model
–
: error term
• Once this model is learned from training data
– at day t, the box office revenue xt can be predicted by
xt−1, xt−2,. . ., xt−p
New AR Model
• AR models are only appropriate for time series
that are stationary
• 1st step
• 2nd step (remove seasonality)
• New AR model
Incorporating Sentiments
• Bt denote the set of blogs on the movie of interest
that were posted on day t
• average probability of sentiment factor z = j
conditional on blogs in Bt
– ωt,j represents the average fraction of the sentiment “mass”
that can be attributed to the hidden sentiment factor j
ARSA Model
• Autoregressive Sentiment-Aware model
– p, q, and K : user-chosen parameters
• q : the sentiment information from how many
preceding days are considered
• k : the number of hidden sentiment factors used by SPLSA to represent the sentiment information
–
and
: parameters whose values are to be
estimated using the training data
Training the ARSA Model
• learning the set of parameters φi(i = 1, . . . , p), and ρi,j(i =
1, . . . , q; j = 1, . . . ,K), from the training data that consist of
the true box office revenues
• For a particular movie m(m = 1, . . . ,M)
– M : total number of movies in the training data
•
• minimize
,
Empirical Study
• Experiment settings
– a set of blog documents on movies of interest collected
from the Web
• from May 1, 2006 to August 8, 2006.
• timestamp ranging from one week before the release to four
weeks after
• the amount of blog entries collected for each movie ranges from
663 (for Waist Deep) to 2069 ( for Little Man)
• 45046 blog entries comment on 30 different movies
– corresponding daily box office revenue data for these
movies
• manually collected from IMDB
Experiment
• choose half of the movies for training, and the other
half for testing
• train an S-PLSA model
– For each blog entry b, the sentiments towards a movie are
summarized using a vector of the posterior probabilities of
the hidden sentiment factors, P(z|b)
• apply ARSA model
– obtain estimates of the parameters
• evaluate the prediction performance of the ARSA
model by experimenting it with the testing data set.
MAPE
• use the mean absolute percentage error (MAPE) to
measure the prediction accuracy
– n : total amount of predictions made on the testing data
– Predi : predicted value
– Truei : true value of the box office revenue
Parameter Selection
• K、p、q
Lower dim. Vector
Loss sentiment info.
Factor in all influence
on preceding day’s
performance
Overfitting
High cost
12.1%
Get irrelevant
information
前一天 post blog之
sentiment info. 和
prediction最相關
Result on Particular Movie
• similar to observation of Figure 2
• close to true values using the optimal parameter settings
Comparison with Alternative
Methods
Pure AR model
AR model utilizes the
volume of blog mentions
vt −i :
number of blog
mentions on day t−i
φi and ρi :
parameters to be learned
Comparison with other Feature
Selection Method
• feature vectors are
computed using the
(relative) frequencies of
all the words appearing
in the blog entries
– large set、cost high
• only select words with
higher frequencies
(excluding stop words)
– same # of words as ARSA
for fairness
Conclusions and Future Work
• predicting sales performance using sentiment information
mined form blogs
– using movies as a case study
• proposal of S-PLSA
– generative model for sentiment analysis
– “summarizing” sentiment information from blogs
• ARSA
– model for predicting sales performance based on
• sentiment information
• product’s past sales performance
• Future Work
– Clustering and classification of blogs based on their sentiments
– use S-PLSA as a tool to help track and monitor the changes and trends
in sentiments expressed online