박상현 전략 노트

Download Report

Transcript 박상현 전략 노트

Multi-variate Outliers in Data Cubes
2012-03-05
JongHeum Yeon
Intelligent Data Systems Laboratory
Contents
 Sentiment Analysis and Opinion Mining
• Materials from AAAI-2011 Tutorial
 Multi-variate Outliers in Data Cubes
• Motivation
• Technologies
• Issues
IDS Lab.
Page 2
SENTIMENT ANALYSIS AND
OPINION MINING
IDS Lab.
Page 3
Sentiment Analysis and Opinion Mining
 Opinion mining or sentiment analysis
• Computational study of opinions, sentiments, subjectivity, evaluations, attitudes,
appraisal, affects, views, emotions, etc., expressed in text.
• Opinion mining ~= sentiment analysis
 Sources: Global Scale
• Word-of-mouth on the Web
• Personal experiences and opinions about anything in reviews, forums, blogs, Twitter,
micro-blogs, etc.
• Comments about articles, issues, topics, reviews, etc.
• Postings at social networking sites, e.g., facebook.
• Organization internal data
• News and reports
 Applications
•
•
•
•
Businesses and organizations
Individuals
Ads placements
Opinion retrieval
IDS Lab.
Page 4
Problem Statement
 (1) Opinion Definition
• Id: Abc123 on 5-1-2008 “I bought an iPhone a few days ago. It is such a nice
phone. The touch screen is really cool. The voice quality is clear too. It is much
better than my old Blackberry, which was a terrible phone and so difficult to
type with its tiny keys. However, my mother was mad with me as I did not tell
her before I bought the phone. She also thought the phone was too expensive, …”
• document level, i.e., is this review + or -?
• sentence level, i.e., is each sentence + or -?
• entity and feature/aspect level
• Components
•
•
•
•
Opinion targets: entities and their features/aspects
Sentiments: positive and negative
Opinion holders: persons who hold the opinions
Time: when opinions are expressed
 (2) Opinion Summarization
IDS Lab.
Page 5
OPINION DEFINITION
IDS Lab.
Page 6
Two main types of opinions
 Regular opinions: Sentiment/opinion expressions on some target entities
• Direct opinions:
• “The touch screen is really cool.”
• Indirect opinions:
• “After taking the drug, my pain has gone.”
 Comparative opinions: Comparisons of more than one entity.
• e.g., “iPhone is better than Blackberry.”
 Opinion (a restricted definition)
• An opinion (or regular opinion) is simply a positive or negative sentiment, view,
attitude, emotion, or appraisal about an entity or an aspect of the entity (Hu and
Liu 2004; Liu 2006) from an opinion holder (Bethard et al 2004; Kim and Hovy
2004; Wiebe et al 2005).
• Sentiment orientation of an opinion
• Positive, negative, or neutral (no opinion)
• Also called opinion orientation, semantic orientation, sentiment polarity.
IDS Lab.
Page 7
Entity and Aspect
 Definition (entity)
• An entity e is a product, person, event, organization, or topic. e is represented as
• a hierarchy of components, sub-components, and so on.
• Each node represents a component and is associated with a set of attributes of the
component.
 An opinion is a quintuple
IDS Lab.
Page 8
Goal of Opinion Mining
 Id: Abc123 on 5-1-2008 “I bought an iPhone a few days ago. It is such a nice
phone. The touch screen is really cool. The voice quality is clear too. It is
much better than my old Blackberry, which was a terrible phone and so
difficult to type with its tiny keys. However, my mother was mad with me as
I did not tell her before I bought the phone. She also thought the phone was
too expensive, …”
 Quintuples
• (iPhone, GENERAL, +, Abc123, 5-1-2008)
• (iPhone, touch_screen, +, Abc123, 5-1-2008)
 Goal: Given an opinionated document,
• Discover all quintuples (ej, ajk, soijkl, hi, tl),
• Or, solve some simpler forms of the problem (sentiment classification at the
document or sentence level)
• Unstructured Text → Structured Data
IDS Lab.
Page 9
Sentiment, subjectivity, and emotion
 Sentiment ≠ Subjective ≠ Emotion
• Sentence subjectivity: An objective sentence presents some factual information,
while a subjective sentence expresses some personal feelings, views, emotions,
or beliefs.
• Emotion: Emotions are people’s subjective feelings and thoughts.
 Most opinionated sentences are subjective, but objective sentences can
imply opinions too.
 Emotion
• Rational evaluation: Many evaluation/opinion sentences express no emotion
• e.g., “The voice of this phone is clear”
• Emotional evaluation
• e.g., “I love this phone”
• “The voice of this phone is crystal clear”
 Sentiment ⊄ Subjectivity
 Emotion ⊂ Subjectivity
 Sentiment ⊄ Emotion
IDS Lab.
Page 10
OPINION SUMMARIZATION
IDS Lab.
Page 11
Opinion Summarization
 With a lot of opinions, a summary is necessary.
• A multi-document summarization task
 For factual texts, summarization is to select the most important facts and
present them in a sensible order while avoiding repetition
• 1 fact = any number of the same fact
 But for opinion documents, it is different because opinions have a
quantitative side & have targets
• 1 opinion ≠ a number of opinions
 Aspect-based summary is more suitable
• Quintuples form the basis for opinion summarization
IDS Lab.
Page 12
Aspect-based Opinion Summary

Id: Abc123 on 5-1-2008 “I bought an iPhone a few days ago. It is such a nice phone.
The touch screen is really cool. The voice quality is clear too. It is much better than
my old Blackberry, which was a terrible phone and so difficult to type with its tiny
keys. However, my mother was mad with me as I did not tell her before I bought the
phone. She also thought the phone was too expensive, …”
Feature Based Summary of iPhone

Opinion Observer

IDS Lab.
Page 13
Aspect-based Opinion Summary
 Opinion Observer
IDS Lab.
Page 14
Aspect-based Opinion Summary
 Bing
 Google Product Search
IDS Lab.
Page 15
Aspect-based Opinion Summary
 OpinionEQ
 Detail opinion sentences
IDS Lab.
Page 16
OpinionEQ
 % of +ve opinion and # of opinions
 Aggregate opinion trend
IDS Lab.
Page 17
Live tracking of two movies (Twitter)
IDS Lab.
Page 18
OPINION MINING PROBLEM
IDS Lab.
Page 19
Opinion Mining Problem
 (ej, ajk, soijkl, hi, tl),
•
•
•
•
•
•
ej - a target entity: Named Entity Extraction (more)
ajk – an aspect of ej: Information Extraction
soijkl is sentiment: Sentiment Identification
hi is an opinion holder: Information/Data Extraction
tl is the time: Information/Data Extraction
5 pieces of information must match
 Coreference resolution
 Synonym match (voice = sound quality)
 …
IDS Lab.
Page 20
Opinion Mining Problem
 Tweets from Twitter are the easiest
• short and thus usually straight to the point
 Reviews are next
• entities are given (almost) and there is little noise
 Discussions, comments, and blogs are hard.
• Multiple entities, comparisons, noisy, sarcasm, etc
 Determining sentiments seems to be easier.
 Extracting entities and aspects is harder.
 Combining them is even harder.
IDS Lab.
Page 21
Opinion Mining Problem in the Real World
 Source the data, e.g., reviews, blogs, etc
• (1) Crawl all data, store and search them, or
• (2) Crawl only the target data
 Extract the right entities & aspects
• Group entity and aspect expressions,
• Moto = Motorola, photo = picture, etc …
• Aspect-based opinion mining (sentiment analysis)
• Discover all quintuples
• (Store the quintuples in a database)
 Aspect based opinion summary
IDS Lab.
Page 22
Problems




Document sentiment classification
Sentence subjectivity & sentiment classification
Aspect-based sentiment analysis
Aspect-based opinion summarization





Opinion lexicon generation
Mining comparative opinions
Some other problems
Opinion spam detection
Utility or helpfulness of reviews
IDS Lab.
Page 23
APPROACHES
IDS Lab.
Page 24
Approaches
 Knowledge-based approach
• Uses background knowledge of linguistics to identify sentiment polarity of a text
• Background knowledge is generally represented as dictionaries capturing the
sentiments of lexicons
 Learning-based approach
• Based on supervised machine learning techniques
• Formulating the problem of sentiment identification as a text classification,
utilizing bag-of-words model
IDS Lab.
Page 25
Document Sentiment Classification
 Classify a whole opinion document (e.g., a review) based on the overall
sentiment of the opinion holder
 A text classification task
• It is basically a text classification problem
 Assumption: The doc is written by a single person and express
opinion/sentiment on a single entity.
 Goal: discover (_, _, so, _, _), where e, a, h, and t are ignored
 Reviews usually satisfy the assumption.
• Almost all papers use reviews
• Positive: 4 or 5 stars, negative: 1 or 2 stars
IDS Lab.
Page 26
Document Unsupervised Classification
 Data: reviews from epinions.com on automobiles, banks, movies, and travel
destinations.
 Three steps
 Step 1
• Part-of-speech (POS) tagging
• Extracting two consecutive words (two-word phrases) from reviews if their tags
conform to some given patterns, e.g., (1) JJ, (2) NN.
 Step 2: Estimate the sentiment orientation (SO) of the extracted phrases
• Pointwise mutual information
• Semantic orientation (SO)
• Step 3: Compute the average SO of all phrases
IDS Lab.
Page 27
Document Supervised Learning
 Directly apply supervised learning techniques to classify reviews into
positive and negative
 Three classification techniques were tried
• Naïve Bayes
• Maximum entropy
• Support vector machines
 Pre-processing
• Features: negation tag, unigram (single words), bigram, POS tag, position.
 Training and test data
• Movie reviews with star ratings
• 4-5 stars as positive
• 1-2 stars as negative
 Neutral is ignored.
 SVM gives the best classification accuracy based on balance training data
• 83%
• Features: unigrams (bag of individual words)
IDS Lab.
Page 28
Aspect-based Sentiment Analysis
 (ej, ajk, soijkl, hi, tl)
 Aspect extraction
• Goal: Given an opinion corpus, extract all aspects
• A frequency-based approach (Hu and Liu, 2004)
• nouns (NN) that are frequently talked about are likely to be true aspects (called
frequent aspects)
• Infrequent aspect extraction
• To improve recall due to loss of infrequent aspects. It uses opinions words to extract
them
• Key idea: opinions have targets, i.e., opinion words are used to modify aspects and
entities.
• “The pictures are absolutely amazing.”
• “This is an amazing software.”
• The modifying relation was approximated with the nearest noun to the opinion word.
IDS Lab.
Page 29
Aspect-based Sentiment Analysis
 Using part-of relationship and the Web
• Improved (Hu and Liu, 2004) by removing those frequent noun phrases that
may not be aspects: better precision (a small drop in recall).
• It identifies part-of relationship
• Each noun phrase is given a pointwise mutual information score between the phrase
and part discriminators associated with the product class, e.g., a scanner class.
• e.g., “of scanner”, “scanner has”, etc, which are used to find parts of scanners by
searching on the Web:
 Extract aspects using DP (Qiu et al. 2009; 2011)
• A double propagation (DP) approach proposed
• Based on the definition earlier, an opinion should have a target, entity or aspect.
• Use dependency of opinions & aspects to extract both aspects & opinion words.
• Knowing one helps find the other.
• E.g., “The rooms are spacious”
• It extracts both aspects and opinion words.
• A domain independent method.
IDS Lab.
Page 30
Aspect-based Sentiment Analysis
 DP is a bootstrapping method
• Input: a set of seed opinion words,
• no aspect seeds needed
 Based on dependency grammar (Tesniere 1959).
• “This phone has good screen”
IDS Lab.
Page 31
Aspect-based Sentiment Analysis
 iKnow
Easy
subject
Keeping
Fast
modify
Quite
IDS Lab.
subject
Delivery
Page 32
Aspect Sentiment Classification
 For each aspect, identify the sentiment or opinion expressed on it.
 Almost all approaches make use of opinion words and phrases.
 But notice:
• Some opinion words have context independent orientations, e.g., “good” and
“bad” (almost)
• Some other words have context dependent orientations, e.g., “small” and sucks”
(+ve for vacuum cleaner)
 Supervised learning
• Sentence level classification can be used, but …
• Need to consider target and thus to segment a sentence (e.g., Jiang et al. 2011)
 Lexicon-based approach (Ding, Liu and Yu, 2008)
• Need parsing to deal with: Simple sentences, compound sentences, comparative
sentences, conditional sentences, questions; different verb tenses, etc.
• Negation (not), contrary (but), comparisons, etc.
• A large opinion lexicon, context dependency, etc.
• Easy: “Apple is doing well in this bad economy.”
IDS Lab.
Page 33
Aspect Sentiment Classification
 A lexicon-based method (Ding, Liu and Yu 2008)
• Input: A set of opinion words and phrases. A pair (a, s), where a is an aspect and
s is a sentence that contains a.
• Output: whether the opinion on a in s is +ve, -ve, or neutral.
• Two steps
• Step 1: split the sentence if needed based on BUT words (but, except that, etc).
• Step 2: work on the segment sf containing a. Let the set of opinion words in sf be
w1, .., wn. Sum up their orientations (1, -1, 0), and assign the orientation to (a, s)
accordingly.
• where wi.o is the opinion orientation of wi.
• d(wi, a) is the distance from a to wi.
IDS Lab.
Page 34
MULTI-VARIATE OUTLIERS IN DATA
CUBES
IDS Lab.
Page 35
Previous Work
 연종흠, 이동주, 심준호, 이상구, 상품 리뷰 데이터와 감성 분석 처리 모델링,
한국, 한국전자거래학회지, 2011
 Jongheum Yeon, Dongjoo Lee, Jaehui Park and Sang-goo Lee, A Framework
for Sentiment Analysis on Smartphone Application Stores, AITS, 2012
IDS Lab.
Page 36
On-Line Sentiment Analytical Processing
 의견 정보가 증가할수록 OLAP(On-Line Analytical Processing)처럼 의견
정보를 다양한 각도로 분석 및 의사 결정 지원에 활용하는 요구 증가
 하지만 기존의 오피니언 마이닝 기법은 결과가 정형화되어 있어 다각도로
데이터를 분석하기 어려움
• 구매 예정자를 대상으로 리뷰를 특징 단위의 점수로 요약
• 특정 키워드에 연관된 의견 성향을 판단
 OLSAP: On-Line Sentiment Analytical Processing
• 의사 결정 지원을 위해 의견 정보를 데이터 웨어하우스에 저장
• 의견 정보를 온라인에서 동적으로 분석하고 통합하는 처리 기법
 OLSAP를 위한 의견 정보의 모델링 방안을 제시
IDS Lab.
Page 37
의견 데이터 모델
 OLSAP에서는 다음과 같은 형태로 의견 데이터를 모델링
(o𝑖 , 𝑓𝑗 , 𝑒𝑘 , 𝑚𝑙 , 𝑣𝑒𝑗𝑘 , 𝑣𝑚𝑘𝑙 , 𝑢𝑚 , 𝑡𝑛 , 𝑝𝑜 )
• 𝑜𝑖 는 “아이폰”과 같은 의견이 표현된 대상
• 𝑓𝑗 는 “LCD”와 같은 𝑜𝑖 의 세부 특징
• 𝑒𝑘 는 “좋다”와 같이 각 특징에 대한 어휘
• 𝑚𝑙 는 “꽤” 와 같은 의견의 강도를 나타내는 어휘
• 𝑣𝑒𝑗𝑘 와 𝑣𝑚𝑘𝑙 는 각각 특징과 의견강도에 대한 실수 값
• 부정일 경우 음수, 긍정일 경우 양수
• 𝑢𝑚 는 의견을 제시한 사용자
• 𝑡𝑛 는 의견이 작성된 시각
• 𝑝𝑜 는 의견이 작성된 위치
IDS Lab.
Page 38
OLSAP 모델링
 OLSAP 데이터베이스 스키마
IDS Lab.
 의견 정보 연관 테이블
Page 39
Related Work
 Integration of Opinion into Customer Analysis Model, Eighth IEEE
International Conference on e-Business Engineering, 2011
IDS Lab.
Page 40
Motivation
 Opinion Mining on top of Data Cubes
 OnLine Analysis of Data to provide “Clients” with “right” reviews
 Interaction is the key between
• Analysis of Review Data and Clients
 Let the client decide how to view the result of analyzing the reviews
• 1. Any opinion mining can’t be perfect.
• 2. Mined data itself can have “malicious” outliers.
 Data Warehousing
• Data Cubes, Multidimensional Aggregation
• A ‘real-systematic’ platform to give the birth of data mining.
 Focus:
• More system-like approach, towards the integrated Algorithm & Data Structure, and
its Performance, in order to integrate the OLAP with Opinion Mining.
• In other words, no interests on traditional opinion mining issues such as natural
language processing and polarity classification stuffs
IDS Lab.
Page 41
Motivation
 Avg_GroupBy(User=Anti1, Product=Samsung, …) means the (average) value
grouped by (user, product, …) where the values of user and product are the
given literals, as to Anti1 and Samsung, respectively.
 ALL represents the don’t care.
IDS Lab.
Page 42
Motivation
 To find out if Anti1’s review needs to be considered or out of concerned, we
are interested in the following values:
• 1. Avg_GroupBy(User=Anti1, Product=Samsung)
• 2. Avg_GroupBy(User=Anti1, Product=~Samsung),
• where Product=~Samsung means U-{Samsung}.
• 3. Avg_GroupBy(User=Anti1, Product=ALL)
• = Avg_GroupBy(User=Anti1)
• 4. Avg_GroupBy(User=~Anti1, Product=Samsung)
• 5. Avg_GroupBy(User=ALL, Product=Samsung)
• = Avg_GroupBy(Product=Samsung)
IDS Lab.
Page 43
Motivation
 Look into Behavior of Anti1 & Anti2
• Anti1 provides the values only to Samsung while Ant2 does to others as well.
• 1) Avg_GroupBy(User=Anti1, Product=ALL) = Avg_GroupBy(User=Anti1,
Product=Samsung)
• i.e., Avg_GroupBy(User=Anti1, Product=~Samsung) = NULL
• &&
• 2) Avg_GroupBy(User=Anti1, Product=Samsung) = 1 turns out to be an outlier,
considering a Avg_GroupBy(User=~Anti1, Product=Samsung) = 2.85
 이 경우( Ant1만 빼야하는지, 아니면 Ant2도 빼야하는지, 아니면 이들을 다
포함한 평균값을 생각해야 하는지? 즉 Avg_GroupBy(User=ALL,
Product=Samsung) = 2.18
 이경우 User-3는 Samsung에만 줬는데 왜 Outlier가 아닌지?
 예제가 부족하지만, User-3의 avg값은 outlier가 아닐정도라고 가정.
IDS Lab.
Page 44
 Look into Behavior of Anti1 & Anti2
• Anti2 provides the values not only to Samsung, but to others as well.
• 1) Avg_GroupBy(User=Anti2, Product=ALL) != Avg_GroupBy(User=Anti1, Product=Samsung)
&&
•
i.e., Avg_GroupBy(User=Anti1, Product=~Samsung) is NOT NULL
• &&
• 2) Avg_GroupBy(User=Anti2, Product=Samsung) 와 Avg_GroupBy(User=Anti2,
Product=~Samsung) 가 너무 차이남
• &&
• 3) Avg_GroupBy(User=Anti2, Product=Samsung) turns out to be an outlier, considering that
Avg_GroupBy(User=ALL, Product=Samsung)
 이경우 User-2는 Samsung과 다른 제품들에 모두 줬는데 왜 Outlier가 아닌지?
 위의 2)번 조건에 위배. 즉 User-2 자체의 점수들 자체가 짬. i.e., 점수 분포는 not
bias.
IDS Lab.
Page 45
Motivation
 Summary
1) 특정 제품 그룹만 review하고, 그 review 평균값이 다른 user들의 해당 그룹
review 평균값과 많은 차이가 날때.
2) 특정 제품 그룹과 다른 그룹들 모두 review하고, 그 그룹간 review 평균값이
많이 차이나면서, 특정 그룹 review 평균값이 다른 user들의 해당 그룹 review
평균값과 많은 차이가 날때.
• 위의 2)에서 및줄친부분이 만족하지 않으면 원래 review 점수가 짠 사람.
 User3 should be okay!
• Why? 한 그룹만 review 했지만 그 평균값이 다른 user들의 해당 그룹 평균값과
별로 차이가 나지 않음.
 User2 should be okay!
• 원래 짠 사람.
 User4 should be okay!
• 여러 그룹 review하고, 각 그룹의 평균값이 다른 user들의 해당 그룹 평균값과
별로 차이가 나지 않음.
IDS Lab.
Page 46
Research Perspectives -1
 Outlier Conditions
• Most likely, we must consider some heuristics, to suit the domain; here opinion
(review) data.
•
•
•
•
Condition1
Condition2
…
Condition_n in forms of as followings
• Multi-variate Outlier Detection
• Avg_groupby(X1=x1, X2=x2, …., Xn=xn) is an outlier only if for Xi_c = X – {Xi} Chisq
[Avg_groupby(X1=x1, X2=x2, Xi-1=xi-1, Xi+1=xi+1,…., Xn=xn)] * Skew 보다도 값이
넘어갈때.
• Sort of …..
 Can this conditions be interactively input by the user? (Rule-based approach)
 For some users who are not likely to explore the interactive outlinedetection features, can a default-rule be applied and give the user some
hints wrt potential outliers?
IDS Lab.
Page 47
Research Perspectives -2
 Outlier-conscious Aggregation – Aggregation Construction Algorithm (&
Data Structures)
• Data cubes are constructed to contain Avg_groupby(X1=x1, X2=x2, …., Xn=xn)
for each dimension X1, …Xn.
• However, after either interactive (manual) ot batch (heuristical automatic)
process of eliminating outliers, the cube also needs to be “effectively or
efficiently” constructed to contain Avg_groupby without having those outliers.
• Most likely, cubes need to maintain not only Avg_groupby value. Instead, needs
to have count, sum, max, min values as well.
• While
1. Multi-variate Outlier Detection
Avg_groupby(X1=x1, X2=x2, …., Xn=xn) is an outlier only if for Xi_c = X – {Xi} Chisq
[Avg_groupby(X1=x1, X2=x2, Xi-1=xi-1, Xi+1=xi+1,…., Xn=xn)] * Skew 보다도 값이 넘어갈때.
2. To see if the lower-variate also cause the outlier: |Xn-1|. In other words, ant1 can input
outlier for all individual Samsung products. Then avg_gourpby(ant1,samsung) will be
a outlier while avg_groupby(ant1,samsung,samsung_prod1) is an outlier.
=> So find out the loweset-dimension outlier, and removes all the containing
outlier elements.
IDS Lab.
Page 48
Research Perspectives -3
 Outlier-conscious Aggregation – Visualization of Aggregation and possible
outliers and their effects.
• Instead of showing the Avgs
• Not only Average
• Med or
• Distribution
IDS Lab.
Page 49
Research Perspectives -3
Showing the distribution
Containing possible outliers
Showing Not only Mean but Median (and Mode)
IDS Lab.
Page 50
Research Perspectives -3
Or combining together
IDS Lab.
Autos.yahoo.com
Page 51
Research Perspectives -4
 Outlier-conscious Aggregation – Aggregation Construction (RP-2) &
Visualization (RP-3)
• After either interactive (manual) or batch (heuristically automatic) process of
eliminating outliers, the cube also needs to be “effectively or efficiently”
constructed to contain Avg_groupby without having those outliers.
This process, hopely,
be done interactively,
i.e, ONLINE.
IDS Lab.
Page 52
References to Start with
 Cube Data Structures for outliers
• R* Trees, Efficient Online Aggregates in Dense-Region-Based Data Cube
Representation, K. Haddadin and T. Lauer, Data Warehousing and Knowledge
Discovery, Lecture Notes in Computer Science Vol 5691, p177-188, 2009.
• TP-Trees, Pushing Theoretically-Founded Probabilistic Guarantees in HighlyEfficient OLAP Engines, A. Cuzzocrea and W. Wang, New Trends in Data
Warehousing and Data Analysis, Annals of Information Systems Vol 3, p1-30,
2009.
 R
• The R Project for Statistical Computing, http://www.r-project.org/
• “Introduction to Data Mining.pdf”, Technical Document, Well explained Outliers
and R. “pdf included”
 Etc
• Outlier-based Data Association: Combining OLAP and Data Mining, Technical
Report, University of Virginia, Song Lin & Donald E. Brown, 2002. “pdf included”
• Selected Topics in Graphical Analytic Techniques,
http://www.statsoft.com/textbook/graphical-analytic-techniques/
IDS Lab.
Page 53
Applications
 Election (Malicious SNS)
 Biased Product Reviews
 Business Perspectives
• Quick Testbed Environment
IDS Lab.
Page 54