103textWeb2 - Website Services - University of Illinois at Urbana

Download Report

Transcript 103textWeb2 - Website Services - University of Illinois at Urbana

Data Mining:
Concepts and Techniques
— Chapter 10 —
10.3.2 Mining Text and Web Data (II)
Jiawei Han and Micheline Kamber
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
Acknowledgements: Slides by students at CS512 (Spring 2009)
4/10/2016
1
Outline
•
Probabilistic Topic Models (Yue Lu)
•
Opinion Mining (Hyun Duk Kim)
•
Mining Query Logs for Personalized Search (Yuanhua
Lv)
•
Online Analytical Processing on Multidimensional
Text Database (Duo Zhang)
Probabilistic Topic Models
Yue LU
Department of Computer Science
University of Illinois, Urbana-Champaign
Many slides are adapted/taken from different sources, including
presentations by ChengXiang Zhai, Qiaozhu Mei and Tom Griffiths
3
Intuition
•
Documents exhibit
multiple topics.
topic: Social
network website
topic: education
topic: criticism
4
What is a Topic?
Representation: a multinomial
distribution of words,
i.e., a unigram language model
retrieval
0.2
information 0.15
model
0.08
query
0.07
language
0.06
feedback
0.03
……
Topic: A broad concept/theme,
semantically coherent, which is
hidden in documents
5
Organize Information with Topics
price
0.0772
oil
0.0643
gas
0.0454
increase 0.0210
product 0.0203
fuel
0.0188
company 0.0182
…
Resolution
Natural
hazards
Categories
government response
oil price,
Topics
loss statistics, …
new orleans,
president bush..
oil, new, put, …
orleans, is, …
1
several
Entities
50~100
…
Phrases
hundreds
Words
thousands
new orleans,
put together, ..
How many in
a document?
Patterns
Many ...
6
The Usage of Topic Models
Topic 1
government 0.3
response 0.2
...
donate 0.1
relief 0.05
help 0.02
...
Topic 2
…
Topic k
[ Criticism of government response to the hurricane
primarily consisted of criticism of its response to the
approach of the storm and its aftermath, specifically in the
delayed response ] to the [ flooding of New Orleans. …
80% of the 1.3 million residents of the greater New Orleans
metropolitan area evacuated ] …[ Over seventy countries
pledged monetary donations or other assistance]. …
city 0.2
new 0.1
orleans 0.05
...
Background B
is 0.05
the 0.04
a 0.03
...
•
Usage of a topic model:
–
–
–
–
–
–
Summarize themes/aspects
Navigate documents
Retrieve documents
Segment documents
Document classification
Document clustering
7
General Idea of
Probabilistic Topic Models
• Cast intuition into a generative probabilistic
process (Generative Process)
– Each document is a mixture of corpus-wide topics
(multinomial distribution/unigram LM)
– Each word is drawn from one of those topics
• Since we only observe the documents, need to
figure out (Estimation/Inference)
– What are the topics?
– How are the documents divided according to those
topics?
• Two basic models: PLSA and LDA
8
Probabilistic Latent Semantic
Analysis/Indexing [Hofmann 99]
PLSA: Generation Process
Document
[Hofmann 99], [Zhai et al. 04]
Topics
battery 0.3
life 0.2..
design 0.1
screen 0.05
1
2
…
price 0.2
purchase 0.15
k
Generate a word
in a document
d1
d2
w
dk
B
Is 0.05 B
the 0.04
a 0.03 ..
Collection
background
Parameters:
B=noise-level (manually set)
’s and ’s need to be estimated
PLSA: Estimation
[Hofmann 99], [Zhai et al. 04]
Document
Topics
battery ?
life ?
design ?
screen ?
1
2
…
price ?
purchase ?
k
Generate a word
in a document
d1?
d2 ?
dk?
Is ?
the ?
a?
w
B
Collection
B
background
Estimated with Maximum
Likelihood Estimator (MLE)
through an EM algorithm
Log-likelihood of
the collection
Problems with PLSA
– “Documents have no generative probabilistic
semantics”
•i.e., document is just a symbol
– Model has many parameters
•linear in number of documents
•need heuristic methods to prevent overfitting
– Cannot generalize to new documents
Latent Dirichlet Allocation [Blei et al. 03]
Basic Idea of LDA
[Blei et al. 03], [Griffiths&Steyvers 02, 03, 04]
•
β

Topics
1
2
…
k
•
d1
d2
dk
Document
w
•
Adding a Dirichlet Prior α
on topic distribution in
documents
Adding a Dirichlet Prior β
on word distribution in
topics
α, β can be vectors, but
for convenience, α = α1=
α2=…; β = β1 = β2=…
(Smoothed LDA)
Dirichlet Hyperparameters α, β
•
•
•
•
•
Generally have a smoothing effect on multinomial
parameters
Large α, β : more smoothed topic/word distribution
Small α, β: more skewed topic/word distribution (e.g.
bias towards a few words for each topic)
Common settings: α=50/K, β=0.01
PLSA is maximum a posteriori estimated LDA when
using uniform prior: α=1, β=1
Inference
• Exact inference is intractable
• Approximation techniques:
– Mean field variational methods (Blei et al., 2001, 2003)
– Expectation propagation (Minka and Lafferty, 2002)
– Collapsed Gibbs sampling (Griffiths and Steyvers, 2002)
– Collapsed variational inference (Teh et al., 2006)
Would like to know more?
• “Parameter estimation for text analysis” by
Gregor Heinrich
• “Probabilistic topic models” by Mark Steyvers
Opinion Mining
Hyun Duk Kim
4/10/2016
Data Mining: Principles and Algorithms
18
Agenda





Overview
Opinion finding & sentiment classification
Opinion Summarization
Other works
Discussion & Conclusion
4/10/2016
Data Mining: Principles and Algorithms
19
Web 2.0


“ Web 2.0 is the business revolution in the
computer industry caused by the move to the
Internet as a platform, and an attempt to
understand the rules for success on that new
platform.” [Wikipedia]
Users participate in content creation
 ex. Blog, review, Q&A forum
4/10/2016
Data Mining: Principles and Algorithms
20
Opinion Mining



Huge volume of
opinions on the Web
 Ex. Product reviews,
Blog posts about
politic issues
Need a good
technique to
summarize them
Example of
commercial system
(MS live search)
4/10/2016
Data Mining: Principles and Algorithms
21
Usefulness of opinion mining



Individuals
 Purchasing a product/ service
 Tracking political topics
 Other decision making tasks
Businesses and organizations
 product and service benchmarking
 survey on a topic
Ads placements
 Place an ad when one praises an product
 Place an ad from a competitor if one criticizes a
product
[Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008]
4/10/2016
Data Mining: Principles and Algorithms
22
Subtasks

Opinion finding & sentiment classification
 Opinion finding


Sentiment classification




Lexicon based method
Machine learning
Opinion Summarization
 How to show opinion finding/classification results effectively
 Methods





If the opinion is positive or negative
In detail, ‘positive/negative/mixed’
Methods


If the target text is opinion or fact
Basic statistics showing
Feature level summary [Hu & Liu, KDD'04/ Hu & Liu, AAAI'04]
Summary paragraph generation [Kim et al, TAC'08]
Probabilistic analysis [Mei et al, WWW'07]
Other works
4/10/2016
Data Mining: Principles and Algorithms
23
Opinion Finding

Lexicon-based method
 Prepare opinion word list


Check special part of speech expressing opinions



Ex. Adjective: ‘excellent’, ‘horrible’ / Verb: ‘like’, ‘hate’
Decision based on the those words occurrences
Lexicon sources




Ex. Word: ‘good’, ‘bad’ / Phrase: ‘I think’, ‘In my opinion’
Manually classified word lists
WordNet
External sources: Wikipedia (objective), review data (subjective)
Machine learning
 Train with tagged examples
 Main features
Opinion lexicons
 Part-of-speech tag, Punctuation (ex. ! ), Modifiers (ex. not, very)
Word tokens, Dependency

4/10/2016
Data Mining: Principles and Algorithms
24
Opinion Sentiment Classification

Method
 Similar to opinion finding




Lexicon based method
Machine learning
Instead of using ‘opinionated word/examples’,
use ‘positive and negative’ word/examples
If positive/negative dominant
-> positive or negative
Both positive and negative dominantly exist
-> mixed
4/10/2016
Data Mining: Principles and Algorithms
25
Opinion Sentiment Classification

Query dependent sentiment classification

Motivation:




Sentiments are expressed differently in different queries
Ex. Small can be good for ipod size, but can be bad for LCD
monitor size
Use external web sources to obtain positive and negative
opinionated lexicons
Key Ideas


Objective words: Wikipedia, product specification part of
Amazon.com
Subjective words: Reviews from Amazon.com, Rateitall.com and
Epinions.com



[Lee et al, TREC '08/ Jia et al, TREC '08]
Reviews rated 4 or 5 out of 5: positive words
Reviews rated 1 or 2 out of 5: negative words
Top ranked in Text Retrieval Conference
[Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008]
4/10/2016
Data Mining: Principles and Algorithms
26
Agenda





Overview
Opinion finding & sentiment classification
Opinion Summarization
Other works
Discussion & Conclusion
4/10/2016
Data Mining: Principles and Algorithms
27
Opinion Summarization

Basic statistics
 Show how many numbers of opinions

4/10/2016
Ex. Opinions about ipod
Positive
Negative
80%
20%
Data Mining: Principles and Algorithms
28
Opinion Summarization (cont.)

Feature-based summary [Hu & Liu, KDD '04/ Hu & Liu, AAAI '04]
 Find lower level of features and analyze.

Ex. Opinions about ipod
Battery life


Pos
Neg
Pos
Neg
Pos
Neg
50%
50%
95%
5%
30%
70%
Usually nouns / noun phrases
Frequent feature identification


4/10/2016
Price
Feature extraction


Design
Association mining
Feature pruning and infrequent feature identification
based on heuristic rules
Sentiment summary for each features
Data Mining: Principles and Algorithms
29
Opinion Summarization (cont.)

Summary paragraph generation [Kim et al, TAC '08]
 General NLP summarization techniques


Opinion filtering



Sentence extraction based summary
Show sentences opinionated.
Show sentences having the same polarity to the goal of
the summary
Opinion ordering

Paragraph division by opinion polarity

4/10/2016
[Paragraph1] …
Following are positive opinions…
Following are negative opinions…
[Paragraph2] …
Following are mixed opinions…
…
Data Mining: Principles and Algorithms
30
Opinion Summarization (cont.)

Probabilistic analysis
 Topic sentiment mixture model

[Mei et al, WWW '07]
Topic modeling with opinion priors
Figure.
The generation process
of the topicsentiment mixture model
4/10/2016
Data Mining: Principles and Algorithms
31
Agenda





Overview
Opinion finding & sentiment classification
Opinion Summarization
Other works
Discussion & Conclusion
4/10/2016
Data Mining: Principles and Algorithms
32
Other works

Comparative analysis Focus on texts having
contradiction or comparison.
 Finding comparative sentences [Jindal & Liu, SIGIR '06]




Finding preferred entity


4/10/2016
Comparison indicator such as ‘than’ or ‘as well as’.
Ex. ‘Ipod’ is better than ‘Zune’.
Sequential patterns showing comparative sentences
ex. 〈{NN}{VBZ}{RB}{moreJJR}{NN}{IN}{NN}〉 comparative
[Murthy & Liu, COLING '08]
Rule based approach
Context-dependent orientation finding using Pros and
Cons reviews.
Data Mining: Principles and Algorithms
33
Other works

Opinion Integration [Lu & Zhai, WWW '08]
 Integrate expert reviews with arbitrary text
collection



Arbitrary: not structured, various & updated data
Semi-supervised topic model


4/10/2016
Expert reviews: well structured, easy to find
features, not often updated
Extract structure aspects (features) data from the expert
review to cluster general documents
Add supplementary opinions from general documents
Data Mining: Principles and Algorithms
34
Agenda





Overview
Opinion finding & sentiment classification
Opinion Summarization
Other works
Discussion & Conclusion
4/10/2016
Data Mining: Principles and Algorithms
35
Challenges in opinion mining

Polarity terms are context sensitive

Ex. Small can be good for ipod size, but can be bad for LCD monitor size

Even in the same domain, use different words depending on target
feature

Ex. Long ‘ipod’ battery life vs. long ‘ipod’ loading time
Partially solved (query dependent sentiment classification)
Implicit and complex opinion expressions

Rhetoric expression, metaphor, double negation

Ex. The food was like a stone

Need both good IR and NLP techniques for opinion mining.
Cannot divide into pos/neg clearly

Not all opinions can be classified into two categories

Interpretation can be changed based on conditions

Ex. 1) The battery life is ‘long’ if you do not use LCD a lot (pos)
2) The battery life is ‘short’ if you use LCD a lot (neg)
Current system classify the first one as positive and second one as
negative.
However, actually both are saying the same fact.



[Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008]
4/10/2016
Data Mining: Principles and Algorithms
36
Discussion



A difficult task
Essential for many blog or review mining
techniques
Current stage of opinion finding
 Good performance in sentence level, specific
domain, sub-problem.
 Still low accuracy in general case

MAP score of TREC ‘08 top performed system



4/10/2016
Opinion finding: 0.4569
Polarity finding: 0.2297~0.2723
A lot of margin to improve !
Data Mining: Principles and Algorithms
37
References








I. Ounis, C. Macdonald and I. Soboroff, Overview of the TREC 2008 Blog Track , TREC, 2008.
Opinion Mining and Summarization: Sentiment Analysis. Tutorial given at WWW-2008, April
21, 2008 in Beijing, China.
Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, ChengXiang Zhai. Topic Sentiment Mixture:
Modeling Facets and Opinions in Weblogs, Proceedings of the 16th International World Wide
Web Conference (WWW' 07), pages 171-180, 2007.
Minqing Hu and Bing Liu. "Mining and summarizing customer reviews". To appear in
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining (KDD-2004, full paper), Seattle, Washington, USA, Aug 22-25, 2004.
Minqing Hu and Bing Liu. "Mining Opinion Features in Customer Reviews." To appear in
Proceedings of Nineteeth National Conference on Artificial Intellgience (AAAI-2004), San Jose,
USA, July 2004.
Yue Lu and ChengXiang Zhai. "Opinion Integration Through Semisupervised Topic Modeling",
In Proceedings of the 17th International World Wide Web Conference (WWW'08)
Kavita Ganesan, Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008
Hyun Duk Kim, Dae Hoon Park, V.G.Vinod Vydiswaran, and ChengXiang Zhai,Opinion
Summarization Using Entity Features and Probabilistic Sentence Coherence Optimization:
UIUC at TAC 2008 Opinion Summarization Pilot, Text Analysis Conference (TAC), Maryland,
USA.
4/10/2016
Data Mining: Principles and Algorithms
38
References





Y. Lee, S.-H. Na, J. Kim, S.-H. Nam, H.-Y. Jung and J.-H. Lee , KLE at TREC 2008 Blog
Track: Blog Post and Feed Retrieval , TREC, 2008.
L. Jia, C. Yu and W. Zhang, UIC at TREC 208 Blog Track, TREC, 2008.
Nitin Jindal and Bing Liu. "Identifying Comparative Sentences in Text Documents" To
appear in Proceedings of the 29th Annual International ACM SIGIR Conference on
Research & Development on Information Retrieval (SIGIR-06), Seattle 2006.
Opinion Mining and Summarization (including review spam detection), tutorial given at
WWW-2008, April 21, 2008 in Beijing, China.
Murthy Ganapathibhotla and Bing Liu, Mining opinions in comparative sentences,
Proceedings of the 22nd International Conference on Computational Linguistics (Coling
2008), pages 241–248, Manchester, August 2008
4/10/2016
Data Mining: Principles and Algorithms
39
Thank you
4/10/2016
Data Mining: Principles and Algorithms
40
Mining User Query Logs for
Personalized Search
Yuanhua Lv
(Some slides are taken from Xuehua Shen, Bin Tan, and ChengXiang Zhai’s presentation)
Problem of Current Search Engines
Jaguar
Suppose we know:
1.
2.
Short-term query logs: previous query =
“racing cars”. [Shen et al. 05]
Apple Software
Car
Long-term query logs: “car” occurs far more
frequentlyAnimal
than “Apple” in the user’s query
logs of the recent 2 months. [Tan et al. 06]
Chemistry Software
42
Problem Definition
?
User Information Need
Q1
User Query
e.g., Apple software
{C1,1 , C1,2 ,C1,3 ,…}
C1
User Clickthrough
Apple - Mac OS X
The Apple Mac OS X product page.
Describes features in the current version of
Mac OS X, a screenshot gallery, latest
software downloads, and a directory of ...
e.g.,
Q2
…
{C2,1 , C2,2 ,C2,3 ,… }
C2
Qk
How to model and mine user query logs?
43
Retrieval Model
Basis: Unigram language model + KL divergence
Mining query logs to update query model
U
Qk
D
θ
'
Qk
Q
θD
Similarity Measure
k
' ||  )
D
(

D(QQkk || DD )
Results
| pk ()w| Qpk(,w
p( w p
| (kw) 
Qk|1Q
,...,
k ) Q1 , Ck 1 ,..., C1 )
Query Logs
44
Mining Short-term User Query Logs
[Shen et al. 05]
Average user’s previous clickthrough
C1
p( w | H C ) 
i  k 1
1
k 1
 p( w | C ) …
i
i 1
Ck-1
H

p( w | H Q )  k11
 p( w | Q ) …
i 1
i
Qk-1
H
1 
Average user’s previous queries
Q1
i  k 1
p( w | H )   p( w | H C )  (1   ) p( w | H Q )
C
H
Q
Combine previous clickthrough
and previous queries
1

k
Linearly interpolate current query
and history model
Qk
p(w | k )   p(w | Qk )  (1   ) p(w | H )
45
Four Heuristic Variants
• FixInt: fixed coefficient interpolation
p(w | k )   p(w | Qk )  (1   ) p(w | H )
Mining Short-term User Query Logs
[Shen et al. 05]
Average user’s previous clickthrough
C1
p( w | H C ) 
i  k 1
1
k 1
 p( w | C ) …
i
i 1
Ck-1
H

p( w | H Q )  k11
 p( w | Q ) …
i 1
i
Qk-1
H
1 
Average user’s previous queries
Q1
i  k 1
p( w | H )   p( w | H C )  (1   ) p( w | H Q )
C
H
Q
Combine previous clickthrough
and previous queries
1

k
Fixed α?
Linearly interpolate current query
and history model
Qk
p(w | k )   p(w | Qk )  (1   ) p(w | H )
47
Four Heuristic Variants
• FixInt: fixed coefficient interpolation
• BayesInt: adapt the interpolation coefficient to
different query length
– Intuition: if the current query Qk is longer, we
should trust Qk more
Mining Short-term User Query Logs
[Shen et al. 05]
Average user’s previous clickthrough
C1
p( w | H C ) 
i  k 1
1
k 1
 p( w | C ) …
i
i 1
Ck-1
Average?
H

p( w | H Q )  k11
 p( w | Q ) …
i 1
i
Qk-1
H
1 
Average user’s previous queries
Q1
i  k 1
p( w | H )   p( w | H C )  (1   ) p( w | H Q )
C
H
Q
Combine previous clickthrough
and previous queries
1

k
Fixed α?
Linearly interpolate current query
and history model
Qk
p(w | k )   p(w | Qk )  (1   ) p(w | H )
49
Four Heuristic Variants
• FixInt: fixed coefficient interpolation
• BayesInt: adapt the interpolation coefficient to
different query length
– Intuition: if the current query Qk is longer, we
should trust Qk more
• OnlineUp: assign more weight to more recent
records.
• BatchUp: the user becomes better and better at
query formulation as time goes on, but we do not
need to “decay” the clickthrough.
Data Set of Evaluation
•
•
•
•
Data collection: TREC AP88-90
Topics: 30 hard topics of TREC topics 1-150
System: search engine + RDBMS
Context: Query and clickthrough history of 3
participants.
51
Overall Effect of Search Context
FixInt
Query
BayesInt
OnlineUp
BatchUp
(=0.1,=1.0)
(=0.2,=5.0)
(=5.0,=15.0)
(=2.0,=15.0)
MAP
MAP
pr@20
MAP
pr@20
MAP
pr@20
pr@20
Q3
0.0421 0.1483 0.0421
0.1483
0.0421
0.1483
0.0421 0.1483
Q3+HQ+HC
0.0726 0.1967 0.0816
0.2067
0.0706
0.1783
0.0810 0.2067
Improve
72.4%
93.8%
39.4%
67.7%
20.2%
92.4%
Q4
0.0536 0.1933 0.0536
0.1933
0.0536
0.1933
0.0536 0.1933
Q4+HQ+HC
0.0891 0.2233 0.0955
0.2317
0.0792
0.2067
0.0950 0.2250
Improve
66.2%
19.9%
47.8%
6.9%
77.2%
32.6%
15.5%
78.2%
39.4%
16.4%
• Short-term query log helps system improve retrieval accuracy
• BayesInt better than FixInt; BatchUp better than OnlineUp
52
Mining Long-term User Query Log
[Tan et al. 05]
• Can we mine long-term user query log
similarly?
• Challenge: long-term query log is noisy
– How do we handle the noise?
– Can we still improve performance?
• Solution:
– Assign weights to the query log data (EM
algorithm)
Hierarchical History Models
Weights for query log units
S1
S2
q1D1C1
q2D2C2
θS1
θS2
St-1
...... q D C
t-1 t-1 t-1
......
θH
St
qtDt
unit history model
θSk ← qkDkCk
overall history model
θH = Σwk θSk
θSt-1
original query model
θq
θq
document model
θq,H
{θd}
D(θq,H||θd)
θd
contextual query model
θq,H
Discriminative Weighting with Mixture
Model
S1
S2
q1D1C1
q2D2C2
θS1
Background
model
θB
St-1
...... q D C
t-1 t-1 t-1
......
θS2
λ2?
λ1?
λt-1?
St
qtDt
θSt-1
θH
θq
<d1>
jaguar car perfect for
racing
<d2>
jaguar is a big cat...
<d3>
locate jaguar dealer
in champaign...
λq?
λB?
θMix
Select {λ} to fit the data: maximize p(Dt|θMix)
EM algorithm
Experimental Results
two query types
recurring ≫ fresh
combination ≈ clickthrough > docs > query, contextless
Summary
• Mining user query logs can personalize search
results and improve retrieval performance
– Four different models to exploit short-term query
logs [Shen et al. 05].
– Assign weights to the long-term query logs to
reduce the effect of noise [Tan et al. 06].
Reference
• Xuehua Shen, Bin Tan, ChengXiang Zhai:
Context-sensitive information retrieval using
implicit feedback. SIGIR 2005: 43-50
• Bin Tan, Xuehua Shen, ChengXiang Zhai:
Mining long-term search history to improve
search accuracy. KDD 2006: 718-723
The End
Thank you !
59
Data Mining:
Concepts and Techniques
— Chapter 11 —
11.8. Online Analytical Processing on
Multidimensional Text Database
Duo Zhang
Department of Computer Science
University of Illinois at Urbana-Champaign
http://sifaka.cs.uiuc.edu/~dzhang22/
4/10/2016
60
Online Analytical Processing on
Multidimensional Text Database



Motivation
Text Cube: Computing IR Measures for
Multidimensional Text Database Analysis
Topic Cube: Topic Modeling for OLAP on
Multidimensional Text Databases
4/10/2016
61
Motivation
• Industry and commercial applications often
collect huge amount of data containing both
structured data records and unstructured text
data in a multidimensional text database
•
•
Incident reports
•
Job descriptions
•
Product reviews
•
Service feedback
It is highly desirable and strategically important to
support high-performance search and mining over such
databases
4/10/2016
62
Examples


Aviation Safety Reporting System
Location
Environment
…
Narrative
199801
TX
Daylight
…
…… I TOLD HIM I WAS AT 2000 FT AND HE
SAID OK……
199801
LA
Daylight
…
……WE STOPPED THE DSCNT AT CIRCLING
MINIMUMS……
199801
LA
Night
…
……THE TAXI/LNDG LIGHTS VERY DIM. NO
OTHER VISIBLE TFC IN SIGHT……
199902
FL
Night
…
……I FEEL WE SHOULD ALL EDUCATE
OURSELVES ON CHKLISTS……
How to organize the data to help experts efficiently explore and
digest text information?


Time
e.g. compare the reports in 1998 and reports in 1999?
How to help experts analyze a specific type of anomaly in
different contexts?

e.g. what did pilots say about anomaly “landing without clearance”
during daylight v.s. night?
Online Analytical Processing on
Multidimensional Text Database


Motivation
Text Cube: Computing IR Measures for
Multidimensional Text Database Analysis
C. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao (ICDE’08)

Topic Cube: Topic Modeling for OLAP on
Multidimensional Text Databases
4/10/2016
64
Text Cube

Text Cube
 A novel data cube model integrating the power of
traditional data cube and IR techniques for effective
text mining
 Computing IR measures for multidimensional text
database analysis
 Heterogeneous records to be examined
 Structured categorical attributes
 Unstructured free text
 IR statistics are evaluated
 TF-IDF
 Inverted Index
4/10/2016
65
Text Cube - Implementation



Preprocessing
stemming, stop words elimination, TF-IDF weighting
Concept hierarchy construction
 A dimension hierarchy takes the form of a tree or a
DAG. An attribute at a lower level reveals more details
 Four operations are supported: roll-up, drill-down, slice
and dice
Term hierarchy construction
 A term hierarchy represents semantic levels of terms in
the text and their correlations
 Infusion with expert knowledge
 Two novel operations: Pull-up & Push-down
4/10/2016
66
Text Cube - Implementation


Partial materialization: if a non-materialized cell is
retrieved, we compute it on-the-fly based on the partially
materialized cuboids
A balance between time and space: given a time
threshold δ, we minimize storage size within the query
time bound δ for retrieving all cells to be interested in
4/10/2016
67
Experiment – Efficiency and Effectiveness
Compare avgTF under different
“Environment: Weather Elements”
Compare avgTF under different
“Supplementary: Problem Areas”
68
Online Analytical Processing on
Multidimensional Text Database



Motivation
Text Cube: Computing IR Measures for
Multidimensional Text Database Analysis
Topic Cube: Topic Modeling for OLAP on
Multidimensional Text Databases
D. Zhang, C. Zhai, and J. Han (SDM’09)
4/10/2016
69
Motivation


Aviation Safety Reporting System
Location
Environment
…
Narrative
199801
TX
Daylight
…
…… I TOLD HIM I WAS AT 2000 FT AND HE
SAID OK……
199801
LA
Daylight
…
……WE STOPPED THE DSCNT AT CIRCLING
MINIMUMS……
199801
LA
Night
…
……THE TAXI/LNDG LIGHTS VERY DIM. NO
OTHER VISIBLE TFC IN SIGHT……
199902
FL
Night
…
……I FEEL WE SHOULD ALL EDUCATE
OURSELVES ON CHKLISTS……
How to organize the data to help experts efficiently explore and
digest text information?


Time
e.g. compare the reports in 1998 and reports in 1999?
How to help experts analyze a specific type of anomaly in
different contexts?

e.g. what did pilots say about anomaly “landing without clearance”
during daylight v.s. night?
Solution: Topic Cube
Topic
Topic
turbulence
Encounter
birds
undershoot
overshoot
LAX
SJC MIA
Location
AUS
Deviation
98.02
98.01
99.02
99.01
1998
1999
CA
drill-down
roll-up
FL
TX
Location
Challenges:
 How to support operations along the topic dimension?
 How to quickly extract semantic topics?
Constructing Topic Cube
ALL
roll-up
Anomaly Altitude Deviation…… Anomaly Maintenance
Problem
Undershoot …… Overshoot Improper
Documentatio
n
Time
Loc
Env
…
98.01
TX
Daylight
…
98.01
98.01
99.02
LA
LA
FL
Daylight
Night
Night
…
……
…… Anomaly Inflight
Encounter
Improper
Maintenanc
e
Birds
Altitude
Ft
Climb
…
Narrative
Descent
Cloud
Ft
…
…
0.03
0.02
0.01
….
0.06
0.03
0.01
….
…
Descent
System
View
…
drill-down
…… Turbulence
0.05
0.02
0.01
….
Altitude
Ft
Instruct
…
0.04
0.03
0.01
….
Materialization
Topic Dimension (Anomaly Event)
Mtopic-agg
C
C
LAX-overshoot
LAX-altitude
Mstd-agg
Standard
Dimension
(Location)
CA-altitude
Mstd-agg
Mtopic-agg
pc(0) ( w | i( L ) ) 

C
 c(w, d ) p( zd ,w  j ')

w'

j '{ s( L1) ,
i
i
,e( L1) }
i
 c(w ', d ) p( z
d
Mtopic-agg
US-altitude
j '{ s( L1) , ,e( L1) } d
i
LAX-all
Mstd-agg
C
CA-all
Mstd-agg
US-overshoot
Mtopic-agg:
Mtopic-agg
C
CA-overshoot
C
C
Mstd-agg
Mtopic-agg
C
Mtopic-agg
d ,w'
 j ')
Mstd-agg
C
Mstd-agg:
( Ln )
pc(0)
(
w
|

)
j
a
US-all
  c(w, d ) p( z
ci d Dci
d ,w
  c(w ', d ) p( z
w'
ci d Dci
 j)
d ,w'
 j)
Experimental Results
daylight
night
Word
p(w|θ)
Tower
0.075
Pattern
0.061
Final
0.060
Runway
0.053
Land
0.052
Downwind
0.039
Tower
0.035
Runway
0.029
Light
0.027
Instrument Landing System
0.015
Beacon
0.014
landing without clearance
…WINDS ALOFT AT PATTERN ALT OF 1000 FT MSL, WERE
MUCH STRONGER AND A DIRECT XWIND. NEEDLESS TO
SAY, THE PATTERNS AND LNDGS WERE DIFFICULT FOR
MY STUDENT AND THERE WAS LIGHT TURB ON THE
DOWNWIND…
…I LISTENED TO HWD ATIS AND FOUND THE TWR
CLOSED AND AN ANNOUNCEMENT THAT THE HIGH
INTENSITY LIGHTS FOR RWY 28L WERE INOP.
BROADCASTING IN THE BLIND AND LOOKING FOR THE
TWR BEACON AND LOW INTENSITY LIGHTS AGAINST A
VERY BRIGHT BACKGROUND CLUTTER OF STREET
LIGHTS, ETC…
Iterations
Closeness to the optimum point
1 4 7 1013161922252831343740434649
Objective
Function
-480000
-490000
-500000
-510000
-520000
-530000
-540000
Time
(sec.)
Aggregate
Baseline
200
150
100
50
0
Aggregate
Baseline
0.2568%
0.2466%
0.2365%
0.2263%
0.2162%
0.2060%
0.1959%
0.1857%
0.1756%
0.1654%
0.1553%
0.1451%
0.1350%
0.1248%
Context
4/10/2016
75