causal-topic-miningx

Download Report

Transcript causal-topic-miningx

Analysis of Causal Topics in Text Data and
Time Series with Applications to
Presidential Prediction Markets
Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)
Thomas A. Rietz (Univ. of Iowa)
Daniel Diermeier (Northwestern Univ.)
Meichun Hsu, Malu Castellanos, and Carlos Ceja (HP Labs)
1
Text Mining for Understanding Time Series
What might have caused the stock market crash?
…
Sept 11 Attack!
Time
Any clues in the companion news stream?
Dow Jones Industrial Average [Source: Yahoo Finance]
2
Analysis of Presidential Prediction Markets
What might have caused the sudden drop of price for this candidate?
What “mattered” in this election?
Tax cut?
…
Time
Any clues in the companion news stream?
3
Joint Analysis of Text and Time Series to
Discover “Causal Topics”
• Input:
– Time series
– Text data produced in a similar time period (text stream)
• Output
– Topics whose coverage in the text stream has strong
correlations with the time series (“causal” topics)
Tax cut
…
Gun control
4
Related Work
• Topic modeling (e.g., [Hofmann 99], [Blei et al. 03], …)
– Extract topics from text data and reveal their patterns
– No consideration of time series  topics extracted may
not be correlated with time series
• Stream data mining (e.g., [Agrawal 02])
– Clustering & categorization of time series data
– No topics being generated for text data
• Temporal text retrieval and prediction (e.g., [Efron 10],
[Smith10])
– Incorporating time factor in retrieval or text-based
prediction
– No topics being generated
New Problem: Discover causal topics from text streams
with time series data for supervision
5
Background: Topic Models
• Topic = multinomial distribution over words
(unigram language models)
• Text is assumed to be a sample of words drawn from
a mixture of multiple (unknown) topics
• Parameter estimation and Bayesian inference
“reveal”
– All the unknown topics in a text collection
– The coverage of each topic in each document
– Prior can be imposed to bias the inference of both topics
and topic coverage
6
Document as a Sample of Mixed Topics
[ Criticism of government
government 0.3
Topic 1
response to the hurricane
response 0.2
primarily consisted of
...
criticism of its response
city 0.2
to the approach of the
Topic 2
new 0.1
storm and its aftermath,
orleans 0.05
Generative
...
specifically in the delayed
…
response ] to the
Topic
[ flooding of New
donate 0.1
Model
relief 0.05
Orleans. … 80% of the 1.3
Topic k
help 0.02
million residents of the
...
greater New Orleans
metropolitan area
is 0.05
evacuated ] …[ Over
Background k
the 0.04
a 0.03
seventy countries
Inference/Estimation pledged monetary
...
Of topics
donations or other
assistance]. …
Prior can be added on them
7
When a topic model applied to text stream
…
Topic 1
Topic 2
…
Topic k
Background k
Time
government 0.3
response 0.2
...
city 0.2
new 0.1
orleans 0.05
...
donate 0.1
relief 0.05
help 0.02
...
is 0.05
the 0.04
a 0.03
...
8
New Text Mining Framework:
Iterative Causal Topic Modeling
Topic Modeling
Text Stream
Sep
2001
Oct
2001
…
Causal Topics
Topic 1
Topic 2
Topic 1
Topic 2
Topic 3
Topic 4
Topic 3
Topic 4
Non-text
Time Series
Feedback
as Prior
Split Words
Zoom into
Word Level
Topic 1
Topic 1-1
W1
W3
+
+
Topic 1-2
W2
W4
---
W1
W2
W3
W4
W5
…
+
-+
--
Causal
Words
9
Iterative Causal Topic Modeling Framework
Topic Modeling
Text Stream
Sep
2001
Oct
2001
…
Causal Topics
Topic 1
Topic 2
Topic 1
Topic 2
Topic 3
Topic 4
Topic 3
Topic 4
Feedback
as Prior
Split Words
Non-text
Time Series
Zoom into
Word Level
Topic 1
Topic 1-1
W1 +
W1
W3 +
W2
• General Framework for any topic modeling and any causality measure
W3
Topic 1-2
• Naturally incorporate non-text time series in the process
W4
W2 -• Topic level + Word level  Efficiency + Granularity
W5
W4 -…
+
-+
--
Causal
Words
10
Heuristic Optimization of Causality + Coherence
11
Causality Measures
• Pearson correlation
– Basic correlation
• Granger Test
– For two time series x (topic), y (stock), time lag p
Auto-regression
Lagged values
Significance test if lagged x terms should be retained or not
12
Feedback Prior Generation
Topic
1
5
Word
Impact
Significance
(%)
Social
+
99
Security
+
96
Gun
-
98
Control
-
96
September
-
Airline
Terrorism
Word
Prob
Social
0.8
security
0.2
Gun
0.75
Control
0.25
99
September
0.1
-
99
Airline
0.1
-
97
Terrorism
0.075
… (5 more
words)
Attack
-
96
Good
+
96
Topic
1
2
3
… (5 more)
Attack
0.05
Good
0.0
13
Experiment Design 1:
Stock Market Analysis
• Time: June 2000 – Dec. 2011
• Text data
– New York Times
• Time series
– American Airlines stock (AAMRQ)
– Apple stock (AAPL)
• Question: any “causal topics” to explain
fluctuation of the stocks of the two companies?
14
Experiment Design 2:
2000 Presidential election campaign
•
•
Time: May 2000 – Oct. 2000
Text data
– New York Times (use text mentioning Bush or Gore)
•
Time series
– Normalized “Gore stock price” in Iowa Electronic
Markets (IEM), online future market
"𝐺𝑜𝑟𝑒 𝑝𝑟𝑖𝑐𝑒"
"Gore pric𝑒" +"𝐵𝑢𝑠ℎ 𝑃𝑟𝑖𝑐𝑒"
•
Question: any “causal topics” to explain changes in
opinions about Gore?
15
Measuring Topic Quality
• Causality Confidence of a topic
– Based on p-value of causality test (Granger, Pearson)
for the topic
• Topic Purity
– Consistency in the direction of “causal” relation with
the time series (“are all words in the topic positively
correlated with the time series?”)
– Based on entropy of distribution of positive/negative
words
16
Topic Purity
# positiveWords
# positiveWords  # negativeWords
Entropy H (T )   p(T " pos" ) log p(T " pos" )
 p(T " neg " ) log p(T " neg " )
Purity (T )  100 * (1  H (T ))
H(T)
p(T " Pos" ) 
Topic
1
5
Word
Impact
Significance (%)
Social
+
99
Security
+
96
Gun
-
98
Control
-
96
September
-
99
Airline
-
99
Terrorism
-
97
Attack
-
96
Good
+
96
1.0
0
0.5
1.0
P(T=“pos”)
P(T=“pos”)=p(T=“neg”)=1/2
Highest entropy  Lowest purity(0)
P(T=“pos”)=1/5 p(T=“neg”)=4/5
 Lower entropy  Higher purity
17
Sample Result 1:
Topics discovered for AAMRQ vs. AAPL
AAMRQ
AAPL
russia russian putin
europe european germany
bush gore presidential
police court judge
airlines airport air
united trade terrorism
food foods cheese
nets scott basketball
tennis williams open
awards gay boy
moss minnesota chechnya
paid notice st
russia russian europe
olympic games olympics
she her ms
oil ford prices
black fashion blacks
computer technology software
internet com web
football giants jets
japan japanese plane
…
- Significant topic list of two different external time series.
AAMRQ: airline, terrorism topic
AAPL: IT industry topic
 Topics discovered depend on external time series
18
Effect of Iterations on
Causality Confidence & Purity
19
Different Feedback Strength (µ)
Average Confidence
Average Purity
100
99.5
99
98.5
98
97.5
97
96.5
96
95.5
1
2
3
4
5
Iter
Number Of Significant Topics
120
30
100
25
80
20
60
15
µ=100
MU100
40
10
µ=500
MU500
20
5
0
1
2
3
4
5
Iter
µ=10
MU10
µ=50
MU50
µ=1000
MU1000
0
1
2
3
4
5
Iter
• Significant improvement in confidence, number of
significant topics by feedback
– Clear benefit of feedback
• Large µ guarantees topic purity improvement
20
Sample Result 2:
Major Topics in 2000 Presidential Election
Top Three Words
in Significant Topics
tax cut 1
screen pataki guiliani
enthusiasm door symbolic
oil energy prices
news w top
pres al vice
love tucker presented
partial abortion privatization
court supreme abortion
gun control nra
• Revealed several
important issues
– E.g. tax cut, abortion,
gun control, oil energy
– Such topics are also
cited in political science
literature [Pomper `01] and
Wikipedia [Link]
21
Additional Results:
http://sifaka.cs.uiuc.edu/~hkim277/InCaToMi/demo/2000_Presidential_
Election/dashboard/Dashboard.html
22
Conclusions & Future Work
• Meaningful topics can be extracted from text stream
by using time series for supervision
• Such “causal” topics provide potential explanations
for changes in the time series data
• Preliminary experiment results on 2000 presidential
prediction markets are promising
• Future work (discussion)
– Issues related to topic models (e.g., local maxima, # of
topics, interpretation of topics)
– Issues related to causality analysis (e.g., “local” causality)
– Unified analysis model
– System to support online interactive analysis of causal
topics (time series can be derived from text too)
23
Thank You!
Questions/Comments?
24