Discovering Evolutionary Theme Patterns from Text

Download Report

Transcript Discovering Evolutionary Theme Patterns from Text

Discovering Evolutionary Theme
Patterns from Text
- An Exploration of Temporal Text Mining
Qiaozhu Mei and ChengXiang Zhai
Department of Computer Science
University of Illinois at Urbana Champaign
SIGKDD’05
Introduction
• Temporal Text Mining (TTM):
discovering temporal patterns in text information
collected over time.
• In this paper
– Discovering and summarizing the evolutionary
patterns of themes in a text stream
– Revealing the life cycle of a theme
Introduction (cont.)
• We solve this problem through
– 1. Discovering latent themes from text
– 2. Constructing an evolution graph of themes
– 3. Analyzing life cycles of themes.
• Evaluation
– News articles
– The abstracts of the ACM KDD conference
papers
Definitions
• Time indexed documents C = {d1, d2, …, dT}
vocabulary set V = {w1, …, w|V|}
• Theme θ
– A unigram language model {p(w|θ)}
• Theme Span γ = 〈θ, s(γ), t(γ)〉
– If s = 1 and t = T, then γ is a trans-collection theme
• Evolutionary Transition
– If t(γ1) ≦ s(γ2) and similarity(γ1,γ2) ﹥threshold, we say
that there is an evolutionary transition from γ1 to γ2, γ1
γ2

Definitions (cont.)
• Theme Evolution Thread
– A sequence of theme spans γ0, γ1, …, γn such
that γi γi+1 
• Theme Evolution Graph
– Weighted directed graph G = (N,E), where N
is the set of all theme spans, E is the set of
evolutionary transition.
Example of Theme Evolution Graph
Theme Span
Theme
Evolution
Thread
Evolutionary
Transition
Evolution Graph Discovery
• Partition the documents into sub-collections
C = C1∪ C2∪…∪Cn
• Extract the most salient themes  i = {  i ,1, …,  i,ki }
from each sub-collection Ci
• For any themes 1  1 and  2  2 where i < j,
decide whether there is an evolutionary
transition.
Theme Extraction
• Let θ1, …, θk be k themes and θB be a
background model for the whole collection C.
• A document d is regarded as a sample of the
following mixture model:
w: a word in d, πd,j : mixing weight for d choosing
θj, λ: mixing weight for θB
• The log-likelyhood of Ci,
• Using EM algorithm to train
   j ,  d , j | d  Ci , 1  j  k 
Parameter Estimation
• {zd,w} is a hidden variable
• p(zd,w = j) indicates that the word w in document d is
generated using theme j given that w is not generated
from the background mode.
Evolutionary Transition Discovery
• For every pair of theme spans γ1 = 〈θ1,
s(γ1), t(γ1)〉and γ2 = 〈θ2, s(γ2), t(γ2)〉
where t(γ1) ≦s(γ2)
• Kullback-Leibler divergence
• If D(θ2 || θ1) ﹥ξ, then γ1
γ2 
Analysis of Theme Life Cycles
• Theme Life Cycle : the strength distribution of
the trans-collection theme over the entire time
line.
• Assume the collection is
generated from HMM
– States → Themes
– Output symbol set → V
– Output probability distribution
→ the multinomial distribution
of words of that state
• Obtain state sequence with
Viterbi algorithm
Analysis of Theme Life Cycles
(cont.)
• The absolute and relative strengths of theme i at
time t
1 if word
=
0 otherwise
is labeled as theme i
Experimental Data Sets
• News about Asia Tsunami
– Dec. 19 2004 to Feb. 8 2005 (50 days)
– Downloaded with query “tsunami”
• The abstracts in KDD conference
proceeding from 1999 to 2004
Theme Spans from Tsunami
Theme Evolution Graph for Tsunami
c:
Theme Life Cycle in CNN
Absolute life cycle
in CNN data
Theme Life Cycle in XINHUA
• A
Absolute life cycle in XINHUA data
Normalized life cycle in XINHUA data
Theme Spans from KDD
Theme Evolution Graph for KDD
a:
classification
Typical
classification tech.
Web classification
Clustering &
random variables
Theme Life Cycle for KDD
Business
Web Info.
Classification
Association Rule
Biology Data
Time series
Clustering
Conclusions
• We propose methods to discover evolutionary
theme patterns and analyze the life cycle of
each theme
• The proposed methods can generate meaningful
temporal theme structures on the two
experimental data sets.
• Our methods are generally applicable to any text
stream data.
• Future works
– Hierarchical theme clustering
– Temporal theme mining system