Automatic Labeling of Multinomial Topic Models

Download Report

Transcript Automatic Labeling of Multinomial Topic Models

Automatic Labeling of
Multinomial Topic Models
Qiaozhu Mei, Xuehua Shen,
ChengXiang Zhai
University of Illinois at Urbana-Champaign
Outline
• Background: statistical topic models
• Labeling a topic model
– Criteria and challenge
• Our approach: a probabilistic framework
• Experiments
• Summary
2
Statistical Topic Models for Text Mining
Topic models
(Multinomial distributions)
Text
Collections
Probabilistic
Topic Modeling
…
PLSA [Hofmann 99]
LDA [Blei et al. 03]
Author-Topic
[Steyvers et al. 04]
Pachinko allocation
[Li & McCallum 06]
CPLSA
[Mei & Zhai 06]
Topic over time
…
[Wang et al. 06]
term
relevance
weight
feedback
independ.
model
…
web
search
link
graph
…
…
0.16
0.08
0.07
0.04
0.03
0.03
Subtopic discovery
Topical pattern
analysis
Summarization
0.21
0.10
0.08
0.05
Opinion comparison
…
3
Topic Models: Hard to Interpret
• Use top words
term
0.16
– automatic, but hard to make sense relevance 0.08
weight
0.07
feedback
0.04
Term, relevance,
independence 0.03
weight, feedback
model
0.03
frequent
0.02
• Human generated labels
probabilistic 0.02
– Make sense, but cannot scale up
document
0.02
…
insulin
foraging
foragers
collected
grains
loads
collection
nectar
…
?
Retrieval Models
Question: Can we automatically generate
understandable labels for topics?
4
What is a Good Label?
Retrieval
models
term
0.1599
relevance
0.0752
weight
0.0660
feedback
0.0372
independence 0.0311
model
0.0310
frequent
0.0233
probabilistic 0.0188
document
0.0173
…
– Mei and Zhai 06:
a topic in SIGIR
•
•
•
•
Semantically close (relevance)
Understandable – phrases?
High coverage inside topic
Discriminative across topics
iPod Nano
じょうほうけんさく
Pseudo-feedback
Information Retrieval
5
Our Method
Collection
NLP Chunker
Ngram Stat.
(e.g., SIGIR)
information retrieval, retrieval model,
index structure, relevance feedback,
…
1 Candidate label pool
term
0.16
relevance
0.07
weight
0.07
feedback
0.04
independence 0.03
model
0.03
…
2 Relevance Score
Information retrieval 0.26
retrieval models
0.19
IR models
0.17
pseudo feedback
0.06
……
filtering
0.21
collaborative 0.15
… trec
0.18
evaluation 0.10
…
3
Discrimination
information retriev. 0.26 0.01
retrieval models 0.20
IR models
0.18
pseudo feedback 0.09
……
4
Coverage
retrieval models
0.20
IR models
0.18 0.02
pseudo feedback
0.09
……
information retrieval 0.01
6
Relevance (Task 2): the Zero-Order
Score
• Intuition: prefer phrases well covering top words
p(“clustering”|) = 0.4 Clustering
√
Good Label (l1):
“clustering
algorithm”
p(“dimensional”|) = 0.3
dimensional
Latent
Topic 
algorithm
birch
shape
?
p(clustering  a lg orithm |  )
p(clustering  a lg orithm)
p(“shape”|) = 0.01
…
p(w|)
body
…
>
p(body  shape |  )
p(body  shape)
Bad Label (l2):
“body shape”
p(“body”|) = 0.001
7
Relevance (Task 2): the First-Order
Score
• Intuition: prefer phrases with similar context (distribution)
Clustering
Clustering
Clustering
Good Label (l1)
dimension
“clustering
dimension dimension
algorithm”
Topic

partition
partition
Score
rank
algorithm

algorithm
…
algorithm
(l,  ) = D(||l)
key
p
(
w
|

)
PMI
(
w
,
l
| C)

w
P(w|)
…
hash
…
hash
p(w | clustering algorithm )
l2: “hash join”
key …hash join
… code …hash
table …search
…hash join…
map key…hash
…algorithm…key
…hash…key
table…join…
hash
p(w | hash join)
8
Discrimination and Coverage
(Tasks 3 & 4)
• Discriminative across topic:
– High relevance to target topic, low relevance to
other topics
Score' (l , i )  Score(l , i )    Score(l ,1,..., i 1,i 1,..., k )
• High Coverage inside topic:
– Use MMR strategy
lˆ  arg max [  Score(l , )  (1   ) max Sim(l ' , l )]
lL  S
l 'S
9
Variations and Applications
• Labeling document clusters
– Document cluster  unigram language model
– Applicable to any task with unigram language model
• Context sensitive labels
– Label of a topic is sensitive to the context
– An alternative way to approach contextual text mining
tree, prune, root, branch  “tree algorithms” in CS
 ? in horticulture
 ? in marketing?
10
Experiments
• Datasets:
– SIGMOD abstracts; SIGIR abstracts; AP news data
– Candidate labels: significant bigrams; NLP chunks
• Topic models:
– PLSA, LDA
• Evaluation:
– Human annotators to compare labels generated from
anonymous systems
– Order of systems randomly perturbed; score average
over all sample topics
11
Result Summary
• Automatic phrase labels >> top words
• 1-order relevance >> 0-order relevance
• Bigram > NLP chunks
– Bigram works better with literature; NLP better with
news
• System labels << human labels
– Scientific literature is an easier task
12
Results: Sample Topic Labels
the, of, a, and,
to, data, > 0.02
…
clustering 0.02
clustering algorithmtime
0.01
clustering structure
clusters
0.01
…
databases 0.01
large
0.01
performance 0.01
0.005
large data, data quality
north
0.02
case
0.01
trial
0.01
iran
0.01
documents 0.01
walsh
0.009
reagan
0.009
charges 0.007
r tree
b tree …
quality, high data,
data application, …
indexing
methods
iran contra
…
tree
trees
spatial
b
r
disk
array
cache
0.09
0.08
0.08
0.05
0.04
0.02
0.01
0.01
13
Results: Context-Sensitive Labeling
sampling
estimation
approximation
histogram
selectivity
histograms
…
Context: Database
(SIGMOD Proceedings)
selectivity estimation;
random sampling;
approximate answers;
Context: IR
(SIGIR Proceedings)
distributed retrieval;
parameter estimation;
mixture models;
• Explore the different meaning of a topic with different
contexts (content switch)
• An alternative approach to contextual text mining
14
Summary
• Labeling: A postprocessing step of all multinomial
topic models
• A probabilistic approach to generate good labels
– understandable, relevant, high coverage, discriminative
• Broadly applicable to mining tasks involving
multinomial word distributions; context-sensitive
• Future work:
– Labeling hierarchical topic models
– Incorporating priors
15
Thanks!
16