kdd07-Label2 - UM Personal World Wide Web Server

Download Report

Transcript kdd07-Label2 - UM Personal World Wide Web Server

Automatic Labeling of
Multinomial Topic Models
Qiaozhu Mei, Xuehua Shen,
ChengXiang Zhai
University of Illinois at Urbana-Champaign
Outline
• Background: statistical topic models
• Labeling a topic model
– Criteria and challenge
• Our approach: a probabilistic framework
• Experiments
• Summary
2
Statistical Topic Models for Text Mining
Topic models
(Multinomial distributions)
Text
Collections
Probabilistic
Topic Modeling
…
PLSA [Hofmann 99]
LDA [Blei et al. 03]
Author-Topic
[Steyvers et al. 04]
Pachinko allocation
[Li & McCallum 06]
CPLSA
[Mei & Zhai 06]
Topic over time
…
[Wang et al. 06]
term
relevance
weight
feedback
independ.
model
…
web
search
link
graph
…
…
0.16
0.08
0.07
0.04
0.03
0.03
Subtopic discovery
Topical pattern
analysis
Summarization
0.21
0.10
0.08
0.05
Opinion comparison
…
3
Topic Models: Hard to Interpret
• Use top words
term
0.16
– automatic, but hard to make sense relevance 0.08
weight
0.07
feedback
0.04
Term, relevance,
independence 0.03
weight, feedback
model
0.03
frequent
0.02
• Human generated labels
probabilistic 0.02
– Make sense, but cannot scale up
document
0.02
…
insulin
foraging
foragers
collected
grains
loads
collection
nectar
…
?
Retrieval Models
Question: Can we automatically generate
understandable labels for topics?
4
What is a Good Label?
Retrieval
models
term
0.1599
relevance
0.0752
weight
0.0660
feedback
0.0372
independence 0.0311
model
0.0310
frequent
0.0233
probabilistic 0.0188
document
0.0173
…
– Mei and Zhai 06:
a topic in SIGIR
•
•
•
•
•
Semantically close (relevance)
Understandable – phrases?
High coverage inside topic
Discriminative across topics
… iPod Nano
じょうほうけんさく
Pseudo-feedback
Information Retrieval
5
Our Method
Collection
NLP Chunker
Ngram Stat.
(e.g., SIGIR)
information retrieval, retrieval model,
index structure, relevance feedback,
…
1 Candidate label pool
term
0.16
relevance
0.07
weight
0.07
feedback
0.04
independence 0.03
model
0.03
…
2 Relevance Score
Information retrieval 0.26
retrieval models
0.19
IR models
0.17
pseudo feedback
0.06
……
filtering
0.21
collaborative 0.15
… trec
0.18
evaluation 0.10
…
3
Discrimination
information retriev. 0.26 0.01
retrieval models 0.20
IR models
0.18
pseudo feedback 0.09
……
4
Coverage
retrieval models
0.20
IR models
0.18 0.02
pseudo feedback
0.09
……
information retrieval 0.01
6
Relevance (Task 2): the Zero-Order
Score
• Intuition: prefer phrases well covering top words
p(“clustering”|) = 0.4 Clustering
√
Good Label (l1):
“clustering
algorithm”
p(“dimensional”|) = 0.3
dimensional
Latent
Topic 
algorithm
birch
shape
?
p(clustering  a lg orithm |  )
p(clustering  a lg orithm)
p(“shape”|) = 0.01
…
p(w|)
body
…
>
p(body  shape |  )
p(body  shape)
Bad Label (l2):
“body shape”
p(“body”|) = 0.001
7
Relevance (Task 2): the First-Order
Score
• Intuition: prefer phrases with similar context (distribution)
Clustering
Clustering
Clustering
Good Label (l1)
dimension
“clustering
dimension dimension
algorithm”
Topic

partition
partition
Score
rank
algorithm

algorithm
…
algorithm
(l,  ) = D(||l)
key
p
(
w
|

)
PMI
(
w
,
l
| C)

w
P(w|)
…
hash
…
hash
p(w | clustering algorithm )
l2: “hash join”
key …hash join
… code …hash
table …search
…hash join…
map key…hash
…algorithm…key
…hash…key
table…join…
hash
p(w | hash join)
8
Discrimination and Coverage
(Tasks 3 & 4)
• Discriminative across topic:
– High relevance to target topic, low relevance to
other topics
Score' (l , i )  Score(l , i )    Score(l ,1,..., i 1,i 1,..., k )
• High Coverage inside topic:
– Use MMR strategy
lˆ  arg max [  Score(l , )  (1   ) max Sim(l ' , l )]
lL  S
l 'S
9
Variations and Applications
• Labeling document clusters
– Document cluster  unigram language model
– Applicable to any task with unigram language model
• Context sensitive labels
– Label of a topic is sensitive to the context
– An alternative way to approach contextual text mining
tree, prune, root, branch  “tree algorithms” in CS
 ? in horticulture
 ? in marketing?
10
Experiments
• Datasets:
– SIGMOD abstracts; SIGIR abstracts; AP news data
– Candidate labels: significant bigrams; NLP chunks
• Topic models:
– PLSA, LDA
• Evaluation:
– Human annotators to compare labels generated from
anonymous systems
– Order of systems randomly perturbed; score average
over all sample topics
11
Result Summary
• Automatic phrase labels >> top words
• 1-order relevance >> 0-order relevance
• Bigram > NLP chunks
– Bigram works better with literature; NLP better with
news
• System labels << human labels
– Scientific literature is an easier task
12
Results: Sample Topic Labels
the, of, a, and,
to, data, > 0.02
…
clustering 0.02
clustering algorithmtime
0.01
clustering structure
clusters
0.01
…
databases 0.01
large
0.01
performance 0.01
0.005
large data, data quality
north
0.02
case
0.01
trial
0.01
iran
0.01
documents 0.01
walsh
0.009
reagan
0.009
charges 0.007
r tree
b tree …
quality, high data,
data application, …
indexing
methods
iran contra
…
tree
trees
spatial
b
r
disk
array
cache
0.09
0.08
0.08
0.05
0.04
0.02
0.01
0.01
13
Results: Context-Sensitive Labeling
sampling
estimation
approximation
histogram
selectivity
histograms
…
Context: Database
(SIGMOD Proceedings)
selectivity estimation;
random sampling;
approximate answers;
Context: IR
(SIGIR Proceedings)
distributed retrieval;
parameter estimation;
mixture models;
• Explore the different meaning of a topic with different
contexts (content switch)
• An alternative approach to contextual text mining
14
Summary
• Labeling: A postprocessing step of all multinomial
topic models
• A probabilistic approach to generate good labels
– understandable, relevant, high coverage, discriminative
• Broadly applicable to mining tasks involving
multinomial word distributions; context-sensitive
• Future work:
– Labeling hierarchical topic models
– Incorporating priors
15
Thanks!
- Please come to our poster tonight (#40)
16