Class Slides

Download Report

Transcript Class Slides

CS246
Latent Dirichlet Analysis
LSI

LSI uses SVD to find the best rank-K approximation


The result is difficult to interpret especially with negative
numbers
Q: Can we develop a more interpretable method?
Theory of LDA (Model-based Approach)




Develop a simplified model on how users write a
document based on topics.
Fit the model to the existing corpus and “reverse
engineer” the topics used in a document
Q: How do we write a document?
A: (1) Pick the topic(s)
(2) Start writing on the topic(s) with related terms
Two Probability Vectors

For every document d, we assume that the user will first
pick the topics to write about



P(z|d) : probability to pick topic z when the user write each
T


word in document d.
  P( zi | d )  1
 i 1

Document-topic vector of d
We also assume that every topic is associated with each
term with certain probability

P(w|z) : the probability of picking the term w when the user


write on the topic z.
  P( w | z )  1




Topic-term vector of z
W

j 1
j
Probabilistic Topic Model


There exists T number of topics
The topics-term vector for each topic is set before any
document is written


P(wj|zi) is set for every zi and wj
Then for every document d,


The user decides the topics to write on, i.e., P(zi|d)
For each word in d


The user selects a topic zi with probability P(zi|d)
The user selects a word wj with probability P(wj|zi)
Probabilistic Document Model
P(w|z)
P(z|d)
1.0
1 bank1 loan1
money
DOC 1
bank1 money1 ...
0.5
Topic 1
0.5
1.0
Topic 2
1 river2 bank1
money
DOC 2
stream2 bank2 ...
2 stream2 river2
river
DOC 3
bank2 stream2 ...
Example: Calculating Probability

z1 = {w1:0.8, w2:0.1, w3:0.1}
z2 = {w1:0.1, w2:0.2, w3:0.7}

d’s topics are {z1: 0.9, z2:0.1}
d has three terms {w32, w11, w21}.

Q: What is the probability that a user will write such a
document?
Corpus Generation Probability




T: # topics
D: # documents
M: # words per document
Probability of generating the corpus C
M

P(C )     P( wi , j | zi , j ) P( zi , j | d i ) 
i 1  j 1

D
Generative Model vs Inference (1)
P(w|z)
P(z|d)
1.0
1 bank1 loan1
money
DOC 1
bank1 money1 ...
0.5
Topic 1
0.5
1.0
Topic 2
1 river2 bank1
money
DOC 2
stream2 bank2 ...
2 stream2 river2
river
DOC 3
bank2 stream2 ...
Generative Model vs Inference (2)
?
? bank? loan?
money
DOC 1
bank? money? ...
?
Topic 1
?
?
Topic 2
? river? bank?
money
DOC 2
stream? bank? ...
? stream? river?
river
DOC 3
bank? stream? ...
Probabilistic Latent Semantic Index (pLSI)

Basic Idea: We pick P(zj|di), P(wk|zj), and zij values to
maximize the corpus generation probability


Maximum-likelihood estimation (MLE)
More discussion later on how to compute the P(zj|di),
P(wk|zj), and zij values that maximize the probability
Problem of pLSI

Q: 1M documents, 1000 topics, 1M words. 1000
words/doc. How much input data? How many variables
do we have to estimate?

Q: Too much freedom. How can we avoid overfitting
problem?
A: Adding constraints to reduce degree of freedom

Latent Dirichlet Analysis (LDA)

When term probabilities are selected for each topic

Topic-term probability vector, (P(w1|zj), …, P(wW|zj)), is
sampled randomly from Dirichlet distribution
Dir( p1 ,...., pW ; 1 ,..., W ) 
( j  j )
 ( 
j

j
)
W
j
p
 j
 1
j 1
When users select topics for a document

Document-topic probability vector, (P(z1|d), …, P(zT|d)), is
sampled randomly from Dirichlet distribution
Dir( p1 ,...., pT ; 1 ,...,  T ) 
( j  j )
 (
j
j
)
T
j
p
 j
j 1
 1
What is Dirichlet Distribution?

Multinomial distribution

Given the probability pi of each event ei, what is the probability
that each event ei occurs i times after n trial?
(i  i  1) 
f (1 ,...,  k ; n, p1 ,..., pk )  k
p1 ... pk
 ( i  1)
1
k
i 1


We assume pi’s. The distribution assigns i’s probability.
Dirichlet distribution

“Inverse” of multinomial distribution: We assume i’s. The
distribution assigns pi’s probability.
(i ( i  1)) 
f ( p1 ,..., pk ;1 ,...,  k )  k
p1 ... pk
 ( i  1)
1
i 1
k
Dirichlet Distribution
f ( p1 ,..., pk ;1 ,...,  k ) 
(i ( i  1))
k
 (
i 1

i
 1)
p11 ... pk k
Q: Given 1, 2,…, k, what are the most likely p1, p2, pk
values?
Normalized Probability Vector and Simplex



T
W
i 1
j 1
 P( zi | d )  1and
 P( w | z )  1
Remember that
When (p1, …, pn) satisfies p1 + … + pn = 1, they are on
a “simplex plane”
(p1, p2, p3) and their 2-simplex plane
j
Effect of  values
p1
p3
(1 ,  2 ,  3 )  (1,1,1)
p1
p2
p3
(1 ,  2 ,  3 )  (1,2,1)
p2
Effect of  values
p1
p3
(1 ,  2 ,  3 )  (1,2,1)
p1
p2 p3
(1 ,  2 ,  3 )  (1,2,3)
p2
Effect of  values
p1
p3
p1
p2
(1 ,  2 ,  3 )  (1,1,1)
p3
p2
(1 ,  2 ,  3 )  (5,5,5)
Effect of  values
p1
p1
p3
p3
p2
(1 ,  2 ,  3 )  (1,1,1)
p2
(1 ,  2 ,  3 )  (5,5,5)
Minor Correction
f ( p1 ,..., pk ;1 ,...,  k ) 
(i ( i  1))
k
 (
i
 1)
p11 ... pk k
i 1
is not “standard” Dirichlet distribution.
The “standard” Dirichlet Distribution formula:
f ( p1 ,..., pk ; 1 ,...,  k ) 
(i  i )
k
 ( )
p11 1... pk k 1
i
i 1


Used non-standard to make the connection to multinomial
distribution clear
From now on, we use the standard formula
Back to LDA Document Generation Model

For each topic z


Pick the word probability vector P(wj|z)’s by taking a random
sample from Dir(1,…, W)
For every document d


The user decides its topic vector P(zi|d)’s by taking a random
sample from Dir(1,…, T)
For each word in d



The user selects a topic z with probability P(z|d)
The user selects a word w with probability P(w|z)
Once all is said and done, we have



P(wj|z): topic-term vector for each topic
P(zi|d): document-topic vector for each document
Topic assignment to every word in each document
Symmetric Dirichlet Distribution


In principle, we need to assume two vectors, (1,…, T)
and (1 ,…, W) as input parameters.
In practice, we often assume all i’s are equal to  and all
i’s = 



Use two scalar values  and , not two vectors.
Symmetric Dirichlet distribution
Q: What is the implication of this assumption?
Effect of  value on Symmetric Dirichlet
p1
p3
p3
 2


p1
p2
 6
p2
Q: What does it mean? How will the sampled document
topic vectors change as  grows?
Common choice:  = 50/T,   200/W
Plate Notation

P(z|d)
z

P(w|z)
T
w
M
N
LDA as Topic Inference

Given a corpus
d1: w11, w12, …, w1m
…
dN: wN1, wN2, …, wNm



Find P(z|d), P(w|z), zij that are most “consistent” with the
given corpus
Q: How can we compute such P(z|d), P(w|z), zij?
The best method so far is to use Monte Carlo method
together with Gibbs sampling
Monte Carlo Method (1)


Class of methods that compute a number through
repeated random sampling of certain event(s).
Q: How can we compute Pi?
Monte Carlo Method (2)
1.
2.
3.
4.

Define the domain of possible events
Generate the events randomly from the domain using a
certain probability distribution
Perform a deterministic computation using the events
Aggregate the results of the individual computation into
the final result
Q: How can we take random samples from a particular
distribution?
Gibbs Sampling

Q: How can we take a random sample x from the distribution
f(x)?

Q: How can we take a random sample (x, y) from the
distribution f(x, y)?

Gibbs sampling


Given current sample (x1, …, xn), pick an axis xi, and take a random
sample of xi value assuming all other (x1, …, xn) values
In practice, we iterative over xi’s sequentially
Markov-Chain Monte-Carlo Method (MCMC)

Gibbs sampling is in the class of Markov Chain sampling


Next sample depends only on the current sample
Markov-Chain Monte-Carlo Method

Generate random events using Markov-Chain sampling and
apply Monte-Carlo method to compute the result
Applying MCMC to LDA



Let us apply Monte Carlo method to estimate LDA
parameters.
Q: How can we map the LDA inference problem to
random events?
We first focus on identifying topics {zij} for each word
{wij}.

Event: Assignment of the topics {zij} to wij’s. The assignment
should be done according to P({zij}|C)

Q: How to sample according to P({zij}|C)?
Q: Can we use Gibbs sampling? How will it work?

Q: What is P(zij|{z-ij},C)?


P( zij  t | zij , C )



 n   

ndit   

wij t



P( zij  t | zij , C )  W

 T

  (nwt   )   (nditk   ) 
 w1
 k 1




nwt: how many times the word w has been assigned to the
topic t
ndt: how many words in the document d have been
assigned to the topic t
Q: What is the meaning of each term?
LDA with Gibbs Sampling

For each word wij

Assign to topic t with probability

For the prior topic t of wij, decrease nwt and ndt by 1
For the new topic t of wij, increase nwt and ndt by 1


Repeat the process many times





 n   

n


w
t
dit
ij



W
T



(
n


)
(
n


)
  wt
  dit k

 w1
 k 1

At least hundreds of times
Once the process is over, we have


zij for every wij
nwt and ndt
P(w | t ) 
nwt  
W
 (n
i 1
wi t
 )
P (t | d ) 
ndt  
T
 (n
k 1
dtk
)
Result of LDA (Latent Dirichlet Analysis)

TASA corpus


37,000 text passages from educational materials collected by
Touchstone Applied Science Associates
Set T=300 (300 topics)
Inferred Topics
Word Topic Assignments
LDA Algorithm: Simulation

Two topics: River, Money
Five words: “river”, “stream”, “bank”, “money”, “loan”
River
Money

river
stream
bank
1/3
1/3
1/3
1/3
money
loan
1/3
1/3
Generate 16 documents by randomly mixing the two
topics and using the LDA model
Generated Documents and Initial Topic
Assignment before Inference
First 6 and the last 3 documents are purely from one topic. Others are mixture
White dot: “River”. Black dot: “Money”
Topic Assignment After LDA Inference
First 6 and the last 3 documents are purely from one topic. Others are mixture
After 64 iterations
Inferred Topic-Term Matrix

Model parameter
River
river
stream
bank
0.33
0.33
0.33
Money

loan
0.33
0.33
0.33
money
loan
0.29
0.39
Estimated parameter
River
Money

money
river
stream
bank
0.25
0.4
0.35
0.32
Not perfect, but very close especially given the small data
size
SVD vs LDA
Both perform the following decomposition
topic
term


X
doc
doc
=
term
topic

SVD views this as matrix approximation
LDA views this as probabilistic inference based on a generative
model

Each entry corresponds to “probability”: better interpretability
LDA as Soft Classfication


Soft vs hard clustering/classification
After LDA, every document is assigned to a small number
of topics with some weights


Documents are not assigned exclusively to a topic
Soft clustering
Summary

Probabilistic Topic Model




Generative model of documents
pLSI and overfitting
LDA, MCMC, and probabilistic interpretation
Statistical parameter estimation



Multinomial distribution and Dirichlet distribution
Monte Carlo method
Gibbs sampling

Markov-Chain class of sampling