Class Slides
Download
Report
Transcript Class Slides
CS246
Latent Dirichlet Analysis
LSI
LSI uses SVD to find the best rank-K approximation
The result is difficult to interpret especially with negative
numbers
Q: Can we develop a more interpretable method?
Theory of LDA (Model-based Approach)
Develop a simplified model on how users write a
document based on topics.
Fit the model to the existing corpus and “reverse
engineer” the topics used in a document
Q: How do we write a document?
A: (1) Pick the topic(s)
(2) Start writing on the topic(s) with related terms
Two Probability Vectors
For every document d, we assume that the user will first
pick the topics to write about
P(z|d) : probability to pick topic z when the user write each
T
word in document d.
P( zi | d ) 1
i 1
Document-topic vector of d
We also assume that every topic is associated with each
term with certain probability
P(w|z) : the probability of picking the term w when the user
write on the topic z.
P( w | z ) 1
Topic-term vector of z
W
j 1
j
Probabilistic Topic Model
There exists T number of topics
The topics-term vector for each topic is set before any
document is written
P(wj|zi) is set for every zi and wj
Then for every document d,
The user decides the topics to write on, i.e., P(zi|d)
For each word in d
The user selects a topic zi with probability P(zi|d)
The user selects a word wj with probability P(wj|zi)
Probabilistic Document Model
P(w|z)
P(z|d)
1.0
1 bank1 loan1
money
DOC 1
bank1 money1 ...
0.5
Topic 1
0.5
1.0
Topic 2
1 river2 bank1
money
DOC 2
stream2 bank2 ...
2 stream2 river2
river
DOC 3
bank2 stream2 ...
Example: Calculating Probability
z1 = {w1:0.8, w2:0.1, w3:0.1}
z2 = {w1:0.1, w2:0.2, w3:0.7}
d’s topics are {z1: 0.9, z2:0.1}
d has three terms {w32, w11, w21}.
Q: What is the probability that a user will write such a
document?
Corpus Generation Probability
T: # topics
D: # documents
M: # words per document
Probability of generating the corpus C
M
P(C ) P( wi , j | zi , j ) P( zi , j | d i )
i 1 j 1
D
Generative Model vs Inference (1)
P(w|z)
P(z|d)
1.0
1 bank1 loan1
money
DOC 1
bank1 money1 ...
0.5
Topic 1
0.5
1.0
Topic 2
1 river2 bank1
money
DOC 2
stream2 bank2 ...
2 stream2 river2
river
DOC 3
bank2 stream2 ...
Generative Model vs Inference (2)
?
? bank? loan?
money
DOC 1
bank? money? ...
?
Topic 1
?
?
Topic 2
? river? bank?
money
DOC 2
stream? bank? ...
? stream? river?
river
DOC 3
bank? stream? ...
Probabilistic Latent Semantic Index (pLSI)
Basic Idea: We pick P(zj|di), P(wk|zj), and zij values to
maximize the corpus generation probability
Maximum-likelihood estimation (MLE)
More discussion later on how to compute the P(zj|di),
P(wk|zj), and zij values that maximize the probability
Problem of pLSI
Q: 1M documents, 1000 topics, 1M words. 1000
words/doc. How much input data? How many variables
do we have to estimate?
Q: Too much freedom. How can we avoid overfitting
problem?
A: Adding constraints to reduce degree of freedom
Latent Dirichlet Analysis (LDA)
When term probabilities are selected for each topic
Topic-term probability vector, (P(w1|zj), …, P(wW|zj)), is
sampled randomly from Dirichlet distribution
Dir( p1 ,...., pW ; 1 ,..., W )
( j j )
(
j
j
)
W
j
p
j
1
j 1
When users select topics for a document
Document-topic probability vector, (P(z1|d), …, P(zT|d)), is
sampled randomly from Dirichlet distribution
Dir( p1 ,...., pT ; 1 ,..., T )
( j j )
(
j
j
)
T
j
p
j
j 1
1
What is Dirichlet Distribution?
Multinomial distribution
Given the probability pi of each event ei, what is the probability
that each event ei occurs i times after n trial?
(i i 1)
f (1 ,..., k ; n, p1 ,..., pk ) k
p1 ... pk
( i 1)
1
k
i 1
We assume pi’s. The distribution assigns i’s probability.
Dirichlet distribution
“Inverse” of multinomial distribution: We assume i’s. The
distribution assigns pi’s probability.
(i ( i 1))
f ( p1 ,..., pk ;1 ,..., k ) k
p1 ... pk
( i 1)
1
i 1
k
Dirichlet Distribution
f ( p1 ,..., pk ;1 ,..., k )
(i ( i 1))
k
(
i 1
i
1)
p11 ... pk k
Q: Given 1, 2,…, k, what are the most likely p1, p2, pk
values?
Normalized Probability Vector and Simplex
T
W
i 1
j 1
P( zi | d ) 1and
P( w | z ) 1
Remember that
When (p1, …, pn) satisfies p1 + … + pn = 1, they are on
a “simplex plane”
(p1, p2, p3) and their 2-simplex plane
j
Effect of values
p1
p3
(1 , 2 , 3 ) (1,1,1)
p1
p2
p3
(1 , 2 , 3 ) (1,2,1)
p2
Effect of values
p1
p3
(1 , 2 , 3 ) (1,2,1)
p1
p2 p3
(1 , 2 , 3 ) (1,2,3)
p2
Effect of values
p1
p3
p1
p2
(1 , 2 , 3 ) (1,1,1)
p3
p2
(1 , 2 , 3 ) (5,5,5)
Effect of values
p1
p1
p3
p3
p2
(1 , 2 , 3 ) (1,1,1)
p2
(1 , 2 , 3 ) (5,5,5)
Minor Correction
f ( p1 ,..., pk ;1 ,..., k )
(i ( i 1))
k
(
i
1)
p11 ... pk k
i 1
is not “standard” Dirichlet distribution.
The “standard” Dirichlet Distribution formula:
f ( p1 ,..., pk ; 1 ,..., k )
(i i )
k
( )
p11 1... pk k 1
i
i 1
Used non-standard to make the connection to multinomial
distribution clear
From now on, we use the standard formula
Back to LDA Document Generation Model
For each topic z
Pick the word probability vector P(wj|z)’s by taking a random
sample from Dir(1,…, W)
For every document d
The user decides its topic vector P(zi|d)’s by taking a random
sample from Dir(1,…, T)
For each word in d
The user selects a topic z with probability P(z|d)
The user selects a word w with probability P(w|z)
Once all is said and done, we have
P(wj|z): topic-term vector for each topic
P(zi|d): document-topic vector for each document
Topic assignment to every word in each document
Symmetric Dirichlet Distribution
In principle, we need to assume two vectors, (1,…, T)
and (1 ,…, W) as input parameters.
In practice, we often assume all i’s are equal to and all
i’s =
Use two scalar values and , not two vectors.
Symmetric Dirichlet distribution
Q: What is the implication of this assumption?
Effect of value on Symmetric Dirichlet
p1
p3
p3
2
p1
p2
6
p2
Q: What does it mean? How will the sampled document
topic vectors change as grows?
Common choice: = 50/T, 200/W
Plate Notation
P(z|d)
z
P(w|z)
T
w
M
N
LDA as Topic Inference
Given a corpus
d1: w11, w12, …, w1m
…
dN: wN1, wN2, …, wNm
Find P(z|d), P(w|z), zij that are most “consistent” with the
given corpus
Q: How can we compute such P(z|d), P(w|z), zij?
The best method so far is to use Monte Carlo method
together with Gibbs sampling
Monte Carlo Method (1)
Class of methods that compute a number through
repeated random sampling of certain event(s).
Q: How can we compute Pi?
Monte Carlo Method (2)
1.
2.
3.
4.
Define the domain of possible events
Generate the events randomly from the domain using a
certain probability distribution
Perform a deterministic computation using the events
Aggregate the results of the individual computation into
the final result
Q: How can we take random samples from a particular
distribution?
Gibbs Sampling
Q: How can we take a random sample x from the distribution
f(x)?
Q: How can we take a random sample (x, y) from the
distribution f(x, y)?
Gibbs sampling
Given current sample (x1, …, xn), pick an axis xi, and take a random
sample of xi value assuming all other (x1, …, xn) values
In practice, we iterative over xi’s sequentially
Markov-Chain Monte-Carlo Method (MCMC)
Gibbs sampling is in the class of Markov Chain sampling
Next sample depends only on the current sample
Markov-Chain Monte-Carlo Method
Generate random events using Markov-Chain sampling and
apply Monte-Carlo method to compute the result
Applying MCMC to LDA
Let us apply Monte Carlo method to estimate LDA
parameters.
Q: How can we map the LDA inference problem to
random events?
We first focus on identifying topics {zij} for each word
{wij}.
Event: Assignment of the topics {zij} to wij’s. The assignment
should be done according to P({zij}|C)
Q: How to sample according to P({zij}|C)?
Q: Can we use Gibbs sampling? How will it work?
Q: What is P(zij|{z-ij},C)?
P( zij t | zij , C )
n
ndit
wij t
P( zij t | zij , C ) W
T
(nwt ) (nditk )
w1
k 1
nwt: how many times the word w has been assigned to the
topic t
ndt: how many words in the document d have been
assigned to the topic t
Q: What is the meaning of each term?
LDA with Gibbs Sampling
For each word wij
Assign to topic t with probability
For the prior topic t of wij, decrease nwt and ndt by 1
For the new topic t of wij, increase nwt and ndt by 1
Repeat the process many times
n
n
w
t
dit
ij
W
T
(
n
)
(
n
)
wt
dit k
w1
k 1
At least hundreds of times
Once the process is over, we have
zij for every wij
nwt and ndt
P(w | t )
nwt
W
(n
i 1
wi t
)
P (t | d )
ndt
T
(n
k 1
dtk
)
Result of LDA (Latent Dirichlet Analysis)
TASA corpus
37,000 text passages from educational materials collected by
Touchstone Applied Science Associates
Set T=300 (300 topics)
Inferred Topics
Word Topic Assignments
LDA Algorithm: Simulation
Two topics: River, Money
Five words: “river”, “stream”, “bank”, “money”, “loan”
River
Money
river
stream
bank
1/3
1/3
1/3
1/3
money
loan
1/3
1/3
Generate 16 documents by randomly mixing the two
topics and using the LDA model
Generated Documents and Initial Topic
Assignment before Inference
First 6 and the last 3 documents are purely from one topic. Others are mixture
White dot: “River”. Black dot: “Money”
Topic Assignment After LDA Inference
First 6 and the last 3 documents are purely from one topic. Others are mixture
After 64 iterations
Inferred Topic-Term Matrix
Model parameter
River
river
stream
bank
0.33
0.33
0.33
Money
loan
0.33
0.33
0.33
money
loan
0.29
0.39
Estimated parameter
River
Money
money
river
stream
bank
0.25
0.4
0.35
0.32
Not perfect, but very close especially given the small data
size
SVD vs LDA
Both perform the following decomposition
topic
term
X
doc
doc
=
term
topic
SVD views this as matrix approximation
LDA views this as probabilistic inference based on a generative
model
Each entry corresponds to “probability”: better interpretability
LDA as Soft Classfication
Soft vs hard clustering/classification
After LDA, every document is assigned to a small number
of topics with some weights
Documents are not assigned exclusively to a topic
Soft clustering
Summary
Probabilistic Topic Model
Generative model of documents
pLSI and overfitting
LDA, MCMC, and probabilistic interpretation
Statistical parameter estimation
Multinomial distribution and Dirichlet distribution
Monte Carlo method
Gibbs sampling
Markov-Chain class of sampling