Transcript Document

Latent Dirichlet Allocation
Presenter: Hsuan-Sheng Chiu
1
Reference
• D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent Dirichlet
allocation”, Journal of Machine Learning Research, vol. 3,
no. 5, pp. 993-1022, 2003.
2
Outline
• Introduction
• Notation and terminology
• Latent Dirichlet allocation
• Relationship with other latent variable models
• Inference and parameter estimation
• Discussion
3
Introduction
• We consider with the problem of modeling text corpora
and other collections of discrete data
– To find short description of the members a collection
• Significant process in IR
– tf-idf scheme (Salton and McGill, 1983)
– Latent Semantic Indexing (LSI, Deerwester et al., 1990)
– Probabilistic LSI (pLSI, aspect model, Hofmann, 1999)
4
Introduction (cont.)
• Problem of pLSI:
– Incomplete: Provide no probabilistic model at the level of
documents
– The number of parameters in the model grows linear with the
size of the corpus
– It is not clear how to assign probability to a document outside of
the training data
• Exchangeability: bag of words
5
Notation and terminology
• A word is the basic unit of discrete data ,from vocabulary
indexed by {1,…,V}. The vth word is represented by a Vvector w such that wv = 1 and wu = 0 for u≠v
• A document is a sequence of N words denote by w =
(w1,w2,…,wN)
• A corpus is a collection of M documents denoted by D =
{w1,w2,…,wM}
6
Latent Dirichlet allocation
•
Latent Dirichlet allocation (LDA) is a generative
probabilistic model of a corpus.
•
Generative process for each document w in a corpus D:
–
–
–
1. Choose N ~ Poisson(ξ)
2. Choose θ ~ Dir(α)
3. For each of the N words wn
(a) Choose a topic zn ~ Multinomial(θ)
(b) Choose a word wn from p(wn|zn, β), a multinomial probability
conditioned on the topic zn
βij is a a element of k×V matrix = p(wj = 1| zi = 1)
7
Latent Dirichlet allocation (cont.)
• Representation of a document generation:
θ~ Dir(α) → {z1,z2,…,zk}
w
z1
z2
…
…
zN
w1
w2
…
…
wN
β(z)
→{w1,w2,…,wn}
N ~ Poisson
8
Latent Dirichlet allocation (cont.)
• Several simplifying assumptions:
– 1. The dimensionality k of Dirichlet distribution is known and fixed
– 2. The word probabilities β is fixed quantity that is to be
estimated
– 3. Document length N is independent of all the other data
generating variable θ and z
• A k-dimensional Dirichlet random variable θ can take
values in the (k-1)-simplex
p |   

 i 1 i
k

  
k
i 1
1 1
1
... k k 1
i
http://www.answers.com/topic/dirichlet-distribution
9
Latent Dirichlet allocation (cont.)
• Simplex:
The above figures show the graphs for the n-simplexes with n =2 to 7.
(from mathworld, http://mathworld.wolfram.com/Simplex.html)
10
Latent Dirichlet allocation (cont.)
• The joint distribution of a topic θ, and a set of N topic z,
and a set of N words w:
N
p , z, w |  ,    p |   p zn |   pwn | zn ,  
n 1
• Marginal distribution of a document:
 N

pw |  ,     p |    p z n |   pwn | z n ,   d
 n 1 zn

• Probability of a corpus:
 Nd

pD |  ,      p d |    p z dn |   pwdn | z dn ,   d d
d 1
 n 1 zd n

M
11
Latent Dirichlet allocation (cont.)
• There are three levels to LDA representation
– αβ are corpus-level parameters
– θd are document-level variables
– zdn, wdn are word-level variables
corpus
document
Refer to as hierarchical models, conditionally independent
hierarchical models and parametric empirical Bayes models
12
Latent Dirichlet allocation (cont.)
• LDA and exchangeability
– A finite set of random variables {z1,…,zN} is said exchangeable if the
joint distribution is invariant to permutation (πis a permutation)
pz1 ,..., z N   p z 1 ,..., z  N  
– A infinite sequence of random variables is infinitely exchangeable if
every finite subsequence is exchangeable
– De Finetti’s representation theorem states that the joint distribution of an
infinitely exchangeable sequence of random variables is as if a random
parameter were drawn from some distribution and then the random
variables in question were independent and identically distributed,
conditioned on that parameter
– http://en.wikipedia.org/wiki/De_Finetti's_theorem
13
Latent Dirichlet allocation (cont.)
• In LDA, we assume that words are generated by topics
(by fixed conditional distributions) and that those topics
are infinitely exchangeable within a document
 N

pw, z    p   pzn |   pwn | zn  d
 n 1

14
Latent Dirichlet allocation (cont.)
•
A continuous mixture of unigrams
–
By marginalizing over the hidden topic variable z, we can
understand LDA as a two-level model
pw |  ,     pw | z,   pz |  
•
z
Generative process for a document w
–
–
–
1. choose θ~ Dir(α)
2. For each of the N word wn
(a) Choose a word wn from p(wn|θ, β)
Marginal distribution od a document
 N

pw |  ,     p |    pwn |  ,   d
 n 1

15
Latent Dirichlet allocation (cont.)
• The distribution on the (V-1)-simplex is attained with only
k+kV parameters.
16
Relationship with other latent variable models
• Unigram model
N
p w   p wn 
n 1
• Mixture of unigrams
– Each document is generated by first choosing a topic z and then
generating N words independently form conditional multinomial
– k-1 parameters
N
p w   p  z  p wn | z 
z
n 1
17
Relationship with other latent variable models (cont.)
• Probabilistic latent semantic indexing
– Attempt to relax the simplifying assumption made in the mixture
of unigrams models
– In a sense, it does capture the possibility that a document may
contain multiple topics
– kv+kM parameters and linear growth in M
pd , wn   pd  pwn | z  pz | d 
z
18
Relationship with other latent variable models (cont.)
• Problem of PLSI
– There is no natural way to use it to assign probability to a
previously unseen document
– The linear growth in parameters suggests that the model is prone
to overfitting and empirically , overfitting is indeed a serious
problem
• LDA overcomes both of there problems by treating the
topic mixture weights as a k-parameter hidden random
variable
• The k+kV parameters in a k-topic LDA model do not
grow with the size of the training corpus.
19
Relationship with other latent variable models (cont.)
• A geometric interpretation: three topics and three words
20
Relationship with other latent variable models (cont.)
• The unigram model find a single point on the word simplex and
posits that all word in the corpus come from the corresponding
distribution.
• The mixture of unigram models posits that for each documents, one
of the k points on the word simplex is chosen randomly and all the
words of the document are drawn from the distribution
• The pLSI model posits that each word of a training documents
comes from a randomly chosen topic. The topics are themselves
drawn from a document-specific distribution over topics.
• LDA posits that each word of both the observed and unseen
documents is generated by a randomly chosen topic which is drawn
from a distribution with a randomly chosen parameter
21
Inference and parameter estimation
• The key inferential problem is that of computing the
posteriori distribution of the hidden variable given a
document
p , z, w |  ,  
p , z | w,  ,   
pw |  ,  

 i 1 i
k

 k  i 1  N k V
wnj 
pw |  ,   
 i    i  ij   d
k


i1  i   i1  n1 i1 j 1
Unfortunately, this distribution is intractable to compute in general.
A function which is intractable due to the coupling between
θ and β in the summation over latent topics
22
Inference and parameter estimation (cont.)
• The basic idea of convexity-based variational inference is
to make use of Jensen’s inequality to obtain an
adjustable lower bound on the log likelihood.
• Essentially, one considers a family of lower bounds,
indexed by a set of variational parameters.
• A simple way to obtain a tractable family of lower bound
is to consider simple modifications of the original graph
model in which some of the edges and nodes are
removed.
23
Inference and parameter estimation (cont.)
• Drop some edges and the w nodes
p , z, w |  ,  
p , z | w,  ,   
pw |  ,  
N
q , z |  ,    q |   q z n | n 
n 1
24
Inference and parameter estimation (cont.)
• Variational distribution:
– Lower bound on Log-likelihood
log pw |  ,    log   p , z, w |  ,   d  log   q , z |  ,  
z
   q , z |  , log
z
z
p , z, w |  ,  
d
q , z |  , 
p , z, w |  ,  
d  Eq log p , z, w |  ,    Eq q , z |  , 
q , z |  , 
– KL between variational posteriori and true posteriori
Dq , z |  ,  || p , z|w, ,  
   q , z |  , log q , z |  , d    q , z |  , log p , z|w, ,  d
z
z
p , z, w, ,  
   q , z |  , log q , z |  , d    q , z |  , log
d
pw, ,  
z
z
 Eq q , z |  ,   Eq  p , z, w, ,    Eq log pw, ,  
25
Inference and parameter estimation (cont.)
• Finding a tight lower bound on the log likelihood
log pw |  ,    Eq log p , z, w |  ,    Eq log q , z |  ,  
 Dq , z |  ,   || p , z|w, ,  
• Maximizing the lower bound with respect to γand φ is
equivalent to minimizing the KL divergence between the
variational posterior probability and the true posterior
probability

*
,  *   arg min Dq , z |  ,   || p , z|w, ,  
 , 
26
Inference and parameter estimation (cont.)
• Expand the lower bound:
L ,  ;  ,    Eq log p , z, w |  ,    Eq log q , z |  ,  
 Eq log p |  
 Eq log pz |  
 Eq log pw | z,  
 Eq log p |  
 Eq log pz |  
27
Inference and parameter estimation (cont.)
• Then
L ,  ;  ,  


k
k
i 1
i 1


 log   j 1 j   log i     i  1   i     j 1  j
N
k
k


  ni   i     j 1  j
n 1 i 1
N
k

k

k
  ni wnj log  ij
n 1 i 1


k
k
i 1
i 1


 log   j 1  j   log i     i  1   i     j 1  j
N
k
k

k
  ni log ni
n 1 i 1
28
Inference and parameter estimation (cont.)
• We can get variational parameters by adding Lagrange
multipliers and setting this derivative to zero:


ni   iv exp   i     j 1  j
k

 i   i  n 1ni
N
29
Inference and parameter estimation (cont.)
• Parameter estimation
– Maximize log likelihood of the data:  ,   
M
 log pw
d 1
d
|,  
– Variational inference provide us with a tractable lower bound on
the log likelihood, a bound which we can maximize with respect α
and β
• Variational EM procedure
– 1. (E-step) For each document, find the optimizing values of the
variational parameters {γ, φ}
– 2. (M-step) Maximize the result lower bound on the log likelihood
with respect to the model parameters α and β
30
Inference and parameter estimation (cont.)
• Smoothed LDA model:
31
Discussion
• LDA is a flexible generative probabilistic model for
collection of discrete data.
• Exact inference is intractable for LDA, but any or a large
suite of approximate inference algorithms for inference
and parameter estimation can be used with the LDA
framework.
• LDA is a simple model and is readily extended to
continuous data or other non-multinomial data.
32