CS 401R, section 1: Statistical Natural Language Processing

Transcript CS 401R, section 1: Statistical Natural Language Processing

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.
CS 479, section 1:
Natural Language Processing
Lecture #32: Text Clustering with
Expectation Maximization
Thanks to Dan Klein of UC Berkeley for some of the materials used in this lecture.
Announcements
 Homework 0.3
 Key has been posted
 Requires authentication
 If you haven’t submitted this assignment, but you’re doing project #4,
perhaps it’s time to submit what you have and move on.
 Project #4
 Note updates to the requirements involving horizontal markovization
 Reading Report #13: M&S ch. 13 on alignment and MT
 Due Monday – online again
 See directions sent to the announcements list
 Project #5
 Help session: Tuesday
Final Project
 Proposal pre-requisites:
 Discuss your idea with me before submitting
proposal
 Identify and possess data
 Proposal Due: Today
 Project Report:
 Early: Wednesday after Thanksgiving
 Due: Friday after Thanksgiving
Where are we going?
 Earlier in the semester: Classification
 “Supervised” learning paradigm
 Joint / Generative model: Naïve Bayes
 Moving into “Unsupervised” territory …
Objectives
 Understand the text clustering problem
 Understand the Expectation Maximization
(EM) algorithm for one model
 See unsupervised learning at work on a real
application
Clustering Task
 Given a set of documents, put them into groups
so that:
 Documents in the same group are alike
 Documents in different groups are different
 “Unsupervised” learning: no class labels
 Goal: discover hidden patterns in the data
 Useful in a range of NLP tasks:




IR
Smoothing
Data mining and Text mining
Exploratory data analysis
Clustering Task: Example
Dog
dog
dog
cat
Canine
dog
woof
Dog
collar
Puppy
dog
litter
Kitten
cat
litter
Feline
cat
Kennel
dog
pound
Cat cat
cat cat
dog
Raining
cats
and
dogs
Year of
the
dog
Dog
dog
dog
dog
Dog
dog
dog
Dog
dog
Dog
cat
mouse
kitten
Feline
cat
meow
cat
Dog
dog
cat cat
Cat
dog
cat
Dog
dog
cat
Cat
kitten
cub
Dog
cat cat
cat cat
k-Means Clustering
Count(dog)
 The simplest model-based technique
Count(cat)
 Procedure:
 Failure Cases:
Demo
http://home.dei.polimi.it/matteucc/Clustering/t
utorial_html/AppletKM.html
High-Level:
Expectation-Maximization (EM)
 Iterative procedure:
1. Guess some initial parameters for the model
2. Use model to estimate partial label counts for all
docs (E-step)
3. Use the new complete data to learn better
model (M-step)
4. Repeat steps 2 and 3 until convergence
 k-Means is “hard” EM
 Or EM is “soft” k-Means
EM
𝐷: Posterior distribution over clusters 𝑝(𝑐|𝑑) for each document
𝐶: Total partial label counts for each document, based on 𝐷
𝜃: Model parameter estimates, computed from 𝐶
When to stop?
Model
Word-Token Perspective:
ci
…
ti,1
ti,2
ti,Mi
This is what we called “Naïve Bayes” when we discussed
classification.
Multinomial Mixture Model
Word-Type Perspective:
ci
Word-Token Perspective:
ci
ci
…
…
xi
xi,1
xi,2
xi,V
ti,1
ti,2
ti,Mi
Multinomial Mixture Model
ci ~ Multinomial( ) ( More precisely: ci ~ Categorical( ) or ci ~ Multinomial( ,1))
K
  K ,  k  1, k  0
k 1
𝑃 𝑐𝑖 = 𝜆𝑖
Word-Type Perspective:
ci
Word-Token Perspective:
ci
ci
…
…
xi
xi,1
xi,2
ti,1
xi,V
ti,Mi
ti , j | ci ~ Categorical( ci )
xi | ci ~ Multinomial( ci , M i )
𝑃 𝑤𝑗 |𝑐𝑖 = 𝛽𝑐𝑖,𝑗
ti,2
V
  ( K ,V ), k  k , j  1
j 1
Computing Data Likelihood
K
P ( xi )   P (ci , xi )
Marginalization
ci 1
K
  P (ci ) P ( xi | ci )
Factorization
ci 1
K
V
ci 1
j 1
K
V
  P (ci ) P ( xi , j | ci )
Conditional
Independence
  P (ci ) P ( w j | ci ) i , j
ci 1
K
j 1
V
  ci   ci , j
ci 1
x
j 1
xi , j
Make word type explicit
+ Conditional Indep.
Categorical
parameters
Log Likelihood of the Data
 We want the probability of the unlabeled data according to a model with
parameters 𝜃:
N
N
V
K
P ( D)   P ( x i )    ci   ci , j
i 1
N
K
i 1 ci 1
V
log P ( D)  log   ci   ci , j
i 1 ci 1
xi , j
j 1
xi , j
Logarithm
j 1
N
K
V
i 1
ci 1
j 1
  log  ci   ci , j
Independence of data
xi , j
V
 
xi , j  
  log sum log  ci   ci , j  
i 1 1 ci  K 
 
  j 1
N
V


  log sum log ci   xi , j log  ci , j 
i 1 1 ci  K 
j 1

Log of product
N
Log of sum
Log of product
logsum
 For the computation involved in the logsum
and its justification:
https://facwiki.cs.byu.edu/nlp/index.php/Log_D
omain_Computations
Data Likelihood
Likelihood Surface
Model Parameterizations
EM Properties
 Each step of EM is guaranteed to increase data likelihood - a hill climbing
procedure
 Not guaranteed to find global maximum of data likelihood
Data Likelihood
 Data likelihood typically has many local maxima for a general model class and
rich feature set
 Many “patterns” in the data that we can fit our model to…
Idea: Random restarts!
Model Parameterizations
EM for Naïve Bayes: E Step
Dog
dog
dog
cat
Canine
dog
woof
Dog
collar
Puppy
dog
litter
Kitten
cat
litter
Feline
cat
Kennel
dog
pound
Cat cat
cat cat
dog
Raining
cats
and
dogs
Year of
the
dog
Dog
dog
dog
dog
Dog
dog
dog
Dog
dog
Dog
cat
mouse
kitten
Feline
cat
meow
cat
Dog
dog
cat cat
Cat
dog
cat
Dog
dog
cat
Cat
kitten
cub
Dog
cat cat
cat cat
EM for Naïve Bayes: E Step
1. Compute posteriors for each 𝑥𝑗 :
Dog
cat cat
cat cat
Document j
𝑝𝜃 𝑐𝑗 = 𝑘1 𝑥𝑗 ) = 0.2
𝑝𝜃 𝑐𝑗 = 𝑘2 𝑥𝑗 ) = 0.8
2. Use as partial counts,
total them up:
𝐶(𝑘1) += 0.2
𝐶 (𝑘2) += 0.8
𝐶 (𝑘1, 𝑑𝑜𝑔) += 1 ∗ 0.2
𝐶 (𝑘2, 𝑑𝑜𝑔) += 1 ∗ 0.8
𝐶 (𝑘1, 𝑐𝑎𝑡) += 4 ∗ 0.2
𝐶 (𝑘2, 𝑐𝑎𝑡) += 4 ∗ 0.8
EM for Naïve Bayes: M Step
 Once you have your partial
counts, re-estimate
parameters like you would
with regular counts
𝜆1 = 𝑝 𝑘1
𝐶 𝑘1
=
𝑁
𝜆1 = 𝑝 𝑘2
𝐶 𝑘2
=
𝑁
𝛽1,𝑑𝑜𝑔 = 𝑝 𝑑𝑜𝑔 𝑘1
𝐶 𝑘1 , 𝑑𝑜𝑔 + 1
= 𝑉
𝑣=1 𝐶 𝑘1 , 𝑣 + 𝑉
𝛽1,𝑐𝑎𝑡 = 𝑝 𝑐𝑎𝑡 𝑘1
𝐶 𝑘1 , 𝑐𝑎𝑡 + 1
= 𝑉
𝑣=1 𝐶 𝑘1 , 𝑣 + 𝑉
𝛽2,𝑑𝑜𝑔 = 𝑝 𝑑𝑜𝑔 𝑘2
𝐶 𝑘2 , 𝑑𝑜𝑔 + 1
= 𝑉
𝑣=1 𝐶 𝑘2 , 𝑣 + 𝑉
𝛽2,𝑐𝑎𝑡 = 𝑝 𝑐𝑎𝑡 𝑘2
𝐶 𝑘2 , 𝑐𝑎𝑡 + 1
= 𝑉
𝑣=1 𝐶 𝑘2 , 𝑣 + 𝑉
Etc.
 In the next E step, the partial counts will be
different, because the parameters have
changed:
 E.g.,
 𝑝 𝑐𝑗 = 𝑘1 𝑥𝑗 ∝ 𝑝 𝑘1 ⋅
𝑛
𝑗=1 𝑝
 And the likelihood will increase!
𝑤𝑗 𝑘1
𝑥𝑗
EM for Multinomial Mixture Model
 Initialize model parameters. E.g., randomly
P(c)  c and P( w j | c)  c , j
 E-step:
1. Calculate posterior distributions over clusters for each document xi:
P  c   P  xi , j | c 
V
P(c, xi )
P  c | xi  

P( xi )
 P  ck   P  xi, j | ck 
K
k 1
2.
j 1
V
j 1
V

c   c , j
K
j 1
V
 k  k , j
k 1
j 1
Use posteriors as fractional labels; Total up your new counts!
j , c : C  w j , c    P  c | xi   xi , j
N
i 1
N
c : C  c    P  c | xi 
i 1
xi , j
xi , j
M-Step
1. Re-estimate  :
C (c )
ˆ
ˆ
c  P(c) 
, where N  number of documents
N
2. Re-estimate  from the fractionally labeled
data with smoothing:
ˆ
c, j
1  C ( w j , c)
1  C ( w j , c)
ˆ
 P( w j | c) 

 (1  C (w, c)) | V |   C (w, c)
wV
wV
EM in General
 EM is a technique for learning anytime
we have incomplete data (𝑥, 𝑦)
 Induction Scenario (ours): we actually want
to know 𝑦 (e.g., clustering)
 Convenience Scenario: we want the
marginal, 𝑃(𝑥), and including 𝑦 just makes
the model simpler (e.g., mixing weights)
General Approach
 Learn 𝑦 and 𝜃
 E-step: make a guess at posteriors
𝑃(𝑦 | 𝑥, 𝜃)
 This means scoring all completions with the
current parameters
 Treat the completions as (fractional) complete
data, and count.
 M-step: re-estimate 𝜃 to maximize log
𝑃(𝑥, 𝑦 | 𝜃)
 Then compute (smoothed) ML estimates of
model parameters
Expectation Maximization (EM)
for Mixing Parameters
 How to estimate mixing parameters 𝜆1 and 𝜆2 ?
PLIN ( 1 ,2 ) ( w | w1 )  1Pˆ ( w | w1 )  2 Pˆ ( w)
 Sometimes you can just do line search, as we have discussed.
 … or the “try a few orders of magnitude” approach
 Alternative: Use EM
 Think of mixing as a hidden choice between histories:
PLIN ( PH ) ( w | w1 )  PH (1) Pˆ ( w | w1 )  PH (0) Pˆ ( w)
 Given a guess at PH, we can calculate expectations of which
generation route a given token took (over held-out data)
PH (1) Pˆ ( w | w1 )
P(h  1 | w, w1 ) 
PH (1) Pˆ ( w | w1 )  PH (0) Pˆ ( w)
 Use these expectations to update PH, rinse and repeat
Advertisement
 Learn more about text mining, including text
classification, text clustering, topic models,
text summarization, and visualization for text
mining:
Register for CS 679 – Fall 2012
Next
 Next Topics:
 Machine Translation
 If time permits, Co-reference Resolution
 We can use the EM algorithm to attack both
of these problems!

CS 401R, section 1: Statistical Natural Language Processing

Transcript CS 401R, section 1: Statistical Natural Language Processing

Directory