Transcript lecture3

LING 696B: Mixture model
and its applications in
category learning
1
Recap from last time


G&G model: a self-organizing map (neural
net) that does unsupervised learning
Non-parametric approach: encoding stimuli
distribution by large number of connection
weights
2
Question from last time

Scaling up to the speech that infants hear:
higher dimensions?



Extending the G&G network in Problem 3 to
Maye&Gerken’s data?
Speech segmentation: going beyond static
vowels?
Model behavior: parameter tuning, starting
points, degree of model fitting, when to stop
…
3
Today’s agenda





Learning categories from distributions
(Maye & Gerken)
Basic ideas of statistical estimation,
consistency, maximum likelihood
Mixture model, learning with the
Expectation-Maximization algorithm
Refinement of mixture model
Application in the speech domain
4
Learning categories after
minimal pairs

Idea going back as early as to Jakobson, 41:



Moreover, this predicts the order in which
categories are learned


knowing /bin/~/pin/ implies [voice] as a distinctive
feature
[voice] differentiates /b/ and /p/ as two categories
of English
Completely falsified? (small project)
Obvious objection: early words don’t include
many minimal pairs
5
Maye & Gerken, 00


Categories can be learned from
statistics, just like learning statistics
from sequences
Choice of artificial contrast: English d
and (s)t


Small difference in
voicing and F0
Main difference:
F1, F2 onset
6
Detecting d~(s)t contrast in
Pegg and Werker, 97


Most adults can do this, but not as good
as a native contrast
6-8m much better than 10-12m

(Need more than distribution learning?)
7
Maye & Gerken, 00

Training on monomodal v.s. bimodal distributions
Both groups heard the
same number of stimuli
8
Maye & Gerken, 00

Results from Maye thesis:
9
Maye, Gerken & Werker, 02

Similar experiment done more carefully
on infants


Preferential looking time
Alternating and non-alternating trials
alternating
Non-alter
alternating
10
Maye, Gerken & Werker, 02

Bimodal-trained infants look longer at
alternating trials than non-alternating
Not significant
Difference significant
11
Reflections

The dimension in which bimodal differs
monomodal is abstract




Shape of distribution also hard to characterize
Adults/infants are not told what categories
are there to learn
Neither do they know how many categories
to learn
Machine learning does not have satisfying
answers to all these questions
12
Statistical estimation

Basic setup:




The world: distributions p(x; ),  is set of free
parameters
“all models may be wrong, but some are useful”
Given parameter , p(x; ) tells us how to
calculate the probability of x (also referred to as
the “likelihood” p(x|) )
Observations: X = {x1, x2, …, xN} generated from
some p(x; ). N is the number of observations
Model-fitting: based on some examples X,
make guesses (learning, inference) about 
13
Statistical estimation

Example:




Assuming people’s height follows normal
distributions (mean, var)
p(x; ) = the probability density function of
normal distribution
Observation: measurements of people’s
height
Goal: estimate parameters of the
normal distribution
14
Statistical estimation:
Hypothesis space matters

Example: curve fitting with polynomials
15
Criterion of consistency

Many model fitting criteria




Least squares
Minimal classification errors
Measures of divergence, etc.
Consistency: as you get more and more
data x1, x2, …, xN (N -> infinite), your
model fitting procedure should produce
an estimate that is closer and closer to
the true that generates X.
16
Maximum likelihood estimate
(MLE)


Likelihood function: examples xi are
independent of one another, so
Among all the possible values of ,
choose the
so that L() is the biggest
L()
Consistent!

17
MLE for Gaussian distributions

Parameters: mean and variance

Distribution function:

MLE for mean and variance

Exercise: derive this result in 2 dimensions
18
Mixture of Gaussians



An extension of Gaussian distributions
to handle data containing categories
Example: mixture of 2 Gaussian
distributions
More concrete example: height of male
and female follow two distributions, but
we don’t know the gender from which
measurement is made
19
Mixture of Gaussians

More parameters



Parameters of the two Gaussians: (1, 1)
and (2, 2) -- two categories
The “mixing” proportion: 0    1
How are data generated?


Throw a coin with heads-on probability 
If head is on, generate an example from
the first Gaussian, otherwise generate from
the second
20
Maximum likelihood:
Supervised learning


Seeing data x1, x2, …, xN (height) as
well as their category membership y1,
y2, …, yN (male or female)
MLE :


For each Gaussian, estimate based on
members of category, e.g.
= (number of 1) / N
21
Maximum likelihood:
Unsupervised learning



Only seeing data x1, x2, …, xN , no idea
about category membership or 
Must estimate
based on X only
Key idea: relate this problem to the
supervised learning
22
The K-means algorithm



Clustering algorithm for designing
“codebooks” (vector quantization)
Goal: dividing data into K clusters and
representing each cluster by its center
First: random guesses about cluster
membership (among 1,…,K)
23
The K-means algorithm

Then iterate



Update the center of each cluster by the
mean of data belonging to the cluster
Re-assign each datum to the cluster based
on the shortest distance to the cluster
centers
After some iterations, this will not
change any more
24
K-means demo

Data generated from mixture of 2
Gaussians with mixing proportion 0.5
25
Why does K-means work?




In the beginning, the centers are poorly
chosen, so the clusters overlap a lot
But if centers are moving away from each
other, then clusters tend to separate better
Vice versa, if clusters are well-separated, then
the centers will stay away from each other
Intuitively, these two steps “help each other”
26
Expectation-Maximization
algorithm

Replacing the “hard” assignments in K-means
with “soft” assignments


Hard: (0, 1) or (1, 0)
Soft: (p( /t/ | x), p( /d/ | x)), e.g. (0.5, 0.5)
=?
/t/?
/d/?
[?]
[?]
[?]
27
Expectation-Maximization
algorithm

Initial guesses
/t/0
 0 = 0.5
/d/0
[?]
[?]
[?]
28
Expectation-Maximization
algorithm

Expectation step: Sticking in “soft”
labels -- a pair (wi, 1-wi)
/t/0
 0 = 0.5
/d/0
[?]
[0.5 t
0.5 d]
[?]
29
Expectation-Maximization
algorithm

Expectation step:
step label each example
/t/0
 0 = 0.5
/d/0
[?]
[0.5 t
0.5 d]
[0.3 t
0.7 d]
30
Expectation-Maximization
algorithm

Expectation step:
step label each example
/t/0
 0 = 0.5
/d/0
[0.1 t
0.9 d]
[0.5 t
0.5 d]
[0.3 t
0.7 d]
31
Expectation-Maximization
algorithm

Maximization step: going back to
update the model with MaximumLikelihood, weighted by soft labels
/t/1
/d/0
[0.1 t
0.9 d]
[0.5 t
0.5 d]
[0.3 t
0.7 d]
32
Expectation-Maximization
algorithm

Maximization step:
step going back to
update the model with MaximumLikelihood , weighted by soft labels
/t/1
/d/1
[0.1 t
0.9 d]
[0.5 t
0.5 d]
[0.3 t
0.7 d]
33
Expectation-Maximization
algorithm

Maximization step: going back to
update the model with MaximumLikelihood
/t/1
 1 = 0.3
 = (0.5+0.3+0.1)/3
/d/1
[0.1 t
0.9 d]
[0.5 t
0.5 d]
[0.3 t
0.7 d]
34
Common intuition behind Kmeans and EM


The labels are important, yet not
observable – “hidden variables” /
“missing data”
Strategy: make probability-based
guesses, and iteratively guess – update
until converge


K-means: hard guess 1,…,K
EM: soft guess (w1,…,wK), w1+…+wK=1
35
Thinking of this as an
exemplar-based model

Johnson (1997)'s exemplar model of
categories:
 When
a new stimulus comes in, its
membership is jointly determined by all
pre-memorized exemplars.
-- This is the E – step
 After a new stimulus is memorized, the
“weight” of each exemplar is updated.
-- This is the M – step
36
Convergence guarantee of EM

E-step: finding a lower bound of L()
L()
E: choosing this

37
Convergence guarantee of EM

M-step: finding the maximum of this
lower bound
L()
M: finding the maximum
Always <= L()

38
Convergence guarantee of EM

E-step again
L()

39
Local maxima
What if you start
Here?
40
Overcoming local maxima:
Multiple starting points
Multiple starting points
41
Overcoming local maxima:
Model refinement




Guess 6 at once is hard, but 2 is easy;
Hill climbing strategy: starting with 2,
then 3, 4, ...
Implementation: split the cluster with
the maximum gain in likelihood;
Intuition: discriminate within the
biggest pile.
42