Transcript lecture3
LING 696B: Mixture model
and its applications in
category learning
1
Recap from last time
G&G model: a self-organizing map (neural
net) that does unsupervised learning
Non-parametric approach: encoding stimuli
distribution by large number of connection
weights
2
Question from last time
Scaling up to the speech that infants hear:
higher dimensions?
Extending the G&G network in Problem 3 to
Maye&Gerken’s data?
Speech segmentation: going beyond static
vowels?
Model behavior: parameter tuning, starting
points, degree of model fitting, when to stop
…
3
Today’s agenda
Learning categories from distributions
(Maye & Gerken)
Basic ideas of statistical estimation,
consistency, maximum likelihood
Mixture model, learning with the
Expectation-Maximization algorithm
Refinement of mixture model
Application in the speech domain
4
Learning categories after
minimal pairs
Idea going back as early as to Jakobson, 41:
Moreover, this predicts the order in which
categories are learned
knowing /bin/~/pin/ implies [voice] as a distinctive
feature
[voice] differentiates /b/ and /p/ as two categories
of English
Completely falsified? (small project)
Obvious objection: early words don’t include
many minimal pairs
5
Maye & Gerken, 00
Categories can be learned from
statistics, just like learning statistics
from sequences
Choice of artificial contrast: English d
and (s)t
Small difference in
voicing and F0
Main difference:
F1, F2 onset
6
Detecting d~(s)t contrast in
Pegg and Werker, 97
Most adults can do this, but not as good
as a native contrast
6-8m much better than 10-12m
(Need more than distribution learning?)
7
Maye & Gerken, 00
Training on monomodal v.s. bimodal distributions
Both groups heard the
same number of stimuli
8
Maye & Gerken, 00
Results from Maye thesis:
9
Maye, Gerken & Werker, 02
Similar experiment done more carefully
on infants
Preferential looking time
Alternating and non-alternating trials
alternating
Non-alter
alternating
10
Maye, Gerken & Werker, 02
Bimodal-trained infants look longer at
alternating trials than non-alternating
Not significant
Difference significant
11
Reflections
The dimension in which bimodal differs
monomodal is abstract
Shape of distribution also hard to characterize
Adults/infants are not told what categories
are there to learn
Neither do they know how many categories
to learn
Machine learning does not have satisfying
answers to all these questions
12
Statistical estimation
Basic setup:
The world: distributions p(x; ), is set of free
parameters
“all models may be wrong, but some are useful”
Given parameter , p(x; ) tells us how to
calculate the probability of x (also referred to as
the “likelihood” p(x|) )
Observations: X = {x1, x2, …, xN} generated from
some p(x; ). N is the number of observations
Model-fitting: based on some examples X,
make guesses (learning, inference) about
13
Statistical estimation
Example:
Assuming people’s height follows normal
distributions (mean, var)
p(x; ) = the probability density function of
normal distribution
Observation: measurements of people’s
height
Goal: estimate parameters of the
normal distribution
14
Statistical estimation:
Hypothesis space matters
Example: curve fitting with polynomials
15
Criterion of consistency
Many model fitting criteria
Least squares
Minimal classification errors
Measures of divergence, etc.
Consistency: as you get more and more
data x1, x2, …, xN (N -> infinite), your
model fitting procedure should produce
an estimate that is closer and closer to
the true that generates X.
16
Maximum likelihood estimate
(MLE)
Likelihood function: examples xi are
independent of one another, so
Among all the possible values of ,
choose the
so that L() is the biggest
L()
Consistent!
17
MLE for Gaussian distributions
Parameters: mean and variance
Distribution function:
MLE for mean and variance
Exercise: derive this result in 2 dimensions
18
Mixture of Gaussians
An extension of Gaussian distributions
to handle data containing categories
Example: mixture of 2 Gaussian
distributions
More concrete example: height of male
and female follow two distributions, but
we don’t know the gender from which
measurement is made
19
Mixture of Gaussians
More parameters
Parameters of the two Gaussians: (1, 1)
and (2, 2) -- two categories
The “mixing” proportion: 0 1
How are data generated?
Throw a coin with heads-on probability
If head is on, generate an example from
the first Gaussian, otherwise generate from
the second
20
Maximum likelihood:
Supervised learning
Seeing data x1, x2, …, xN (height) as
well as their category membership y1,
y2, …, yN (male or female)
MLE :
For each Gaussian, estimate based on
members of category, e.g.
= (number of 1) / N
21
Maximum likelihood:
Unsupervised learning
Only seeing data x1, x2, …, xN , no idea
about category membership or
Must estimate
based on X only
Key idea: relate this problem to the
supervised learning
22
The K-means algorithm
Clustering algorithm for designing
“codebooks” (vector quantization)
Goal: dividing data into K clusters and
representing each cluster by its center
First: random guesses about cluster
membership (among 1,…,K)
23
The K-means algorithm
Then iterate
Update the center of each cluster by the
mean of data belonging to the cluster
Re-assign each datum to the cluster based
on the shortest distance to the cluster
centers
After some iterations, this will not
change any more
24
K-means demo
Data generated from mixture of 2
Gaussians with mixing proportion 0.5
25
Why does K-means work?
In the beginning, the centers are poorly
chosen, so the clusters overlap a lot
But if centers are moving away from each
other, then clusters tend to separate better
Vice versa, if clusters are well-separated, then
the centers will stay away from each other
Intuitively, these two steps “help each other”
26
Expectation-Maximization
algorithm
Replacing the “hard” assignments in K-means
with “soft” assignments
Hard: (0, 1) or (1, 0)
Soft: (p( /t/ | x), p( /d/ | x)), e.g. (0.5, 0.5)
=?
/t/?
/d/?
[?]
[?]
[?]
27
Expectation-Maximization
algorithm
Initial guesses
/t/0
0 = 0.5
/d/0
[?]
[?]
[?]
28
Expectation-Maximization
algorithm
Expectation step: Sticking in “soft”
labels -- a pair (wi, 1-wi)
/t/0
0 = 0.5
/d/0
[?]
[0.5 t
0.5 d]
[?]
29
Expectation-Maximization
algorithm
Expectation step:
step label each example
/t/0
0 = 0.5
/d/0
[?]
[0.5 t
0.5 d]
[0.3 t
0.7 d]
30
Expectation-Maximization
algorithm
Expectation step:
step label each example
/t/0
0 = 0.5
/d/0
[0.1 t
0.9 d]
[0.5 t
0.5 d]
[0.3 t
0.7 d]
31
Expectation-Maximization
algorithm
Maximization step: going back to
update the model with MaximumLikelihood, weighted by soft labels
/t/1
/d/0
[0.1 t
0.9 d]
[0.5 t
0.5 d]
[0.3 t
0.7 d]
32
Expectation-Maximization
algorithm
Maximization step:
step going back to
update the model with MaximumLikelihood , weighted by soft labels
/t/1
/d/1
[0.1 t
0.9 d]
[0.5 t
0.5 d]
[0.3 t
0.7 d]
33
Expectation-Maximization
algorithm
Maximization step: going back to
update the model with MaximumLikelihood
/t/1
1 = 0.3
= (0.5+0.3+0.1)/3
/d/1
[0.1 t
0.9 d]
[0.5 t
0.5 d]
[0.3 t
0.7 d]
34
Common intuition behind Kmeans and EM
The labels are important, yet not
observable – “hidden variables” /
“missing data”
Strategy: make probability-based
guesses, and iteratively guess – update
until converge
K-means: hard guess 1,…,K
EM: soft guess (w1,…,wK), w1+…+wK=1
35
Thinking of this as an
exemplar-based model
Johnson (1997)'s exemplar model of
categories:
When
a new stimulus comes in, its
membership is jointly determined by all
pre-memorized exemplars.
-- This is the E – step
After a new stimulus is memorized, the
“weight” of each exemplar is updated.
-- This is the M – step
36
Convergence guarantee of EM
E-step: finding a lower bound of L()
L()
E: choosing this
37
Convergence guarantee of EM
M-step: finding the maximum of this
lower bound
L()
M: finding the maximum
Always <= L()
38
Convergence guarantee of EM
E-step again
L()
39
Local maxima
What if you start
Here?
40
Overcoming local maxima:
Multiple starting points
Multiple starting points
41
Overcoming local maxima:
Model refinement
Guess 6 at once is hard, but 2 is easy;
Hill climbing strategy: starting with 2,
then 3, 4, ...
Implementation: split the cluster with
the maximum gain in likelihood;
Intuition: discriminate within the
biggest pile.
42