Exchangeability - Collegio Carlo Alberto

Download Report

Transcript Exchangeability - Collegio Carlo Alberto

Hierarchical Bayesian Nonparametrics
with Applications
Michael I. Jordan
University of California, Berkeley
June 20, 2009
Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux
Hierarchy
• Standard nonparametric mixtures:
Hierarchy
• Standard nonparametric mixture:
• Hierarchical nonparametric mixtures:
Hierarchy
• Standard nonparametric mixtures:
• Hierarchical nonparametric mixtures:
• Cf. nonhierarchical alternatives for sharing atoms:
De Finetti
• Exchangeability: invariance to permutation of the
joint probability distribution of infinite sequences
of random variables
• Theorem (De Finetti, 1935). If
are
infinitely exchangeable, then the joint probability
distribution
has a representation
as a conditional independence mixture:
Application: Speaker Diarization
Bob
J
B
i
o Jane
l
b
l
John
50
100
150
200
John
250
300
TIME
40
30
20
10
0
-10
-20
-30
0
1
2
3
4
5
6
7
8
9
10
4
TIME
x 10
Application: Protein Modeling
• A protein is a folded chain of amino acids
• The backbone of the chain has two degrees of
freedom per amino acid (phi and psi angles)
• Empirical plots of phi and psi angles are called
Ramachandran diagrams
Application: Protein Modeling
• We want to model the density in the
Ramachandran diagram to provide an energy
term for protein folding algorithms
• We actually have a linked set of Ramachandran
diagrams, one for each amino acid neighborhood
• We thus have a linked set of density estimation
problems
Haplotype Modeling
• Consider
binary markers in a genomic region
• There are
possible haplotypes---i.e., states of
a single chromosome
– but in fact, far fewer are seen in human populations
• A genotype is a set of unordered pairs of markers
from one individual
• Given a set of genotypes (multiple individuals),
estimate the underlying haplotypes
Haplotype Modeling
• This is a mixture modeling problem:
– where the key problem is inference for the number of
clusters (the cardinality of )
• Consider now the case of multiple groups of
genotype data (e.g., ethnic groups)
• Geneticists would like to find clusters within each
group but they would also like to share clusters
between the groups
Stick-Breaking
• A general way to obtain distributions on
countably infinite spaces
• The classical example: Define an infinite
sequence of beta random variables:
• And then define an infinite random sequence as
follows:
• This can be viewed as breaking off portions of a
stick:
Constructing Random Measures
• It's not hard to see that
• Now define the following object:
(wp1)
• where
are independent draws from a distribution
on
some space
• Because
,
is a probability measure---it is
a random measure
• The distribution of
is known as a Dirichlet process:
• What exchangeable marginal distribution does this yield
when integrated against in the De Finetti setup?
Chinese Restaurant Process (CRP)
• A random process in which customers sit down
in a Chinese restaurant with an infinite number of
tables
– first customer sits at the first table
–
th subsequent customer sits at a table drawn from
the following distribution:
– where is the number of customers currently at table
and where
denotes the state of the restaurant
after
customers have been seated
The CRP and Mixture Modeling
• Data points are customers; tables are mixture
components
– the CRP defines a prior distribution on the partitioning of the
data and on the number of tables
• This prior can be completed with:
– a likelihood---e.g., associate a parameterized probability
distribution with each table
– a prior for the parameters---the first customer to sit at table
chooses the parameter vector, , for that table from a prior
• So we now have defined a full Bayesian posterior for a
mixture model of unbounded cardinality
CRP Prior, Gaussian Likelihood,
Conjugate Prior
Exchangeability
• As a prior on the partition of the data, the CRP is
exchangeable
• The prior on the parameter vectors associated with the
tables is also exchangeable
• The latter probability model is generally called the Polya
urn model. Letting denote the parameter vector
associated with the th data point, we have:
• From these conditionals, a short calculation shows that
the joint distribution for
is invariant to order
(this is the exchangeability proof)
Dirichlet Process Mixture Models
Marginal Probabilities
• To obtain the marginal probability of the parameters
we need to integrate out
• This marginal distribution turns out to be the Chinese
restaurant process (more precisely, it's the Polya urn
model)
Multiple estimation problems
• We often face multiple, related estimation problems
• E.g., multiple Gaussian means:
• Maximum likelihood:
• Maximum likelihood often doesn't work very well
– want to “share statistical strength”
Hierarchical Bayesian Approach
• The Bayesian or empirical Bayesian solution is to view the
parameters
as random variables, related via an
underlying variable
• Given this overall model, posterior inference yields
shrinkage---the posterior mean for each combines data
from all of the groups
Hierarchical Modeling
• Recall the plate notation:
• Equivalent to:
Hierarchical Dirichlet Process Mixtures
(Teh, Jordan, Beal, & Blei, JASA 2006)
Marginal Probabilities
• First integrate out the
, then integrate out
Chinese Restaurant Franchise (CRF)
Protein Folding (cont.)
• We have a linked set of Ramachandran
diagrams, one for each amino acid neighborhood
Protein Folding (cont.)
Natural Language Parsing
• Key idea: lexicalization of probabilistic context-free
grammars
– the grammatical rules (S
NP VP) are conditioned on the
specific lexical items (words) that they derive
– this leads to huge numbers of potential rules, and (adhoc)
clustering methods are used to control the number of rules
– note that this is a set of clustering problems---each
grammatical context defines a clustering problem
HDP-PCFG
(Liang, Jordan & Klein, 2009)
• Use the HDP to link these multiple clustering
problems and to let the number of grammatical
rules be chosen adaptively
Nonparametric Hidden Markov models
• Essentially a dynamic mixture model in which the
mixing proportion is a transition probability
• Use Bayesian nonparametric tools to allow the
cardinality of the state space to be random
– urn model point of view (Beal, et al, “infinite HMM”)
– Dirichlet process point of view (Teh, et al, “HDP-HMM”)
Hidden Markov Models
states
Time
State
observations
Hidden Markov Models
modes
Time
observations
Hidden Markov Models
states
Time
observations
Hidden Markov Models
states
Time
observations
HDP-HMM
• Dirichlet process:
– state space of unbounded
cardinality
• Hierarchical Bayes:
– ties state transition
distributions
State
Time
HDP-HMM
• Average transition distribution:
• State-specific transition distributions:
sparsity of b is shared
State Splitting
• HDP-HMM inadequately models temporal persistence of states
• DP bias insufficient to prevent unrealistically rapid dynamics
• Reduces predictive performance
“Sticky” HDP-HMM
original
state-specific base measure
Increased probability of self-transition
sticky
Direct Assignment Sampler
Rao-Blackwellized Gibbs Sampler
likelihood
Chinese
restaurant
prior
• Marginalize:
– Transition densities
– Emission parameters
• Sequentially sample:
Conjugate base
measure 
closed form
Splits true state,
hard to merge
Blocked Resampling
HDP-HMM weak limit approximation
• Sample:
• Compute backwards messages:
• Block sample
as:
Results: Gaussian Emissions
Sticky HDP-HMM
Blocked
sampler
Sequential
sampler
HDP-HMM
Results: Fast Switching
Observations
Sticky
HDP-HMM
True state
sequence
HDP-HMM
Results: Multinomial Emissions
Sticky HDP-HMM
HDP-HMM
• Sticky parameter improves
• Observations
predictive
probability of test
embedded in
sequences
discrete
• Histogram
of log-likelihood
topological
spaceof
test sequences using parameters
• No
senseiterations
of
sampled
at thinned
in
closeness
help
the 10,000
to 30,000 to
range
with grouping
Multimodal Emissions
states
mixture
components
observations
• Approximate multimodal
emissions with DP mixture
• Temporal state persistence
disambiguates model
Results: Multimodal Emissions
Sticky HDP-HMM
HDP-HMM
• 5-state HMM

# emission components

Equal mixture weights
Speaker Diarization
40
30
20
10
0
-10
-20
-30
0
1
2
3
4
5
6
7
8
9
10
4
x 10
TIME
Bob
J
B
i
o Jane
l
b
l
John
50
100
150
TIME
200
John
250
300
NIST Evaluations
Meeting by Meeting Comparison
• NIST Rich
Transcription 20042007 meeting
recognition
evaluations
• 21 meetings
• ICSI results have
been the current
state-of-the-art
Results: 21 meetings
Overall DER
Best DER
Worst DER
Sticky HDP-HMM
17.84%
1.26%
34.29%
Non-Sticky HDP-HMM
23.91%
6.26%
46.95%
ICSI
18.37%
4.39%
32.23%
Results: Meeting 1
(AMI_20041210-1052)
Sticky DER = 1.26%
ICSI DER = 7.56%
Results: Meeting 18
(VT_20050304-1300)
Sticky DER = 4.81%
ICSI DER = 22.00%
Results: Meeting 16
(NIST_20051102-1323)
Nonparametric Hidden Markov Trees
(Kivinen, Sudderth & Jordan, 2009)
• A tree in which each parent-child edge is a
Markov transition
• We need to tie the parent-child transitions across
the parent states; this is again done with the HDP
Nonparametric Hidden Markov Trees
(Kivinen, Sudderth & Jordan, 2009)
• Local Gaussian Scale Mixture (31.84 dB)
Nonparametric Hidden Markov Trees
(Kivinen, Sudderth & Jordan, 2009)
• HDP-HMT (32.10 dB)
The Beta Process
• The Dirichlet process naturally yields a
multinomial random variable (which table is the
customer sitting at?)
• Problem: in many problem domains we have a
very large (combinatorial) number of possible
tables
– using the Dirichlet process means having a large
number of parameters, which may overfit
– perhaps instead want to characterize objects as
collections of attributes (“sparse features”)?
– i.e., binary matrices with more than one 1 in each row
Indian Buffet Process (IBP)
(Griffiths & Ghahramani, 2002)
• Indian restaurant with infinitely many dishes in a
buffet line
• Customers through enter the restaurant
– the first customer samples
dishes
– the th customer samples a previously sampled dish
with probability
then samples
new dishes
Indian Buffet Process (IBP)
(Griffiths & Ghahramani, 2002)
• Indian restaurant with infinitely many dishes in a
buffet line
• Customers through enter the restaurant
– the first customer samples
dishes
– the th customer samples a previously sampled dish
with probability
then samples
new dishes
Indian Buffet Process (IBP)
(Griffiths & Ghahramani, 2002)
• Indian restaurant with infinitely many dishes in a
buffet line
• Customers through enter the restaurant
– the first customer samples
dishes
– the th customer samples a previously sampled dish
with probability
then samples
new dishes
Indian Buffet Process (IBP)
(Griffiths & Ghahramani, 2002)
• Indian restaurant with infinitely many dishes in a
buffet line
• Customers through enter the restaurant
– the first customer samples
dishes
– the th customer samples a previously sampled dish
with probability
then samples
new dishes
Indian Buffet Process (IBP)
(Griffiths & Ghahramani, 2002)
• Indian restaurant with infinitely many dishes in a
buffet line
• Customers through enter the restaurant
– the first customer samples
dishes
– the th customer samples a previously sampled dish
with probability
then samples
new dishes
Indian Buffet Process (IBP)
• The IBP is usually derived by taking a finite limit
of a process on a finite matrix
• But this leaves some issues somewhat obscured:
–
–
–
–
Is the IBP exchangeable?
Why the Poisson number of dishes in each row?
Is the IBP conjugate to some stochastic process?
Is there an underlying stochastic process with the IBP
as its marginal?
Completely Random Processes
(Kingman, 1968)
• Completely random measures are measures on a set
that assign independent mass to nonintersecting subsets
of
– e.g., Brownian motion, gamma processes, beta processes,
compound Poisson processes and limits thereof
• (The Dirichlet process is not a completely random process
– but it's a normalized gamma process)
• Completely random processes are discrete wp1 (up to a
possible deterministic continuous component)
• Completely random processes are random measures, not
necessarily random probability measures
Completely Random Processes
(Kingman, 1968)
x
x
(! i ; pi )
x
x
x
x
x
x
x
x
x
x
x
x
x
!i
-
• Assigns independent mass to nonintersecting subsets of
Completely Random Processes
(Kingman, 1968)
• Consider a non-homogeneous Poisson process on
with rate function obtained from some product measure
• Sample from this Poisson process and connect the samples
vertically to their coordinates in
º
R
x
x
x
x
x
x
x
x
(! i ; pi )
x
Beta Processes
(Hjort, Kim, et al.)
• The product measure is called a Levy measure
• For the beta process, this measure lives on
and is given as follows:
degenerate Beta(0,c) distribution
Base measure
• And the resulting random measure can be written
simply as:
Beta Processes
!
p
Beta Process -> IBP
(Thibaux & Jordan, 2007)
• Theorem: The beta process is the De Finetti
mixing measure underlying the IBP
• Thus:
– the IBP is obviously exchangeable
– it’s clear why the Poisson assumption arises
– this leads to new algorithms for the IBP based on
augmentation
– we’re motivated to consider hierarchies of beta
processes
Beta Process and Bernoulli Process
BP and BeP Sample Paths
Hierarchical Beta Processes
• A hierarchical beta process is a beta process whose base measure
is itself random and drawn from a beta process
Text Classification
words
documents
SPORTS
SCIENCE
POLITICS
Prediction
Predicting class 1 based on 10 samples per class
Hierarchy of beta processes
average log likelihood
-27
-28
-29
Laplace smoothing
-30
-31
-32
-33
-34
0.0001
0.001
0.01
0.1
smoothing
1
10
100
Classification Accuracy
20-newsgroup dataset
(unbalanced dataset, with categories of size 2 to 100)
Hierarchies of BP
58%
Best Naïve Bayes
50%
Stick-Breaking Representations
Stick-Breaking Representations
• Thibaux & Jordan sized-biased ordering:
• Teh, et al inverse Lévy measure approach:
Conclusions
• Hierarchical modeling has at least an important a
role to play in Bayesian nonparametrics as is
plays in classical Bayesian parametric modeling
• In particular, infinite-dimensional parameters can
be controlled recursively via Bayesian
nonparametric strategies
• For papers and more details:
www.cs.berkeley.edu/~jordan/publications.html