Back - Information Theory Society

Download Report

Transcript Back - Information Theory Society

Completely Random Measures for
Bayesian Nonparametrics
Michael I. Jordan
University of California, Berkeley
June 14, 2010
Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux
Information Theory and Statistics
• Many classical and neo-classical connections
– hypothesis testing, large deviations, empirical
processes, consistency of density estimation,
compressed sensing, MDL priors, reference priors,
divergences and asymptotics
Information Theory and Statistics
• Many classical and neo-classical connections
– hypothesis testing, large deviations, empirical
processes, consistency of density estimation,
compressed sensing, MDL priors, reference priors,
divergences and asymptotics
• Some of these connections are frequentist and
some are Bayesian
• Some of the connections involve nonparametrics
and some involve parametric modeling
– it appears that most of the nonparametric work is being
conducted within a frequentist framework
• I want to introduce you to Bayesian
nonparametrics
Bayesian Nonparametrics
• At the core of Bayesian inference is Bayes theorem:
• For parametric models, we let
and write:
denote a Euclidean parameter
• For Bayesian nonparametric models, we let be a general
stochastic process (an “infinite-dimensional random variable”)
and write (non-rigorously):
• This frees us to work with flexible data structures
Bayesian Nonparametrics (cont)
• Examples of stochastic processes we'll mention
today include distributions on:
– directed trees of unbounded depth and unbounded fanout
– partitions
– grammars
– sparse binary infinite-dimensional matrices
– copulae
– distributions
• General mathematical tool: completely random
processes
Bayesian Nonparametrics
• Replace distributions on finite-dimensional objects with
distributions on infinite-dimensional objects such as
function spaces, partitions and measure spaces
– mathematically this simply means that we work with
stochastic processes
• A key construction: random measures
• These can be used to provide flexibility at the higher levels
of Bayesian hierarchies:
Stick-Breaking
• A general way to obtain distributions on
countably infinite spaces
• The classical example: Define an infinite
sequence of beta random variables:
• And then define an infinite random sequence as
follows:
• This can be viewed as breaking off portions of a
stick:
Constructing Random Measures
• It's not hard to see that
• Now define the following object:
– where
space
(wp1)
are independent draws from a distribution
on some
• Because
,
is a probability measure---it is
a random probability measure
• The distribution of
is known as a Dirichlet process:
• There are other random measures that can be obtained
this way, but there are other approaches as well…
Random Measures from Stick-Breaking
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Completely Random Measures
(Kingman, 1968)
• Completely random measures are measures on a set
that assign independent mass to nonintersecting subsets
of
– e.g., Brownian motion, gamma processes, beta processes,
compound Poisson processes and limits thereof
• (The Dirichlet process is not a completely random process
– but it's a normalized gamma process)
• Completely random measures are discrete wp1 (up to a
possible deterministic continuous component)
• Completely random measures are random measures, not
necessarily random probability measures
Completely Random Measures
(Kingman, 1968)
• Consider a non-homogeneous Poisson process on
with rate function obtained from some product measure
• Sample from this Poisson process and connect the samples
vertically to their coordinates in
x
x
x
x
x
x
x
x
x
Beta Processes
• The product measure is called a Levy measure
• For the beta process, this measure lives on
and is given as follows:
degenerate Beta(0,c) distribution
Base measure
• And the resulting random measure can be written
simply as:
Beta Processes
Gamma Processes, Dirichlet
Processes and Bernoulli Processes
• The gamma process uses an improper gamma density in
the Levy measure
• Once again, the resulting random measure is an infinite
weighted sum of atoms
• To obtain the Dirichlet process, simply normalize the
gamma process
• The Bernoulli process is obtained by treating the beta
process as an infinite collection of coins and tossing those
coins:
where
Beta Process and Bernoulli Process
BP and BeP Sample Paths
Beta Processes vs. Dirichlet Processes
• The Dirichlet process yields a classification of each data
point into a single class (a “cluster”)
• The beta process allows each data point to belong to
multiple classes (a “feature vector”)
• Essentially, the Dirichlet process yields a single coin with
an infinite number of sides
• Essentially, the beta process yields an infinite collection
of coins with mostly small probabilities, and the Bernoulli
process tosses those coins to yield a binary feature
vector
The De Finetti Theorem
• An infinite sequence of random variables
is called infinitely exchangeable if
the distribution of any finite subsequence is
invariant to permutation
• Theorem: infinite exchangeability if and only if
for some random measure
Random Measures and Their Marginals
• Taking
to be a completely random or stickbreaking measure, we obtain various interesting
combintorial stochastic processes:
– Dirichlet process => Chinese restaurant process (Polya
urn)
– Beta process => Indian buffet process
– Hierarchical Dirichlet process => Chinese restaurant
franchise
– HDP-HMM => infinite HMM
– Nested Dirichlet process => nested Chinese restaurant
process
Dirichlet Process Mixture Models
Chinese Restaurant Process (CRP)
• A random process in which customers sit
down in a Chinese restaurant with an infinite
number of tables
– first customer sits at the first table
–
th subsequent customer sits at a table drawn from
the following distribution:
– where is the number of customers currently at table
and where
denotes the state of the restaurant
after
customers have been seated
The CRP and Clustering
• Data points are customers; tables are mixture
components
– the CRP defines a prior distribution on the partitioning of the
data and on the number of tables
• This prior can be completed with:
– a likelihood---e.g., associate a parameterized probability
distribution with each table
– a prior for the parameters---the first customer to sit at table
chooses the parameter vector, , for that table from a prior
• So we now have defined a full Bayesian posterior for a
mixture model of unbounded cardinality
CRP Prior, Gaussian Likelihood,
Conjugate Prior
Application: Protein Modeling
• A protein is a folded chain of amino acids
• The backbone of the chain has two degrees of
freedom per amino acid (phi and psi angles)
• Empirical plots of phi and psi angles are called
Ramachandran diagrams
Application: Protein Modeling
• We want to model the density in the
Ramachandran diagram to provide an energy
term for protein folding algorithms
• We actually have a linked set of Ramachandran
diagrams, one for each amino acid neighborhood
• We thus have a linked set of density estimation
problems
Multiple Estimation Problems
• We often face multiple, related estimation problems
• E.g., multiple Gaussian means:
• Maximum likelihood:
• Maximum likelihood often doesn't work very well
– want to “share statistical strength”
Hierarchical Bayesian Approach
• The Bayesian or empirical Bayesian solution is to view the
parameters
as random variables, related via an
underlying variable
• Given this overall model, posterior inference yields
shrinkage---the posterior mean for each combines data
from all of the groups
Hierarchical Modeling
• The plate notation:
• Equivalent to:
Hierarchical Dirichlet Process Mixtures
Chinese Restaurant Franchise (CRF)
Global menu:
Restaurant 1:
Restaurant 2:
Restaurant 3:
Protein Folding (cont.)
• We have a linked set of Ramachandran
diagrams, one for each amino acid neighborhood
Protein Folding (cont.)
Speaker Diarization
Bob
J
B
i
o Jane
l
b
l
John
50
100
150
200
John
250
300
TIME
40
30
20
10
0
-10
-20
-30
0
1
2
3
4
5
TIME
6
7
8
9
10
4
x 10
Motion Capture Analysis
 Goal: Find coherent “behaviors” in the time series that
transfer to other time series (e.g., jumping, reaching)
Hidden Markov Models
states
Time
State
observations
Page 35
Hidden Markov Models
modes
Time
observations
Page 36
Hidden Markov Models
states
Time
observations
Page 37
Hidden Markov Models
states
Time
observations
Page 38
Issues with HMMs
• How many states should we use?
– we don’t know the number of speakers a priori
– we don’t know the number of behaviors a priori
• How can we structure the state space?
– how to encode the notion that a particular time series makes
use of a particular subset of the states?
– how to share states among time series?
• We’ll develop a Bayesian nonparametric approach to
HMMs that solves these problems in a simple and
general way
HDP-HMM
• Dirichlet process:
– state space of unbounded
cardinality
• Hierarchical DP:
– ties state transition
distributions
State
Time
HDP-HMM
• Average transition distribution:
• State-specific transition distributions:
sparsity of b is shared
State Splitting
• HDP-HMM inadequately models temporal persistence of states
• DP bias insufficient to prevent unrealistically rapid dynamics
• Reduces predictive performance
“Sticky” HDP-HMM
original
state-specific base measure
Increased probability of self-transition
sticky
“Sticky” HDP-HMM
original
state-specific base measure
Increased probability of self-transition
sticky
Speaker Diarization
40
30
20
10
0
-10
-20
-30
0
1
2
3
4
5
6
7
8
9
Bob
4
J
B
i
o Jane
l
b
l
John
50
100
150
TIME
10
x 10
TIME
200
John
250
300
NIST Evaluations
Meeting by Meeting Comparison
• NIST Rich
Transcription 20042007 meeting
recognition
evaluations
• 21 meetings
• ICSI results have
been the current
state-of-the-art
Results: 21 meetings
Overall DER
Best DER
Worst DER
Sticky HDP-HMM
17.84%
1.26%
34.29%
Non-Sticky HDP-HMM
23.91%
6.26%
46.95%
ICSI
18.37%
4.39%
32.23%
HDP-PCFG
• The most successful
parsers of natural language
are based on lexicalized
probabilistic context-free
grammars (PCFGs)
• We want to learn PCFGs
from data without handtuning of the number or
kind of lexical categories
HDP-PCFG
The Beta Process
• The Dirichlet process naturally yields a
multinomial random variable (which table is the
customer sitting at?)
• Problem: in many problem domains we have a
very large (combinatorial) number of possible
tables
– using the Dirichlet process means having a large
number of parameters, which may overfit
– perhaps instead want to characterize objects as
collections of attributes (“sparse features”)?
– i.e., binary matrices with more than one 1 in each row
Beta Processes
Beta Process and Bernoulli Process
BP and BeP Sample Paths
Multiple Time Series
• Goals:
– transfer knowledge among related time series in the form of
a library of “behaviors”
– allow each time series model to make use of an arbitrary
subset of the behaviors
• Method:
– represent behaviors as states in a nonparametric HMM
– use the beta/Bernoulli process to pick out subsets of states
BP-AR-HMM
• Bernoulli process
determines which
states are used
• Beta process prior:
– encourages sharing
– allows variability
Motion Capture Results
Document Corpora
• “Bag-of-words” models of document corpora have a
long history in the field of information retrieval
• The bag-of-words assumption is an exchangeability
assumption for the words in a document
– it is generally a very rough reflection of reality
– but it is useful, for both computational and statistical
reasons
• Examples
– text (bags of words)
– images (bags of patches)
– genotypes (bags of polymorphisms)
• Motivated by De Finetti, we wish to find useful latent
representations for these “documents”
Short History of Document Modeling
• Finite mixture models
• Finite admixture models (latent Dirichlet allocation)
• Bayesian nonparametric admixture models (the
hierarchical Dirichlet process)
• Abstraction hierarchy mixture models (nested
Chinese restaurant process)
• Abstraction hierarchy admixture models (nested
beta process)
Finite Mixture Models
• The mixture components are distributions on
individual words in some vocabulary (e.g., for text
documents, a multinomial over lexical items)
– often referred to as “topics”
• The generative model of a document:
– select a mixture component
– repeatedly draw words from this mixture component
• The mixing proportions are corpora-specific, not
document-specific
• Major drawback: each document can express only a
single topic
Finite Admixture Models
• The mixture components are distributions on
individual words in some vocabulary (e.g., for text
documents, a multinomial over lexical items)
– often referred to as “topics”
• The generative model of a document:
– repeatedly select a mixture component
– draw a word from this mixture component
• The mixing proportions are document-specific
• Now each document can express multiple topics
Latent Dirichlet Allocation
(Blei, Ng, and Jordan, 2003)
• A word is represented as a multinomial random variable w
• A topic allocation is represented as a multinomial random
variable z
• A document is modeled as a Dirichlet random variable q
• The variables a and b are hyperparameters
Abstraction Hierarchies
• Words in documents are organized not only by
topics but also by level of abstraction
• Models such as LDA don’t capture this notion;
common words often appear repeatedly across
many topics
• Idea: Let documents be represented as paths
down a tree of topics, placing topics focused on
common words near the top of the tree
Nesting
• In the hierarchical models, all of the atoms are
available to all of the “groups”
• In nested models, the atoms are subdivided into
smaller collections, and the groups select among
the collections
• E.g., the nested CRP of Blei, Griffiths and Jordan
- each table in the Chinese restaurant indexes another
Chinese restaurant
• E.g., the nested DP of Rodriguez, Dunson and
Gelfand
- each atom in a high-level DP is itself a DP
Nested Chinese Restaurant Process
(Blei, Griffiths and Jordan, JACM, 2010)
Nested Chinese Restaurant Process
Nested Chinese Restaurant Process
Nested Chinese Restaurant Process
Nested Chinese Restaurant Process
Nested Chinese Restaurant Process
Hierarchical Latent Dirichlet Allocation
• The generative model for a document is as
follows:
- use the nested CRP to select a path down the infinite
tree
- draw
to obtain a distribution on levels
along the path
- repeatedly draw a level from and draw a word from
the topic distribution at that level
Psychology Today Abstracts
Pros and Cons of Hierarchical LDA
• The hierarchical LDA model yields more
interpretable topics than flat LDA
• The maximum a posteriori tree can be interesting
• But the model is out of the spirit of classical topic
models because it only allows one “theme” per
document
• We would like a model that can both mix over
levels in a tree and mix over paths within a single
document
Nested Beta Processes
• In the nested CRP, at each level of the tree the
procedure selects only a single outgoing branch
• Instead of using the CRP, use the beta process
(or the gamma process) together with the
Bernoulli process (or the Poisson process) to
allow multiple branches to be selected:
Nested Beta Processes
Nested Beta Processes
Nested Beta Processes
Conclusions
• Hierarchical modeling and nesting have at least
an important a role to play in Bayesian
nonparametrics as they play in classical
Bayesian parametric modeling
• In particular, infinite-dimensional parameters can
be controlled recursively via Bayesian
nonparametric strategies
• Completely random measures are an important
tool in constructing nonparametric priors
• For papers, software, tutorials and more details:
www.cs.berkeley.edu/~jordan/publications.html