Memory Bounded Inference on Topic Models

Download Report

Transcript Memory Bounded Inference on Topic Models

Memory Bounded Inference
on Topic Models
Paper by R. Gomes, M. Welling, and P. Perona
Included in Proceedings of ICML 2008
Presentation by Eric Wang
1/9/2009
Outline
• Motivation/Background: what is “memory bounded
lifelong learning?”
• Review of the Variational Topic Model
– Note: The title “Topic Model” is misleading, this paper
actually applies an LDA based model to image analysis.
• Introduce the Memory Bounded Variational Topic
Model (2 step process)
– Defining “Clumping” and “Grouping”
– Model Building Step
– Compression Step
• Empirical Results
• Conclusion
Background/Motivation
• The goal of this paper is to create a class of
algorithms which can
– Expand model complexity as more data arrives,
– Do so in a speedy and memory efficient manner.
• Traditional non-parametric algorithms, including
DP, can address the first issue but not the second.
• DP and its derivatives are batch algorithms. They
require the entire dataset be cached in memory,
and each time the dataset is expanded, they infer a
new model based on ALL data.
• This paper proposes a new class of models which
produces results similar to those of batch
algorithms while meeting both initial requirements.
Background/Motivation
• Memory Bounded Models compress the data and
the internal representation of the model after each
dataset expansion according to a set memory
bound.
• The net effect of this is that the memory
requirements of the model scale much more
gradually over time than a similar batch processing
model.
• This paper demonstrates results with a memory
bounded approximation to LDA which validates the
author’s claims.
Review: Variational Topic Model
• Consider the following Bayesian topic model
where, unlike the Blei’s form of LDA, we assume
that
is in the exponential family and has the following
prior
Review: Variational Topic Model
• The variational posteriors over
are then
Dirichlet (as in LDA)
Multinomial distribution over topic indicators
Updates of Variational Parameters
Review: Variational Topic Model
• As in LDA, we assume Gamma priors on the
hyperparameters of the Dirichlet
• And the hyperparameter update equations can be
written
“Clumping” and “Grouping”
• Tie some of the
within and across different
documents (images) to a “clump distribution”
.
• Further, tie some of the document specific topic
distributions
to a “topic group” specific
distribution
.
• Equivalently
“Clumping” and “Grouping”
• Definitions
–
–
–
–
Let
Let
Let
Let
be the number of documents in a group.
be the number of words* in a word clump.
be the number of words in a clump c and group s.
• The update equations can be derived similarly to
the original topic model
* “Words” is used loosely; it may mean, as in this case, an image patch.
“Clumping” and “Grouping”
• Updates on the hyperparameters are also similar to
the original topic model
• Since a clump may be used to represent more than
one “word”, the effect of a clump on the parameters
must be scaled appropriately.
Memory Bounded Variational
Topic Model
• Suppose the data arrives as in time sequential sets
denoted “epochs”.
• The proposed family of models will estimate a new
model with the current epoch of data using the
previous epoch’s model (model building) as a
starting point, then compress the model and data
by assigning words and documents to clumps and
groups (model compression). Historical data will be
represented as clumps and groups.
• The data level distributions
and the topic level
distributions
are then discarded, while the
clump sufficient statistics and estimated model are
retained as the initial model for the next epoch.
Model Building
Variational Parameters
Hyperparameters
Model Building
• In this step, the goal is to
build the model exactly as
in the non-memory bounded
case, based on the model
from the previous epoch.
• The purpose of this step is
to assign new data to
specific topics, which is
necessary for the following
steps.
Model Building
• This step ranks which
components to split or
merge based on the
Split/Merge EM algorithm.
• SMEM was originally
proposed by Ueda et al. to
help mixture models avoid
local maxima by increasing
the number of clusters in
under-populated regions,
and decreasing the number
of clusters in overpopulated
regions.
Model Building
• The idea behind SMEM is to
evaluate whether splitting a
cluster or merging two
clusters will increase the
expected log-likelihood
function.
• This paper uses only the
split/merge criterion from
Ueda et al., and not the
SMEM algorithm.
Model Building
• Merge Criteria
• Where
are the
parameters estimated from
standard EM and
is a vector of posteriors for
each data to the ith mixture
component.
• Merge those components
which are most similar.
Model Building
• Split Criteria
• Define the KL divergence
between the estimated
mixture component
and the empirical data
density
of data
belonging to component k.
• Split the component which
most disagrees with its
empirical density.
Model Building
• Each document in the
current epoch is assumed to
be in its own group, and
each data point in its own
cluster. Notice that the
update equations then
collapse to the original
equations.
Model Building
• Each topic is split along into
two daughter components,
and the data from the
current epoch with high
probability to that topic is
then hard assigned to one of
the two daughter topics.
Model Building
• To summarize: the clumps
and new data are
considered together during
model inference, but the
effect of a clump’s
assignment to a topic must
be scaled to represent all
the data contained in the
clump. In this way, the
historical data points in
each clump don’t need to be
considered.
Model Building
• Merges are done by
choosing two topics, taking
the data with the highest
posterior probability for
those two topics and
combining them into a new
meta-topic.
Model Building
• In this step, a specific
action, either split a certain
topic or merge two topics, is
chosen such that it
maximizes the change in the
lower bound.
Model Building
• After taking the action of
either split or merge, the
algorithm is run again
considering all the data and
clumps, as well as all the
mixture components.
Model Compression
• This is done so that
later, for each cluster,
only the data with high
posterior probability for
that cluster is
considered.
Model Compression
• Define the artificially
scaled data sample size
• Define the artificially
scaled number of
documents
where
and
are
the expected lifetime
data size and document
number.
Model Compression
Variational Parameters
Hyperparameters
Model Compression
• Due to the scaled
sample size and
document count, the
lower bound equation is
also modified
Model Compression
• A key point here is that
the splitting procedure
is only to extract new
data clumps, not to
refine the model
building.
• The splits from this
recursive procedure are
cached temporarily,
and extracted at the
end as clumps when
the memory bound has
been reached.
Model Compression
• The data is not cached at the end of each epoch.
Instead, each clump stores sufficient statistics for
full covariance Gaussian mixture components (in
this application).
• The total memory needed is
Where d is the feature dimensionality and |S| is
number of document groups.
Model Compression
• Another way to control memory usage is to merge
document groups through the use of the following
heuristic
• Initially each document is in its own “group”, but
after the first iteration, we are in a position to
group multiple documents together into groups.
Experimental Results
• 2 different tests were conducted: an image
segmentation task, and an object
recognition/retrieval task.
• The goal is not to show state-of-the-art
performance, but rather to compare accuracy and
memory usage between a conventional algorithm
and its memory bounded equivalent.
Experimental Results
• The first experiment is to segment images.
• Dataset: 435 facial images, from “Caltech-Easy”
dataset.
• Pixels are represented as a 5 dimensional feature
vector: {x,y,color1,color2,color3}.
• Each image was scaled to 200 x 160. 435 images,
with 5 dimensional features means 69,600,000 real
numbers to store.
• The baseline non-memory bounded algorithm ran
out of memory after storing 30 images.
Experimental Results
(a)
(b)
(c)
(d)
• (a) Original image
• (b) Result from baseline method.
• (c) Result from memory bounded model using 30% of the
dataset, 125 clumps, and image grouping.
• (d) Result from memory bounded model using 30% of the
dataset, 125 clumps, and no image grouping.
Experimental Results
• Top row show sample clumps, with the colors
corresponding to different clumps. The bottom row
are the corresponding segmentations based on the
clumps.
Experimental Results
• Results when trained on the entire 435 image
dataset. No document merges were used.
• Total memory requirement was 257 Mb, compared
to 9.3 Gb for the baseline batch processing model.
• Each epoch consisted of 10 images, resulting in 10
learning rounds. Total runtime was 15 hours.
Experimental Results
• The second test was object recognition/retrieval.
• Training set consisted 3000 images, test set
consisted of 1000 images.
• Features: 128 dimensional sift features from 500
randomly chosen locations in each image.
• The batch model was not compared here because it
could not handle the data size.
• Epochs consisted of 60 images at a time.
Experimental Results
• A single topic model was built for all training
images, and a score for each test image was
computed
• A nearest neighbor (1-NN) algorithm was then used
for classification.
Experimental Results
• When the 1-NN accuracy is plotted against the
memory bound, the accuracy seems to saturate for
M > 10000.
• When M is fixed at 10000, the performance drops
significantly only when a small number of
documents is permitted.
Conclusion
• The authors introduced a new family of variational
Bayesian models which do not have limited
capacity and do not require the entire historical
dataset be cached.
• This method speeds up inference, and allows for
processing of much larger datasets than could be
previously considered, with negligible (claimed) loss
on accuracy.
• A drawback of this algorithm is that at each epoch,
inference (either complete or restricted) must be
performed several times.