Scaling Up Graphical Model Inference
Download
Report
Transcript Scaling Up Graphical Model Inference
Scaling Up
Graphical Model Inference
Graphical Models
β’ View observed data and unobserved properties as random variables
β’ Graphical Models: compact graph-based encoding of probability
distributions (high dimensional, with complex dependencies)
π
π₯ππ
π¦π
π·
π
β’ Generative/discriminative/hybrid, un-,semi- and supervised learning
β Bayesian Networks (directed), Markov Random Fields (undirected), hybrids,
extensions, etc. HMM, CRF, RBM, M3N, HMRF, etc.
β’ Enormous research area with a number of excellent tutorials
β [J98], [M01], [M04], [W08], [KF10], [S11]
Graphical Model Inference
β’ Key issues:
β Representation: syntax and semantics (directed/undirected,variables/factors,..)
β Inference: computing probabilities and most likely assignments/explanations
β Learning: of model parameters based on observed data. Relies on inference!
β’ Inference is NP-hard (numerous results, incl. approximation hardness)
β’ Exact inference: works for very limited subset of models/structures
β E.g., chains or low-treewidth trees
β’ Approximate inference: highly computationally intensive
β Deterministic: variational, loopy belief propagation, expectation propagation
β Numerical sampling (Monte Carlo): Gibbs sampling
Inference in Undirected Graphical Models
β’ Factor graph representation
π π₯1 , . . , π₯π
1
=
π
πππ π₯1 , π₯2
π₯π βπ(π₯π )
β’ Potentials capture compatibility of related observations
β e.g., π π₯π , π₯π = exp(βπ π₯π β π₯π )
β’ Loopy belief propagation = message passing
β iterate (read, update, send)
Synchronous Loopy BP
β’ Natural parallelization: associate a processor to every node
β Simultaneous receive, update, send
β’ Inefficient β e.g., for a linear chain:
2π/π time per iteration
π iterations to converge
[SUML-Ch10]
Optimal Parallel Scheduling
β’ Partition, local forward-backward for center, then cross-boundary
Processor 1
Processor 2
Synchronous Schedule
Parallel
Component
Gap
Processor 3
Optimal Schedule
Sequential
Component
6
Splash: Generalizing Optimal Chains
1) Select root, grow fixed-size BFS Spanning tree
2) Forward Pass computing all messages at each vertex
3) Backward Pass computing all messages at each vertex
β’ Parallelization:
β Partition graph
β’ Maximize computation, minimize
communication
β’ Over-partition and randomly assign
β Schedule multiple Splashes
β’ Priority queue for selecting root
β’ Belief residual: cumulative change
from inbound messages
β’ Dynamic tree pruning
DBRSplash: MLN Inference Experiments
Experiments: MLN Inference
8K variables, 406K factors
Single-CPU runtime: 1 hour
Cache efficiency critical
120
Speedup
β’
β’
β’
β’
70
20
Speedup
-30
β’ 1K variables, 27K factors
β’ Single-CPU runtime: 1.5 minutes
β’ Network costs limit speedups
No Over-Part
5x Over-Part
0
60
50
40
30
20
10
0
30
60
90
Number of CPUs
120
No Over-Part
5x Over-Part
0
30
60
90
Number of CPUs
120
Topic Models
β’ Goal: unsupervised detection of topics in corpora
β Desired result: topic mixtures, per-word and per-document topic assignments
[B+03]
Directed Graphical Models:
Latent Dirichlet Allocation [B+03, SUML-Ch11]
β’ Generative model for document collections
β πΎ topics, topic π: Multinomial(ππ ) over words
β π· documents, document π:
β’ Topic distribution ππ βΌ Dirichlet πΌ
β’ ππ words, word π₯ππ :
β Sample topic π§ππ βΌ Multinomial ππ
β Sample word π₯ππ βΌ Multinomial ππ§ππ
Prior on topic
distributions
πΌ
Documentβs
topic distribution
ππ
Wordβs topic
π§ππ
Word
π₯ππ
ππ
β’ Goal: infer posterior distributions
β Topic word mixtures {ππ }
β Document mixtures ππ
β Word-topic assignments {π§ππ }
π·
Topicβs word
distribution
ππ
Prior on word
distributions
π½
πΎ
Gibbs Sampling
β’ Full joint probability
π π, π§, π, π₯ πΌ, π½ =
π(ππ |π½)
π=1..πΎ
π(ππ |πΌ)
π=1..π·
π π§ππ ππ π(π₯ππ |ππ§ππ )
π=1..ππ
β’ Gibbs sampling: sample π, π, π§ independently
β’ Problem: slow convergence (a.k.a. mixing)
β’ Collapsed Gibbs sampling
β Integrate out π and π analytically
β²
β² +π½
ππ₯π§
πππ§
+πΌ
π π§ π₯, π, πΌ, π½ β
β²
β²
π₯(ππ₯π§ +π½) π§(πππ§ +πΌ)
β Until convergence:
β’ resample π π§ππ π₯ππ , πΌ, π½),
β’ update counts: ππ§ , ππ§π , ππ₯π§
Parallel Collapsed Gibbs Sampling [SUML-Ch11]
β’ Synchronous version (MPI-based):
β
β
β
β
Distribute documents among π machines
Global topic and word-topic counts ππ§ , ππ€π§
Local document-topic counts πππ§
After each local iteration, AllReduce ππ§ , ππ€π§
β’ Asynchronous version: gossip (P2P)
β Random pairs of processors exchange statistics upon pass completion
β Approximate global posterior distribution (experimentally not a problem)
β Additional estimation to properly account for previous counts from neighbor
Parallel Collapsed Gibbs Sampling [SN10,S11]
β’ Multithreading to maximize concurrency
β Parallelize both local and global updates of ππ₯π§ counts
β Key trick: ππ§ and ππ₯π§ are effectively constant for a given document
β’ No need to update continuously: update once per-document in a separate thread
β’ Enables multithreading the samplers
β Global updates are asynchronous -> no blocking
[S11]
Scaling Up Graphical Models: Conclusions
β’ Extremely high parallelism is achievable, but variance is high
β Strongly data dependent
β’ Network and synchronization costs can be explicitly accounted for in
algorithms
β’ Approximations are essential to removing barriers
β’ Multi-level parallelism allows maximizing utilization
β’ Multiple caches allow super-linear speedups
References
[SUML-Ch11] Arthur Asuncion, Padhraic Smyth, Max Welling, David Newman, Ian Porteous, and Scott Triglia. Distributed
Gibbs Sampling for Latent Variable Models. In βScaling Up Machine Learningβ, Cambridge U. Press, 2011.
[B+03] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993β1022, 2003.
[B11] D. Blei. Introduction to Probabilistic Topic Models. Communications of the ACM, 2011.
[SUML-Ch10] J. Gonzalez, Y. Low, C. Guestrin. Parallel Belief Propagation in Factor Graphs. In βScaling Up Machine Learningβ,
Cambridge U. Press, 2011.
[KF10] D. Koller and N. Friedman Probabilistic graphical models. MIT Press, 2010.
[M01] K. Murphy. An introduction to graphical models, 2001.
[M04] K. Murphy. Approximate inference in graphical models. AAAI Tutorial, 2004.
[S11] A.J. Smola. Graphical models for the Internet. MLSS Tutorial, 2011.
[SN10] A.J. Smola, S. Narayanamurthy. An Architecture for Parallel Topic Models. VLDB 2010.
[W08] M. Wainwright. Graphical models and variational methods. ICML Tutorial, 2008.