Transcript Slide 1
Seq-ing the Epigentic Code with Exact Bayesian Network Structure Learning
.
Brandon M.
1,2
Malone ,
Changhe
1
Yuan ,
Eric
1
Hansen
and Susan M.
1,2
Bridges
1Department
of Computer Science & Engineering, Mississippi State University
2Institute for Genomics, Biocomputing and Biotechnology, Mississippi State University
Abstract
The epigenetic code [Jaenisch] hypothesis proposes that patterns of post-translational modifications to the histone core proteins, the presence of transcription factor binding sites and
other genomic features influence expression of associated DNA. Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-Seq) is frequently used to
characterize these features at a genome-wide scale. Previous studies [Yu] have used approximation techniques to learn relationships among them. In this work, we apply a novel exact
Bayesian network learning algorithm to learn a network structure which identifies regulatory relationships among a set of epigenetic features in human CD4 cells [Barksi]. Comparison
to networks learned using greedy methods reveals that our network identifies more biologically relevant relationships. By applying an exact, optimal learning algorithm instead of an
approximate, greedy algorithm, the relationships we learn are unaffected by sources of uncertainty stemming from the structure learning algorithm.
Optimal Learning with Dynamic Programming
The Epigenetic Code
In the case of a ChIP-Seq dataset, we do not know the relationships among the
variables. Therefore, we must learn them. Singh and Moore [2005] proposed a dynamic
programming algorithm to learn an optimal Bayesian network which minimizes the MDL
score. The figure below shows the intuition behind the algorithm. Equation 2 expresses this
recursively. Silander and Myllmaki [2006] refined the algorithm by reversing the process.
The central dogma of molecular biology (roughly) states that DNA is
transcribed into RNA which is translated into proteins. Proteins
perform many of the functions in the body. We have the same DNA in
most of our cells, yet they perform quite different functions. One
reason for this differentiation lies in the epigenetic code.
When DNA forms chromosomes, it packs together very tightly into a
structure called chromatin. The DNA coils around a group of eight
proteins called histones. Figure 1 summarizes chromatin packaging.
H3K9
ac
H3K4
me3
H3K9
ac
H3K4
me3
H3K9
ac
H3K4
me3
H3K9
ac
H3K4
me3
Pol II
H3K27
me3
Pol II
H3K27
me3
Pol II
H3K27
me3
Pol II
H3K27
me3
H3K36
me
Expr
H3K36
me
Figure 1. Chromatin packaging and histones.
(http://themedicalbiochemistrypage.org/)
Equations
(1)
The histone proteins include a tail domain which is very susceptible to
a large number of post-translational modifications which affect the
attraction between histones. The attraction can increase between
histones, tightening surrounding chromatin and suppressing
expression. Chromatin can also loosen, increasing expression.
The combination of present modifications determines the effect on the
chromatin structure. Some histone modifications affect the likelihood
of other modifications. The epigenetic code [Jaenisch] proposes that
the combination of histone modifications, as well as other features such
as the presence of transcription factor binding sites, serves as a type of
message to present and future generations of cells about regulation.
ChIP-Seq
We can measure the presence of a particular histone modification in
cells using chromatin immunopreciptation followed by high throughput
sequencing (ChIP-Seq). The figure below shows the ChIP-Seq process.
Raw DNA
Recursively find optimal
leaves until an empty
subnetwork remains.
As suggested by Equation 2, learning an optimal Bayesian network consists of three phases which we formulate as search problems.
Calculating Scores
Identifying Optimal Parent Sets
Learning Optimal Subnetworks
Goal. Calculate MDL(X|U), which is
the score of X using U as parents
Representation. AD-tree [Moore]
Search Strategy. Depth-first
AD Node. Records with U= u
Vary Node. Records with U = u, X = x
Successor. Instantiate a new X
Storage. Written to disk
Goal. Calculate BestScore(U, X), which
selects the best parents of X from U
Representation. Sorted and bit arrays
Search Strategy. On demand
Successor. Use bit operators to find
scores consistent with U\Y
Score. scores[firstBit(usable(X))]
Storage. Arrays and bit sets
Goal. Calculate Score(U), which is the best
subnetwork for variables U.
Representation. Order graph [Yuan]
Search Strategy. Breadth-first
Node. Score(U) for some U.
Successor. Use X as a leaf of U
Score. Score(U) + BestScore(U, X)
Storage. Hash table or written to disk
Calculate and sort all of the scores for a variable.
ϕ
parents
scores
A
B
a
B
ab
b
B
ab
ab
ab
{1,2}
8
{2}
10
{1}
11
{1,3}
12
{3}
13
{}
15
1,2,3,4
{2,3}
20
Mark which scores use each variable (n-1 of these each).
uses[1]
a
The remaining pieces
of DNA are sequenced.
The remaining subnetwork is
also a DAG, so it has a leaf.
Our Optimal Search Formulation
The DNA is sheared into
pieces around 200 bp in length.
Pieces are immunoprecipitated
against an antibody to
extract desired pieces.
Remove that leaf and its
edges from the network..
The optimal Bayesian network
structure is a DAG, so it has a
leaf variable with no children.
(2)
H3K36
me
b
Vary Node
Nx,u
AD Node
Nu
X
X
X
X
X
X
X
X
X
1,3,4
X
X
1,2
1,3
2,3
1
2
1,4
For each X in U
newScore = U.score +
2,3,4
BestScore(U, X)
succ = get({U+ X})
if newScore < succ.score
2,4
3,4
put({U+ X}, newScore)
X
When X is used as a leaf, find the usable parent scores
with (usable & ~uses[X]). The first set bit is optimal.
usable
1,2,4
X
Initially, a variable can use all scores. The first is optimal.
usable
1,2,3
Expand(U)
X
3
4
ϕ
Frontier Breadth-first Branch and Bound Search
[Illumina]
The sequenced
DNA is mapped
back to the genome.
The order graph has a very regular structure. The successors for a node in layer l always appear in layer l+1. This observation allows
us to keep only two layers in memory rather than all n. Furthermore, we can calculate how good a particular node can possibly be. If
this is worse than a known bound, we safely disregard it. If optimality is not needed, we disregard many nodes to reduce running time.
Experiments
Bayesian Networks
We used our optimal frontier breadth-first search algorithm
to learn an optimal Bayesian network over the 23-variable
data set and compared it to a greedy search, used previously
[Yu]. Figures 2 and 3 show the learned networks.
Representation. Joint probability distribution over a set of variables
Structure. Directed acyclic graph storing conditional dependencies.
•Vertices correspond to variables.
•Edges indicate relationships among variables.
Parameters. Conditional probability tables quantifying relationships
Scoring. Minimum Description Length (MDL) [Schwartz], Equation 1
Results and Discussion
Data Set and Preprocessing
Raw Data. 30 human ChIP-Seq experiments [Barski]
Cellular Environment. CD4 cells (specialized white blood cells)
Normalization. Linear regression, against an IgG control data set
Discretization. Clustered genes using MDL for each experiment
Processed Data Set. A numeric array of length 30 for each gene
Figure 2. Learned structure
with our optimal algorithm.
Figure 3. Learned structure
with a standard greedy algorithm.
Selected References
Jaenisch, R. & Bird, A. Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals Nature Genetics, 2003, 33, 245-254
Schwarz, G. (1978). "Estimating the Dimension of a Model." The Annals of Statistics 6(2): 461-464.
Barski, A., S. Cuddapah, et al. (2007). "High-resolution profiling of histone methylations in the human genome." Cell 129(4): 823 – 837
Singh, A. P. and A. W. Moore (2005). Finding optimal bayesian networks by dynamic programming (Technical Report). Carnegie Mellon Univ: 05—106.
Silander, T. and P. Myllymaki (2006). A simple approach for finding the globally optimal Bayesian network structure. Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI-06), AUAI Press.
Yu, H., S. Zhu, et al. (2008). "Inferring causal relationships among different histone modifications and gene expression." Genome Research 18(8): 1314-1324.
Yuan, C.; Malone, B. & Wu, X. Learning Optimal Bayesian Networks using A* Search. Proceedings of the 22nd International Joint Conference on Artificial Intelligence, 2011
Acknowledgments
This material is based on work supported by the National Science Foundation under Grants No. NSF EPS-0903787 and NSF IIS-0953723.
We focused on the transcription factor binding site for
CTCF, known to play a function in the regulation of many
elements. We expect CTCF to be an ancestor of important
regulatory elements. In our network, CTCF is parent of the
five most highly connected regulatory elements in the
network. The approximate algorithm identified four parents
and three children of intermediate degree for CTCF.
Conclusions
We presented a frontier breadth-first search algorithm for
learning optimal Bayesian networks that improves the
memory complexity from O(2n) to O(C(n,n/2)). Provably
optimal solutions allow us to focus on interpreting the
results. We learned the optimal structure of a network of
epigeneitc features; it included more biologically meaningful
relationships than structures learned with greedy search.