SysCODE_040909_97format

Transcript SysCODE_040909_97format

Initial Steps Toward Computational
Discovery of Genetic Regulatory
Networks in Pancreatic Islet
Development
Georg Gerber, PhD
Gifford Laboratory, MIT CSAIL
SysCODE Meeting 4/9/09
Outline
Goals
 Expression data overview
 TF-TF interaction networks

◦ pair-wise mutual information
◦ Bayesian networks
Gene expression programs
 ChIP-seq data
 Directions for future work

Biological goals of building a
transcriptional regulatory network
of pancreatic specification

Knowledge of distinct signaling/transcriptional steps
involved in pancreatic specification
◦ Optimize ES differentiation by determining signaling event(s)
directly inducing each sequential TF

What is the network structure? Linear or cross-regulatory,
parallel or all interrelated
◦ Direct reprogramming using TFs would benefit from knowing
hierarchy of each network
◦ Are TFs that play role in specification of pancreas necessary for
later function of pancreas or are they merely required to
properly induce other necessary TFs?

Can knowledge of the pancreatic specification network
teach us about lineage diversification within the pancreas
(endocrine, exocrine, duct)?
Immediate computational goals
Determine set of transcription factors
active at different developmental stages
 Discover network “wiring”
 Determine how network changes/evolves
throughout development
 Compare in vivo and ESC networks

Outline
Goals
 Expression data overview
 TF-TF interaction networks

◦ pair-wise mutual information
◦ Bayesian networks
Gene expression programs
 ChIP-seq data
 Directions for future work

Expression data overview
E8.25
Embryonic Definitive endoderm
Embryonic
mesoderm (E7.75 and E8.75 as well)ectoderm/notochord
E11.5
Esophageal
endoderm
Lung
endoderm
Pancreatic
Endoderm
(E10.5 as well)
Liver
endoderm
Stomach
endoderm
Intestinal
endoderm
DMSO
Tcf2
Foxa2
DMSO/
2 uM RA
6h/24h
50 ng/mL ActA
6 days
ES
2 uM RA
Sox17
GFP+
FACS sort
Sox17GFP+Dpp4- definitive
endoderm
and perform microarray
1. Implant bead coated with DMSO/RA
into foregut of E8.25 (4-6 somite)
embryo
2. Explant embryo anterior to 1st somite
3. Culture for 6/24 hours
4. Dissociate, sort for EpCAM+ endoderm
5. Amplify RNA and profile on Illumina
Mouse Ref8 v2 chips
Expression data overview (cont.)
120 Illumina arrays (18118 genes/array)
72 distinct experiments (41 in mESC’s)
Standardized mESC/in vivo experiments separately
2758 genes w/ ≥ 2-fold change in ≥ 5
experiments
 154 TFs w/ ≥ 2-fold change in ≥ 5 experiments
(out of 946 “definite” or “candidate” TFs from
TFCat, Fulton et al, Genome Biology 2009)




Limitations of expression data for
genetic network reconstruction
Need 100’s of varied experiments for
finding relevant/significant networks
 Association ≠ causation
 High false positive rates (high dimensional,
noisy, dependent data)
 High false negative rates (low TF
transcript abundance, post-transcriptional
regulation, etc.)

Outline
Goals
 Expression data overview
 TF-TF interaction networks

◦ pair-wise mutual information
◦ Bayesian networks
Gene expression programs
 ChIP-seq data
 Directions for future work

Pair-wise mutual information
networks (CLR)



Context Likelihood of Relatedness
method: Faith et al., PLoS Biology 2007
Computes MI between all genes
Innovation: considers MI distribution for
both target and source to compute pvalues/estimate FDR
CLR (cont.)
TF-TF network (MI)
E8.25 4-6s definitive endoderm
TF-TF network (MI)
E8.75 13-15s definitive endoderm
TF-TF network (MI)
E9.5 definitive endoderm
TF-TF network (MI)
E10.5 pancreatic endoderm
TF-TF network (MI)
E11.5 pancreatic endoderm
TF-TF network (MI)
E11.5 intestinal endoderm
TF-TF network (MI)
6h 83 uM RA bead
mES 2 uM RA 6h
TF-TF network (MI)
24h 83 uM RA bead
mES 2 uM RA 24h
Outline
Goals
 Expression data overview
 TF-TF interaction networks

◦ pair-wise mutual information
◦ Bayesian networks
Gene expression programs
 ChIP-seq data
 Directions for future work

Bayesian networks
Directed networks, allow for multiple parents
 Encode conditional independence
 Penalize complexity automatically
 Software: Banjo (Alexander Hartemink, Duke
University)

TF-TF network (Bayes Net)
E8.25 4-6s definitive endoderm
TF-TF network (Bayes Net)
E8.75 13-15s definitive endoderm
TF-TF network (Bayes Net)
E9.5 definitive endoderm
TF-TF network (Bayes Net)
E10.5 pancreatic endoderm
TF-TF network (Bayes Net)
E11.5 pancreatic endoderm
6h 83 uM RA bead
TF-TF network (Bayes Net)
mES 2 uM RA 6h
24h 83 uM RA bead
TF-TF network (Bayes Net)
mES 2 uM RA 24h
Outline
Goals
 Expression data overview
 TF-TF interaction networks

◦ pair-wise mutual information
◦ Bayesian networks
Gene expression programs
 ChIP-seq data
 Directions for future work

Advantages to methods that
discover groups of genes
Infer more robust relationships because
considering many genes
 Allow for enrichment analysis

◦ Functional categories
◦ Signaling pathways
◦ TF DNA binding motifs
GeneProgram
Gerber et al, PLoS Comp Bio 2007
Discovers sets of genes co-expressed across
subsets of conditions
 Simultaneously models probabilistic
structure of experiments (tissues) and genes
 Uses Hierarchical Dirichlet Processes, a fully
Bayesian method for automatically
determining the number of expression
programs and tissue groups
 Outperforms state-of-the-art biclustering
methods


Hierarchical clustering
Singular Value
Decomposition (SVD)
Non-negative Matrix
Factorization (NMF)
GeneProgram w/o tissue
groups
Full GeneProgram model
GeneProgram
produced a map
of 12 tissue
groups and 62
expression
programs
tissue groups
GeneProgram
produced a map
of 12 tissue
groups and 62
expression
programs
tissue
GeneProgram
produced a map
of 12 tissue
groups and 62
expression
programs
expression programs
(sorted by generality
score)
GeneProgram
produced a map
of 12 tissue
groups and 62
expression
programs
expression program use
by tissue
Expression program enrichment
analysis

GO categories
◦ FDR controlled to 5%

TRANSFAC motifs
◦ Software: SAMBA
◦ Scans +3000 to -200 bp for each motif
◦ Uses PWM to score region, background to
calculate p-value (Bonferroni corrected)
Expression programs (GO and motif enrichment)
E8.25 4-6s definitive endoderm
Expression programs (GO and motif enrichment)
E8.75 13-15s definitive endoderm
Expression programs (GO and motif enrichment)
E9.5 definitive endoderm
Expression programs (GO and motif enrichment)
E10.5 pancreatic endoderm
Expression programs showing TFs in programs and
motif enrichment
E8.25 4-6s definitive endoderm
Expression programs showing TFs in programs and
motif enrichment
E8.75 13-15s definitive endoderm
Expression programs showing TFs in programs and
motif enrichment
E9.5 definitive endoderm
Expression programs showing TFs in programs and
motif enrichment
E10.5 pancreatic endoderm
Expression programs showing TFs in programs and
motif enrichment
E11.5 pancreatic endoderm
Outline
Goals
 Expression data overview
 TF-TF interaction networks

◦ pair-wise mutual information
◦ Bayesian networks
Gene expression programs
 ChIP-seq data
 Directions for future work

Retinoic acid receptor ChIP-seq data
Generated in the Wichterle lab at
Columbia (unpublished data, Motor
Neuron Development Project)
 mESC’s grown to embryoid body stage,
profiled after 8h of RA exposure

ChIP-seq RAR binding: Cyp26a1
ChIP-seq RAR binding: Rarb
Overlap of expression and binding data
# upreg
genes
# bound
genes
% bound
genes
p-value
6h RA 83 uM bead
104
29
28%
0
1d RA 83 uM bead
369
29
8%
0.069
mESC 6h 2 uM RA
165
33
20%
0
mESC 1d 2 uM RA
220
38
17%
0
Binding events determined with modified MACS method (Zhang et al,
Genome Biology 2008); called if significant peak found w/in 50 kb of gene
start site
Future computational directions
Add publically available ES expression data
Apply more sophisticated TF binding motif
methods (phylogeny, spatial arrangements, coregulation)
 Extend GeneProgram framework for add’l data
types (TF expression, binding motifs, ChIP-seq,
knockdown/overexpression, ?protein-protein
interactions, etc.) → causal/predictive models
 Infer dynamic rewiring networks over inferred
developmental tree
 Develop novel probabilistic methods for ChIP-seq
data


Acknowledgements
Rich Sherwood (Melton lab) - all the
expression data!
 Arvind Jammalamadaka (Gifford lab) initial data analysis/normalization methods
 Shaun Mahony (Gifford lab) - RA ChIPseq data analysis
 Esteban Mazzoni (Wichterle lab) - RA
ChIP-seq data


SysCODE_040909_97format

Transcript SysCODE_040909_97format

Directory