Kellis Lab modENCODE update
Download
Report
Transcript Kellis Lab modENCODE update
Integration methods and analysis
Manolis Kellis
Broad Institute of MIT and Harvard
MIT Computer Science & Artificial Intelligence Laboratory
The good news: ever-expanding dimensions
Environment
Genotype
Disease
Gender
Stage
Age
Chromatin marks
Cell types
• Now: Cell-type and chromatin-mark dimensions
• Next: References for each background
• All clearly needed, and increasingly available
Difficulty of interpreting increasing # marks
Challenge: simplify
–
–
–
–
Learn combinations
Interpret function
Prioritize marks
Study dynamics
Overview
1. Learning chromatin states
– ChromHMM captures combinatorics / spatial info
2. Interpreting chromatin states
– Distinct functions, power for genome annotation
3. Selecting number of marks / prioritizing
– Greedy ordering, tunable to states of interest
4. Chromatin dynamics in human cell lines
– Activity profiles, linking enhancers, activat/repressors
5. Selecting number of states
– Interpreting genome at increasing resolution
4
1. Learn combinations
5
Challenge of data integration in many marks/cells
• Epigenetic modifications
• Dozens of marks
• Encode epigenetic state
• Histone code hypothesis
• Distinct function for distinct
combinations of marks?
• Hundreds of histone marks
• Astronomical number of
histone mark combinations
• How do we find biologically
relevant ones?
• Unsupervised approach
• Probabilistic model
• Explicit combinatorics
Ernst et al, In preparation
Chromatin states for genome annotation
Promoter states
Transcribed states
Active Intergenic
Repressed
• Learn de novo
significant
combinations of
chromatin marks
• Reveal functional
elements, even
without looking
at sequence
• Use for genome
annotation
• Use for studying
regulation
dynamics in
different cell
types
ChromHMM: learning ‘hidden’ chromatin states
Transcription
Start Site
Enhancer
Observed
chromatin
marks. Called
based on a
poisson
distribution
Most likely
Hidden State
K4me1
K27ac
1
200bp
intervals
K4me3
K4me3
Transcribed Region
K4me1
K36me3
K36me3
4
6
6
DNA
K36me3 K36me3
K4me1
2
3
6
6
High Probability Chromatin Marks in State
0.8
0.8
0.7
1:
2:
3:
K4me1
K27ac
0.9
0.8
K4me3
K4me1
0.9
K4me3
4:
K4me1
5:
6:
0.9
6
5
5
5
All probabilities are
learned de novo from
chromatin data alone
(Baum-Welch aka. EM)
8
K36me3
Each state: vector of emissions, vector of transitions
Ernst et al, in preparation
Repetitive Repressed Active intergenic
Transcribed
Promoter
Application of ChromHMM to 41 chromatin marks in CD4+ T-cells (Barski’07, Wang’08)
9
Chromatin Marks from (Barski et al, Cell 2007; Wang et al Nature Genetics, 2008); DNAseI hypersensitivity from (Boyle et al, Cell 2008); Expression Data from (Su et
al, PNAS 2004); Lamina data from (Guelen et al; Naature 2008)
2. Interpreting chromatin states
As learned in the IMR90 REMC datasets
10
IMR90:
24
marks
+
DNAse
+
CpGmethyl
Emission Parameters
Transition Parameters
• Interpreting a 40-state model as a basis for analysis
Promoter associated states 1-5
State 1 Bivalent Repressed Promoter
State 3 TSS Specific State
States 6-19: transcribed regions
States 15-19: ends of genes and exons
State 19 is 60-fold enriched for ZNF genes
States 20-30 associated with active
intergenic regions
20-24 candidate strong enhancers
Increased accessibility; lower metyhylation; greater conservation
States 30-40: large scale repressive states
State 34 - Strong H3K27me3 silenced state
39-40: H3K9me3 repressive domains /
experimental nuclear lamina
Repetitive
Repressed
Transcribed states
Active Intergenic
Promoter states
Specific functional annotations for each of 51 chromatin states
23
Example applications for genome annotation
New protein-coding genes Long intergenic non-coding RNAs/lincRNAs
lincRNAs
Known coding
Evolutionary CSF score
In promoter(short)/low-expr states
New developmental enhancer regions
Chromatin signature:
promoter / transcribed
Evolutionary signature:
not protein-coding
Assign candidate functions to intergenic SNPs
from genome-wide association studies
Examples of distinct properties of chromatin states
GO Category State 3 State 4 State 5
Cell Cycle
Phase
2.10
(2x10-7)
0.57
(1)
State 6
State 7
State 8
1.45
(1)
1.15 (1)
1.51 (1)
0.85 (1)
0.54 (1)
1.00 (1)
2.2
1.64 (1)
(1.4x10-7)
0.85 (1)
0.85 (1)
1.61
(0.001)
Embryonic
2.82
1.24 (1)
1.07 (1)
Development
(9x10-23)
Chromatin
1.20 (1) 0.48 (1)
Response to
1.55
2.13
1.97
DNA Damage 1.20 (1) 0.35 (1)
(0.074) (6.5x10-11) (1.0x10-4)
Stimulus
RNA
Processing
0.49 (1) 0.26 (1) 1.31 (1)
T cell
Activation
0.77 (1) 0.88 (1) 1.27 (1)
State 28: 112-fold ZNF enrich
0.84 (1)
1.91
2.64
2.45
(4.2x10-11) (8.7x10-24) (3.0x10-4)
0.70 (1)
0.79 (1)
4.72
(2x10-7)
Promoter state gene GO function
TF binding
“The achievement of the repressed
state by wild-type KAP1 involves
decreased recruitment of RNA
polymerase II, reduced levels of
histone H3 K9 acteylation and
H3K4 methylation, an increase in
histone occupancy, enrichment of
trimethyl histone H3K9, H3K36,
and histone H4K20 …” MCB 2006.
State 27
Transcription End State
ZNF repressed state recovery
Motif enrichment
State 30
29
34
promoters
enhancers
35
42
Promoter vs. enhancer regulation
State 10kb away predictive of expr.
Distinct types of repression
- Chrom bands / HDAC resp
- Repeat family / composition
Quantifying discovery power for promoters, transcripts
• Significantly outperform individual chromatin marks
• For transcripts, no single mark is sufficient signature
• CAGE/EST experiments give possible upper bound
3. Prioritize marks
Select marks based on state recovery
Select appropriate number of states
27
Recovery of 40 chromatin states with 6 marks
• Increasing marks show increasing resolution in state recovery
Extending IMR90 set beyond initial 22 marks
22 Marks common with CD4T data
H2AK5ac
H3K27ac
H3K27me3
H3K9me3
H2BK120ac
H3K4ac
H3K36me3
H4K20me1
H2BK12ac
H3K9ac
H3K4me1
H2BK20ac
H4K5ac
H3K4me2
H3K14ac
H4K8ac
H3K4me3
H3K18ac
H4K91ac
H3K79me1
H3K23ac
H3K79me2
19 Marks only in CD4T data
H2AK9ac
H2BK5me1
H3K9me2
CTCF
H2BK5ac
H3K27me1
H3R2me1
H2AZ
H3K36ac
H3K27me2
H3R2me2
PolII
H4K12ac
H3K36me1
H4K20me3
H4K16ac
H3K79me3
H4R3me2
29
Selecting marks based on specific states of interest
Greedy ordering of marks
State Inferred with all 41 marks
State Inferred with all 41 marks
State Inferred with subset of marks
State confusion matrix with 11 ENCODE marks
Recovery of states with increasing
number of marks
4. Study dynamics
Initial methods: ENCODE
31
Emerging large-scale genomic/epigenomic datasets
Multiple cell types
Diverse experiments
Developmental
time-course
Reference Epigenome Mapping Centers
Used to study many disease epigenomes
ENCODE Chromatin Group (PI: Bernstein)
9 human cell types
9 chromatin
marks+WCE
HUVEC
Umbilical vein endothelial
H3K4me1
NHEK
Keratinocytes
H3K4me2
GM12878
Lymphoblastoid
H3K4me3
K562
Myelogenous leukemia
H3K27ac
HepG2
Liver carcinoma
NHLF
Normal human lung
fibroblast
x
H3K9ac
H3K27me3
H4K20me1
H3K36me3
HMEC
Mammary epithelial cell
HSMM
Skeletal muscle myoblasts
+WCE
H1
Embryonic
+RNA
CTCF
15-state model learned jointly
Promoter
Enhancer
Insulator
Transcribed
Repressed
Repetitive
HUVEC
NHEK
…
H1
Cell type concatenation approach
-Ensures common emission parameters
- Verified with independent learning
Chromatin states consistent across cell types
Promoter
Candidate
enhancer
Clustering of
independently
learned 15-state
models
Insulator
Transcribed
Repressive
Repetitive
State definitions are
cell type invariant
State locations are
cell type specific
Study dynamic
changes in state
assignments
Reveal logic of
chromatin regulation
Chromatin state changes across pairs of cell types
HUVEC
NHEK
HUVEC
K562
K562
Proportion of
genome
Pairwise state
fold
enrichments
CTCF island state (State 9)
highly stable across cell types
NHEK
Chromatin state changes across pairs of cell types
K562
K562
NHEK OFF HUVEC
HUVECON
blood vessel development
2.60E-05
vasculature development
3.00E-05
angiogenesis
blood vessel morphogenesis
HUVEC OFFNHEK OFF
2.90E-09
epidermis development
1.80E-08
keratinocyte differentiation
3.00E-06
tissue development
3.20E-06
cell adhesion
1.90E-05
HUVEC
NHEK
GO Enrichment for TSS in:
3.50E-05 HUVEC: active promoter (st1)
NHEK: unmodified (st7)
1.20E-04
P-value
ectoderm development
NHEK
P-value
GO Enrichment for:
NHEK: TSS in active promoter (st1)
HUVEC: TSS in unmodified (st7)
NHEK
HUVEC
35
Correlations between multi-cell activity profiles
Gene
expression
Chromatin
States
Active TF motif
enrichment
TF regulator
expression
Dip-aligned
motif biases
HUVEC
NHEK
GM12878
K562
HepG2
NHLF
HMEC
HSMM
H1
ON
OFF
Active enhancer
Repressed
Motif enrichment
Motif depletion
TF On
TF Off
Motif aligned
Flat profile
(1) Linking enhancer states to correlated target genes
10kb
Candidate
TM4SF1
Enhancer
Search for coherent
changes between:
• gene expression
• chromatin marks at
distant loci (10kb)
Combine two vectors:
1.Expression vector for
each gene
2.Vector of mark intensities
at dist locus
(combine marks based on
enhancer emissions)
3. High correlation
enhancer/target link
(3) Signatures of activators and repressors from activity profiles
“Off” State
“Enhancer” States
STAT1 activator of GM12878
“On” States
STAT5 activator for K562
CREB repressor of GM12878
STAT1 motif
STAT5 motif
CREB motif
2-4 Motif enrichment 24
2-2
TF Expression
22
Summary
1. Learning chromatin states
– ChromHMM captures combinatorics / spatial info
2. Interpreting chromatin states
– Distinct functions, power for genome annotation
3. Selecting number of marks / prioritizing
– Greedy ordering, tunable to states of interest
4. Chromatin dynamics in human cell lines
– Activity profiles, linking enhancers, activat/repressors
5. Selecting number of states
– Interpreting genome at increasing resolution
39
Selecting number of states
40
Comparison of BIC Score vs. Number of States for Random and
Nested Initialization
Step 1: Learn a larger model that captures ‘all’ relevant states
Step 2: Prune down model greedily eliminating least informative states
Step 3: Select arbitrary cutoff
based on biological interpretation
Result: a 51-state model that captures most biology in least complexity
Recovery of 79-state model in random vs. nested initialization
Random Initialization
(states appear & disappear)
Selected 51-state model
Nested Initialization
(states consistly recoverd)
States capture mark dependencies: Expected vs. Observed Mark Co-Occurence
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
Increasing numbers of states lead to increasing mark independence
5 States
30 States
10 States
40 States
20 States
51 States
State 6
Other desirable features of the resulting model
• Power appropriately
spent
• Chromatin states show
45
distinct mark combinations
Selecting number of states
46
State Assigned in CD4T using all marks
State Assigned in CD4T using 22-common marks
A value in a cell indicates the percentage of locations assigned to the state of the row with
the full set of marks that would be assigned to the state of the column using the subset of
47
marks.
State Assigned in CD4T using all marks
State Assigned in CD4T using 22-common marks
Many locations assigned to a satellite repeat state with the full set of
marks are assigned to a large H3K9me3 heterochromatin state using
the set of 22 marks.
48
State Assigned in CD4T using all marks
State Assigned in CD4T using 22-common marks + H4K20me3
With just data on the location of H4K20me3 almost all these locations
are assigned to a satellite repeat state.
49
State Assigned in CD4T using all marks
State Assigned in CD4T using 22-common marks
State 38 is primarily associated with H2AZ in distal locations
50
State Assigned in CD4T using all marks
State Assigned in CD4T using 22-common marks + H2AZ
Adding H2AZ substantially improves the recovery of this state.
51
State Assigned in CD4T using all marks
State Assigned in CD4T using 22-common marks
Various expressed transcribed states
State 46 is strongly associated with simple repeats
(maybe an artifact)
52
State Assigned in CD4T using all marks
State Assigned in CD4T using 22-common marks + H2BK5me1
Various expressed transcribed states
State 46 is strongly associated with simple repeats
(maybe an artifact)
53