Yeast whole-genome analysis of conserved regulatory motifs

Download Report

Transcript Yeast whole-genome analysis of conserved regulatory motifs

Disease epigenomics:
Interpreting non-coding variants using
chromatin and activity signatures
Jason Ernst
Broad Institute of MIT and Harvard
MIT Computer Science & Artificial Intelligence Laboratory
Challenge: interpreting disease-associated variants
Gene annotation
(Coding, 5’/3’UTR, RNAs)
 Evolutionary signatures
Roles in gene/chromatin regulation
 Activator/repressor signatures
CATGACTG
CATGCCTG
Non-coding annotation
 Chromatin signatures
Disease-associated
variant (SNP/CNV/…)
Other evidence of function
 Signatures of selection (sp/pop)
• GWAS, case-control,… reveal disease-associated variants
 Molecular mechanism, cell-type specificity, drug targets
• Challenges towards interpreting disease variants
–
–
–
–
Find ‘true’ causative SNP among many candidates in LD
Use ‘causal’ variant: predict function, pathway, drug targets
Non-coding variant: type of function, cell type of activity
Regulatory variant: upstream regulators, downstream targets
• This talk: genomics tools for addressing these challenges
The good news: ever-expanding dimensions
Additional dimensions:
Environment
Each point represents a
Genotype
genome-wide dataset
Disease
Gender
Chromatin marks
Stage
Age
Cell types
• Now: Cell-type and chromatin-mark dimensions
• Next: References for each background
• All clearly needed, and increasingly available
Difficulty of interpreting increasing # tracks
Challenge: simplify
–
–
–
–
Learn combinations
Interpret function
Prioritize marks
Study dynamics
Challenge of data integration in many marks/cells
Epigenomic information
retains genome ‘state’
in differentiation
and development
Two types:
DNA methyl.
Histone marks
DNA packaged into
chromatin around
histone proteins
Genome-wide
modification maps
Hundreds of
histone tail
modifications
already known
• Epigenetic modifications
• DNA/histone/nucleosome
• Encode epigenetic state
• Histone code hypothesis
• Distinct function for distinct
combinations of marks?
• Hundreds of histone marks
• Astronomical number of
histone mark combinations
• How do we find biologically
relevant ones?
• Unsupervised approach
• Probabilistic model
• Explicit combinatorics
Genomic tools for disease SNP interpretation
• Chromatin states  regulatory region annotation
– Combinatorial patterns of marks  chromatin states
– Distinct classes of prom/enh/transcr/repres’d/repetitive
– Reveal new genes, lincRNAs, enhancers, GWAS/SNP
• Activity signatures  linking enhancer networks
– Correlated changes in expression, chromatin, motifs
– Link TFs to enhancers and enhancers to targets
– Predict causal cell-type specific activators/repressors
• Interpreting disease variants
– Predicting SNP chromatin states and cell-type specificity
– Specific mechanistic predictions for disease SNPs
– Measuring selective pressures within human populations
ChromHMM: learning ‘hidden’ chromatin states
Transcription
Start Site
Enhancer
Observed
chromatin
marks. Called
based on a
poisson
distribution
Most likely
Hidden State
K4me1
K27ac
1
200bp
intervals
K4me3
K4me3
Transcribed Region
K4me1
K36me3
K36me3
4
6
6
DNA
K36me3 K36me3
K4me1
2
3
6
6
High Probability Chromatin Marks in State
0.8
0.8
0.7
1:
2:
3:
K4me1
K27ac
0.9
0.8
K4me3
K4me1
0.9
K4me3
4:
K4me1
5:
6:
0.9
6
5
5
5
All probabilities are
learned de novo from
chromatin data alone
(Baum-Welch aka. EM)
7
K36me3
Each state: vector of emissions, vector of transitions
Ernst and Kellis, Nature Biotech 2010
Chromatin states for genome annotation
Promoter states
Transcribed states
Active Intergenic
Repressed
• Learn de novo
significant
combinations of
chromatin marks
• Reveal functional
elements, even
without looking
at sequence
• Use for genome
annotation
• Use for studying
regulation
dynamics in
different cell
types
Emerging large-scale genomic/epigenomic datasets
Multiple cell types
Diverse experiments
Developmental
time-course
Reference Epigenome Mapping Centers
Used to study many disease epigenomes
ENCODE Chromatin Group (PI: Bernstein)
9 human cell types
9 chromatin
marks+WCE
HUVEC
Umbilical vein endothelial
H3K4me1
NHEK
Keratinocytes
H3K4me2
GM12878
Lymphoblastoid
H3K4me3
K562
Myelogenous leukemia
H3K27ac
HepG2
Liver carcinoma
NHLF
Normal human lung
fibroblast
x
H3K9ac
H3K27me3
H4K20me1
H3K36me3
HMEC
Mammary epithelial cell
HSMM
Skeletal muscle myoblasts
+WCE
H1
Embryonic
+RNA
CTCF
15-state model learned jointly
Promoter
Enhancer
Insulator
Transcribed
Repressed
Repetitive
HUVEC
NHEK
…
H1
Cell type concatenation approach
-Ensures common emission parameters
- Verified with independent learning
Chromatin states capture coordinated mark changes
• State definitions are cell-type invariant
– Same combinations consistently found
• State locations are cell-type specific
– Can study pair-wise or multi-way changes
Chromatin states correlation with gene expression
-50kb
TS
S
+50kb
Lower
expression
Higher
expression
Pair-wise changes reveal cell-type specific functions
• Gene functional enrichments match cell function
• Distinguish On, Off, and Poised promoter states
Genomic tools for disease SNP interpretation
• Chromatin states  regulatory region annotation
– Combinatorial patterns of marks  chromatin states
– Distinct classes of prom/enh/transcr/repres’d/repetitive
– Reveal new genes, lincRNAs, enhancers, GWAS/SNP
• Activity signatures  linking enhancer networks
– Correlated changes in expression, chromatin, motifs
– Link TFs to enhancers and enhancers to targets
– Predict causal cell-type specific activators/repressors
• Interpreting disease variants
– Predicting SNP chromatin states and cell-type specificity
– Specific mechanistic predictions for disease SNPs
– Measuring selective pressures within human populations
Introducing multi-cell activity profiles
Gene
expression
Chromatin
States
Active TF motif
enrichment
TF regulator
expression
Dip-aligned
motif biases
HUVEC
NHEK
GM12878
K562
HepG2
NHLF
HMEC
HSMM
H1
ON
OFF
Active enhancer
Repressed
Motif enrichment
Motif depletion
TF On
TF Off
Motif aligned
Flat profile
Enhancer vs. promoter dynamics
Promoters typically
active in many cells
Enhancers exquisitely
cell-type specific
Linking candidate enhancers to correlated target genes
10kb
Candidate
TM4SF1
Enhancer
Search for coherent
changes between:
• gene expression
• chromatin marks at
distant loci (10kb)
Combine two vectors:
1.Expression vector for
each gene
2.Vector of mark intensities
at dist locus
(combine marks based on
enhancer emissions)
3. High correlation 
enhancer/target link
Mark intensity correlation w/ expr
Predictive power of distal enhancer regions
10kb upstream
100kb upstream
10kb/100kb controls
Correlation of individual regions (Sorted by Rank)
• At least 100 regions with >80% correlation
Coordinated activity reveals enhancer links
Enhancer
activity
Gene
activity
Predicted
regulators
Activity signatures for each TF
• Distal enhancer hard to integrate in regulatory models
• Linked to target genes based on coordinated activity
• Linked to upstream regulators using TF expr & motifs
Nucleosome Positioning Footprints Supports
Transcription Factor Cell Type Predictions
Tag Enrichment for H3K27ac
Genomic tools for disease SNP interpretation
• Chromatin states  regulatory region annotation
– Combinatorial patterns of marks  chromatin states
– Distinct classes of prom/enh/transcr/repres’d/repetitive
– Reveal new genes, lincRNAs, enhancers, GWAS/SNP
• Activity signatures  linking enhancer networks
– Correlated changes in expression, chromatin, motifs
– Link TFs to enhancers and enhancers to targets
– Predict causal cell-type specific activators/repressors
• Interpreting disease variants
– Predicting SNP chromatin states and cell-type specificity
– Specific mechanistic predictions for disease SNPs
– Measuring selective pressures within human populations
Enhancer annotationxxrevisits disease SNPs
 Previously unlinked phenotypes
enriched for cell-type specific enhancers
Application1: Pinpoint disease SNPs in enhancers
• Much smaller fraction of genome considered
• Strong enhancers 1.9%, weak 2.8%, promoter 1.4%
Application 2: Make much more precise predictions
Use: * Cell-type specificity of chromatin states
* Predicted activators/repressors of these states
* Predicted motif instances across the genome
Ex1: Systemic lupus erythematosus intergenic SNP
• SNP in lymphoblastoid GM-specific enhancer state
• Disrupts Ets1 motif instance, predicted GM regulator
 Model: Disease SNP abolishes GM-specific enhancer
Ets-1 is a predicted activator of GM/HUVEC enhancers
Enhancer
activity
Gene
activity
Predicted
regulators
Activity signatures for each TF
• Enhancer class specific to GM and HUVEC cell types
• Ets expression  Ets-1 motif enrichment in enhancers
 Model: Ets-1 disruption would abolish enhancer state
Ex2: Erythrocyte phenotype study intronic SNP
K562: erythroleukaemia cell type
`
`
• Disease SNP creates motif instance for Gfi-1 repressor
• Gfi-1 predicted repressor for K562-specific enhancers
 Creation of repressive motif abolishes K562 enhancer
Gfi-1 is a predicted repressor of non-K562 enhancers
Enhancer
activity
Gene
activity
Predicted
regulators
Activity signatures for each TF
• Gfi expression  Gfi-1 motif depletion in enhancers
• Prediction: Gfi-1 large-scale repression of non-K562
 Motif created  Gfi-1 recruited  enhancer repressed
More generally: eQTLs in specific chromatin states
Dixon 2007: All eQTLs, Lymphoblasts, 400 ind.
Schadt 2008: Trans eQTLs, liver cells, 427 ind.
• Nucleotide-resolution genome-wide expr. predictors
• Strong enrichment for promoter and enhancer states
• Trans-eQTLs select for cell-type specific enhancers
Genomic tools for disease SNP interpretation
• Chromatin states  regulatory region annotation
– Combinatorial patterns of marks  chromatin states
– Distinct classes of prom/enh/transcr/repres’d/repetitive
– Reveal new genes, lincRNAs, enhancers, GWAS/SNP
• Activity signatures  linking enhancer networks
– Correlated changes in expression, chromatin, motifs
– Link TFs to enhancers and enhancers to targets
– Predict causal cell-type specific activators/repressors
• Interpreting disease variants
– Predicting SNP chromatin states and cell-type specificity
– Specific mechanistic predictions for disease SNPs
– Measuring selective pressures within human populations