Yeast whole-genome analysis of conserved regulatory motifs
Download
Report
Transcript Yeast whole-genome analysis of conserved regulatory motifs
Fly ModENCODE data integration update
Manolis Kellis, MIT
Broad Institute of MIT and Harvard
MIT Computer Science & Artificial Intelligence Laboratory
modENCODE integration goals
• Annotate all functional elements
– Enhancers, promoters, insulators, silencers
– Protein-coding genes, RNA genes, alternative splice forms
• Understand their dynamics
– Tissue- and stage-specific activity of each type of element
• Mechanisms
– Relative roles of histones, chromatin, specific/general TFs
– Sequence specificity, regulatory motifs and grammars
• Community involvement will be key
– Seeking both computational and experimental partners
– Large-scale: Complementary datasets / computation
– Small-scale: Directed follow-up studies / genes, pathways
• Drosophila 2009 modENCODE workshop discussion
Each dataset is supported by all others
Data Integration efforts
Nucleosomes
Henikoff
Already presented
Underway
Transcripts
Chromatin
Celniker
Karpen
White
Mac
Alpine
TFs/Chromatin
Replication
Lai
Small RNAs
• Each type of
element requires
multiple data types
–
–
–
–
–
–
–
Protein genes
RNA genes
Promoters
Enhancers
Transcripts
Heterochromatin
Initiation sites
modENCODE is not alone
• Community data types
Nucleosomes
– Boundaries
Henikoff
Transcripts
Chromatin
Karpen
DNAse HS
12flies
(+8 flies)
Celniker
Dam
mapping
Mac
Alpine
Boundaries
– Small RNAs
etc
TFs/Chromatin
Lai
Small RNAs
– evolutionary properties
(correlations with
conserved/nonconserved properties)
– Dam mapping
White
Replication
– DNAse HS sites, low
buoyant density
(protein binding)
• Techniques and
functional genomics
– Gene Disruption projects
– RNAi collection
– Recombineering
– Computational analyses
Comparative resources for Drosophila genomes
New Species
D. ficusphila
D. biarmipes
D. elegans
D. kikkawai
D. eugracilis
D. takahashii
D. rhopaloa
D. bipectinata
Dist
0.80
0.70
0.72
0.89
0.76
0.65
0.66
0.99
done
priority1
priority2
• Identify functional elements by their evolutionary
signatures: complement experimental studies
Evolutionary signatures for diverse functions
Protein-coding genes
- Codon Substitution Frequencies
- Reading Frame Conservation
RNA structures
- Compensatory changes
- Silent G-U substitutions
microRNAs
- Shape of conservation profile
- Structural features: loops, pairs
- Relationship with 3’UTR motifs
Stark et al, Nature 2007; Clark et al, Nature 2007
Regulatory motifs
- Mutations preserve consensus
- Increased Branch Length Score
- Genome-wide conservation
Frequency
Fraction
Functional annotation of Novel Transcripts using evo. sigs
-20
0
20
40
60
CSF Score (best 30 aa window)
CSF = Heuristic metric for codon
substitution frequency
-20
0
20
40
60
CSF Score (best 30 aa window)
73 Putative protein coding
57 Putative non-coding
Mike Lin, Jane Landolin, Sue Celniker
Discover motifs associated with binding
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Consensus
CTAATTAAA
TTKCAATTAA
WATTRATTK
AAATTTATGCK
GCAATAAA
DTAATTTRYNR
TGATTAAT
YMATTAAAA
AAACNNGTT
RATTKAATT
GCACGTGT
AACASCTG
AATTRMATTA
TATGCWAAT
TAATTATG
CATNAATCA
TTACATAA
RTAAATCAA
AATKNMATTT
ATGTCAAHT
ATAAAYAAA
YYAATCAAA
WTTTTATG
TTTYMATTA
TGTMAATA
TAAYGAG
AAAKTGA
AAANNAAA
RTAAWTTAT
TTATTTAYR
MCS
65.6
57.3
54.9
54.4
51
46.7
45.7
43.1
41.2
40
39.5
38.8
38.2
37.8
37.5
36.9
36.9
36.3
36
35.6
35.5
33.9
33.8
33.6
33.2
33.1
32.9
32.9
32.9
32.9
Matches to known
engrailed (en)
reversed-polarity (repo)
araucan (ara)
paired (prd)
ventral veins lacking (vvl)
Ultrabithorax (Ubx)
apterous (ap)
abdominal A (abd-A)
fushi tarazu (ftz)
broad-Z3 (br-Z3)
Antennapedia (Antp)
Abdominal B (Abd-B)
extradenticle (exd)
gooseberry-neuro (gsb-n)
Deformed (Dfd)
Tissue specific target expression
Promoters
25.4
5.8
11.7
4.5
13.2
16
7.1
7
20.1
3.9
17.9
10.7
19.5
5.8
14.1
1.8
5.4
3.2
3.6
2.4
57.2
5.3
6.3
6.7
8.9
4.7
7.6
449.7
11
30.7
Enhancers
2
4.2
2.6
16.5
0.3
3.3
1.7
2.2
4.3
0.7
1.2
2
5.4
1.7
2.8
0
4.6
-0.5
0.6
6
1.7
1.6
2.7
0.3
0.8
0.8
Ability to discover full dictionary of regulatory motifs de novo
Stark et al, Nature, 2007
Initial regulatory network for an animal genome
• ChIP-grade quality
– Similar functional
enrichment
– High sens. High spec.
• Systems-level
–
–
–
–
81% of Transc. Factors
86% of microRNAs
8k + 2k targets
46k connections
• Lessons learned
– Pre- and post- are
correlated (hihi/lolo)
– Regulators are heavily
targeted, feedback loop
Sushmita Roy
Kheradpour et al, Genome Research, 2007
Temporal latencies in regulatory networks
• TF-specific latencies, coherent with TF function
• Latencies associated with network motifs
• Extensions to tissue-specific networks
Rogerio Candeias
Incorporating ENCODE functional datasets
Pouya Kheradpour, Jason Ernst,
Chris Bristow, Rachel Sealfon
modENCODE and gene regulation
Goal: Understand the DNA elements responsible for gene regulation:
• The regulators: TFs, GFs, miRNAs, their specificities
• The regions: enhancers, promoters, insulators
• The targets: individual regulatory motif instances
• The grammars: combinations predictive of tissue-specific activity
Building blocks of gene regulation
Our tools: Comparative genomics & large-scale experimental datasets.
• Evolutionary signatures for promoter/enhancer/3’UTR motif annotation
• Chromatin signatures for integrating histone modification datasets
• TFs, GFs, motifs, instances associated with tissue-specific activity
• Infer regulatory networks, their temporal and spatial dynamics
Integrate diverse datasets
Sequence motifs predictive of insulators
• Understand specificity of each factor
• How predictable are these of binding
• Motif combinations and grammars
GAF, check
Su(Hw), check
CTCF, check
BEAF-32, variant
CP190, novel
Pouya Kheradpour
Motifs specific to each insulator
Mod(mdg4), novel
SPP, 40bp window
Narrow Peak Interval Rank
x104
Performance (higher is better)
Fraction overlapping CTCF motif instances
Motif instances correlate with ChIP peaks
Recovery of CTCF inst. at 90% confid.
Peak size
• CTCF motif instances correlate strongly with narrow peak
calls from multiple peak callers, even at 40bp window
• Correlation extends down rank link (to all 50,000 peaks)
• Implications for peak calling and for motif discovery
Pouya Kheradpour, Ben Brown
Motifs and tissue-specific chromatin marks
•
•
•
•
Active marks
The NF-κB motif is enriched in H3K4me2 regions found
uniquely in GM12878 cells
It is likewise enriched in the uniquely bound regions for
other active marks
Conversely, it is enriched in the uniquely unbound regions
for the repressive mark H3K27me3
We find that NF-κB is also over expressed in GM12878,
suggesting a causative explanation
Pouya Kheradpour
Repressive mark
NF-κB motif
Fold enrichment or
over expression
Motifs and stage-specific chromatin marks
H3K27me3
• abd-A motif is enriched in new H3K27me3 regions at L2
– Coincides with a drop in the expression of abd-A
– Model: sites gain H3K27me3 as abd-A binding lost
Fold enrichment
• Additional intriguing stories found, to be explored
or over expression
What about combinations of chromatin marks?
Jason Ernst
A hidden Markov model for chromatin state
Transcription
Start Site
Enhancer
Transcribed Region
DNA
Observed
Histone
Modifications
Most likely
Hidden State
1
2
3
4
Highly Likely Modifications in State
0.8
0.8
.8
1:
4: 0.7
2: 0.9
3: 0.9
5: 0.9
0.8
6:
5
5
5
5
5
Even though modification was
not observed can still infer
correct state based on
neighboring locations that this
state is likely of the same type
as its neighboring states
6
6
6
20 distinct chromatin states, combinations of marks
• Combinations of chromatin marks
– More informative than individual marks (A&B ≠ A&C)
– Small number of states (20 instead of all 2 million=221)
– Allow study of co-occurrence patterns, independence…
Each chromatin state associated w/ distinct function
Tentative annotations
• Reveals active/repressed promoters & enhancers
• Distinct enrichments for 5’UTR/3’UTR/transcripts
• Distinct chromatin properties of exons / introns
Transcriptional unit enrichment
Transcription start site (TSS) enrichment
Transcription termination site (TTS) enrichment
Transcriptional unit enrichment
Chromatin signatures as context for TF analysis
• TF role in establishing chromatin states
• Chromatin role in modulating TF function
Specific enrichment for DV and AP factors
Functions of 20 distinct chromatin states in fly
Chromatin marks
DV enhancers
AP enhancers
General TFs
Insulators
Replication
Motifs
The grand challenge ahead
Binding sites of every
developmental regulator
Dorsal-Ventral
Annotations & images for all expression patterns
Sequence motifs for
every regulator
CTCF, check
GAF, check
Su(Hw), check
Anterior-Posterior
Expression domain primitives reveal underlying logic
BEAF-32, variant
CP190, novel
Mod(mdg4), novel
Understand regulatory logic specifying development
Summary of our lab’s experience in (mod)ENCODE
• Protein-coding genes (Mike Lin)
– Hubbard: Predict new genes, evaluate novel genes
– Celniker: Distinguish coding/non-coding transcripts
• Chromatin domains (Jason Ernst)
– Karpen: Chromatin states in Drosophila
– Bernstein: Chromatin states in Human
• Motif and grammar discovery (Pouya Kheradpour)
– White: Motifs associated with insulator proteins
– Bernstein: Tissue-specific chromatin states
– White: Expression and Binding Time-course
• Tissue-specific gene expression (Chris Bristow)
– Celniker: Embryo expression domains
– All: Predictive models of gene expression
Acknowledgements
Pouya
Kheradpour
Alex
Stark
Mike
Lin
Jason
Ernst
Chris
Bristow
Funding
ENCODE, modENCODE, NHGRI, NSF, Sloan Foundation
TFs/Insul.
12+8-flies
Chromatin
Prot.Genes
Kevin White, Bing Ren, Nicolas Negre, Par Shah, Jim Posakony
Andy Clark, Mike Eisen, Bill Gelbart, Doug Smith, Peter Cherbas
Gary Karpen, Aki Minoda, Nicole Riddle, Peter Park + Kharchenko
BDGP: Sue Celniker, Jane Landolin, FlyBase: Bill Gelbart