Ensembl. Going beyond A,T, G and C

Download Report

Transcript Ensembl. Going beyond A,T, G and C

ENCODE: understanding our
genome
Ewan Birney
The ENCODE Project Consortium
Biosapiens Network of Excellence
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
ENCODE experiments
Area
Assay
Groups
Proteins
Manual annotation, Guigo,
RT-PCR
Harrow+Hubbard,
Reymond
Transcripts
Tiling Arrays
Gingeras, Snyder
Transcripts
Tag seq.
Yijun, Riken
General Chromatin
Marks
Tiling Arrays, ChIP
Dunham, Reng
Sequence sp.
Factors
Tiling Arrays, ChIP
Snyder, Gingeras,
Farnham, Dunham
DNaseI sens.
PCR, Tiling arrays
Stam. , Crawford
Replication
Tiling arrays
Dutta
Conservation
Comparative
sequence
Green, Sidow,
Miller
DNA structure
Hydroxyl radical
Leib
Promoter
Reporter assays
Myers
ENCODE Pilot
• Considered too expensive and
too risky to decide on winning
technologies (started in 2004)
• 1% of the genome (30MB)
chosen - all experiments on the
same 1%
• Pilot phase ended
– Analysis and publication
– Scale up to genome wide now
funded
A lot of Chip/Chip
Nowdays, a lot of Chip/seq
Transcription
Transcription
• Lots of it
– And not all of it genes
– And even when it is inside a gene,
not all of it with open reading
frames
– And even when it has an open
reading frame, not all of it making
sense! (evolutionary or
structurally)
• Not technical false positives
Protein coding loci are far
more complex than we think
• On average 5 transcripts per
locus
• Many do not encode proteins (as
far as we can see)
• Even the ones which do encode
proteins, many of these proteins
look “weird”
Unplausible structures
Many effects on potential
function
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Signal peptides, TM Helices
• 1097 protein transcripts from 487
loci
– 219 have signal peptides (107 loci)
– 12 loci have an isoform without the
signal peptide
– 41 transcripts have a gain or loss
of a tansmembrane helix
(sometimes up to 8!)
The Clade B Serpins
a inactive, "stressed"
Potential
Missing fragments
b active (beta inserted)
(c)
(d)
(e)
(f)
Transcription Start Sites
Technologies on TSS
Gencode
Manual Ann.
Unbiased
TxFrag
Ditag data
Cage data
Histone mod.
Dnase I sens
Sequence sp
Factors (eg Myc)
Integration Strategy
Anchor on 5’ ends
GenCode 5’ and CAGE/DiTag
Categorise and assess using
Transcript based evidence
Exons, TxFrags, CpG islands
Assess categories with
Histone and TF data
16,051 unique TSS
8,587 TSS “tight clusters”
5 different classes
First 4 low-Pvalues
First 4 categories have
Biological signals:
4,491 TSS
TSS Categories
Category
GenCode 5’
Number
P-value of
(non-redundant) overlap
1730
2e-70
Exon(sense)
1437
6e-39
Exon(anti)
521
3e-8
TxFrag
639
7e-63
CpG
164
4e-90
No support
2666
GenCode 5’ ends
Unsupported tags
Novel TSSs
Conclusion
• There are 4,418 TSS with
multiple lines of evidence
supporting them
• This is ~10 fold more than the
number of Genes
• Only 38% would be traditionally
classified as TSS (less if one
took Ensembl or RefSeq)
Implications of many more
TSSs
• Consistent with considerable
diversity of transcripts
• Independently integrating
Chip/Chip data suggested ~1,000
“Regulatory Clusters”
– 25% proximal considering
Ensembl/Refseq
– 65% when this TSS catalog is
considered
More subtle conclusions
• Sequence specific factors are
distributed symmetrically around
the TSS
– Should we only be taking upstream
regions for reporter genes?
• Histone information is highly
correlated with gene on/off
status
– Generalising many locus specific
studies
Gene On/Off
Gene status prediction
Distal sites
Finding distal sites
• Chip/Chip not “great”
– Most look close to one of these
new TSSs
– Factor bias?
• DNaseI Hypersenstive Sites
– All factors give a DHS signal
– 55% of DHSs are distal to any TSS
Distal DHS
Most surveyed factors are
proximal
Replication
H3K27me3 is correlated
Evolutionary conservation and
ENCODE
Evolutionary conservation
…but not everything is
constrained
Why is there a discrepancy?
• False positives in the experiments
– But experiments validate at >80% and crossvalidate each other
• False negatives in the constraint detection
– But can detect up to 8bp elements, and within
“neutral” zone of alignability
• Neutral turnover model
Neutral biochemical events
Time
Lineage specific
Time
“Functional” conservation
Mouse
Human
Special case: Transcription
Constrained sequence
Gene
Regulatory Information
Constrained sequence
Pre-miRNAs
What should we learn from
ENCODE
• “whacky” transcription is real (but
god knows what it does)
– Unconventional Transcript
• Lots more TSSs than we understand
– Many “distal” regions are actually close
to promoters
• Broad specificity marks are more
useful
– DNaseI sites, Histone marks
Neutral model for biochemical
events on the genome
• Because things happen reproducibly
in multiple tissues does not imply
selection
• (this is not the same as experimental
variance)
• Could imply “functional”
conservation outside of orthologous
bases
– Comparative genomics sequencing not
enough (but a great starting point!)
– Comparative functional investigation
Consortia work
• ENCODE
– Experimentally lead consortia
– Needs a lot of computational
collaboration
• Biosapiens
– Computationally lead consortia
– Needs experimental collaboration
(!)
• DNA: ENCODE
• Protein: Biosapiens
What happens next?
Ensembl Regulatory Build
Chr 14,
5677077-567896
elements
GM06990
Cells, Myc
bound
Status
Initial Regulatory Build
• DNaseI Hypersenstive sites, 6
histone modifications, CTCF binding
• ~110,000 elements, ~2MB of DNA
• 6,000 “promoter associated” by
inherent pattern (DNaseI +
H3K36me3)
• Available now
• This year: Mouse, More classification
Regulatory build
Ensembl - at your service
• Web browser www.ensembl.org
• MySQL DB access
• BioMart
• “Geek for a week”
– You send someone to use for a
week
• Xose for a day
– We send someone to you for a day
The ENCODE Project Consortium
Damian Keefe, Yutao Fu, Zhiping
Weng, Mike Snyder, Elliott
Marguilles, John Stam., Manolis
Dermitzakis, Tom Gingeras,
Roderic Guigo, Ian Dunham,
Christophe Koch, Anindya Dutta
Paul Flicek and 293 others…
The Biosapiens Network of
Excellence
Michael Tress, Alfonso Valencia,
Janet Thornton, Roderic Guigo,
Soren Brunak, David Jones, Martin
Vingron, Anna Tramontano,
Jacques van Helden and 57
others…