lecture25_DarkMatter..

Download Report

Transcript lecture25_DarkMatter..

proposed redefinition of “gene”
requires it to have a biological role
Gerstein MB, …, Snyder M. 2007. Genome Res 17: 669-681
example of complexities observed by ENCODE
(A) annotated exons (black rectangles), novel
transcriptionally active regions or TARs (hollow
rectangles); conventional annotation identifies
only 4 genes or just a fraction of the transcripts
reported (dashed lines are introns)
(B) observed transcripts are shown alongside
the sequences that regulate them (gray circles);
note that some of the enhancers are actually
promoters for novel splice isoforms
a redefinition of the “gene”
1. a gene is a genomic sequence directly encoding functional
product molecules, either RNAs or proteins
2. when there are several functional products that share
overlapping regions, take the union of all overlapping genomic
sequences encoding them
3. this union must be coherent, done separately for protein
and RNA products, but it does not require that all the products
necessarily share a common subsequence
concisely summarized as
a union of genomic sequences encoding a coherent set of
potentially overlapping functional products
4 genes defined in this one locus
there are three primary transcripts, two of which encode five proteins, while the third
encodes a noncoding RNA; two primary transcripts share a 5’ untranslated region,
but they are considered different genes because the translated regions (D and E do
not overlap; there is a noncoding RNA, but the fact it shares its genomic sequence
(X and Y) with the protein-coding genomic segments A and E does not make it a coproduct of these genes; there are four genes in this one locus by the new definition
gene number estimates as a
function of time and methodology
genome is
sequenced
genes
observed
transcripts
dark
matter
sequence
annotation
time
dark matter is reproducible, but it’s poorly transcribed, poorly conserved,
non protein coding, and outnumbers validated microRNAs by ~1000 fold
cDNA sequencing reveals an
abundance of non-coding genes
number of cDNAs
size of transcript
size of best ORFs
% as single exon
FANTOM categories for mouse cDNAs
coding1
coding2 non-coding1 non-coding2
14,317
3,277
11,526
4,280
2146 (1061)
2174 (1091)
1939 (1019)
1790 (996)
1107 (742)
550 (578)
206 (91)
194 (80)
13.4%
35.4%
68.7%
73.1%
number of cDNAs
coding1
coding2
non-coding1
non-coding2
mouse cDNAs by Okazaki Y, …, Hayashizaki Y. 2002. Nature 420: 563 or
human cDNAs by Imanishi T, …, Sugano S. 2004. PLoS Biol 2: e162
neutral evolution of non-coding
cDNAs from mouse transcriptome
BlastZ to HUMAN at 25% threshold
30
coding1
30
20
20
coding1-CDS
coding2
non-coding1
non-coding2
ncRNAs
intron1
intergenic
10
0
BlastZ to RAT at 25% threshold
coding1
coding1-CDS
coding2
non-coding1
non-coding2
ncRNAs
intron1
intergenic
10
50
60
70
80
90
sequence identity [%]
100
0
60
70
80
90
sequence identity [%]
ncRNAs are known RNA genes; intron1 and intergenic are negative controls
communications arising Wang J, …, Wong GK. 2004. Nature 431: after p757
100
tiling array data are riddled with
unexplained signal anomalies too
do not assume that non-coding cDNAs are tiling arrays exons
mystery
BURST
human thymus polyA+ cDNAs profiled at locus of Ewing sarcoma breakpoint
region 1 gene; from Johnson JM, …, Schadt EE. 2005. Trends Genet 21: 93
indications of biological relevance: transcription, conservation,
both lines of evidence, or neither?
poorly
transcribed
highly
transcribed
most
biology
dark
matter
highly
conserved
poorly
conserved
possible dark matter explanations:
1. biological noise, i.e. real transcripts with no biological roles
2. RNA genes unique to a species
3. long RNAs are precursors for short (and conserved) RNAs
NB: dark matter based on tiling arrays with 150 bp exons is not equivalent to
cDNA sequences with 1800 bp exons
hypothesis is unannotated long
RNAs are precursors for short RNAs
Kapranov P, …, Gingeras TR. 2007. Science 316: 1484-1488
nuclear and cytosolic polyadenylated RNAs longer than 200 nt (long RNAs, lRNAs)
and whole-cell RNAs less than 200 nt (short RNAs, sRNAs) for non-repetitive portion
of human genome; 64% of poly(A)+ transcription (nucleus and cytosol) do not align
with annotated exons but of these 265,237 annotated exons some 80% are detected
lRNAs that overlap with sRNAs
are more PhastCons conserved (i)
PhastCons identifies evolutionarily conserved elements from a multi-species
sequence alignment, given their phylogenetic tree, and based on a statistical
model of evolution called a phylogenetic hidden Markov model (phylo-HMM)
lRNAs that overlap with sRNAs
are more PhastCons conserved (ii)
quantile-quantile plot of PhastCons scores for long RNAs that do (x axis) and
do not (y axis) overlap with short RNAs; conservatively, 3.1% of HepG2 and
2.4% of HeLa nuclear lRNA transfrags might be parts of precursors of sRNAs
sRNAs associate with 5’ and 3’
boundaries of annotated transcripts
enrichment over random expectation is plotted as function of distance from 5’ and 3’
termini for sRNAs on same (sense) or opposite (antisense) strand as the annotated
transcripts; comparison is made against random regions with matched G+C content