ppt - University of Connecticut
Transcript ppt - University of Connecticut
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes
Serghei Mangul*, Adrian Caciula*, Ion Mandoiu** and Alexander Zelikovsky*
*Georgia State University, **University of Connecticut
Genes, Exons, Introns, and Splicing
INITIALIZATION: Uniform transcript frequencies f(j) ‘s
Compute the expected number n(j) of reads sampled from transcript
j (assuming current transcript frequencys f(j) )
For each transcript j, set of f(j) = portion of reads emitted by
transcript j among all reads in the sample
Gene - a segment of DNA or RNA
that carries genetic information.
Exon - a region of a gene which is
translated into protein
Intron - a region of a gene which is not
translated into protein
Splicing – a process in which the
introns are removed and exons are joined
to be translated into a single protein
the process in which exons can
be spliced out in different
combinations named transcripts
to generate the mature RNA.
Expectation Maximization (EM)
Discovery and Reconstruction of Unannotated Transcripts
DRUT (Discovery and Reconstruction of Unannotated Transcripts):
GIVEN: A set of transcripts and frequencies for the reads.
FIND : Transcripts missing from the set.
Quality of ML Model
Fig. 1. Chromosome with its DNA
Alternative splicing is a common
mode of gene regulation within
cells, being used by 90–95% of
It can drastically alter the
Fig. 2. Alternative Splicing Process
function of a gene in different
tissue types or environmental
conditions, or even inactivate the gene completely.
The possible gaps in the ML model include:
erroneous reads caused by genotyping errors
missing and/or chimerical candidate transcripts
an inaccurate read to transcript match (caused by genotyping errors)
non-uniform emitting of reads by transcripts
j:hi , j 0
a) Map reads to annotated
transcripts (using Bowtie)
Measure the quality of ML model by deviation D of observed reads from
expected reads (ej)
| oj ej |
|R| is the number of reads
Expected read frequencies (ej) are calculated based on
weighted match between reads and strings
maximum likelihood frequencies estimations of transcripts ( j )
Fig4 shows the relation between transcripts, exons and
hTj ,i ML
b) VTEM: Identify “overexpressed”
exons (possibly from unannotated
c) Assemble Transcripts (e.g., Cufflinks)
using reads from “overexpressed”
exons and unmapped reads
d) Output: annotated transcripts + novel
Alternative splicing is implicated in many diseases.
Fig. 4. Transcripts – Exons –Reads Relation.
Virtual Transcript Expectation Maximization (VTEM)
-> Observed frequencies
- EDGES: weights ~ probability of the read to be emitted
by the transcript
Fig. 9(a) shows that in genes with more transcripts is more difficult to
correctly reconstruct all transcripts. As a result Cufflinks performs better on
genes with few transcripts since annotations are not used in it standard settings.
DRUT has higher sensitivity on genes with 2 and 3 transcripts, but RABT is
better on genes with 4 transcripts.
For genes with more than 4 transcripts performance of annotation-guided
methods is equal to ”existing annotations ratio”, which mean what these
methods are unable to reconstruct unannotated transcript..
GIVEN: Annotations (transcripts) and
frequencies of the reads.
FIND: ML estimate of transcript frequencies
Fig 3. Panel: Bipartite Graph - consisting
of transcripts with unknown frequencies
and reads with observed frequency (oj)
Decide if the panel is likely to be incomplete
Estimate total frequency of missing transcripts
Identify read spectrum emitted by missing transcripts
Assemble missing transcripts from read spectrum emitted by
Input data of EM is a panel: a bipartite graph
a set of candidate transcripts that are believed to emit the set of reads
weighted match based on mapping of the read i to the transcripts j (hTj, i)
ML Estimates of Transcripts Frequencies
Probability that a read is sampled from transcript j is proportional with f(j)
f(j) transcript (unknown) frequency
ML estimates for f(j) is given by n(j)/(n(1) + . . . + n(N))
n(j) denotes the number of reads sampled from transcript j
Simulation Setup: human genome data (UCSC hg18)
UCSC database - 66, 803 isoforms
19, 372 genes, Single error-free reads: 60M of length 100bp
for partially annotated genome -> remove from every gene exactly one isoform
- LEFT: transcripts -> unknown frequencies
- RIGHT: reads
Fig. 7. VTEM
ML ML ML
1st | 2nd | 3rd
Run | Run | Run
Run | Run | Run
weights (to 0)
VT frequency stays 0
No false positives
E – Expected exon frequencies
VT – Virtual Transcripts with hTi, j = 0
ML – Estimatied transcript frequency
VT frequency increases!
Deviation of expected
from observed decreases!
Fig 8. An example of VTEM estimation
VT frequency (.2) ≈ T3 frequency (.25)
VT’s exons (E3,E4)= T3’s exons (E3,E4)
a) Number of transcripts per gene
Observed = Expected
Nothing to update
Maximum Likelihood (ML) Model
b) Number of transcripts per gene
Fig. 9. a) Sensitivity and PPV of the methods grouped by the number of transcripts per gene. Here, 60M
single reads of length 100bp are simulated
* Cufflinks is a well known tool for transcriptome reconstruction .
1. S. Mangul, I. Astrovskaya, M. Nicolae, B. Tork, I. Mandoiu, and A. Zelikovsky, “Maximum likelihood estimation of
incomplete genomic spectrum from hts data,” in Proc. 11th Workshop on Algorithms in Bioinformatics, 2011.
2. C. Trapnell, B. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. van Baren, S. Salzberg, B. Wold, and L. Pachter,
“Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell
differentiation.” Nature biotechnology, vol. 28, no. 5, pp. 511–515, 2010.