Discovery through RNA-Seq

Download Report

Transcript Discovery through RNA-Seq

RNA-Seq as a Discovery Tool
Julia Salzman
Deciphering the Genome
Power of RNA-Seq: Quantification and
Discovery
• RNA Isoform specific gene expression
• Gene fusions
• Overlooked RNA structural variants
Salzman, Gawad, Wang
Lacayo, Brown, 2012
Paired-end RNA-Seq
Matched sequences are
obtained for each library
molecule
CTTC…..GAAG
GGAC…..GCCT
Data: millions of 70-150+ bp A/C/G/T
sequences
• Part 1: Isoform Specific
Expression
Example: Paired-end Data Aligned
Some reads are informative about isoform-specific expression
Paired-end RNA-Seq for RNA Isoform
Specific Gene Expression
Exon 4
Exon 1
Rnpep
Goal: estimate the expression of
each isoform?
Nontrivial : we only observe
fragments of sequences
• Since the size distribution of library molecules
is known, inferred insert lengths can be used
to increase statistical power and inference
Insert Length Distributions
Insert lengths of entire library (pooled) can be
calculated and used to precisely estimate the
distribution of sizes of cDNA in the library:
Sequenced
molecule
length
100
200
300
Base pairs
Paired-end RNA-Seq Model
• Compute genome-wide insert length distribution
Sequenced
molecule
length
100
200
300
Base pairs
•Mapped to Isoform 1
 length 150
•Mapped to Isoform 2
 length 90
Salzman, Jiang, Wong 2011
Using PE for quantification is
statistically more powerful
• PE model is a statistical improvement over naïve models and
has optimal information reduction
• “Information” gain using PE Sequencing
• Overall, using “mate pair” information, more power, but
sometimes experimental artifacts can effect results
Paired-end Size Distributions are
Foundation for Tophat and other
PE-RNA Seq Algorithms
Summary and Problems:
• rely on a reference
• assume uniformity of size distributions in library
• over look biases’
Rep.1
Rep.2
• Part 2: Gene Fusions
Recurrent Gene Fusions in Cancer
A handful of recurrent fusions in solid tumors
• PAX8 -PPARγ fusion (thyroid cancer)
• EML4-ALK fusion (non small cell lung cancer)
• TMPRSS2-ERG family fusion (prostate cancer)
Not
Genomewide
More to be learned by unbiased study of RNA
Fusion Discovery
• 2 flavors
– Totally “de novo” discovery
• Search for any RNA fragments out of order with respect
to the reference genome– not necessarily coinciding
with exon boundaries
• Noisy
– Discovery with a reference database
• Discover fusions at annotated exon boundaries (protein
coding) and better statistical checks
• Misses some fusions
Reference Approach
• Search for gene fusions with exon A in gene 1
spliced to exon B of gene 2
Exon A
Exon B
Algorithm (with respect to reference)
• Remove all PE reads consistent with the
reference
• Identify gene pairs PE reads where (read1,
read2) map to (gene1, gene2)
• Find PE reads of the form:
(gene A, gene A-B junction)
Exon A
Exon B
Paired-End RNA-Seq for Gene Fusions in
Ovarian Tumors
• Paired-end sequencing of poly-A selected RNA from
12 late stage tumors– genome wide search
• Top hit of our algorithm : ESRRA-C11orf20
• Isoform-specific estimation: ESRRA and the fusion are
expressed at roughly equal magnitude (Salzman, Jiang, Wong)
ESRRA
Fusion
C11orf20
Salzman et al,
2011
• Part 3: Exploratory Analysis
of RNA Rearrangements
Bioinformatic Analysis
• Thousands of exon scrambling events in RNA
from human leukocytes and cancer samples
Inconsistent with the reference genome!
Wildtype genome:
DNA
Canonical
transcript
Potential Biological Mechanisms for RNA
Rearrangements
DNA
Rearrangement
RNA
rearrangement
Trans-splicing
Template
switching
PCR artifact
Analysis of Leukocyte Data
• Exons in ‘scrambled’ (non-increasing) order with respect to
canonical exon order
• Thousands of genes with evidence of exon scrambling
• Naïve estimate of fractional abundance of
scrambled read rate: all read rate (per transcript)
100s of Transcripts with High Fractions of
Scrambled Isoforms
Canonical Isoform
< 25%
Scrambled Isoform
> 75%
100s
of
genes
100s of transcripts from B cells, stem cells and neutrophils have
>50% copies from scrambled isoform
What Models Can Explain Exon
Scrambling in RNA?
Model 1 to Explain RNA
Exon Scrambling
Model 1 Prediction
Can be made statistically precise
Model 1 is statistically inconsistent with vast majority of
data
Alternative Model
Model and data are consistent
Mining RNA-Seq Data for Evidence
Consistent with Circular RNA?
• In poly-A depleted samples, expect to see strong
evidence of scrambled exons (circular RNA)
• In poly-A selected samples, expect to see little
evidence of scrambled exons (circular RNA)
Poly-A Depleted Samples Enriched for
Scrambled Exons
Align all reads to a custom database
Summary of RNA-Seq for NGS
• RNA-Seq can be used for discovery
•Tophat and other fusion/splicing algorithms
gives a broad picture
• May have significant noise
• Miss important features of RNA expression
Currently, all
published/downloadable algorithms
will miss identifying circular RNA!
(feel free to contact me for the algorithm to identify circular RNA!)