RNA-seq introduction

Download Report

Transcript RNA-seq introduction

RNA-seq library prep
introduction
NESCent Academy
Outline
•
•
•
•
•
Methodologies and history
RNA-seq challenges
Library preparation methods
Common queries
Validation
• Spike-in and future-proofing your work
Gene
expression
RNA sequencing
Samples of interest
Condition 1
Condition 2
(normal colon) (colon tumor)
Isolate RNAs
Generate cDNA, fragment,
size select, add linkers
Sequence ends
Map to genome,
transcriptome, and
predicted exon
junctions
Downstream analysis
100s of millions of paired reads
10s of billions bases of sequence
Metholologies for RNA-Seq studies








Mapping transcription start sites
Strand-specific RNA-Seq
Characterization of alternative splicing patterns
Gene fusion detection
Targeted approaches using RNA-Seq
Small RNA profiling
Direct RNA sequencing
Profiling low-quantity RNA samples
Pre NGS Transcriptomics

Hybridization-based approaches



Genomic tiling microarrays
Fluorescently labelled cDNA with microarrays
Sequence-based approaches




Sanger sequencing of cDNA or EST libraries
Serial analysis of gene expression (SAGE)
Cap analysis of gene expression (CAGE)
Massively parallel signature sequencing (MPSS)
RNA-seq
Challenges
• RNAs consist of small exons that may be separated by large
introns
– Mapping reads to genome is challenging
• The relative abundance of RNAs vary wildly
– 105 – 107 orders of magnitude
– Since RNA sequencing works by random sampling, a small fraction of
highly expressed genes may consume the majority of reads
– Ribosomal and mitochondrial genes
• RNAs come in a wide range of sizes
– Small RNAs must be captured separately
– PolyA selection of large RNAs may result in 3’ end bias
• RNA is fragile compared to DNA (easily degraded)
• Bacterial samples may need to be depleted of rRNA
Rubbish in = Rubbish out
RNA-seq library prep methodologies
• Two main routes for mRNA-seq preparation
– Illumina TruSeq prep
– Script-seq
• Generally Script-seq is our favourite
RNA Illumina Tru-Seq library prep
Adaptor ligation and
standard library
preparation
2 days for 8 samples
Size selection step
5ug of total RNA
~$100 per sample
Not strand-specific
Script-seq method
2 hours for 12 samples
< 1ug of RNA
~$150 per sample
Strand-specific
DNA library preparation: RNA
fragmentation and DNA fragmentation
compared
a | Fragmentation of oligo-dT
primed cDNA (blue line) is more
biased towards the 3' end of the
transcript. RNA fragmentation
(red line) provides more even
coverage along the gene body,
but is relatively depleted for both
the 5' and 3' ends. Note that the
ratio between the maximum and
minimum expression level (or the
dynamic range) for microarrays is
44, for RNA-Seq it is 9,560. The
tag count is the average
sequencing coverage for 5,000
yeast ORFs. b | A specific yeast
gene, SES1 (seryl-tRNA
synthetase), is shown.
Common questions: How much library
depth is needed for RNA-seq?
• My advice. Don’t ask this question if you want a simple
answer…
• Depends on a number of factors:
– Question being asked of the data. Gene expression? Alternative
expression? Mutation calling?
– Tissue type, RNA preparation, quality of input RNA, library
construction method, etc.
– Sequencing type: read length, paired vs. unpaired, etc.
– Computational approach and resources
• Identify publications with similar goals
• Pilot experiment
• Good news: 1/8th -1 lane of recent Illumina HiSeq data should
be enough for most purposes
Coverage versus depth
Common questions: What mapping
strategy should I use for RNA-seq?
• Depends on read length
• < 50 bp reads
– Use aligner like BWA and a genome + junction database
– Junction database needs to be tailored to read length
• Or you can use a standard junction database for all read lengths
and an aligner that allows substring alignments for the junctions
only (e.g. BLAST … slow).
– Assembly strategy may also work (e.g. Trans-ABySS)
• > 50 bp reads
– Spliced aligner such as TopHat or Trinity
Common questions: how reliable are
expression predictions from RNA-seq?
• Are novel exon-exon junctions real?
– What proportion validate by RT-PCR and Sanger
sequencing?
• Are differential/alternative expression changes
observed between tissues accurate?
– How well do differential expression values correlate with
qPCR?
• 384 validations
– qPCR, RT-PCR, Sanger sequencing
• See ALEXA-Seq publication for details:
– Also includes comparison to microarrays
– Griffith et al. Alternative expression analysis by RNA
sequencing. Nature Methods. 2010 Oct;7(10):843-847.
Common questions: How many
replicates?
• As many as you can afford
• Tophat/Cufflinks statistics work best with
three or more biological replicates
Validation (qualitative)
33 of 192 assays shown. Overall validation rate = 85%
RNA-seq vs Microarray
Spike-in controls
• How can you identify limits of detection and ensure
your data can be compared to future platforms or new
library prep methods? (e.g. How does Oxford
Nanopore compare to Illumina sequencing?)
• Spike-in RNA to your total RNA which has a known
concentration
• http://tools.invitrogen.com/content/sfs/manuals/4455352C.pdf
• Cost - $20 per sample
RNA-seq spike-in protocol
Assessing lower limit of detection
Assessing fold change response
Take home
• Good quality total RNA of 1-10ug
• Have 3 or more biological replicates
• Unless you have good reason, use a Script-seq
type protocol
• Use a standard spike-in as an internal control
and to ensure samples can be compared
across platforms
• Don’t forget to validate key findings with
qPCR!