Transcriptome assembly
Download
Report
Transcript Transcriptome assembly
How to design arrays with Next
generation sequencing (NGS) data
Lecture 2
Christopher Wheat
Outline
Transcriptome sequencing
Assembly
Assessing assembly
Annotation
Calling SNPs
Designing probes
Making decisions about experimental design
≈10,000 genes in triplicate
Gene annotation
48k contigs
Vera and Wheat et al. 2008
Getting the genes
Cheapest method is to directly sequence them
Sequence the transcritptome
Challenges
Getting right tissue, timing, induction, etc.
Getting the population variation (SNPs, indels, etc.)
Getting the high quality RNA
Choosing a sequencing method
Assembling the data and assessing it
Annotating the data
Pool? Yes!
Normalize? Maybe ….
8 day old aerial tissue, A. thaliana seedlings
Run 1 touched 17,449 gene models (60% of genes)
Run 2 only touched 10% more
Microarray studies indicate 55-67% of genes expressed
in this tissue
They estimate they have 90% of transcriptome in the
tissue
Weber
et al. 2007
Weber
et al. 2007
Roche 454
Illumina
short but deep
long but shallow
vs.
Fundamental tradeoffs in read: length
vs. depth vs. cost
Length: 400 vs. 2 x 100 bp
Depth: 1.2 E6 vs. 300 E6 reads
Costs: 10,000 Euros vs. 2500 Euros
Roche
454
Stats per run:
350 - 450 bp
1.2 E6 reads
500 MBp
0.5 days
10,000 euro?
Flow diagram
GGGG
TCAGCGTAAGG
Huse et al. 2007
Illumina
Illumina, Inc.
Illumina
Stats:
2 x 100 bp
300-500 E6
reads
60-200 GBp
9.5 days
3,000 euros?
Illumina, Inc.
Dephasing limits read length
No homopolymer runs issues
Per read error rate
due to difference in sequence by synthesis method
current estimate is very low
Correction methods
quality scores and bioinformatics
http://seq.molbiol.ru/
Which to use?
Illumina PE
because there is so much more data
generated per euro, for good transcriptome
coverage and thus assembly of even low
expressed genes or rare isoforms
(do your own price comparisons)
Challenge: Bioinformatics
Assembly
Transcriptome (all the above issues)
SNPs, indels, CNV, repeated elements, error
Fragmented assembly is the norm
Alternative splicing
Software
Trinity, Oasis, TransAbyss, Seqman,CAP3,Mira2, Newbler,
CLC, etc.
Settings
Many methods, few studies comparing their performance
But see Kumar and Blaxter, and Trinity paper.
Computational power (beyond HD space):
CPU vs RAM: tends to be RAM intensive,
Learn bioinformatics, hire a
bioinformatician, buy
expensive software ….
All comes down to time and money ….
But there is also no “perfect” way to do
something, as each species appears to be a bit
different, so comparing different methods is the
best route
CLC is a very nice, accessible commercial package,
but like all things, it requires a fast computer.
Blast against what?
Important to determine a genomic reference
species
Predicted gene models for comparison
Need species with predicted gene set ideally <
100 million years divergent
Many genes should be shared
Even divergent species are useful for assessing
assembly run method X parameters
Compare results
Predicted genes:
D. melanogaster = 13,379
B. mori = 18,510
Estimated coverage:
70% D. mel estimate
50% B. mori estimate
But how much of each
gene does each contig
assemble?
How much fragmentation?
But what do these numbers
mean?
45,000 contigs had blast hit to 9000 gene models
in another species
What are these gene models? Are isoforms included?
Filtering the predicted gene set to remove
isoforms and recent duplicates helps greatly
RBB90 dataset is useful.
Cd-hit: a fast program for clustering and
comparing large sets of protein or nucleotide sequences
Potential Blast bias source
Wheat, C. W. (2012) SNP Discovery in Non-model Organisms Using 454
Next Generation Sequencing.
Metabolic Map Comparison
Bombyx mori with WGS
M. cinxia with 454 seq.
Upper estimate of 70% = 13,142
genes
140
120
100
M. cinxia
B. mori
80
60
40
20
0
glycolysis &
related pathways
citric acid cycle
fatty acid
metabolism
purine
metabolism
pyrimidine
metabolism
average
oxidative
phosphorylation
ribosomal
proteins
Wheat 2008
Assessing De novo transcriptome assembly
Focal species 454:
Nearest WGS:
>1
=1
<1
Vera & Wheat et al. 2008 Mol. Ecol.
Hornett & Wheat et al. 2012
Relative ortholog coverage
Ex. 6 species assemblies with
blast result insights
454 EST libraries
22 genes assessed for sequence coverage
Alternative splicing
> 80% in humans
> 40 % in fruit flies
Most assemblers
Designed for genomic data
Don’t know how to handle
splicing
But Trinity can!
What effects will
this havealternative
on a
Transcriptome
assembly:
splicing example
microarray?
Vera and Wheat et al.
Uses Illumina PE data
Incorporates alternative
splicing into its assembly
Does great job assembling
full length transcripts
Successfully predicts many
isoforms as well
Grabherr et al. 2011
Downside:
Generates potential incorrect
isoforms
Different contigs for each
haplotype
SNP by splicing event
Can cluster these results,
possibly using CAP3
software for consensus and
SNP calling
Grabherr et al. 2011
Calling SNPs
Many programs do this now
Each sequencing method has specific errors
associated
Best to use SNP calls > 2 reads for minor allele
to ensure validity
Generate consensus sequences with SNP calls
as template for probe design
Know the sensitive region of probes to SNP/indel
variation … Agilent probes are robust!
SNP calling
Many different methods, criteria.
Just cause its published doesn’t make it ideal for you
Wheat, C. W. (2012) SNP Discovery in Non-model Organisms Using 454
Next Generation Sequencing.
Choosing Probes
Binding performance
SNPs, indels, alternative splicing
All genes or just annotated ones?
3’ UTR end or tiling across whole gene
Recommend
Avoid them?
Use them, via tiling probes?
Technical replicates within array
Run a test array to assess design
Combination of the above?
Potential example
Only genes / contigs with annotations
Probes in triplicate
Tiled across entire gene
Covering SNPs, indels, atl. Splicing sites
Initial array designed, printed, and tested with
several different RNA pools to look at probe
hybridization performance
Full experimental set of arrays ordered + 20%
Challenge: Bioinformatics
Annotation of fragmented data
Multiple contigs may belong to same gene
Unannotated sequences (novel coding, UTR, junk?)
How conduct statistical analysis of the fragmented
data?
Combine results, pick best probes, etc.?
Are outliers biological or technical
If biological, separate loci or splicing?
Unannotated probes with significant results
Where to go?
What will change tomorrow
Read lengths and quality
Read lengths per DNA strand
Paired end fragment sizes
Parallelization
Number of samples per run
Amount of starting material needed
Bioinformatic tools
RNA-Seq more common ……
What won’t change tomorrow
Need for good experimental questions &
design
Biological realities
Complications of finding the genes
Expression
Patterns of genetic variation
Need for validation (indep. & higher)
Limited annotation insights
Conclusion
Many methods and rationals for using
some over others
Arrays work great, but will they take you where
you want to go?
Analysis is the most challenging part, so work
with datasets that will be similar to yours.
You needed to decide what you want
Can you get answers from those that you want?
What software/program skills do you need?
Collaboration helps for many things
Some references
Feldmeyer, B., C. W. Wheat, N. Krezdorn, B. Rotter and M. Pfenninger (2011) Short
read Illumina data for the de novo assembly of a non-model snail species
transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of
assembler performance. BMC Genomics, 12:317.
Wheat, C. W., and H. Vogel (2011) Transcriptome sequencing goals, assembly and
assessment in V. Orgogozo, and M. V. Rockman, eds. Molecular methods for
evolutionary genetics. Humana Press, New York.
Wheat, C. W. 2008 Rapidly developing functional genomics in ecological model
systems via 454 transcriptome sequencing. Genetica 138: 433 - 451. PDF .
Hornett, E. A. and C. W. Wheat (2012) Quantitative RNA-Seq analysis in nonmodel species: assessing transcriptome assemblies as a scaffold and the utility of
evolutionary divergent genomic reference species. BMC Genomics. 13:361.
Grabherr, M. G. et al. (2011) Full-length transcriptome assembly from RNA-Seq
data without a reference genome. Nat Biotechnol. 29, 644–652
Kumar, S. and Blaxter, M. L. (2010) Comparing de novo assemblers for 454
transcriptome data. BMC Genomics. 11, 571
Many available on my website: www.christopherwheat.net
Some recommendations
Illumina sequencing, paired end, variable
fragment size from 200-500, unnormalized (but
normalized is better).
Many individuals X tissues X treatment, etc., to
reflect the experimental material
Assemble with Trinity, join isoforms and
haplotypes into contigs using CAP3
Assess via BLAST to relevant species
Annotate dataset
Design probes for annotated genes, tiling when
possible for SNPs, indels, and splicing
Consider running test set of probes to assess.
Thanks