Transcriptome assembly

Transcript Transcriptome assembly

How to design arrays with Next
generation sequencing (NGS) data
Lecture 2
Christopher Wheat
Outline

Transcriptome sequencing
Assembly
Assessing assembly
Annotation
Calling SNPs
Designing probes

Making decisions about experimental design





≈10,000 genes in triplicate
Gene annotation
48k contigs
Vera and Wheat et al. 2008
Getting the genes

Cheapest method is to directly sequence them


Sequence the transcritptome
Challenges
Getting right tissue, timing, induction, etc.
 Getting the population variation (SNPs, indels, etc.)
 Getting the high quality RNA
 Choosing a sequencing method
 Assembling the data and assessing it
 Annotating the data

Pool? Yes!
Normalize? Maybe ….





8 day old aerial tissue, A. thaliana seedlings
Run 1 touched 17,449 gene models (60% of genes)
Run 2 only touched 10% more
Microarray studies indicate 55-67% of genes expressed
in this tissue
They estimate they have 90% of transcriptome in the
tissue
Weber
et al. 2007
Weber
et al. 2007
Roche 454
Illumina
short but deep
long but shallow
vs.

Fundamental tradeoffs in read: length
vs. depth vs. cost
Length: 400 vs. 2 x 100 bp
Depth: 1.2 E6 vs. 300 E6 reads
Costs: 10,000 Euros vs. 2500 Euros
Roche
454

Stats per run:
 350 - 450 bp
 1.2 E6 reads
 500 MBp
 0.5 days

10,000 euro?
Flow diagram
GGGG

TCAGCGTAAGG
Huse et al. 2007
Illumina
Illumina, Inc.
Illumina

Stats:
 2 x 100 bp
 300-500 E6
reads
 60-200 GBp
 9.5 days

3,000 euros?
Illumina, Inc.


Dephasing limits read length
No homopolymer runs issues


Per read error rate


due to difference in sequence by synthesis method
current estimate is very low
Correction methods

quality scores and bioinformatics
http://seq.molbiol.ru/
Which to use?
Illumina PE
because there is so much more data
generated per euro, for good transcriptome
coverage and thus assembly of even low
expressed genes or rare isoforms
(do your own price comparisons)
Challenge: Bioinformatics

Assembly

Transcriptome (all the above issues)
 SNPs, indels, CNV, repeated elements, error
 Fragmented assembly is the norm
 Alternative splicing

Software
 Trinity, Oasis, TransAbyss, Seqman,CAP3,Mira2, Newbler,
CLC, etc.
 Settings
 Many methods, few studies comparing their performance
 But see Kumar and Blaxter, and Trinity paper.

Computational power (beyond HD space):
 CPU vs RAM: tends to be RAM intensive,
Learn bioinformatics, hire a
bioinformatician, buy
expensive software ….
All comes down to time and money ….
But there is also no “perfect” way to do
something, as each species appears to be a bit
different, so comparing different methods is the
best route
CLC is a very nice, accessible commercial package,
but like all things, it requires a fast computer.
Blast against what?

Important to determine a genomic reference
species




Predicted gene models for comparison
Need species with predicted gene set ideally <
100 million years divergent
Many genes should be shared
Even divergent species are useful for assessing
assembly run method X parameters

Compare results
Predicted genes:
D. melanogaster = 13,379
B. mori = 18,510
Estimated coverage:
70% D. mel estimate
50% B. mori estimate
But how much of each
gene does each contig
assemble?
How much fragmentation?
But what do these numbers
mean?

45,000 contigs had blast hit to 9000 gene models
in another species


What are these gene models? Are isoforms included?
Filtering the predicted gene set to remove
isoforms and recent duplicates helps greatly

RBB90 dataset is useful.
Cd-hit: a fast program for clustering and
comparing large sets of protein or nucleotide sequences
Potential Blast bias source
Wheat, C. W. (2012) SNP Discovery in Non-model Organisms Using 454
Next Generation Sequencing.
Metabolic Map Comparison
Bombyx mori with WGS
M. cinxia with 454 seq.
Upper estimate of 70% = 13,142
genes
140
120
100
M. cinxia
B. mori
80
60
40
20
0
glycolysis &
related pathways
citric acid cycle
fatty acid
metabolism
purine
metabolism
pyrimidine
metabolism
average
oxidative
phosphorylation
ribosomal
proteins
Wheat 2008
Assessing De novo transcriptome assembly
Focal species 454:
Nearest WGS:
>1
=1
<1
Vera & Wheat et al. 2008 Mol. Ecol.
Hornett & Wheat et al. 2012
Relative ortholog coverage
Ex. 6 species assemblies with
blast result insights


454 EST libraries
22 genes assessed for sequence coverage
Alternative splicing

> 80% in humans
> 40 % in fruit flies

Most assemblers




Designed for genomic data
Don’t know how to handle
splicing
But Trinity can!
What effects will
this havealternative
on a
Transcriptome
assembly:
splicing example
microarray?
Vera and Wheat et al.




Uses Illumina PE data
Incorporates alternative
splicing into its assembly
Does great job assembling
full length transcripts
Successfully predicts many
isoforms as well
Grabherr et al. 2011

Downside:


Generates potential incorrect
isoforms
Different contigs for each
haplotype
 SNP by splicing event

Can cluster these results,
possibly using CAP3
software for consensus and
SNP calling
Grabherr et al. 2011
Calling SNPs




Many programs do this now
Each sequencing method has specific errors
associated
Best to use SNP calls > 2 reads for minor allele
to ensure validity
Generate consensus sequences with SNP calls
as template for probe design

Know the sensitive region of probes to SNP/indel
variation … Agilent probes are robust!
SNP calling
Many different methods, criteria.
Just cause its published doesn’t make it ideal for you
Wheat, C. W. (2012) SNP Discovery in Non-model Organisms Using 454
Next Generation Sequencing.
Choosing Probes


Binding performance
SNPs, indels, alternative splicing





All genes or just annotated ones?
3’ UTR end or tiling across whole gene
Recommend



Avoid them?
Use them, via tiling probes?
Technical replicates within array
Run a test array to assess design
Combination of the above?
Potential example



Only genes / contigs with annotations
Probes in triplicate
Tiled across entire gene



Covering SNPs, indels, atl. Splicing sites
Initial array designed, printed, and tested with
several different RNA pools to look at probe
hybridization performance
Full experimental set of arrays ordered + 20%
Challenge: Bioinformatics

Annotation of fragmented data



Multiple contigs may belong to same gene
Unannotated sequences (novel coding, UTR, junk?)
How conduct statistical analysis of the fragmented
data?


Combine results, pick best probes, etc.?
Are outliers biological or technical
 If biological, separate loci or splicing?

Unannotated probes with significant results

Where to go?
What will change tomorrow







Read lengths and quality
Read lengths per DNA strand
Paired end fragment sizes
Parallelization
 Number of samples per run
Amount of starting material needed
Bioinformatic tools
RNA-Seq more common ……
What won’t change tomorrow




Need for good experimental questions &
design
Biological realities
 Complications of finding the genes
 Expression
 Patterns of genetic variation
Need for validation (indep. & higher)
Limited annotation insights
Conclusion

Many methods and rationals for using
some over others



Arrays work great, but will they take you where
you want to go?
Analysis is the most challenging part, so work
with datasets that will be similar to yours.



You needed to decide what you want
Can you get answers from those that you want?
What software/program skills do you need?
Collaboration helps for many things
Some references






Feldmeyer, B., C. W. Wheat, N. Krezdorn, B. Rotter and M. Pfenninger (2011) Short
read Illumina data for the de novo assembly of a non-model snail species
transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of
assembler performance. BMC Genomics, 12:317.
Wheat, C. W., and H. Vogel (2011) Transcriptome sequencing goals, assembly and
assessment in V. Orgogozo, and M. V. Rockman, eds. Molecular methods for
evolutionary genetics. Humana Press, New York.
Wheat, C. W. 2008 Rapidly developing functional genomics in ecological model
systems via 454 transcriptome sequencing. Genetica 138: 433 - 451. PDF .
Hornett, E. A. and C. W. Wheat (2012) Quantitative RNA-Seq analysis in nonmodel species: assessing transcriptome assemblies as a scaffold and the utility of
evolutionary divergent genomic reference species. BMC Genomics. 13:361.
Grabherr, M. G. et al. (2011) Full-length transcriptome assembly from RNA-Seq
data without a reference genome. Nat Biotechnol. 29, 644–652
Kumar, S. and Blaxter, M. L. (2010) Comparing de novo assemblers for 454
transcriptome data. BMC Genomics. 11, 571
Many available on my website: www.christopherwheat.net
Some recommendations

Illumina sequencing, paired end, variable
fragment size from 200-500, unnormalized (but
normalized is better).






Many individuals X tissues X treatment, etc., to
reflect the experimental material
Assemble with Trinity, join isoforms and
haplotypes into contigs using CAP3
Assess via BLAST to relevant species
Annotate dataset
Design probes for annotated genes, tiling when
possible for SNPs, indels, and splicing
Consider running test set of probes to assess.
Thanks

Transcriptome assembly

Transcript Transcriptome assembly

Directory