Transcript Sequencing
Large-scale genome projects
• Sequencing DNA molecules in the Mb
size range
• All strategies employ the same
underlying principles:
Random Shotgun sequencing
Strategy
Libraries
Sequencing
Assembly
Closure
Annotation
Release
Genomic DNA
Shearing/Sonication
Subclone and Sequence
Shotgun reads
Assembly
Contigs
Finishing read
Finishing
Complete sequence
Nucleotide Database Growth
EMBL breakdown by organism
EMBL Release 65
Progress on Large Sequencing Projects
Strategies for sequencing
• How big can you go??
Strategy
• Large-insert clones
Libraries
• cosmids 30-40 kb
• BACs/PACs 50 - 100 kb
• Whole chromosomes
• Whole genomes
Sequencing
Assembly
Closure
Annotation
Release
Genome size and sequencing strategies
Genome size (log Mb)
0
1
2
3
4
H.sapiens (3000 Mb)
D.melanogaster (170 Mb)
C.elegans (100Mb)
P.falciparum (30 Mb)
S.cerevisiae (14 Mb)
E.coli (4 Mb)
Whole genome shotgun (WGS)
Clone-by-clone
Whole Chromosome Shotgun (WCS)
Whole Genome Shotgun (WGS)
with Clone ‘skims’
Genomic DNA
Shearing/Sonication
Subclone and Sequence
Shotgun reads
Assembly
Contigs
Finishing read
Finishing
Complete sequence
Strategies for sequencing
• Size and GC composition of genome
• Volume of data
• Ease of cloning
• Ease of sequencing
• Genome complexity
Strategy
Libraries
Sequencing
Assembly
Closure
• dispersed repetitive sequence
Annotation
• telomeres & centromeres
Release
• Politics/Funding
Strategies: Clone by Clone
• Simple (0.5 - 2 K reads)
Strategy
• Few problems with repeats
Libraries
• Relatively simple informatics
• Scalability
• Quality of physical map
• Fingerprint / STS maps
• End sequencing
Sequencing
Assembly
Closure
Annotation
Release
Strategies: Whole Chromosome
shotgun (WCS)
• Requires chromosome isolation
Strategy
• Moderate complexity (10’s K reads)
Libraries
• Problems with repeats
• Complex informatics
• Inefficient in isolation
• Quality of physical map
• Skims of mapped clones
Sequencing
Assembly
Closure
Annotation
Release
Strategies: Whole Genome
shotgun (WGS)
• Moderate to High complexity (10-100’s K reads)
Strategy
• Problems with repeats
Libraries
• Complex informatics
• Quality of physical map
• Fingerprint map
• STS markers
• End-sequences
• Skims of mapped clones
Sequencing
Assembly
Closure
Annotation
Release
Sequencing my genome
Politics
Strategy
Libraries
Production
Sequencing
Assembly
Finishing
Closure
Annotation
Annotation
Release
TIME MONEY
What do you get?
DATA!!, DATA !!, and more DATA!!
Strategy
• Sequence
• incomplete v complete
Libraries
Sequencing
• First-pass annotation
Assembly
• Gene discovery
Closure
• Full annotation
Annotation
• A starting point for research
Release
Genome annotation is central to functional genomics
ORFeome based functional genomics
RNAi phenotypes
Gene Knockout
Expression Microarray
Sequencing
• Library construction
Strategy
• Colony picking
Libraries
• DNA preparation
• Sequencing reactions
• Electrophoresis
• Tracking/Base calling
Sequencing
Assembly
Closure
Annotation
Release
Libraries
• Essentially Sub-cloning
• Generation of small insert libraries in a well
characterised vector.
• Ease of propagation
• Ease of DNA purification
• e.g. puc18, M13
Strategy
Libraries
Sequencing
Assembly
Closure
Annotation
Release
Libraries - testing
• Simple concepts
• Insert/Vector ratio
• Real data
• Insert size
• Sequence ….
• Simple analysis
Strategy
Libraries
Sequencing
Assembly
Closure
Annotation
Release
Sequence generation
• Pick colonies
• Template preparation
• Sequence reactions
• Standard terminator chemistry
• pUC libraries sequenced with forward and
reverse primers
Strategy
Libraries
Sequencing
Assembly
Closure
Annotation
Release
Sequence generation
• Electrophoresis of products
• Old style - slab gels, 32 > 64 > 96 lanes
• New style - capillary gels, 96 lanes
• Transfer of gel image to UNIX
• Sequencing machines use a slave Mac/PC
• Move data to centralised storage area for
processing
Strategy
Libraries
Sequencing
Assembly
Closure
Annotation
Release
Gel image processing
• Light-to-Dye estimation
• Lane tracking
• Lane editing
• Trace extraction
• Trace standardisation
• Mobility correction
• Background substitution
Strategy
Libraries
Sequencing
Assembly
Closure
Annotation
Release
Pre-processing
• Base calling using Phred
• modifies SCF file
• Quality clipping
• Vector clipping
• Sequencing vector
• Cloning vector
• Screen for contaminants
• Feature mark up (repeats/transposons)
Strategy
Libraries
Sequencing
Assembly
Closure
Annotation
Release
Finishing
• Assembly: Process of taking raw single-pass
reads into contiguous consensus sequence
• Closure: Process of ordering and merging
consensus sequences into a single contiguous
sequence
Strategy
Libraries
Sequencing
Assembly
Closure
• Finished is defined as sequenced on both strands
using multiple clones. In the absence of multiple
clones the clone must be sequenced with multiple
chemistries. The overall error rate is estimated at
less than 1 error per 10 kb
Annotation
Release
Genome Assembly
Strategy
• Pre-assembly
• Assembly
• Automated appraisal
• Manual review
Libraries
Sequencing
Assembly
Closure
Annotation
Release
Pre-Assembly
Strategy
• Convert to CAF format
• flatfile text format
• choice of assembler
• choice of post-assembly modules
• choice of assembly editor
Libraries
Sequencing
Assembly
Closure
Annotation
Release
www.sanger.ac.uk/Software/CAF
Assembly
Strategy
• Assemble using Phrap
• Read fasta & quality scores from CAF file
• Merge existing Phrap .ace file as necessary
• Adjust clipping
Libraries
Sequencing
Assembly
Closure
Annotation
Release
Assembly appraisal
• auto-edit
• removes 70% of read discrepancies
• Remove cloning vector
• Mark up sequence features
• finish
• Identify low-quality regions
• Cover using ‘re-runs’ and ‘long-runs’
• Compare with current databases
• plate contamination
Strategy
Libraries
Sequencing
Assembly
Closure
Annotation
Release
Manual Assembly appraisal
Strategy
• Use a sequence editor (GAP/consed)
• Tools to identify Internal joins
• Tools to identify and import data from
an overlapping projects
• Tools to check failed or mis-assembled
reads for inclusion in project
Libraries
Sequencing
Assembly
Closure
Annotation
Release
Manual editing
• Sanger uses 100% edit strategy
Strategy
• Where additional data is required:
Libraries
• Check clipping
• Additional sequencing
• Template / Primer / Chemistry
• Assemble new data into project
• GAP4 Auto-assemble
• Repeat whole process
Sequencing
Assembly
Closure
Annotation
Release
Manual Quality Checks
• Force annotation tag consistency
• All unedited data is re-assembled using Phrap
Strategy
• All high-quality discrepancies are reviewed
Libraries
• Confirm restriction digest (clones)
• Check for inverted repeats
• Manually check:
• Areas of high-density edits
• Areas with no supporting unedited data
• Areas of low read coverage
Sequencing
Assembly
Closure
Annotation
Release
Gap closure
• Read pairs
Strategy
• PCR reactions (long-range / combinatorial)
Libraries
• Small-insert libraries
Sequencing
• Transposon-insertion libraries
Assembly
Closure
Annotation
Release
Gap closure - contig ordering
• Read pair consistency
Strategy
• STS mapping
Libraries
• Physical mapping
Sequencing
• Genetic mapping
• Optical mapping
• Large-insert clone
• skims
• end-sequencing
Assembly
Closure
Annotation
Release
Annotation
• DNA features (repeats/similarities)
Strategy
• Gene finding
Libraries
• Peptide features
• Initial role assignment
• Others- regulatory regions
Sequencing
Assembly
Closure
Annotation
Release
Annotation of eukaryotic genomes
Genomic DNA
ab initio gene
prediction
transcription
Unprocessed RNA
RNA processing
Mature mRNA
Gm3
AAAAAAA
translation
Nascent polypeptide
Comparative gene
prediction
folding
Active enzyme
Functional
identification
Function
Reactant A
Product B
Genome analysis overview: C.elegans
DNA features
• Similarity features
• mapping repeats
• simple tandem and inverted
• repeat families
• mapping DNA similarities
Strategy
Libraries
Sequencing
Assembly
• EST/mRNAs in eukaryotes
Closure
• Duplications,
Annotation
• RNAs
Release
• mapping peptide similarities
• protein similarities
Gene finding
• ORF finding (simple but messy)
Strategy
• ab initio prediction
Libraries
• Measures of codon bias
• Simple statistical frequencies
• Comparative prediction
• Using similarity data
• Using cross-species similarities
Sequencing
Assembly
Closure
Annotation
Release
Peptide features
• Peptide features
• low-complexity regions
• trans-membrane regions
• structural information (coiled-coil)
• Similarities and alignments
• Protein families (InterPro/COGS)
Strategy
Libraries
Sequencing
Assembly
Closure
Annotation
Release
Initial role assignment
• Simple attempt to describe the
functional identity of a peptide
• Uses data from:
• peptide similarities
• protein families
• Vital for data mining
• Large number of predicted genes remain
hypothetical or unknown
Strategy
Libraries
Sequencing
Assembly
Closure
Annotation
Release
Other regulatory features
• Ribosomal binding sites
Strategy
• Promoter regions
Libraries
Sequencing
Assembly
Closure
Annotation
Release
Data Release
• DNA release
• Unfinished
Strategy
• Finished
Libraries
• Nucleotide databases
• GENBANK/EMBL/DDBJ
• Peptide databases
• SWISSPROT/TREMBL/GENPEPT
• Others
Sequencing
Assembly
Closure
Annotation
Release