Gene-Boosted Assembly of a Novel Bacterial Genome from
Download
Report
Transcript Gene-Boosted Assembly of a Novel Bacterial Genome from
Harry Presman
Overview
Motivation
Assembly
Results
Advantages/Limitations
Motivation
Next-gen sequencers produce short
read-lengths
Useful for polymorphism discovery
Difficult to assemble whole genomes
Current assembly algorithms produce
highly fragmented results
Sequencing P. aeruginosa(PAb1)
Source of common in-hospital infections
Chosen due to available comparators,
PAO1 and PA14
8,627,900 shotgun reads (Solexa)
Assembly
Step 1: AMOScmp
Comparative assembler
Uses MUMmer
○ Alignment system based on suffix trees
Referenced in “Comparative Genome
Assembly”
PA14 – 2053 contigs
PAO1 – 2797 contigs
Assembly
Step 2 : multiple sequence alignment
Align PAO1 and PA14 assemblies
Use Minimus to fill gaps with contigs
○ AMOS component for small data sets
Re-map reads using AMOScmp to clean
assembly
Closed 203 gaps
Assembly
Step 3 : gene-boosted assembly
UofMaryland annotation pipeline
○ Based on BLAST and Glimmer
Protein-coding genes used to fill gaps
○ Identify genes at contig edges and gaps
○ Extract AA sequences
○ tBlastn identified potential filler reads
○ ABBA assembled reads into gaps
Closed 185 gaps
Aside
Tested gene-boosted analysis alone
PAb1 assembled using PA14 proteins
96% of PAb1 proteins assembled using
only this method
Lacks global genome structure
information
Assembly
Step 4 : Clean up
SSAKE
“Short Sequence Assembly by K-mer search
and 3’ read Extension”
Edena
“Exact DE Novo Assembler”
Velvet
Closed 46 gaps
Results
76 contigs containing 6,290,005 bp
94% of bases in single scaffold
5602 protein-coding genes identified
Error rate per read = 1.04%
Error with coverage > 20X is zero
Slight bias toward high gene coverage
regions
Results
SNP analysis
Aligned PA14 and PAb1
5,537,508/5,568,550 bp agreed
1157/5,568,550 possible sequence errors
187/1104 indels in error
Accuracy of assembly: > 99.97%
Advantages/Limitations
Requires related genomes and protein
sequences
GenBank contains > 650 microbial genomes
Genome size should not matter
High speed and low cost
¼ of a single Solexa sequencing run in this
case
Thank You
Questions?