Gene-Boosted Assembly of a Novel Bacterial Genome from

Download Report

Transcript Gene-Boosted Assembly of a Novel Bacterial Genome from

Harry Presman
Overview

Motivation

Assembly

Results

Advantages/Limitations
Motivation

Next-gen sequencers produce short
read-lengths

Useful for polymorphism discovery

Difficult to assemble whole genomes

Current assembly algorithms produce
highly fragmented results
Sequencing P. aeruginosa(PAb1)

Source of common in-hospital infections

Chosen due to available comparators,
PAO1 and PA14

8,627,900 shotgun reads (Solexa)
Assembly

Step 1: AMOScmp
 Comparative assembler
 Uses MUMmer
○ Alignment system based on suffix trees
 Referenced in “Comparative Genome
Assembly”
 PA14 – 2053 contigs
 PAO1 – 2797 contigs
Assembly

Step 2 : multiple sequence alignment
 Align PAO1 and PA14 assemblies
 Use Minimus to fill gaps with contigs
○ AMOS component for small data sets
 Re-map reads using AMOScmp to clean
assembly
 Closed 203 gaps
Assembly

Step 3 : gene-boosted assembly
 UofMaryland annotation pipeline
○ Based on BLAST and Glimmer
 Protein-coding genes used to fill gaps
○ Identify genes at contig edges and gaps
○ Extract AA sequences
○ tBlastn identified potential filler reads
○ ABBA assembled reads into gaps
 Closed 185 gaps
Aside

Tested gene-boosted analysis alone

PAb1 assembled using PA14 proteins

96% of PAb1 proteins assembled using
only this method

Lacks global genome structure
information
Assembly
Step 4 : Clean up
 SSAKE

 “Short Sequence Assembly by K-mer search
and 3’ read Extension”

Edena
 “Exact DE Novo Assembler”
Velvet
 Closed 46 gaps

Results

76 contigs containing 6,290,005 bp
 94% of bases in single scaffold
 5602 protein-coding genes identified
 Error rate per read = 1.04%
 Error with coverage > 20X is zero
 Slight bias toward high gene coverage
regions
Results

SNP analysis
 Aligned PA14 and PAb1
 5,537,508/5,568,550 bp agreed
 1157/5,568,550 possible sequence errors
 187/1104 indels in error
 Accuracy of assembly: > 99.97%
Advantages/Limitations

Requires related genomes and protein
sequences
 GenBank contains > 650 microbial genomes
Genome size should not matter
 High speed and low cost

 ¼ of a single Solexa sequencing run in this
case
Thank You
Questions?