Transcript schatterji

CompostBin : A DNA composition
based metagenomic binning algorithm
Sourav Chatterji*, Ichitaro Yamazaki,
Zhaojun Bai and Jonathan Eisen
UC Davis
[email protected]
Overview of Talk
 Metagenomics and the binning problem.
 CompostBin
The Microbial World
Exploring the Microbial World
 Culturing
 Majority of microbes currently unculturable.
 No ecological context.
 Molecular Surveys (e.g. 16S rRNA)
 “who is out there?”
 “what are they doing?”
Metagenomics
Interpreting Metagenomic Data
 Nature of Metagenomic Data
 Mosaic
 Intraspecies polymorphism
 Fragmentary
 New Sequencing Technologies
 Enormous amount of data
 Short Reads
Metagenomic Binning
Classification of sequences by taxa
Binning in Action
Glassy Winged Sharpshooter (Homalodisca coagulata).
Feeds on plant xylem (poor in organic nutrients).
Microbial Endosymbionts
Current Binning Methods
 Assembly
 Align with Reference Genome
 Database Search [MEGAN, BLAST]
 Phylogenetic Analysis
 DNA Composition [TETRA,Phylopythia]
Current Binning Methods
 Need closely related reference genomes.
 Poor performance on short fragments.
 Sanger sequence reads 500-1000 bp long.
 Current assembly methods unreliable
 Complex Communities Hard to Bin.
Overview of Talk
 Metagenomics and the binning problem.
 CompostBin
Genome Signatures
 Does genomic sequence from an organism have a
unique “signature” that distinguishes it from
genomic sequence of other organisms?
 Yes [Karlin et al. 1990s]
 What is the minimum length sequence that is
required to distinguish genomic sequence of one
organism from the genomic sequence of another
organism?
Imperfect World
 Horizontal Gene Transfer
 Recent Estimates [Ge et al. 2005]
 Varies between 0-6% of genes.
 Typically ~2%.
 But…
 Amelioration
DNA-composition metrics
The K-mer Frequency Metric
CompostBin uses hexamers
DNA-composition metrics
 Working with K-mers for Binning.
 Curse of Dimensionality : O(4K) independent
dimensions.
 Statistical noise increases with decreasing
fragment lengths.
 Project data into a lower dimensional space to
decrease noise.
 Principal Component Analysis.
PCA separates species
Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]
Effect of Skewed Relative Abundance
Abundance 1:1
Abundance 20:1
B. anthracis and L. monogocytes
A Weighting Scheme
For each read, find overlap with other sequences
A Weighting Scheme
4
5
5
3
Calculate the redundancy of each position.
Weight is inverse of average redundancy.
Weighted PCA
 Calculate weighted mean µw :
N
w X
i
i
μ w = i =1
N
 Calculates weighted co-variance matrix Mw
N
M w =  w i (X i - μ w )(X i - μ w ) T
i =1
 PCs are eigenvectors of Mw.
 Use first three PCs for further analysis.
Weighted PCA
separates species
PCA
Weighted PCA
B. anthracis and L. monogocytes : 20:1
Un-supervised Classification ?
Semi-Supervised Classification
 31 Marker Genes [courtesy Martin Wu]
 Omni-present
 Relatively Immune to Lateral Gene Transfer
 Reads containing these marker genes can
be classified with high reliability.
Semi-supervised Classification
Use a semi-supervised version of the normalized cut algorithm
The Semi-supervised
Normalized Cut Algorithm
1. Calculate the K-nearest neighbor graph
from the point set.
2. Update graph with marker information.
o If two nodes are from the same species, add
an edge between them.
o If two nodes are from different species,
remove any edge between them.
3. Bisect the graph using the normalized-cut
algorithm.
Generalization to multiple bins
Apply algorithm
recursively
Gluconobacter oxydans [0.61], Granulobacter
bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]
Generalization to multiple bins
Gluconobacter oxydans [0.61], Granulobacter
bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]
Testing
 Simulate Metagenomic Sequencing
 Sanger Reads
 Variables
 Number of species
 Relative abundance
 GC content
 Phylogenetic Diversity
 Test on a “real” dataset where answer is
well-established.
Results
Conclusions/Future Directions
 Satisfactory performance
 No Training on Existing Genomes 
 Sanger Reads 
 Low number of Species 
 Future Work
 Holy Grail : Complex Communities
 Semi-supervised projection?
 Hybrid Assembly/Binning
Acknowledgements
UC Davis
 Jonathan Eisen
 Martin Wu
 Dongying Wu
 Ichitaro Yamazaki
 Amber Hartman
 Marcel Huntemann
UC Berkeley
 Lior Pachter
 Richard Karp
 Ambuj Tewari
 Narayanan Manikandan
Princeton University
 Simon Levin
 Josh Weitz
 Jonathan Dushoff