Transcript schatterji
CompostBin : A DNA composition
based metagenomic binning algorithm
Sourav Chatterji*, Ichitaro Yamazaki,
Zhaojun Bai and Jonathan Eisen
UC Davis
[email protected]
Overview of Talk
Metagenomics and the binning problem.
CompostBin
The Microbial World
Exploring the Microbial World
Culturing
Majority of microbes currently unculturable.
No ecological context.
Molecular Surveys (e.g. 16S rRNA)
“who is out there?”
“what are they doing?”
Metagenomics
Interpreting Metagenomic Data
Nature of Metagenomic Data
Mosaic
Intraspecies polymorphism
Fragmentary
New Sequencing Technologies
Enormous amount of data
Short Reads
Metagenomic Binning
Classification of sequences by taxa
Binning in Action
Glassy Winged Sharpshooter (Homalodisca coagulata).
Feeds on plant xylem (poor in organic nutrients).
Microbial Endosymbionts
Current Binning Methods
Assembly
Align with Reference Genome
Database Search [MEGAN, BLAST]
Phylogenetic Analysis
DNA Composition [TETRA,Phylopythia]
Current Binning Methods
Need closely related reference genomes.
Poor performance on short fragments.
Sanger sequence reads 500-1000 bp long.
Current assembly methods unreliable
Complex Communities Hard to Bin.
Overview of Talk
Metagenomics and the binning problem.
CompostBin
Genome Signatures
Does genomic sequence from an organism have a
unique “signature” that distinguishes it from
genomic sequence of other organisms?
Yes [Karlin et al. 1990s]
What is the minimum length sequence that is
required to distinguish genomic sequence of one
organism from the genomic sequence of another
organism?
Imperfect World
Horizontal Gene Transfer
Recent Estimates [Ge et al. 2005]
Varies between 0-6% of genes.
Typically ~2%.
But…
Amelioration
DNA-composition metrics
The K-mer Frequency Metric
CompostBin uses hexamers
DNA-composition metrics
Working with K-mers for Binning.
Curse of Dimensionality : O(4K) independent
dimensions.
Statistical noise increases with decreasing
fragment lengths.
Project data into a lower dimensional space to
decrease noise.
Principal Component Analysis.
PCA separates species
Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]
Effect of Skewed Relative Abundance
Abundance 1:1
Abundance 20:1
B. anthracis and L. monogocytes
A Weighting Scheme
For each read, find overlap with other sequences
A Weighting Scheme
4
5
5
3
Calculate the redundancy of each position.
Weight is inverse of average redundancy.
Weighted PCA
Calculate weighted mean µw :
N
w X
i
i
μ w = i =1
N
Calculates weighted co-variance matrix Mw
N
M w = w i (X i - μ w )(X i - μ w ) T
i =1
PCs are eigenvectors of Mw.
Use first three PCs for further analysis.
Weighted PCA
separates species
PCA
Weighted PCA
B. anthracis and L. monogocytes : 20:1
Un-supervised Classification ?
Semi-Supervised Classification
31 Marker Genes [courtesy Martin Wu]
Omni-present
Relatively Immune to Lateral Gene Transfer
Reads containing these marker genes can
be classified with high reliability.
Semi-supervised Classification
Use a semi-supervised version of the normalized cut algorithm
The Semi-supervised
Normalized Cut Algorithm
1. Calculate the K-nearest neighbor graph
from the point set.
2. Update graph with marker information.
o If two nodes are from the same species, add
an edge between them.
o If two nodes are from different species,
remove any edge between them.
3. Bisect the graph using the normalized-cut
algorithm.
Generalization to multiple bins
Apply algorithm
recursively
Gluconobacter oxydans [0.61], Granulobacter
bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]
Generalization to multiple bins
Gluconobacter oxydans [0.61], Granulobacter
bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]
Testing
Simulate Metagenomic Sequencing
Sanger Reads
Variables
Number of species
Relative abundance
GC content
Phylogenetic Diversity
Test on a “real” dataset where answer is
well-established.
Results
Conclusions/Future Directions
Satisfactory performance
No Training on Existing Genomes
Sanger Reads
Low number of Species
Future Work
Holy Grail : Complex Communities
Semi-supervised projection?
Hybrid Assembly/Binning
Acknowledgements
UC Davis
Jonathan Eisen
Martin Wu
Dongying Wu
Ichitaro Yamazaki
Amber Hartman
Marcel Huntemann
UC Berkeley
Lior Pachter
Richard Karp
Ambuj Tewari
Narayanan Manikandan
Princeton University
Simon Levin
Josh Weitz
Jonathan Dushoff