MEGAN analysis of metagenomic data

Download Report

Transcript MEGAN analysis of metagenomic data

MEGAN analysis of metagenomic
data
Daniel H. Huson, Alexander F. Auch, Ji Qi, et al.
Genome Res. 2007
Early metagenomic

Known phylogenetic markers and subsequent sequencing of
clones


Analysis of paired-end reads
Complete sequences of environmental fosmid and BAC clones


Environmental assemblies


Rough annotation of the metabolic capacity
Distinguish between discrete species and population of closely related
biotypes
Problem of using proven phylogenetic markers(ribosomal
genes, coding sequences)

Slow-evolving genes : distinguishing between species at large
evolutionary distances
What is MEGAN?




Metagenome Analyzer (MEGAN)
Free software.
Deviates from the analytical pattern of previous
Built on the statistical analysis of comparing random sequence
intervals with unspecified phylogenetic properties against
databases



Providing filter to adjust the level of stringency later to an
appropriate level
Laptop analysis


Depends on the related sequences in the databases
Comparing result (BLAST)-> laptop (MEGAN)
Graphical and statistical output
Pipeline




Compare against databases : BLAST
Compute, explore taxonomical content : NCBI taxonomy
Lowest common ancestor (LCA) algorithm
Data sets(Sargasso Sea, mammoth bone, Short E. coli K12 & B.
bacteriovorus HD100)
What we can do with
MEGAN




Species and strain identification
through species-specific genes
Searching species or taxa by find
tool
Distribution of strains of a species
Underlying sequence alignments
Experiments-1

Sargasso Sea

data set


Sanger sequencing
Sample 1-4 from DDBJ/EMBL/GenBank



BLASTX->NCBI-NR


10000 reads from Sample1
Randomly selected a pooled set of 10000 reads from samples 2-4
1% no hits from sample1, <3% no hits from sample 2-4
Filters



Min-score : bit-score threshold of 100
Top-percent : bit scores lie within 5% of the best score
Min-support : isolated assignments it by one read) discarded
Analysis-Sargasso Sea data

1.66M reads, AVG. 818bp by Sanger sequensing

Species profile of 16 taxonomical groups

Environmental assemblies

By analyzing six specific phylogenetic markers

rRNA, RecA/RadA, HSP70, RpoB, EF-Tu, and Ef-G
Result
• Sample1
•~83% reads were assigned to taxa that
were more speific than the kingdom level
•Majority of (8298) were assigned to
bacterial group
•Sample 2-4
•~59% reads were assigned to taxa that
were more specific than the kingdom level
•Majority of (5709) were assigned to
bacterial group
•Alphaproteobacteria, Gammaproteobacteria
by a factor of 2-4 over the remaining 14
taxonomic groups
•Eukaryotes & Viruses : size filtering
•Archaea : May be there is 10times as much
vacterial sequence information in the public
databases
•MEGAN vs. previous (Venter et al. 2004)
•Specific assignment information : LCA
Result-cont.
•Averaged weighted percentage of the siz phylogenetic markers for each
of the 16 taxonomic groups
•Easily detect sampling bias between sample1 and pooled sample 2-4
Experiments-2

Mammoth bone

Data set






Roche GS20 sequencing (Sequencing-by-synthesis)
Sample from 1g of mammoth bone , 28000 years
~300,000 reads, 95bp
BLASTZ-genome sequences (elephant, human, dog)
45.4% of the reads mammoth DNA, others are environmental
organisms (bacteria, fungi, amoeba, nematodes)
BLASTX–NCBI-NR for environmental sequences

Filters : bit-score threshold 30, discard isolated assignment (filtered
2086 reads)
Result


19841 reads to Eukaryota, of
which 7969 to
Gnathostomata
16972 : Bacteria, 761: Archea,
152 : Viruses
Experiment 3

Identifying species from various lead length

Short E. coli K12 & B. bacteriovorus HD100 simulation



5000 random shotgun reads
BLASTX-NCBI-NR
Filters




Bit-score threshold 35
20% of the best hit
Discarded isolated assignments
Result : no false-positive assignment, short read can be used for
metagenomic analysis, albeit at the cost of a high rate of underprediction
Experiment 3-cont.

Roche GS20 sequencing

Data set




2000 reads from random positions in the E.coli K12
~100 bp
BALSTX – NCBI-NR
Filters




Bit-score threshold 35
20% of the best hit
Discarded isolated assignments
Result
Experiment 3-cont.

Roche GS20 sequencing

Data set





2000 reads from random positions in the B. bacteriovorus HD100
~100 bp
BALSTX – NCBI-NR : A in figure
BLASTX – NCBI-NR without B.bacteriovorus HD100 : B in figure
Filters




Bit-score threshold 35
20% of the best hit
Discarded isolated assignments
Result
MEGAN 3(June, 2009)

Suitable for very large datasets


Interests changed


Advances in the throughput and cost-efficiency of sequencing
technology
From ‘which species present’ to ‘What’s different?’
Features


Visualization technique for multiple database
New statistical method for highlighting the difference in a
pairwise comparison
MEGAN3-cont.


Comparing 6 mouse gut with human gut
Clickable, collapsible.