MEGAN analysis of metagenomic data
Download
Report
Transcript MEGAN analysis of metagenomic data
MEGAN analysis of metagenomic
data
Daniel H. Huson, Alexander F. Auch, Ji Qi, et al.
Genome Res. 2007
Early metagenomic
Known phylogenetic markers and subsequent sequencing of
clones
Analysis of paired-end reads
Complete sequences of environmental fosmid and BAC clones
Environmental assemblies
Rough annotation of the metabolic capacity
Distinguish between discrete species and population of closely related
biotypes
Problem of using proven phylogenetic markers(ribosomal
genes, coding sequences)
Slow-evolving genes : distinguishing between species at large
evolutionary distances
What is MEGAN?
Metagenome Analyzer (MEGAN)
Free software.
Deviates from the analytical pattern of previous
Built on the statistical analysis of comparing random sequence
intervals with unspecified phylogenetic properties against
databases
Providing filter to adjust the level of stringency later to an
appropriate level
Laptop analysis
Depends on the related sequences in the databases
Comparing result (BLAST)-> laptop (MEGAN)
Graphical and statistical output
Pipeline
Compare against databases : BLAST
Compute, explore taxonomical content : NCBI taxonomy
Lowest common ancestor (LCA) algorithm
Data sets(Sargasso Sea, mammoth bone, Short E. coli K12 & B.
bacteriovorus HD100)
What we can do with
MEGAN
Species and strain identification
through species-specific genes
Searching species or taxa by find
tool
Distribution of strains of a species
Underlying sequence alignments
Experiments-1
Sargasso Sea
data set
Sanger sequencing
Sample 1-4 from DDBJ/EMBL/GenBank
BLASTX->NCBI-NR
10000 reads from Sample1
Randomly selected a pooled set of 10000 reads from samples 2-4
1% no hits from sample1, <3% no hits from sample 2-4
Filters
Min-score : bit-score threshold of 100
Top-percent : bit scores lie within 5% of the best score
Min-support : isolated assignments it by one read) discarded
Analysis-Sargasso Sea data
1.66M reads, AVG. 818bp by Sanger sequensing
Species profile of 16 taxonomical groups
Environmental assemblies
By analyzing six specific phylogenetic markers
rRNA, RecA/RadA, HSP70, RpoB, EF-Tu, and Ef-G
Result
• Sample1
•~83% reads were assigned to taxa that
were more speific than the kingdom level
•Majority of (8298) were assigned to
bacterial group
•Sample 2-4
•~59% reads were assigned to taxa that
were more specific than the kingdom level
•Majority of (5709) were assigned to
bacterial group
•Alphaproteobacteria, Gammaproteobacteria
by a factor of 2-4 over the remaining 14
taxonomic groups
•Eukaryotes & Viruses : size filtering
•Archaea : May be there is 10times as much
vacterial sequence information in the public
databases
•MEGAN vs. previous (Venter et al. 2004)
•Specific assignment information : LCA
Result-cont.
•Averaged weighted percentage of the siz phylogenetic markers for each
of the 16 taxonomic groups
•Easily detect sampling bias between sample1 and pooled sample 2-4
Experiments-2
Mammoth bone
Data set
Roche GS20 sequencing (Sequencing-by-synthesis)
Sample from 1g of mammoth bone , 28000 years
~300,000 reads, 95bp
BLASTZ-genome sequences (elephant, human, dog)
45.4% of the reads mammoth DNA, others are environmental
organisms (bacteria, fungi, amoeba, nematodes)
BLASTX–NCBI-NR for environmental sequences
Filters : bit-score threshold 30, discard isolated assignment (filtered
2086 reads)
Result
19841 reads to Eukaryota, of
which 7969 to
Gnathostomata
16972 : Bacteria, 761: Archea,
152 : Viruses
Experiment 3
Identifying species from various lead length
Short E. coli K12 & B. bacteriovorus HD100 simulation
5000 random shotgun reads
BLASTX-NCBI-NR
Filters
Bit-score threshold 35
20% of the best hit
Discarded isolated assignments
Result : no false-positive assignment, short read can be used for
metagenomic analysis, albeit at the cost of a high rate of underprediction
Experiment 3-cont.
Roche GS20 sequencing
Data set
2000 reads from random positions in the E.coli K12
~100 bp
BALSTX – NCBI-NR
Filters
Bit-score threshold 35
20% of the best hit
Discarded isolated assignments
Result
Experiment 3-cont.
Roche GS20 sequencing
Data set
2000 reads from random positions in the B. bacteriovorus HD100
~100 bp
BALSTX – NCBI-NR : A in figure
BLASTX – NCBI-NR without B.bacteriovorus HD100 : B in figure
Filters
Bit-score threshold 35
20% of the best hit
Discarded isolated assignments
Result
MEGAN 3(June, 2009)
Suitable for very large datasets
Interests changed
Advances in the throughput and cost-efficiency of sequencing
technology
From ‘which species present’ to ‘What’s different?’
Features
Visualization technique for multiple database
New statistical method for highlighting the difference in a
pairwise comparison
MEGAN3-cont.
Comparing 6 mouse gut with human gut
Clickable, collapsible.