Advancing Science with DNA Sequence
Download
Report
Transcript Advancing Science with DNA Sequence
Advancing Science with DNA Sequence
IMG/M and metagenome analysis
Natalia Ivanova
MGM Workshop
February 5, 2009
Advancing Science with DNA Sequence
Outline
1. Problems of metagenomic
data
2. IMG/M features
3. Analysing metagenomic data:
flowcharts
Advancing Science with DNA Sequence
1. Problems of metagenomic data
(metagenomic data is the
problem)
(see IMG/M -> Using IMG/M -> About IMG/M
-> Background for definitions)
Advancing Science with DNA Sequence
Metagenomic data are noisy
• Definition of high quality genome sequence: an
example of “finished” JGI genomes - each base is
covered by at least two Sanger reads in each direction
with a quality of at least Q20
• Definition of “ high quality” metagenome?
Too many variables:
species composition/abundance
amount of DNA available
average GC content of each species (applies to 454 Titanium as
well)
“clonability” of the DNA of each species (or biases of 454
libraries)
amount of sequence allocated
no clear sequencing goal
…
Advancing Science with DNA Sequence
Metagenomic data are noisy
• Sequence coverage of metagenomes is low
US Sludge, Phrap assembly
# of scaffolds
% total scaffolds
Scaffolds, coverage > 2.0
2954
9.3
Scaffolds, coverage 1.03-2.0
8158
25.7
Unassembled reads
20630
65
• Rate of sequencing artifacts is high
• Frameshifts are the most unpleasant artifacts,
they lead to errors in gene prediction
Advancing Science with DNA Sequence
Metagenomic data are highly
fragmented
• Median scaffold length in 56 GEBA
genomes – 28,179 bp
• Median scaffold length in US Sludge,
Phrap assembly – 1,157 bp
• Many more gene fragments in
metagenomes
(median protein size in GEBA genomes – 252 aa,
median protein size in US Sludge, Phrap – 195 aa)
• Problems with assignment to protein
families and functional annotation
Advancing Science with DNA Sequence
Metagenomic datasets are large
(or huge)
# of CDSs
GEBA genomes
Samples in IMG Projects in IMG
minimal
1,375
2,331 (mouse
gut ob2)
2,386 (AMO community)
maximal
9,433
185,274 (soil)
333,301 (Lake Washington
sediment)
median
3,562
16,053
83,662
• No manual annotation (functional annotations
in metagenomes should be taken with a grain of
salt)
• “Divide and conquer” approach
Advancing Science with DNA Sequence
2. IMG/M features
(see also IMG/M -> Using IMG/M -> Using
IMG/M -> IMG User Guide and IMG/M
Addendum)
Advancing Science with DNA Sequence
IMG/M User Interface Map
Advancing Science with DNA Sequence
Dividing the genes phylogenetically
• Bins
Microbiome Details -> Microbiome Information -> Bins
(of scaffolds)
• Phylogenetic Distribution of Genes
Microbiome Details -> Phylogenetic Distribution of
Genes
gene lists
gene counts
histogram
Components:
(phylum/class)
histograms
summary statistics
Protein Recruitment Plots histogram
counts, lists,
statistics
(family)
summary statistics tables
counts, lists
lists of genes
histogram
(species)
recruitment plots
Advancing Science with DNA Sequence
Dividing the genes by abundance/
by function
• Abundance Profiles
Compare Genomes -> Abundance Profiles Tools
Components:
Abundance Profile Overview
Abundance Profile Search
Function Comparisons
Function Category Comparisons
Common parameters:
Normalization (none/scale for size)
Type of count (raw counts/estimated gene copies)
Type of protein family (COG, Pfam, Enzyme,
TIGRfam)
Advancing Science with DNA Sequence
3. Analysing metagenomic data:
flowcharts
Advancing Science with DNA Sequence
Sanger metagenomes
Sanger
library
16S
sequences
10 plate
QC
raw read QC:
GC content
insert-less clones
contamination
taxonomic
analysis
(MEGAN)
Full
sequenc
e
vector and
quality
trimming
assembly
annotation
binning
loading to
IMG/M-ER
(upon request)
manual analysis
(protein families,
etc.)
loading to
IMG/M-ER
Advancing Science with DNA Sequence
454 Titanium metagenomes
Titanium
library
16S
pyrotags
¼ run
QC (100
Mb)
Full
sequence (1
run, ~500
Mb)
raw read QC;
initial assembly
?
loading to
IMG/M-ER
(upon request)
taxonomic
analysis
(MEGAN)
dereplication
quality
trimming
?
assembly
?
manual analysis
(protein families,
etc.)
annotation?
binning ?
loading to
IMG/M-ER
Advancing Science with DNA Sequence
Sanger/Titanium metagenomes:
unassembled data
unassembled
metagenomes
taxonomic analysis
using Phylogenetic
Distribution of genes
gross counts of hits to taxa
hits to housekeeping genes at
different % identity
compare to 16S and MEGAN results
abundance analysis
using Function
Comparisons and
Function Category
Comparisons
compare to relevant metagenomes
(ecology/taxonomy)
compare to relevant genomes
(ecology/taxonomy)
check “Genes in internal clusters”
abundance analysis of
custom function
categories using
Function Profiles
find the relevant genes and reference
sequences in the literature
identify relevant protein families
add them to Function Cart, run
Function Profiles, compare sums of
counts
Advancing Science with DNA Sequence
Sanger/Titanium metagenomes:
assembled data
assembled
metagenomes
taxonomic analysis
using Phylogenetic
Distribution of genes
look for reference genomes
try to select a training set for binning
binning
abundance analysis
using Function
Comparisons and
Function Category
Comparisons
abundance analysis of
custom function
categories using
Function Profiles
compare to relevant metagenomes
(ecology/taxonomy)
compare to relevant genomes
(ecology/taxonomy)
check “Genes in internal clusters”
find the relevant genes and reference
sequences in the literature
identify relevant protein families
add them to Function Cart, run
Function Profiles, compare sums of
counts
Advancing Science with DNA Sequence
Sanger/Titanium metagenomes:
assembled and binned data
QC analysis of bins
assembled and
binned
metagenomes
metabolic
reconstruction on bins
compare bin content
using Phylogenetic
Profiles
analyze recombination
within populations
using SNP VISTA
check the genes on the scaffolds with
lowest confidence
analysis of bin coverage: check the
presence of COGs in biosynthetic
pathways, ribosomal proteins, etc.
COG Pathways and Functional
Categories
KEGG maps
custom pathways
keep in mind bin coverage
analyze gene presence/absence in
pathway context
be careful with unique proteins –
they may be errors of gene prediction