Advancing Science with DNA Sequence

Download Report

Transcript Advancing Science with DNA Sequence

Advancing Science with DNA Sequence
IMG/M and metagenome analysis
Natalia Ivanova
MGM Workshop
February 5, 2009
Advancing Science with DNA Sequence
Outline
1. Problems of metagenomic
data
2. IMG/M features
3. Analysing metagenomic data:
flowcharts
Advancing Science with DNA Sequence
1. Problems of metagenomic data
(metagenomic data is the
problem)
(see IMG/M -> Using IMG/M -> About IMG/M
-> Background for definitions)
Advancing Science with DNA Sequence
Metagenomic data are noisy
• Definition of high quality genome sequence: an
example of “finished” JGI genomes - each base is
covered by at least two Sanger reads in each direction
with a quality of at least Q20
• Definition of “ high quality” metagenome?
Too many variables:
 species composition/abundance
 amount of DNA available
 average GC content of each species (applies to 454 Titanium as
well)
 “clonability” of the DNA of each species (or biases of 454
libraries)
 amount of sequence allocated
 no clear sequencing goal
…
Advancing Science with DNA Sequence
Metagenomic data are noisy
• Sequence coverage of metagenomes is low
US Sludge, Phrap assembly
# of scaffolds
% total scaffolds
Scaffolds, coverage > 2.0
2954
9.3
Scaffolds, coverage 1.03-2.0
8158
25.7
Unassembled reads
20630
65
• Rate of sequencing artifacts is high
• Frameshifts are the most unpleasant artifacts,
they lead to errors in gene prediction
Advancing Science with DNA Sequence
Metagenomic data are highly
fragmented
• Median scaffold length in 56 GEBA
genomes – 28,179 bp
• Median scaffold length in US Sludge,
Phrap assembly – 1,157 bp
• Many more gene fragments in
metagenomes
(median protein size in GEBA genomes – 252 aa,
median protein size in US Sludge, Phrap – 195 aa)
• Problems with assignment to protein
families and functional annotation
Advancing Science with DNA Sequence
Metagenomic datasets are large
(or huge)
# of CDSs
GEBA genomes
Samples in IMG Projects in IMG
minimal
1,375
2,331 (mouse
gut ob2)
2,386 (AMO community)
maximal
9,433
185,274 (soil)
333,301 (Lake Washington
sediment)
median
3,562
16,053
83,662
• No manual annotation (functional annotations
in metagenomes should be taken with a grain of
salt)
• “Divide and conquer” approach
Advancing Science with DNA Sequence
2. IMG/M features
(see also IMG/M -> Using IMG/M -> Using
IMG/M -> IMG User Guide and IMG/M
Addendum)
Advancing Science with DNA Sequence
IMG/M User Interface Map
Advancing Science with DNA Sequence
Dividing the genes phylogenetically
• Bins
Microbiome Details -> Microbiome Information -> Bins
(of scaffolds)
• Phylogenetic Distribution of Genes
Microbiome Details -> Phylogenetic Distribution of
Genes
gene lists
gene counts
histogram
Components:
(phylum/class)
 histograms
summary statistics
 Protein Recruitment Plots histogram
counts, lists,
statistics
(family)
 summary statistics tables
counts, lists
 lists of genes
histogram
(species)
recruitment plots
Advancing Science with DNA Sequence
Dividing the genes by abundance/
by function
• Abundance Profiles
Compare Genomes -> Abundance Profiles Tools
Components:
 Abundance Profile Overview
 Abundance Profile Search
 Function Comparisons
 Function Category Comparisons
Common parameters:
 Normalization (none/scale for size)
 Type of count (raw counts/estimated gene copies)
 Type of protein family (COG, Pfam, Enzyme,
TIGRfam)
Advancing Science with DNA Sequence
3. Analysing metagenomic data:
flowcharts
Advancing Science with DNA Sequence
Sanger metagenomes
Sanger
library
16S
sequences
10 plate
QC
raw read QC:
GC content
insert-less clones
contamination
taxonomic
analysis
(MEGAN)
Full
sequenc
e
vector and
quality
trimming
assembly
annotation
binning
loading to
IMG/M-ER
(upon request)
manual analysis
(protein families,
etc.)
loading to
IMG/M-ER
Advancing Science with DNA Sequence
454 Titanium metagenomes
Titanium
library
16S
pyrotags
¼ run
QC (100
Mb)
Full
sequence (1
run, ~500
Mb)
raw read QC;
initial assembly
?
loading to
IMG/M-ER
(upon request)
taxonomic
analysis
(MEGAN)
dereplication
quality
trimming
?
assembly
?
manual analysis
(protein families,
etc.)
annotation?
binning ?
loading to
IMG/M-ER
Advancing Science with DNA Sequence
Sanger/Titanium metagenomes:
unassembled data
unassembled
metagenomes
taxonomic analysis
using Phylogenetic
Distribution of genes
gross counts of hits to taxa
hits to housekeeping genes at
different % identity
compare to 16S and MEGAN results
abundance analysis
using Function
Comparisons and
Function Category
Comparisons
compare to relevant metagenomes
(ecology/taxonomy)
compare to relevant genomes
(ecology/taxonomy)
check “Genes in internal clusters”
abundance analysis of
custom function
categories using
Function Profiles
find the relevant genes and reference
sequences in the literature
identify relevant protein families
add them to Function Cart, run
Function Profiles, compare sums of
counts
Advancing Science with DNA Sequence
Sanger/Titanium metagenomes:
assembled data
assembled
metagenomes
taxonomic analysis
using Phylogenetic
Distribution of genes
look for reference genomes
try to select a training set for binning
binning
abundance analysis
using Function
Comparisons and
Function Category
Comparisons
abundance analysis of
custom function
categories using
Function Profiles
compare to relevant metagenomes
(ecology/taxonomy)
compare to relevant genomes
(ecology/taxonomy)
check “Genes in internal clusters”
find the relevant genes and reference
sequences in the literature
identify relevant protein families
add them to Function Cart, run
Function Profiles, compare sums of
counts
Advancing Science with DNA Sequence
Sanger/Titanium metagenomes:
assembled and binned data
QC analysis of bins
assembled and
binned
metagenomes
metabolic
reconstruction on bins
compare bin content
using Phylogenetic
Profiles
analyze recombination
within populations
using SNP VISTA
check the genes on the scaffolds with
lowest confidence
analysis of bin coverage: check the
presence of COGs in biosynthetic
pathways, ribosomal proteins, etc.
COG Pathways and Functional
Categories
KEGG maps
custom pathways
keep in mind bin coverage
analyze gene presence/absence in
pathway context
be careful with unique proteins –
they may be errors of gene prediction