29. A Metagenome analysis test case

Download Report

Transcript 29. A Metagenome analysis test case

Advancing Science with DNA Sequence
Metagenome analysis: use case
Natalia Ivanova
MGM Workshop
September 29, 2011
Advancing Science with DNA Sequence
Minoan eruption and
metagenomics
…it seemed as though the sea was being sucked
backwards, as if it were being pushed back by
the shaking of the land…Behind us were
frightening dark clouds, rent by lightning
twisted and hurled, opening to reveal huge
figures of flame. These were like lightning,
but bigger.
From Pliny the Younger’s Letter
Advancing Science with DNA Sequence
Apart from Minoan eruption…
Diagram by Gary Massoth/PMEL
from Chernicoff & Stanley,
Geology, 2007
Advancing Science with DNA Sequence
Sampling sites
white mat
red mat
Key gradients white vs red:
Temperature 60 vs 18oC
CO2 tension >99% vs <1%
Advancing Science with DNA Sequence
This is what it looks like
Advancing Science with DNA Sequence
Chimney material may be of
biological origin
Advancing Science with DNA Sequence
Standard JGI metagenome
pipeline
454 standard
DNA sample
shotgun libraries
DNA QC
SSU pyrotags
454 long mate pair
Illumina standard
Illumina long mate pair
http://pyrotagger.jgi-psf.org
Community composition
Semi-quantitative – OTU
abundance
Analysis
Assembly
Metagenome IMG/M-ER
contigs + unassembled reads
Community composition
Functional analysis
Advancing Science with DNA Sequence
Pyrotag results – BLASTn against
Greengenes database
phylum
Pyrotags - phylum level, filtered at 0.1% of all clusters
Kolumbo_volcano_white
Kolumbo_volcano_red
EM3
Thermosulfidobacterium
Thermotogae
BRC1
C2
SM2F11
WS6
pMC2A15
pMC2A384
Cyanobacteria
DHVE3
TM7
Chlamydiae
MAT-CR-M3-H11
TM6
OP5
Lentisphaerae
pMC1
Spirochaetes
VHS-B5-50
NKB19
Firmicutes
Thermoplasmata_Eury
ABY1_OD1
pMC2A209
Nitrospirae
MBMPE71
Gemmatimonadetes
Chlorobi
Verrucomicrobia
WS3
OP8
OP3
Caldithrix_KSB1
Actinobacteria
Acidobacteria
OP11
Unknown
Thaumarchaeota
Marine_group_A
Chloroflexi
Planctomycetes
Bacteroidetes
Proteobacteria
0
5
10
15
20
25
30
35
40
% pyrotag clusters
Advancing Science with DNA Sequence
PhyloDistribution results – BLASTp of
metagenome CDSs against isolates in
IMG
PhyloDistribution of CDSs - phylum level, filtered at 0.1% abundance
Kolumbo_volcano_white_grey
Lenti s pha era e
Kolumbo_volcano_red
Thermos ul fi doba cteri um
Thermotoga e
Cya noba cteri a
Spi rocha etes
Fi rmi cutes
phylum
Thermopl a s ma ta _Eury
Chl orobi
Verrucomi crobi a
Ca l di thri x_KSB1
Acti noba cteri a
Aci doba cteri a
Tha uma rcha eota
Chl orofl exi
Pl a nctomycetes
Ba cteroi detes
Proteoba cteri a
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
% CDS hits
30.00%
35.00%
40.00%
45.00%
50.00%
Advancing Science with DNA Sequence
Pyrotags vs PhyloDistribution –
white mat
Marine_group_A
Lentisphaerae
Caldithrix_KSB1
Actinobacteria
Acidobacteria
Unknown
Chlorobi
Verrucomicrobia
Thermoplasmata_Eury
Thaumarchaeota
EM3
Thermosulfidobacteri
Planctomycetes
OP5
Thermotogae
Proteobacteria
Bacteroidetes
0.00%
Kolumbo_volcano_white_grey_PhyloDist
Kolumbo_volcano_white_grey_Pyro
5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 50.00%
Big differences in abundance (an order of magnitude or more) of
Bacteroidetes and Thermotogae
Advancing Science with DNA Sequence
Possible explanations
• Primer bias in pyrotags (against
Proteobacteria)?
• Amplification artifacts in pyrotags – well known
for metagenome data
• Sequencing GC bias in the metagenome – low
and high (<30% and >65%) are
underrepresented in Illumina data
• K-mer assembler problems: abundant
populations may be undrrepresented in
assembly if incorrect k-mer/coverage
parameters selected
Advancing Science with DNA Sequence
PCR artifacts in metagenome data
454 technology includes
an emulsion PCR step,
which may lead to
artificial
overrepresentation of
certain sequences
Reason: presence of free
beads during the library prep
step; escaped emPCR products
bind to free beads and are
12
disproportionately amplified
Advancing Science with DNA Sequence
What about GC bias?
Low GC (Brachyspira) Medium GC (Arcanobacterium) High GC (Cellulomonas)
Question: how do you find average/max/min GC content
for a clade?
Answer: IMG=>Genome Browser=>View Phylogenetically=>click on green + to select the
clade, then “Add selected to Genome Cart”=>Compare Genomes=>Genome Statistics
Result: Thermotogae GC percent 41 average/47 max/31 min
Bacteroidetes GC percent 42.5 average/66 max/31 min
Advancing Science with DNA Sequence
Are there any abundant populations
that could be filtered out in assembly?
Cluster1
Cluster2
Cluster3
Cluster6
Cluster12
Cluster15
Cluster13
Cluster8
Cluster9
Cluster17
Cluster19
Cluster25
Cluster53
Kolumbo_volcano_white_grey
% identity Taxonomy
32.4
100 Bacteria
12.84
94 Bacteria
4.93
98.51 Bacteria
3.44
94.03 Bacteria
3.42
97.5 Bacteria
2.83
98.5 Bacteria
2.82
91.13 Bacteria
2.81
97 Bacteria
1.86
89.22 Bacteria
1.53
92.12 Bacteria
1.44
97 Bacteria
1.27
93.56 Bacteria
1.03
100 Archaea
Bacteroidetes
Bacteroidales
Thermotogae
Proteobacteria
Zetaproteobacteria
Proteobacteria
Gammaproteobacteria
OP5
SRI-280
Thermosulfidobacterium
EM3
Proteobacteria
Gammaproteobacteria
Proteobacteria
Alphaproteobacteria
Proteobacteria
Deltaproteobacteria
Proteobacteria
Deltaproteobacteria
Proteobacteria
Desulfurellales
Thaumarchaeota
Cenarchaeales
VC21_Bac22
Methylococcales
Methylococcaceae
Clonothrix
Thiomicrospira
Thiomicrospira_frisia
Thiomicros
Caulobacterales
Caulobacteraceae
Caulobacte
Desulfobacterium_catecholicum
Desulfobulbus_rhabdofor
Desulfurellaceae
Cenarchaeum
Typical Pyrotagger output
There are 2 highly abundant populations – just 2 clusters account
for nearly all Bacteroidetes and Thermotogae in the sample
Advancing Science with DNA Sequence
Let’s take a closer look at the
assemblies and unassembled reads
Verrucomicrobia
White mat
Thermotogae
454
reads total
Thaumarchaeota
Red mat
454
1,429,091
Illumina
299,975
Spirochaetes
Illumina reads total
49,227,146
45,337,178
195,590
88,776
659
869
28,145
75,483
Proteobacteria
Planctomycetes
Assembled
Firmicutescontigs
Euryarchaeota
N50, bp
Cyanobacteria
Chloroflexi
Longest
contig, bp
Chlorobi
Illumina
reads mapped to assembly, % 42.3
Bacteroidetes
total
12.5
Aquificae
454
reads mapped to assembly, %
Actinobacteria
total
0
10
20
62.1
30
15.3
40
50
60
70
Advancing Science with DNA Sequence
Functional analysis: metagenome
as a bag of functions
• Red mat is taxonomically more diverse
• Is it more diverse functionally?
White mat
Red mat
COG clusters
3631
3402
Pfam clusters
3847
3505
Question: where do you find this information?
Answer: IMG=>Taxon Details=>Metagenome Statistics; Genes with
Pfam=>Display as a list =>Export
Taxa (95% confidence)
3600
3200
2800
2400
2000
1600
1200
800
400
0
10000
20000
30000
40000
Specimens
50000
60000
70000
80000
Rarefaction curves: white
mat is expected to have
~4000 different Pfams; red
mat ~3600
Advancing Science with DNA Sequence
Abundance Comparisons
Motility and chemotaxis genes are overrepresented
in white mat (detected by both Pfams and COG
Categories)
white mat
red mat
Advancing Science with DNA Sequence
Is motility/chemotaxis common to
all organisms in white mat?
• Scenario 1: the function/pathway is overrepresented
because it is present in all members of the community,
possibly at higher copy number
• Scenario 2: the function/pathway is overrepresented
because it is present in one clade, which is absent from
the second sample
Question: can we distinguish between the two scenarios?
Answer: click on the gene count for protein family/functional category, add all genes to Gene
Cart=>add scaffolds to Scaffold Cart=>PhyloDistribution of all scaffolds in the Scaffold Cart
Advancing Science with DNA Sequence
Are Sulfurimonas-like bacteria
present in both samples?
red mat white mat
The total number
of sequences in all clusters assigned to
Cluster1730
13
33 Epsilonproteobacteria
Campylobacterales
Helicobacteraceae
Helicobacter
Epsilonproteobacteria
is 50 in white
mat Campylobacteraceae
and 66Sulfurospirillum
in red mat
Cluster5877
28
Epsilonproteobacteria
Campylobacterales
Cluster8886
Helicobacter
Largest
cluster in5 white Epsilonproteobacteria
mat includes Campylobacterales
125K+Helicobacteraceae
sequences
Cluster13550
8 Epsilonproteobacteria
Campylobacterales
Arcobacteraceae
PL-7C7
Largest
cluster in4 red mat
includes 14K+
sequences
Cluster14168
Epsilonproteobacteria
Campylobacterales
Campylobacteraceae
Sulfurospirillum_arc
Cluster17937 what 4about the
Epsilonproteobacteria
Question:
presence of Sulfurimonas-like
Cluster20681
5
Epsilonproteobacteria
Campylobacterales
Campylobacteraceae
Sulfurospirillum_arc
bacteria
in
the
metagenomes?
Cluster22836
2
Epsilonproteobacteria
Answer: go to Compare Genomes=>PhyloDistribution=>Genome vs Metagenomes, select the
Cluster35524
5 Epsilonproteobacteria
Campylobacterales
Helicobacteraceae
Sulfurimonas
genome; the histogram shows the number of BLASTp hits from CDSs in all metagenomes to
Cluster38665
2 Epsilonproteobacteria
Sulfurovumales
this genome
Cluster44900
1
Epsilonproteobacteria
Sulfurovumales
Sulfurovumaceae
Rimicaris_exoculata
Cluster57912
1
Epsilonproteobacteria
Cluster60712
1 Epsilonproteobacteria
Cluster76930
1
1 Epsilonproteobacteria
unclassified
unclassified
Nitratiruptor
Cluster87523
1
Epsilonproteobacteria
unclassified
unclassified
Nitratiruptor
Cluster160974
1
Epsilonproteobacteria
Campylobacterales
Campylobacteraceae
Sulfurospirillum
Advancing Science with DNA Sequence
Are there any methylotrophs in
the white mat?
Advancing Science with DNA Sequence
Conclusions
Two communities have different composition; white
mat sampled next to the hydrothermal vent has
lower complexity
Community composition as sampled by pyrotags and
the metagenome may be quite different due to a
number of biases
Some protein families/functional categories are
more abundant in one sample as compared to the
other because of different community
composition, and not necessarily because they are
more important in this environment