Argonne, 2008 - Edwards @ SDSU

Download Report

Transcript Argonne, 2008 - Edwards @ SDSU

Challenges for metagenomic data analysis and
lessons from viral metagenomes
[What would you do if sequencing were free?]
Rob Edwards
San Diego State University
Fellowship for Interpretation of Genomes
The Burnham Institute for Medical Research
Outline
• How and why we sequence environments
• Viral metagenomics
– Marine stories
– Human stories
• Pyrosequencing
– Mine story
• Is there a Future?
Why Metagenomics?
• What is there?
• How many are there?
• What are they doing?
How do you sequence the environment?
• Extract DNA
CsCl step
gradient
1.1 g ml-1
1.35 g ml-1
1.5 g ml-1
1.7 g ml-1
CsCl step
gradient
How do you sequence the environment?
• Extract DNA
• Create library
Linker-Amplified Shotgun Libraries (LASLs)
Soil Extraction Kit
Hydroshear
Blunt-ending
Hydroshear
Blunt-ending
Addition of Linkers
Amplification of
Fragments
This method produces
high coverage libraries
of over 1 million clones
from as little as 1 ng
DNA
Addition of Linkers
Amplification of
Fragments
David Mead -
Breitbart (2002) PNAS
How do you sequence the environment?
• Extract DNA
• Create library
• Sequence fragments
Outline
• How and why we sequence environments
• Viral metagenomics
– Marine stories
– Human stories
• Pyrosequencing
– Mine story
• Is there a Future?
Why Phages?
• Phages are viruses that infect bacteria
– 10:1 ratio of phages:bacteria
– 1031 phages on the planet
• Specific interactions (probably)
– one virus : one host
• Small genome size
– Higher coverage
• Horizontal gene transfer
– 1025-1028 bp DNA per year in the oceans
Uncultured Viruses
200 liters water
5-500 g fresh fecal matter
Concentrate and purify viruses
Epifluorescent
Microscopy
Extract nucleic acids
DNA/RNA LASL
Sequence
Bioinformatics
• BLASTagainst NR
– blastx, tblastn, tblastx
• BLAST against boutique databases
– Complete phage genomes, ACLAME, Other
libraries, 16S
• Parsing to present data in a useful format
BLAST and Parsing
• http://phage.sdsu.edu/blast
• Submit BLAST to local and remote databases
– Local (as fast as possible)
– NCBI (one search every 3 seconds)
• Many concurrent searches
– One search versus 1,000 searches
• Parse data into tables
– Access to taxonomy etc
Most Viral Genes are Unknown
Known
22%
Unknown
78%
TBLAST (E<0.001)
3,093 sequences
Breitbart (2002) PNAS
Rohwer (2003) Cell
GenBank has more than doubled since 2001 …
60 billion base pairs
60 million sequences
GenBank has more than doubled since 2001 …
but the fraction of unknowns remains constant
Edwards (2005) Nature Rev. Microbiol.
All of the new genes in the databases are
coming from environmental sequences
Outline
• How and why we sequence environments
• Viral metagenomics
– Marine stories
– Human stories
• Pyrosequencing
– Mine story
• Is there a Future?
Human-associated viruses
• More bacteria than somatic cells by at
least an order of magnitude
• More phages than bacteria by an order
of magnitude
• Sample the bacteria in the intestine by
sampling their phage
Most Viral DNA Sequences in Adult Human
Feces are Unknown Phages
Eukaryotic Viruses 6%
Known
40%
Unknown
60%
TBLAST (E<0.001)
532 sequences
Phages
94%
Breitbart (2003) J. Bacteriol.
Adults Versus Babies
No bacteria or
viruses in 1st
fecal sample
Abundant
bacterial and
viral
communities by
1 week of age
>108 VLP/g feces
Baby Feces Viruses
• Most sequences are unknown (≈70%)
• Similarities to phages from Lactococcus,
Lactobacillus, Listeria, Streptococcus, and other
Gram positive hosts
• From microarray studies, sequences are stable in
the baby over a 3 month period
• Same types of phage as present in adult feces
– one identical sequence to an unrelated adult!
DNA viruses in feces are phages.
Feces ≠ intestines.
RNA viruses?
Most Human RNA Viruses are Known
Unknown
8%
Known
92%
TBLAST (E<0.001)
≈36,000 sequences
Other Plant
Viruses
9%
Pepper Mild
Mottle Virus
65%
Other
26%
Zhang (2006) PLoS Biology
Pepper Mild Mottle Virus (PMMV)
• ssRNA virus; ≈6 kb genome
• Related to Tobacco Mosaic Virus
• Infects members of Capsicum family
• Widely distributed – spread through seeds
• Fruits are small, malformed, mottled
• Rod-shaped virions
Viral particles in
fecal sample
TOBACCO MOSAIC VIRUS
http://www.rothamsted.bbsrc.ac.u
k/ppi/links/pplinks/virusems/
PMMV is common in Human Feces
Fecal samples
Extract total RNA
RT-PCR for PMMV
S1
S2
S3
S4
S5
S6
S7
S8
S9
PMMV
San Diego : 78% people are positive
Singapore : 67% people are positive
10-50 fold increase in feces compared to food
106-109 PMMV copies per gram dry weight of feces
Which Foods Contain PMMV?
Chili powder
Chili sauces
NOT FOUND IN FRESH
PEPPERS
Koch’s Postulates
Thesunmachine.net
http://www.sweatnspice.com
Human microbial metagenome is more
important than human genome
Outline
• How and why we sequence environments
• Viral metagenomics
– Marine stories
– Human stories
• Pyrosequencing
– Mine story
• Is there a Future?
How do you sequence the environment?
• Extract DNA
• Create library
• Sequence fragments
How do you sequence the environment?
• Extract DNA
• Pyrosequence
454 Pyrosequencing
•DNA extraction from environment
•Whole genome amplification
• Emulsion-based PCR
• Luciferase-based sequencing
}
SDSU
}
454 Inc.
Margulies (2005) Nature
454 Sequence Data
(Only from Rohwer Lab)
• 21 libraries
– 10 microbial, 11 phage
• 597,340,328 bp total
– 20% of the human genome
– 50% of all complete and partial microbial genomes
• 5,769,035 sequences
– Average 274,716 per library
• Average read length 103.5 bp
– Av. read length has not increased in 7 months
Growth of sequence data
600 million bp
6 million reads
Cost of sequencing
•
•
•
•
•
•
One reaction = $10,000
One reaction = 250,000 reads
250 reads = $10
1 read = 4¢
454 sequencing does
1 read = 100bp
cot require cloning, arraying
1 bp = 0.04¢
etc.
($400 per 1x 1,000,000 bp)
• Sanger sequencing ca. $1/rxn, 0.2¢/bp
– real cost ca. $5/rxn, 1¢/bp
Bioinformatics
• 597,340,328 bp total
• 5,769,035 sequences
• 7 months
• Existing tools are not sufficient
Current Pipeline
http://phage.sdsu.edu/~rob/Pyrosequencing/
• Dereplicate
• BLAST against
– 16S
– Complete phage
– nr (SEED)
– subsystems
Sequencing is cheap and easy.
Bioinformatics is neither.
Outline
• How and why we sequence environments
• Viral metagenomics
– Marine stories
– Human stories
• Pyrosequencing
– Mine story
• Is there a Future?
The Soudan Mine, Minnesota
Red Stuff
Black Stuff
Oxidized
Reduced
Red and Black Samples Are Different
Black stuff
Cloned and 454 sequenced
16S are indistinguishable
Cloned
Red
Red
Annotation of metagenomes by subsystems
A subsystem is a group of genes that
work together
– Metabolism
– Pathway
– Cellular structures
– Anything an annotator thinks is interesting
There are different amounts of
metabolism in each environment
There are different amounts of
substrates in each environment
Red
Stuff
Black
Stuff
But are the differences significant?
• Sample 10,000 proteins from site 1
• Count frequency of each subsystem
• Repeat 20,000 times
• Repeat for sample 2
• Combine both samples
• Sample 10,000 proteins 20,000 times
• Build 95% CI
• Compare medians from sites 1 and 2 with 95% CI
Rodriguez-Brito (2006). In Review
Examples of significantly different
subsystems
Red Stuff
Arg, Trp, His
Ubiquinone
FA oxidation
Chemotaxis, Flagella
Methylglyoxal metabolism
Black Stuff
Ile, Leu, Val
Siderophores
Glycerolipids
NiFe hydrogenase
Phenylpropionate
degradation
Subsystem differences & metabolism
Iron acquisition
Black Stuff
Siderophore enterobactin biosynthesis
ferric enterobactin transport
ABC transporter ferrichrome
ABC transporter heme
Black stuff: ferrous iron
(Fe2+, ferroan [(Mg,Fe)6(Si,Al)4O10(OH)8])
Red stuff: ferric iron
(goethite [FeO(OH)])
Nitrification differentiates the samples
Edwards (2006)
In review
Not all biochemistry happens in a single
organism
Anaerobic methane oxidation
Boetius et al. Nature, 2000.
CH4 + SO42- -> HCO3- + HS- + H2S
Archaea
CH4 + H2O ->
HCO3- + OH + H2 -> CO2 + H2O
Bacteria
SO42- + H2O ->
HS- + OH + 2O2
The challenge is explaining the
differences between samples
Red Sample
Arg, Trp, His
Ubiquinone
FA oxidation
Chemotaxis, Flagella
Methylglyoxal metabolism
Black Sample
Ile, Leu, Val
Siderophores
Glycerolipids
NiFe hydrogenase
Phenylpropionate
degradation
We are moving away from one organism one reaction and
towards studying the biochemistry of whole environments
Bacteria don’t live alone
Summary
From 454 sequence:
– Identify microbial composition
– Identify metabolic function
– Identify statistically significant differences
in metabolism
– Who, what, why of microbial ecology
Metazoan
associated
Sampling
Sites
Marine
Near-shore water (~100 samples)
Off-shore water (~50 samples)
Near- and off-shore sediments
Corals
Fish
Human blood
Human stool
Freshwater
Aquifer
Glacial lake
Extreme
Terrestrial/Soil
Amazon rainforest
Konza prairie
Joshua Tree desert
Singapore Air
Hot springs
(84oC; 78oC)
Soda lake
(pH 13)
Solar saltern
(>35% salt)
SDSU
FIG
Forest Rohwer
Mya Breitbart
Beltran Rodriguez-Brito
Rohwer Lab:
Linda Wegley
Florent Angly
Matt Haynes
Also at SDSU
Anca Segall
Willow Segall
Stanley Maloy
Genome Institute
of Singapore:
Zhang Tao
Charlie Lee
Chia Lin Wei
Yijun Ruan
MIT:
Ed DeLong
Veronika Vonstein
Ross Overbeek
Annotators
Math Guys@SDSU
Peter Salamon
Joe Mahaffy
James Nulton
Ben Felts
David Bangor
Steve Rayhawk
Jennifer Mueller
NSF - Biotic Surveys and Inventories
- Biological Oceanography
- Biocomplexity
Viral Community Structure
• Contigs assembled from fragments with >= 98%
identity over 20 bp are a resampling of a single phage
genome
• Contig specturm is the number of contigs that have
one sequence, the number that have two sequences,
and so on
• Use both analytical and Monte-Carlo simulations to
predict community structure from contig spectrum
The Math Guys (2006) In preparation
Abundance of the species (%)
Determine the actual
contig spectrum of
the sample
2.00
1.80
1.60
1.40
1.20
1.00
0.80
Predict a contig spectrum
using a species abundance
model
0.60
0.40
0.20
0.00
0
10
20
30
40
50
Species Rank
Continue this procedure
until we obtain the
smallest error
Compute the error
between the actual
and predicted
Adjust the parameters in
the species abundance
model to minimize errors
Error
Model
parameters
Find the smallest error,
a global minimum
Viral Communities are Extremely Diverse
Fecal
Seawater
Marine
Sediments
Lots of rare viral genotypes
10
Sediment Viruses
9
Seawater Viruses
Seawater Viruses
8
7
Fecal Viruses
Shannon- 6
Wiener 5
Index
Bacteria on Corals
Agriculture Soil Bacteria
Soil Nematodes
4
3
2
1
0
Cropland
Earthworms
Rainforest Spiders
Amazon Fish
Rainforest Birds
Forest Mammals
Temperate Forest Beetles
River Bacteria
Forest Amphibians
Fossil Corals