Transcript InvTalk1
Phylogenetic diversity and proposed
ecological roles of rare members of the
soil biosphere
Mostafa S. Elshahed
Oklahoma State University
Attempts to isolate a stabilized toluate degrading
Syntrophic consortium
Mostafa Elshahed, January 23rd, 1997
Phylogenetic diversity and proposed
ecological roles of rare members of the
soil biosphere
Mostafa S. Elshahed
Oklahoma State University
A yet-unexplored rare biosphere
700
600
Number of OTUs
• Microbial communities often
exhibit a species distribution
pattern in which the majority of
microbial species are present in
low abundance.
• Sampling effort in highly
diverse environments usually
covers a small fraction of the
estimated number of species.
• Low abundance, rarely sampled
species have been called the rare
biosphere
500
400
300
200
100
0
0
200
400
600
800
1000
Number of clones
Curve constructed form Schloss and Handelsman dataset of Alaskan soil
(Plos comp. Bio. 2006 , 2:e92)
1200
What’s in the rare biosphere?
• How to define, access the rare biosphere?
• What is the level of phylogenetic diversity within members
of the rare biosphere?
• What is the evolutionary relationships between rare and
abundant members of the rare biosphere?
• What is the community dynamics between rare and
abundant members of the community?
• What is the ecological role (if any) of members of the rare
biosphere?
• What is the effect of the rare biosphere on species richness
estimates?
The rare soil biosphere
• Extremely valuable ecosystem for economic sustainability
as well as for elemental global cycling.
• Highly diverse, species richness estimates range between
2,000 and 55,000.
• One of the most intensively sampled ecosystems, 77,103 in
RDP, almost all sequences originated from small size clone
libraries.
• Current collection of 16S sequences in the databases could
be regarded as a global survey of abundant species in soils.
Kessler farm soil (KFS)
• Undisturbed tallgrass prairie soil in McClain County,
Oklahoma.
• 16S clone library was constructed using primer pair 27F-1391R
13,001 near full length non-chimeric clones.
• Sequences grouped to 3,747 operational taxonomic units at a
3% sequence divergence cutoffs (OTU0.03).
• Dataset is 8-times the largest near full-length soil clone library
available, 1/4 the size of the largest pyrosequencing dataset
generated from soil.
Elev-1860
Chlamydia
Fibrobacter
Gal-15
KFS-5
KFS-4
5
KFS-3
SPAM
Firmicutes
Gamma
Proteobacteria
Gemmatimonadetes
BetaProteobacteria
Planctomycetes
Bacteroidetes
Groups with abundances > 1%
KFS-2
SC4
OP3
Novel Unclassified
BRC-1
0.6
AD3
KFS-1
Chladithrix
TM6
SC3
Chlorobia
Verrucomicrobia
15
Elusimicrobia
Chloroflexi
Delta
Proteobacteria
Alpha
Proteobacteria
Acidobacteria
Actinobacteria
% abundance
20
Cyanobacteria
OD2
Chloroplasts
WS3
OP11
TM7
OP10
% abundance
30
KFS community composition
25
• 13,001 clones, 3747 OTUs
10
• 34 different bacterial phyla
0
• Major phyla fairly typical of soil.
0.8
0.7
Groups with abundances < 1%
0.5
0.4
0.3
0.2
0.1
0
Defining “rare”
4000
3%
3500
Number of OTUs
3000
2500
6%
2000
8%
1500
KFS Rarefaction
curve at various
taxonomic cutoffs
10%
1000
15%
20%
500
0
0
3000
6000
9000
12000
15000
Number of clones
•
•
•
Subjective process.
Sampling effort did not reach saturation.
OTUs labeled as rare in KFS dataset represent OTUs with a low
probability of being encountered in average-sized clone libraries.
How to define the rare biosphere?
OTU
occurrence
% of
occurrence
Prob. of occurrence
in 100-clone library*
Gene copy #/
g of soil**
Cells/g of soil
1
18.1
0.77
4,230
282-4230
2
8.8
1.53
8,460
564-8460
3
4.4
2.28
12,690
846-12690
4
3.57
3.03
16,920
1,128-16920
5
2.27
3.77
21,150
1,410-21150
6
1.59
4.51
25,380
1,692-25380
7
2
5.24
29,610
1,974-29610
30
0.45
20.63
126,900
8,460-126,900
65
0.49
39.42
274,950
18,330-274,950
204
1.54
79.43
862,920
57,528-862920
* calculated using the formula p=1-(1-x)y, where p is the probability of detecting a species with
relative abundance x in the large dataset in a small dataset of size y
** Determined using qPCR
Rare members of KFS community
14 phyla are represented by ≤ 5 clones, 4 phyla represented
by 1 clone
Novel phylum level diversity in
Kessler Farm Soil
• 5 Novel candidate phyla
(KFS1-KFS5).
• Future availability of sequences
could add three new phyla.
• All novel phyla were
represented by less than five
clones.
Novel subphylum level lineages in KFS rare
biosphere
• Novel subphylum level lineages detected in all major soil phyla.
• With the exception of two lineages in -Proteobacteria, all clones
belonging to these novel lineages were present in low abundance.
Rare biosphere harbor lineages not commonly
associated with soil
• Within rare KFS biosphere, multiple lineages belonged to known phyla not
commonly associated with soil.
• Examples include:
Phyla Chlorobia, Caldithrix, Elusimicrobia, and candidate phylum BRC-1
Clones affiliated with the genus Salinibacter within the Bacteroidetes
Clostridiales-affiliated clones
Clones belonging to Sup-05 lineage within the -Proteobacteria
• These lineages obligatory require specific environmental conditions (strict
anaerobic conditions, high salt, high temperature) that are not usually prevalent in
soil ecosystems.
40
Average sequence divergence
from closest relative
Percentage difference from
the closest relative
Novelty of rare Vs abundant species in KFS dataset
30
20
10
0
0
50
100
150
Number of clones
200
16
12
8
4
0
0
50
100
150
Number of Clones/OTU
200
•Rare species (n5) are more than 7.5% different from their closest
relatives in the database
•Abundant species (n≥50) are 0.85-5.9% different from their closest
relatives in the database
•Some exceptions
Uniqueness of rare members of the soil biosphere
•
•
What is the phylogenetic relationship between rare and abundant species in our
dataset?
To answer this question, we determined the percentage of rare taxa at different
taxonomic cutoffs
• Species 3%
• Genus 6%
• Family 8%
• Order 10%
• Class 15%
• Phylum 20%-25%
– If the rare species are unique, this percentage should not decrease as the
cutoff increases
– If they are closely related to other more abundant taxa, the percentage
should drop sharply as the cutoff increases
– The magnitude of the drop in the % of rare taxa is indicative of the
relative contribution of the above 2 scenarios to the total number of rare
taxa in our dataset.
Proportion of unique clones within rare members
of the KFS bacterial community
While a fraction of the rare species
have close relatives within the more
abundant members of the community,
a fraction represents unique,
evolutionary distinct lineages
Percentage of rare clones
100
80
60
40
20
0
0
Rare clones at putative genus cutoff represent
50.1-66.1% of the rare clones at putative species
level.
Rare clones at putative class cutoff represent
7.9-16.3% of the rare clones at (OTU0.03) (Figure
4b).
5
10
15
Taxonomic cutoff
n=1
n≤5
20
25
• Members of the rare biosphere represent 18.1-37.1% of the
KFS dataset.
• Members of the rare biosphere are on average more novel
than more abundant members of the community
• Members of the rare biosphere either:
– Have close relatives within the more abundant members of the KFS
community
– Belong to phyla not commonly associated with soil
– Belong to unique, phylogenetically distinct lineages with no close
sequence similarity to more abundant members of KFS.
We reason that recognizing these novelty and
uniqueness patterns is key to understanding the
origins, dynamics, and ecological roles of various
members of the soil’s rare biosphere
Proposed origins, dynamics, and ecological roles
of rare members of the soil biosphere
• non-unique, non-novel members of the rare biosphere act
as a back-up system and readily respond to seasonal
variations encountered in soil temperature, pH, light
exposure, and nutrient levels.
• Unique clones belonging to well described lineages that
are not prevalent in soil respond to more drastic
disturbances that could occur in the ecosystem.
• Unique clones belonging to novel lineages have an old,
evolutionary distinct origin, ecological role of this group is
not clear
- Remnants of evolution with exceptional survival ability.
- Perform yet-unknown ecological role in the ecosystem.
Species richness in KFS
Estimation methods
Parametric
ML-Models
Advantages
•Unbiased
•Assessment of the model
fit
•Use maximum amt of
frequency data
Disadvantages
•Several models, which one
to choose
•Different models do not
produce similar estimates
•Computationally difficult
Rarefaction curves
Fitting the curve to estimate
the asymptote
Advantages
•Not sensitive to sample
size
Disadvantages
•Precision problems
Non-parametric
No assumption of distribution
Chao, ACE
Advantages
•Computationally easy
Disadvantages
•limited diagnostic criteria
•arbitrary cutoff point
•bias
Parametric models
• Approximate frequency distribution of captured species then project
the given distribution to estimate the number of unobserved species.
• Problems with previous application of parametric models:
– Assume a-priori distribution (Lognormal).
– Did not use maximum likelihood estimation of model parameters.
– Did not test the goodness of fit or provide a standard error of the
estimate.
Parametric models, cont.
• As Hong S.-H. et al. (2005) and Joen et al. (2006) suggested, since there is no
reason to assume a-priori that a certain model will provide the best fit to the
observed frequency data, several models were tested and compared. The model
of choice is the one that
– Provides Best fit (using 2 goodness of fit)
– Has Least standard error
– Includes the maximum number of frequency data (highest truncation point)
• Models tested have an underlying sampling Poisson (random) distribution and
differ in the distribution function
–
–
–
–
–
–
Negative binomial (-mixed Poisson)
Inverse Gaussian-mixed Poisson
Lognormal-mixed Poisson
Pareto-mixed Poisson
Mixture of 2 exponentials-mixed Poisson
Poisson
Our dataset
• Using our 13,001 clones, the species richness was
estimated using parametric models as discussed above.
Model
Estimate
SE
TP
Inverse Gaussian-mixed Poisson
15,896
NA
8
Lognormal-mixed Poisson
11,002
NA
9
Pareto-mixed Poisson
7,379
NA
55
722
17
Mixture of 2 exponentials-mixed Poisson 15,009
• The species richness for our dataset estimated to be 15,009 species.
• The mixture of 2 exponentials-mixed Poisson seems to be the model
that best describes our frequency distribution.
• Different models gave different estimates of species richness.
The species richness is
estimated by fitting an
equation to the curve and
estimating the the asymptote
Number of OTUs
Rarefaction curve fitting
data
MM fit
Number of clones
Rarefaction curve fitting
• Equations used to fit the curve
– Michaelis Menten
– Exponential
• Both the Michaelis Menten and the exponential curves are
forced through the origin. This affects their fit.To improve the
fit, an intercept is added.
– MM-with intercept
– Exponential with intercept
• With bigger datasets, the curvatures at the beginning and the
end of the rarefaction curve are not the same. For this reason
the MM equation is not a good fit. A double MM equation with
one for the beginning and one for the end of the rarefaction
curve should solve this problem
– Double MM equation
Rarefaction curve fitting
• Materials and methods
– Analytic Rarefaction software was used to construct the rarefaction curves
– Once the rarefaction curve is available, the data is fitted using nonlinear least
square method. Software available online.
– 5 different equations are used to fit the curve.
•
•
•
•
MM and exponential equations have 2 parameters
The intercept equations have 3 parameters
The double MM equation has 4 parameters
For each equation, one of the parameters is the asymptote, i.e. the estimated
species richness
– The curve fitter software gives the parameter and its SE as well as the
residuals (difference between the observed and fitted data)
– The best model has the least SE and residuals
Rarefaction curve fitting
•
Double MM was the model
that gave the best fit (least
residuals)
Model
Estimate SE
MM
7537 18
Exponential
4866 67
MM intercept
8943 121
Expl intercept
5666 74
Double MM
11913 61
Non parametric estimators
2 estimators are the most common:
– Chao. Uses the number of species observed in the dataset as well as number
of singletons (OTUs that occurred once) and doubletons.
– Abundance-coverage estimator (ACE). Divides the data into abundant and
rare species usually at a cutoff of 10.
Species richness estimates in KFS
Method
Estimate SE
Parametric 15,009 722
models
dMM fits
11,913 61
Chao
8,654 210
ACE
10,159 85
• Different methods predict
different estimates of species
richness
• Chao estimate is the lower
bound
• Highest estimate found with
parametric model
• Since the parametric models
had the most controls, we
expect the parametric estimate
to be the most accurate
Using a 13,001-clone library, We estimate 15,009 722
species to be present in soil at the time of sampling.
Is this the true species richness?
• As the sample size increases, the species richness estimates also
increases.
• Is our sample size (13,001 clones) enough to predict the true
richness?
• We randomly sampled our dataset to construct 100-, 500-, 1000-,
and 3000-clone libraries. We treated each of them as a separate
dataset and estimated species richness for each dataset by all the
above methods.
Estimate
Species richness with different sample size
datasets
16000
14000
12000
10000
8000
6000
4000
2000
0
chao1-bc
ACE
Parametric
dMM
0
5000
10000
Sample size
15000
Regardless of the method used, the estimate increased with sample size
Towards a sample size-unbiased estimate of species richness
• A plot of species richness estimate (SRest) at different clone
library sizes (CLact) could be used.
• However, asymptote of SRest will occur at clone library sizes
that are orders of magnitude higher than the actual clone library
sizes used for plotting the curve, making asymptote
determination (and hence true SR determination) grossly
inaccurate.
• Theoretical clone library sizes required to encounter the
absolute majority of species richness (CLth) being much larger is
better suited for SR determination in a CLth Vs SRest plot.
Theoretical clone library size
(Clth-99)
CLth required to observe the absolute majority of SRest at different CLact
4.E+06
3.E+06
2.E+06
1.E+06
0.E+00
0
2000
4000
6000
8000
Actual clone library size
(Clact)
Data
10000
12000
14000
Michaelis Menten fit
• When CLth = CLact effort required to observe the absolute majority of species is met.
• Increase in CLact will not increase the CLth required to observe the absolute majority of SR.
• SRest will not increase upon further sampling, represents a sample size-unbiased estimate of
species richness.
• This CLth value was determined to be 6.3X106.
Sample size unbiased estimate of species richness
Species richness estimate
(SRest)
16000
12000
8000
4000
0
0.E+00 5.E+05 1.E+06 2.E+06 2.E+06 3.E+06 3.E+06 4.E+06 4.E+06
Theoretical clone library size
(CLth-99)
Data
Michaelis Menten fit
• CLth - SRest plot, curve fitting suggested 17,230 as a sample-size unbiased
estimate of species richness.
• This value is 15% higher than SRest determined using the 13,001 dataset.
Species richness estimates, conclusions
• Species richness estimates increase with dataset size.
Reported estimates are a fraction of the “true” richness.
• We propose an approach that provides a sample size-unbiased
estimate of species richness
• The approach suggested a species richness value of 17,230,
compared to 345-15,009 in 100-13,001 clones libraries
Comparative diversity between different
phyla in soil
• All previous studies treated soil as a
single dataset.
• Soil has a fairly stable composition.
• We compared the diversity between
phyla that were present more than
3% in our dataset: Actinobacteria,
Acidobacteria, -proteobacteria, proteobacteria, Chloroflexi,
Verrucomicrobia, Bacteroidetes,
Planctomycetes, -proteobacteria.
Comparative diversity indices
1. Single indices
– Shannon index is the most common
– Has been used for both macro- and micro-communities.
– Disadvantage: highly sensitive to sample size
Shannon’s index for the major phyla in
soil
Group
% Abundance
Shannon’s index
Ranking
Actinobacteria
24.5%
5.52
9
Planctomycetes
4.1%
5.27
8
Acidobacteria
20.3%
5.1
7
Chloroflexi
7.9%
4.97
6
Alpha proteobacteria
10.3%
4.8
5
Delta proteobacteria
9.6%
4.44
4
Bacteroidetes
4.6%
4
3
Beta proteobacteria
3.9%
3.96
2
Verrucomicrobia
5.1%
3.91
1
Comparative diversity indices, cont.
2. Rarefaction curves
•
Can be used to rank communities. The community with the rarefaction curve
lying above, is the community with higher diversity.
B
A
Not sensitive to sample
size compared to single
indices
Number of clones sampled
Disadvantages
If the 2 rarefaction curves cross, the communities can not be judged with regards
to diversity.
Even if they do not cross with the current sample size, there is no guarantee they
are not going to as the sample size increases.
Rarefaction curves of the 9 major phyla
Rarefaction ordering
1. Verrucomicrobia
2. Beta proteobacteria
3. Bacteroidetes
4. Delta proteobacteria
5. Acidobacteria
6. Alpha proteobacteria
7. Chloroflexi
8. Actinobacteria
9. Planctomycetes
acido
actino
alpha
bacter
beta
chloro
delta
plancto
verruco
800
600
400
200
0
0
1000
2000
3000
Sample size
Shannon
Rarefaction
Actino
Plancto
Plancto
Actino
250
Acido
Chloro
200
Chloro
Alpha
Alpha
Acido
Delta
Delta
Bacter
Bacter
50
Beta
Beta
0
Verr
Verr
300
acido
actino
alpha
bacter
beta
chloro
delta
plancto
verruco
# of OTUs
# of OTUs
1000
150
100
0
100
200
300
Sample size
400
500
600
Comparative diversity indices, cont.
3. Diversity profiling
– A potential solution to the problems with single diversity indices is offered
by the use of parametric families of diversity indices.
– There exist 3 different groups for diversity orderings. Each one of them
can be represented by more than one method.
– We compared the 9 major phyla using each and every method of diversity
profiling ( a total of 12 methods in 3 major groups) to come up with a
ranking of diversity.
Comparative diversity indices, cont.
Phyla compared
Information Expected #
(6)
of species
(2)
Intrinsic
diversity
(4)
Plancto> Actino
Actino> Acido
x
Acido> chloro
x
Chloro>
>
x
>
x
> Bacter> Verr
x
x: inconclusive (profiles crossed)
In diversity profiling we make a decision about any 2 phyla only if they are
comparable (their profiles do not cross) by at least 2 groups of methods
Diversity profile ranking:
Planctomycetes> Actinobacteria> Acidobacteria> chloroflexi> -proteobacteria> proteobacteria> -proteobacteria> Bacteroidetes> Verrucomicrobia
Summary of diversity rankings
Phyla
Rarefaction Shannon
ranking
ranking
Diversity
profiling
Planctomycetes
9
8
9
Actinobacteria
8
9
8
Acidobacteria
5
7
7
Chloroflexi
7
6
6
Alpha-proteobacteria
6
5
5
Delta-proteobacteria
4
4
4
Beta-proteobacteria
2
2
3
Bacteroidetes
3
3
2
Verrucomicrobia
1
1
1
High
Moderate
Low
Ecological implications of
differential diversity
•
•
•
•
More diverse phyla have a high OTU/clone
ratio at different taxonomic cutoff.
More diverse phyla have more basal
branches, less diverse phyla have more
peripheral branches.
Evolutionary sweeps purge branches with
similar niche/ role in ecosystem
functioning. Basal branches that survive are
essential for ecosystem functioning
Members of microdiverse clusters arise
from neutral mutation, occupy the same
niche, and fulfill similar services to the
ecosystem.
More diverse phyla, with more basal
branches, are more important to ecosystem
functioning than phyla with lower diversity.
Just a theory
80
% of clones
•
100
Planctomycetes
60
40
Verrucomicrobia
20
0
80
82
84
86
88
90
92
94
96
98
Cutoff
S. Giovannoni. nature 430:515-516 (2004)
100
102
Summary
• 16S rRNA near complete gene clone library was constructed from an
undisturbed tallgrass prairie soil (13,001 clones).
• To our knowledge this is the largest full length 16S rRNA gene clone
library from a single PCR reaction.
• Phylogenetic analysis identified 34 phyla and 3,747 species within the
dataset.
• The large sample size allowed the discovery of 5 new candidate phyla.
• The rare biosphere in Kessler farm soil is phylogenetically diverse,
harbors novel lineages at all taxonomic levels, and is more novel than
abundant clones in KFS.
• Rare biosphere is a mixture of both unique species and species closely
related to abundant soil microorganisms.
Summary, cont.
• The distribution of species in KFS soil follows a mixture of 2
exponentials-mixed Poisson.
• Parametric species estimates suggested a species richness of 15,009 at
the time of sampling. A sample size-unbiased approach suggested
17,230 species.
• Differential diversity studies were conducted within the community as
opposed to between communities. Some of the methods used for
differential diversity are new to microbial ecology (diversity profiling).
• Of the nine major phyla in Kessler farm soil, Planctomycetes had the
highest diversity and the highest percentage of rare species,
Verrucomicrobia has the lowest diversity.
Acknowledgments
Noha Youssef, James Davis
Lee Krumholz, Anne Spain, Cody Sheik
Bruce Roe, Fares Najar, Leonid Sukharnikov
David Bruce, Kerrie Barry
Vanessa Bailey
OSU administration for keeping us homeless
for 8 months
Current and Future plans
I. Explore the diversity, dynamics, and ecological roles
of the rare biosphere in multiple anaerobic habitats.
• Combine pyrosequencing and capillary sequencing to identify extremely
rare microorganisms.
• Double, triple, or quadruple the number of known bacterial phyla.
• More accurate estimates of species richness.
• Global patterns of differential diversity between various bacterial phyla.
II. Quantification, visualization, and metagenomics of
rare (and abundant) candidate phyla.
Capillary sequencing Vs Pyrosequencing in culture
independent community analysis
• Short fragments (100 bp) Vs long fragments.
• Pyrosequencing generates a much larger dataset in a
single batch, more cost effective.
• However,
– Cannot satisfactory document the presence of novel lineages
– Unable to classify sequences with low similarity to database
sequences
– OTU assignment does not coincide with long sequencing
fragments.
Add slide from supplementary materials
• xxxxxx.