Taxonomic distribution of Large DNA viruses in the sea

Download Report

Transcript Taxonomic distribution of Large DNA viruses in the sea

Adam Monier, Jean-Michel Claverie & Hiroyuki Ogata
Genome Biology 2008, 9:R106
Virus
 A small infectious agent that can replicate only inside




the living cells of other organisms.
Infect all types of organisms—animals, plants, bacteria
and archaea.
Found in almost every ecosystem on Earth
The most abundant type of biological entity
Consist of two or three parts:
 DNA or RNA (genetic information)
 Capsid protein(protects its gene)
 Some may have an envelope
Viruses in marine system
 Abundant in the marine system: 106 to 109 virus-like




particles per milliliter of sea water
Infect marine organisms from oxygen-producing
phytoplankton to whales
Regulate the population of many sea organisms and are
important effectors of global biogeochemical fluxes
Hold a great genetic diversity
May significantly contribute to the evolution of
microorganisms in marine ecosystems.
 A quantitative description of the marine virosphere
 The determination of the relative abundance of virus
families
 The assessment of the level of their genetic diversity.
Data set
 The first phase of the Sorcerer II Global Ocean
Sampling (GOS) Expedition
 The GOS data comprise a large environmental shotgun
sequence collection, with 7.7 million sequencing reads
assembled into 4.9 billion bp contigs
 At least 3% of the predicted proteins contained within
the GOS data are of viral origin
 Most DNA samples were extracted from the 0.1-0.8
μsized fraction
Methods for determining
taxonomic distribution
 ‘Binning' is the first step to analyze microbial
populations in metagenomic sequences
 Drawbacks of the use of homology search programs
 BLAST scores are highly sensitive to alignment sizes and
to insertions/deletions
 Difficult to infer evolutionary distances among high
scoring hits only from the BLAST scores.
Phylogenetic analysis
 Phylogenetic analysis is the process used to determine
the evolutionary relationships between organisms.
 The results of an analysis can be drawn in a
hierarchical diagram called phylogenetic tree.
 Branches are based on the hypothesized evolutionary
relationships between organisms. Each member in a
branch is assumed to be descended from a common
ancestor.
B-family DNA polymerase (PolB)
 A DNA polymerase is an enzyme that catalyzes the




polymerization of deoxyribonucleotides into a DNA strand
during the process of replication.
B-family DNA polymerase (PolB) sequences are conserved
in all known members of nucleocytoplasmic large DNA
viruses
The presence of PolB homologs in bacteria is limited
Have strong sequence conservation and an apparently low
frequency of recent horizontal transfer
Pol B is a useful marker to examine taxonomic
distribution of large DNA viruses in a metagenomic
sequence collection
Defect of normal phylogenetic
methods
 Short sequences in the environmental shotgun
sequences.
 Large variation in size and correspond to different
parts of a selected marker gene
 Normal phylogenetic analysis does not provide an
appropriate alignment
Phylogenetic mapping
 A new phylogeny-based method discovered by the
author
 Analyzes individual metagenomic sequences one by
one
 Determines their phylogenetic positions using a
reference multiple sequence alignment (MSA) and a
reference tree
This paper…
 The taxonomic richness and the relative abundance of
different large DNA viruses in marine environments
 Analyzed the GOS data set by phylogenetic mapping
 Use PolB sequences as reference
Results
 Phylogenetic mapping
 Validation of the mapping results using long PolB
fragments
 Comparison of the abundance of viral PolB genes with
the bacterial ones
 Geographic distributions of viral PolBs
 Examination of additional ORFs
1. Phylogenetic mapping
 Step1: calculation of PolB fragments
 Step2: generation of a reference MSA and a maximum
likelihood tree
 Step3: examinination of PolB fragments’ phylogenetic
position
Step1: Calculation of PolB
fragments
 Searched the GOS data set for PolB-like sequences
using the Pfam hidden Markov profile (PF00136).
 A set of 1,947 sequences
 ‘PolB fragments’
Step2: Reference MSA and
Maximum likelihood tree
PolB homologs from known
organisms
Built a reference MSA
corresponding to the polymerase
domains of PolB homologs
(contains 101 sequences)
Generate a maximum likelihood
tree
Cont.
Step3: Examinination of PolB
fragments’ phylogenetic position
 Reduce the reference MSA (51 representitives) and the




reference tree (99 branches).
Conserve the original topology of the full reference
tree
Align each of the PolB fragments on the reference MSA
using T-Coffee profile method.
Compute the likelihoods for all 99 possible branching
positions by ProtML.
Assess the tatistical significance for the best tree by
RELL bootstrap method.
Taxonomic distribution of the GOS
PolB fragments
 Assign the best branching position Chloroviruses
for 1,423 PolB
fragments
 1,224 (86%) were mapped on viral branches
Mimiviruses
 869 were supported by RELL (bootstrap value ≥ 75%)
 811 were on viral branches
Phages
2. Validation of the mapping results
using long PolB fragments
 Examined the phylogenetic
mapping result and the
sequence diversity of the
PolB fragments classified in
large eukaryotic virus groups
(NCLDVs).
 A single alignment of the
selected long PolB fragments
together with the reference
PolB sequences from large
eukaryotic virus groups
Cont.
3. Comparison of the abundance of
viral PolB genes with the bacterial
ones
 Read coverage was used to measure the abundance of
the cognate DNA molecules.
 Compute the read coverage of each contig harboring a
PolB fragment
 Obtain the median of the read coverage values for each
branch
Viral PolBs are more diverse than
bacterial PolBs
 Viral branches : a large number of mapped contigs
exhibiting a low coverage.
 Bacterial branches: a lower number of mapped contigs
with a larger read coverage.
 Virus populations are numerous and very diverse.
4. Geographic distributions of viral
PolBs
 Compare the relative abundance of the predicted
viral PolB fragments and the associated metadata
across different GOS sampling sites
Geographic localization
5. Examination of additional ORFs
 Searched the putative viral contigs against NRDB by
BLASTX
 ‘Virus-specific’ genes next to the PolB homologs
 OtV5 putative major capsid gene [chlorovirus group
branch]
 regA (translation repressor of early genes) or uvsX (recAlike recombination and DNA repair protein genes)
[cyanophage P-SSM4 branch]
Prediction of ‘new’ viral genes
 An ORF similar to RimK--a protein involved in post-
translational modification of the ribosomal protein S6
– on the cyanophage P-SSM4 branch.
 No rimK homolog has been found in a viral genome
 Use this viral RimK homolog as a query of TBLASTN
and screene the entire GOS data set.
GOS contigs with putative RimK
sequences
 Identify more than 100 contigs harboring RimK
homologs with higher similarities than those exhibited
by cellular homologs in NRDB.
 Many of these contigs have additional ORFs usually
specific to phages.
Maximum likelihood tree of RimK
sequences
 The RimK homologs are
closely related to each
other and distantly
related to bacterial
RimK .
 The existence of phages
carrying rimK homologs
in marine environments.
--‘new’ viral gene
Conclusion
 The phylogenetic mapping approach provided a
comprehensive picture of the taxonomic distribution of
large viruses enclosed in the GOS metagenomic data.
 The highest genetic richness corresponded to phages.
 The Mimiviridae represent a major and ubiquitous
component of large eukaryotic DNA viruses in diverse
marine environments.
 Prediction of ‘new’ viral genes
Pfam
 Pfam is a large collection of protein families,
represented by multiple sequence alignments and
hidden Markov models (HMMs)
T-Coffee
 A multiple sequence alignment program.
 Compare all the sequences two by two, producing a
global alignment and a series of local alignments
 Then combine all these alignments into a multiple
alignment.
 Allows you to combine results obtained with several
alignment methods.
 T-Coffee will combine all that information and
produce a new multiple sequence having the best
agreement whith all these methods.
ProtML
 Maximum Likelihood Inference of Protein
Phylogeny
 developed by Felsenstein
 Implements the maximum likelihood method for
protein amino acid sequences. It uses the either the
Jones-Taylor-Thornton or the Dayhoff probability
model of change between amino acids.
 Uses a Hidden Markov Model (HMM) method of
inferring different rates of evolution at different amino
acid positions.
Read coverage
 Read coverage of a contig is the number of reads that
contribute to the contig consensus.