CAMERA -- A Metagenomics Resource for Marine - C-MORE

Download Report

Transcript CAMERA -- A Metagenomics Resource for Marine - C-MORE

CAMERA
A Metagenomics Resource for Marine
Microbial Ecology
July 27, 2007
Paul Gilna
UCSD/Calit2
Saul A. Kravitz
J. Craig Venter Institute
Acknowledgements
• UCSD/Calit2
- Larry Smarr, PI; Paul Gilna, Executive Director
- Phil Papadopoulos, Technical Lead
- Weizhong Li
• JCVI
-
Marv Frazier, co-PI
Leonid Kagan, Architect; Jennifer Wortman, Bioinformatics
Rekha Seshadri, Outreach and Training;
Doug Rusch, Shibu Yooseph, Aaron Halpern, Granger Sutton
• UC Davis
- Jonathan Eisen, co-investigator
• Gordon and Betty Moore Foundation
- David Kingsbury and Mary Maxon
Outline
•
•
•
•
•
•
New Discipline of Metagenomics
Global Ocean Sampling Expedition
Challenges of Metagenomic Data
CAMERA Features
CAMERA Usage to Date
Cyberinfrastructure
Genomics vs Metagenomics
• Genomics – ‘Old School’
- Study of an organism's genome
- Genome sequence determined using shotgun
sequencing and assembly
- ~1300 microbes sequenced, first in 1995
- DNA usually obtained from pure cultures
• Metagenomics
- Application of genome sequencing methods to
environmental samples (no culturing)
- Environmental shotgun sequencing is the most widely
used approach
Metagenomic Questions
• Within an environment
- What biological functions are present (absent)?
- What organisms are present (absent)
• Compare data from (dis)similar environments
- What are the fundamental rules of microbial ecology
• Search for novel proteins and protein families
Metagenomics Applications
• Marine Ecology and Microbiology
• Alternative Energy and Industrial
- Hypersaline ponds, Oceans
- Termite Metabolism
• Medical Applications
- Microbial Ecology of Human body cavities and fluids
• Agricultural
- Disease Vector Metabolism (Glassy Eyed Sharpshooter)
- Soil Ecology
• Environmental Remediation
- DOE: Acid Mine Drainage, Chemical and Radioactive Waste
Metadata
• Metagenomics
- Genomics + Metadata
• Environmental Metadata
- Time and location (lat, long, depth)
of sample collection
- Correlate w/remote sensing data
- Physico-chemical properties (e.g.
temperature, salinity)
MODIS-Aqua satellite image of ocean
chlorophyll in the Sargasso Sea grid about
the BATS site from 22 February 2003
JCVI Global Ocean Sampling Expedition
Largest Metagenomic Study to Date
Global Ocean Sampling (GOS)
178 Total Sampling Locations
Phase 1: 41 samples, 7.7M reads, >6M proteins
Diverse Environments
Open ocean, estuary, embayment, upwelling, fringing reef, atoll, warm seep,
mangrove, fresh water, biofilms, sediments, soils
GOS Protein Analysis
Yooseph et al (PLoS 2007)
•
Novel clustering process
• Sequence similarity based
• Predict proteins and group into related clusters
• Include GOS and all known proteins
• Findings
• GOS proteins cover ~all existing prokaryotic families
• GOS expands diversity of known protein families
• 1700 large novel clusters with no homology to known
protein families
• Higher than expected proportion of novel clusters are viral
• No saturation in the rate of novel protein family discover
Added Diversity
Rubisco homologs
UVDE homologs
H. marismortui
D. radiodurans
D. psychrophila
GOS eukaryotes
GOS prokaryotes
T. thermophilus
B. halodurans
B. anthracis
GOS viral
GOS prokaryotes
Known eukaryotes
Known eukaryotes
Known prokaryotes
Known prokaryotes
Known viral
Rate of Protein Discovery
Number of clusters (thousands)
Rate of discovery
250
200
size >=3
150
size >=5
size >=10
100
size >=20
50
0
0
1
2
3
4
5
Number of sequences (millions)
6
7
Fragment Recruitment Viewer
Rusch et al, PLoS 3/2007
Sequence absent from most strains –
phage/other lateral transfer?
100%
Percent Identity
100%
50%
55% Reference Genome Coordinates
“core” genome,
~75% identical
Ribosomal operon
Why CAMERA?
• Public repositories not focused on
environmental metagenomics
- Sargasso Sea data underutilized by community
• M$ invested in sequencing and analysis but
only accessible to bioinformatics elite
• Release of GOS dataset in March 2007
• Comply with Convention on Biodiversity
CAMERA – http://camera.calit2.net
• “Convenient acronym for cumbersome name…”
- Henry Nichols, PLoS Biology
• Mission
- Enable Research in Marine Microbiology
• CAMERA Partners:
Challenges
• Enormous datasets with high gene density
- large compute resources required
- 2 orders of magnitude jump
• Fragmentary data
- inadequate bioinformatics tools for assembly,
annotation, analysis, visualization
• Metadata standards non-existent
- metadata absent from databases
- Lack of standards impedes collection of datasets
• Diversity of User Sophistication and Needs
CAMERA Services
• Maintain searchable sequence collections
-
ALL metagenomic sequence reads, assemblies
Non-identical amino acid collection (extended NRAA)
Viral, Fungal, pico-Eukaryotes, Microbial
CAMERA protein clusters
• Metagenomics data easily downloadable
• Interactive and Batch Search Facility
- Scalable parallel implementations of BLAST
- Integrated with associated metadata
Distinctive Features Set in Progress
• Graphical Tools for Visualizing Diversity
- Based on Rusch et al
- Fragment recruitment viewer
• CAMERA Protein Clusters
- Based on Yooseph et al
- Incremental version implemented in 2007
• Annotation
- Break through quadratic complexity via clusters
- Phyletic Classification
• Overviews of sequence collections
Fragment Recruitment Viewer
Metagenomic Sequence
vs
Reference Sequence
• Highlight and Select with
Associated Metadata
• View large datasets
• AJAX I/F
Based on Doug Rusch’s Viewer