Transcript Document

Efforts to Link Ecological Metadata with
Bacterial Gene Sequences at the
Sapelo Island Microbial Observatory
Wade M. Sheldon
Mary Ann Moran
James T. Hollibaugh
Genetic Sequence Databases





Major informatics success story
Large repositories for nucleotide sequences (e.g.
GenBank/EMBL/NDDJ ~16M)
Automated and web-based data submission required as part of publication process
Standardized alignment/search tools support use
for classification
Numerous ‘environmental sequences’ – ecologists
now using to study biogeography, community
structure, eco-physiology
Problems with GenBank






Metadata voluntary – limited in scope
 Title (definition), authors, key words, comments,
literature citation
Many sequences unpublished, undescribed
Quality control standards poorly enforced
No direct way to provide links to ancillary data
(URLs not officially supported, often removed)
Very inefficient and often impossible for investigators to
obtain ecological context information, even from journals
Comparisons of matched taxa by traits not possible
Consequence
 Tremendous
amount of bacterial sequence
data relevant to microbial ecologists
 No established interface
Example – Insufficient Metadata
Sapelo Island Microbial Observatory
(http://simo.marsci.uga.edu)




MObs – NSF-funded network of sites or "microbial
observatories" established to discover novel microorganisms,
microbial consortia, communities, activities and other novel
properties, and to study their roles in diverse environments
Projects supported are expected to establish or participate in an
established, Internet-accessible knowledge network to
disseminate the information resulting from these activities
SIMO - Investigating the diversity of prokaryotes, their
physiological and genetic characteristics, and their
biogeochemical activities in a salt marsh/estuarine ecosystem
in the southeastern U.S.
Knowledge networks:

GenBank
 GCE-LTER IS
 SIMO 16S rRNA Database
SIMO 16S rRNA Database







Purpose: LIMS, research tool, data dissemination
Designed to store sequence data and all supporting SIMO
research information
Hierarchical structure modeled after research workflow
Metadata on site geography, sample collection, all
methodology, personnel, ancillary measurements
Extensive content control, error checking
Links to information in external databases (RDP II,
GenBank, GCE-LTER)
Queries by phylogenic and/or ecological characteristics
Conceptual Diagram of the SIMO Database
Metadata
Primary Data
Secondary/
External Data
Environment
Other Analyses
Methodology
Samples
Ancillary Data
Study Site
GCE-LTER
Methodology
Organisms
Methodology
Sequences
RDP II
Methodology
Phylogenetic
Groups
Phylogenetic
Comparisons
GenBank
List-based data entry linked to metadata tables
Controlled vocabulary supports finely-targeted queries
Automatic hyperlinks provide links to tasks
List-based queries also simplify public interface
Phylogenetic and ecological characteristics combined
dynamically to create overview and query interface
SIMO Metadata

Metadata primarily stored in managed lists, linked
to records by foreign key fields
 Scalable design – details can be added
independently without altering data records
 Complete metadata for sequences generated by
relational joins
 Links to external metadata in GCE-LTER database
adds site geography, research history, long-term
environmental characteristics
Metadata Standards

No existing standard for environmental sequence
metadata
 Sequence formats (FASTA, BIOML, BSML)
designed for data parsing, sequence annotation
 SIMO metadata currently displayed in summary
form on sequence detail pages
 Exploring adopting emerging standards like EML
Sequence Details
Future Directions

Incorporating batch upload features for library
submissions
 Integrating database with ‘RDP SeqMatch Agent’
programs for automatic phylogenetic analysis,
sequence annotation
 Provide full metadata in formatted/printable and
parsable ASCII formats (XML)
 Participate in Entrez Link-Out to provide links to
SIMO sequence entries from GenBank