CIPRES.2006.database_sd

Download Report

Transcript CIPRES.2006.database_sd

CIPRES
Database Focus Group
NSF Site Visit
June 28, 2006
San Diego
Senior Personnel
• Susan Davidson, University of Pennsylvania
• Michael Donoghue, Yale University
• Mark Miller, San Diego Supercomputer Center
• Dan Miranker, UT Austin
• Brent Mishler, UC Berkeley
• William H. Piel, Yale University (TreeBASE II lead)
• Val Tannen, University of Pennsylvania (database focus
lead)
Other (Partially) Funded Personnel
• Lucie Chan, Senior Software Developer, San Diego
Supercomputer Center
• Shirley Cohen, Database Developer, then PhD
Student, UT Austin, then University of Pennsylvania
• Sarah Cohen-Boulakia, Post-Doc, University of
Pennsylvania (not funded by CIPRES)
• Jin Ruan, Senior Software Developer, San Diego
Supercomputer Center (TreeBASE II Software
Lead)
• Yifeng Zheng, PhD student, University of
Pennsylvania.
Goals of the Database Focus
• The major objective is the development of TreeBASE II
• In addition, this focus has supported related research on
– storage/querying of the large phylogenetic trees constructed in
• the Simulation Focus (Davidson, Kim, Zheng)
• the Algorithms Focus of the project (Moret, Hunt, Warnow)
– data provenance in phyloinformatics workflows
(Davidson, Cohen, Cohen-Boulakia)
– phylogenetic database extensions using a metric ordering to
support molecular data (Miranker)
– genome-scale phylogenetics (Piel)
– searching large collections of trees for topological patterns (Piel)
The current TreeBASE (I)
• A 10+ years-old major data resource for biological and
biomedical research
– submissions needed to be published in a peer-reviewed scientific
journal before being published in TreeBASE.
• Has been searched from over 60,000 distinct IP addresses
• Has accepted over 1,300 submissions that map to over
– 3,700 trees and
– 60,000 distinct taxons.
• But the capabilities of the current database are being
overtaken by demands.
• CIPRES is developing TreeBASE II as a robust, scalable,
and versatile re-design and re-engineering of TreeBASE I.
TreeBASE I Audience
Researchers from
– traditional systematics backgrounds and
– molecular biology backgrounds
who are concentrating on a series of focused experiments
in the lab.
These users include those who periodically seek online
representations of individual phylogenies for research and
educational purposes.
Additional TreeBASE II Audiences (1)
Researchers that want to run meta-analyses on large
collections of trees. Examples:
• identifying patterns in trees that result from one type of
analysis over another
• visualizing large collections of trees
• studying collaborative networks among phylogeneticists
Additional TreeBASE II Audiences (2)
Phyloinformaticians who seek to make large-scale inference
using synthetic methods applied to large collections of
trees. Examples:
• assemble a supertree for a large branch of the Tree of Life
• mine data in search of conflicting phylogenetic signals
• examine the evolution of genes and genomes in a
comparative context
Additional TreeBASE II : Audiences(3)
Bioinformaticians who conduct simulation studies.
Frequently, simulation studies use simple models, such as
the Kimura 2-Parameter and Jukes-Cantor that are not
believed to be biologically realistic.
Finding realistic evolutionary models, using real data, and
carrying out simulation studies are some of the main goals
of this group.
Value Added by TreeBASE II
• A phylogenetic query language to allow ``power-users'' to
run complex phyloinformatic queries, including on tree
topology.
• A robust service layer and LSIDs to allow external tools
and services to interface with the database.
• Storage of LSIDs and foreign handles to better integrate
with external data services (morphological characters,
gene names, taxon names, and museum specimen IDs).
• Taxonomic intelligence for leaf and node labels.
• Ability to store geographic coordinates to support
phylogeographic data visualization and analysis.
Collected Use Cases:
Query Examples
• Given a set of taxa and a character matrix, find the
characters for which the taxa have the same state.
• Given a set of taxa and a set of trees, find all trees for
which the subtree determined by the taxa (as leaves) is
the same.
TreeBASE II Capabilities:
Submission
• Friendlier interface, more features semi-automated
• Support for entering additional (currently non-NEXUS)
data such as specimen IDs
• Automated annotations (eg., communication with other
sources to retrieve GenBank accession number
sequence)
• Better error checking (eg., matching taxon labels between
trees and character matrices)
• Assistance features will be opt-in and can be turned off by
the user
TreeBASE II Capabilities: Curation
•
•
Support for interaction with the publication process:
– In conjunction with journal submission, study data is submitted to
TreeBASE
– It is not made visible to search/query users but reviewers or journal editors
can examine it (anonymous access)
– If and when the journal submission is accepted, the study data is made
visible to search/query users
Support for TreeBASE II editors, examples:
– to correct author, citation, or other metadata
– to correct the taxon names (alignment between trees and character
matrices or with taxonomic services)
– to remove orphan data
An interface with access to taxonomic services such as uBio (www.ubio.org) or
the Glasgow Name Server (taxonomy.zoology.gla.ac.uk/rod/rod.html) will be
provided to facilitate both submission support and curation capability.
TreeBASE II Capabilities:
Search (1)
2-step configurable GUI retrieving sets of studies, matrices, or trees.
– Step 1: choose search criteria
– Step 2: choose search
• Study Search By:
–
–
–
–
Disjunction of conjunctions of author last names
Citation title matches given keyword(s)
Name matches keyword
Contains analysis/analysis step such that:
•
•
•
•
•
•
Name matches given keyword(s)
Uses given algorithm
Uses given software package
Input and/or output data contains given set of taxa
Input and/or output data contains tree that matches given tree pattern
Input and/or output data contains matrices satisfying given search criteria
(same as below)
TreeBASE II Capabilities:
Search (2)
• Tree Search By:
– Tree id number
– Appears in a study satisfying given search criteria (same as above)
– Appears in an analysis/analysis step satisfying given search criteria (same
as above)
– Contains given set of taxa
– Matches given tree pattern
• Matrix Search By:
–
–
–
–
Uses given set of taxa
Uses given set of character names
Is a sequence matrix that uses a certain kind of biomolecular information
Contains given specimen(s)
TreeBASE II Capabilities:
Bulk Queries
XML-based query interface for tools that interoperate with TreeBASE II
• Input: domain-specific query language
– based on theTreeBASE Domain Model
– related semantically to a simple subset of SQL or ODMG/OQL
– XML-based syntax
• TreeBASE XML format for query output
– Nexus data
– additional data in TreeBASE II
• For the CIPRES tool which is CORBA-based we will use an IDL-toXML bridge
• Interactive (sophisticated) user can also submit prepared query
TreeBASE II Domain Model
A detailed object-oriented Domain Model was designed for TreeBASE II
(EER diagrams were manually derived from the Domain Model)
A very partial and simplified view:
Study

1
Data
1
Matrix
1
1

Taxon

Tree
1
MatrixRow
1

RowSegment
1
1
Specimen
Technologies used in TreeBASE II development
• Open source
• Proven technologies and best practices
• Hibernate to generate the SQL schema from the Domain
Model
• Hibernate, based on the Domain Model, to program any
database access
• Tomcat Web container and one of SDSC's Web farms
• Spring framework as an application container to manage
transactions
Status and Future Plans
• Requirements and use case collection is complete
• The architectural design is complete
• Currently working on detailed design and coding, including GUI work
and loading data from TreeBASE I (some is ready)
• A demo will be performed during the site visit
• TreeBASE I data will be loaded by August 2006
• Elements of the interactive user interface will be beta released and
end-user tested throughout Fall 2006
• New submissions accepted starting February 2007
• Links to taxonomic services developed in Spring 2007
• Bulk query API, including CIPRES tool interface, developed in 2007
• Available as Web service at end of 2007