Transcript SysMO-DB

SysMO-DB: A Community-Based
Approach to Data Sharing
Dr Katy Wolstencroft
University of Manchester
SysMO-DB
DB
A data access, model handling and data integration
platform for Systems Biology
 A web based resource



That promotes shared understanding
Using a common platform and common technologies
Started July 2008
SysMO-DB Dev Team
Sergejs Aleksejevs
Wolfgang
Müller
Heidelberg
Institute for
Theoretical
Studies
Germany
Carole Goble
Olga Krebs
Katy Wolstencroft
University of Manchester, UK
Stuart Owen
Jacky Snoep
University of Stellenbosch,
South Africa
University of Manchester, UK
Franco B du Preez
Finn Bacall
Systems Biology of Microorganisms
http://www.sysmo.net


Pan European collaboration
Eleven individual projects, 89 institutes
 Different research outcomes
 A cross-section of microorganisms, incl.
bacteria, archaea and yeast

Record and describe the dynamic
molecular processes occurring in
microorganisms in a comprehensive way

Present these processes in the form of
computerized mathematical models

Pool research capacities and know-how

Already running since April 2007
Runs for 3-5 years
This year, 2 new projects join and 6 leave


Challenges

Heterogeneous data and models
Distributed groups of researchers
Modellers and experimentalists have different
skills, training, experience
Scientists want to remain in control

Social and technical challenges



Social Challenge: Focus Group
Show what is there
Suggest what is possible
Ask for requirements
Give requirements
Tell priorities
Rate outcomes
Suggest improvements
DB team
Double check
Transmit
Disseminate
Collect answers
Focus Group
Projects
Focus Group

SysMO-DB PALS






Audits and Sharing.



21 Postdocs and PhD students
Modellers, experimentalists
and bioinformaticians
Design and technical
collaboration team
Intense collaboration
UK and Continental PALS
Chapters
Methods, data, models,
standards, software, schemas,
spreadsheets, SOPs…..
20 questions
Deployment into Projects
Technical Challenge







Rapid and incremental development
Just enough and just in time , not Just in case
No reinvention
Driven by the PALs
Sustainable and extensible
Migrate to standards
Fitting in with normal lab practices
What do we share
Nature Protocols
Protocol Title
Authors
Keywords
Abstract
Materials
Reagents
Reagent Set Up
Equipment
Time Taken
Procedure
Troubleshooting
Critical Steps
Anticipated Results
References
Methods
+
Data
All SysMO Assets
+
Results
What do we share
Protocols for Models
Protocol Title
Authors
Keywords
Description
Assumptions
Equations
Numerical Methods/Algorithms
Computational Tools
Parameter Estimation Techniques
Limitations
References
Methods
+
+
Models
Data
All SysMO Assets
+
Results
A Tree View of Assets
Investigation
Studies
SOP
Assay
SOP
ISA infrastructure provides a
directory structure for
experiments
http://isatab.sourceforge.net/
SOP
Construction
Validation
Expertise, tools
Coordinates, data
How do we share
“Just Enough Results Model”

What type of data is it


What was measured


Microarray, growth curve, enzyme activity…
Gene expression, OD, metabolite concentration….
What do the values in the datasets mean

Units, time series, repeats….
Based on:

Minimum information models
e.g. MIAME, MIAPE, MIRIAM

Biological ontologies
e.g. Gene Ontology, MGED, SBO

Bioportal web service used in SysMO-SEEK for:
Concept lookup and visualisation
How do we share

Share JERM templates developed by SysMO-DB,
PALs and consortium



Spreadsheet templates
Database Schemas
Encourage uptake throughout SysMO



transcriptomics
metabolomics
proteomics etc….
Tools to help manage data:
Annotation standards by stealth
Controlled vocabulary plug in
BioPortal
JERM Model

SysMO JERM a ‘MIBBI’ for the SysMO-SEEK

What do we need to help you find stuff?




Title, person, filename, class
What is experiment specific?
What is experiment specific, but helps us map
between them?
Common biological elements

chemicals, genes, proteins, organisms, strains
Identifying Biological Objects

What do you have in your data?


Where/how do these objects interact?


Proteins/enzymes, genes/expression levels,
metabolites
Pathways, flux, experimental conditions
What models describe these interactions
Possible when using common frameworks,
naming schemes and controlled vocabularies
Following Standards

We recommend formats but we do not enforce
them






Protocols and SOPs – Nature Protocols
Data – JERM models and community minimum
information models
Models – SBML and related standards
Publications – PubMed and DOI
If you follow the prescribed formats, you get
more out, but if you don’t, you can still
participate
Lowering the adoption barrier
Just Enough Sharing
Access
Permissions
...we don’t talk about security
Just Enough sharing
SysMOLab
Wiki
COSMIC
Fetch on
Request
Alfresco
MOSES
Wiki
ANOTHER
Direct
Upload
A DATA
STORE
SOP
When do People Share
Data Collection
Pre-publication
Post-publication
Your own group and maybe Project + maybe
your project
consortium
Consortium and wider
community
Collaboration
Advertising
Discussion and criticism
SysMO Aims : sharing sooner
• Suspicion and fear of scooping
• Reputation
Incentives for sharing




Safe haven for data
Credit and attribution
Help with exporting to public repositories (e.g.
One-click export to ArrayExpress, PRIDE etc)
A repository for “supplementary materials” in
publications


Linking publications and data
Access other resources through a SEEK gateway
SEEK as a Gateway
JWS Online Plugin
•online simulator, runs in
SysMO-SEEK
•upload models in SBML
format
•SBGN schemas, with
annotations and external
links
Incentives for sharing

Credit and attribution


SEEK records who owns what. If data, models, or
protocols are reused, scientists get recognition
Accountability

SEEK records who owns what. If you take credit for
others work, they will see
Data citation – formal credit for data
published in SEEK
Data Citation




Persistent identifiers and URLs for the data
Linking people to the data
Safe haven for the data
Guarantees of sustainability


Data MUST be uploaded and archived
If cited, it must be public
SEEK as a Safe Haven



HITS can archive SysMO data for 10 years
All SysMO software is open source and available
Distinction between sustaining the service and
the software
Governance and Policy

What is required by SysMO members?




When should they share during their projects?
How long after the project can they keep data private
to finish publications?
If their data is stored locally, what is the archive
process?
Policy from DMG and funding agencies and NOT
SysMO-DB
Governance and Policy

Proposals under discussion:


All data registered in SEEK should be uploaded and
archived at the end of a SysMO project
All data from finished projects should be shared


How long after the end? 1 day, 6 months, 1 year?
Scientists can invoke “creator’s privilege” on SysMO
assets produced near the end of the project

Extra time to write-up and publish before release to the
general public – respecting publication cycles
SysMO So Far…


People ARE sharing
Over 300 assets in SEEK




SOPs: 102, Models: 17, DataFiles: 95 ,Investigations:
13, Studies: 26, Assays: 53
PALs – a network of young SysBio researchers
Training and education in data and metadata
management spreading through the consortium
Modellers and experimentalists communicating
SysMO Methods Spreading

Virtual Liver








Mueller, via HITS
Lungsys
SBCancer
EraSysBio+
Eukaryotic organisms
Interactions between host and pathogen
Human disease
Multi scale modelling
Why it works for us



A solution that fits in with current practices
Start simple, show benefits, add more
Engage with the people actually doing the work

PhD students, Post-docs

Build to the PALs requirements
Respect publication cycles
Respect cultural differences

Scientists stay in control


Acknowledgements




SysMO-DB Team
SysMO-PALS
myGrid, Hits and JWS Online
EMBL-EBI, MCISB
http://www.sysmo-db.org