Transcript OBO

www.
.uni-rostock.de
Systems Biology
Data exchange standards and ontologies
Ulf Schmitz
[email protected]
Systems Biology and Bioinformatics Group
www.sbi.informatik.uni-rostock.de
Ulf Schmitz, Data exchange standards and ontologies
1
www.
.uni-rostock.de
Outline
1.
2.
3.
4.
5.
6.
7.
8.
The need for data exchange formats
Standards and de facto standards in SB
Why XML as framework?
Ontolgies
OWL, RDF, OBO. Portege
Minimum Information Requied – suggestions
Standards for graphical representation
Outlook
Ulf Schmitz, Data exchange standards and ontologies
2
The need for data exchange formats
•
•
•
www.
.uni-rostock.de
Rapid increase in experimental data (high throughput)
Quick comparison, analysis and integration of that data is required
This results in a need for standardized formats for representation of those results
Number of entries in EMBL
(current 63,713,453)
Ulf Schmitz, Data exchange standards and ontologies
3
The need for data exchange formats
www.
.uni-rostock.de
•setup
•protocol
•results
[Brazma, 2006]
Ulf Schmitz, Data exchange standards and ontologies
4
The need for data exchange formats
www.
.uni-rostock.de
there is a demand in standardized formats for:
• experimental data
– for reproducibility of experiments
– annotation in DBs and use in data analysis tools
• standardized names for metabolites, reactions and
enzymes
• mathematical models
– description, accessibility and exchange
• standardized graphical representation of networks
– like with electronic circuits
[Klipp, Survey]
Ulf Schmitz, Data exchange standards and ontologies
5
www.
.uni-rostock.de
Standards and de facto standards in SB
Totally more than 80 standards within Systems Biology
Name
Ver.
Purpose
Tools
Data
2.2
format for representing models of biochemical
reaction networks
Supported by over 100
software systems
Data available from many
databases, (e.g. KEGG,
Reactome,JWS, Biomodels)
2.5
standard for data representation of proteinprotein interactions
Tools for viewing and analysis
Datasets available from many
sources, for instance IntAct,
DIP, BIND
Biological Pathways
Exchange
2
format for biological pathway data
Existing tools for OWL such as
Protégé
Datasets available from
Reactome
CellML
1.1
Supports the definition of models of cellular and
subcellular processes
Tools for publication,
visualization, creation and
simulation
CellML Model Repository
(~240 models) www.cellml.org
Chemical Markup
Language
2.2
Interchange of chemical information (atomic,
molecular and crystallographic information,
compounds, structure, publications)
Molecular browsers, editors
BioCYC www.biocyc.org
EMBLxml
1.0
nucleotide sequence information
API support in BioJavaX
EMBL www.ebi.ac.uk/embl
MathML
2.0
For the representation of mathematical
formulars
Browsers
http://www.w3.org/Math
SBML
Systems Biology
Markup Language
PSI MI
Proteomics Standards
Initiative - Molecular
Interaction
BioPAX
CML
Ulf Schmitz, Data exchange standards and ontologies
6
Standards and de facto standards in SB
Name
Ver.
Purpose
www.
.uni-rostock.de
Tools
Data
INSD-seq
International Nucleotide
Sequence Database
1.4
representation for sequence records
API support in BioJavaX
EMBL and GenBank
n/a
NCBI uses ASN.1 for the storage and retrieval of
data such as nucleotide and protein sequences.
SRI’s BioWarehouse and
Protein Structure Factory’s
ORFer
Entrez
3.1
Facilitate the interchange of data for more efficient
communication within the life sciences community
LabBook’s Genomic Browser
and Sequence Viewer
Converters Previously
provided by EMBL
0.8
markup language for proteome data
HUP-ML Editor
1.1
Microarray Gene Expression Data
Converters
ArrayExpress
www.ebi.ac.uk/arrayex
press
2.1
common file format for MassSpec data
Converters, viewers
PeptideAtlas, Sashimi,
Open Proteomics
Database
2.0
To model the concept of annotated gel (AG) for 2DE results
Visualizer
AGML
Collaboration
Seq-entry
BSML
Bioinformatic Sequence
Markup Language
HUP-ML
Human Proteome Markup
Language
MAGE-ML
MicroArray and Gene
Expression
MzXML
mass spectrometric
AGML
Annotated Gel Markup
Language
Ulf Schmitz, Data exchange standards and ontologies
7
XML based standards
www.
.uni-rostock.de
In biology, HTML is used for data publishing, database browsing,
data gathering, data submission and analysis
•
It’s very user friendly:
–
•
very easy to learn the language:
–
•
knowledge of a few self-explanatory tag names and understanding of a very simple syntax is
enough to write good web pages
but HTML has limits, cause it’s basically dedicated to human browsing
–
•
browsing the web is almost instinctive and requires minimal training time
it has a static structure and does’nt privide semantic features, not differentiating between different
data types
Final:
–
XML is an extensible and easy to use format for information representation in biological
applications
Ulf Schmitz, Data exchange standards and ontologies
8
Why XML as framework?
•
www.
.uni-rostock.de
The eXtensible Markup Language (XML) is derived from SGML
(Standard Generalized Markup Language)
– the international standard for defining descriptions of the structure and
contents of different types of electronic documents
•
XML is an emerging standard for structuring documents, notably for the
World Wide Web
– XML allows the definition of a set of tags to be applied to one or many
documents
– these tags define elements in the document
•
XML bases standards have found to be most useful as a data language
for bioinformatics
– for data interchange between databases and other sources of data
•
this goes in hand with the development of ontologies
[Archard et.al., 2000]
Ulf Schmitz, Data exchange standards and ontologies
9
www.
XML
.uni-rostock.de
XML documents consist of elements, that
are textual data structured by tags
An element consits of a Start/End tag pair,
some optional/mandatory attributes
defined as key/value pairs and the data
between those tags
Ulf Schmitz, Data exchange standards and ontologies
10
XML Pros and Cons
•
.uni-rostock.de
Pros:
–
–
–
–
•
www.
XML is highly flexible
human readable
internet oriented, has rich capabilities of linking data, useful for interconnecting databases
provides an open framework for defining standard specifications
Cons:
–
–
–
overhead of text bases data formats in data parsing, storage and transmission
source can be read an edited with any editor
expressiveness of the XML data model would probable not be sufficient for molecular biology
Alternative formats for the management and exchange of
bioinformatics data
• Flat Files (e.g. flat file libraries from EMBL, GenBank, DDBJ or Swiss-Prot)
• ASN.1 Abstract Syntax Notation One (used at the NCBI for exporting GenBank data)
• COBRA The Common Object Request Broker Architecture
• JAVA RMI Remote Method Invocations
• OODBMS Object oriented Database Management System
Ulf Schmitz, Data exchange standards and ontologies
11
XML based ontology languages
•
•
•
•
.uni-rostock.de
Ontology
–
•
www.
A system for describing knowledge, a
conceptualization of a domain of interest usually
made up of any or all of the following: concepts
(classes), relations, attributes, constraints, objects,
values.
RDF
–
Resource Description Framework, a proposed W3C
standard, allows description of basic relationships
between objects (subject-predicate-object
semantics).
OWL
–
Web ontology language, a proposed W3C standard,
is an extension of RDF to support ontologies. It
provides semantics for classes and subclasses,
instances, and relationships.
OBO
–
Open Biomedical Ontologies (OBO) Foundry is a
collaborative experiment: to produce well-structured
vocabularies introduces a new paradigm for
biomedical ontology development
Protégé
–
Protégé ontology and knowledge base editor. A
software tool to build an ontology and manage
instances of classes defined in that ontology.
Ulf Schmitz, Data exchange standards and ontologies
12
www.
Ontologies
Ulf Schmitz, Data exchange standards and ontologies
.uni-rostock.de
13
www.
Ontologies
Ulf Schmitz, Data exchange standards and ontologies
.uni-rostock.de
14
www.
Ontologies
Ulf Schmitz, Data exchange standards and ontologies
.uni-rostock.de
15
www.
Ontologies
Domain
Prefix
Files
Format
Biological imaging methods
FBbi
image.obo
OBO
Biological process
GO
gene ontology.obo
OBO
Cell type
CL
cell.obo
OBO
Cellular component
GO
gene ontology.obo
OBO
Drosophila development
FBdv
fly development.obo
OBO
Event (INOH pathway ontology)
IEV
event.obo
OBO
Evidence codes
ECO
evidence code.obo
OBO
eVOC (Expressed Sequence
Annotation for Humans)
EV
evoc.obo.tar (v2.7)
OBO
FlyBase Controlled Vocabulary
FBcv
flybase controlled vocabulary.obo
OBO
Human disease
DOID
human disease.obo
OBO
Ulf Schmitz, Data exchange standards and ontologies
.uni-rostock.de
16
www.
Ontologies
.uni-rostock.de
Domain
Prefix
Files
Mammalian phenotype
MP
mammalian phenotype.obo
OBO
MESH
MESH
mesh.obo
OBO
Microarray experimental
conditions
MO
MGEDOntology.owl
OWL
Molecular function
GO
gene ontology.obo
OBO
Multiple alignment
RO
mao.obo
OBO
NCBI organismal classification
taxon
taxonomy.dat
OBO relationship types
OBO_REL
ro.obo
OBO
Pathway ontology
PW
pathway.obo
OBO
Protein domain
IPR
InterPro FTP directory
Protein-protein interaction
MI
psi-mi.obo
OBO
Proteomics data and process
provenance
ProPreO
ProPreO.owl
OWL
Sequence types and features
SO
so.obo
OBO
Systems Biology
SBO
SBO_OWL.owl
OWL
UniProt taxonomy
Organism identification code list
Ulf Schmitz, Data exchange standards and ontologies
Format
plain text
http://www.w3.org/XML/
plain text
17
New standards defining the minimal required
contents
•
•
•
www.
.uni-rostock.de
MIAME – Minimum Information About a Microarray Experiment
MIAPE – Minimum Information About a Proteomics Experiment
MIRIAM – Minimum Information Requested In the Annotation of biochemical
Models
One common suggestion among these requirements is to store metadata according
to the controlled vocabulary (in ontologies) instead of free text
Other requirements are:
• information about participating substances
• Organisms
• Literature references
Ulf Schmitz, Data exchange standards and ontologies
18
MIRIAM - Minimum Information Requested In
the Annotation of biochemical Models
•
•
•
•
www.
.uni-rostock.de
many of the published models in biology are lost for the community because they are
either not made available or they are insufficiently characterized to allow them to be
reused
the lack of a standard description format, lack of stringent reviewing and authors’
carelessness are the main cause for incomplete model descriptions
quantitative models will be useful only if their access and reuse is made easy for all
scientists
rules for creating quantitative models of biological systems:
–
–
–
–
use standardized, structured formats for encoding biological models (SBML, CellML)
annotate models on public repositories (Biomodels Database, Sigpath, EcoCyc, CellML repository,
JWS Online, RegulonDB, DOQCS)
the model when instantiated whithin a suitable simulation environment, must be able to produce all
relevantresults given in the reference description
annotations to be included in model (use CellML metadata or SBML simple annotation scheme):
•
•
•
•
•
preferred name of the model
citation of the reference description
Name and contact information for the model creators
date and time of creation
a precise statement about the terms of distribution (‘public domain’, ‘copyrighted’, ‘freely
distributable’,’confidential’
SBML validator
[Novere 2005]
Ulf Schmitz, Data exchange standards and ontologies
19
MIRIAM example model
www.
.uni-rostock.de
through standardization of the model curation process, it will be possible to create resources that are as
significant to systems biology as resources like Ensembl are to genomics
Ulf Schmitz, Data exchange standards and ontologies
20
MIRIAM example model
Ulf Schmitz, Data exchange standards and ontologies
www.
.uni-rostock.de
21
Survey about standards in SB
www.
.uni-rostock.de
[Klipp 2005]
Ulf Schmitz, Data exchange standards and ontologies
22
Standards for graphical representation
www.
.uni-rostock.de
there is a need for a graphical formalism that covers fundamental biochemical processes
and that can be uniquely mapped
1.
to mathematical objects such as ordinary differential equations (ODE) or stochastic
simulation schemes, and
2.
to a textual description.
• CellDesigner
– often used tool to visualize biochemical reaction networks
• SBGN – Systems Biology Graphical Notation
– Attempt to develop standards for graphical representation
• Molecular Interaction Map (Kohn maps)
Ulf Schmitz, Data exchange standards and ontologies
23
A.Funahashi and H. Kitano modification of
Kohn maps
www.
.uni-rostock.de
Notation of the process diagram
State transition – changes the state of modification
rather than activation
Activation
Inhibition
Translocation of module
A
Dashes line indicates active state of a molecule
A
Specific state of molecular species
Ulf Schmitz, Data exchange standards and ontologies
24
www.
.uni-rostock.de
Molecular Interaction Maps (MIM)
• Characteristics:
– Each molecule shown only in one location
• All interactions and modifications can be traced from one point
• Molecules can be located from an index of map coordinates
– In “Cell Cycle eMIMs” (interactive MIMs) molecules serve
as links to additional sources of information (PubMed,
Gene Cards, MedMiner)
Ulf Schmitz, Data exchange standards and ontologies
25
www.
.uni-rostock.de
Symbols and conventions used in eMIMs
Reactions:
A
A
X
B
Protein A and B can bind to each other
The node represents the A:B complex
B
Multimolecular complex: x is A:B; y is (A:B):C
Endless extendable
Y
C
P
Ph’tase
Covalent modification of protein A.
A can exist in a phosphorylated state.
A
P
Cleavage of a covalent bond: dephosphorylation of A
by a phosphatase.
A
A
B
Stoichiometric conversion of A to B.
Ulf Schmitz, Data exchange standards and ontologies
26
www.
.uni-rostock.de
Symbols and conventions used in eMIMs
Reactions:
Cytosol
Nucleus
A
A
Transport of A from cytosol to nucleus.
The dot represents A after transport to the nucleus.
Formation of homodimer. Dot on the right represents
copy of A. Dot on line represents the homodimer A:A
Contingencies:
Enzymatic stimulation of a reaction
Enzymatic of a reaction in trans.
Stimulation of a process. Bar indicates necessity.
Inhibition
Transcriptional activation
Transcriptional inhibition
Ulf Schmitz, Data exchange standards and ontologies
27
www.
.uni-rostock.de
Molecular Interaction Map (eMIM)
Ulf Schmitz, Data exchange standards and ontologies
28
Take home message
www.
.uni-rostock.de
•
while developing/using tools consider which data exchange formats it should be able
to handle (import/export)
•
while doing experiments, consider annotating them with an apropriate data exchange
format to make it reusable/reproducable for others)
•
defining a new data exchange format keep existing ontologies in mind helping to find a
common vocabulary among the community
•
while communicating your scientific procedures and results, try to observe if there is a
common language used among you and your collaborators or if there is a need of
common vocabulary defined in an ontology? (Don’t hasitate to create one with the help
of Portege)
Ulf Schmitz, Data exchange standards and ontologies
29
www.
Literature
•
•
•
•
•
.uni-rostock.de
Strömbäck, L. and Hall D. and Lambrix P.: A review of standards for data exchange within systems
biology. Proteomics 2007, 7, 857–867
Achard, F. and Vaysseix, G. and Barillot, E.: XML, bioinformatics and data integration. Bioinformatics
2000, 17, 2, 115-125
Brazma, A. and Krestyaninova, M. and Sarkans, U.: Standards for systems biology. NATURE
REVIEWS GENETICS, 2006, 7, 593-605
Novere, N. et. al.: Minimum information requested in the annotation of biochemial models (MIRIAM).
Nature Biotechnology, 2005, 23, 12, 1509-1115
Klipp, E. and Liebermeister, W. and Helbig, A. and Kowald, A. and Schaber, J.: Standards in
Computational Systems Biologie. 2005
Ulf Schmitz, Data exchange standards and ontologies
30
www.
.uni-rostock.de
Data exchange standards and ontologies
Thanx for your
attention!!!
Ulf Schmitz, Data exchange standards and ontologies
31
www.
Appendix
.uni-rostock.de
BioPAX
Ulf Schmitz, Data exchange standards and ontologies
32
www.
Appendix
.uni-rostock.de
BioPAX
Ulf Schmitz, Data exchange standards and ontologies
33
www.
Appendix
.uni-rostock.de
PSI MI
Ulf Schmitz, Data exchange standards and ontologies
34