Pathway - Internet Database Lab.

Download Report

Transcript Pathway - Internet Database Lab.

RDF based on Integration of Pathway
Database and Gene Ontology
SNU OOPSLA LAB.
2005
DongHyuk Im
Contents
 Introduction
 Pathway Database
 Enzyme Database
 Gene Ontology
 Related Works
 Our Approach




Supporting Function
Data Transformation
Integration of KEGG, Enzyme, Gene Ontology
Querying using SeRQL
Pathway?
 Most chemical reaction mechanisms are translated
from a compound(substrate) to a
compound(product) by enzyme acting
 Importance
 to comparison and analyze pathways in order to
understand the process of creating compounds and the
evolutive relevance between organisms
 Drug Discovery
Pathway
Map : Glycolysis / Gluconeogenesis
Map : Aquifex aeolicus
Enzyme Database
EC number
Recommended name
Alternative names(if any)
Catalytic activity
Cofactors (if any)
Pointers to the SWISS-PORT entrie(s) that
correspond to the enzyme (if any)
 Pointers to disease(s) associated with a deficiency of
the enzyme (if any)






Enzyme Hierarchy
[*]
 Four levels
[1]
[2]
[3]
[2.1]
[2.2]
[2.3]
[2.2.1]
[2.2.2]
[2.2.3]
[2.2.2.1]
[2.2.2.2] [2.2.2.3]
 EC number
 Ex) 1.1.1.1 is a member
of the top level group
[1]
 The leftmost number
identifies the highest
level
 [2.4.2.3] –
[2.4.2.4](sibling) :
similar reactions in
pathway
Gene Ontology
KEGG
KEGG
 To computerize all aspects of cellular functions in
terms of the pathway of interacting molecules or
genes
 To maintain gene catalogs for all organisms and link
each gene product to a pathway component
 To organize a database of all chemical compounds in
the cell and link each compound to a pathway
component
 To develop computational technologies for pathway
comparison, reconstruction, and analysis
Why RDF Integration?
 Pathway data model : DAG
 RDF is a good model for representing pathway
 RDF data model : DAG
 Need integration of multiple knowledge sources available
from internet : one of the major problems in biologists
 RDF is a good model for same standard
 Enzyme, GO : hierarchy structure
 RDF is a good model for representing hierarchy structure
 GO annotation is important
 Enzymes(proteins) in certain pathway need GO annotation
Related Works
 KEGG: Kyoto Encyclopedia of Genes and Genomes ,
1999, Nucleic Acids Res.
 YeastHub: a semantic web case for integrating data in
the life science domain, 2005, Bioinformatics
 LIGAND: database of chemical compounds and
reactions in biological pathways, 2002, Nucleic Acids
Res.
 Gene Ontology: tool for the unification biology, the
Gene Ontology Consortium, 2000, Nature Genetics.
Our System’s Supporting
 KEGG
 Search compound
 Path prediction
 Search Enzyme
 Our system’s function to add
 Integration Query (pathway+enzyme+GO)
 Relaxation Query using GO hierarchy
 Searching pathway using enzyme information
Search Compounds
target
Compound : C00668
Pathway Prediction Tool
compound
Relaxation query using enzyme hierarchy
Search Enzyme
Enzyme : 5.3.1.9
From Pathway to Gene Ontology
Select enzyme
Data Translation for Integration
GENOS Storage
XSLT
KGML Data
KEGG RDF Data
Adding GO ID
Enzyme RDF Data
XSLT : http://www.w3.org/2005/02/13-KEGG/
GO RDF Data
KEGG RDF Data(1/2)
<k:entry>
<Gene rdf:nodeID="_1">
<k:name rdf:resource="http://www.w3.org/2005/02/13-KEGG/aae#aq_186"/>
<k:reaction rdf:resource="http://www.w3.org/2005/02/13-KEGG/rn#R00710"/>
<k:link rdf:resource="http://www.genome.jp/dbget-bin/www_bget?aae+aq_186"/>
<k:graphics><Rectangle k:name="aldH1" k:fgcolor="#000000"
k:bgcolor="#BFFFBF" k:x="170" k:y="1018" k:width="45" k:height="17"/>
</k:graphics>
</Gene>
</k:entry>
<k:entry>
<Enzyme rdf:nodeID="_3">
<k:name rdf:resource="http://www.w3.org/2005/02/13-KEGG/ec#1.2.1.5"/>
<k:graphics>
<Rectangle k:name="1.2.1.5" k:fgcolor="#000000"
k:bgcolor="#FFFFFF" k:x="170" k:y="1039" k:width="45" k:height="17"/>
</k:graphics>
</Enzyme>
</k:entry>
Gene entry
Enzyme entry
No
information
Compound entry
<k:entry>
<Compound rdf:nodeID="_4">
<k:name rdf:resource="http://www.w3.org/2005/02/13-KEGG/cpd#C00033"/>
<k:link rdf:resource="http://www.genome.jp/dbget-bin/www_bget?compound+C00033"/>
<k:graphics>
<Circle k:name="C00033" k:fgcolor="#000000"
k:bgcolor="#FFFFFF" k:x="102" k:y="971" k:width="8" k:height="8"/>
</k:graphics>
</Compound>
</k:entry>
KEGG RDF Data(2/2)
Relation
<k:relation>
<ECrel>
<k:entry1 rdf:resource="_42"/>
<k:entry2 rdf:resource="_48"/>
<compound rdf:resource="_88"/>
</ECrel>
</k:relation>
Reaction
<k:reaction reversible="" rdf:about="http://www.w3.org/2005/02/13-KEGG/rn#R00710">
<k:substrate rdf:resource="http://www.w3.org/2005/02/13-KEGG/cpd#C00084"/>
<k:product rdf:resource="http://www.w3.org/2005/02/13-KEGG/cpd#C00033"/>
</k:reaction>
How to Process KEGG Pathway
 Problem
 GENOS(Sesame) does not support multiple graph
 KEGG data consists of multiple documents
 Ex) map00010.rdf, aae00010.rdf …
 Solution
 Using namespace, we can distinguish maps
 When Storing pathway data, pathway’s map name is added
as a namespace in resource table of GENOS
Processing Pathway Data
<k:Pathway k:org="aae" k:number="00010" k:title="Glycolysis / Gluconeogenesis">
….
….
<k:entry>
<Gene rdf:nodeID="_1">
<k:name rdf:resource="http://www.w3.org/2005/02/13-KEGG/aae#aq_186"/>
<k:reaction rdf:resource="http://www.w3.org/2005/02/13-KEGG/rn#R00710"/>
<k:link rdf:resource="http://www.genome.jp/dbget-bin/www_bget?aae+aq_186"/>
<k:graphics><Rectangle k:name="aldH1" k:fgcolor="#000000"
k:bgcolor="#BFFFBF" k:x="170" k:y="1018" k:width="45" k:height="17"/>
</k:graphics>
</Gene>
</k:entry>
resources
table of GENOS
ID
NameSpace
Localname
1
…
…
2
…
Glycolysis/…
3
aae#00010
_1
4
…
aq_186
5
…
6
aae#00020
_1
map#00010
_1
7
8
9
….
conflict
triples table
of GENOS
Subject
Predicate
Object
…
…
…
3
…
…
6
…
…
8
…
…
…
…
…
Integrating Databases
Enzyme number
GO ID
Relaxation Querying using SeRQL
E1
subclassof
subclassof
E1.*
C2
C1
E1.*
SeRQL
SELECT C1,C2
FROM Path_EXP
WHERE E1 LIKE “1.*"
use Prefix
Dewey order
Ex. 1.1 and 1.2 are childrens of 1
Considering Performance
KEGG : Pathway List
aae:aq_018
aae:aq_020
aae:aq_021
….
….
….
….
eco:b1236
eco:b1236
eco:b1236
….
Genes
path:aae03010
path:aae03010
path:aae00400
path:eco00052
path:eco00500
path:eco00520
Map
using genes_index
Schedule
 Implementation (~11/30)
 Integrated Databases
 Query Processor for pathway
 Simple UI (Web :JSP)
 Complete Paper (~12/10)