Transcript PolyDoms
Discovering Disease Associations using a Biomedical Semantic Web: Integration and Ranking
Ranga Chandra Gudivada1,2, Xiaoyan A. Qu 1,2, Anil G Jegga2,3,4, Eric K. Neumann5 , Bruce J Aronow1,2,3,4
Departments of Biomedical Engineering1 and Pediatrics2, University of Cincinnati, Center for Computational Medicine3 and Division of Biomedical Informatics4,
Cincinnati Children’s Hospital Medical Center, Cincinnati OH-45229, USA and Teranode Corporation5, Seattle, WA 98104
Abstract
Mouse Phenotype
Description
Disease
Mammalian
Phenotype
Biol.Process
Description
Mouse Phenotype
ID
Cell.Component
Description
Others
Case Study-Prioritizing Modifier Genes, Pathways and Biological Processes for CARDIOMYOPATHY, DILATED
Step1
Pathways
Pathway
Description
OMIM
rdfs:label
CARDIOMYOPATHY,
BIOCARTA
KEGG
DILATED,
BIOCYC
X-LINKED
hasAssociated
Gene
Disease
CUI
Gene / Protein
Annotations
Entrez Gene
Disease
Name
SwissProt
Biological Process
Interacting
Process
Biol.Process
GO ID
Primary Genes
Partners
GO_0006936
muscle contraction
(1)
(16)
GO_0007016
cytoskeletal anchoring
Biological Processes
GO_0043043
peptide biosynthesis
(4)
GO_0007517
muscle development
DMD
Molecular
Mol.Function
GO ID
Interactions
BIND
Anatomy
CUI
Step2
REACTOME
Gene Ontology
others
Agrin in Postsynaptic
h_agrPathway
Differentiation
(1)
inBiological
Gene
Symbol
Cell.Component
GO ID
Pathways
Pathways
Pathway
Id
hasAssociated
Anatomy
One of the principal goals of biomedical research is to elucidate the
complex network of gene interactions underlying common human
diseases. Although integrative genomics based approaches have been
shown to be successful in understanding the underlying pathways
and biological processes in normal and disease states, most of the
current biomedical knowledge is spread across different databases in
different formats. Semantic Web principals, standards and
technologies provide an ideal platform to integrate such
heterogeneous information and bring forth implicit relations hitherto
embedded in these large integrated biomedical and genomic datasets.
Semantic Web query languages such as SPARQL can be effectively
used to mine the biological entities underlying complex diseases
through richer and complex queries on this integrated data. However,
the end results are frequently large and unmanageable. Thus, there is
a great need to develop techniques to rank resources on the Semantic
Web which can later be used to retrieve and rank the results and
prevent the information overload. Such ranking can be used to
prioritize the discovered disease–gene, disease–pathway or disease–
processes novel relationships. We implemented an existing semantic
web based knowledge mining technique which not only discovers
underlying genes, processes and pathways of diseases but also
determines the importance of the resources to rank the results of a
search while determining the semantic associations.
Data Integration- RDF MODEL
Mol.Function
Description
Anatomy
Name
Primary genes
Nature Pathway
Interaction
database
SPARQL QUERY
+
PREFIX CCHMC:<http://www.cchmc.com/test.owl#>
Interacting Partners
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT DISTINCT ?pathway
(1+16)
Ranking on Semantic Web
where {
?pathway rdf:type CCHMC:Pathway .
Biological Problem
KleinBerg Algorithm (1)
Biological Processes
Disease genes discovered to date likely represent the easy ones.
Discovering the genetic basis of remaining Mendelian and complex
gene-X-gene-X-environment disorders will be challenging and require
consideration of many more features and causal relationships
(27)
Points to many
authoritative
sites, increases
the hub scores
High Hub score
Hub Nodes
Identifying modifier genes, i.e. gene networks underlying diseases is
challenging (pathways, biological processes and functions)
High Authoritative score
Authoritative node
No gene operates in vacuum, all gene, protein, pathway interactions
can lead to Modifier Gene effects
Computational Problem
Pointed by
good hubs its
authoritative
score
increases
Data complexity poses a formidable challenge to efforts to integrate,
formally model, and simulate biological systems behaviors
Extending ‘KleinBerg Algorithm’(2) for Semantic Web
Likelihood Ranking requires mining and prioritization of entities and
events that function in the context of biological networks
Subjectivity
Weight
associatedPathway
Objectivity
weight
gene
Semantic Web standards such as Resource Description Framework
(RDF) & Ontology Web Language (OWL) facilitate semantic
integration of heterogeneous multi-source data
Pathway
Subjectivity weight > objectivity weight
(28)
Modifier Genes (16)
Rank GeneSymbol
Score
1
UTRN
21.89344952
2
FASLG
17.42028994
3
ACTA1
12.36025539
4
DTNA
8.888475658
5
DAG1
5.893112758
6
KCNJ12
4.838225059
7
SNTA1
4.623228312
Pubmed Evidence
12868498 10423348 11186993
16168288 16080838
16945537 10508519
16644324 16427346
15117830 14564412
Novel Gene
16427346
QUERY RESULT
WITH
PRIORITIZATION
Conclusion
We have shown that related yet heterogeneous
h_agrPathway
Agrin in Postsynaptic Differentiation
1.134984242 information can be integrated using RDF-OWL and
h_hsp27Pathway Stress Induction of HSP Regulation
0.139887918 that this approach can support mechanistic analyses
h_actinYPathway Y branching of actin filaments
0.093908976 of diseases. Specifically, we have uncovered
h_no1Pathway
Actions of Nitric Oxide in the Heart
0.093908976 additional genes and pathways that could play a
in the
onset and treatment of Cardiomyopathy.
h_nfatPathway
NFAT and Hypertrophy of the heart (T ranscription
0.093908976
in therole
broken
heart)
h_metPathway
Signaling of Hepatocyte Growth Factor Receptor
0.093908976 We intend to expand our analyses into additional
h_salmonellaPathway
How does salmonella hijack a cell
0.093908976 modalities such as anatomy, cellular type, and
h_mCalpainPathwaymCalpain and friends in Cell motility
0.093908976 symptoms/ phenotypes.
A single gene participating in multiple
biological pathways is considered more
sensitive to perturbation than a single
pathway having a large number of nodes
(Different weights for non - symmetric
properties); corollary :
1
2
3
3
3
3
3
3
3
3
h_PDZsPathway
h_rabPathway
Synaptic Proteins at the Synaptic Junction 0.093908976
Rab GT Pases Mark T argets In T he Endocytotic
0.093908976
Machinery
Biological Processes (27)
SPARQL, a semantic web query language , capable of making
queries of higher order relationships in multi dimensional data can
be used to mine Bio-RDF graphs
Prioritization of biological entities on semantic web can be
accomplished by extending[2] and applying existing graph
algorithms, such as Kleinberg Aglorithm[1]
}
Pathways (28)
Data integration: biological feature complexity is deep, heterogeneous,
and extensive.
Benefits of Semantic Web
?resource ?PROPERTY ?pathway .
Pathways
Subjectivity
Weight
interacts
Objectivity
weight
geneA
Subjectivity weight = objectivity weight
GeneA interacting with various genes has
equal significance as GeneB interacting with
geneB
various genes (Equal weights for symmetric
properties)
1
2
3
4
4
4
4
GO_0006936
GO_0007517
GO_0007165
GO_0048741
GO_0030240
GO_0043043
GO_0007016
muscle contraction
muscle development
signal transduction
skeletal muscle fiber development
muscle thin filament assembly
peptide biosynthesis
cytoskeletal anchoring
1.5385859
0.3562762
0.1139403
0.1102909
0.1102909
0.1027902
0.1027902
1.Kleinberg, J. M. 1999. Authoritative sources in a
hyperlinked environment. J. ACM 46, 5 (Sep.
1999)
2 Bhuvan Bamba, Sougata Mukherjea: Utilizing
Resource Importance for Ranking Semantic
Web Query Results. SWDB 2004: 185-198