Knowledge-Based Integration of Neuroscience Data Sources
Download
Report
Transcript Knowledge-Based Integration of Neuroscience Data Sources
Knowledge-Based Integration of
Neuroscience Data Sources
Amarnath Gupta
Bertram Ludäscher
Maryann Martone
University of California San Diego
A Standard Information
Mediation Framework
Client Query
Integrated
XML View
View Definition
XML
View
Mediator
XML
View
Wrapper
XML
View
Wrapper
Data
Source
XML Data
Source
Data
Source
A Neuroscience Question
Cerebellar distribution of rat proteins with more than 70%
homology with human NCS-1? Any structure specificity?
How about other rodents?
Integrated
View
View Definition
Wrapper
Mediator
Wrapper
Wrapper
Wrapper
WWW
protein localization
morphometry
neurotransmission CaBP, Expasy
Integration Issues
• Structural Heterogeneity
– Resolved by converting to common semistructured data
model
• Heterogeneity in Query Capabilities
– Resolved by writing wrappers with binding patterns and
other capability-definition languages
• Semantic Heterogeneity
– Schema conflicts
• Partially resolved by mapping rules in the mediator
– Hidden Semantics?
Hidden Semantics:Protein Localization
Purkinje Cell layer of
Cerebellar Cortex
Molecular layer of
Cerebellar Cortex
Fragment of
<protein_localization>
<neuron type=“purkinje cell” />
<protein channel=“red”>
<name>RyR</>
….
</protein>
<region h_grid_pos=“1” v_grid_pos=“A”>
<density>
<structure fraction=“0.8”>
<name>spine</>
<amount name=“RyR”>0</>
</>
<structure fraction=“0.2”>
<name>branchlet</>
dendrite
<amount name=“RyR”>30</>
</>
Hidden Semantics: Morphometry
Must be dendritic
because Purkinje cells
don’t have somatic spines
<neuron name=“purkinje cell”>
<branch level=“10”>
Branch level beyond 4
<shaft>
is a branchlet
…
</shaft>
<spine number=“1”>
<attachment x=“5.3” y=“-3.2” z=“8.7” />
<length>12.348</>
<min_section>1.93</>
<max_section>4.47</>
<surface_area>9.884</>
<volume>7.930</>
<head>
<width>4.47</>
<length>1.79</>
</head>
</spine>
…
The Problem
• Multiple Worlds Integration
– compatible terms not directly joinable
– complex, indirect associations among schema elements
– unstated integrity constraints
• Why not use ontologies?
– typical ontologies associate terms along limited number
of dimensions
• What’s needed
– a “theory” under which non-identical terms can be
“semantically” joined
Our Approach
•
Modify the standard Mediation Architecture
– Wrapper
•
Extend to encode an object-version of the structure schema
– Mediator
•
Redesign to incorporate auxiliary knowledge sources to
–
–
•
Correlate object schema of sources
Define additional objects not specified but derivable from sources
At the Mediator
– Use a logic engine to
•
•
•
•
Encode the mapping rules between sources
Define integrated views using a combination of exported objects
from source and the auxiliary knowledge sources
Perform query decomposition
We still use Global-as-View form of mediation
The KIND Architecture
Integrated User View
Auxiliary
Knowledge
Source 1
View Definition Rules
Logic Engine Integration Logic
Schema of Registered Sources
Materialized
Views
Object Wrapper
Object Wrapper
Structure Wrapper
Structure Wrapper
Src 1
Src 2
Auxiliary
Knowledge
Source 2
The Knowledge-Base
• Situate every data object in its anatomical context
– An illustration
– New data is registered with the knowledge-base
– Insertion of new data reconciles the current knowledgebase with the new information by:
• Indexing the data with the source as part of registration
• Extending the knowledge-base
• Creating new views with complex rules to encode additional
domain knowledge
F-Logic for the Mediation Engine
• Why F-Logic?
– Provides the power of Datalog (with negation) and
object creation through Skolem IDs
– Correct amount of “notational sugar” and rules to
provide object-oriented abstraction
– Schema-level reasoning
– Expressing variable arity
• F-Logic in KIND
– Source schema wrapped into F-Logic schema
– Knowledge-sources programmed in F-Logic
– Definition of Integrated Views
Wrapping into Logic Objects
• Automated Part
<!ELEMENT Studies (Study)*>
<!ELEMENT Study (study_id, … animal,
experiments, experimenters>
<!ELEMENT experiments (experiment)*>
<!ELEMENT experiment (description, instrument,
parameters)>
studyDB[studies study].
study[study_id string; …
animal animal;
experiments experiment;
experimenters string].
…
• Non-automated Part
• Subclasses mushroom_spine::spine
• Rules S:mushroom_spine IF S:spine[head_;neck _].
• Integrity Constraints ic1(S):alert[type “invalid spine”; object S] IF
S:spine[undef {head, neck}].
Computing with
Auxiliary Sources
• Creating Mediated Classes
animal[MR] IF S:source, S.animal [MR] .
union view
animal[taxon ‘TAXON’.taxon].
X[taxonT] IF X: ‘PROLAB’.animal[name N],
words(N,[W1,W2|_]),
T: ‘TAXON’.taxon[genus W1;species W2].
association
rule
• Reasoning with Schema
Schema
taxon[subspecies string; species string; genus string; …
phylum string; kingdom string; superkingdom string].
At Mediator
subspecies::species::genus:: … kingdom::superkingdom
Class creation by
schema reasoning
T:TR, TR::TR1 IF
T: ‘TAXON’.taxon[Taxon_Rank TR, Taxon_Rank1 TR1],
Taxon_Rank::Taxon_Rank1.
Integrated View Definition
• Views are defined between sources and knowledge base
• Example: protein_distribution
– given: organism, protein, brain_region
– KB Anatom:
• recursively traverse the has_a paths under brain_region collect all
anatomical_entities
– Source PROLAB:
• join with anatomical structures and collect the value of attribute
“image.segments.features.feature.protein_amount” where
“image.segments.features.feature.protein_name” = protein and
“study_db.study.animal.name” = organism
– Mediator:
• aggregate over all parents up to brain_region
• report distribution
Query Evaluation Example
• protein distribution of Human NCS-1 homologue
– from wrapped CaBP website:
• get the amino acid sequence for human NCS-1
– from wrapped Expasy website:
a second
integrated view
• submit amino acid sequence, get ranked homologues
– at Mediator:
• select homologues H found in rat, and homology > 0.70
– at Mediator:
• for each h in H
– from previous view:
» protein_distribution(rat, h, cerebellum, distribution)
• Construct result
Implementation
• System
– Flora as F-Logic Engine
– Communicate with ODBC databases through
underlying XSB Prolog
– XML wrapping and Web querying through XMAS, our
XML query language and custom-built wrappers
• Data
– Human Brain Project sites
– NPACI Neuroscience Thrust sites
Work in Progress
• Architecture
– plug-in architecture for
• domain knowledge sources
• conceptual models from data sources
• Functionality
– better handling of large data
– operations
• expressive query language
• operators for domain knowledge manipulation
– query evaluation
• query optimization using domain knowledge
• Demonstration
– at VLDB 2000