From Data Integration to Semantic Mediation

Download Report

Transcript From Data Integration to Semantic Mediation

From Data Integration To Semantic
Mediation:
Addressing Heterogeneities in Data
Bertram Ludäscher
[email protected]
Knowledge-Based Information Systems Lab
San Diego Supercomputer Center
and
Department of Computer Science & Engineering
University of California, San Diego
Outline
1. Information Integration from a Database Perspective
2. XML-Based Data Integration
3. Model-Based / Semantic Mediation
4. Discussion
2
An Online Shopper’s Information Integration Problem
El Cheapo: “Where can I get the cheapest copy (including shipping cost) of
Wittgenstein’s Tractatus Logicus-Philosophicus within a week?”
addall.com
?
Information
Mediator (virtual DB) Integration
(vs. Datawarehouse)
amazon.com
barnes&noble.com
“One-World” Scenario:
XML-based mediator
half.com
A1books.com
A Home Buyer’s Information Integration Problem
Which houses for sale under $500k have at least 2 bathrooms, 2 bedrooms,
a nearby school ranking in the upper third, in a neighborhood
with below-average crime rate and diverse population?
?
Information
Integration
Realtor
Crime Stats
“Multiple-Worlds”
Scenario:
XML-based mediator
School Rankings
Demographics
A Neuroscientist’s Information Integration Problem
What is the cerebellar distribution of rat proteins with more than 70%
homology with human NCS-1? Any structure specificity?
How about other rodents?
?
Information
Integration
protein localization
sequence info
(NCMIR)
(CaPROT)
“Complex MultipleWorlds” Scenario:
Model-based mediator
morphometry
neurotransmission
(SYNAPSE)
(SENSELAB)
A Geoscientist’s Information Integration Problem
What is the distribution and U/ Pb zircon ages of A-type plutons in VA?
How about their 3-D geometry ?
How does it relate to host rock structures?
?
Information
Integration
Geologic Map
(Virginia)
GeoChemical
“Complex MultipleWorlds” Scenario:
Model-based mediator
GeoPhysical GeoChronologic
(gravity contours) (Concordia)
Foliation Map
(structure DB)
Information Integration Challenges:
Heterogeneities = S4...
• System Aspects
– platforms, devices, distribution, APIs, protocols, …
• Syntaxes
– heterogeneous data formats (one for each tool ...)
• Structures
– heterogeneous schemas (one for each DB ...)
– heterogeneous data models (RDBs, ORDBs, OODBs, XMLDBs,
flat files, …)
• Semantics
– unclear & “hidden” semantics : e.g., incoherent terminology,
multiple / informal taxonomies, implicit assumptions, ...
7
Information Integration Challenges
Semantics
Structure
Syntax
System aspects
 reconciling S4
heterogeneities
 “gluing” together
multiple data sources
 bridging information
and knowledge gaps
computationally
• System aspects: “Grid” middleware
– distributed data & computing
– Web services, WSDL/SOAP, …
– sources = functions, files, databases, …
• Syntax & Structure:
(XML-Based) Mediators
– wrapping, restructuring
– (XML) queries and views
– sources = (XML) databases
• Semantics:
Model-Based/Semantic Mediators
– conceptual models and declarative views
– Semantic Web: ontologies, description
logics, RDF(S), DAML+OIL, OWL, ...
– sources = knowledge bases (DB+CMs+ICs)
8
Information Integration from a DB Perspective
• Information Integration Problem
– Given: data sources S1, ..., Sk (DBMS, web sites, ...) and user
questions Q1,..., Qn that can be answered using the Si
– Find: the answers to Q1, ..., Qn
• The Database Perspective: source = “database”
 Si has a schema (relational, XML, OO, ...)
 Si can be queried
 define virtual (or materialized) integrated views V over
S1 ,..., Sk using database query languages (SQL, XQuery,...)
 questions become queries Qi against V(S1,..., Sk)
9
Outline
1. Information Integration from a Database Perspective
2. XML-Based Data Integration
3. Model-Based / Semantic Mediation
4. Discussion
10
Extensible Markup Language (XML)
... in their wonderful book called SemWeb
<title>SemWeb
Tractat
Tractat
Tractat</title>
</title>
by
<author>B.
B.Lee,
Schatz
Schatz</author>
T.B. Lee,
by
B. Schatz
andby
T.B.
the and
authors
showthe
how ...
<book>
authors
and
<author>
show how
T.B....Tractat</title>
Lee</author>, the authors
<title>SemWeb
show how ...
<author>B. Schatz</author>
<author>T.B. Lee</author>
</book>
book
title
author
author
“SemWeb Tractat” “B. Schatz” “T.B. Lee”
book:
title:
“SemWeb Tractat”
• (meta)language for marking up text & data
with user-definable tags
– (X)HTML, XSLT, XML Schema, ...
– MathML, BioML, GeoML, NeuroML, ...
– XML-RPC, SOAP, WSDL, OWL, ...
author:
“B. Schatz”
• semistructured tree data model
author:
“T.B. Lee”
• container model:
– flexible: marked-up text, web-pages,
databases, ...
– “boxes within boxes”
11
XML-Based Mediator Architecture
USER/Client
Query Q ( G (S1,..., Sk) )
Integrated Global
XML View G
Integrated View
Definition
MEDIATOR
G(..) S1(..)…Sk(..)
XML Queries & Results
XML View
XML View
XML View
Wrapper
Wrapper
Wrapper
S1
S2
Sk
12
Some Challenges in XML-Based Integration ...
• XML Query/Transformation Languages
– DB community: QLs for semistructured data, e.g.,
TSIMMIS/MSL, Lorel, Yatl, ..., Florid/F-logic [InfSystems98]
– CSE/SDSC: XMAS [SSD99,SIGMOD99,WebDB99,EDBT00]
– W3C: XPath, XSLT, XQuery (Working Draft , June 2001)
• XML Schema Languages
– DTDs, RELAX NG, XML Schema, ... [XMLDM02]
• DB Theoreticians:
– Expressiveness/Complexity Trade-Off
• querying: FO, (WF/S-)Datalog, FO(LFP), FO(PFP), ... , all
• reasoning: query satisfiability, containment, equivalence
• ...
13
XMAS: XML Matching And Structuring language
CONSTRUCT <books>
<book>
$a1
$t
<pubs>
$p { $p }
</pubs>
</book> { $a1, $t }
</books>
WHERE <books.book>
$a1 : <author />
$t : <title />
</> IN "amazon.com"
AND
<authors.author>
$a2 : <author />
<pubs> $p : <pub/> </>
</> IN "www...DBLP… "
AND value( $a1 ) = value( $a2 )
Integrated View Definition:
“Find books from amazon.com
and DBLP, join on author,
group by authors and title”
XMAS Algebra
XMAS
[QL98,SIGMOD99]
14
[EDBT00]
XML (XMAS) Query Processing
XML Query Q
XML Global View
Definition G(S)
Translator
algebraic plans
Composition Q(G)
composed plan
Compile-time
Rewriter/Optimizer: Q’(S)
optimized plan
Run-time:query evaluation
Plan Execution
15
…New Challenges in (XML-Based) Mediation
• Global-As-View (GAV)
– user query Q  global relations G
Q(G)
– global relations G  source relations S
G(S)
– challenge: compute answers Q(G(V(S))) without computing all of V and G
 query rewriting (with limited source capabilities):
Q’(S) = Q(G)
• Local-As-View (LAV)
– user query Q  global relations G
Q(G)
– source relations S  global relations G
S(G)
– challenge: “reverse/rewrite rules” from S(G) to some G’(S)
 answering queries using views: equivalent rewritings may not exist
 find maximally contained ones:
Q’(G’(S))  Q(G)
• Inter(CS)disciplinary research needed: DB  FP  LP
– GAV/LAV  view (un)folding  Clark’s completion, resolution, factoring
16
Querying XML Streams: A New Frontier
• New applications for stream-based XML processing:
– Continuous, real-time data streams (wireless sensor networks, …)
– Data / message transformation in Web services (SOAP, RMI, processing …)
– Extract-transform-load applications (Tera/Peta-byte archival migration, …)
• … leading to a new XML querying & transformation paradigm:
– how to execute (some) XML queries & transformations on very large (infinite)
data streams using only limited memory
– XML stream machine (XSM): extended XML transducers with buffers
XSM network
XQuery
XSMs clearly outperform tree-based approaches
on streamable queries (100x over Xalan)
[A Transducer-Based XML Query Processor, Ludäscher
Mukhopadhyay, Papakonstantinou, VLDB’02]
17
Outline
1. Information Integration from a Database Perspective
2. XML-Based Data Integration
3. Model-Based / Semantic Mediation
4. Discussion
18
A Neuroscientist’s Information Integration Problem
What is the cerebellar distribution of rat proteins with more than 70%
homology with human NCS-1? Any structure specificity?
How about other rodents?
?
Information
Integration
protein localization
sequence info
(NCMIR)
(CaPROT)
“Complex
Multiple-Worlds”
Mediation
morphometry
neurotransmission
(SYNAPSE)
(SENSELAB)
A Geoscientist’s Information Integration Problem
What is the distribution and U/ Pb zircon ages of A-type plutons in VA?
How about their 3-D geometry ?
How does it relate to host rock structures?
?
Information
Integration
Geologic Map
(Virginia)
GeoChemical
“Complex
Multiple-Worlds”
Mediation
GeoPhysical GeoChronologic
(gravity contours) (Concordia)
Foliation Map
(structure DB)
What’s the Problem with XML & Complex Multiple-Worlds?
• XML is Syntax
– ... for labeled ordered trees
– ... all semantics lies outside of XML
• XML DTDs => tags + nesting
• XML Schema => DTDs + data modeling
• need anything else? => write comments!
• Domain Semantics is Complex:
– implicit assumptions, hidden semantics
 sources seem unrelated to the non-expert
• Need Structure and Semantics beyond trees!




employ richer OO models
make domain semantics and “glue knowledge” explicit
use ontologies to fix terminology and conceptualization
avoid ambiguities by using KR and formal semantics
21
Information Integration Landscape
conceptual
complexity/depth
high
Model-Based Mediation
GO EcoCyc
Ontologies
KR formalisms
RiboWeb
UMLS
Bioinformatics
Geo-, Ecoinformatics
Tambis
BLAST
MIA Entrez
Cyc
WordNet
DB mediation
techniques
low
home-buyer
24x7 consumer
addall
book-buyer
conceptual distance
multiple-worlds
one-world
22
XML-Based vs. Model-Based Mediation
CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …}
Integrated-DTD 
XML-QL(Src1-DTD,...)
“Glue Maps”
= Domain &
Process Maps
(ontologies)
No Domain
Constraints
CM-QL ~ {F-Logic, DAML+OIL, …}
Integrated-CM 
CM-QL(Src1-CM,...)
IF
 THEN 
IF
IFTHEN
THEN 
Structural Constraints (DTDs),
Parent, Child, Sibling, ...
A = (B*|C),D
B = ...
C1
C2
....
XML
Elements
XML Models
Raw
Raw
Data
RawData
Data
C3
R
....
. . ....
....
Logical
Domain
Constraints
Classes,
Relations,
is-a,
has-a, ...
(XML)
Objects
Conceptual Models
What’s the Glue? What’s in a Link?
• Syntactic Joins
– (X,Y) := X.SSN = Y.SSN
– (X,Y) := X.UMLS-ID = Y.UID
X

Y
equality
• “Speciality” Joins
– (X,Y,Score) := BLAST(X,Y,Score)
similarity
• Semantic/Rule-Based Joins
– (X,Y,C) :=
X isa C, Y isa C, BLAST(X,Y,S), S>0.8
homology, lub
– (X,Y,[produces,B,increased_in]) :=
X produces B, B increased_in Y.
rule-based
e.g., X=-secretase, B=beta amyloid, Y=Alzheimer’s disease
• CS Challenge:
– compile semantic joins into efficient syntactic ones
24
Semantic Mediation Methodology @ SOURCES
• Lift Sources to export CMs:
CM(S) = OM(S) + KB(S) + CON(S)
• Object Model OM(S):
– complex objects (frames), class hierarchy, OO constraints
• Knowledge Base KB(S):
– explicit representation of (“hidden”) source semantics
– logic rules over OM(S)
• Contextualization CON(S):
– situate OM(S) data using “glue maps” (ontologies):
 domain maps DMs
= terminological knowledge: concepts + roles
 process maps PMs
= “procedural knowledge”: states + transitions
25
Semantic Mediation Methodology @ MEDIATOR
• Integrated View Definition (IVD)
– declarative (logic) rules with object-oriented features
– defined over CM(S), domain maps, process maps
– needs “mediation engineers” = domain + KRDB experts
• Knowledge-Based Querying and Browsing (runtime):
– mediator composes the user query Q with the IVD
... rewrites (Q o IVD), sends subqueries to sources
... post-processes returned results (e.g., situate in context)
26
Model-Based Mediator Architecture
USER/Client
“Glue” Maps
GMs
CM (Integrated View)
DomainMaps
Maps
Domain
Domain
Maps
DMs
DMs
DMs
Mediator
Engine
Integrated View
Definition IVD
LP rule proc.
XSB Engine
DomainMaps
Maps
Domain
Process
Maps
DMs
DMs
PMs
semantic
context
CON(S)
FL rule proc.
Graph proc.
GCM
GCM
GCM
First results & Demos:
CM S1
CM S2
CM S3
KIND prototype, formal
DM semantics, PMs
[SSDBM00] [VLDB00]
[ICDE01] [NIH-HB01]
[BNCOD02] [ER02]
[EDBT02] [BioInf02]
CM Queries & Results
(exchanged in XML)
CM(S) =
OM(S)+KB(S)+CON(S)
CM-Wrapper
CM-Wrapper
CM-Wrapper
(XML-Wrapper)
(XML-Wrapper)
(XML-Wrapper)
S1
S3
S2
27
Formalizing Glue Knowledge:
Domain Map for SYNAPSE and NCMIR
Domain Map
= labeled graph with
concepts ("classes") and
roles ("associations")
• additional semantics: expressed
as logic rules (F-logic)
Purkinje cells and Pyramidal cells have dendrites
that have higher-order branches that contain spines.
Dendritic spines are ion (calcium) regulating components.
Spines have ion binding proteins. Neurotransmission
involves ionic activity (release). Ion-binding proteins
control ion activity (propagation) in a cell. Ion-regulating
components of cells affect ionic activity (release).
Domain Expert Knowledge
Domain Map (DM)
28
DM in Description Logic
Source Contextualization & DM Refinement
In addition to registering
(“hanging off”) data relative to
existing concepts, a source
may also refine the mediator’s
domain map...
 sources can register new
concepts at the mediator ...
29
Example:
ANATOM Domain Map
Browsing Registered Data with Domain Maps
31
Query Processing
Demo
Mediator View Definition
DERIVE
Contextualization
protein_distribution(Protein,
Organism,Brain_region,
Feature_name, Anatom, Value)
CON(Result) wrt.
ANATOM.
WHERE
I:protein_label_image[ proteins ->> {Protein}; organism -> Organism; anatomical_structures ->>
{AS:anatomical_structure[name->Anatom]}] ,
% from PROLAB
NAE:neuro_anatomic_entity[name->Anatom;
% from ANATOM
located_in->>{Brain_region}],
Query results
AS..segments..features[name->Feature_name; value->Value].
in context
• provided by the domain expert and mediation engineer
• deductive OO language (here: F-logic)
Example: Inside Query Evaluation
"How does the parallel fiber output (Yale/SENSELAB) relate to the distribution
of Ryanodine Receptors (UCSD/NCMIR)?”
push selection
@SENSELAB: X1 := select targets of “output from parallel fiber” ;
determine source context
@MEDIATOR: X2 := “find and situate” X1 in ANATOM Domain Map;
compute region of interest (here: downward closure)
@MEDIATOR: X3 := subregion-closure(X2);
push selection
@NCMIR:
X4 := select PROT-data(X3, Ryanodine Receptors);
compute protein distribution
@MEDIATOR: X5 := compute aggregate(X4);
display in context
@MEDIATOR/GUI: display X5 in context (ANATOM)
=> DEMONSTRATION
Open Database & Knowledge Representation Issues
• Mix of Query Processing and Reasoning
– GAV & LAV with semantic query optimization (NIH BIRN, NSF GEON)
– description logic reasoner for DMs (FaCT) ?
– reconciliation of conflicting DMs via argumentation-frameworks (“games”)
using well-founded and stable models of logic programs [ICDT97, PODS97,
TCS00, TODS02]
• Modeling “Process Knowledge” => Process Maps
– formal semantics? (dynamic/temporal/Kripke models/Petri nets?)
– executable semantics? (Statelog?)
• Graph Queries over DMs and PMs
– expressible in F-logic [InfSystem98]
– scalability? (UMLS Domain Map has millions of entries)
• How to incorporate “procedural features”?
– Bioinformatics, Ecoinformatics, … => sources = DBs + analytical tools + …
 scientific workflow planning and management (“promoter identification
workflow” for DOE SciDAC, NSF/ITR SEEK)
34
Process Maps with Abstractions and Elaborations:
From Terminological to Procedural Glue
• nodes ~ states
• edges ~ processes, transitions
• blue/red edges:
• processes in Src1/Src2
• general form of edges:
35
related formalisms
gi#’s from
clusfavor
blast
Genomic gi#
Chr #
Gene location
A Scientific Workflow:
Promoter Identification
cDNA gi# blast other species
Gene name
blast human
Genomic gi#
Chr #
Gene location
GC Island location
Exon/intron location
Repeats location
Promoter location
GRAIL
TRANSFAC
CLUS TAL
Validates polII
promoter location
TAF’s
Location on Genomic gi#’s
Probabilities of match
Probabilities of random match
Data Consolidation
TRANSFAC
Consensus sequences
CLUS TAL
blast
Questions:
Are chr#’s in common?
Are chr#’s locations in common?
Are there conserved upstream sequences?
Are gene locations conserved across species
Questions:
RNA POLII promoter?
GpC Island present?
Are there common TAF’s
across genomic gi#?
36
promoter location
Shared TAF’s across cluster
Common consensus sequence
blast
Genomic gi#
cDNA gi#
Questions:
Are there other common genes?
Matthew Coleman, LLNL, 2002
SDM Demo & Architecture
Translation Approach:
Abstract Workflow (AWF) => Executable Workflow (EWF)
37
Analytical Pipelines: An Open Source Tool
38
A Commercial Tool for Analytical Pipelines
39
Summary: Mediation Scenarios & Techniques
Federated Databases
XML-Based Mediation
Model-Based Mediation
Glue?
One-World
Common Schema
One-/Multiple-Worlds
Complex Multiple-Worlds
Mediated Schema
Common Glue Maps
SQL, rules
XML query languages
Schema Transformations
Syntax-Aware Mappings
Syntactic Joins
Syntactic Joins
DB expert
DB expert
40
DOOD query languages
Semantics-Aware Mappings
“Semantic” Joins via Glue Maps
KRDB + domain experts
GEON vs. SEEK
41
Outline
1. Information Integration from a Database Perspective
2. XML-Based Data Integration
3. Model-Based / Semantic Mediation
4. Discussion
42
Thank you!
Questions? Queries?
43
Some References
•
Model-Based Mediation:
– A Model-Based Mediator System for Scientific Data Management, B. Ludäscher, A. Gupta, M.
Martone, Bioinformatics: Managing Scientific Data, Lacroix, Critchlow (eds), Morgan
Kaufmann, to appear, 2003
– Model-Based Mediation with Domain Maps, B. Ludäscher, A. Gupta, M. E. Martone, 17th Intl.
Conference on Data Engineering (ICDE’01), Heidelberg, Germany, IEEE Computer Society,
2001.
– Managing Semistructured Data with FLORID: A Deductive Object-Oriented Perspective, B.
Ludäscher, R. Himmeröder, G. Lausen, W. May, C. Schlepphorst, Information Systems, 23(8),
Special Issue on Semistructured Data, 1998.
•
XML-Based Mediation:
– VXD/Lazy Mediators: Navigation-Driven Evaluation of Virtual Mediated Views, B. Ludäscher,
Y. Papakonstantinou, P. Velikhov, Intl. Conference on Extending Database Technology
(EDBT’00), Konstanz, Germany, LNCS 1777, Springer, 2000.
– XML Streams: A Transducer-Based XML Query Processor, B. Ludäscher, P. Mukhopadhyay, Y.
Papakonstantinou, Intl. Conference on Very Large Databases (VLDB’02), Hong Kong, 2002
44
Knowledge Representation:
Relating Theory to the World via Formal Models
John F. Sowa, Knowledge Representation: Logical, Philosophical, and Computational Foundations
“All models are wrong, but some are useful!”
45