slides - geongrid

Download Report

Transcript slides - geongrid

Towards a Generic Framework for Semantic Data
Registration and Integration in Geosciences
Kai Lin, Chaitan Baru
San Diego Supercomputer Center
University of California, San Diego
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 1
Data Integration Goal
• Query heterogeneous data sources as a single
resource
– Query: not write a program (“ad hoc, non-procedural
query languages”)
– Heterogeneous: local resource controls definition of the
data
– Single resource: remove the burden of individually
accessing each data source
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 2
Data Integration Challenges: Heterogeneities
• Syntactical Heterogeneity
heterogeneous data format
e.g. 02-04-2004 vs. 02/04/04
• Structural Heterogeneity
heterogeneous data models and schemas
e.g. 02-04-2004 is saved as three columns or one columns
• Semantics Heterogeneity
fuzzy metadata, terminology, “hidden” semantics, implicit
assumptions
GEON Solution:
• data should be semantically registered to GEON first
• heterogeneities are resolved by registration
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 3
Levels of Registration
• Metadata-level registration
– Register metadata associated with a resource
–  submit required metadata. Predefined semantics.
• “Item” level registration
– Register the “schema” of a resources, e.g. relational
database, shapefiles, …
– Record semantics of schema elements, e.g. table name,
column name
• “Item-Detail” level registration
– Register individual values in a dataset
– Record semantics of each item in a record/column
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 4
Registering Structured Data
•
•
•
•
•
Relational databases
Shapefiles  database tables
Excel spreadsheets  database tables
Delimited ASCII files  database tables
Headers of scientific data files, e.g. netCDF
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 5
Item Level Database Registration and Access
GEON JDBC Driver
Application
GEON Mediator
Table
Original Database
Table
Table
select tables and
views to register
Published Database
View
Table Def
Table
View
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Table Def
View Def
www.geongrid.org 6
How to Connect to GEON Databases
• Download GEON JDBC Driver
• Use the following code to create a connection
// load driver
Class.forName ("org.geongrid.jdbc.driver.Driver");
// set the mediator URL
String url = "jdbc:geon://geon01.sdsc.edu:2532/GEON-63cb404c-6038-11d9-a69f”;
// open the connection
Connection conn = DriverManager.getConnection(url, "geonuser", "geongrid");
GEON JDBC protocol
The host name and port number
of GEON Mediator
GEON ID
Note: the original account information is not accessbile by end users
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 7
GEON Mediator Enables Write Protection
Database
Mediator
C
UPDATE B
B
B
A
• Only accepts SELECT statements
• Rejects any requests other than SELECT
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 8
Read Protection for Unregistered Tables and Views
Database
Mediator
C
B
SELECT *
FROM A
B
A
An unregistered table or view is invisible to an end user
• The data in the table can’t be viewed by SELECT statement
• The schema can’t be fetched
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 9
GEON Database Integration
GEON Mediator supports integration at three levels
Level 1: Federation-Based Integration
• End users need to be knowledgeable about each database
Level 2: View-Based Integration
• End users see “integrated views”. An intermediary designs these
views.
Level 3: Ontology-Based Integration
• End users can query using familiar concepts
• Requires middleware and formal representation of domain
knowledge
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 10
Level 1: Federation-Based Integration
• Use SQL to query the federated database
• Structural and semantic heterogeneity should be
solved by users themselves
backend
GEON Mediator
A
B
A
B
C
D
C
D
SELECT * FROM A, E WHERE ……
backend
E
E
F
F
G
G
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 11
Level 2: View-Based Integration
• Allow defining views on top of the federated databases
• Allow hiding the original backend schemas
• Integration results can be shared and reused
GEON Mediator
backend
A
B
C
D
A
C
B
D
V
backend
E
F
E
F
W
SELECT * FROM V, W WHERE ……
G
G
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 12
Level 3: Ontology-Based Integration
• Requires ontology annotations for backend databases
• Use simple ontology query language to query the integrated database
• End users do not need to know the backend schemas and local
semantics
GEON Mediator
backend
A
B
C
D
A
C
B
D
Ontology Based Query
backend
E
F
E
F
G
G
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 13
GEON Ontology Based Data Integration
• Ontology Enabled Semantic Integration
Ontology1
dataset1
Ontology2
dataset2
ontology3
dataset3
dataset4
Challenges for Computer Scientists and Domain Scientists
– Computer Scientists: build an integration system based on the
ontological registration of datasets
– Domain Scientists: create domain ontologies
– Data Providers: register datasets to ontologies
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 14
Ontological Data Registration for Data integration
• Registering a dataset to an ontology for data integration
is a procedure to generate a partial model of the ontology
from the dataset itself
individuals
dataset
From
registration
ontology
p
Not all the constraints in
the ontology are satisfied
by the generated individuals
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 15
Registering Relational Tables to Ontology Classes
• Associate one or more columns under an optional
SQL condition to a selected class in the ontology
Location
……
Latitude
……
23.5
……
……
Longitude
……
47.9
……
……
……
(23.5, 47.9) is the name of
an individual of the class
Location
Same name indicates the
same location
• Provide a mapping method if no explicit names of
individuals should be generated
RockSample
GeologicAge
……
GeologicalAge
Jurassic/Triassic
Precambrian
…………
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Precambrian
Cenozoic
Paleozoic
www.geongrid.org 16
Registering Relational Tables to Ontology Object Properties
• Associate two entities which are already registered to the
domain class and the range class of a selected object
property in the ontology
Rock
hasAge
GeologicAge
……
RockSampleID
……
PERIOD
……
……
……
……
……
……
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 17
ODAL and SOQL
ODAL
(Ontological Database
Annotation Language)
Register item/item-detail
to Ontology
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
User query
SOQL
(Simple Ontology
Query Language)
www.geongrid.org 18
ODAL (Ontological Database Annotation Language)
• Create a partial model of ontologies from databases
• Independent of end interface
• Independent of specific database implementations
• The ODAL mapping is itself a “first-class” object
GUI
<odal:NamedIndividuals odal:id="RockSample"
odal:database="VTDatabase">
<odal:Class odal:resource="http://geon.vt.edu#RockSample" />
<odal:Table>Samples</odal:Table>
<odal:Table>RockTexture</odal:Table>
<odal:Table>RockGeoChemistry</odal:Table>
generate
<odal:Table>ModalData</odal:Table>
<odal:Table>MineralChemistry</odal:Table>
<odal:Table>Images</odal:Table>
<odal:Column>ssID</odal:Column>
</odal:NamedIndividuals>
to ODAL
processor
The values in the column ssID of the table Samples, RockTexture, RockGeoChemistry,
ModalData,MineralChemistry and Images represent instances of RockSample
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 19
ODAL: Import Ontologies
The Ontologies used for annotating a database can be
imported as follows:
<?xml version="1.0"?>
<odal:ODAL xmlns:rdf = “http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:odal = “http://www.sdsc.edu/odal#” >
<odal:Ontology>
<odal:Imports rdf:resource="http://www.library.org/Book.owl"/>
<odal:Imports rdf:resource="http://www.writer.org/Writer.owl"/>
</odal:Ontology>
……
</odal:ODAL>
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 20
ODAL: Database Connection Declaration
The target databases for making annotation is declared as
follows:
<?xml version="1.0"?>
<odal:ODAL xmlns:rdf = “http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:odal = “http://www.sdsc.edu/odal#” >
……
<odal:Database odal:id="PublicationDatabase">
<odal:DatabaseProductName>Oracle<odal:DatabaseProductName>
<odal:DatabaseProductVersion>9.1.21<odal:DatabaseProductVersion>
<odal:Host>oracle.sdsc.edu</odal:Host>
<odal:Port>3456</odal:Port>
<odal:DatabaseName>Publications</odal:DatabaseName>
</odal:Database>
……
</odal:ODAL>
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 21
ODAL: Simple Named Individuals
Suppose the Book ontology contains a class Book and the schema
Collection contains a table Book-Price with a column ISBN.
<odal:NamedIndividuals odal:id="BookInTableBookPrice"
odal:database="PublicationDatabase" >
<odal:Class odal:resource="http://www.amazon.com/Book.owl#Book"/>
<odal:Schema>Collections</odal:Schema>
<odal:Table>book-price</odal:Table>
<odal:Column>ISBN</odal:Column>
</odal:NamedIndividuals>
The statement says that each value in the column ISBN represents a book
individual.
odal:id gives a name to the declaration, and represents the set of the
individuals generated by the statement.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 22
ODAL: Named Individuals from Multiple Columns
Suppose an ontology contains a class Location and a database table
Rock-Sample with two columns Latitude and Longitude.
<odal:NamedIndividuals odal:id="LocationInTableRockSample" >
<odal:Class odal:resource="http://www.usgs.org/Space.owl#Location"/>
<odal:Schema>California</odal:Schema>
<odal:Table>Rock-Sample</odal:Table>
<odal:Column>Latitude</odal:Column>
<odal:Column>Longitude</odal:Column>
</odal:NamedIndividuals>
The statement says that a pair of latitude and longitude gives a location
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 23
ODAL: Named Individuals with Conditions
<odal:NamedIndividuals odal:id="MaleEmployeeInTableEmployee" >
<odal:Class odal:resource="http://www.abc.com/Employee.owl#MaleEmployee"/>
<odal:Table>employee</odal:Table>
<odal:Column>EmployeeId</odal:Column>
<odal:Condition><![CDATA[ Gender=’M’ >]]</odal:Condition>
</odal:NamedIndividuals>
<odal:NamedIndividuals odal:id="FemaleEmployeeInTableEmployee" >
<odal:Class odal:resource="http://www.abc.com/Employee#FemaleEmployee"/>
<odal:Table>employee</odal:Table>
<odal:Column>EmployeeId</odal:Column>
<odal:Condition><![CDATA[ Gender=’F’ >]]</odal:Condition>
</odal:NamedIndividuals>
A condition in an odal:Condition element should be a boolean expression which is
valid to be used in any WHERE clauses of SQL queries
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 24
ODAL: Data Type Property Declaration
…
SSN
…
age
…
…
1234-56-7890
…
8
…
Person
hasAge
double
<odal:NamedIndividuals odal:id="PersonInTablePerson" >
<odal:Class odal:resource="http://www.foo.org/Person.owl#Person"/>
<odal:Table>Person</odal:Table>
<odal:Column>ssn</odal:Column>
</odal:NamedIndividuals>
<odal:OntologyProperty>
<odal:DatatypeProperty odal:resource="http://www.foo.org/Person.owl#hasAge"/>
<odal:Table>person</odal:Table>
<odal:Domain odal:resource="PersonInTablePerson" />
<odal:Range odal:resource="age" />
</odal:OntologyProperty>
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 25
Conditions for Joining Individuals from Different Resources
• To join data across independent resources we need we need to know
the correspondence between entities.
• For example, does “10001” represent the same rock in the two
resources. By default, we assume they are not.
Rock
RockSampleID
RockID
10001
10001
…...
……
• A set of datatype properties can be declared as a key for a class in the
ontology. We do join cross multiple resources based on keys.
e.g. { hasLatitude, hasLongitude} can be declared as a key of Location
Two locations from different resources are same if they have the same
latitude and longitude
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 26
SOQL (Simple Ontology Query Language)
Query single or integrated resources
• via ontologies (i.e., high level logical views)
• independent of schema-level representation
RockSample
location
hasSiO2
ValueWithUnit value
Location
lat
long
float
unit
string
GUI
generate
SELECT X.location.*;
FROM RockSample X
WHERE X.location.lat > 60
AND X.location.long > 100
AND X.hasSiO2.value < 30
AND X.hasSiO2.unit =‘weightPercetage’
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
to SOQL
processor
www.geongrid.org 27
The Architecture of GEON Semantic Mediator
Oracle
DB2
SQL
Server
MySQL
PostgreSQL
PostGIS
Query Execution
Query
Optimization
Query
Planning
Internal Database
SQL Parser
Spatial SQL against federal schemas
Mediator JDBC Driver
SOQL
GUI
Semantic Query Rewriter
SOQL
Parser
Ontology
Reasoner
ODAL Processor
OWL
Portal or Application
ODAL
SOQL Processor
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
www.geongrid.org 28
Question: Finding all seismic stations within 1 mile from railroads
SELECT X.code, X.location.*
FROM SeismicStation X, Railroad Y
WHERE distance(X.location, Y.geometry) < 1
GEON
SOQL
GUI
SOQL Processor
SELECT X2.stationcode, X2.lat, X2.lon
FROM railroads_of_the_united_states X1,
stationdatatable X2
WHERE distance(X1.the_geom, MakePoint(X2.lat, X2.lon)) < 1
Schema Mediator
distance(X1.the_geom, MakePoint(X2.lat, X2.lon)) < 1
SELECT X1.the_geom
FROM railroads X1
Railroad
shapefile
Seismic
Stations
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
SELECT X2.stationcode, X2.lat, X2.lon
FROM stationdatatable X2
WHERE bounding box condition
www.geongrid.org 29