Unity Demonstration - People | UBC's Okanagan campus

Download Report

Transcript Unity Demonstration - People | UBC's Okanagan campus

Unity Demonstration
Dr. Ramon Lawrence
University of Iowa
[email protected]
Outline
Motivation and Background
Two basic integration approaches:
 global
as view (GAV)
 local as view (LAV)
What is the open problem?
How Unity is different
Using Unity example
Benefits and Contributions
Future Work
Page 2
Motivation
There are many integration environments:
 Operational
systems within an organization
 System integration during company merger
 Data warehouses, Intranets, and the WWW
Users require information from many data sources
which often do not work together.
Page 3
What is Integration?
Two levels of integration:
 Schema
integration - the description of the data
 Data integration - the individual data instances
Integration handles the different mechanisms for
storing data (structural conflicts), for referencing
data (naming conflicts), and for attributing meaning
to the data (semantic conflicts).
Page 4
Two Current Approaches
The current state-of-the-art integration systems all
can be reduced to a logical basis.
 For
this demo, assume the data is physically stored in the
relational model and queried using Datalog.
There are two basic "database" approaches to
integration:
 global
as view approach - the extraction and integration of
data is defined simulatenously with the global view
definition
 TSIMMIS using Object Exchange Model (OEM)
 local
as view approach - pre-defines the global view and
then defines what portion of the global view each local
source provides
 Information Manifold using description logic
Page 5
BodyWorks Systems
Customer
Web Server
Order
Database
Invoice
Database
Shipment
Database
Custom
Accounting
Package
Shipment
Tracking
Software
BodyWorks Systems
Customer
Web Server
Order
Database
Invoice
Database
Shipment
Database
Custom
Accounting
Package
Shipment
Tracking
Software
Question: Who has a complete picture of a customer's order,
or the entire customer relatioship?
BodyWorks Systems
Customer
Web Server
Order
Database
Invoice
Database
Shipment
Database
Custom
Accounting
Package
Answer: No one, but
management wants to know...
Shipment
Tracking
Software
Data Warehouse Approach
Features:
Warehouse
Gather
Refine
Aggregate
Store
Gather
Refine
Aggregate
Store
Invoice
Database
Order
Database
- static, materialized view
- performs data cleansing
and aggregation
- historical more than
operational
Gather
Refine
Aggregate
Store
Shipment
Database
Query-Driven Dynamic Approach
Features:
- view dynamically built
- data is extracted at
query-time
- still typically read-only
mediator
Wrapper
Wrapper
Invoice
Database
Order
Database
Cust(id,name,addr,city,state,cty)
Invoice(invId,custId,shipId,iDate)
InvProd(invId,prodId,amt,pr)
Prod(id,name,pr,desc)
Cust(id,name,addr,city,state,cty)
Order(oid,cid,odate)
OrdProd(oid,pid,amt,pr)
Prod(id,name,pr,desc)
Wrapper
Shipment
Database
Cust(id,name,addr,city,state,cty)
Shipment(shipid,oid,cid,shipdate)
ShipProd(shipid,prodid,amt)
Prod(id,name,pr,desc, inv)
Global as View Approach
Define global objects by specifying how to extract
their information from the local sources.
Requires that the administrator defining the global
view understand the semantics of every local data
source.
Further, if the local views or global views must be
changed for whatever reason (such as adding a new
data source), the global view must be re-compiled.
Page 11
Global as View Example
Tsimmis MSL example extracting customer info:
<f(I) customer
customer
<f(I) customer
customer
{<id
{<id
{<id
{<id
I>
I>
I>
I>
<name
<name
<name
<name
N>
N>
N>
N>
<addr
<addr
<addr
<addr
A>}>@med :A>}@invoiceDB
A>}>@med :A>}@orderDB
<f(I) customer {<id I> <name N> <addr A>}>@med :customer {<id I> <name N> <addr A>}@shipmentDB
Equivalent SQL:
Union the results of the following 3 queries:
(matching ids if possible)
orderDB:
SELECT * FROM customer
invoiceDB: SELECT * FROM customer
shipmentDB: SELECT * FROM customer
Page 12
Global as View Example (2)
Extract all orders with invoices and shipments:
<shipInvOrd {<shipment S> <invoice I> <order O>}>@med :<shipment {<shipid S> <oid O>}@shipmentDB AND
<order {<oid O>}>@orderDB AND
<invoice {<invId I> <shipId S>}@invoiceDB
Equivalent SQL: (if possible to query multiple databases)
SELECT shipment.shipid, invoice.invId, order.oid
FROM shipment, invoice, order
WHERE shipment.shipid = invoice.shipId AND
shipment.oid = order.oid
Page 13
Local as View Approach
Pre-define an integrated global view that
encompasses the information present in all sources.
For each local source, specify the local view as a
subset of the information available in the GV.
Building the GV is typically not discussed. However,
LAV approach makes it easier to add/remove
sources as GV does not have to be updated.
Query processing using LAV approach is more
difficult than GAV approach as have to determine
what information can be extracted from the views.
Page 14
Local as View Example
Consider this global customer relation in the GV:
customer(id, name, addr)
 Assume
that the order, shipment, and invoice databases
only contains a customer record if the customer had an
invoice, order ,or shipment respectively. Further, assume
that only shipmentDB contains a customer address.
Local views of each source:
orderView(C,N) :- customer(C,N)
invoiceView(C,N) :- customer(C,N)
shipView(C,N,A) :- customer(C,N,A)
Page 15
Local as View Example (2)
Let the user pose the following query:
q(N) :- customer(I, N, A)
 Query
asks for all customer names.
 Query processor must determine which views are relevant
(in this case all of them).
Local queries on each source:
q(N) :- orderView(C,N)
q(N) :- invoiceView(C,N)
q(N) :- shipView(C,N,A)
Page 16
What is the open problem?
The two approaches are both viable methods for
solving data integration.
However, the open problem is that neither approach
performs schema integration - the construction of
the global view itself.
 GAV
- GV constructed (schema integration performed) by
global designer when specifying extraction rules
 LAV - GV is pre-defined using some previous integration
process (most likely manual in nature)
 Both methods rely on the concept of a global user to
create the global schema.
Page 17
How Unity is Different
Our integration architecture called Unity is different
because it approaches the integration problem for a
different perspective:
How can we automate, or semi-automate, the
construction of the global view by extracting
information from the local data sources?
Thus, the integration problem is tackled from a
different set of starting assumptions:
 Do
not assume pre-existing or manually created GV.
 However, assume we have a dictionary and a language for
describing schema and data element semantics.
 Attempt to automatically build a GV from source
descriptions of each data source.
Page 18
The Unity Approach
Given a set of data sources and a dictionary and
language to describe data semantics:
 1)
Semi-automatically extract and represent data source
semantics in the language using the dictionary.
 2) Automatically match concepts across data sources by
using the dictionary to determine related concepts.
 This process effectively builds the global level relations or objects
initially assumed or created in other approaches.
 However, since there is no manual intervention, the precision of
global view construction is affected by inconsistencies in the
descriptions of the data sources and matching concepts.
 3)
Automatically generate queries specified by the user
using dictionary terms (not structures) and map the user's
query to appropriate data elements in the local sources.
Page 19
Unity Overview
Unity is a software package that implements the
integration architecture with a GUI.
Developed using Microsoft Visual C++ 6 and
Microsoft Foundation Classes (MFC).
Unity allows the user to:
 Construct
and modify standard dictionaries
 Build X-Specs to describe data sources
 Integrate X-Specs into an integrated view
 Transparently query integrated systems using ODBC and
automatically generate SQL transactions
Page 20
Unity Example
Step #1 - Standard Dictionary
A standard dictionary (SD) provides standardized
terms to capture data semantics.
 Hierarchy
of terms related by IS-A or HAS-A links
 Contains base set of common database concepts, but new
concepts can be added
A SD term is a single, unambiguous semantic
definition.
 Several
SD entries for a single English word are required if
the word has multiple definitions.
The top-level dictionary terms are those proposed by
Sowa.
Page 21
Unity Example
Step #2 - Data Extraction
For each data source, an X-Spec document is
constructed that consists of:
 field,
table, key, and join information extracted from the
ODBC source
 assignment of semantic names for each field and table
Semantic names combine dictionary terms to
describe the semantics of schema elements.
 semantic
name := [CT_Type] | [CT_Type] PN
 CT_Type := CT | CT {; CT} | CT {,CT}
 CT := context term, PN := property name
 each CT and PN is a single term from the dictionary
Page 23
Unity Example
Step #2 - Data Extraction (2)
Semantic names are initially assigned using an
automatic algorithm which attempts to find the best
matches.
 The
integrator can then refine initial semantic name
assignments.
Semantic names have two major purposes:
 used
as a means for describing, documenting, and
comparing concepts across systems
 allow information in the database (and later in the
integrated view) to be organized by semantic concept
instead of using structures or relations
 This simplifies querying the database and integrated view because
the information is not divided in normalized relations.
Page 24
Unity Example
Step #3 - Schema Building
Unlike previous approaches, the global view (or
schema) is constructed automatically by combining
source specifications (X-Specs).
This is possible because semantic naming of
concepts allows matching across systems:
 The
same semantic name in two databases is assumed to
represent the same concept.
 Hierarchical nature of semantic names (consisting of
multiple terms) allows a schema to be built-up from pieces
of relations or objects from each data source.
Effectively, the global view is synthesized by the
union of concepts in the underlying systems.
Page 26
Unity Example
Step #4 - Query Processing
The query processor:
 Allows
the user to formulate queries on the view.
 Translates from semantic names in the context view to
structural queries (SQL) on databases.
 Involves determining correct field and table mappings and
discovery of join conditions and join paths
 Retrieves
query results and formats them for display to the
user.
Client-side query processing:
 Perform
joins between databases using common keys.
Page 28
Benefits and Contributions
The architecture automatically integrates relational
schemas into a global view for querying.
Unique contributions:
 Synthesizing
a global view from the bottom-up instead of
top-down. This should improve integration scalability.
 Organizing the global view as a hierarchy of concepts
instead of relations or predicates simplifies querying
similar to the Universal Relation as the user does not have
to specify specific predicates/relations or join conditions.
 Query processing is achieved by dynamically discovering
extraction rules.
 The discovered rules are similar to extraction rules of GAV
systems.
Page 30
Future Work
Unity performs schema integration by extracting
data source information and performing global joins.
 However,
the global query processor needs to be extended
to handle more diverse queries involving:
 aggregration and grouping, recursive queries, queries with
selection conditions that span data sources
 support for typical data integration problems of scaling, data type
conversions, and translation of units
Synthesizing the global view by combining concepts
can be improved by exploiting dictionary knowledge:
 Use
IS-A relationships in dictionary to improve matching.
 Determine when to create new global level attributes and
contexts that are discovered based on interschema
relationships.
Page 31
References
Publications:
 Unity
- A Database Integration Tool, R. Lawrence and K.
Barker, TRLabs Emerging Technology Bulletin, Jan. 2000.
 Multidatabase Querying by Context, R. Lawrence and K.
Barker, DataSem2000, pages 127-136, Oct. 2000.
 Integrating Relational Database Schemas using a
Standardized Dictionary, SAC’2001 - ACM Symposium on
Applied Computing, pages 225-230, March 2001.
 Querying Relational Databases without Explicit Joins
DASWIS 2001- International Workshop on Data Semantics
in Web Information Systems (with ER'2001), Nov. 2001.
Further Information:
 http://www.cs.uiowa.edu/~rlawrenc/
Page 32
Extra Slides
Extra Slides...
Page 33
Data Warehouse Approach
Warehouse
Gather
Refine
Aggregate
Store
Gather
Refine
Aggregate
Store
Gather
Refine
Aggregate
Store
Invoice
Database
Order
Database
Shipment
Database
Cust(id,name,addr,city,state,cty)
Invoice(invId,custId,invDate)
InvProdinvId,prodId,amt,pr)
Prod(id,name,pr,desc)
Cust(id,name,addr,city,state,cty)
Order(oid,cid,odate)
OrdProd(oid,pid,amt,pr)
Prod(id,name,pr,desc)
Cust(id,name,addr,city,state,cty)
Shipment(shipid,oid,cid,shipdate)
ShipProd(shipid,prodid,amt)
Prod(id,name,pr,desc, inv)
Integration Architecture
Client
Client
Multidatabase Layer
• user’s view of integration
2) X-Spec Editor
Integrated Context View
X-Spec
Editor
Standard
Dictionary
Architecture Components:
1) Integrated Context View
• stores schema & metadata
• uses XML
Integration
Algorithm
3) Standard Dictionary
• terms to express semantics
4) Integration Algorithm
Query Processor and ODBC Manager
5) Query Processor
Subtransactions
X-Spec
X-Spec
Database
Database
Local Transactions
• combines X-Specs into
integrated context view
• accepts query on view
• determines data source
mappings and joins
• executes queries and
formats results
Architecture Components
The architecture consists of four components:
A
standard dictionary (SD) to capture data semantics
 SD terms are used to build semantic names describing semantics
of schema elements.
 X-Specs
for storing data semantics
 Database metadata and semantic names stored using XML
 Integration
Algorithm
 Matches concepts in different databases by semantic names.
 Produces an integrated view of all database concepts.
 Query
Processor
 Allows the user to formulate queries on the view.
 Translates from semantic names in integrated view to SQL queries
and integrates and formats results.

Involves determining correct field and table mappings and discovery
of join conditions and join paths
Page 36
Integration Processes
The integration architecture consists of three
separate processes:
 Capture
process: independently extracts database schema
information and metadata into a XML document called a XSpec.
 Integration
process: combines X-Specs into a structurallyneutral hierarchy of database concepts called an
integrated context view.
 Query
process: allows the user to formulate queries on the
integrated view that are mapped by the query processor to
structural queries (SQL) and the results are integrated and
formatted.
Page 37
Architecture Components:
Dictionary vs. Knowledge Base
The standard dictionary differs from a knowledge
base such as Cyc because:
 Not
intended to be a general English dictionary or contain
knowledge facts about the world
 Dictionary is evolved as new terms are required
 Not all English words are used
 Dictionary
provides the systems with no “knowledge”
 Since no facts are stored, system cannot deduce new facts
 Dictionary terms are just semantic place holders, integrators
determine the semantics of the database not the system
 Simplified
organization
 Dictionary is organized as a tree for efficiency and simplicity in
determining related concepts
 Re-use
of terms
 Terms are re-used in semantic names
Page 38