Data Integration Approaches
Download
Report
Transcript Data Integration Approaches
Approaches to the Integration of
Distributed and Heterogeneous
Data Resources
Ahmet Sayar
Indiana University
Computer Science Department
1
Motivation
• Integrating data from multiple data sources
• Distributed query and transactions of data.
• Definitions and adoptions of data, metadata and their
storages.
• Accessing the data seamlessly.
• Transparency, support for heterogeneity, extensibility
and scalability.
2
Outline
• Data Integration Approaches
– Application Specific Solutions
– Application-Integration Framework
• ASIS (Application Specific Information System)
– Database Federation
• Ogsa-DAI (Ogsa-Data Access and Integration)
• Compare ASIS with Ogsa-DAI
– Digital Libraries
• SRB (Storage Resource Broker)
• Sompel’s Digital Library Approach
• Compare ASIS with SRB and Sompel’s DL
3
Application Specific Solutions
• The most common means of data integration
• Expensive -in terms of time and skills
• Developing and using requires deep system
knowledge
• Better results for special-purpose applications
• Fragile
– Changes to the underlying sources may easily break
the application
• Hard to extend
– A new data source requires new code to be written
4
Outline
• Data Integration Approaches
– Application Specific Solutions
– Application-Integration Framework
• ASIS
– Database Federation
• Ogsa-DAI
• Compare ASIS with Ogsa-DAI
– Digital Libraries
• SRB
• Sompel’s DL
• Compare ASIS with SRB and Sompel’s DL
5
Application-Integration
Framework
• It can also be called component-based framework
– Such as CORBA or Filters with common interfaces
• Not necessarily address data integration issues
• Based on common data model (such as CML and GML)
– With adaptors, if the source change the adaptor may have to
change, but application may never see it.
• Adding a new source is easy
– a new adaptor may need to be written.
– The adaptor may already be exist online.
• No need to detailed system knowledge
• Ex. ASIS - OGC GIS Application Integration Framework
6
ASIS (1)
• Enables inter-service communication through welldefined service interfaces, message formats and
capabilities metadata.
• Data model is ASL (Application Specific Lang.)
• Metadata model is capability document
• Data and metadata have common predefined schema
• Components are Filter Services
– Web Services, comon service interfaces defined in WSDL
– Information/data services enabling distributed access,
querying and transformation through their predictable
input/output interfaces.
– Chainable, located, and capable of updating their
metadata manually or dynamically
7
ASIS (2)
• Data and data storage model
– Any data can be integrated into the system after
transforming to ASL.
– Heterogeneity is handled at the end-Filters with adaptors.
– ASL is community-accepted application specific language
• GML (Geographic Markup Lang.) in GIS applications
• CML (Chemistry Markup Lang.) in Chemistry applications
– Filter’s common service interfaces
• getCapabilities, getData, getFeatureInfo.
– Requests to Filter’s interfaces
• getCapabilitiesReq, getDataReq, getFeatureInfoReq
– Expected return types are defined in Filters’ capability
metadata
8
ASIS (3)
• Metadata and Metadata storage model:
– Data integration is done through Filters’ capability metadata
– Metadata is stored in local Filter’s file system as a flat file.
– Capability:
• Inspired from OGC WMS capability specification.
• Look like Dublin Core format.
• Capability like structure is also used in Gannon’s approach
(XPOLA), for Grid services’ security issues.
• Describes dynamic Web/Grid resources.
• Updated manually or dynamically.
• Consists of descriptor, service and provider metadata
• Inter-service communication is achieved without a third-party.
Enables chain of Filters.
9
ASIS (4)
Data Access and Filter Chaining
•
F3
State Boundary
F1
State Boundary F2
Earth
Fault
Earth
Fault
•
F4
Fault
•
Each Filter is capable of acting as
both a server and a client
Capability integration is done
through “getCapability” service
interface
Requests for common service
interfaces are created in accordance
with predefined XML schema
Filter Name
Initial Data Provided
After Chaining Data Provided
F1
None
Earth, Fault and State Boundary
F2
Earth (raster)
Earth and Fault
F3
State Boundary (vect) State Boundary
F4
Fault (vector)
Fault
10
Outline
• Data Integration Approaches
– Application Specific Solutions
– Application-Integration Framework
• ASIS
– Database Federation
• Ogsa-DAI
• Compare ASIS with Ogsa-DAI
– Digital Libraries
• SRB
• Sompel’s DL
• Compare ASIS with SRB and Sompel’s DL
11
Database Federation
• Middleware consisting of database management
system
• Uniform access to number of heterogeneous
data sources
• Provides query language used to combine,
contrast, analyze and manipulate the data
• Data integration is done through Database
integration.
• Combine data from multiple sources in a single
SQL statement – query recreation.
• Ex. Ogsa-DAI (Open Grid Service Architecture –
Data Access and Integration)
12
Ogsa-DAI (1)
• Provides common Java API for accessing and
integrating data resources –such relational and XML
databases, and files- in Grid environment
• Specifically designed for OGSA architecture
• SQL queries on relational resources and XPath
statements on XML collections
• Provides data pipelining (similar to Filter chaining) via an
XML document called “perform” document.
• Allows developers to easily add or extend functionality
within Ogsa-DAI, “activity” document.
13
Ogsa-DAI (2)
• Data and storage model :
–
–
–
–
Any data stored in XML or relational databases, files
No common data model
Data is provided through GDS (Grid Data Services)
Uses Ogsa-DQP (Distributed Query Processor) to
coordinate to access to multiple data services
– The enactment engine is the core of Ogsa-DAI.
Orchestrate running of the perform document
– Information in perform document includes:
• The list of activities and their XML schemas and
implementation classes.
• The list of role mappers and details
• The info about data resource
14
Ogsa-DAI (3)
• Metadata storage model:
–
–
–
–
Metadata is kept in Catalog Service (MCS)
MCS enables attribute-based querying
Metadata is for the datasets, data can be anything (binary, text ..)
Data integration is done through XML based activity file mixing
activities (in SQL queries) and metadata
• Simple data access scenario
– A client contacts a DAISGR first to locate the GDSFs.
– Accesses suitable GDSFs directly to find out more about their
properties and the data resources they represent.
– Asks GDSF to instantiate a GDS
– Accesses resource by sending the GDS the GDS-Perform doc.
15
Ogsa-DAI (4)
• Metadata model:
– No common schema for metadata like
capability
– Defines Metadata for the datasets
• No schema in XML
• Stored in Database tables as attributes
– Defines Metadata for the Database system to
enable querying and defining activities
• Schema in XML (mcsActivity.xsd schema file)
• Kept as XML file in the file system (mcsActivity.xml)
16
ASIS vs. Ogsa-DAI
• Ogsa-DAI does not define metadata and data in XML
schema. Metadata is mixed with Database schema. ASIS
has predefined data and metadata models.
• Ogsa-DAI uses any data, and they have predefined
Database schema to enable querying and accessing data.
• ASIS’s data integration is on demand and based on
capability federation. Instead, Ogsa-DAI’s data integration
is coded in XML struc perform and activity documents.
• Ogsa-DAI has central (MCS), ASIS has distributed
metadata approach.
• Both system are based on Web Services.
• Ogsa-DAI uses GridFTP, and ASIS uses NaradaBrokering
for the performance issues in data transfers.
17
Outline
• Data Integration Approaches
– Application Specific Solutions
– Application-Integration Framework
• ASIS
– Database Federation
• Ogsa-DAI
• Compare ASIS with Ogsa-DAI
– Digital Libraries
• SRB
• Sompel’s DL
• Compare ASIS with SRB and Sompel’s DL
18
Digital Libraries
• Main focus is publishing and discovering of the digital
objects.
• Digital Objects : file, URL, SQL command string and any
string of bits.
• Collects data from multiple different data sources.
• It is little bit different from the other data integration
approaches
– Data curation services – such as publishing and removing
data from the data sources.
• Ex. SRB (Storage Resource Broker) and Sompel’s
Digital Library Approach
19
SRB (1)
• A federated client server system
• Each server managing/brokering a set of resources
• An implementation architecture for
– Data grids
– Digital Libraries.
• Storage resources include digital libraries, MSS, UniTree
and file systems
• SRB consists of three components
– MCAT services,
– SRB servers to access to storage repositories and
– SRB clients
• Mediates access to distributed heterogeneous resources
• Uses MCAT (Metadata Catalog Service) to facilitate
brokering and attribute based querying.
• Integrates data and metadata
20
SRB (2)
• Data and storage model:
–
–
–
–
–
–
Uniform storage interface
Resource-specific drivers to map from defined storage to interface
Storage resources are registered within SRB as physical resources
Logical resources (LSR) enable replication.
LSR = one or more than one physical resource
Client API refers to LSR. Collections are created by LSR
• Metadata storage model (MCAT):
–
–
–
–
–
–
–
–
Serves both a core-metadata and domain-dependent metadata
Core-metadata is a standardized schema like Dublin Core
Stores metadata about data, collections, users, resources, methods
Attribute based access and querying, updating metadata catalog
Implemented as a relational database. Oracle, DB2 or Sybase
Abstraction and Replica information for data
“Global user” name space and authentication
Authorization through ACL and tickets
21
SRB (3)
• Metadata and Metadata Exchange Model:
– MAPS (Metadata Attribute Presentation Structure)
– Independent of the internal representation of the
attributes inside the catalog.
– Provides a uniform interface specification that can be
used between user applications and the MCAT
catalog and vice verse.
– Structures which form the MAPS:
•
•
•
•
MAPS_Query_Struct,
MAPS_Result_Struct,
MAPS_Update_Struct and
MAPS_Definition_Struct
– Mapping from MAPS to other models and exchange
format. Dublin Core format is under implementation.
22
SRB (4)
• Simple data access scenario:
– SRB server spawns SRB agent to authenticate the
user/Application by comparing it with information stored in MCAT.
– Find the location in MCAT.
– Check user request against permissions stored in MCAT.
– SRB agent contacts user with the result of his request.
– SRB agent communicates with the user through a port specific to
this client session.
• SRB server chaining scenario (integrated SRBs):
– First 3 steps from simple data access case.
– SRB agent contacts remote SRB agent via remote SRB server.
– The second SRB agent returns the pointer to the data item to the
first SRB agent which passes it on to the user.
– The SRB client interact with the data item directly. The federated
SRB scheme -SRB server acts as a client to another.
23
ASIS vs. SRB
• SRB doesn’t define metadata in XML structure (as ASIS
does)
• SRB uses any data but ASIS uses ASL
• SRB keeps the metadata in Catalogue Services (MCAT).
ASIS uses XML structured capability metadata
• SRB has central metadata handling approach, ASIS has
distributed metadata handling approach
• ASIS’s data integration is based on metadata federation,
SRB’s data integration is based on SRB server
federation.
• Instead of Filters, SRB uses SRB server and agents for
accessing data resources.
24
Sompel’s DL (1)
• Scholarly communication as a network-based workflow
• Instead of Filters and ASL in ASIS, Sompel defines
“repositories” and “digital objects”, respectively.
• Repository is a networked system that provides services
pertaining to a collection of Digital Objects
• Repositories have common service interfaces.
– “Obtain”, “Harvest” and “Put”.
• Two classes of participants.
– Data providers (DP) and Service providers (SP)
• SP collect metadata from DPs (via 3 service interface);
normalize and cluster it to deal with duplicates.
• DP offer some type of search mechanism for their own
repositories.
25
Sompel’s DL (2)
• Data and storage model:
– Data is the abstraction of the Digital Objects
– Digital Objects = Digital data + key metadata.
– Serialization of Digital Objects = Surrogates
– Surrogates
•
•
•
•
•
Information for the value chains and service
information used at repository service interfaces.
In the XML/RDF format
Composed of “dataStream” and/or “Entity” tag elements.
Chained object is defined by keymetadataID or “providerInfo”.
– Different storage types: book repositories, teaching object
repositories, dataset repositories etc.
– Repositories are active nodes. Repositories enable the
use and re-use of materials in many contexts.
26
Sompel’s DL (3)
• Metadata model:
–
–
–
–
Surrogates are essentially metadata records for objects
Based on Dublin Core format with domain specific extensions.
Dublin core has 15 standard entities to define resources.
For more details see http://doublincore.org
• Chaining for integrating data:
– Application/User doesn’t need to use workflow engine or script to
create or run the chain. (As in ASIS)
– Chain (they call “value chain”) is hidden in the surrogates.
– Surrogates are updated through the common interfaces (“put”
“obtain” and “harvest”) of the resources.
– Chain is defined in the “Entity” element in the surrogate document
with the “Lineage” sub element.
• Sample chaining scenario:
– A paper might have references to some papers and these papers
might be references to some other papers….
– Value chain does not stop.
– Papers have different metadata (value added) through value chain
27
ASIS vs. Sompel’s Approach
• Instead of Filters and ASL in ASIS, Sompel defines “repositories”
and “digital objects” respectively
• DP correspond to End-Filters, and SP correspond to Filters in ASIS
• ASIS do not have publishing or putting service interfaces
– “Obtain” corresponds to “getData” in ASIS
– “Harvest” corresponds to “getCapabilities” in ASIS
• Both have distributed metadata approaches for data integration
– ASIS – direct communication between Filters by using
“GetCapabilities” interface
– Sompe’s DL – direct communication between repositories and
services by using “Harvest” interface
• Sompel’s DL uses Dublin Core for the representation of the
resources – ASIS uses its own schema.
• ASIS uses ASL for the representation of the data - Sompel’s
approach doesn’t have common data model.
28
Summary
• Application-Integration Framework (ASIS)
–
–
–
–
Easy to add new sources
Using online Filters providing required adaptors
peer-to-peer chain of Filters
no central metadata catalog server – Distributed
capability exchange and aggregation
– SOA
• Re-usable components (Filters) for different
applications in predefined domain
• Implications of Filter services
– Scalable and Fault-tolerant
• Load-balancing and caching
– Dynamically updating capability metadata
29
THANKS !
30
APPENDIX
31
Capability in Grid Services Security
• XPOLA
– The infrastructure is built on a peer-to-peer chain-of-trust model. No
central admins
– WS-Security compliant
– Extensible – PKI and SAML based
– Dynamic and reusable (manually or automatically generated)
– Composed of two sectors.
• Policy document (SAML, lifetime info, binding info etc.)
• Provider’s signature
• Existing grid security solutions to fine-grained authorization
were not addressing general Web/Grid services in
compliant with Web Services security specs.
• With central admins, other approaches don’t address
dynamic services
32
Sample Capabilities File (too simplified) – GIS Domain
•
<?xml version='1.0' encoding="UTF-8" standalone="no" ?>
<!DOCTYPE WMT_MS_Capabilities SYSTEM "http://toro.ucs.indiana.edu:8086/xml/capabilities.dtd">
<Capabilities version="1.1.1" updateSequence="0">
<Service>
<Name>CGL_Mapping</Name>
<Title>CGL_Mapping WMS</Title>
<OnlineResource xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple“
xlink:href="http://toro.ucs.indiana.edu:8086/WMSServices.wsdl" />
<ContactInformation>
…..
</ContactInformation>
</Service>
<Capability>
<Request>
<GetCapabilities>
<Format>WMS_XML</Format>
<DCPType><HTTP><Get>
<OnlineResource xmlns:xlink="http://w3.org/1999/xlink" xlink:type="simple“
xlink:href="http://toro.ucs.indiana.edu:8086/WMSServices.wsdl" />
</Get></HTTP></DCPType>
</GetCapabilities>
<GetMap>
<Format>image/GIF</Format>
<Format>image/PNG</Format>
<DCPType><HTTP><Get>
<OnlineResource xmlns:xlink="http://w3.org/1999/xlink" xlink:type="simple“
xlink:href="http://toro.ucs.indiana.edu:8086/WMSServices.wsdl" />
</Get></HTTP></DCPType>
</GetMap>
</Request>
<Layer>
<Name>California:Faults</Name>
<Title>California:Faults</Title>
<SRS>EPSG:4326</SRS>
<LatLonBoundingBox minx="-180" miny="-82" maxx="180" maxy="82" / >
</Layer>
</Capability>
</Capabilities>
33
Dublin Core
• Challenge of resource description and discovery
• Language for making a particular class of statements
about resources
• There 2 namespaces – Dublin Core element set (dc)and
Dublin Core qualifiers (dcq ex. dcq:iso8601).
• Some of Dublin core metadata element set
– Title (dc:title), subject, description, creator, publisher, type,
format, source, language, rights
• Using DC in RDF, specifications for DC in RDF (work in
progress)
• Resource has(verb) property(dc:creator) X(dc:Ahmet)
34
Sample Dublin Core
35
http://www.ils.unc.edu/mrc/jcdl2006/slides/kunze.pdf
Open Archive Initiative
OAI
36
OAI
• Deals with e-print server world
• Need to develop services that permitted searching
across papers housed at multiple repositories
• Repositories also needed capabilities to automatically
identify and copy papers that had been deposited in
them.
• Definition of an interface to permit e-print servers to
expose the metadata for the papers that it held.
• Service providers with similar metadata standards need
to harvest this metadata
• Service providers act as a federation of repositories, by
indexing documents, so that multiple collections cen be
searched as though they form a single collection
37
OAI-PMH
• For the variety of the communities engaged in
publishing content on the Web
• Any networked server can emplly the protocol to
enable service providers to collect its metadata
• HTTP-based request-response transaction
• Service Providers
– Harvest metadata from Data Providers using the OAI
protocol and use the returned metadata as a basis for
building value-added services.
• Data Providers (repositories)
– Adopt OAI technical as a means of exposing
metadata about their content.
38
Comments on OAI
• OAI-PMH is ultimately only as useful as the
metadata it transports.
• The tendency of implementers to almost
exclusively apply the lowest common
denominator of unqualified dublin core makes it
difficult to implement more advanced search
interface features.
• Content providers should prefer more
expressive metadata schema like MARC or
qualified DC and find ways to augment humangenerated descriptive metadata.
39
Sompel’s Digital Library
Approach
40
Sompel’s Approach
Hierarchy steps
41
http://msc.mellon.org/Meetings/Interop/lagoze_data_model.pdf
Sompel’s DL
Data Model
42
msc.mellon.org/Meetings/Interop/lagoze_data_model.pdf
Ogsa-DAI
43
Ogsa-DAI Figure
http://www.globus.org/grid_software/data/dai.php
44
Perform Document
http://www.ogsadai.org.uk/documentation/ogsadai-wsi-2.2/doc/interaction/Perform.html
45
MCS
• MCS present a design of Metadata Catalog
Service that provides mechanism for storing and
accessing descriptive metadata attributes
• Requirements: Store domain-independent
attributes, user-defined attributes, query with a
set of attributes, query with a logical name,
authentication, authorization and auditing
• Allows users to discover data sets based on the
value of descriptive attributes, rather then
requiring to know specific names or physical
locations of data items
46
MCAT vs. MCS
• MCAT can be used just with SRB
• MCS can be used just in OGSA architecture
• MCAT stores both physical and logical
addresses
• MCS stores logical metadata attributes and
handles that can be resolved by a data location
or data access services.
• They can both be extended for serving
application-specific metadata, but they don’t
have generalized way for doing that.
47
SRB
48
SRB
49
CLIENT
• Example interaction with SRB using Scommands:
– Sinit
• Start interaction with SRB
– Spwd
• Display current position within SRB repository
– Smeta -i –I “UDSMD0=‘author’” –I “UDSMD1=‘bob’” myfile
• Add metadata describing the author the file
– Smeta -i –I “UDSMD0=‘author’” –I “UDSMD1=‘arthur’”
• Search for files with author metadata set as arthur
– Sget myFile
• Copy myFile from SRB to local storage
– Sreplicate –S anotherResource myFile
• Create a replica of myFile on anotherResource
– Srm myFile
• Remove myFile (and all replicas) from SRB
– Sexit
• End interaction with SRB
50