Transcript OGSA-DAI

Sessions 43 & 44
Accessing data using a common
interface: OGSA-DAI as an example
Elias Theocharopoulos and Tilaye Alemu
ISSGC ‘09 – Sophia Antipolis – Tuesday, 14th July 2009
web: www.omii.ac.uk
email: [email protected]
Overview
•
•
•
•
•
•
•
The problem: Sharing data in a grid
What is OGSA-DAI?
Data-centric workflows
Key OGSA-DAI terms
The OGSA-DAI client toolkit
Use cases and extensibility points
Pros and cons
2
web: www.omii.ac.uk
email: [email protected]
The problem:
Sharing and accessing
data in a grid
3
web: www.omii.ac.uk
email: [email protected]
Distributed data resources
web: www.omii.ac.uk
email: [email protected]
How about a central server?
FR
query
FR
data
Client
web: www.omii.ac.uk
email: [email protected]
Central server pros and cons
•
•
•
•
Access to up-to-date data
Single point of access
Data in common format
Database can handle joins
• Initial overhead in terms of time, effort and cost
• Keeping data up to date
• Loss of control by data providers
o
Assuming they even let go
• Security and trust
web: www.omii.ac.uk
email: [email protected]
How about providing direct access?
UK
query
UK
data
ES
query
Translate and join
web: www.omii.ac.uk
ES
data
IA
query
Client
email: [email protected]
IA
data
Direct access pros and cons
• Access to up-to-date data
• Fast access
• Data providers retain control
• Fat clients
• Heterogeneity and inconsistency
o
o
o
o
Data
Databases
Connection
Security
• Security overheads for data providers
o
Manage firewalls and usernames/passwords for multiple clients
• Hard to use in grid/web service workflows
web: www.omii.ac.uk
email: [email protected]
How about providing a ZIP on the web?
UK data
HTTP
GET
ES data
ZIP
UnZIP, translate and join
web: www.omii.ac.uk
HTTP
GET
IA data
HTTP
GET
ZIP
Client
email: [email protected]
ZIP
ZIP on the web pros and cons
• Fast access
• Data providers retain control
• Very large downloads even if client only needs
subset
• Providers have to select and ZIP their data
• Client has to install data into a local database
• Static snapshot
web: www.omii.ac.uk
email: [email protected]
Sharing distributed heterogeneous resources with OGSA-DAI
UK
query
UK
data
ES
query
Translate and join
ES
data
IA
query
OGSA-DAI
FR
query
FR
data
Client
web: www.omii.ac.uk
email: [email protected]
IA
data
Motivation
• Grid is about sharing resources
• Need to share structured data resources
Relational
Database
XML
Database
Indexed
File
12
web: www.omii.ac.uk
email: [email protected]
What is OGSA-DAI?
• Open Grid Services Architecture Data Access
Integration
• A framework that executes workflows
• Workflows are data-centric
• Workflow components are designed for data
access, integration, transformation and delivery
• Can access heterogeneous data resources
• Webservice interface
• Intended as a toolkit for building higher-level
application-specific data services
13
web: www.omii.ac.uk
email: [email protected]
OGSA-DAI’s vision
• Sharing data resources to enable collaboration
• Data access
o
Structured data in distributed heterogeneous data resources
• Data integration
o
e.g. expose multiple databases to users as a single virtual database
• Data transformation
o
e.g. expose data in schema X to users as data in schema Y
• Data delivery
o
o
To where it’s needed by the most appropriate means
e.g. web service, e-mail, HTTP, FTP, GridFTP
web: www.omii.ac.uk
email: [email protected]
OGSA-DAI and data-centric workflows
web: www.omii.ac.uk
email: [email protected]
OGSA-DAI workflow
• Executes workflows
• Workflows contain activities
o
o
o
Well-defined functional units
Data goes in, something is done, data comes out
Equivalent to programming language methods
• Workflows are submitted by clients
o
To an OGSA-DAI web service
web: www.omii.ac.uk
email: [email protected]
An OGSA-DAI workflow - a simply analogy
Convert query
from French to
English
Country
Capital
UK
London
France
Paris
Run SQL query
SELECT Country, Capital
FROM Countries
Convert
data from
English to
French
Capital
Grande-Bretagne
Londres
France
Paris
Join
the
data
SELECT Pays,Capital
FROM Pays
Convert query
from French to
Spanish
Pays
SELECT País, Capital
FROM Países
Convert
data from
Spanish to
French
Run SQL query
País
Capital
España
Madrid
Italia
Roma
web: www.omii.ac.uk
Pays
Capital
Grande-Bretagne
Londres
France
Paris
l'Espagne
Madrid
l'Italie
Rome
Pays
Capital
l'Espagne
Madrid
l'Italie
Rome
email: [email protected]
How it appears to the client
OGSA-DAI
workflow(SELECT Pays,Capital
FROM Pays)
Client
web: www.omii.ac.uk
Pays
Capital
Grande-Bretagne
Londres
France
Paris
l'Espagne
Madrid
l'Italie
Rome
email: [email protected]
Data integration with OGSA-DAI workflows
• Across OGSA-DAI services
Workflow 1
OGSA
DAI
DB1
Data
Workflow 2
SQLQuery
(DB1)
Deliver to
OGSA-DAI
OGSA
DAI
DB2
Receive from
OGSA-DAI
JOIN
Deliver
SQLQuery
(DB2)
21
web: www.omii.ac.uk
email: [email protected]
Key OGSA-DAI terms:
activities, resources,
workflows
22
web: www.omii.ac.uk
email: [email protected]
OGSA-DAI: Key Term Activity
• An activity is a named unit of functionality
o
o
o
A well defined workflow unit
Pluggable
Composable
• An activity can have
o
o
0 or more named inputs
0 or more named outputs
• Blocks of data flow from an activity’s output into
another activity’s input
23
web: www.omii.ac.uk
email: [email protected]
OGSA-DAI: Key Term Activity (cont.)
• Example activities include
o
o
o
o
o
Execute an SQL query
ZIP a batch of data
List the files in a directory
Execute an XSL transform on an XML document
Deliver data to an FTP server
24
web: www.omii.ac.uk
email: [email protected]
OGSA-DAI: Key Term Activity (cont.)
• Activity Connections
o
o
o
All required inputs must be connected
All outputs must be connected
Optional inputs
• Inputs
o
o
o
Literal
Streamed
Types
25
web: www.omii.ac.uk
email: [email protected]
Connecting activities - examples
26
web: www.omii.ac.uk
email: [email protected]
Data grouping: Lists
• Special blocks are used to mark the beginning and the
end of a list.
• A list groups related data as one unit.
f1,f2
[byte[]…],[ byte[]..]
ReadFromFileActivity
• For example ReadFromFileActivity can dynamically
take any number of filenames as input.
o
o
Without a way to group the output byte arrays we would
have no way to differentiate between the binary data of
filenames f1 and f2.
Streaming is preserved since for each file a number of byte
arrays is produced to be forwarded to coming activities.
27
web: www.omii.ac.uk
email: [email protected]
Passing data internally: OGSA-DAI Tuple
• A special type of data passing between activities
• A Tuple is a data representation similar to a row
of relational data. Each element of a Tuple
represent a column.
• Tuples are normally grouped in lists and they
are preceded by a metadata block.
Athens 20
Madrid 22
Rome 25
SqlQuery
SELECT city, temp
FROM weather;
28
web: www.omii.ac.uk
email: [email protected]
An interesting activity: Tee
• There are activities that operate on the level of
blocks and are not concerned with the type and
values of data they are handling. E.g TeeActivity:
[A,B,C,D]
[A,B,C,D]
TeeActivity
[A,B,C,D]
No of outputs: 2
29
web: www.omii.ac.uk
email: [email protected]
OGSA-DAI: Key Term Resource
•
•
•
•
•
Data request execution resource
Data resources
Data sources
Data sinks
Sessions
o A state container associated with a set of workflows
o One workflow can lodge state
o A subsequent workflow can retrieve it
• Requests
o One per workflow submitted to a DRER
o Access request status
30
web: www.omii.ac.uk
email: [email protected]
OGSA-DAI: Key Term Workflow
• A workflow can contain:
o Activities
• Resource-based: SQLQuery
• Non-Resource: Transformation
and Delivery
o Resources
• Targeted by Activities
o Other Workflows
• Sub workflows
• Other types of workflow
31
web: www.omii.ac.uk
email: [email protected]
OGSA-DAI: Key Term Workflow (cont’)
• OGSA-DAI can be used as a workflow
processing system that is designed to stream
data through a set of activities in a pipelined
manner.
• In the Query->Transform->Deliver workflow, if
the activities are well defined all three will be
processing concurrently with different portions
of the data stream.
32
web: www.omii.ac.uk
email: [email protected]
OGSA-DAI: Key Term Workflow (cont’)
• Pipeline workflow consists of a set of chained activities that will be
executed in parallel with data flowing between the activities.
• Sequence workflow all the sub-workflows added to this workflow
will be executed in sequence.
For example 1st sub-workflow in a sequence creates a table, 2nd
bulk loads transformed data into this table.
• Parallel workflow all the sub-workflows added to this workflow will
be executed in parallel.
1
2
33
web: www.omii.ac.uk
email: [email protected]
Getting to the first practical:
The OGSA-DAI client
toolkit.
34
web: www.omii.ac.uk
email: [email protected]
OGSA-DAI client toolkit
• OGSA-DAI client toolkit
o
Construct and submit requests in Java not XML
• Toolkit manages interaction with web services via SOAP
over HTTP; it handles SOAP request construction and
response parsing.
o
Provides Java abstractions of
•
•
•
•
Services
OGSA-DAI resources and properties
Requests
Activities
35
web: www.omii.ac.uk
email: [email protected]
The client toolkit
• The workflow description is sent to the OGSA-DAI
server as an XML document.
• Application developer does not need to worry about
creating this document.
• The client toolkit provides ways of assembling activity
workflows programmatically.
• We will see how to use the client toolkit during the
hands-on session.
36
web: www.omii.ac.uk
email: [email protected]
Service/resource model
Data
Request
Execution
Service
MyDRER
Data Request
Execution
Resource
One
Data
Resource
Data
Two
Data
Resource
Data
Three
Data
Resource
Data
Client
Request
Management
Service
Session
Session
Request
MyRequest123456
37
web: www.omii.ac.uk
email: [email protected]
Client Toolkit Activities
• One client activity per server activity
• Same input and output names
• Plus some convenience methods
For example:
• Retrieve results as a JDBC ResultSet from a
TupleToWebRowSet activity.
• Retrieve update count as an Integer from a
SQLUpdate activity
38
web: www.omii.ac.uk
email: [email protected]
Step by Step Guide for Writing Clients
• Create activities
o
There’s a corresponding client toolkit activity for each
server-side activity
DeliverToFTP deliver = new DeliverToFTP();
ReadFromFile readFile = new ReadFromFile();
39
web: www.omii.ac.uk
email: [email protected]
Connecting activities
• Set inputs for each activity (e.g. parameters)
• Every input parameter can either be literal input
or streamed from another activity
o
Literal inputs, e.g. for constant parameters:
deliver.addFilename("results1.txt");
deliver.addHost(“[email protected]:21");
o
Connect input to the output of another activity to
stream data
deliver.connectDataInput(readFile.getDataOutput());
40
web: www.omii.ac.uk
email: [email protected]
Gaining access to the results
• If the output of an activity can be provided in a
user-friendly type, then there are methods to
access the results:
o
Check whether there are more results to be retrieved
boolean hasNext = sqlUpdate.hasNextResult();
o
Get the next result in a convenient type
int count = sqlUpdate.getNextResult();
41
web: www.omii.ac.uk
email: [email protected]
Build and execute the Workflow Request
• Create workflow and add activities to them
• A data service executes the workflow and
returns a response (or an error!)
• The response may contain data (depending on
the activities)
• Each client toolkit activity provides utility
methods for retrieving its response data
42
web: www.omii.ac.uk
email: [email protected]
First hands-on session
Go to :
http://homepages.nesc.ac.uk/~elias/issgc09/html/practical.html
43
web: www.omii.ac.uk
email: [email protected]
Extensibility points &
components
44
web: www.omii.ac.uk
email: [email protected]
Extending OGSA-DAI: What
• OGSA-DAI
o
o
A Framework
Extensible
• Out of the Box is the basics
o
o
o
Different applications have different needs
New Sources of Data
New Functionality
45
web: www.omii.ac.uk
email: [email protected]
Extending OGSA-DAI: Overview
Presentation Layer
Message Frameworks
UNICORE
WS-DAI
Axis New
GT
OMII
?
gLite
Embedded
OGSA-DAI Core
Workflow Execution Engine
Sessions
Activity Framework
MyOwnActivity
XSLTransform
DeliverToURL
XPathQuery
SQLQuery
Data Resources
Persistence and Configuration
Request
DataNew
Source
Functionality
Data Sink
New Types of Data
46
web: www.omii.ac.uk
email: [email protected]
Extending OGSA-DAI: Activities
• Activities do some unit of work
• Specific transformation
o
Data Format: SWISS-PROT to format X
• Delivery
o
Deliver to a target service
• Data analysis and Integration
o
Combine data from different sources
47
web: www.omii.ac.uk
email: [email protected]
Extending OGSA-DAI: Resources
• New resources – why?
o
o
o
New Products
New Applications
Specialised Access
• Required:
o
o
o
DataResource
DataResourceState
ResourceAccessor
48
web: www.omii.ac.uk
email: [email protected]
Extending OGSA-DAI: Remote Resource
• Accessing Resources on Remote OGSA-DAI
• Avoid replication of resources
• Security Issues
o
o
Devolved to Local OGSA-DAI
Security between OGSA-DAI Deployments
49
web: www.omii.ac.uk
email: [email protected]
SQL views
• Define a drPatient view
o
SELECT id, name, age, sex, doctor.name as drName FROM patient,
doctor
WHERE patient.DrID = doctor.ID;
ID Name
Age
Sex
ZIP
Dr ID
1
Ken
42
M
IL1478305
456
2
Josie
25
F
BN1 7QP
789
ID
Name
DN
123
Greene
US-Chicago-G
456
Ross
US-Chicago-R
789
Fairhead
UK-Holby-F
• Client runs SELECT * FROM drPatient;
• Shorthand for complex query results
• Data access control e.g. users of drPatient
o
o
Cannot access a patient’s ZIP
Are unaware of the doctor or patient tables
web: www.omii.ac.uk
email: [email protected]
OGSA-DAI SQL views
• OGSA-DAI SQL views data resource
o
Represents a view across a database exposed by an
OGSA-DAI relational resource
• SQLQuery activity
o
o
o
Parses query
Splices in view definition
Submits transformed query to database
• Can define views for read-only databases
• Schema transformation
o
Map a logical schema to a physical schema
web: www.omii.ac.uk
email: [email protected]
Distributed query processing
• OGSA-DQP
o
o
o
Developed by Universities of Manchester and Newcastle
Refactored for OGSA-DAI 3.0 by EPCC as part of the NextGrid project
OGSA-DAI DQP package
• Multiple tables on multiple databases are exposed to clients as
multiple tables in one “virtual database”
• Clients are unaware of the multiple databases
• Databases can be exposed
o
o
EITHER within one OGSA-DAI server
OR via multiple remote OGSA-DAI servers
web: www.omii.ac.uk
email: [email protected]
OGSA-DAI DQP
3a: SELECT
Archeo_Finds.ID,
Archeo_Finds.Provenance
FROM Archeo_Finds;
3: Execute
sub-queries
OGSA-DAI
OGSA-DAI
3b: SELECT
Annotations_Ratings.ID,
Annotations_Ratings.Confidence
FROM Annotation_Ratings
WHERE
Annotations_Ratings.Confidence
> 0.99
4: Push results
OGSA-DAI (DQP query evaluator)
2: Parse query and
form query plan
5: Combine and postprocess – do the JOIN
OGSA-DAI (core + DQP coordinator)
5: Results
1: SELECT Archeo_Finds.ID, Archeo_Finds.Provenance,
Annotations_Ratings.Confidence FROM Annotations_Ratings,
HGV_June WHERE Annotations_Ratings.Confidence > 0.99
AND Annotations_Ratings.ID = Archeo_Finds.ID;
web: www.omii.ac.uk
Client
email: [email protected]
OGSA-DAI workflows – a de-facto standard
• OGSA-DAI workflows are a de-facto standard
o
Of use to many projects as we’ll see
• For some applications workflows are too powerful
o
o
Too expressive
Infer semantics from names of activities available on server
• Must interrogate the server
o
o
Problems using OGSA-DAI services in workflow engines e.g.
Taverna
Not compatible with existing data analysis tools
web: www.omii.ac.uk
email: [email protected]
Facades
• Define facades on top of OGSA-DAI
• Why?
o
o
o
Provide interfaces with more tightly-defined semantics
Comply with standards
Exploit existing data analysis tools
• Continue to exploit the power of workflows under-thehood
o
o
o
“Canned workflows”
Templates selected and populated, executed and parsed
Map service operations to “template” OGSA-DAI workflows
web: www.omii.ac.uk
email: [email protected]
Grid-enabling existing data-related products
OGSA-DAI
OGSA-DAI mediator
Data analysis tool
web: www.omii.ac.uk
email: [email protected]
OGSA-DAI in action
web: www.omii.ac.uk
email: [email protected]
VOTES – data with different schema distributed across multiple
databases within a group of strategic partners
• Virtual Organisations for Trials and Epidemiological
Studies (VOTES)
o
o
http://labserv.nesc.gla.ac.uk/projects/votes/index.html
UK Medical Research Council project
• Data access and integration in the clinical domain
o
o
Relational databases – Microsoft SQL Server, Access, …
Distributed database joins
• Patient information
• Clinical trials records
o
Linking key is Scotland’s CHI number
web: www.omii.ac.uk
email: [email protected]
VOTES – cross-database join activity
workflow
DB1
OGSA
DAI
DB2
SELECT CHI, Sex, DOB
FROM Patients
ORDER BY CHI
SQLQuery
(DB1)
(CHI, Sex, DOB)
(CHI, Sex, DOB, Diagnosis)
Ordered data
streams
SQLQuery
(DB2)
Merge
Join
Deliver
(CHI, Diagnosis)
SELECT CHI, Diagnosis
FROM TrialX
ORDER BY CHI
• This is equivalent to running:
SELECT chi, sex, DOB, diagnosis FROM patients, trialX WHERE patients.chi
= trialX.chi;
• patients and trialX are in two different databases
web: www.omii.ac.uk
email: [email protected]
Public Health Grid – data with different schema distributed across multiple
databases within a group of strategic partners
• US Public Health Grid
o
o
o
o
US Centers for Disease Control
University of Pittsburgh
Tarrant Country Public Health Department
Dallas County Public Health Department
• Real-time Outbreak and Disease Surveillance
o
o
o
Health query system
Look for incidences of some disease on the rise over an area
Historical and live data
• Health centres maintain their own databases
o
o
o
Distributed databases
Different products and schemas
• e.g. PatientID, Id, PatientIdentifier, PatientNumber
Security and privacy is important
web: www.omii.ac.uk
email: [email protected]
Public Health Grid – workflows, DQP and views
workflow
DB1
OGSA
OGSADAI
DAI
DB6
OGSAView
DAI
DB5
DB2
OGSADQP
SELECT zip, count(*) as total
FROM Cases
WHERE Reason = “Flu”
GROUP BY zip
ORDER BY zip
SQLQuery
(DB6)
web: www.omii.ac.uk
(15112, 3)
(15144, 1)
DB4
OGSAView
DAI
DB3
Cases:
SELECT * FROM
DB1.Cases UNION DB2.Cases UNION
DB4.Cases
email: [email protected]
SEE-GEO – working with private and public data
• SEcurE access to GEOspatial services
o
o
o
http://edina.ac.uk/projects/seesaw/seegeo/index.html
EDINA, MIMAS, NeSC, NCeSS
UK JISC project
• Geographical information systems
• Virtual integration of and access control to
o
o
o
Census data – geo-data access service
Borders data – web feature service
Data hosted by other organisations and exposed as
services
web: www.omii.ac.uk
email: [email protected]
SEE-GEO – geo-linking service portal
1: GLSQuery
submited via
portal e.g. “Leeds
population
distribution by
census output
area”
GLS
Portal
Maps
5: Portal gets image using URL
4: URL of image is returned to portal – avoids
costly SOAP/HTTP transfer of image
MIMAS
Census
OGSA-DAI
Get
Join
Transform
Deliver
Get
UK
BORDERS
2: Workflow is populated with
query parameters and run
web: www.omii.ac.uk
Image
Creation
Service
email: [email protected]
3: Image
is placed
on a map
server
Why OGSA-DAI?
web: www.omii.ac.uk
email: [email protected]
Workflows
• A workflow can represent a complex data
management scenario, involving:
o
o
o
o
o
Data access
Transformation
Filtering
Updating
Numerous distributed, heterogeneous databases
web: www.omii.ac.uk
email: [email protected]
Workflows and performance
• OGSA-DAI is one more layer between clients
and data
• Therefore, OGSA-DAI is not as fast as a direct
connection to a database
o
OGSA-DAI uses JDBC so will never be as fast as a
direct JDBC connection
• But this is not what OGSA-DAI is designed to
do
web: www.omii.ac.uk
email: [email protected]
Workflows and performance
• Having a server execute workflows yields
o
o
Thinner clients with less memory and CPU requirements
Minimised client-server communication overheads
• Activities process data on the server
o
o
Minimises data movement
As opposed to BPEL or Taverna or web service-based workflow
engines which pass data to and fro via web services
• Data streaming
o
o
o
Activities work on different parts of the data stream in parallel
Reduces memory footprint on server
Reduces execution time
web: www.omii.ac.uk
email: [email protected]
Why another layer can be good
• Data providers retain control of their data
• A place to hide database heterogeneities
o
Yields thinner clients
• A place to enforce additional security
o
o
o
Hide the actual location of the data
Filter the data according to the rights of clients
Manage access to federations, databases, tables,
documents, files, rows, lines
• A place to define views on read-only databases
web: www.omii.ac.uk
email: [email protected]
Developing applications
• OGSA-DAI is highly extensible
o
Data resources, activities, security, presentation layers
• An enabling framework
o
o
o
Save development time
Focus on application-specific features
Get standard functionalities out-of-the-box
• Queries, updates, transformations, deliveries
web: www.omii.ac.uk
email: [email protected]
Portability
• OGSA-DAI is 100% Java
o
Runs under Windows, UNIX, Linux
• OGSA-DAI uses web services
o
Clients can be written in any language and on any
platform that supports web services
web: www.omii.ac.uk
email: [email protected]
Second and third hands-on sessions
Go to :
http://homepages.nesc.ac.uk/~elias/issgc09/html/practical.html
#ScenarioTwoDataIntegration
76
web: www.omii.ac.uk
email: [email protected]
Further information
• WWW site
: http://www.ogsadai.org.uk
• Info
: [email protected]
• Users e-mail list : [email protected]
web: www.omii.ac.uk
email: [email protected]