Transcript PPT

The Challenge of Data Integration
Data + Grid = Discovery?
Prof. Malcolm Atkinson
Director
www.nesc.ac.uk
22nd January 2003
1
Overview
Essentials of e-Science
Collaboration



Resource Sharing
Data Sharing
Mutual Dependence
Essentials of the Grid
Distributed Virtual Machine?
Essentials of Data Sharing
Database Research did it?
New Challenges
Data Access & Integration Building Bricks
Band Wagon v Research Opportunity
Thresholds, Visions and Questions
2
3
UK e-Science
e-Science and the Grid
‘e-Science is about global collaboration in key
areas of science, and the next generation of
infrastructure that will enable it.’
‘e-Science will change the dynamic of the
way science is undertaken.’
John Taylor
Director General of Research Councils
Office of Science and Technology
5
From presentation by Tony Hey
UK e-Science Investment
National
eScience
Centre
HPC(x)
Edinburgh
Glasgow
Newcastle
Belfast
Projects
> 60 started
> 30 proposed
+
EU Projects
Daresbury Lab
Manchester
Cambridge
Hinxton
Oxford
Cardiff
RAL
London
Southampton
6
UK e-Science Programme (2)
2003 - 2005
DG Research Councils
E-Science
Steering Committee
Director’s
Awareness and Co-ordination Role
Grid TAG
Director
Director’s
Management Role
Generic Challenges
EPSRC (£15m), DTI (£15m)
Academic Application Support
Programme
Research Councils (£74m), DTI (£5m)
PPARC (£26m)
BBSRC (£8m)
MRC (£8m)
NERC (£7m)
£80m Collaborative projects
ESRC (£3m)
EPSRC (£17m)
CLRC (£5m)
Industrial Collaboration (£40m)
7
8
Collaboration Growing
Hard Problems, Multi-disciplinary, Expense
Sharing




Ideas
Thought processes and Stimuli
Effort
Resources
Requires




Communication
Common understanding & Framework
Mechanisms for sharing fairly
Organisation and Infrastructure
Scientists have done this for Centuries
9
Interdependence
Science has relied on experiment and theory
Simulation, Data Mining, Analysis
Theory- Experiment Greece
Italy
400 BC
1,500 AD
Simulation Europe
1,980 AD
For problems which are:
- too large/small
- too fast/slow
- too complex
- too expensive, unethical, ...
-Testing Understanding
12
Interdependence
Models
Theory
Data
Computing
Data
Experiment
13
Database Growth
PDB protein structures
14
15
Globus Toolkit® History
30000
Does not include downloads from:
NMI, UK eScience, EU Datagrid,
IBM, Platform, etc.
Physiology of the Grid
Paper Released
25000
20000
Anatomy of the Grid
Paper Released
The Grid: Blueprint for a
New Computing
Infrastructure published
DARPA, NSF,
and DOE
begin funding
Grid work
NASA begins
funding Grid work,
DOE adds support
NSF & European Commission
Initiate Many New Grid Projects
Significant
Commercial
Interest in
Grids
15000
10000
Early Application
Successes Reported
GT 1.0.0
Released
5000
Downloads per Month from ftp.globus.org
GT 2.0
Released
0
1997
1998
1999
2000
2001
2002
16
Encompassing Vision
software
computers
sensor
nets
instruments
colleagues
data
archives
17
People & Industry
Global Grid Forum
900
800
700
600
500
400
300
200
100
0
GGF1
GGF2
GGF3
GGF2
GGF3
GGF4
GGF5
GGF4
GGF5
GGF6
GGF7
260
220
400
900
450
>1000
Jul 01
Oct 01
Feb 02
Jul 02
Oct 02
Mar 03
450
Targets



Sep 02
Jan 03
Financial, Life Sciences
Automotive & Aerospace
Governments
Partners

GlobusWorld
1
“IBM DRIVES GRID COMPUTING
FOR COMMERCIAL BUSINESS WITH
TEN NEW GRID OFFERINGS”

UK All Hands
AHM’02 350
IBM This week

Platform, DataSynapse
Avaki, Entropia
United Devices
IBM last 20 months
Leaders of OGSI
Development teams
Grid Jamboree
GGF
18
19
High-Altitude Views
A Rallying Cry
Meeting a Hard Challenge requires Many Minds
Operating & Maintaining Infrastructure requires Many
Hands & Many Companies
Another Stab at Distributed Computing
Hard Challenge: Intellectually and Practically Important
Dependable Ubiquity over Heterogeneity & Fallibility
An Ambitious Virtual Machine
Consistent large scale computational environments
A Global Operating System
Collective Resources, Common Management
20
An Architectural View
Application Users
Application
Application
Common Application Platform for Group of Applications
& Platform
Developers
Monitoring
Diagnosis
Logging
Scheduling
Accounting
Authorisation
Grid Plumbing & Security Infrastructure
Data & Compute Resources
Providers
Distributed
Operations
Teams
21
Open Grid Services Infrastructure
Confluence of Web Services & Grid
Consistent Interface Description
Based on WSDL 1.2 proposal



Extend Properties
Separate Binding from Interface
Function Composition & Inheritence
Exploit WS* Investment
Grid Features
Security
Life-Time Management
Service (state) Information via Data Elements
Discovery
Grouping
Notification
OGSI Version 1 Proposal at GGF7 (March 03)
22
Open Grid Services Architecture
Ubiquitous Building Blocks
Using OGSI Platform
Open & Extensible
Encourage Refactoring Experiments
Initially
The Globus 2 model

Except State Information now distributed
Example New Features
Global Name Mapping Service
Replication and Caching Service
Data Access & Integration
Metering, Logging, Authorisation, Charging, …
23
Grid Challenge
Balancing “Direct” Access to the
“Platforms” with Abstraction &
Virtualisation
Developers often have exploitable application
knowledge
Automation necessary & helpful


Interface matching, operation validation, …
Optimisation at many scales
There isn’t enough effort to develop Languages
& Abstractions
24
25
Data Integration
Scientist with Idea
2) Extract Data
Data Resource 1
1) Find Data
3) Transform Data
4) Combine Data
5) Interpret Data
Data Resource 2
26
Wellcome Trust: Cardiovascular
Functional Genomics
Glasgow
Shared data
Edinburgh
Public curated
data
Leicester
Oxford
London
Netherlands
27
OGSA-DAI Partners
IBM
USA
EPCC & NeSC
Glasgow
Newcastle
Belfast
Daresbury Lab
Manchester
Oxford
Cambridge
EPCC & NeSC
Oracle Hinxton
RAL
IBM UK
Cardiff
London
IBM Hursley
IBM USA
Southampton
Manchester e-SC
Newcastle e-SC
£3 million, 18 months, started February 2002
Oracle
28
DAI Key Services
GridDataService
GDS
Access to data & DB operations
GridDataServiceFactory
GDSF
Makes GDS & GDSF
GridDataServiceRegistry
GDSR
Discovery of GDS(F) & Data
GridDataTranslationService GDTS
Translates or Transforms Data
GridDataTransportDepot
Data transport with persistence
GDTD
Integrated Structured Data Transport
Relational & XML models supported
Role-based Authorisation
Binary structured files (later)
30
DAI Architecture
Data Intensive X Scientists
Data Intensive Applications for Science X
Simulation, Analysis & Integration Technology for Science X
Generic Virtual Data Access and Integration Technology
Monitoring
Diagnosis
Scheduling
Accounting
GridFTP
Naming
Authorisation
Caching
Data Integration
Services
Data Access Ser vices
Grid Infrastructure
Compute, Data & Storage Resources
Structured Data
Distributed
Data Integration Architecture
31
1a. Request to Registry
for sources of data
about “x”
SOAP/HTTP
Registry
1b. Registry
responds with
Factory handle
service creation
API interactions
2a. Request to Factory for access
to database
Factory
Client
2c. Factory returns
handle of GDS to
client
3a. Client queries GDS with
XPath, SQL, etc
3c. Results of query returned to
client as XML
2b. Factory creates
GridDataService to manage
access
Grid Data
Service
3b. GDS interacts
with database
XML /
Relationa
l
database
32
1a. Request to Registry
for sources of data about
“x” & “y”
SOAP/HTTP
Registry
1b. Registry
responds with
Factory handle
3b. Tell
consumer Client
service creation
API interactions
2a. Request to Factory for access and
integration to databases
2c. Factory returns handle of
GDS to client
Factory
2b. Factory creates
GridDataServices network
3a. Client submits set of
queries GDS with XPath, SQL,
etc
Consumer
GDS
GDS
XML /
Relationa
l
database
GDS
3c. Results of queries returned to
consumer as XML or binary
GDS
GDS
XML /
Relationa
l
33
database
Biomedical (or ANY) Data
Opportunities
Global Production of
Published Data
Volume Diversity
Combination 
Analysis  Discovery
Opportunities
Specialised Indexing
Structurally varied
replication
Consistent Structured
Universe of Discourse
Data & Computation
Integration
Challenges
Data Huggers
Meagre metadata
Ease of Use
Automated, optimised
integration
Traceability, Dependability
Challenges
Approximate Matching
Multi-scale optimisation
Bad habits / industrial
structures
Safety and Multi-scale
optimisation
34
Data Integration Challenges
High-Level Languages
Describing the Data Extraction Recipes
Describing the Sources & Components

Metadata that drives automation & validation
Mobility
Code & Data
Integrating Existing DB technology
Moving the DBMS to the Grid context
New Optimisation Challenges
Data & Computation & Storage & Movement
Shared Distributed Annotation Systems
How to Reference
Provenance & Acknowledgement
35
36
Challenges
A Programming & Development Model
Dependability at this Scale
Foundations for Trust
Raising the Level of Automation
Supporting New Forms of
Collaboration
Data
37
38