UK e-Science Future Infrastructure for Scientific Data Mining
Download
Report
Transcript UK e-Science Future Infrastructure for Scientific Data Mining
UK e-Science
Future Infrastructure for Scientific Data
Mining, Integration and Visualisation
Malcolm Atkinson
Director of National e-Science Centre
www.nesc.ac.uk
25th October 2002
SDMIV workshop, e-Science Institute
Edinburgh
Overview
UK e-Science
Reminder of Investment and Infrastructure
International e-Science
Examples and Collaboration
Data Access and Integration
Lego Bricks for Scientific Application
Developers
Tailored: Application and Computing Scientists
A Computer Scientist’s Christmas List
Diversity and Opportunity
The Way Ahead
e-Science
Fundamentally about Collaboration
Sharing
Ideas
Thought processes and Stimuli
Effort
Resources
Requires
Communication
Common understanding & Framework
Mechanisms for sharing fairly
Organisation and Infrastructure
Scientists (Biologists) have done this for Centuries
e-Science (take 2)
Fundamentally about Collaboration
structured, organised &
Text, digital media,
Sharing
Ideas
Thought processes and Stimuli
Effort
Resources
Requires
curated data, computable
models, visualisation, shared
instruments, shared systems,
shared administration, …
Nationally & Internationally
Distributed, …
Routine, Daily, Automated, …
Communication
Common understanding & Framework
Mechanisms for sharing fairly
Organisation and Infrastructure
That Requires very Significant Investment in Digital
Systems and their Support
e-Science (take 3)
Digital networks, digital workFundamentally about Collaboration
Sharing
Ideas
Thought processes and Stimuli
Effort
Resources
Requires
places, digital instruments, …
Metadata, ontologies,
standards, shared curated
data, shared codes, …
Common platforms, shared
software, shared training, …
Communication
Common understanding & Framework Citation, Authentication,
Authorisation, Accounting,
Mechanisms for sharing fairly
Provenance, Policies, …
Organisation and Infrastructure
Shared Provision of Platform,
The Grid SHOULD make this much easier by
providing a common, supported high-level of
Software and Organisational infrastructure
Grid Expectations
Persistence
Always there, Always Working, Always Supported
Stability
You can build on foundations that don’t move
Trustworthy & Predictable
Honours commitments
Digital policies, digital contracts, security, …
Data integrity, longevity and accessibility
Performance
High-level & Extensible
The capabilities you need are already there
Ubiquitous
Your collaborators use it
Grid Reality
Persistence
Political, Economic & Technical
issues to Solve
Always there, Always Working, Always Supported
Early days but Open Grid
Stability
Services link with Web Services
+ GGF standardisation
You can build on foundations that don’t move
Trustworthy & Predictable
Not yet but very substantial
global effort to achieve this
Honours commitments
Digital policies, digital contracts, security, …
Data integrity, longevity and accessibility
Performance
High-level & Extensible
Good basis for extension
Commitment to basic functionality
WS + Community effort
The capabilities you need are already there
Ubiquitous
Your collaborators use it
Global & Industrial Rallying Cry
Must work with Web Services
UK Grid Network
National
eScience
Centre
Access Grid
always-on video
walls
HPC(x)
Edinburgh
Glasgow
Newcastle
Belfast
Daresbury Lab
Manchester
Cambridge
Hinxton
Oxford
Cardiff
RAL
London
Southampton
SuperJanet4, June 2002
Scotland via
Glasgow
20Gbps
10Gbps
2.5Gbps
622Mbps
155Mbps
Scotland via
Edinburgh
WorldCom
Glasgow
WorldCom
Edinburgh
NNW
NorMAN
YHMAN
Northern
Ireland
MidMAN
WorldCom
Manchester
WorldCom
Leeds
EMMAN
WorldCom
Reading
WorldCom
London
EastNet
TVN
South Wales
MAN
WorldCom
Bristol
External
Links
WorldCom
Portsmouth
LMN
SWAN&
BWEMAN
Tony Hey July 2001
LeNSE
Kentish
MAN
National e-Science Centre
Events
Workshops
Research Meetings
International Meetings
History of Events
GGF5
HPDC11
Summer school
> 50 workshops held
> 1000 people in total
Many return often
Planned Events
25 workshops
Conferences to 2005
Visitors
3 arrived
4 arranged
International collaboration,
visits & visitors
China
Argonne National Lab
SDSC
NCSA
…
Centre Projects
Pilot Projects
Regional Support
Research Projects
EPSRC, MRC, WT, SHEFC
UCSF
UIUC
From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign
DataGrid Testbed
Testbed Sites(>40)
HEP sites
ESA sites
Dubna
Lund
Moscow
RAL Estec KNMI Berlin
IPSL
Paris
Santander
Lisboa
CERN
Prague
Brno
Lyon
Grenoble
Milano
PD-LNL
Torino
Madrid
Marseille Pisa BO-CNAF
Barcelona
ESRIN
Roma
Valencia
Catania
[email protected] - [email protected]
A Simplified Grid Anatomy
Scientific Users
Scientific Application
Monitoring
Diagnosis
Logging
Scheduling
Accounting
Authorisation
Application
Developers
Grid Plumbing & Security Infrastructure
Operations
Owners Data & Compute Resources
Team
Distributed
The Crux
Scientific Users
Scientific Application
Monitoring
Diagnosis
Logging
Application
Developers
Keep all the (pink)
groups Authorisation
Scheduling
Accounting
HAPPY
Grid Plumbing & Security Infrastructure
Operations
Owners Data & Compute Resources
Team
Distributed
A SDMIV Grid Anatomy
SDMIV Users
Scientific Application
Monitoring
Diagnosis
Scheduling
Accounting
Logging
Data Integration
Authorisation
Data Access
Grid Plumbing & Security Infrastructure
Data & Compute Resources
Distributed
Structured
DataData
Providers
Data Curators
Database Growth
PDB protein structures
Data Mining:
Science
vs Commerce
Data in files
FTP a local copy /subset.
ASCII or Binary.
Each scientist builds own
analysis toolkit
Analysis is tcl script of
toolkit on local data.
Some simple visualization
tools: x vs y
Data in a database
Standard reports for
standard things.
Report writers for
non-standard things
GUI tools to explore data.
Decision trees
Clustering
Anomaly finders
Jim Gray UCSC April 2002
But…some science is hitting a wall
FTP and GREP are not adequate
You can GREP 1 MB in a second
You can GREP 1 GB in a minute
You can GREP 1 TB in 2 days
You can GREP 1 PB in 3 years.
Oh!, and 1PB ~10,000 disks
You can FTP 1 MB in 1 sec
You can FTP 1 GB / min (= 1 $/GB)
… 2 days and 1K$
…
3 years and 1M$
50,000 Kg
250 KW
60 Racks = 120m2
At some point you need
indices to limit search
parallel data search and analysis
This is where databases can help
Jim Gray UCSC April 2002
OGSA & OGSI
Web Services
Grid Technology
Grid Services
www.gridforum.org/ogsi-wg
www.gridforum.org/ogsa-wg
www.gridforum.org/
Web Services
Rapid Integration
Dynamic binding
Commercial Power
Financial & Political
Independence
Client from Service
Service from Client
Separation
Function from Delivery
Description
WSDL, WSC, WSEF, …
Tools & Platforms
Java ONE, Visual .NET
WebSphere, Oracle, …
www. w3c. org / TR / SOAP or TR/wsdl
Grid Technology
Virtual Organisations
Sharing & Collaboration
Security
Single Sign in, delegation
Distribution & fast FTP
But Various Protocols
Resource Mangement
Discovery
Process Creation
Scheduling
Monitoring
Portability
Ubiquitous APIs & Modules
Gov’nm’t Agency Buy in
Industrial Buy in
Foster, I., Kesselman, C. and Tuecke, S., The Anatomy of the Grid: Enabling Virtual
Organisations, Intl. J. Supercomputer Applications, 15(3), 2001 http://www.gridforum.org/ogsi-wg
Open Grid Services Architecture
Applications
Using operations
Virtual Grid Services
Implemented by
Multiple implementations of
Grid Services
OGS infrastructure
Foster, I., Kesselman, C., Nick, J. and Tuecke, S., The Physiology of the Grid:
An Open Grid Services Architecture for Distributed Systems Integration
Scientific Data
Deluge of Data
Exponential growth
Doubling times
Astronomy
Bio-Sequences
Functional Genomics
Bytes/dollar
12 months
9 months
6 months
12 to 18 months
Not How big it is but
Scientific Data
Deluge of Data
Exponential growth
Doubling times
Astronomy
Bio-Sequences
Functional Genomics
Bytes/dollar
12 months
9 months
6 months
12 to 18 months
Not How big it is but
What you do with it
Sharing
Curation
Metadata
Automated movement, access & integration
Computational Access
Scientific Data
Deluge of Data
Exponential growth
Doubling times
Astronomy
Bio-Sequences
Functional Genomics
Bytes/dollar
12 months
9 months
6 months
12 to 18 months
Not How big it is but
How you Embrace & Manage Change
The Database is a Knowledge chest
The Database is a Communication Hub
Autonomously Managed (Curated) change
An Essential part of e-BioMedical,
Astronomical, …, Science & Engineering
Wellcome Trust: Cardiovascular
Functional Genomics
Glasgow
Shared data
Edinburgh
Public curated
data
Leicester
Oxford
London
Netherlands
Data Access & Integration
Central to e-Science
Astronomy, Earth Sciences, Ecology,
Biology, Medicine, …
Collaboration
Shared Databases
Curated Knowledge
Accumulated Observations
Accumulated Simulations
Computation
Data mining
Input to models
Calibration of models
Presentation
Publication of results
Visualisation
GGF DAIS WG
Chairs
Norman Paton (Manchester Uni.)
Leanne Guy (CERN)
Dave Pearson (Oracle UK)
Activity
BoF GGF4 Toronto
WG Meeting GGF5 Edinburgh
Papers for GGF6
Workshops & Mail lists
Goals
Norman Paton,
Inderpal Narang,
Leanne Guy,
Susan Maliaka,
Greg Ricardi, …
Agree Standards for Database Access & Integration
Freely available reference implementations
OGSA-DAI one source & focus for discussions
http://www.cs.man.ac.uk/grid-db/
OGSA-DAI project
Lego kit for Data Access & Integration
Components for e-Science Applications
Accelerated Application Development
Multiple Data Models
Distributed Data
Access via Grid & Proxies
Integration, Translation & Transformation
Open Source Reference Implementation
For DAIS-WG standard
Trigger for Component Construction
Start a community
OGSA-DAI Partners
IBM
USA
EPCC & NeSC
Glasgow
Newcastle
Belfast
Daresbury Lab
Manchester
Oxford
Cambridge
EPCC & NeSC
Oracle Hinxton
RAL
IBM UK
Cardiff
London
IBM Hursley
IBM USA
Southampton
Manchester e-SC
Newcastle e-SC
£3 million, 18 months, started February 2002
Oracle
Primary Components
GDSF
Client
GDS
DB
Consumer
GDSR
Advanced Components
Translation
Client
GDS:PerformScript
GDS
DB
Translation
GDT
Consumer
Composed Components
GDS:performScript
Translation
GDS:performScript
GDS
Client
GDS:performScript
GDT
Translation
GDS:performScript
GDT
GDT
Consumer
Composing Components
Data Transport
OGSA-DAI
Component
Data Transport
OGSA-DAI
Component
Data Transport
OGSA-DAI
Component
Data Transport
DAI Key Components
GridDataService
GDS
Access to data & DB operations
GridDataServiceFactory
GDSF Makes GDS & GDSF
GridDataServiceRegistry GDSR Discovery of GDS(F) & Data
GridDataTranslationService
Translates or Transforms Data
GridDataTransportDepot GDTD Data transport with persistence
Relational & XML models supported
Role-based Authorisation
Binary structured files
OGSA Relationship
Class
GridService
GDS
Registry
NotificationConsumer
NotificationProducer
Mandatory
Optional
Normal
GDSF
Mandatory
Optional
Normal
GDSR
Mandatory
GDTS
Mandatory
GDTD
Mandatory
Mandatory
Normal
Optional
Normal
DAI portType Usage
Class
GridDataService
DataTransport
GDS
Mandatory
Normal
GDSF
Optional
Normal
GDSR
Optional
GDTS
Optional
Mandatory
GDTD
Optional
Mandatory
Factory
Mandatory
Distributed Query
R
F
Registry
Factory
GDS
6
GDS
1
5
4
Client
7
Evaluator
3
PNM
6
GDTV
GDT
DQP
GDS
2
DB
GDTV
GDTV
GDS
5
T
GDTV
7
QPM
Q
7
(7) 8
Consumer GDT
NS
GDT
Evaluator
GDTV
7
6
T
GDT
5
T
Evaluator
DQP : Distributed Query Processor
GDT : Grid Data Transport
T : Translation
Q : Query
GDTV : Grid Data Transport Vehicle
F : Factory
QPM : Query Progres Monitor
PNM : Progress Notification Message
AM : Application Metadata
CRM : Computational Resource Metadata
NS : Notification Sink
PNM
GDTV
7
GDT
GDS
T
GDTV
7
OGSA-DAI Time Line
WS + GSI UK support ( > 100 downloads)
XML + OGSA Prototypes for Early Adopters
Design Documents & Demos for DAIS WG @ GGF5
XML + OGSA Prototype Available
RDB + GT2 / OGSA Prototypes Available
GGF6 WG Papers & Prototypes
Ship Alpha Release for GT3 Integration
Presentation & Beta @ GGF7
Productisation, RAMPS &
Extension
Feb ’02
May ’02
Phase 1 Starts
Jul ’02
Sep ’02
Dec ’02
Phase 2 Starts
Feb ’03
May ’03
Sep ’03
OGSA-DAI Summary
On Schedule & Going Well
Contributions via DAIS-WG @ GGF5 & 6
Releases with GT3 Releases scheduled
Status: Early Days
Released prototypes
Tested Architectural Design
Using OGSA
Working with Early Adopter Pilot Projects
AstroGrid & MyGrid
First PRODUCT release Dec ‘02
Influence OGSA-DAI direction
Via DAIS-WG & Direct messages to us
Data Processing
Archive
Archive
Reference Data
Instrument
Raw Data
Multi-stage
Processed
Processing
Data
In Silico
Processing Characteristics
-Well defined work flow
-Correction, calibration, transformation,filtering, merging
-Relatively static reference data
-Stable processing functions (audited changes)
-Periodic reprocessing from archive
Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago
Analysis and Interpretation
Archive
Summarisation
Processed
Data
Analysis Characteristics
- Variable workflow
- Standard functions
- Standard and personal
filtering and summarisation
- Retain drill down capability
Summarised
Data
Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago
Analysis and Interpretation
Personalised
Database
Conclusions/Inferences
- Descriptions
- Trends
- Correlations
- Relationships
Summarised
Data
Processed
Data
Result data
Retrieval &
Update
Analysis and Interpretation Characteristics
- Highly dynamic work flow
- Multiple data types
- Volatile data
- Annotations, inferences, conclusions
- Evidential reasoning
- Shared multiple versions of truth
- Periodic version consolidation
Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago
Metadata Requirements
Technical Metadata
Direct referencing - Physical location and data schema/structure
Data currency/status – version, time stamping
Accreditation/Access permissions - Ownership (Dublin Core)
Query time/Governance - data volume, no. of records, access paths
Contextual Metadata
Logical referencing physical data – semantic/syntactic ontologies
Lexical translation – Thesaurus, ontological mapping
Named derivations (summarisations)
Scope of Requirements
All science communities
Related to provenance
Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago
Metadata Requirements
Data Versioning
Distinguish latest/agreed version of data
Maintain history record of change
Synchronise and mirror replicated data
Distinguish shared personal interpretations and/or annotations
Provenance
Record of data processing – calibration, filtering, transformation
Record of workflow – methods, standards and protocols
Reasoning – evidential justification for inferences & conclusions
Scope of Requirements
All science communities
Includes Technical and Contextual Metadata
Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago
Provenance Issues
Schema evolution
Granularity of record
Processed v Derived
Inheritance
Lack of structured annotations, ontologies
Interactive analysis = dynamic workflow
Multiple derived data sources
Context of usage
Best practice can change
Multiple versions of the truth
Evidential reasoning
Existing data & applications
Where is the provenance record stored
Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago
Collaborative Annotation
See DAS
Distributed Annotation Service
Challenges
Autonomy
Selective viewing
Identification
Provenance
Derivation
Biomedical e-Scientists
Is this one species?
Understanding bird energy
Understanding a river / ocean interaction
Understanding a biochemical pathway
Understanding a cell
Understanding a Heart or Brain
Understanding Rhododendra
Understanding Evolution
…
No One-Size fits all solutions
But sharable re-usable components
Opportunities
Many, many …
More than we can address
Compute needs
Data management needs
Data integration needs
…
Must choose some pioneers
To meet a range of common requirements
To provoke rich & high-level platform
To generate re-usable components
A Long-Term Commitment Needed
Advancing SDMIV Grid
SDMIV Users
Scientific Application
SDMIV (Grid) Application Component Library
Monitoring
Diagnosis
Scheduling
Accounting
Logging
Data Integration
Authorisation
Data Access
Grid Plumbing & Security Infrastructure
Data & Compute Resources
Distributed
Structured Data
Summary
e-Science
Data as well as Compute Challenges
Needed to be put together
Need ubiquitous supported consistent platforms
Grid
A (potentially) invaluable platform
Only show in town
Data Integration
Hard Develop & Use Standard kit of parts
Started to build the kit
No ready made general integration
Combines application & computing science
Opportunities
No one-size fits all, but re-usable subsystems
Invest in wider range of Problem driven pioneering
Strategic choices needed