Presentation - ppt

Download Report

Transcript Presentation - ppt

Bryan Lawrence, BADC
David Boyd Deputy Director CLRC e-Science centre
Kerstin Kleese DL: Climate Database Expert
Roy Lowry BODC: Marine Database Expert
Dean Williams PCMDI: ESG Principle Investigator
Bob Drach PCMDI: ESG Metadata Architecture
Mike Fiorino PCMDI: Meteorologist
The NERC
DataGrid
Acronym Summary:
PCMDI: Program for Climate Model Data Intercomparison
(US Department of Energy, Lawrence-Livermore National Lab)
ESG: Earth System Grid
The NERC DataGrid
(US Grid Project: NCAR, Argonne, PCMDI, USC …)
Outline
• Motivation
• The Earth System Grid
– definitions of “portals” and applications
– ontologies
• Relations with other NERC e-science
programmes.
• Architecture
– querying
– software Stack
•
•
•
•
The NERC DataGrid
Initial steps and Project Management
Connectivity with other grid projects
Success and Failure
Summary of what we are doing and the road to
the future
The BADC – part of NCAS!
The NERC DataGrid
The Role: Key words: Curation and Facilitation!
http://www.badc.rl.ac.uk
Just under half of BADC users are NOT
atmospheric scientists:
Earth Observation
Earth Science
160
126
42
132
The NERC DataGrid
56
104
132
Engineering
Geography
Marine Sciences
Mathematics
152
Biological/Medical
Terrestrial/Fresh Water
Motivation – Town meeting 2001
E-science should be involved with:
• delivering an enhanced meta-data record of archived
data.
• 'dictionary' building.
• building systems to translate data and link databases.
• integrating computer and natural science communities.
• the ability to generate a single query across multiple
datasets (in different catalogues) returning both
metadata and data.
• the ability to acquire large datasets in near real time
(NRT).
• the automatic production of metadata, both by models,
and where possible, by observing systems.
Summary from two of the four
working groups!
The NERC DataGrid
Relevant to many stakeholders
Energy
Water
Management
Food
Chain
The NERC DataGrid
Health
Weather
Risk
(Slide from Julia Slingo’s introduction to
CGAM as part of NCAS)
Motivation
Page 22:
NERC will …... ensure that Earth
system science is underpinned by
e-science investments to enable
access, manipulation … of data
from diverse sources.
The NERC DataGrid
The Data Use Chain
Discovery
Authentication
Authorisation
Extraction
SubSampling
Regridding
Formatting
Processing
Delivery
Time-line
The NERC DataGrid
Display
NERC Metadata Gateway - SST
• Geospatial coordinates
forgotten. Time reference
forgotten. Need to get entire
field(s), and find correct time!
•And if I want to compare data
from different locations?
- multiple logins
- multiple formats
- discovery?
The NERC DataGrid
Searching: need comprehensive metadata!
A priori would any user know to look
in the COAPEC data set?
Earth system-science means we have
to remove these boundaries!
• detailed file level metadata isn’t
visible, and so data mining
applications impossible.
- need ontologies to help queries
match actual data descriptions.
The NERC DataGrid
NB: Dynamic catalogues!
What is an Ontology?
An ontology defines the terms used to describe and represent an
area of knowledge by specifying the following kinds of
concepts:
•Classes (general things) in the many domains of interest
•The relationships that can exist among things
•The properties (or attributes) those things may have
Ontologies are usually expressed in a logic-based language, so
that detailed, accurate, consistent, sound, and meaningful
distinctions can be made among the classes, properties, and
relations..
The NERC DataGrid
Ontology Example:
An example of part of ontology defined using OIL (e.g. see
Oil in a Nutshell, D. Fensel et.al.)
ontology-definitions
slot-def eats
inverse is-eaten-by
slot-def has-part
inverse is-part-of
properties transitive
Relationships
Classes
Properties
The NERC DataGrid
class-def animal
class-def plant
subclass-of NOT animal
class-def tree
subclass-of plant
class-def branch
slot-constraint is-part-of
has-value tree
class-def leaf
slot-constraint is-part-of
has-value branch classdef
class-def defined carnivore
class-def giraffe
subclass-of animal
subclass-of animal
slot-constraint eats
slot-constraint eats
value-type animal
value-type leaf
class-def defined herbivore
class-def lion
subclass-of animal
subclass-of animal
slot-constraint eats
slot-constraint eats
value-type
value-type herbivore
plant OR
(slot-constraint is-part-of has-value plant)
With current funding, the NDG does not aim to build a formal
ontology, but we do aim to being to build a thesaurus that can
form the basis of one, and we do hope to spin off a project to
build one and integrate it in the NDG
(OIL: Ontology Inference Layer)
ESG: Example of a Web-based Data Portal
ESG will provide support for:
• large but simple data sets,
• limited metadata, but not
searchable.
NDG will provide support for
•Small-but-complex datasets.
•Data-mining (searchable
metadata).
NDG is complementary to ESG!
The NERC DataGrid
Live Access Server (1)
… we will keep the basic structure, but
gradually replace components.
The NERC DataGrid
Live Access Server (2)
Data Request Structure:
The NERC DataGrid
ESG: Example of a Client Application
We will:
• Provide python based
classes for our
observational data to
complement the access to
3D gridded data.
• Provide a web services
wrapper so that other grid
applications can access
NDG data.
The NERC DataGrid
Applications and Portals
Internet Link
tape
robot
Online
Data
XML database
BADC NDG Wrapper
Online
Data
Online
Data
XML database
XML database
BODC NDG
Wrapper
Group NDG
Wrapper
Wider Internet
NERC Grid
Software Agent
Grid User
ESG (&other)
Applications
Supercomputer
Research Group Data
Sources
Wider Internet
NDG
Web
Portal
Internet User
Internet Link
XML database
The NERC DataGrid
Satellite
Relationship to GODIVA (Haines et.al.)
(Grid for Ocean Diagnostics, Interactive Visualisation and Analysis)
Architecture of the GODIVA Grid:
NDG will:
• improve data discovery tools
for GODIVA (even for their own
datasets).
• provide metadata creation
tools for GODIVA participants.
• provide access to data held
outside GODIVA participants.
The NERC DataGrid
GODIVA team have already discovered
issues with the XML
database
interface they are going to use.
ClimatePrediction.com
•Scientific
•investigators
•HTTP
•Summary
•statistics
•HTTP (DODS URL)
•Obs
•Datamining
•HTTP
•Participants &
•policy-makers
•Live Access Server
•ESG-II/NERC
CP.COM will need the
•DataGrid
•Peer-to-peer
NDG to make•visualisation
best use of
observational data in
•100Tb of key output at 10-20
sites
evaluating
their
•Conventional FTP/HTTP parameter space.
•GridFTP
•1Pb total output on 1M participants’ PCs
The NERC DataGrid
Mining on the Grid
Satellite
Data
Grid Mining
Agent
Archive X
Grid
Processor
Grid Mining
Agent
Grid
Processor
Satellite
Data
Archive Y
Grid Mining
Agent
Grid
Processor
From Hinke’s NASA IPG presentation at CEOS, Rome, May 2002
The NERC DataGrid
Data mining: Grid Miner Architecture
IPG Mining
Agent
Data
Archive X
IPG Processor
IPG Processor
Mining
Operations
Repository
The devil is in the
detail: how does the
data mining agent get
at the data?
IPG Processor
Mining
Confiig
Info
Mining
Daemon
Control
Database
Need data mining clients
– objects which can read
specific datatypes and
present themselves to
agents!
IPG Processor
Satellite
Data
Archive Y
IPG Mining
Agent
IPG Processor
From Hinke’s NASA IPG presentation at CEOS, Rome, May 2002
The NERC DataGrid
Finding data: Querying!
• Requires databases of metadata & querying those databases.
• Each part of the NDG will have an internal metadata catalogue
(&/or database), and data (either in flat files or the database).
– so the querying strategy must support centralised querying on partially
indexed data, followed (if necessary) by distributed querying, which may
or may not need mapping into a local database schema.
– In the grid environment the indexes themselves will be replicated, and
some data may also be replicated.
• Major NDG design issue: developing appropriate data models,
database schema and indexing strategies!
– This is not a generic problem, it will be specific to our datatypes.
– Technology needs to be public domain (i.e. free) for uptake!
– NDG approach to database technology will be developed in conjunction
with DBTF.
The NERC DataGrid
Query Pathway; software components
Application Level
NERC DataGrid
Interfaces:
NERC
international
generic
Data Extraction Path for Known Datasets
inadequate
Generate
Expansion
Query (e..g:
time and space)
User Query
Potentially
Interesting
Discovery
and Extraction
Path
No
User
Assessment
Define
Requirements
for SubSampling and
Reformatting
Exit or return to
previous step at
this level
Data Exists?
Continue to
Extraction?
Existing Data and
Services
New Data Interfaces Existing and Required
and Services
Grid Middleware
Not OK
The NERC DataGrid
Yes
Data Path
into Archive
Query
Distributor
(Check
Authentication)
Collate
Multiple
Returns
Query Distributor
(Check
Authorisation
against "Locating")
Parallel Queries
New Model and
Data Ingestion
and Metadata
Creation
Interfaces
Query Handler
Data and
Metadata
Archives
"Dataset"
Catalogue Search
(Check
Authorisation
against "Looking")
OK for extraction
Check
Authorisation for
"Extraction"
Network Path
and Cache
Identification
Deliver Data to
Processor (s)
(and cache)
Parallel Queries
Reformat
Metadata
Query Handler
Granule
Catalogue Search;
Return
Satisfactory
Granule Metadata
User
Processing,
Display
and/or
Visualisation
Sub-Sample
and Reformat
Extract Data
File
Response:
DataSet
Metadata
BNL V1.01 - 12/01
Information Structure
Joint Interfaces
PCMDI
Components
NDG
Components
Existing
Components
The NERC DataGrid
Simplified Software Stack
Key point:
make use of existing
technology, allow
component replacement
with time!
Achievable by:
interface definition and
integration.
Note: Any application will be
able to access our data
services via the OGSA
wrapper in the middleware.
The NERC DataGrid
Software stack
The NERC DataGrid
NDG: Ingestion Tasks
The NERC DataGrid
Draft Project Schedule
Phase One
Delivery
The NERC DataGrid
Metadata Gateway
The NERC DataGrid
Replace
with
Globus
Giggle?
Next steps include:
•Replacing the transport layers
in the metadata gateway with
SOAP
•Replacing the SGML in the
metadata gateway with XML
…etc
The NERC DataGrid
Connectivity? Innovation? Evolution!
ClimatePrediction
.com
GODIVA
BADC
ESG II
UK
DataBase
Task Force
QinetiQ
CEOS
BNSC
BODC
NERC DataGrid
NEODC
CLRC
e-science
Data Portal
PARADISE
? Future ?
Other
Programmes
Other
DDC-CEH
U.S.
Thredds/
NOMADS
EU
DataGrid
WP9
Ontologies
- Nesc
-MyGrid
Digital
Libraries
(Zoom)
Plagiarism: Copying from one person
Research : Copying from many people
… we can’t afford to be too innovative!
The NERC DataGrid
Indicators of Success
Finding and making use of data:
•
Possible to find, reformat, and visualise disparate datasets from
disparate organisations within one application.
•
No longer necessary to rely on personal contacts to locate and acquire
data of interest if it’s held in the BADC/BODC.
•
•
Key requirement for interdisciplinarity; the ability to test data
comparison ideas without learning foreign formats and
establishing personal relationships every time.
Other NERC data designated data centres implementing NDG.
Take up by community:
•
NDG software (but not necessarily graphics tools) in use in GODIVA
project and in wider UK university community (including data
repositories in research groups).
•
Earth System Grid uses NDG components.
The NERC DataGrid
Risks Of Failure
• Someone else does it first – unlikely!
• Performance too slow for users!
–
–
–
–
More cache and replication
Improve database performance (UK DBTF!)
Data-compression layer for XML
Reduce scope and search depth (don’t want to do this!)
• Globus 3 (OGSA) delivery heavily delayed
– Web services implementation + Globus2 + datagrid service registry
• Availability of people with appropriate skills
– re-deploy existing staff where possible
– Schedule begins with three months training.
• ESG-II architecture delayed or incompatible with UK
architecture
– Close relationship with PCMDI means we will be able to proceed
effectively anyway.
The NERC DataGrid
NDG expected evolution
Data
Repositories
At USER Institution
Satellite
1
NERC DDC
Computation
Data
File
010
010
010
010
010
010
4
Catalogue
Ingestor
Local
Catalogue
Other: e.g.
PML/ESSC
Computation
Catalogue
Client
3
Python API
XML
Catalogue
Server
6
Catalogue
Client
Computation
Based on LAS
2
Docs
The NERC DataGrid
Data
File
Graphics
Evolving to OGSA
5
Beyond the next three years:
The NDG and earth systems science
Extension to the other NERC data centres,
requires:
– online (or near-line) data.
– appropriate ingestion tools, appropriate mappings between specific
discipline specific metadata and generic metadata.
– GRID enabling data centres.
– Decisions about policy and access.
The NERC DataGrid
Bryan Lawrence, BADC
David Boyd, CLRC E-science
Kerstin Kleese, CLRC E-science
Roy Lowry, BODC
Dean Williams, PCMDI
Bob Drach, PCMDI
Mike Fiorino, PCMDI
The NERC DataGrid
The NERC
DataGrid
Project Management
• Weekly workgroup meetings (teleconference and physical).
• Milestoning code and documentation reviews at quarterly
intervals.
• Quarterly liaison with both US colleagues and other NERC
projects (GODIVA, ClimatePrediction.com etc).
• Bi-Annual target-reprofiling.
• Professional project management at the code level:
– Both RAL SSTD and RAL e-Science have considerable experience
managing and delivering large software projects.
• Two key tenets of management philosophy:
– Build early, build often.
– Evolve from a working system.
The NERC DataGrid
The NDG: What will we do?
Key components: BADC/BODC
• Project Management.
• Ingestion tools for station data, oracle database data, and other (eg
PP - includes tools based on ESML and Marine XML).
• Format conversion tools within CDAT.
• Ingestion! Migrate NERC Metadata gateway to WDSL/SOAP
(Zoom?).
Key components: CLRC e-science
•
•
•
•
•
Globus Installation at all sites.
Functional decomposition and interface definitions.
Search database schema; search software python API, wrappers.
Database Population. Logical to Physical File Manager.
Amalgamating search API into
– LAS (or successor) , VCDAT, metadata gateway.
• Add data retrieval interfaces into metadata gateway.
The NERC DataGrid