Transcript PPT

Discovery Net :
A UK e-Science Pilot Project
for Grid-based Knowledge Discovery Services
Patrick Wendel
Imperial College, London
Data Mining and Exploration Middleware for Distributed and Grid Computing,
September 18-19, 2003
Why Discovery Net?
Data Challenge:
Distributed, heterogeneous & large scale data sets
Novel and real-time data sources
Resource Challenge
Novel specialised data analysis components/services continually
being published/made available
Computational resources provided
Information Challenge:
Data cleaning, normalisation & calibration
New data needs to be related to existing data
Knowledge Challenge:
Collaborative, interactive & people-intensive
Result interpretation & validation in relation to existing knowledge
Knowledge sharing is key
What is Discovery Net
Goal :
Construct an Infrastructure for Global wide Knowledge
Discovery Services
Key Technologies:
•
•
•
•
•
•
Grid and Distributed Computing
Workflow and service composition
Data Mining & Visualisation.
Data Access & Information Structuring.
High Throughput Screening Devices: real-time.
Discovery Net: Unifying the
World’s Knowledge
Data Integration:
Dynamic Real Time Construction of “Data Grids”
Application Integration:
Component and Service-based Integration
People Integration:
Global-wide Discovery Groupware
Knowledge Integration:
Multi-subjects and Multi-modality Integrative Analysis to
Cross Validate and Annotate Related Discovery Work
What is Discovery Net
Scientific
Information
Literature
Scientific
Discovery
Real Time Integration
Workflow Construction
Databases
Operational
Data
Dynamic Application
Interactive Visual
Integration
Analysis
Using Distributed Resources
Images
Instrument
Data
Discovery Net Layer Model
(Life Science Application)
D-Net Clients:
End-user applications
and user interface
allowing scientists to
construct and drive
knowledge discovery
activities
Deployment
Web/Grid Services
OGSA
D-Net Middleware:
Provides execution logic
for distributed
knowledge discovery and
access to distributed
resources
High Performance
and Grid-enabled Transfer
Protocol
Computation & Data
Resources:
Distributed databases,
compute servers and
scientific devices.
(GSI-FTP, DSTP..)
Grid-enabled Infrastructure
(GSI)
A Knowledge Grid based on
D-Net Servers
Goal: Plug & Play
Data Sources, Analysis Components and Knowledge Discovery Processes
DNet server
DNet Server
DNet API
Deployment
Computation
Components
Data access & Storage
InfoGrid
Knowledge
discovery
services
DNet server
DNet Server
DNet participating client
DNet Client
DNet Server
XML
DPML
Internet
DNet client
DNet Client
Web client
WWW
RDBMS
Data sources
Computational services
Several types of clients for different usage (from thin web client to
participating client)
Current implmentation based on Java distributed objects (EJB), moving
towards Web/Grid service
But deployment and API access through standard Web/Grid service
Discovery Process
Management
Workflow based service composition
Data-flow approach fits Knowledge Discovery
process
Allows scientists to develop processes.
Towards a Standard Workflow Representation
for Discovery Informatics: Discovery Process
Markup Language (DPML):
Contains component data-flow graphs, but also
Records collaboration information (user,
changes)
Records execution constraints (location,
parameterisation)
Becomes a key intellectual property: Discovery
Processes can be stored, reused, audited,
refined and deployed in various forms
D-Net Workflow for Genome Annotation :
16 services executing across Internet
InfoGrid: Dynamic Data
Integration
Dynamic Data Integration = On-demand
access to heterogeneous data sources +
information structuring
Towards a Dynamic Information
Integration Methodology:
Specialised Information Source Access:
Trails
Journals
Project
Patients…
Clinical
Biological
Activity
Screening
Protocols
Journals
Reports
Toxicology
Integrative Analysis
Patents…
Metabolic
InfoGrid allows users to register, locate
and connect to various specialised
information sources.
On the-fly Integration: InfoGrid allows
users to build their own integration
structure on the fly (Worst case:
proprietary protocol/format, best case
JDBC/HTTP-XML-XPath/Web Service).
Easy Maintenance: Wrappers/Drivers to
new data sources can be added through a
clean API
Structures
Protein /
Pathways…
Sequence
Targets
Structure
Chemistry
Libraries
Gene
Sequence
Location
Synthetic
Expression
Function…
pathways…
Function…
Catalogues
Dynamic Application
Integration Services
Dynamic Application Integration = Ondemand access and composition of
remote analysis components
Towards a Dynamic Component
Integration:
Clustering
Classification
Regression
Gene function
perdition
Component service: allow users to
register, locate and remotely execute
D-NET API
components (Java component interface or
Web Service port type).
Execution service: allow users to control
the execution of components distributed
environments
Easy Maintenance: New components can
be added through a clean API
Promoter
Prediction
Homology Search
Discovery Deployment
Discovery Deployment = On-demand
rapid application construction and
publishing
Towards a Dynamic Deployment of
Knowledge Discovery Procedures:
Deployment Engine : allows users to build
and publish applications based on DPML
code coordinating remotely execute
components, as Web Page, Web/Grid
Service, command line tool.
Easy Maintenance: New discovery
procedures described in DPML, a
Standardised representation of
“composed” discovery procedures
Storage & Reporting Servers: allow users
to share DPML procedures and to generate
workflow audit reports.
Discovery Component
Report
Discovery Process
in DPML
Discovery Service
Batch processing
Knowledge Integration &
Interpretation
Dynamic Knowledge Interpretation =
cross-reference and verify analysis results
against background knowledge
Towards a Knowledge Integration
Framework: Multi-subject data analysis
Text Mining
Genetic
Analysis
Specialised Client Interfaces: Interactive
Analysis and dynamic component
interaction
Result Annotation, Structuring and
Storage: Information source query, result
browsing, sharing and markup
Sequence
Pathway
Analysis
Analysis
Life science example application
Workflow execution
Component execution location resolution
User list of known resources
A component can require explicitly to be executed on
a particular resource
A component can choose from a set of resources
proposed (and could use Grid resource information
systems and network weather information to
determine where to go)
For unconstrained components, simple “near the data”
execution policy:
If single input data location then execute there
Otherwise fallback to original execution location
Allows usual DPKD workflows to be designed
Handles data management and transfer (serialisation,
Java based, FTP based)
Discovery Net and Grid
technologies
Cluster/Campus Grid level:
Partial or complete workflow execution on Condor
/ SGE
Task farming on subset of the workflow
Global Grid:
GSI integration (Java Cog Kit)
GSI-FTP transfer functionality (Java Cog Kit)
OGSA Grid Service access to functionalities (GT3)
Potential use of GRIS or NWS in component
implementation
Globus scheduler ? Unicore ? SRB ?
Discovery Net Application
Testbeds
GUSTO UNITS with wireless connectivity
Life Science Testbed:
Gene sequencing, Protein Chips
High Throughput real-time genome annotation
testbed: analyse and interpret new sequences using
existing distributed bioinformatics tools and
databases
Environmental Modelling
Pollution Sensors (GUSTO): SO2, Benzene, ..
High Throughput real-time pollution monitoring
testbed: analyse, interpret time-resolved correlations
among remote stations, and with other environmental
data sets
Geo-hazard Prediction
Multi-spectral, multi-temporal, Satellite imagery
Real-time geo-hazard prediction testbed: analyse,
interpret satellite images with other data sets to
generate thematic knowledge
Case Study:
SC2002 HPC Challenge
Organism
Identify
High Throughput
Sequencers
Chromosomes
Gene markers
Regulatory
Annotation
D-Net based Global Collaborative
DNA
Real- Time Genome Annotation
Identify
Genes
Nucleotide-level
Organism’s
tRNAs, rRNAs
Non-translated
EMBL
NCBI
SNP
Elements
Literature
Duplication
Variations
…..
blast
Repeat
Repetitive
RNAs
Segmental
Regions
genscan
TIGR
SNP
grail
Masker
E-PCR
genscan
References
Identify
Protein-level
Annotation
Proteins
Functional
Characteisation
Domain
Fold Prediction
Literature
Genome
Classify into
Protein Families
Homologues
3-D Structure
Secondary
Inter
Inter
Pro
Pro
SWISS
SMART
blast
3D-PSSM
Motif
PFAM
Search
PROT
predator
DSC
structure
…..
References
Process-level
Annotation
Pathway
Relate Cell
Cycle
Drugs
Cell death
Literature
Ontologies
Metabolism
Biological
GO
CSNDB
Process…..
Embryogenesis
KEGG
GK
GeneMaps
virtual
GenNav
chip
…..
References
Maps
AmiGO
15 DBs
21 Applications
Annotation
How It Works
Interactive
Editor &
Visualisation
Nucleotide Annotation Workflows
Download
sequence
from
Reference
Server
Save to
Distributed
Annotation
Server
Inter
SMART
Pro
EMBL
KEGG
SWISS
NCBI
PROT
TIGR
SNP
GO
1800 clicks
 500 Web access
200 copy/paste
 3 weeks work
in 1 workflow and
few second execution
Execute
distributed
annotation
workflow
Conclusion and Future
works
Towards an open integration platform that enables
scientists to conduct their KD activities
Several levels of integration required
Enable use of available resources
Evolution towards cost model integration (performance,
value, QoS)
Semantic based service retrieval and composition
Other useful standards ? (OGSA-DAI ?)