Moustafa Ghanem - National e
Download
Report
Transcript Moustafa Ghanem - National e
Distributed Data Mining in Discovery Net
Dr. Moustafa Ghanem
Department of Computing
Imperial College London
1.
2.
3.
4.
5.
What is Discovery Net
Distributed Data Mining for Compute Intensive Tasks
Distributed Data Mining for Sensor Grids
Knowledge Discovery from Naturally Distributed Data Sources
What Do Scientists Really Want?
1. What is Discovery Net
What is Discovery Net?
Funding : One of the eight UK national e-science Pilot Projects
funded by EPSRC (£2.2M)
Start Oct 2001, End March 2005
Goal :Construct the World’s first Infrastructure for Global
Knowledge Discovery Services
Key Technologies:
Open Service Computing
High Throughput Devices and Real Time Data Mining
Real Time Data Integration & Information Structuring
Cross Domain Knowledge Discovery and Management
Discovery Workflow and Discovery Planning
Discovery Net Applications
Life Sciences
High throughput genomics and proteomics
Distributed Databases and Applications
Environmental Modelling
High throughput dispersed air sensing technology
Sensor Grids
Real time geo-hazard modelling
Earthquake modelling through satellite imagery
High performance Distributed Computation
10
9
8
7
6
5
4
3
2
1
A B C D E F G H I J K L M N
Discovery Net Architecture
DPML
Web/Grid Services
OGSA
D-Net Clients:
End-user applications
and user interface
allowing scientists to
construct and drive
knowledge discovery
activities
D-Net Middleware:
Provides services and
execution logic for
distributed knowledge
discovery and access to
distributed resources
and services
High Performance
Communication Protocol
(GridFTP, DSTP..)
Grid Infrastructure
(GSI)
Goal: Plug & Play
•
•
•
Data Sources,
Analysis Components &
Knowledge Discovery Processes
Computation & Data
Resources:
Distributed databases,
compute servers and
scientific devices.
Discovery Net Data Mining
Components
Generic Data Mining
Classification, Clustering, Associations, ..
Unstructured-Data Mining
Text Mining, Image Mining
Domain-specific Mining
Bioinformatics, Cheminformatics, ..
2. Distribution of Compute Intensive Tasks
a. Distributed Data Mining for Geo-hazard Prediction
Grid-based Geo-hazard
Data Mining
Grid-based HPC Computation
Automatically co-register a stack
of imagery layers at high precision
and speed.
Data
Warehousing
&
Modelling
Co-registration
&
geo-rectification
Grid-based Data Access and
Integration
Workflow to Coordinate Grid
Computation
Image features
extraction
Cluster &
classification
Normalised cross-correlation
(NCC) template algorithm
Image
“before”
Image
“after”
Reading
Data set
Reading
Data set
Setting
search
window
Setting
comparing
window
Setting
comparing
window
Significant
correlation
coefficient
N
Y
Delta X
Delta X
Correlation
coefficient
Operating on a remotely
accessed MPI UNIX parallel
computer through fast
network with DNet interface.
Slow but high accuracy: 24
processors 10 hours for one
scene of Landsat-7 ETM+
Pan imagery data. The
algorithm also run on GRID.
2. Distribution of Compute Intensive Tasks
b. Distributed Clustering
Workflows for Distributed Data
Clustering
3. Distributed Mining over Sensor Grid Data
Distributed Spatial Data Mining for Air Pollution Modelling
Sensor Specification
The GUSTO Project - Update
(Generic UV Sensors Technologies & Observations)
•
•
•
•
•
High throughput open path
spectrometer system
Robust algorithm for pollutant
concentration retrievals
Measures SO2, NO, NO2,O3 &
Benzene to ppb levels every few
seconds
Geared for networking of multiple
GUSTO units within a GRID
Infrastructure
Can support Remote Sensing data for
(contour) mapping of pollutants
www.gusto-systems.com
Networking of Multiple GUSTO Units
GUSTO
unit 1
Wireless
connectivity
Monitoring and
control software
Sensor registry &
control service
GUSTO
unit 2
GUSTO
unit 3
SensorML
HTTP,
SOAP,
GSI
Data upload
service
HTTP,
SOAP,
GSI
Warehouse
Data access
service
Archived
weather data
GUSTO
unit 4
Archived
health data
GRID Infrastructure
www.gusto-systems.com
Public access
Web visualizer
Visualisation and
Data Mining
Pollution analysis
4. Knowledge Discovery from Naturally
Distributed Data Sources
Distributed Data Mining in Life Sciences
Distributed Data Mining for Life
Sciences
secondary structure
tertiary structure
polymorphism
patient records
epidemiology
expression patterns
physiology
sequences
alignments
ATGCAAGTCCCT
AAGATTGCATAA
GCTCGCTCAGTT
receptors
signals
pathways
linkage maps
cytogenetic maps
physical maps
Information Integration
Gene
Expression
Warehouse
ExPASy
SwissProt
PDB
ExPASy
Enzyme
OMIM
Disease
Enzyme
Protein
Affy Fragment
LocusLink
Known Gene
MGD
Sequence
Metabolite
SNP
SPAD
Sequence
Cluster
Genbank
NCBI
dbSNP
NMR
Pathway
KEGG
UniGene
Given a collection of microarray generated gene expression data, what kind of
questions the users wish to pose.
Design an integration schema?
From Data Integration to
Knowledge Unification
In Silico Experiment
D-World
I-World
K-World
Life Science Application:
SC2002 HPC Challenge
D-Net based Global Collaborative
Real- Time Genome Annotation
Identify
High Throughput
Sequencers
Organism
Chromosomes
Identify
Genes
Gene markers
Nucleotide-level
Annotation
Organism’s
DNA
Regulatory
Regions
Segmental
Duplication
Literature
References
tRNAs, rRNAs
Non-translated
RNAs
Repetitive
Elements
SNP
Variations
EMBL
NCBI
TIGR
SNP
genscan
blast
grail
Repeat
Masker
E-PCR
genscan
…..
Identify
Protein-level
Annotation
Proteins
Classify into
Protein Families
Functional
Characteisation
Homologues
Domain
3-D Structure
Fold Prediction
Secondary
structure
Literature
References
…..
Inter
Pro
Inter
Pro
SMART
SWISS
PROT
blast
PFAM
Motif
Search
predator
DSC
Relate
Process-level
Annotation
Pathway
Maps
Cell
Cycle
Metabolism
Drugs
Biological
Process…..
Cell death
Literature
References
3D-PSSM
Embryogenesis
…..
GO
CSNDB
KEGG
GK
Ontologies
AmiGO
GeneMaps
virtual
chip
GenNav
15 DBs
21 Applications
Genome
Annotation
HPC Challenge SC2002
Nucleotide Annotation Workflows
Interactive
Editor &
Visualisation
Download
sequence
from
Reference
Server
Real-time
sequencing
in London
Inter
Pro
SMART
KEGG
EMBL
NCBI
SWISS
PROT
TIGR
SNP
GO
Save to
Distributed
Annotation
Server
Distributed data
and computation
1800 clicks
500 Web access
200 copy/paste
3 weeks work
in 1 workflow and
few second execution
Execute
distributed
annotation
workflow
Discovery Net in Action:
China SARS Virtual Lab
Genbank
Homology search against
viral genome DB
Homology search
against protein DB
Annotation using
Artemis and GenSense
Annotation using
Artemis and
GenSense
Predicted
genes
Gene prediction
Exon prediction
Key word
search
Splice site prediction
GeneSense
Ontology
Multiple sequence
alignment
D-Net:
Integration,
interpretation,
and discovery
Relationship
between SARS
and other virus
Phylogenetic analysis
Immunogenetics
Mutual regions
identification
Microarray analysis
Epidemiological analysis
SARS patients
diagnosis
Homology search
against motif DB
Protein localization
site prediction
Protein interaction
prediction
Relationship between
SARS virus and human
receptors prediction
Classification and
secondary structure
prediction
Bibliographic databases
Bibliographic databases
Discovery Net in Action:
SARS Virus Mutation Analysis
5. What do Scientist Really Want?
Does it really work?
Towards Compositional Grid
Services
Native MPI
Condor-G
Web Service
Resource
Mapping
Web Wrapper
Sun Grid
Engine
Service
Browsing
OGSA-service
Oralce 10g
Unicore
Workflow Execution
A compositional GRID
Workflow
Warehousing
Workflow Authoring
Composing services
Workflow Management
Collaborative Knowledge Management
Service
Abstraction
Workflow Deployment:
Grid Service and Portal
Discovery Net Service
Composition
Full Workflow
Executing Protein
Annotation Workflow
Deployment of Node
Deploying Protein
Annotation Workflow
Executing Deployed Service
Locating & Executing Deployed
Service from Discovery Net
Workflow Provenance
Workflow Warehousing
Discovery Net Snapshot
Scientific
Information
Scientific
Discovery
In Real Time
Real Time Data Integration
Discovery Services
Literature
Service Workflow
Databases
Operational
Data
Dynamic Application
Integration
Integrative Knowledge Management
Using Distributed Resources
Images
Instrument
Data