Life Sciences and Grids

Download Report

Transcript Life Sciences and Grids

Virginia Center for Grid Research
The Global Bio Grid
Andrew Grimshaw
University of Virginia
January, 2006
• Why Bio Grids?
• Grid Basics
• The Global Bio Grid
In ten years the world will be
very different.
Think back ten years.
• No web
• Wide-spread internet was new
• Human Genome Project still far from
completion
• Science (biology) done primarily in
individual labs
Today
• Billions a year in e-commerce
• Internet everywhere
• Broadband to your home
• Wireless becoming pervasive
• Pervasive device are proliferating –
motes
• Sequencing of organisms a daily event.
Bioinformatics hitting the main stream
Tomorrow
•
•
$1000/sequnce for humans – becomes standard
clinical practice
“Biology is becoming an information science”
(Large Scale Biomedical Science: Exploring Strategies for future research, Institute of
Medicine, National Research Council, 2003)
•
Global interconnected networks – grids
•
•
•
Provide transparent, secure, access to data, applications,
and on-demand compute.
Research using not just your data, but all trusted
data, not just your applications, but any trusted
application.
Implications for progress are significant.
There are a number of
“catches”
• So much data!
• So many organizations with
so little trust!
• So much complexity!
An IT guys view
• Data is all over, of all different forms, with lots
of different policies
• Need to get the right data in the right place at the
right time
• Ontology problem – how do we compare,
integrate, the databases
• Need to understand semantics, automatically
transform
• Semantics
• Knowledge Discovery – “mining”
This is where grids
enter the picture
(we do the plumbing)
Some lessons learned
• 10+ years in academic and commercial grids
• All/most problems are not technical
• Users don’t want change!
•
•
•
•
•
Too many grids are technology centric
Must keep “activation energy low”
Need a user-centric approach
There are at least four classes of users
Wide variance in computational savvy
What is a Grid?
A grid is all about gathering together
resources and making them accessible to
users and applications.
A grid enables users to collaborate securely
by sharing processing, applications, work
flows and processes, and data across
heterogeneous systems and administrative
domains for collaboration, faster application
execution, and easier access to data.
The emphasis is on secure access to a wide
variety of resources
Characteristics of Grid systems
Numerous Resources
Ownership by Mutually
Distrustful Organizations
& Individuals
Different Security
Requirements
& Policies Required
Potentially Faulty
Resources
Connected by
Heterogeneous,
Multi-Level Networks
Grid
System
Resources are
Heterogeneous
Different Resource
Management
Policies
Geographically
Separated
Characteristics of a Grid
system
Numerous Resources
Ownership by Mutually
Distrustful Organizations
& Individuals
Connected by
Heterogeneous,
Multi-Level Networks
Different Security
Requirements
& Policies Required
Potentially Faulty
Resources
Different Resource
Management
Policies
Resources are
Heterogeneous
Geographically
Separated
What grids are not
•
•
•
•
The solution to all problems
Clusters of machines
SETI@home
Any one particular technology
Users view
Users
Access
Data
Run
programs
Provide
shared
services
Users
Collaborate
Grid
Site 0
Site 1
Site 2
Site 3
HPSS
Cluster
Cluster
Grid Computing Scenarios
Partner Grids
• Multiple owners, sites, domains
• Multiple file systems
• Internet connectivity
Campus/Enterprise Grids
Cluster
Desktop Cycle
Aggregation
• Multiple owners, domains
• Multiple file systems
Grids • WAN connection
• Single owner, department, project
• Single domain, file system
• LAN connection
• Limited acceptance in
commercial enterprises
Standards
• Global Grid Forum – ggf.org
• OGSA – Open Grid Services Architecture
•
•
•
•
•
•
Web-Services based IPC
WSRF and possibly other
OGSA-BES – Basic Execution Service
OGSA-ByteIO – file IO
WS-Naming – abstract name to EPR
RNS-lite – Resource Name Space
The Global Bio Grid
GBG concept
• Federated access to multiple
• Data sources
• Public databases
• Commercial databases
• In-house databases, annotations, etc.
• Application suites (including processes and
workflows)
• Compute resources
• Shared among collaborative research teams
• Multiple research locations
• Virtual organizations
• Built on evolving computing standards
(GGF, I3C, WS-*)
Global Bio Grid
• Datagrid using Avaki DG technology
•
•
•
•
•
Working on ADG available free for “.edu”
UVA, NCBIO, U-Texas, Texas Tech
Already operational
Flat file and relational
Working on an OGSA-compliant implementation
• Compute grid at UVA on-line
•
•
•
•
64 dual processor Opteron’s available
Sunfires
Hundreds of Windows machines
Legion 1.8 based – moving towards OGSA-compliant services
• Applications
• Biomarker
• Searching pub med
• Hospital info integration
Three resource classes
illustrate the Grid-effect
• Data
• Processing
• Applications
Data
• Suppose you have collaborators with critical
databases (clinical, protein, other) that you need to
use.
• You use a number of databases that change on a
regular basis.
• You want to “mine” heterogeneous data sets
(relational, flat-file, XML, …) in different locations –
say in a hospital
• Want to produce, consume, or share derivative data
products, e.g., the result of a set of joins and data
transformation steps.
• This applies to business data (BI/EII) as well as life
science data
DataGrid: Unifying fabric for data access
•
•
•
•
Public DB
Public DB
Public DB
Transparent access to multiple DBs
Multiple domains
Highly-secure, flexible access control
Automatic cache management and
coherence
PDB
NCBI
EMBL
SEQ_1
Data
SEQ_1
SEQ_2
APP 1
Biology
Partner Institution
Research Institution
SEQ_3
APP 2
Biochemistry
Partner Institution
Three Concrete Examples
• KDS – “data mining” on widely
separated data sets such as PubMed.
• “Map” UniProt datasets into data grid
• Researchers no longer need to spend time
downloading latest
• Extended Hospital
Extended Hospital
Non-related
Hospitals
Authorized
Family
Data
Warehouse
Clinics /
Large Practices
Research
Department Domain
Department Domain
Department Domain
Data
Data
Data
Emergency vehicles
HOSPITAL
Insurance companies
Processing
• Classic high-throughput computing
• Suppose you have thousands of
computationally intensive jobs to run
• SW, CHARMm, Sequest, a.out
• Your usage is bursty – need a lot over
short period of time, but often have idle
resources
• You wish you had more!
Public DB
Compute Grid: Shared access to processing
Public DB
Public DB
• Flexible, location-independent access to
virtually unlimited processing, on-demand
• Scheduling, usage, management policies
• System detects, recovers from job failures
• Heterogeneous platform support
• Usage accounting, as required
PDB
Cluster 1
NCBI
Cluster 2
EMBL
SEQ_1
Data
SEQ_1
Cluster N
Processing
SEQ_2
APP 1
Biology
Partner Institution
Research Institution
SEQ_3
APP 2
Biochemistry
Partner Institution
Concrete Examples
• Biomarkers project wants to run
Sequest-2 using public databases
• Charmm/Amber
• Gnomad (Altman et al)
• BLAST, FASTA, ….
• Autodock
Applications
• Suppose you want to use applications
or workflows developed, maintained,
and supported by others – without the
hassle of installing all of them on your
gear.
• Suppose you want to couple multiple
applications developed at different
institutions together.
Public DB
Public DB
Public DB
Grid users share applications, employing
multiple data & processing resources
• Flexible binary management
• No need to recompile applications
• Securely share applications
• Restrict who gains access
• Restrict where apps run
PDB
Cluster 1
APP 1
Cluster 2
APP 2
Cluster N
APP N
NCBI
EMBL
SEQ_1
Data
Processing
Applications
PDB
NCBI
EMBL
SEQ_N
Data
SEQ_1
SEQ_2
APP 1
Biology
Partner Institution
Research Institution
SEQ_3
APP 2
Biochemistry
Partner Institution
Public DB
Public DB
Public DB
Better Research, Faster
• Secure, wide-area access to global
breadth of consistent, current data
• Access to vast processing power
• Ability to securely share proprietary
data and applications, as needed
PDB
Cluster 1
APP 1
Cluster 2
APP 2
Cluster N
APP N
NCBI
EMBL
SEQ_1
Data
SEQ_1
Processing
SEQ_2
APP 1
Biology
Partner Institution
Research Institution
Applications
SEQ_3
APP 2
Biochemistry
Partner Institution
Summary
Evolution in action
Now & Future!
Today
60’s to 80’s
Grid & WS
50’s
Batch OS
Bare Metal
Programming
Multi-User
Timeshare
Low Level
Network
Programming
Summary
• Grids will have a huge impact on the life
sciences
• Prototype GBG operational
• Applications are underway
• We’re always looking for new
applications