Transcript eScience
eScience and Grid
Tools and techniques for the next
generation scientist
Professor Brian Vinter
Head of the Copenhagen eScience Center
eScience
«The next 10 to 20 years will see
computational science firmly
embedded in the fabric of science
– the most profound development
in the scientific method in over
three centuries.»
US Department of Energy 2003.
Mega-Science
The next scientific period will be dominated by
Mega-Science projects
• 104 researchers on a single project
• Extreme data production
• Highly integrated collaboration between different
groups of scientists
Examples
• CERN LHC
• ALMA
• Mars project
Data Production
1997: Total data worldwide app 12
exabytes (incl. documents, film, TV, pictures, …)1
1999: 2-3 exabytes data produced2
2002: App. 5 exabytes data produced2
1
1
1
1
Exabyte = 1000 Petabytes
Petabyte = 1000 Terabytes
Terabyte = 1000 Gigabytes
Gigabyte = 1000 Megabytes
Global data availablity
doubles every 4-5 years.
1) http://www.lesk.com/mlesk/ksg97/ksg.html
2) http://www.sims.berkeley.edu/research/projects/how-much-info-2003/
eScience Components
Modeling and simulation
eScience Components
Modeling and simulation
Data acquisition and handling
eScience Components
Modeling and simulation
Data acquisition and handling
Visualization
eScience Components
Modeling and simulation
Data acquisition and handling
Visualization
HPC and Grid
Why is it getting more difficult?
54 molecules
442 molecules
1372 molecyles
Process
Biopolymers
Proteins
Ribosomes
10-12
Biomimetic
Compound
Single
peptide
O2
10
1000
104
105
10-9
10-6
10-3
1
This
seminar
10-15
2
Protein
folding
Time
1
Proton
transfer
System
H
Size
Photoionization
System sizes and time scales
106
number of atoms
103
seconds
Nano-modeling
Extremely CPU- and
Data-intensive
algorithms
Complex structurecalculations
Multiple days of
execution even on a
supercomputer
Runs of both PCs and
Supercomputers
eScience and Bio/Med
We expect very good results form eScience in
biology and medicine
The foremost advantages will come from
introducing a mathematical causal
understanding of biological systems
• Bio-informatics are already doing this
An emerging field: Systems Biology
• Systems Medicine is also starting internationally
Calculations in treatment
Computational methods
are already important in
medical planning
• Radiation planning
• Bypass flow
modeling
• Robotic surgery
• …
Personalized medicine
Every human is unique
Also at the genetic level
In our genome, which is written with the alphabet
ACGT, we have a number of micro mutations –
called single nucleotide polymorphisms, SNP
These SNPs are often without consequence but
• Some make us sick
• Some are indicators of a faulty gene
• Others influence our reception of a drug
The last complication makes is very hard to make
drugs for the general population
We want to move from commodity medicine to custom
tailored drugs
An example
app 60% of today's medicines are
metabolized by cytochrome P450
enzymes
• Some have highly efficient P450 while
others have very slow and inefficient P450
• Knowledge of a patients P450 level will
allow us to dose medicine to the individual
much more efficiently
This is already in early use
And this is eScience how?
Developing a drug is not a linear process
The human genome is written with
billions og letters
• Any person has millions of SNP mutations
• Finding the SNP that has an effect is a
highly complex computational task
eScience and geology
Geology and hydrology too has been using
computational methods for a long time
There are very interesting aspects in combining
different methods
• i.e. include biological systems in the models
• Inverse mapping of seismic data
It turns out that we use the same techniques in
medicine
• And soon in industry
Grid
Minimum intrusion Grid
Minimum intrusion Grid
User
GRID
User
User
Resource
GRID
GRID
Resource
Resource
Resource
Processing plants
Like the power grid the computing Grid has many
types of power producers
• High yield power plants (fossil fuel, nuclear,…)
• Supercomputers and large farms
• Low yield producers (windmills, etc)
• Individual PCs and games-consoles
• Very low yield producers (solar panels, etc.)
• Web-browers
One Click
Interactive Applications
VGrids
Best thing since sliced bread
VGrids are Virtual Organizations in MiG
They are a dead easy way to create collaborations
•
•
•
•
Share files
Share resources
Private entry page
Public Web-page
Portals
VO’s can generate their own private entry
pages including application portals
Files in VGrids
A user must keep her personal homedirectory independent of which VGrid
she works in
But VGrids have a common directory
where only members of the VGrid are
allowed
• These are represented as directories in the
users home-directory
VGrid owners can create sub-VGrids
Examples
eScience on Grid
GeneRecon
GeneRecon seeks to identify genetic factors behind heretical
deceases
The overall idea is to compare two genomes
•
•
One where the decease is observed
One where the decease is not observed
App 1000 individuals in each set
GeneRecon is developed at the Bioinformatics Research
Center, Århus University
GeneRecon
The Algorithm is a Markov-chain Monte Carlo method
A test run consists of app. 30.000 individual tests
•
•
One test runs form 1 to 10 days on a PC
In total no less than 82 CPU years
MiG hosted the execution on Grid and got the execution down
below a month
Statistics
Total time
1315 jobs were submitted to Grid
at the same time
0 jobs were lost
First result
•
678
101
2:04:44
2.08
Last result
Min
•
Avg
28 days, 5:42:54
Max
0.01
46
392
Execution time
55
505
Queue time
Groundwater modeling on Funen
18.0
Calibration of the Assens
model:
1 model evaluation = 30 min
920 model evaluations = 19
days
Aggregeret objektiv funktion
17.0
16.0
15.0
14.0
13.0
12.0
11.0
0
200
400
600
Antal model evalueringer
800
1000
Days to hours
5.0
AUTOCAL (1 PC)
Objective function
4.5
AUTOCAL
OfficeGRID
AUTOCAL OfficeGRID (10 PCs)
4.0
3.5
3.0
Master
2.5
2.0
1.5
0
20
40
60
80
100
Time [h]
Client
Client
Client
Client
Drug Design
Molecular docking is a time consuming calculation
process which this project does through two steps
First step is a coarse calculation that can eliminate
molecules that won’t dock
• This process can run on PCs and PS3’s – a lot of work
is being done towards efficient utilization of the CELL
CPU for molecular docking
The molecules that survive the first step are then
modeled more precisely at quantum level on classic
supercomputers and clusters
SeGrid
Still a proposal
The idea is to share sensitive data through Grid and use the
Grid technology to manage access control and automatic
anonymization
More information
www.eScience.dk
Portal for KUs eScience activities
www.migrid.org
Portal for the Minimum intrusion Grid
www.rcuk.ac.uk/escience/
The very ambitious UK eScience program