Technical Computing Initiative
Download
Report
Transcript Technical Computing Initiative
e-Science and
Cyberinfrastructure
Tony Hey
Corporate VP for Technical Computing
Microsoft Corporation
Licklider’s Vision
“Lick had this concept – all of the stuff
linked together throughout the world, that
you can use a remote computer, get data
from a remote computer, or use lots of
computers in your job”
Larry Roberts – Principal Architect of the
ARPANET
What is e-Science?
‘e-Science is about global collaboration
in key areas of science, and the next
generation of infrastructure that will
enable it’
John Taylor
Former Director General of Research Councils
Office of Science and Technology, UK
e-Science
e-Science is about data-driven, multidisciplinary
science and the technologies to support such
distributed, collaborative scientific research
Many areas of science are now being overwhelmed
by a ‘data deluge’ from new high-throughput devices,
sensor networks, satellite surveys …
Areas such as bioinformatics, genomics, drug design,
engineering and healthcare require collaboration
between different domain experts
‘e-Science’ is a shorthand for a set of
technologies to support collaborative networked
science
HPC and Information Management are key
technologies to support this e-Science revolution
The UK e-Science Initiative
DTI and OST Investment of over £250M over 5
years aimed at enabling the next generation of
multi-disciplinary collaborative science and
engineering
Major collaborative industrial program with
participation from the engineering,
pharmaceutical, petrochemical, media and
financial sectors
Similar national e-Science programs now
initiated around the world
e.g. China, South Korea, Australia, Germany,
Spain, Netherlands …
… and now Chile
A New Science Paradigm
Thousand years ago:
Experimental Science
- description of natural phenomena
Last few hundred years:
Theoretical Science
- Newton’s Laws, Maxwell’s Equations …
Last few decades:
Computational Science
- simulation of complex phenomena
Today:
e-Science or Data-centric Science
- unify theory, experiment, and simulation
- using data exploration and data mining
•
•
•
Data captured by instruments
Data generated by simulations
Data generated by sensor networks
Scientist analyzes databases/files
(With thanks to Jim Gray)
2
.
4G
c2
a
a 3 a 2
http://www.neptune.washington.edu/
Undersea
Sensor
Network
Connected &
Controllable
Over the
Internet
Persistent
Distributed
Storage
Visual
Programming
Distributed
Computation
Interoperability
& Legacy
Support via
Web Services
Searching &
Visualization
Live
Documents
Reputation
& Influence
Two examples of e-Science
An Astronomy DataGrid – AstroGrid and the
International Virtual Observatory
Chemistry – The Comb-e-Chem Project
Powering the Virtual
Universe
www.astrogrid.ac.uk
Multi-wavelength showing the jet in
M87: from top to bottom – X-ray,
Optical,
Infra-Red and Radio
The Multiwavelength Crab Nebulae
Crab star
1053 AD
X-ray,
optical,
infrared, and
radio
views of the nearby
Crab Nebula, which is
now in a state of
chaotic expansion after
a supernova explosion
first sighted in 1054
A.D. by Chinese
Astronomers.
Slide courtesy of Robert Brunner @ CalTech.
International Virtual Observatory
IRAS 25m
Data has no commercial value
No privacy concerns
Can freely share results with others
Great for experimenting with algorithms
2MASS 2m
Data is real and well documented
DSS Optical
High-dimensional data
Spatial data
Temporal data
IRAS 100m
Data from many different
instruments, places and times
WENSS 92cm
Federation is a key goal
NVSS 20cm
There is a lot of data (petabytes)
With thanks to Jim Gray
ROSAT ~keV
GB 6cm
IVO: An Astronomy Data Grid
Working to build world-wide telescope
Built SkyServer.SDSS.org
Built Analysis system
All astronomy data and literature
online and cross indexed
Tools to analyze it
MyDB
CasJobs (batch job)
OpenSkyQuery
Federation of ~20 observatories.
Results:
It works and is used every day
Spatial extensions in SQL 2005
A good example of Data Grid
A good example of Web Services
The Comb-e-Chem Project
Diffractometer
Video Data
Stream
Automatic
Annotation
HPC Simulation
Data Mining
and Analysis
Structures
Database
Combinatorial
Chemistry
Wet Lab
National X-Ray
Service
Middleware
National Crystallographic Service
Send sample
material to
NCS service
Collaborate in e-Lab
experiment and
obtain structure
X-Ray e-Laboratory
Search materials database
and predict properties using
Grid computations
Structures
Database
Download full
data on materials
of interest
Computation
Service
A digital lab book
replacement that chemists
were able to use, and liked
Monitoring laboratory
experiments using a
broker delivered over
GPRS on a PDA
Crystallographic e-Prints
Direct Access to Raw Data
from scientific papers
Raw data sets can be very
large - stored at UK National
Datastore using SRB software
Cyberinfrastructure
Cyberinfrastructure and e-Infrastructure
In the US, Europe and Asia there is a common
vision for the ‘cyberinfrastructure’ required to
support the e-Science revolution
Set of Middleware Services supported on top of high
bandwidth academic research networks
Software, hardware and organizations that support
e-Science
Similar to vision of the Grid as a set of services
that allows scientists – and industry – to
routinely set up ‘Virtual Organizations’ for their
research – or business
The ‘Microsoft Grid’ vision is as much about
integrating and managing data and information than
about compute cycles
Grids for Virtual Organizations
`
Code
Fileshare
Protocols:
- Resource Discovery
- Job Scheduling & Management
- Data Transfer
- Audit
Computation
Server
Federated Trust
Compute
Cluster
Credential
Srv
Data
Fileshare
SQL DB
Directory
Administrative
Domain
Service-Orientation for
building Distributed Systems
Administrative
domain
Service
Service
Service
network
Service
Service
m
ges
a
s
es
boundaries
Administrative
domain
Administrative
domain
Service
Web Services and Interoperability
Company A
(J2EE)
Web Services
Company C
(.NET)
Open Source
(OMII)
Progress in Grid Standards?
The GGF/EGA merger gives great opportunity for
the new Open Grid Forum (OGF) to standardize
a small set of basic Grid services based on
generally accepted Web Services
Harness the power of the world-wide Grid
community to develop robust open source
reference implementations
Grid research community needs to propose and
explore new features in real experiments
OGF can reassure industry about progress in
Grid standards and grow the market for all
Key Data Issues for e-Science
Networks
The Data Life Cycle
Lambda technology
From Acquisition to Preservation
Scholarly Communication
Open Access to Data and Publications
An International
e-Infrastructure
Starlight (Chicago)
US TeraGrid
SDSC
UK NGS
Leeds
Manchester
Netherlight
(Amsterdam)
Oxford
RAL
NCSA
PSC
UKLight
UCL
AHM 2004
All sites connected by
production network (not
all shown)
Computation
Steering clients
Network PoP
Service Registry
Local laptops
and Manchester
vncserver
The Problem for the e-Scientist
Experiments &
Instruments
Other Archives
Literature
questions
facts
facts
?
answers
Simulations
Data ingest
Managing a petabyte
Common schema
How to organize it?
How to reorganize it?
How to coexist & cooperate with
others?
Data Query and Visualization
tools
Support/training
Performance
Execute queries in a minute
Batch (big) query scheduling
The e-Science Data Life Cycle
Data Acquisition
Data Ingest
Metadata
Annotation
Provenance
Data Storage
Data Cleansing
Data Mining
Curation
Preservation
Scholarly Communication
Global Movement towards permitting ‘Open
Access’ to scholarly publications
Libraries can no longer afford publisher
subscriptions
Principle that results of publicly funded
research should be available to all
Mandates for Open Access
US Proposal – Cornyn-Lieberman Bill
Supported by most top US research
universities
EU Proposals
UK, France and German initiatives
NSF ‘Atkins’ Report on
Cyberinfrastructure
‘the primary access to the latest findings
in a growing number of fields is through
the Web, then through classic preprints
and conferences, and lastly through
refereed archival papers’
‘archives
containing
hundreds
or
thousands of terabytes of data will be
affordable and necessary for archiving
scientific and engineering information’
Interoperable Repositories?
Paul Ginsparg’s arXiv at Cornell has
demonstrated a new model of scientific publishing
David Lipman of the NIH National Library of
Medicine has developed PubMedCentral as
repository for NIH funded research papers
Electronic version of ‘preprints’ hosted on the Web
Microsoft funded development of ‘portable PMC’ now
being deployed in UK and other countries
Stevan Harnad’s ‘self-archiving’ EPrints project in
Southampton provides a basis for OAI-compliant
‘Institutional Repositories’
JISC-funded TARDis Project at Southampton is hybrid
of full-text open access and links to publisher sites
The Service Revolution
Web 2.0
Social networks, tagging for sharing e.g.
e.g. Flikr, Del.icio.us, MySpace, CiteULike,
Connotea …
Wikis, Blogs, RSS, folksonomies …
Software delivered as a service
Microsoft Live services
Office Live
Xbox Live
Windows Live Academic
Mashups
Craigslist + GoogleMap
http://mashupcamp.com
e-Science Mashups?
id
Combine
services to give
added value
id
id
Technical Computing at Microsoft
Advanced Computing for Science and
Engineering
High Performance Computing
Application of new algorithms, tools and
technologies to scientific and engineering
problems
Application of high performance clusters and
database technologies to industrial and
scientific applications
Radical Computing
Research in potential breakthrough
technologies
Fighting HIV with Computer Science
Nebojsa Jojic and David Heckerman
A major problem: Over 40 million infected
Vaccine needed for third world countries
Drug treatments are effective but are an
expensive life commitment
Effective vaccine could eradicate disease
Methods from computer science are
helping with the design of vaccine
Machine learning: Finding biological
patterns that may stimulate the immune
system to fight the HIV virus
Optimization methods: Compressing these
patterns into a small, effective vaccine
Developed Set of Specialist Tools
Chromatogram deconvolution
Pathway analysis/association/causal models
Clustering/Trees (phylo, haplotypes etc.)
Protein binding and folding
Sequence diversity models (epitomes)
Image analysis/classification
Evolution modeling and inference
Epitope prediction
Kyril Faenov, Director of HPC
Microsoft Corporation
http://www.microsoft.com/hpc
Top 500 Architectures / Systems
500
SIMD
400
Single Proc.
300
SMP
200
Const.
100
Cluster
MPP
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
0
HPC: Market Trends
2004 Systems
1,167
3,915
22,712
127,802
Capability,
Enterprise
$1M+
Divisional
$250K-$1M
Departmental
$50-250K
Workgroup
<$50K
<$250K – 97% of systems, 52% of revenue
In 2004 clusters grew 96% to 37% by revenue
Average cluster size 10-16 nodes
Source: IDC, 2005
Windows Compute Cluster
Server 2003
Faster time-to-insight through simplified cluster
deployment, job submission and status monitoring
Better integration with existing Windows infrastructure
allowing customers to leverage existing technology and
skill-sets
Familiar development environment allows developers to
write parallel applications from within the powerful
Visual Studio IDE
Sun’s Surface
Power Density (W/cm2)
10,000
Rocket Nozzle
1,000
Nuclear Reactor
100
Pentium®
10 4004
8086
8085
8008
1
286
Hot Plate
386
486
8080
‘70
Intel Developer Forum, Spring 2004 - Pat Gelsinger
‘80
‘90
‘00
‘10
Radical Computing
Future of silicon chips
Challenge for IT industry and Computer
Science community
“100’s of cores on a chip in 2015”
(Justin Rattner, Intel)
Can we make parallel computing on a chip
easier than message-passing?
Challenge for the Scientific Community
How will the Multi-Core transition affect
scientific computing?
Even more Radical Computing?
Quantum Computing
Multidisciplinary Institute at UCSB
Director is Michael Freedman of MSR
Looking at novel material physics and the 2dimensional Quantum Hall effect
Exploring non-Abelian ‘Anyon’ excitations to
protect coherence of qubits and quantum
gates
See Scientific American April 2006 for
description of ‘Project Q’
Multiplication versus Factoring
3490529610
8476509491
4784961990
3898133417
7646384933
8784399082
0577
X
3276813299
3266709549
9619881908
3446141317
7642987992
9425397982
88533
=
11438162575788888766
92357799761466120102
18296721242362582561
84293570693524573389
78305971235639587050
58989075147599290026
879543541
Prime Factors of the 129-digit number known as RSA-129
Multiplication is classically computationally ‘easy’ but
factorization is computationally ‘hard’
Peter Shor developed new factorization
algorithm for a quantum computer
Summary
Microsoft wishes to work with the university
research and library communities to:
• develop interoperable high-level services, work
flows, tools and data services
• accelerate progress in a small number of societally
important scientific applications
• assist in the development of interoperable
repositories and new models of scholarly publishing
• explore radical new directions in computing and
ways and applications to exploit on-chip parallelism
How can Microsoft best collaborate with the
scientific community?
© 2005 Microsoft Corporation. All rights reserved.
This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.