PPT - RealityGrid

Download Report

Transcript PPT - RealityGrid

The TeraGyroid Project - Aims
and Achievements
Richard Blake
Computational Science and Engineering Department
CCLRC Daresbury Laboratory
This ambitious project was the result of an international
collaboration linking the USA’s TeraGrid and the UK’s eScience Grid, jointly funded by NSF and EPSRC. TransAtlantic optical bandwidth is supported by British
Telecommunications.
Royal Society - June 2004
ANL
Overview
•
•
•
•
•
•
•
•
•
•
Project Objectives
The TeraGyroid scientific experiment
Testbed and Partners
Applications Porting and RealityGrid Environment
Grid Software Infrastructure
Visualization
Networking
What was done
Project Objectives - How well did we do?
Lesson Learned
Royal Society - June 2004
ANL
UK-Teragrid HPC Project Objectives
Joint experiment combining high-end computational
facilities in the UK e-Science Grid (HPCx and
CSAR) and the Teragrid sites:
– world class computational science experiment
– enhanced expertise/ experience to benefit UK and USA
– inform construction/operation of national/ international
grids
– stimulate long-term strategic technical collaboration
– support long-term scientific collaborations
– experiments with clear scientific deliverables
– choice of applications to be based on community codes
– inform future programme of complementary
experiments
Royal Society - June 2004
ANL
The TeraGyroid Scientific Experiment
High-density isosurface of
the late-time
configuration in a
ternary amphiphilic fluid
as simulated on a 643
lattice by LB3D.
Gyroid ordering coexists
with defect-rich,
sponge-like regions.
ANL
The dynamical behaviour of
such defect-rich
systems can only be
studied with very large
scale simulations, in
conjunction with highperformance
visualisation and
computational steering.
Royal Society - June 2004
The RealityGrid project
Mission: “Using Grid technology to closely couple high
performance computing, high throughput experiment
and visualization, RealityGrid will move the
bottleneck out of the hardware and back into the
human mind.”
• to predict the realistic behavior of matter using
diverse simulation methods
• LB3D - highly scalable grid based code to model
dynamics and hydrodynamics of complex multiphase
fluids
• mesoscale simulations enables access to larger physical
and longer timescales
• RealityGrid environment enables multiple steered and
spawned simulations, the visualised output being
streamed to a distributed set of collaborators located
at AG nodes across the USA and UK.
Royal Society - June 2004
ANL
Testbed and Project Partners
Reality Grid partners:
– University College London (Application, Visualisation,
Networking)
– University of Manchester (Application, Visualisation,
Networking)
– Edinburgh Parallel Computing Centre (Application)
– Tufts University (Application)
Teragrid sites at:
–
–
–
–
Argonne National Laboratory (Visualization, Networking)
National Center for Supercomputing Applications (Compute)
Pittsburgh Supercomputing Center (Compute, Visualisation)
San Diego Supercomputer Center (Compute)
UK High-End Computing Services
- HPCx run by the University of Edinburgh and CCLRC Daresbury
Laboratory (Compute, Networking, Coordination)
- CSAR run by the University of Manchester and CSC (Compute
and Visualisation)
Royal Society - June 2004
ANL
Computer Servers
The TeraGyroid project has access to a substantial fraction of the world's
largest supercomputing resources, including the whole of the UK's
supercomputing facilities and the USA's TeraGrid machines. The largest
simulations are in excess of one billion lattice sites.
Site
System
Procs
HPCx (Daresbury
Laboratory)
Computer Services
for Academic
Research (CSAR)
Pittsburgh
Supercomputing
Centre (PSC)
National Centre for
Supercomputing
Applications
(NCSA)
IBM Power 4 1024
Regatta
SGI Origin
512
3800
San Diego
Supercomputing
Centre (SDSC)
TF (Peak) Memory
(TB)
6.6
1.024
0.8
0.512
(shared)
HP-Compaq
Alpha EV68
3000
6
3.0
Itanium 2
256
1.3
0.512
Itanium 2
256
1.3
1.536
Itanium 2
256
1.3
0.512
~ 7 TB memory - 5K processors in integrated resource
Royal Society - June 2004
ANL
Networking
Netherlight
Amsterdam
ANL
TeraGrid
Glasgow
Belfast
DL
UK
Edinburgh
Newcastle
Manchester
Cambridge
Oxford
RAL
Cardiff
London
Southampton
Royal Society - June 2004
BT
provision
Applications Porting
• LB3D written in Fortran90
• Order 128 variables per grid point 1Gpoint = 1TB
• Various compiler issues to be overcome at
different sites
• Site configuration issues important eg I/O
access to high speed global file systems for
checkpoint files
• Connectivity of high-speed file systems to
network
• Multi heading required of several systems to
separate control network from data network
• Port forwarding required for compute nodes on
private network
Royal Society - June 2004
ANL
Exploring parameter space
through computational steering
Cubic micellar phase, high
surfactant density
gradient.
Cubic micellar phase, low
surfactant density
gradient.
Initial condition:
Random water/
surfactant mixture.
Self-assembly starts.
Rewind and
restart from
checkpoint.
Lamellar phase: surfactant
bilayers between water
layers.
Royal Society - June 2004
ANL
Reality Grid - Environment
ANL
• Computations run at HPCx, CSAR, SDSC, PSC and NCSA
• Visualisation run at Manchester, UCL, Argonne, NCSA, Phoenix
• Scientists steering calculations from UCL and Boston over Access Grid
• Visualisation output and collaborations multicast to Phoenix and visualised
on the show floor in the University of Manchester booth
Royal Society - June 2004
Visualisation servers
• Amphiphilic fluids produce exotic mesophases with a
range of complex morphologies - need visualisation
• The complexity of these data sets (128 variables)
makes visualisation a challenge
• Using the VTK library, with patches refreshing
each time new data available
• Video stream multicast to Access Grid using
FLXmitter library
• SGI OpenGL Vizserver used to allow remote control
of visualisation
• Visualisation of billion node models requires 64-bit
hardware and multiple rendering units
• Achieved visualisation of 10243 lattice using raytracing algorithm developed at University of Utah
on 100 proc Altix on showroom floor at SC’03
Royal Society - June 2004
ANL
Grid Software Infrastructure
• Various versions of Globus Toolkit 2.2.3, 2.2.4,
2.4.3 and 3.1 (including GT 2 compatibility
bundles)
• Used GRAM, GridFTP Globus-I/O - no
incompatibilities
• Not use MDS- robustness/ utility of data
• 64 bit version of GT2 required for AIX (HPCx)
system - some grief due to tendency to require
custom-patched versions of third party libraries
• Lot of system management effort required to
work with/ around toolkit
• Need a more scalable CA system that bypasses
every system administrator having to study
everyone else’s certificates
Royal Society - June 2004
ANL
TeraGyroid Network
Starlight (Ch icago)
Netherlight
(Amsterdam)
10 Gb p s
ANL
PSC
Manchester
Caltech
NCSA
Daresbury
BT provision
2 x 1 Gb p s
pr od uct io
n netw ork
SJ4
SDSC
MB-NG
Phoen ix
Visualiz at ion
Computation
Access G rid no de
Network PoP
Service Regi stry
UCL
Du al -homed system
Royal Society - June 2004
ANL
Networking
TCP (near - realtime)
TCP (non - realtime)
UDP (realtime)
checkpoint files
steering
SimEng2
SimEng1
PSC
UK
vis data
VizEng2
PHOENIX
storage
Disk1
UK
Royal Society - June 2004
ANL
Networking
• On-line visualization requires O(1 Gbps) bandwidth
for larger problem sizes
• Steering requires 100% reliable near-real time
data transport across the Grid to visualization
engines.
• Reliable transfer is achieved using TCP/IP:
handshaking for each single packet that is
transferred (to check and repair loss). This slows
down transport limits data transfer rates 
limits LB3D steering of larger systems.
• Point-to-n-point transport for visualization,
storage and job migration uses n times more
bandwidth since unicast is used.
Royal Society - June 2004
ANL
What Was Done?
The TeraGyroid experiment represents the first use of
collaborative, steerable, spawned and migrated processes
based on capability computing.
– generated 2TB of data
– exploration of the multi-dimensional fluid coupling parameter
space with 643 simulations accelerated through steering
– study of finite size periodic boundary condition effects,
exploring the stability of the density of defects in the 643
simulations as they are scaled up to 1283, 2563, 5123, 10243
– 100K to 1,000K time steps
– exploring the stability the crystalline phases to perturbations
and variations in effective surfactant temperature
•
1283 and 2563 simulations - clear of finite size effects
•
Perfect crystal not formed in 1283 systems - 600K steps
•
Statistics of number of defects, velocity and lifetimes
requires large systems as these have sufficient defects
Royal Society - June 2004
ANL
World’s Largest Lattice Boltzmann Simulation?
• 10243 lattice sites
• scale up 1283 simulations with periodic tiling
and perturbations for initial state
• Finite-size effect free dynamics
• 2048 processors
• 1.5 TB of memory
• 1 minute per time step on 2048 processors
• 3000 time steps
• 1.2TB of visualisation data
Run on LeMieux at Pittsburgh SC
Royal Society - June 2004
ANL
Access Grid Screen at SC ‘03 during SC Global
Session on Application Steering
ANL
Royal Society - June 2004
Measured Transatlantic Bandwidths during SC’03
ANL
Royal Society - June 2004
Demonstrations/ Presentations
Demonstrations of the TeraGyroid experiment at
SC’03:
TeraGyroid on the PSC Booth
Tue 18, 10:00-11:00
Thu 20, 10:00-11:00
RealityGrid and TeraGyroid on UK e-Science Booth
Tue 18, 16:00-16:30
Wed 19, 15:30-16:00
RealityGrid during the SC'03 poster session:
Tue 18, 17:00-19:00
HPC-Challenge presentations:
Wed 19 10:30-12:00
SC Global session on steering:
Thu 20, 10:30-12:00
Demonstrations and real-time output at the
University of Manchester and HPCx booths.
Royal Society - June 2004
ANL
Most Innovative Data Intensive Application - SC 03
ANL
Royal Society - June 2004
Project Objectives - How Well Did We Do? - 1
•
•
world class computational science experiment
– science analysis is ongoing - leading to new insights into
properties of complex fluids at unprecedented scales
– SC’03 award - ‘Most Innovative Data Intensive App’
enhanced expertise/ experience to benefit UK and USA
– first transatlantic federation of major HEC facilities
– applications need to be adaptable to different architectures
•
inform construction/operation of national/ int grids
– most insight gained into end to end network integration,
performance and dual homed systems
– remote visualisation, steering and checkpointing require high
bandwidth which is dedicated and reservable
– results fed directly into ESLEA proposal to exploit UKLight
optical switched network infrastructure
•
stimulate long-term strategic technical collaboration
– strengthened relationships between Globus, networking
and visualisation groups
Royal Society - June 2004
ANL
Project Objectives - How Well Did We Do? - 2
• support long-term scientific collaborations
– built on strong and fruitful existing scientific
collaborations between researchers in UK and USA
• experiments with clear scientific deliverables
- an explicit science plan was published, approved and then
executed. Data analysis is ongoing.
• choice of applications to be based on community
codes
– experiences will be of benefit to other grid based
applications in particular in the computation engineering
community
• inform future programme of complementary
experiments
– Report to be made available on RG Website
– EPSRC Initiating another Call for Proposals - not targetting
SC’04.
Royal Society - June 2004
ANL
Lessons Learned
•
•
•
•
•
•
•
•
•
How to support such projects - full peer review?
Timescales were very tight - September - November
Resource estimates need to be flexible
Need complementary experiments for US and UK to
reciprocate benefits
HPC centres/ e-science and networking groups can work
very effectively together on challenging common goals
Site configuration issues very important - network access
Visualisation capabilities in UK need upgrading
Scalable CA, dual address systems
Network QoS very important for checkpointing, remote
steering and visualisation
• Do it again?
Royal Society - June 2004
ANL