presentation

Download Report

Transcript presentation

GRACE at UCL
2
When one size can't fit all: Scalable
HPC For Research Delivery
ISD/RITS/RCPS - Owain Kenway
Grace/Legion/Software Stack/Legion DI
www.ucl.ac.uk/research-it-services
3
State of Research Computing
Services: Legion
 Legion has been UCL's primary
local compute resource since 2007.
 Almost none of the original
hardware is still in service.
 Gradual upgrade over time.
 Absorbing other services.
 7 year old core network
technology – 1G Ethernet
www.ucl.ac.uk/research-it-services
4
State of Research Computing
Services: Legion
Gradual upgrade over time means
service is fragmented:
 8 Different node types!
 Some have Infiniband, some
don't!
 PIs buy the hardware they
need.
www.ucl.ac.uk/research-it-services
5
Parallel vs Serial
In general:
 Iridis 3 → parallel
 Legion → high throughput
Parallel
 Single job spans multiple nodes
 Tightly coupled parallelisation usually in MPI
 Sensitive to network performance
 Currently primarily chemistry, physics,
engineering
High throughput
 Lots (tens of thousands) of independent jobs
on different data
 High I/O
 Currently, primarily biosciences and physics
 In the future, digital humanities
www.ucl.ac.uk/research-it-services
6
Parallel
Many processes on many processors work
simultaneously + communicate between each
other
Input Data
Output Data
www.ucl.ac.uk/research-it-services
7
Many processes, operate independently of
each other and in any order
High Throughput
Output Data
Input Data
www.ucl.ac.uk/research-it-services
8
Iridis Retirement
Luckily, we had £1.5 million to
spend!
In summer 2015, Southampton were due
to retire Iridis
 This means that we would lose ~71
TeraFlops of compute capacity.
 And the ability to run large parallel jobs!
We also wanted to retire the original
Legion hardware which was 7 years old!
 Losing another 20 TeraFlops
www.ucl.ac.uk/research-it-services
9
State of Research Computing
Services: Grace
nd
 Grace went “into service” on the 2
December 2015.Complete new
service for parallel compute.
 All nodes are connected to storage
by 40 gigabit infiniband.
 Infiniband is primary network in the
cluster (IP over IB – looks like a
“normal” network).
 Designed with network capacity to
double size over time.
www.ucl.ac.uk/research-it-services
10
 To replace UCL's Iridis 3 service
and retired Legion nodes we required
~90 TeraFlops sustained
 Grace was benchmarked at ~180
TeraFlops
www.ucl.ac.uk/research-it-services
11
Legion
Grace
www.ucl.ac.uk/research-it-services
12

Legion/Grace have a common software stack.

Red Hat Enterprise Linux + Son of Grid Engine +
Environment modules

Common set of
 Compilers (so you can compile your own code)
 Libraries
 Applications
 It's likely the application you use is already available
or we can install it for you

Scripted builds of applications (so we can easily
install new versions for you)

xCAT management software (which allows us to
manage the cluster)

Easy to move between the services (you have
the same environment on both machines)
www.ucl.ac.uk/research-it-services
13
Wherever possible the UCL Research Computing Platform Services Team's work
is Open Source and on Github:
 https://github.com/UCL-RITS/rcps-buildscripts
 https://github.com/UCL-RITS/rcps-modulefiles
 You can deploy it on your resources/desktop (application licenses permitting)
www.ucl.ac.uk/research-it-services
14
The Future – Legion “Data Intensive”
Although Legion now does only high
throughput computing, it's not designed
for it.
 Some issues with I/O
 We need to retire some old hardware.
 So the next major upgrade is redesigning Legion for HTC.

 Replace old “Nehalem” nodes.
 Replace/upgrade 1G Ethernet I/O
subsystem.
 Local mirroring of common datasets.
 Coming ~summer 2017!
 The then current iteration of the
software stack.
www.ucl.ac.uk/research-it-services
15
None of this would have been possible without:
UCL:
OCF/Lenovo/DDN
Dr Ian Kirker, Heather Kelly, Brian Alston,
Georgina Ellis, Arif Ali, Jagjit Reehal,
Thomas Jones, Luke Sudbery, William Hay,
Jim
Roche,
Richard
Mansfield
and
Colin Byelong, Prof. Dario Alfe, Dr Javier
Herrero, Dr Jörg Saßmannshausen, Mike Atkins, certainly many, many others.
Greg Dyer
THANKS!
www.ucl.ac.uk/research-it-services
Grace has effectively doubled the
capacity for parallel compute
available to researchers at UCL
Visit www. ucl.ac.uk/research-it-services/grace
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut
efficitur ipsum vitae tortor accumsan, a pulvinar lorem lacinia.
Donec eu arcu justo. Fusce eget consequat risus Proin est lacus,
interdum vitae feugiat quis, faucibus vel mi.
Vivamus accumsan nisi vel nulla viverra semper. Donec purus
enim, sollicitudin vitae porta a, commodo sodales justo. Sed
iaculis rutrum molestie.
to download these slides after the event.
What did you think? Join the conversation on
Twitter with #GraceAtUCL.
Don’t forget to follow us for access to the event
video and today’s polling results.