npaci-ahm-2001

Download Report

Transcript npaci-ahm-2001

Evolution of High
Performance Cluster
Architectures
David E. Culler
[email protected]
http://millennium.berkeley.edu/
NPACI 2001 All Hands Meeting
Much has changed since “NOW”
inktomi.berkeley.edu
NOW 110 UltraSparc +Myrinet
NOW1 SS+ATM/Myrinet
NOW0 HP+medusa FDDI
Millennium Cluster Editions
The Basic Argument
• performance cost of engineering lag
– miss the 2x per 18 months
– => rapid assembly of leading edge HW and SW building
blocks
– => availability through fault masking, not inherent
reliability
• emergence of the “killer switch”
• opportunities for innovation
–
–
–
–
–
move data between as fast as within machine
protected user-level communication
large-scale management
fault isolation
novel applications
Clusters Took Off
• scalable internet services
– only way to match growth rate
• changing supercomputer market
• web hosting
Engineering the Building Block
• argument came full circle in ~98
• wide-array of 3U, 2U, 1U rack-mounted
servers
–
–
–
–
thermals and mechanicals
processing per square-foot
110 AC routing a mixed blessing
component OS & drivers
• became the early entry to the market
Emergence of the Killer Switch
•
•
•
•
ATM, Fiberchannel, FDDI “died”
ServerNet bumps along
IBM, SGI do the proprietary thing
little Myrinet just keeps going
– quite nice at this stage
• SAN standards shootout
– NGIO + FutureIO => Infiniband
– specs entire stack from phy to api
» nod to IPv6
– big, complex, deeply integrated, DBC
• Gigabit EtherNet steamroller...
– limited by TCP/IP stack, NIC, and cost
Opportunities for Innovation
Unexpected Breakthru: layer-7 switches
• fell out of modern switch design
– process packets in chunks
•
•
•
•
vast # of simultaneous connections
many line-speed packet filters per port
can be made redundant
=> multi-gigabit cluster “front end”
– virtualize IP address of services
– move service within cluster
– replicate it, distribute it
Switch
high-level xforms
fail-over,
load management
Network
Layer-7Switch
e-Science
any
useful
app
should
be a
service
Protected User-level messaging
Virtual Interface Architecture (VIA) emerged
primitive & complex relative to academic prototypes
industrial compromise
went dormant
Incorporated in Infiniband
big one to watch
Potential breakthrough
user-level TCP, UDP with IP NIC
storage over IP
Management
• workstation -> PC transition a step
back
– boot image distribution, OS distribution
– network troubleshoot and service
• multicast proved a powerful tool
• emerging health monitoring and
control
– HW level
– service level
– OS level still a problem
Rootstock
Local
Rootstock
Server
UC Berkeley
Internet
Rootstock
Server
Local
Rootstock
Server
Local
Rootstock
Server
Ganglia and REXEC
Node A
Node B
Node C
Node D
rexec
d
rexec
d
rexec
d
rexec
d
Cluster IP Multicast Channel
vexecd
(Policy A)
vexecd
(Policy B)
“Nodes AB”
minimum $
rexec
%rexec –n 2 –r 3 indexer
Also: bWatch
BPROC: Beowulf Distributed
Process Space
VA Linux Systems: VACM, VA
Cluster Manager
Network Storage
• state-of-practice still NFS + local copies
• local disk replica management lacking
• NFS doesn’t scale
– major source of naive user frustration
• limited structured parallel access
• SAN movement only changing the device
interface
• Need cluster content distribution, caching,
parallel access and network striping
see: GPFS, CFS, PVFS,
HPSS, GFS,PPFS,CXFS,
HAMFS,Petal, NASD...
Distributed Persistent Data Structure
Alternative
Clustered
Service
Service
Service
Service
DDS lib
DDS lib
DDS lib
Distr
Hash table
API
Redundant
low latency
high xput
network
System Area Network
Storage
Storage
Storage
Storage
Storage
Storage
“brick”
“brick”
“brick”
“brick”
“brick”
“brick”
Single-node
durable
hash table
Scalable Throughput
max throughput (ops/s)
100000
(128,61432)
(128,13582)
10000
reads
1000
writes
100
1
10
# of DDS bricks
100
1000
“Performance Available” Storage
A
D
A
D
A
D
A
D
D
D
D
5
10
Nodes
A
A
A
A
Adpative Agr.
Adpative Agr.
Static Agr.
Static Agr.
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
0
Distributed Queue
D
Adaptive Parallel Aggregation
% of Peak I/O Rate
% of Peak I/O Rate
Static Parallel Aggregation
15
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
0
5
Nodes Perturbed
10
15
Application Software
• very little movement towards harnessing
architectural potential
• application as service
– process stream of requests (not shell or batch)
– grow & shrink on demand
– replication for availability
» data and functionality
– tremendous internal bandwidth
• outer-level optimizations, not algorithmic
Time is NOW
finish the system area network
tackle the cluster I/O problem
come together around management
tools
get serious about application services