Transcript Test

The Computational Plant
9th ORAP Forum
Paris (CNRS)
Rolf Riesen
Sandia National Laboratories
Scalable Computing Systems Department
March 21, 2000
Tech Report SAND98-2221
Distributed and Parallel Systems
9th ORAP Forum. March 21, 2000
Distributed
systems
heterogeneous
•
•
•
•
•
•
•
•
Gather (unused) resources
Steal cycles
System SW manages resources
System SW adds value
10% - 20% overhead is OK
Resources drive applications
Time to completion is not critical
Time-shared
Massively
parallel
systems
homogeneous
•
•
•
•
•
•
•
•
Bounded set of resources
Apps grow to consume all cycles
Application manages resources
System SW gets in the way
5% overhead is maximum
Apps drive purchase of equipment
Real-time constraints
Space-shared
9th ORAP Forum. March 21, 2000
Massively Parallel Processors
•
•
•
•
•
•
Intel Paragon
1,890 compute nodes
3,680 i860 processors
143/184 GFLOPS
175 MB/sec network
SUNMOS lightweight kernel
•
•
•
•
•
•
Intel TeraFLOPS
4,576 compute nodes
9,472 Pentium II processors
2.38/3.21 TFLOPS
400 MB/sec network
Puma/Cougar lightweight
kernel
9th ORAP Forum. March 21, 2000
Cplant Goals
•
•
•
•
•
Production system
Multiple users
Scalable (easy to use buzzword)
Large scale (proof of the above)
General purpose for scientific applications (not
Beowulf dedicated to a single user)
• 1st step: Tflops look and feel for users
Cplant Strategy
9th ORAP Forum. March 21, 2000
• Hybrid approach combining commodity cluster
technology with MPP technology
• Build on the design of the TFLOPS:
– large systems should be built from independent
building blocks
– large systems should be partitioned to provide
specialized functionality
– large systems should have significant resources
dedicated to system maintenance
9th ORAP Forum. March 21, 2000
Why Cplant?
• Modeling and simulation, essential to stockpile
stewardship, require significant computing
power
• Commercial supercomputer are a dying breed
• Pooling of SMPs is expensive and more
complex
• Commodity PC market is closing the
performance gap
• WEB services and e-commerce are driving
high-performance interconnect technology
Cplant Approach
9th ORAP Forum. March 21, 2000
• Emulate the ASCI Red environment
–
–
–
–
–
Partition model (functional decomposition)
Space sharing (reduce turnaround time)
Scalable services (allocator, loader, launcher)
Ephemeral user environment
Complete resource dedication
• Use Existing Software when possible
– Red Hat distribution, Linux/Alpha
– Software developed for ASCI Red
Conceptual Partition View
9th ORAP Forum. March 21, 2000
File I/O
Service
Compute
Users
Net I/O
/home
System Support
Sys Admin
9th ORAP Forum. March 21, 2000
File I/O
User View
Net I/O
Service partition
alaska0
alaska1
alaska2
alaska3
alaska4
Load balancing daemon
rlogin alaska
/home
System Support Hierarchy
Admin access
Master copy
of system
software
9th ORAP Forum. March 21, 2000
sss1
sss0
node
node
node
node
sss0
In-use copy
of system
software
NFS mount
root from
SSS0
Scalable
Unit
node
node
node
node
sss0
In-use copy
of system
software
NFS mount
root from
SSS0
Scalable
Unit
node
node
node
node
In-use copy
of system
software
NFS mount
root from
SSS0
Scalable
Unit
Scalable Unit
compute
compute
compute
compute
compute
compute
compute
compute
compute
service
service
100BaseT hub
100BaseT hub
power
serial
Ethernet
Myrinet
16 port Myrinet switch
compute
Terminal server
compute
Power controller
compute
16 port Myrinet switch
Terminal server
Power controller
To system support network
compute
compute
sss0
9th ORAP Forum. March 21, 2000
8 Myrinet LAN cables
“Virtual Machines”
9th ORAP Forum. March 21, 2000
Uses rdist to
push system software
down
sss0
node
node
node
node
sss1
Production
Alpha
Beta
sss0
In-use copy
of system
software
NFS mount
root from
SSS0
Scalable
Unit
node
node
node
node
SU configuration
database
sss0
In-use copy
of system
software
NFS mount
root from
SSS0
Scalable
Unit
node
node
node
node
In-use copy
of system
software
NFS mount
root from
SSS0
Scalable
Unit
9th ORAP Forum. March 21, 2000
Runtime Environment
• yod - Service node parallel job launcher
• bebopd - Compute node allocator
• PCT - Process control thread, compute node
daemon
• pingd - Compute node status tool
• fyod - Independent parallel I/O
Phase I - Prototype (Hawaii)
9th ORAP Forum. March 21, 2000
•
•
•
•
•
•
•
•
•
•
128 Digital PWS 433a (Miata)
433 MHz 21164 Alpha CPU
2 MB L3 Cache
128 MB ECC SDRAM
24 Myrinet dual 8-port SAN
switches
32-bit, 33 MHz LANai-4 NIC
Two 8-port serial cards per SSS0 for console access
I/O - Six 9 GB disks
Compile server - 1 DEC PWS
433a
Integrated by SNL
Phase II Production (Alaska)
9th ORAP Forum. March 21, 2000
•
•
•
•
•
•
400 Digital PWS 500a (Miata)
500 MHz Alpha 21164 CPU
2 MB L3 Cache, 192 MB RAM
16-port Myrinet switch
32-bit, 33 MHz LANai-4 NIC
6 DEC AS1200, 12 RAID (.75
Tbyte) || file server
• 1 DEC AS4100 compile & user file
server
• Integrated by Compaq
• 125.2 GFLOPS on MPLINPACK
(350 nodes)
– would place 53rd on June 1999
Top 500
9th ORAP Forum. March 21, 2000
Phase III Production (Siberia)
•
•
•
•
•
•
•
•
624 Compaq XP1000 (Monet)
500 MHz Alpha 21264 CPU
4 MB L3 Cache
256 MB ECC SDRAM
16-port Myrinet switch
64-bit, 33 MHz LANai-7 NIC
1.73 TB disk I/O
Integrated by Compaq and
Abba Technologies
• 247.6 GFLOPS on
MPLINPACK (572 nodes)
– would place 40th on Nov
1999 Top 500
Phase IV (Antarctica?)
9th ORAP Forum. March 21, 2000
•
•
•
•
~1350 DS10 Slates (NM+CA)
466MHz EV6, 256MBRAM
Myrinet 33MHz 64bit LANai 7.x
Will be combined with Siberia
for a ~1600-node system
• Red, black, green switchable
9th ORAP Forum. March 21, 2000
Myrinet Switch
• Based on 64-port Clos switch
• 8x2 16-port switches in a 12U
rack-mount case
• 64 LAN cables to nodes
• 64 SAN cables (96 links) to
mesh
16-port
switch
4 nodes
9th ORAP Forum. March 21, 2000
One Switch Rack = One Plane
• 4 Clos switches in
one rack
• 256 nodes per
plane (8 racks)
• Wrap-around in x
and y direction
• 128+128 links in z
direction
y
z
4 nodes per
x
not shown
Cplant 2000: “Antarctica”
Wrap-around and z links and
nodes not shown
Connected to classified network
9th ORAP Forum. March 21, 2000
Connected to
unclassified
network
Compute nodes swing
between red, black, or green
Connected to open network
Cplant 2000: “Antarctica” cont.
9th ORAP Forum. March 21, 2000
• 1056 + 256 + 256 nodes  1600 nodes  1.5TFlops
• 320 “64-port” switches + 144 16-port switches from
Siberia
• 40 + 16 system support stations
9th ORAP Forum. March 21, 2000
Portals
• Data movement layer from SUNMOS and
PUMA
• Flexible building blocks for supporting
many protocols
• Elementary constructs that support MPI
semantics well
• Low-level message-passing layer (not a
wire protocol)
• API intended for library writers, not
application programmers
• Tech report SAND99-2959
Interface Concepts
• One-sided operations
– Put and Get
• Zero copy message passing
– Increased bandwidth
9th ORAP Forum. March 21, 2000
• OS Bypass
– Reduced latency
• Application Bypass
– No polling, no threads
– Reduced processor utilization
– Reduced software complexity
MPP Network: Paragon and Tflops
9th ORAP Forum. March 21, 2000
Network
interface is on
the memory
bus
Network
Memory
Memory
Bus
Processor
Message passing or
computational coprocessor
Processor
Commodity: Myrinet
Network is
far from the
memory
Processor
Memory
Memory
Bus
9th ORAP Forum. March 21, 2000
Bridge
PCI Bus
OS Bypass
NIC
Network
9th ORAP Forum. March 21, 2000
“Must” Requirements










Common protocols (MPI, system protocols)
Portability
Scalability to 1000’s of nodes
High-performance
Multiple process access
Heterogeneous processes (binaries)
Runtime independence
Memory protection
Reliable message delivery
Pairwise message ordering
9th ORAP Forum. March 21, 2000
“Will” Requirements







Operational API
Zero-copy MPI
Myrinet
Sockets implementation
Unrestricted message size
OS Bypass, Application Bypass
Put/Get
9th ORAP Forum. March 21, 2000
“Will” Requirements







Packetized implementations
Receive uses start and length
Receiver managed
Sender managed
Gateways
Asynchronous operations
Threads
9th ORAP Forum. March 21, 2000
“Should” requirements








No message alignment restrictions
Striping over multiple channels
Socket API
Implement on ST
Implement on VIA
No consistency/coherency
Ease of use
Topology information
Portal Addressing
Operational Boundary
Portal Table
9th ORAP Forum. March 21, 2000
Match List
Event Queue
Memory
Descriptors
Portal API Space
Memory
Region
Application
Space
Portal Address Translation
Enter
Get Next
Match Entry
No
Match?
Yes
No
More
Match
Entries?
Discard
Message
Increment
Drop Count
9th ORAP Forum. March 21, 2000
Yes
No
Empty &
Unlink ME?
First MD
Accepts?
Yes
Perform
Operation
Unlink MD
Yes
Yes
Unlink ME
Unlink MD?
No
No
Record
Event
Yes
Event
Queue?
No
Exit
Implementing MPI
• Short message protocol
– Send message (expect receipt)
– Unexpected messages
9th ORAP Forum. March 21, 2000
• Long message protocol
– Post receive
– Send message
– On ACK or Get, release message
• Event includes the memory descriptor
Implementing MPI
Pre-posted
Match none
9th ORAP Forum. March 21, 2000
Match any
Mark
short,unlink
buffer
short,unlink
buffer
Event
Queue
short,unlink
Match any
0, trunc, no ACK
buffer
Flow Control
• Basic flow control
– Drop messages that receiver is not prepared for
• Long messages might waste network
resources
9th ORAP Forum. March 21, 2000
– Good performance for well-behaved MPI apps
• Managing the network – packets
– Packet size big enough to hold a short message
– First packet is an implicit RTS
– Flow control ACK can indicate that message will be
dropped
Portals 3.0 Status
• Currently testing Cplant Release 0.5
9th ORAP Forum. March 21, 2000
– Portals 3.0 kernel module using the RTS/CTS
module over Myrinet
– Port of MPICH 1.2.0 over Portals 3.0
• TCP/IP reference implementation ready
• Port to LANai begun
9th ORAP Forum. March 21, 2000
http://www.cs.sandia.gov/cplant