Lecture 3 - Computer Science and Engineering

Transcript Lecture 3 - Computer Science and Engineering

CS160 – Lecture 3
Clusters. Introduction to PVM and
MPI
Introduction to PC Clusters
• What are PC Clusters?
• How are they put together ?
• Examining the lowest level messaging
pipeline
• Relative application performance
• Starting with PVM and MPI
Clusters, Beowulfs, and more
• How do you put a “Pile-of-PCs” into a
room and make them do real work?
–
–
–
–
–
Interconnection technologies
Programming them
Monitoring
Starting and running applications
Running at Scale
Beowulf Cluster
• Current working definition: a collection of
commodity PCs running an open-source
operating system with a commodity
interconnection network
– Dual Intel PIIIs with fast ethernet, Linux
• Program with PVM, MPI, …
– Single Alpha PCs running Linux
Beowulf Clusters cont’d
• Interconnection network is usually fast ethernet
running TCP/IP
– (Relatively) slow network
– Programming model is message passing
• Most people now associate the name “Beowulf”
with any cluster of PCs
– Beowulf’s are differentiated from high-performance
clusters by the network
• www.beowulf.org has lots of information
High-Performance Clusters
Gigabit Networks
- Myrinet, SCI, FC-AL,
Giganet,GigE,ATM
• Killer micros: Low-cost Gigaflop processors here for a few kilo$$’s
/processor
• Killer networks: Gigabit network hardware, high performance
software (e.g. Fast Messages), soon at 100’s-$$$/ connection
• Leverage HW, commodity SW (*nix/Windows NT), build key
technologies
=> high performance computing in a RICH software environment
Cluster Research Groups
• Many other cluster groups that have had impact
– Active Messages/Network of workstations (NOW)
UCB
– Basic Interface for Parallelism (BIP) Univ. of Lyon
– Fast Messages(FM)/High Performance Virtual
Machines(HPVM) (UIUC/UCSD)
– Real World Computing Partnership (Japan)
– (SHRIMP) Scalable High-performance Really
Inexpensive Multi-Processor (Princeton)
Clusters are Different

• A pile of PC’s is not a large-scale SMP server.
– Why? Performance and programming model
• A cluster’s closest cousin is an MPP
– What’s the major difference? Clusters run N copies of
the OS, MPPs usually run one.
Ideal Model: HPVM’s
Application Program
“Virtual Machine Interface”
Actual system configuration
• HPVM = High Performance Virtual Machine
• Provides a simple uniform programming model,
abstracts and encapsulates underlying resource
complexity
• Simplifies use of complex resources
Virtualization of Machines
• Want the illusion that a collection of machines is a
single machines
– Start, stop, monitor distributed programs
– Programming and debugging should work seemlessly
– PVM (Parallel Virtual Machine) was the first, widelyadopted virtualization for parallel computing
• This illusion is only partially complete in any
software system. Some issues
– Node heterogeneity.
– Real network topology can lead to contention
• Unrelated – What is a Java Virtual Machine?
High-Performance Communication
Switched 100 Mbit
OS mediated access
Switched Multigigabit,
User-level access
Networks
• Level of network interface support + NIC/network router latency
– Overhead and latency of communication  deliverable bandwidth
• High-performance communication  Programmability!
– Low-latency, low-overhead, high-bandwidth cluster communication
– … much more is needed …
• Usability issues, I/O, Reliability, Availability
• Remote process debugging/monitoring
Putting a cluster together
• (16, 32, 64, … X) Individual Node
– Eg. Dual Processor Pentium III/733, 1 GB mem, ethernet
• Scalable High-speed network
– Myrinet, Giganet, Servernet, Gigabit Ethernet
• Message-passing libraries
– TCP, MPI, PVM, VIA
• Multiprocessor job launch
– Portable batch System
– Load Sharing Facility
– PVM spawn, mpirun, rsh
• Techniques for system management
– VA Linux Cluster Manager (VACM)
– High Performance Technologies Inc (HPTI)
Communication style is message
Passing
A
Packetized message
4
3
2
1
1
2
B
• How do we efficiently get a message from Machine A to
Machine B?
• How do we efficiently break a large message into packets
and reassemble at receiver?
• How does receiver differentiate among message fragments
(packets) from different senders?
Will use the details of FM to
illustrate some communication
engineering
FM on Commodity PC’s
FM Host
Library
FM Device
Driver
FM NIC
Firmware
Pentium
II/III
NIC
~450 MIPS
1280Mbps
~33 MIPS
P6 bus
PCI
• Host Library: API presentation, flow control,
segmentation/reassembly, multithreading
• Device driver: protection, memory mapping, scheduling
monitors
• NIC Firmware: link management, incoming buffer
management, routing, multiplexing/demultiplexing
Bandwidth (MB/s)
Fast Messages 2.x Performance
•
•
•
•
80
70
60
50
40
30
20
10
0
100+ MB/s
n1/2
4
16
64
256
Msg size (bytes)
1,024 4,096 16,384 65,536
Latency 8.8ms, Bandwidth 100+MB/s, N1/2 ~250 bytes
Fast in absolute terms (compares to MPP’s, internal memory BW)
Delivers a large fraction of hardware performance for short messages
Technology transferred in emerging cluster standards
Intel/Compaq/Microsoft’s Virtual Interface Architecture.
Comments about Performance
• Latency and Bandwidth are the most basic
measurements message passing machines
– Will discuss in detail performance models
because
• Latency and bandwidth do not tell the entire story
• High-performance clusters exhibit
– 10X is deliverable bandwidth over ethernet
– 20X – 30X improvement in latency
How does FM really get Speed?
• Protected user-level access to network (OSbypass)
• Efficient credit-based flow control
– assumes reliable hardware network [only OK for
System Area Networks]
– No buffer overruns ( stalls sender if no receive space)
• Early de-multiplexing of incoming packets
– multithreading, use of NT user-schedulable threads
• Careful implementation with many tuning cycles
– Overlapping DMAs (Recv), Programmed I/O send
– No interrupts! Polling only.
OS-Bypass Background
• Suppose you want to perform a sendto on a
standard IP socket?
– Operating System mediates access to the network
device
• Must trap into the kernel to insure authorization on each and
every message (Very time consuming)
• Message is copied from user program to kernel packet buffers
• Protocol information about each packet is generated by the OS
and attached to a packet buffer
• Message is finally sent out onto the physical device (ethernet)
• Receiving does the inverse with a recvfrom
– Packet to kernel buffer, OS strip of header, reassembly of data, OS
mediation for authorization, copy into user program
OS-Bypass
• A user program is given a protected slice of the network
interface
– Authorization is done once (not per message)
• Outgoing packets get directly copied or DMAed to
network interface
– Protocol headers added by user-level library
• Incoming packets get routed by network interface card
(NIC) into user-defined receive buffers
– NIC must know how to differentiate incoming packets. This is
called early demultiplexing.
• Outgoing and incoming message copies are eliminated.
• Traps to OS kernel are eliminated
Packet Pathway
Pkt
NIC
Pkt
Pkt
NIC
Pkt
DMA to/from Network
User level
Handler 1
Pkt
User Buffer
User Message Buffer
User Message Buffer
DMA
Programmed I/O
• Concurrency of I/O busses
• Sender specifies receiver handler ID
• Flow control keeps DMA region from
being overflowed
User level
Handler 2
Pinned DMA receive
region
Fast Messages 1.x – An example
message passing API and library
Sender:
FM_send(NodeID,Handler,Buffer,size);
// handlers are remote procedures
Receiver:
FM_extract()
• API: Berkeley Active Messages
– Key distinctions: guarantees(reliable, in-order, flow
control), network-processor decoupling (DMA region)
• Focus on short-packet performance:
– Programmed IO (PIO) instead of DMA
– Simple buffering and flow control
– Map I/O device to user space (OS bypass)
What is an active message?
• Usually, message passing has a send with a
corresponding explicit receive at the
destination.
• Active messages specify a function to
invoke (activate) when message arrives
– Function is usually called a message handler
The handler gets called when the message arrives, not
by the destination doing an explicit receive.
FM 1.x Performance (6/95)
Bandwidth(MB/s)
20
18
FM
16
1Gb Ethernet
14
12
10
8
6
4
2
0
16
32
64
128
256
512
1024
2048
Msg Size (Bytes)
• Latency 14 ms, Peak BW 21.4MB/s [Pakin, Lauria et al.,
Supercomputing95]
• Hardware limits PIO performance, but N1/2 = 54 bytes
• Delivers 17.5MB/s @ 128 byte messages (140mbps,
greater than OC-3 ATM deliverable)
The FM Layering Efficiency
Issue
• How good is the FM 1.1 API?
• Test: build a user-level library on top of it and
measure the available performance
– MPI chosen as representative user-level library
– porting of MPICH 1.0 (ANL/MSU) to FM
• Purpose: to study what services are important in
layering communication libraries
– integration issues: what kind of inefficiencies arise at
the interface, and what is needed to reduce them [Lauria
& Chien, JPDC 1997]
MPI on FM 1.x - Inefficient
Layering of Protocols
Bandwidth (MB/s)
20
15
FM
MPI-FM
10
5
0
Msg Size
• First implementation of MPI on FM was ready in Fall 1995
• Disappointing performance, only fraction of FM bandwidth available
to MPI applications
% Efficiency
MPI-FM Efficiency
100
90
80
70
60
50
40
30
20
10
0
16
32
64
128
256
512
1024
Msg Size
• Result: FM fast, but its interface not efficient
2048
MPI-FM Layering Inefficiencies
Header
Source buffer
Header
Destination buffer
MPI
FM
• Too many copies due to header
attachment/removal, lack of coordination between
transport and application layers
Redesign API - FM 2.x
• Sending
– FM_begin_message(NodeID, Handler,
size)
– FM_send_piece(stream,buffer,size)
gather
– FM_end_message()
//
• Receiving
– FM_receive(buffer,size)
scatter
– FM_extract(total_bytes)
flow control
//
// rcvr
MPI-FM 2.x Improved Layering
Header
Source buffer
Header
Destination buffer
MPI
FM
• Gather-scatter interface + handler multithreading
enables efficient layering, data manipulation
without copies
100
90
80
70
60
50
40
30
20
10
0
FM
65536
32768
16384
8192
4196
2048
1024
512
256
128
64
32
16
8
MPI-FM
4
Bandwidth (MB/s)
MPI on FM 2.x
Msg Size
• MPI-FM: 91 MB/s, 13ms latency, ~4 ms overhead
– Short messages much better than IBM SP2, PCI limited
– Latency ~ SGI O2K
% Efficiency
MPI-FM 2.x Efficiency
100
90
80
70
60
50
40
30
20
10
0
Msg Size
• High Transfer Efficiency, approaches 100% [Lauria,
Pakin et al. HPDC7 ‘98]
• Other systems much lower even at 1KB (100Mbit:
40%, 1Gbit: 5%)
HPVM III (“NT Supercluster”)
77 GF, April 1998
280 GF, Early 2000
• 256xPentium II, April 1998, 77Gflops
– 3-level fat tree (large switches), scalable bandwidth, modular
extensibility
• => 512xPentium III (550 MHz) Early 2000, 280 GFlops
– Both with National Center for Supercomputing Applications
Supercomputer Performance
Characteristics
Mflops/Proc
Flops/Byte
Flops/NetworkRT
Cray T3E
1200
~2
~2,500
SGI Origin2000
500
~0.5
~1,000
HPVM NT Supercluster 300
~3.2
~6,000
Berkeley NOW II
100
~3.2
~2,000
IBM SP2
550
~3.7
~38,000
Beowulf(100Mbit)
300
~25
~200,000
• Compute/communicate and compute/latency ratios
• Clusters can provide programmable characteristics at a
dramatically lower system cost
Solving 2D Navier-Stokes Kernel Performance of Scalable Systems
Preconditioned Conjugate Gradient Method With
Multi-level Additive Schwarz Richardson Pre-conditioner (2D 1024x1024)
7
Origin-DSM
Origin-MPI
6
NT-MPI
Gigaflops
5
SP2-MPI
T3E-MPI
4
SPP2000-DSM
3
2
1
Processors
Danesh Tafti, Rob Pennington, NCSA; Andrew Chien (UIUC, UCSD)
60
50
40
30
20
10
0
0
Is the detail important? Is there
something easier?
• Detail of a particular high-performance
interface illustrates some of the complexity
for these systems
– Performance and scaling are very important.
Sometimes the underlying structure needs to be
understood to reason about applications.
• Class will focus on distributed computing
algorithms and interfaces at a higher level
(message passing)
How do we program/run such
machines?
• PVM (Parallel Virtual Machine) provides
– Simple message passing API
– Construction of virtual machine with a software console
– Ability to spawn (start), kill (stop), monitor jobs
• XPVM is a graphical console, performance monitor
• MPI (Message Passing Interface)
– Complex and complete message passing API
– Defacto, community-defined standard
– No defined method for job management
• Mpirun provided as a tool for the MPICH distribution
– Commericial and non-commercial tools for monitoring debugging
• Jumpshot, VaMPIr, …
Next Time …
• Parallel Programming Paradigms
Shared Memory
Message passing

Lecture 3 - Computer Science and Engineering

Transcript Lecture 3 - Computer Science and Engineering

Directory