Distributed Shared Memory

Download Report

Transcript Distributed Shared Memory

Multiprocessor Systems
Conventional Wisdom (CW) in Computer Architecture
• Old CW: Power is free, Transistors expensive
• New CW: “Power wall” Power expensive, Xtors free
(Can put more on chip than can afford to turn on)
• Old: Multiplies are slow, Memory access is fast
• New: “Memory wall” Memory slow, multiplies fast
(200 clocks to DRAM memory, 4 clocks for FP multiply)
• Old : Increasing Instruction Level Parallelism via compilers,
innovation (Out-of-order, speculation, VLIW, …)
• New CW: “ILP wall” diminishing returns on more ILP
• New: Power Wall + Memory Wall + ILP Wall = Brick Wall
• Old CW: Uniprocessor performance 2X / 1.5 yrs
• New CW: Uniprocessor performance only 2X / 5 yrs?
2
Uniprocessor Performance (SPECint)
3X
Performance (vs. VAX-11/780)
10000
1000
From Hennessy and Patterson,
Computer Architecture: A Quantitative
Approach, 4th edition, 2006
??%/year
52%/year
100
10
25%/year
 Sea change in chip design:
multiple “cores” or processors
per chip
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
• RISC + x86: 52%/year 1986 to 2002
• RISC + x86: ??%/year 2002 to present
3
Evolution from Single Core to Multi-Core
“… today’s processors … are nearing an impasse as technologies approach the
speed of light..”
David Mitchell, The Transputer: The Time Is Now (1989)
 Procrastination rewarded: 2X seq. perf. / 1.5 years
“We are dedicating all of our future product development to multicore designs.
… This is a sea change in computing”
Paul Otellini, President, Intel (2005)
• All microprocessor companies switch to MP (2X CPUs / 2 yrs)
 Procrastination penalized: 2X sequential perf. / 5 yrs
AMD/’05
Intel/’06
IBM/’04
Sun/’05
Processors/chip
2
2
2
8
Threads/Processor
1
2
2
4
Threads/chip
2
4
4
32
Manufacturer/Year
Procrastination : to keep delaying something that must be done
Flynn’s Taxonomy
• Flynn classified by data and control streams in 1966
Single Instruction Single Data
(SISD)
(Uniprocessor)
Single Instruction Multiple
Data SIMD
(single PC: Vector, CM-2)
Multiple Instruction Single
Data (MISD)
(????)
Multiple Instruction Multiple
Data MIMD
(Clusters, SMP servers)
• SIMD  Data Level Parallelism
• MIMD  Thread Level Parallelism
• MIMD popular because
• Flexible: N pgms and 1 multithreaded pgm
• Cost-effective: same MPU in desktop & MIMD
M.J. Flynn, "Very High-Speed Computers",
Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.
5
Flynn’s Taxonomy
• SISD (Single Instruction Single Data)
• Uniprocessors
• MISD (Multiple Instruction Single Data)
• Single data stream operated by successive functional units
• SIMD (Single Instruction Multiple Data)
• Instruction stream executed by multiple processors on different data
• Simple programming model, low overhead
• Examples: Connection Machine and Vector Processors
• MIMD (Multiple Instruction Multiple Data) is the most general
• Flexibility for parallel applications and multi-programmed systems
• Processors can run independent processes or applications
• Processors can also run threads belonging to one parallel application
• Uses off-the-shelf microprocessors and components
Major MIMD Styles
•
Symmetric Multiprocessors (SMP)
•
Main memory is shared and equally accessible by all processors
•
•
•
Called also Uniform Memory Access (UMA)
Bus based or interconnection network based
Distributed memory multiprocessors
•
Distributed Shared Memory (DSM) multiprocessors
•
•
•
•
Distributed memories are shared and can be accessed by all processors
Non-uniform memory access (NUMA)
Latency varies between local and remote memory access
Message-Passing multiprocessors, multi-computers, and clusters
•
•
•
Distributed memories are NOT shared
Each processor can access its own local memory
Processors communicate by sending and receiving messages
Shared Memory Architecture
• Any processor can directly reference any physical
memory
• Any I/O controller to any physical memory
• Operating system can run on any processor
•
OS uses shared memory to coordinate
• Communication occurs implicitly as result of loads and
stores
Processor
Processor
• Wide range of scale
Processor
• Few to hundreds of processors
• Memory may be physically
distributed among processors
• History dates to early 1960s
Processor
Shared Physical
Memory
Processor
I/O
I/O
I/O
Shared Memory Organizations
P1
Pn
P1
Pn
$
$
Switch
Interleaved
Cache
Interconnection network
Interleaved
Mem
Mem
Main memory
Dance Hall (UMA)
Shared Cache
P1
Pn
$
$
Mem
I/O devices
Bus-based Shared Memory
Pn
P1
Mem
$
Mem
Interconnection network
Distributed Shared Memory (NUMA)
$
Bus-Based Symmetric Multiprocessors
• Symmetric access to main memory from any processor
• Dominate the server market
• Building blocks for larger systems
• Attractive as throughput servers and for parallel
programs  Uniform access via loads/stores
 Automatic data movement and
coherent replication in caches
 Cheap and powerful extension to
uniprocessors
 Key is extension of memory
hierarchy to support multiple
processors
P1
Pn
Multilevel
Cache
Multilevel
Cache
Main memory
Bus
I/O system
Shared Address Space Programming Model
• A process is a virtual address space
• With one or more threads of control
• Part of the virtual address space is shared by processes
• Multiple threads share the address space of a single process
• All communication is
through shared memory
• Achieved by loads and stores
• Writes by one process/thread
are visible to others
• Special atomic operations
for synchronization
• OS uses shared memory
to coordinate processes
Virtual address spaces for a
collection of processes communicating
via shared addresses
Load
P1
Machine physical
address space
Pn pr i v at e
Pn
P2
Common physical
addresses
P0
St or e
Shared portion
of address space
Private portion
of address space
P2 pr i v at e
P1 p r i v at e
P0 p r i v at e
Medium and Large Scale Multiprocessors
M
M
°°°
M
Interconnection Network
Interconnection Network
M
$
$
P
P
°°°
Centralized Memory
$
$
P
M
$
°°°
M
P
$
P
P
Distributed Shared Memory
• Problem is interconnect: high cost (crossbar) or bandwidth (bus)
• Centralized memory or uniform memory access (UMA)
• Latencies to memory uniform, but uniformly large
• Interconnection network: crossbar or multi-stage
• Distributed shared memory or non-uniform memory access (NUMA)
• Distributed memory forms one global shared physical address space
• Access to local memory is faster than access to remote memory
Message Passing Architectures
• Complete computer as a building block
• Includes processor, memory, and I/O system
• Easier to build and scale than
shared memory architectures
• Communication via explicit I/O operations
Interconnection Network
M
$
M
P
$
P
• Communication integrated at I/O level, Not into memory system
• Much in common with networks of workstations or clusters
• However, tight integration between processor and network
• Network is of higher capability than a local area network
• Programming model
• Direct access only to local memory (private address space)
• Communication via explicit send/receive messages (library or
system calls)
°°°
M
$
P
Message-Passing Abstraction
Process P
Process Q
Send Q, X, t
Receive P, Y, t
Address Y
Address X
Local process
address space
Match
Local process
address space
• Send specifies receiving process and buffer to be transmitted
• Receive specifies sending process and buffer to receive into
• Optional tag on send and matching rule on receive
• Matching rule: match a specific tag t or any tag or any process
• Combination of send and matching receive achieves …
• Pairwise synchronization event
• Memory to memory copy of message
• Overheads: copying, buffer management, protection
• Example: MPI and PVM message passing libraries
Variants of Send and Receive
• Parallel programs using send and receive are quite structured
• Most often, all nodes execute identical copies of a program
• Processes can name each other using a simple linear ordering
• Blocking send:
• Sender sends a request and waits until the reply is returned
• Non-blocking send:
• Sender sends a message and continues without waiting for a reply
• Blocking receive:
• Receiver blocks if it tries to receive a message that has not arrived
• Non-blocking receive:
• Receiver simply posts a receive without waiting for sender
Evolution of Message-Passing Machines
• Early machines: FIFO on each link
• Hardware close to programming model
• Synchronous send/receive operations
• Topology central (hypercube algorithms)
101
001
100
000
111
011
CalTech Cosmic Cube (Seitz)
110
010
Example Intel Paragon
i860
i860
L1 $
L1 $
Intel
Paragon
node
Memory bus (64-bit, 50 MHz)
Mem
ctrl
DMA
Driver
Sandia’s Intel Paragon XP/S-based SuperComputer
Each card is an SMP with two or more
i860 processors and a network interface
chip connected to the cache-coherent
memory bus.
Each node has a DMA engine to transfer
contiguous chunks of data to and from
the network at a high rate.
NI
4-way
interleaved
DRAM
8 bits,
175 MHz,
bidir ectional
2D grid network
with processing node
attached to every switch
Example 1: Matrix Addition (1/3)
Assume an MIMD Shared-Memory Multiprocessor (SMP) with P processors. The objective
is to write a parallel program for Matrix Addition C = A + B, where A, B, and C are all NxN
matrices using P processor.
(1) How to partition the array of results (C) for parallel processing among P processes!
A few approaches: (1) Block-row partitioning, where each process is assigned N/p rows
and N columns, or (2) Block-column partitioning, where each process is assigned N rows
and N/p columns. The first is preferred due to row-major storage.
(2) Find the number of results assigned to each processes for load balancing
There are N2 results to compute from array C. Each processor is to have N2/P for load
balancing. Using Block-row partitioning, each process is assigned a block-row of (N/P)*N
results.
(3) Determine the range of results that must be computed by each process as function of
the process ID (Pid).
Using block-row partitioning, each process is assigned N/p rows and N columns, This can
be achieved if we partition A into block-rows such that each processor will get the range:
Range = (Imin, Imax) = [Pid*N/P, Imin+N/P-1], where each process must plug its Pid in the
above formula to work on its range, i.e. parallel SPMD program.
Example 1: Matrix Addition (2/3)
(3) Using a shared-memory multiprocessor, write an SPMV program to compute the Matrix Addition
by P processes (threads) by assuming that all arrays are declared as shared data.
The SPMD program that will run on each processor will be:
Imin=Pid*N/P;
Imax= Imin+N/P-1 ;
For i= Imin, I < Imax
{ For j=0, i<N
c[i,j] = a[i,j] + b[i,j];
}
Note: the range of computed data for each process depends on the Pid.
Example 1: Matrix Addition (3/3)
(4) Using a Distributed-Memory Multiprocessor, write an SPMD program using MPI library to parallelize
C=A*B, where each is an NxN matrix, using P processors. Assume block-row partitioning, each process
will run only in its range of data, and the program and data return to master processor (process 0) after
completion of all the iterations.
{MPI-init;
MPI-Comm-size(MPI-comm-wolrd, &numprocs) #Init MPI
MPI-Comm-rank(MPI-comm-wolrd, &mypid)
array_size = N;
#Get process id as mypid
#Now Process 0 scatter arrays A and B over all the processes:
MPI- scatter (&a, array_size=N, 0 (pid), mytag, MPI-Comm-world) #Scatter A
MPI- Scatter(&b, array_size=N, 0 (pid), mytag, MPI-Comm-world) # Scatter B
my_range = N/P;
#Nbr of rows to processes by each process
For i= 0, i < my_range
{ For j = 0, I < N
c[i,j] = a[i,j] + b[i,j];
}
MPI- Gather(&c, array_size=N, 0 (pid), mytag, MPI-Comm-world)
}
Example 2: Summing 100,000 Numbers on 100 Processors

Processors start by running a loop that sums their subset of vector A numbers (vectors A and
sum are shared variables, Pid is the processor’s id, i is a private variable)
sum[Pid] = 0;
for (i = 1000*Pid; i< 1000*(Pid+1); i = i + 1)
sum[Pid] = sum[Pid] + A[i];

The processors then coordinate in adding together the partial sums (P is a private variable
initialized to 100 (the number of processors))
repeat
synch();
/*synchronize first
if (P%2 != 0 && Pid == 0)
sum[0] = sum[0] + sum[P-1];
P = P/2
if (Pid<P) sum[Pid] = sum[Pid] + sum[Pid+P]
until (P == 1);
/*final sum in sum[0]
An Example with 10 Processors
sum[P0] sum[P1] sum[P2] sum[P3] sum[P4] sum[P5] sum[P6] sum[P7] sum[P8] sum[P9]
P0
P1
P2
P3
P4
P0
P1
P2
P3
P4
P0
P1
P0
P5
P6
P7
P8
P9
P = 10
P=5
P=2
P=1
Message Passing Multiprocessors
• Each processor has its own private address space
• Q1 – Processors share data by explicitly sending and
receiving information (messages)
• Q2 – Coordination is built into message passing primitives
(send and receive)