Distributed Shared Memory
Download
Report
Transcript Distributed Shared Memory
Multiprocessor Systems
Conventional Wisdom (CW) in Computer Architecture
• Old CW: Power is free, Transistors expensive
• New CW: “Power wall” Power expensive, Xtors free
(Can put more on chip than can afford to turn on)
• Old: Multiplies are slow, Memory access is fast
• New: “Memory wall” Memory slow, multiplies fast
(200 clocks to DRAM memory, 4 clocks for FP multiply)
• Old : Increasing Instruction Level Parallelism via compilers,
innovation (Out-of-order, speculation, VLIW, …)
• New CW: “ILP wall” diminishing returns on more ILP
• New: Power Wall + Memory Wall + ILP Wall = Brick Wall
• Old CW: Uniprocessor performance 2X / 1.5 yrs
• New CW: Uniprocessor performance only 2X / 5 yrs?
2
Uniprocessor Performance (SPECint)
3X
Performance (vs. VAX-11/780)
10000
1000
From Hennessy and Patterson,
Computer Architecture: A Quantitative
Approach, 4th edition, 2006
??%/year
52%/year
100
10
25%/year
Sea change in chip design:
multiple “cores” or processors
per chip
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
• RISC + x86: 52%/year 1986 to 2002
• RISC + x86: ??%/year 2002 to present
3
Evolution from Single Core to Multi-Core
“… today’s processors … are nearing an impasse as technologies approach the
speed of light..”
David Mitchell, The Transputer: The Time Is Now (1989)
Procrastination rewarded: 2X seq. perf. / 1.5 years
“We are dedicating all of our future product development to multicore designs.
… This is a sea change in computing”
Paul Otellini, President, Intel (2005)
• All microprocessor companies switch to MP (2X CPUs / 2 yrs)
Procrastination penalized: 2X sequential perf. / 5 yrs
AMD/’05
Intel/’06
IBM/’04
Sun/’05
Processors/chip
2
2
2
8
Threads/Processor
1
2
2
4
Threads/chip
2
4
4
32
Manufacturer/Year
Procrastination : to keep delaying something that must be done
Flynn’s Taxonomy
• Flynn classified by data and control streams in 1966
Single Instruction Single Data
(SISD)
(Uniprocessor)
Single Instruction Multiple
Data SIMD
(single PC: Vector, CM-2)
Multiple Instruction Single
Data (MISD)
(????)
Multiple Instruction Multiple
Data MIMD
(Clusters, SMP servers)
• SIMD Data Level Parallelism
• MIMD Thread Level Parallelism
• MIMD popular because
• Flexible: N pgms and 1 multithreaded pgm
• Cost-effective: same MPU in desktop & MIMD
M.J. Flynn, "Very High-Speed Computers",
Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.
5
Flynn’s Taxonomy
• SISD (Single Instruction Single Data)
• Uniprocessors
• MISD (Multiple Instruction Single Data)
• Single data stream operated by successive functional units
• SIMD (Single Instruction Multiple Data)
• Instruction stream executed by multiple processors on different data
• Simple programming model, low overhead
• Examples: Connection Machine and Vector Processors
• MIMD (Multiple Instruction Multiple Data) is the most general
• Flexibility for parallel applications and multi-programmed systems
• Processors can run independent processes or applications
• Processors can also run threads belonging to one parallel application
• Uses off-the-shelf microprocessors and components
Major MIMD Styles
•
Symmetric Multiprocessors (SMP)
•
Main memory is shared and equally accessible by all processors
•
•
•
Called also Uniform Memory Access (UMA)
Bus based or interconnection network based
Distributed memory multiprocessors
•
Distributed Shared Memory (DSM) multiprocessors
•
•
•
•
Distributed memories are shared and can be accessed by all processors
Non-uniform memory access (NUMA)
Latency varies between local and remote memory access
Message-Passing multiprocessors, multi-computers, and clusters
•
•
•
Distributed memories are NOT shared
Each processor can access its own local memory
Processors communicate by sending and receiving messages
Shared Memory Architecture
• Any processor can directly reference any physical
memory
• Any I/O controller to any physical memory
• Operating system can run on any processor
•
OS uses shared memory to coordinate
• Communication occurs implicitly as result of loads and
stores
Processor
Processor
• Wide range of scale
Processor
• Few to hundreds of processors
• Memory may be physically
distributed among processors
• History dates to early 1960s
Processor
Shared Physical
Memory
Processor
I/O
I/O
I/O
Shared Memory Organizations
P1
Pn
P1
Pn
$
$
Switch
Interleaved
Cache
Interconnection network
Interleaved
Mem
Mem
Main memory
Dance Hall (UMA)
Shared Cache
P1
Pn
$
$
Mem
I/O devices
Bus-based Shared Memory
Pn
P1
Mem
$
Mem
Interconnection network
Distributed Shared Memory (NUMA)
$
Bus-Based Symmetric Multiprocessors
• Symmetric access to main memory from any processor
• Dominate the server market
• Building blocks for larger systems
• Attractive as throughput servers and for parallel
programs Uniform access via loads/stores
Automatic data movement and
coherent replication in caches
Cheap and powerful extension to
uniprocessors
Key is extension of memory
hierarchy to support multiple
processors
P1
Pn
Multilevel
Cache
Multilevel
Cache
Main memory
Bus
I/O system
Shared Address Space Programming Model
• A process is a virtual address space
• With one or more threads of control
• Part of the virtual address space is shared by processes
• Multiple threads share the address space of a single process
• All communication is
through shared memory
• Achieved by loads and stores
• Writes by one process/thread
are visible to others
• Special atomic operations
for synchronization
• OS uses shared memory
to coordinate processes
Virtual address spaces for a
collection of processes communicating
via shared addresses
Load
P1
Machine physical
address space
Pn pr i v at e
Pn
P2
Common physical
addresses
P0
St or e
Shared portion
of address space
Private portion
of address space
P2 pr i v at e
P1 p r i v at e
P0 p r i v at e
Medium and Large Scale Multiprocessors
M
M
°°°
M
Interconnection Network
Interconnection Network
M
$
$
P
P
°°°
Centralized Memory
$
$
P
M
$
°°°
M
P
$
P
P
Distributed Shared Memory
• Problem is interconnect: high cost (crossbar) or bandwidth (bus)
• Centralized memory or uniform memory access (UMA)
• Latencies to memory uniform, but uniformly large
• Interconnection network: crossbar or multi-stage
• Distributed shared memory or non-uniform memory access (NUMA)
• Distributed memory forms one global shared physical address space
• Access to local memory is faster than access to remote memory
Message Passing Architectures
• Complete computer as a building block
• Includes processor, memory, and I/O system
• Easier to build and scale than
shared memory architectures
• Communication via explicit I/O operations
Interconnection Network
M
$
M
P
$
P
• Communication integrated at I/O level, Not into memory system
• Much in common with networks of workstations or clusters
• However, tight integration between processor and network
• Network is of higher capability than a local area network
• Programming model
• Direct access only to local memory (private address space)
• Communication via explicit send/receive messages (library or
system calls)
°°°
M
$
P
Message-Passing Abstraction
Process P
Process Q
Send Q, X, t
Receive P, Y, t
Address Y
Address X
Local process
address space
Match
Local process
address space
• Send specifies receiving process and buffer to be transmitted
• Receive specifies sending process and buffer to receive into
• Optional tag on send and matching rule on receive
• Matching rule: match a specific tag t or any tag or any process
• Combination of send and matching receive achieves …
• Pairwise synchronization event
• Memory to memory copy of message
• Overheads: copying, buffer management, protection
• Example: MPI and PVM message passing libraries
Variants of Send and Receive
• Parallel programs using send and receive are quite structured
• Most often, all nodes execute identical copies of a program
• Processes can name each other using a simple linear ordering
• Blocking send:
• Sender sends a request and waits until the reply is returned
• Non-blocking send:
• Sender sends a message and continues without waiting for a reply
• Blocking receive:
• Receiver blocks if it tries to receive a message that has not arrived
• Non-blocking receive:
• Receiver simply posts a receive without waiting for sender
Evolution of Message-Passing Machines
• Early machines: FIFO on each link
• Hardware close to programming model
• Synchronous send/receive operations
• Topology central (hypercube algorithms)
101
001
100
000
111
011
CalTech Cosmic Cube (Seitz)
110
010
Example Intel Paragon
i860
i860
L1 $
L1 $
Intel
Paragon
node
Memory bus (64-bit, 50 MHz)
Mem
ctrl
DMA
Driver
Sandia’s Intel Paragon XP/S-based SuperComputer
Each card is an SMP with two or more
i860 processors and a network interface
chip connected to the cache-coherent
memory bus.
Each node has a DMA engine to transfer
contiguous chunks of data to and from
the network at a high rate.
NI
4-way
interleaved
DRAM
8 bits,
175 MHz,
bidir ectional
2D grid network
with processing node
attached to every switch
Example 1: Matrix Addition (1/3)
Assume an MIMD Shared-Memory Multiprocessor (SMP) with P processors. The objective
is to write a parallel program for Matrix Addition C = A + B, where A, B, and C are all NxN
matrices using P processor.
(1) How to partition the array of results (C) for parallel processing among P processes!
A few approaches: (1) Block-row partitioning, where each process is assigned N/p rows
and N columns, or (2) Block-column partitioning, where each process is assigned N rows
and N/p columns. The first is preferred due to row-major storage.
(2) Find the number of results assigned to each processes for load balancing
There are N2 results to compute from array C. Each processor is to have N2/P for load
balancing. Using Block-row partitioning, each process is assigned a block-row of (N/P)*N
results.
(3) Determine the range of results that must be computed by each process as function of
the process ID (Pid).
Using block-row partitioning, each process is assigned N/p rows and N columns, This can
be achieved if we partition A into block-rows such that each processor will get the range:
Range = (Imin, Imax) = [Pid*N/P, Imin+N/P-1], where each process must plug its Pid in the
above formula to work on its range, i.e. parallel SPMD program.
Example 1: Matrix Addition (2/3)
(3) Using a shared-memory multiprocessor, write an SPMV program to compute the Matrix Addition
by P processes (threads) by assuming that all arrays are declared as shared data.
The SPMD program that will run on each processor will be:
Imin=Pid*N/P;
Imax= Imin+N/P-1 ;
For i= Imin, I < Imax
{ For j=0, i<N
c[i,j] = a[i,j] + b[i,j];
}
Note: the range of computed data for each process depends on the Pid.
Example 1: Matrix Addition (3/3)
(4) Using a Distributed-Memory Multiprocessor, write an SPMD program using MPI library to parallelize
C=A*B, where each is an NxN matrix, using P processors. Assume block-row partitioning, each process
will run only in its range of data, and the program and data return to master processor (process 0) after
completion of all the iterations.
{MPI-init;
MPI-Comm-size(MPI-comm-wolrd, &numprocs) #Init MPI
MPI-Comm-rank(MPI-comm-wolrd, &mypid)
array_size = N;
#Get process id as mypid
#Now Process 0 scatter arrays A and B over all the processes:
MPI- scatter (&a, array_size=N, 0 (pid), mytag, MPI-Comm-world) #Scatter A
MPI- Scatter(&b, array_size=N, 0 (pid), mytag, MPI-Comm-world) # Scatter B
my_range = N/P;
#Nbr of rows to processes by each process
For i= 0, i < my_range
{ For j = 0, I < N
c[i,j] = a[i,j] + b[i,j];
}
MPI- Gather(&c, array_size=N, 0 (pid), mytag, MPI-Comm-world)
}
Example 2: Summing 100,000 Numbers on 100 Processors
Processors start by running a loop that sums their subset of vector A numbers (vectors A and
sum are shared variables, Pid is the processor’s id, i is a private variable)
sum[Pid] = 0;
for (i = 1000*Pid; i< 1000*(Pid+1); i = i + 1)
sum[Pid] = sum[Pid] + A[i];
The processors then coordinate in adding together the partial sums (P is a private variable
initialized to 100 (the number of processors))
repeat
synch();
/*synchronize first
if (P%2 != 0 && Pid == 0)
sum[0] = sum[0] + sum[P-1];
P = P/2
if (Pid<P) sum[Pid] = sum[Pid] + sum[Pid+P]
until (P == 1);
/*final sum in sum[0]
An Example with 10 Processors
sum[P0] sum[P1] sum[P2] sum[P3] sum[P4] sum[P5] sum[P6] sum[P7] sum[P8] sum[P9]
P0
P1
P2
P3
P4
P0
P1
P2
P3
P4
P0
P1
P0
P5
P6
P7
P8
P9
P = 10
P=5
P=2
P=1
Message Passing Multiprocessors
• Each processor has its own private address space
• Q1 – Processors share data by explicitly sending and
receiving information (messages)
• Q2 – Coordination is built into message passing primitives
(send and receive)