EECC722 - Shaaban

Download Report

Transcript EECC722 - Shaaban

Parallel Computer Architecture
• A parallel computer is a collection of processing elements
that cooperate to solve large problems fast
• Broad issues involved:
– Resource Allocation:
• Number of processing elements (PEs).
• Computing power of each element.
• Amount of physical memory used.
– Data access, Communication and Synchronization
• How the elements cooperate and communicate.
• How data is transmitted between processors.
• Abstractions and primitives for cooperation.
– Performance and Scalability
• Performance enhancement of parallelism: Speedup.
• Scalabilty of performance to larger systems/problems.
EECC722 - Shaaban
#1 lec # 3
Fall 2000 9-18-2000
Exploiting Program Parallelism
Levels of Parallelism
Process
Thread
Loop
Instruction
1
10
100
1K
10K
100K
1M
Grain Size (instructions)
EECC722 - Shaaban
#2 lec # 3
Fall 2000 9-18-2000
The Need And Feasibility of Parallel Computing
• Application demands: More computing cycles:
– Scientific computing: CFD, Biology, Chemistry, Physics, ...
– General-purpose computing: Video, Graphics, CAD, Databases,
Transaction Processing, Gaming…
– Mainstream multithreaded programs, are similar to parallel programs
• Technology Trends
– Number of transistors on chip growing rapidly
– Clock rates expected to go up but only slowly
• Architecture Trends
– Instruction-level parallelism is valuable but limited
– Coarser-level parallelism, as in MPs, the most viable approach
• Economics:
– Today’s microprocessors have multiprocessor support eliminating the
need for designing expensive custom PEs
– Lower parallel system cost.
– Multiprocessor systems to offer a cost-effective replacement of
uniprocessor systems in mainstream computing.
EECC722 - Shaaban
#3 lec # 3
Fall 2000 9-18-2000
Scientific Computing Demand
EECC722 - Shaaban
#4 lec # 3
Fall 2000 9-18-2000
Scientific Supercomputing Trends
• Proving ground and driver for innovative architecture
and advanced techniques:
– Market is much smaller relative to commercial segment
– Dominated by vector machines starting in 70s
– Meanwhile, microprocessors have made huge gains in
floating-point performance
•
•
•
•
High clock rates.
Pipelined floating point units.
Instruction-level parallelism.
Effective use of caches.
• Large-scale multiprocessors replace vector supercomputers
– Well under way already
EECC722 - Shaaban
#5 lec # 3
Fall 2000 9-18-2000
Raw Uniprocessor Performance:
LINPACK
10,000
CRAY
 CRAY
 Micro
Micro
n = 1,000
n = 100
n = 1,000
n = 100

1,000

T94

LINPACK (MFLOPS)
C90



100




DEC 8200

Ymp

Xmp/416



 
 IBM Pow er2/990
MIPS R4400
Xmp/14se


DEC Alpha
 HP9000/735

 DEC Alpha AXP
 HP 9000/750
 CRAY 1s
 IBM RS6000/540
10

MIPS M/2000


MIPS M/120

Sun 4/260
1
1975


1980
1985
1990
1995
2000
EECC722 - Shaaban
#6 lec # 3
Fall 2000 9-18-2000
Raw Parallel Performance:
LINPACK
10,000
 MPP peak
 CRAY peak
ASCI Red 
LINPACK (GFLOPS)
1,000
Parago n XP/S MP
(6768)

Parago n XP/S MP
(1024) 
 T3D
CM-5 
100
T932(32) 
Paragon XP/S
CM-200 
CM-2 
Ymp/832(8)
1

 C90(16)
Delta
10


 iPSC/860
 nCUBE/2(1024)
Xmp /416(4)
0.1
1985
1987
1989
1991
1993
1995
1996
EECC722 - Shaaban
#7 lec # 3
Fall 2000 9-18-2000
Parallelism in Microprocessor VLSI Generations
Bit-level parallelism
Instruction-level
Thread-level (?)
100,000,000

10,000,000





1,000,000



R10000




 










Pentium
Transistors



i80386



i80286 
100,000


 R3000
 R2000

 i8086
10,000
 i8080
 i8008

 i4004
1,000
1970
1975
1980
1985
1990
1995
2000
2005
EECC722 - Shaaban
#8 lec # 3
Fall 2000 9-18-2000
The Goal of Parallel Computing
• Goal of applications in using parallel machines: Speedup
Speedup (p processors) =
Performance (p processors)
Performance (1 processor)
• For a fixed problem size (input data set),
performance = 1/time
Speedup fixed problem (p processors) =
Time (1 processor)
Time (p processors)
EECC722 - Shaaban
#9 lec # 3
Fall 2000 9-18-2000
Elements of Modern Computers
Computing
Problems
Algorithms
and Data
Structures
Mapping
Hardware
Architecture
Programming
High-level
Languages
Operating System
Binding
(Compile,
Load)
Applications Software
Performance
Evaluation
EECC722 - Shaaban
#10 lec # 3
Fall 2000 9-18-2000
Elements of Modern Computers
1 Computing Problems:
– Numerical Computing: Science and technology numerical
problems demand intensive integer and floating point
computations.
– Logical Reasoning: Artificial intelligence (AI) demand logic
inferences and symbolic manipulations and large space searches.
2 Algorithms and Data Structures
– Special algorithms and data structures are needed to specify the
computations and communication present in computing
problems.
– Most numerical algorithms are deterministic using regular data
structures.
– Symbolic processing may use heuristics or non-deterministic
searches.
– Parallel algorithm development requires interdisciplinary
interaction.
EECC722 - Shaaban
#11 lec # 3
Fall 2000 9-18-2000
Elements of Modern Computers
3 Hardware Resources
– Processors, memory, and peripheral devices form the
hardware core of a computer system.
– Processor instruction set, processor connectivity, memory
organization, influence the system architecture.
4 Operating Systems
– Manages the allocation of resources to running processes.
– Mapping to match algorithmic structures with hardware
architecture and vice versa: processor scheduling, memory
mapping, interprocessor communication.
– Parallelism exploitation at: algorithm design, program
writing, compilation, and run time.
EECC722 - Shaaban
#12 lec # 3
Fall 2000 9-18-2000
Elements of Modern Computers
5 System Software Support
– Needed for the development of efficient programs in highlevel languages (HLLs.)
– Assemblers, loaders.
– Portable parallel programming languages
– User interfaces and tools.
6 Compiler Support
– Preprocessor compiler: Sequential compiler and low-level
library of the target parallel computer.
– Precompiler: Some program flow analysis, dependence
checking, limited optimizations for parallelism detection.
– Parallelizing compiler: Can automatically detect
parallelism in source code and transform sequential code
into parallel constructs.
EECC722 - Shaaban
#13 lec # 3
Fall 2000 9-18-2000
Approaches to Parallel Programming
Programmer
Programmer
Source code written in
sequential languages C, C++
FORTRAN, LISP ..
Source code written in
concurrent dialects of C, C++
FORTRAN, LISP ..
Parallelizing
compiler
Concurrency
preserving compiler
Parallel
object code
Execution by
runtime system
(a) Implicit
Parallelism
Concurrent
object code
(b) Explicit
Parallelism
Execution by
runtime system
EECC722 - Shaaban
#14 lec # 3
Fall 2000 9-18-2000
Evolution of Computer
Architecture
Scalar
Sequential
Lookahead
Functional
Parallelism
I/E Overlap
Multiple
Func. Units
I/E: Instruction Fetch and
Execute
Pipeline
Implicit
Vector
SIMD: Single Instruction stream
over Multiple Data streams
Explicit
Vector
Memory-to
-Memory
MIMD: Multiple Instruction
streams over Multiple Data
streams
SIMD
Associative
Processor
Processor
Array
Register-to
-Register
MIMD
Multicomputer
Massively Parallel Processors
(MPPs)
Multiprocessor
EECC722 - Shaaban
Parallel Architectures History
Historically, parallel architectures tied to programming models
• Divergent architectures, with no predictable pattern of growth.
Application Software
Systolic
Arrays
System
Software
Architecture
SIMD
Message Passing
Dataflow
Shared Memory
EECC722 - Shaaban
#16 lec # 3
Fall 2000 9-18-2000
Programming Models
• Programming methodology used in coding applications
• Specifies communication and synchronization
• Examples:
– Multiprogramming:
No communication or synchronization at program level
– Shared memory address space:
– Message passing:
Explicit point to point communication
– Data parallel:
More regimented, global actions on data
• Implemented with shared address space or message passing
EECC722 - Shaaban
#17 lec # 3
Fall 2000 9-18-2000
Flynn’s 1972 Classification of
Computer Architecture
• Single Instruction stream over a Single Data stream
(SISD): Conventional sequential machines.
• Single Instruction stream over Multiple Data streams
(SIMD): Vector computers, array of synchronized
processing
elements.
• Multiple Instruction streams and a Single Data stream
(MISD): Systolic arrays for pipelined execution.
• Multiple Instruction streams over Multiple Data streams
(MIMD): Parallel computers:
• Shared memory multiprocessors.
• Multicomputers: Unshared distributed memory,
message-passing used instead.
EECC722 - Shaaban
#18 lec # 3
Fall 2000 9-18-2000
Flynn’s Classification of Computer Architecture
Fig. 1.3 page 12 in
Advanced Computer Architecture: Parallelism,
Scalability, Programmability, Hwang, 1993.
EECC722 - Shaaban
#19 lec # 3
Fall 2000 9-18-2000
Current Trends In Parallel Architectures
• The extension of “computer architecture” to support
communication and cooperation:
– OLD: Instruction Set Architecture
– NEW: Communication Architecture
• Defines:
– Critical abstractions, boundaries, and primitives
(interfaces)
– Organizational structures that implement interfaces
(hardware or software)
• Compilers, libraries and OS are important bridges today
EECC722 - Shaaban
#20 lec # 3
Fall 2000 9-18-2000
Modern Parallel Architecture
Layered Framework
CAD
Database
Multiprogramming
Shared
address
Scientific modeling
Message
passing
Parallel applications
Data
parallel
Programming models
Compilation
or library
Operating systems support
Communication hardware
Communication abstraction
User/system boundary
Hardware/software boundary
Physical communication medium
EECC722 - Shaaban
#21 lec # 3
Fall 2000 9-18-2000
Shared Address Space Parallel Architectures
• Any processor can directly reference any memory location
– Communication occurs implicitly as result of loads and stores
• Convenient:
– Location transparency
– Similar programming model to time-sharing in uniprocessors
• Except processes run on different processors
• Good throughput on multiprogrammed workloads
• Naturally provided on a wide range of platforms
– Wide range of scale: few to hundreds of processors
• Popularly known as shared memory machines or model
– Ambiguous: Memory may be physically distributed among
processors
EECC722 - Shaaban
#22 lec # 3
Fall 2000 9-18-2000
Shared Address Space (SAS) Model
• Process: virtual address space plus one or more threads of control
• Portions of address spaces of processes are shared
Virtual address spaces for a
collection of processes communicating
via shared addresses
Load
P1
Machine physical address space
Pn pr i v at e
Pn
P2
Common physical
addresses
P0
St or e
Shared portion
of address space
Private portion
of address space
P2 pr i v at e
P1 pr i v at e
P0 pr i v at e
• Writes to shared address visible to other threads (in other processes too)
•
Natural extension of the uniprocessor model:
• Conventional memory operations used for communication
• Special atomic operations needed for synchronization
• OS uses shared memory to coordinate processes
EECC722 - Shaaban
#23 lec # 3
Fall 2000 9-18-2000
Models of Shared-Memory Multiprocessors
• The Uniform Memory Access (UMA) Model:
– The physical memory is shared by all processors.
– All processors have equal access to all memory addresses.
• Distributed memory or Nonuniform Memory Access
(NUMA) Model:
– Shared memory is physically distributed locally among
processors.
• The Cache-Only Memory Architecture (COMA) Model:
– A special case of a NUMA machine where all distributed
main memory is converted to caches.
– No memory hierarchy at each processor.
EECC722 - Shaaban
#24 lec # 3
Fall 2000 9-18-2000
Models of Shared-Memory Multiprocessors
Uniform Memory Access (UMA) Model
I/O
devices
Mem
Mem
Mem
Mem
I/O ctrl
I/O ctrl
Interconnect
Interconnect
Processor
Interconnect:
Bus, Crossbar, Multistage network
P: Processor
M: Memory
C: Cache
D: Cache directory
Processor
Network
Network

M
$
P
M
$
P

M
D
D
D
C
C
C
P
P
P
$
P
Distributed memory or
Nonuniform Memory Access (NUMA) Model
Cache-Only Memory Architecture (COMA)
EECC722 - Shaaban
#25 lec # 3
Fall 2000 9-18-2000
Uniform Memory Access Example:
Intel Pentium Pro Quad
CPU
P-Pr o
module
256-KB
L2 $
Interrupt
controller
Bus interface
P-Pr o
module
P-Pr o
module
PCI
bridge
PCI bus
PCI
I/O
cards
PCI
bridge
PCI bus
P-Pr o bus (64-bit data, 36-bit address, 66 MHz)
Memory
controller
MIU
1-, 2-, or 4-w ay
interleaved
DRAM
•
All coherence and multiprocessing
glue in processor module
•
Highly integrated, targeted at high
volume
•
Low latency and bandwidth
EECC722 - Shaaban
#26 lec # 3
Fall 2000 9-18-2000
Uniform Memory Access Example:
SUN Enterprise
P
$
P
$
$2
$2
CPU/mem
cards
Mem ctrl
Bus interf ace/sw itch
Gigaplane bus (256 data, 41 address, 83 MHz)
I/O cards
2 FiberChannel
SBUS
SBUS
SBUS
100bT, SCSI
Bus interf ace
– 16 cards of either type: processors + memory, or I/O
– All memory accessed over bus, so symmetric
– Higher bandwidth, higher latency bus
EECC722 - Shaaban
#27 lec # 3
Fall 2000 9-18-2000
Distributed Shared-Memory
Multiprocessor System Example:
Cray T3E
External I/O
P
$
Mem
Mem
ctrl
and NI
XY
Sw itch
Z
• Scale up to 1024 processors, 480MB/s links
• Memory controller generates communication requests for nonlocal
references
• No hardware mechanism for coherence (SGI Origin etc. provide this)
EECC722 - Shaaban
#28 lec # 3
Fall 2000 9-18-2000
Message-Passing Multicomputers
• Comprised of multiple autonomous computers (nodes).
• Each node consists of a processor, local memory, attached
storage and I/O peripherals.
• Programming model more removed from basic hardware
operations
• Local memory is only accessible by local processors.
• A message passing network provides point-to-point static
connections among the nodes.
• Inter-node communication is carried out by message passing
through the static connection network
• Process communication achieved using a message-passing
programming environment.
EECC722 - Shaaban
#29 lec # 3
Fall 2000 9-18-2000
Message-Passing Abstraction
Match
Receive Y, P, t
Address Y
Send X, Q, t
Addr ess X
•
•
•
•
•
•
•
Local pr ocess
addr ess space
Local pr ocess
addr ess space
Process P
Process Q
Send specifies buffer to be transmitted and receiving process
Recv specifies sending process and application storage to receive into
Memory to memory copy, but need to name processes
Optional tag on send and matching rule on receive
User process names local data and entities in process/tag space too
In simplest form, the send/recv match achieves pairwise synch event
Many overheads: copying, buffer management, protection
EECC722 - Shaaban
#30 lec # 3
Fall 2000 9-18-2000
Message-Passing Example: IBM SP-2
Pow er 2
CPU
IBM SP-2 node
L2 $
Memory bus
•
•
Made out of essentially
complete RS6000
workstations
Network interface
integrated in I/O bus
(bandwidth limited by I/O
bus)
4-w ay
interleaved
DRAM
Memory
controller
MicroChannel bus
NIC
I/O
DMA
i860
NI
DRAM
General interconnection
netw ork formed fom
r
8-port sw itches
EECC722 - Shaaban
#31 lec # 3
Fall 2000 9-18-2000
Message-Passing Example:
Intel Paragon
i860
i860
L1 $
L1 $
Intel
Paragon
node
Memory bus (64-bit, 50 MHz)
Mem
ctrl
DMA
Driver
Sandia’ s Intel Paragon XP/S-based Super computer
2D grid netw ork
w ith processing node
attached to every sw itch
NI
4-w ay
interleaved
DRAM
8 bits,
175 MHz,
bidirectional
EECC722 - Shaaban
#32 lec # 3
Fall 2000 9-18-2000
Message-Passing Programming Tools
• Message-passing programming libraries include:
– Message Passing Interface (MPI):
• Provides a standard for writing concurrent message-passing
programs.
• MPI implementations include parallel libraries used by
existing programming languages.
– Parallel Virtual Machine (PVM):
• Enables a collection of heterogeneous computers to used as a
coherent and flexible concurrent computational resource.
• PVM support software executes on each machine in a userconfigurable pool, and provides a computational
environment of concurrent applications.
• User programs written for example in C, Fortran or Java are
provided access to PVM through the use of calls to PVM
library routines.
EECC722 - Shaaban
#33 lec # 3
Fall 2000 9-18-2000
Data Parallel Systems SIMD in Flynn taxonomy
•
Programming model
– Operations performed in parallel on each
element of data structure
– Logically single thread of control, performs
sequential or parallel steps
Control
processor
– Conceptually, a processor is associated with
each data element
•
Architectural model
– Array of many simple, cheap processors each
with little memory
• Processors don’t sequence through
instructions
– Attached to a control processor that issues
instructions
– Specialized and general communication, cheap
global synchronization
•
PE
PE

PE
PE
PE

PE


PE
PE


PE
Some recent machines:
– Thinking Machines CM-1, CM-2 (and CM-5)
– Maspar MP-1 and MP-2,
EECC722 - Shaaban
#34 lec # 3
Fall 2000 9-18-2000
Dataflow Architectures
• Represent computation as a graph of essential dependences
– Logical processor at each node, activated by availability of operands
– Message (tokens) carrying tag of next instruction sent to next processor
– Tag compared with others in matching store; match fires execution
1
a = (b +1)  (b  c)
d=ce
f =ad
b
c
e

+

d

Dataflow graph
a

Netw ork
f
Token
store
Program
store
Waiting
Matching
Instruction
f etch
Execute
Form
token
Netw ork
Token queue
Netw ork
EECC722 - Shaaban
#35 lec # 3
Fall 2000 9-18-2000
Systolic Architectures
• Replace single processor with an array of regular processing elements
• Orchestrate data flow for high throughput with less memory access
M
M
PE
PE
PE
PE
• Different from pipelining
– Nonlinear array structure, multidirection data flow, each PE
may have (small) local instruction and data memory
• Different from SIMD: each PE may do something different
• Initial motivation: VLSI enables inexpensive special-purpose chips
• Represent algorithms directly by chips connected in regular pattern
EECC722 - Shaaban
#36 lec # 3
Fall 2000 9-18-2000
•
•
•
•
•
•
•
•
•
•
•
•
Parallel Programs
Conditions of Parallelism:
– Data Dependence
– Control Dependence
– Resource Dependence
– Bernstein’s Conditions
Asymptotic Notations for Algorithm Analysis
Parallel Random-Access Machine (PRAM)
– Example: sum algorithm on P processor PRAM
Network Model of Message-Passing Multicomputers
– Example: Asynchronous Matrix Vector Product on a Ring
Levels of Parallelism in Program Execution
Hardware Vs. Software Parallelism
Parallel Task Grain Size
Example Motivating Problems With high levels of concurrency
Limited Concurrency: Amdahl’s Law
Parallel Performance Metrics: Degree of Parallelism (DOP)
Concurrency Profile
Steps in Creating a Parallel Program:
– Decomposition, Assignment, Orchestration, Mapping
– Program Partitioning Example
– Static Multiprocessor Scheduling Example
EECC722 - Shaaban
#37 lec # 3
Fall 2000 9-18-2000
Conditions of Parallelism:
Data Dependence
1 True Data or Flow Dependence: A statement S2 is data
dependent on statement S1 if an execution path exists
from S1 to S2 and if at least one output variable of S1
feeds in as an input operand used by S2
denoted by S1 S2
2 Antidependence: Statement S2 is antidependent on S1
if S2 follows S1 in program order and if the output of
S2 overlaps the input of S1
denoted by S1  S2
3 Output dependence:
Two statements are output
dependent if they produce the same output variable
denoted by S1  S2
EECC722 - Shaaban
#38 lec # 3
Fall 2000 9-18-2000
Conditions of Parallelism: Data Dependence
4 I/O dependence: Read and write are I/O statements.
I/O dependence occurs not because the same variable is
involved but because the same file is referenced by both
I/O statements.
5 Unknown dependence:
• Subscript of a variable is subscribed (indirect
addressing)
• The subscript does not contain the loop index.
• A variable appears more than once with subscripts
having different coefficients of the loop variable.
• The subscript is nonlinear in the loop index
variable.
EECC722 - Shaaban
#39 lec # 3
Fall 2000 9-18-2000
Data and I/O Dependence: Examples
A-
S1:
S2:
S3:
S4:
Load R1,A
Add R2, R1
Move R1, R3
Store B, R1
S1
S4
S2
Dependence graph
B-
S1:
S2:
S3:
S4:
Read (4),A(I)
Rewind (4)
Write (4), B(I)
Rewind (4)
S3
/Read array A from tape unit 4/
/Rewind tape unit 4/
/Write array B into tape unit 4/
/Rewind tape unit 4/
I/O dependence caused by accessing the
same file by the read and write statements
S1
I/O
S3
EECC722 - Shaaban
#40 lec # 3
Fall 2000 9-18-2000
Conditions of Parallelism
• Control Dependence:
– Order of execution cannot be determined before runtime
due to conditional statements.
• Resource Dependence:
– Concerned with conflicts in using shared resources
including functional units (integer, floating point), memory
areas, among parallel tasks.
• Bernstein’s Conditions:
Two processes P1 , P2 with input sets I1, I2 and output sets
O1, O2 can execute in parallel (denoted by P1 || P2) if:
I1 O2 = 
I2 O1 = 
O1 O2 = 
EECC722 - Shaaban
#41 lec # 3
Fall 2000 9-18-2000
Bernstein’s Conditions: An Example
•
For the following instructions P1, P2, P3, P4, P5 in program order and
– Instructions are in program order
– Each instruction requires one step to execute
– Two adders are available
P1 : C = D x E Using Bernstein’s Conditions after checking statement pairs:
P2 : M = G + C
P1 || P5 , P2 || P3 , P2 || P5 , P5 || P3 , P4 || P5
P3 : A = B + C
D E
D E
Time
P4 : C = L + M
X P1 B
P5 : F = G  E
G E
X P1
G
G
C
C
P1
X
P2
P4
+1
+3
+2
P3
Dependence graph:
Data dependence (solid lines)
Resource dependence (dashed lines)

P5
B
L
E
+1
P2
+2
P3
M
A
+3
+1
P2
+3
P4
C
P4
C
Sequential
execution
L
P
G
+2
P3
A
P
5
F
Parallel execution in three steps
assuming two adders are available
per step
5
F
EECC722 - Shaaban
#42 lec # 3
Fall 2000 9-18-2000
Asymptotic Notations for Algorithm Analysis
Asymptotic analysis of computing time of an algorithm f(n)
ignores constant execution factors and concentrates on
determining the order of magnitude of algorithm performance.
 Upper bound:
Used in worst case analysis of algorithm performance.
f(n) = O(g(n))
iff there exist two positive constants c and n0 such that
| f(n) | c | g(n) | for all n > n0
 i.e. g(n) an upper bound on f(n)
O(1) < O(log n) < O(n) < O(n log n) < O (n2) < O(n3) < O(2n)
EECC722 - Shaaban
#43 lec # 3
Fall 2000 9-18-2000
Asymptotic Notations for Algorithm Analysis
 Lower bound:
Used in the analysis of the lower limit of algorithm performance
f(n) = W(g(n))
if there exist positive constants c, n0 such that
| f(n) |  c | g(n) |
for all n > n0
i.e. g(n) is a lower bound on f(n)
 Tight bound:
Used in finding a tight limit on algorithm performance
f(n) = Q (g(n))
if there exist constant positive integers c1, c2, and n0 such that
c1 | g(n) | | f(n) | c2 | g(n) |
for all n > n0
 i.e. g(n) is both an upper and lower bound on f(n)
EECC722 - Shaaban
#44 lec # 3
Fall 2000 9-18-2000
The Growth Rate of Common Computing Functions
log n
n
n log n
0
1
2
3
4
5
1
2
4
8
16
32
0
2
8
24
64
160
n2
n3
1
4
16
64
256
1024
1
8
64
512
4096
32768
2n
2
4
16
256
65536
4294967296
EECC722 - Shaaban
#45 lec # 3
Fall 2000 9-18-2000
Theoretical Models of Parallel Computers
• Parallel Random-Access Machine (PRAM):
– n processor, global shared memory model.
– Models idealized parallel computers with zero synchronization or
memory access overhead.
– Utilized parallel algorithm development and scalability and
complexity analysis.
• PRAM variants: More realistic models than pure PRAM
– EREW-PRAM: Simultaneous memory reads or writes to/from
the same memory location are not allowed.
– CREW-PRAM: Simultaneous memory writes to the same
location is not allowed.
– ERCW-PRAM: Simultaneous reads from the same memory
location are not allowed.
– CRCW-PRAM: Concurrent reads or writes to/from the same
memory location are allowed.
EECC722 - Shaaban
#46 lec # 3
Fall 2000 9-18-2000
Example: sum algorithm on P processor PRAM
•Input: Array A of size n =
in shared memory
•Initialized local variables:
2k
•the order n,
•number of processors p = 2q n,
• the processor number s
•Output: The sum of the elements
of A stored in shared memory
begin
1. for j = 1 to l ( = n/p) do
Set B(l(s - 1) + j): = A(l(s-1) + j)
2. for h = 1 to log n do
2.1 if (k- h - q 0) then
for j = 2k-h-q(s-1) + 1 to 2k-h-qS do
Set B(j): = B(2j -1) + B(2s)
2.2 else {if (s 2k-h) then
Set B(s): = B(2s -1 ) + B(2s)}
3. if (s = 1) then set S: = B(1)
end
Running time analysis:
• Step 1: takes O(n/p) each processor executes n/p operations
•The hth of step 2 takes O(n / (2hp)) since each processor has
to perform (n / (2hp))  operations
• Step three takes O(1)
 n log n  n 
n


(
n
)

O


O
(
 log n )
Tp
 p   h p 
•Total Running time:
p

h 1 

2

EECC722 - Shaaban
#47 lec # 3
Fall 2000 9-18-2000
Example: Sum Algorithm on P Processor PRAM
For n = 8
5
4
p=4
Processor allocation for
computing the sum of 8 elements
on 4 processor PRAM
S= B(1)
P1 Operation represented by a node
Time
Unit
is executed by the processor
indicated below the node.
B(1)
P1
3
2
1
B(1)
B(2)
P1
P2
B(1)
B(2)
B(3)
B(4)
P1
P2
P3
P4
B(1)
=A(1)
B(2)
=A(2)
B(3)
=A(3)
B(4)
=A(4)
B(5)
=A(5)
B(6)
=A(6)
B(7)
=A(7)
B(8)
=A(8)
P1
P1
P2
P2
P3
P3
P4
P4
EECC722 - Shaaban
#48 lec # 3
Fall 2000 9-18-2000
The Power of The PRAM Model
• Well-developed techniques and algorithms to handle
many computational problems exist for the PRAM model
• Removes algorithmic details regarding synchronization
and communication, concentrating on the structural
properties of the problem.
• Captures several important parameters of parallel
computations. Operations performed in unit time, as
well as processor allocation.
• The PRAM design paradigms are robust and many
network algorithms can be directly derived from PRAM
algorithms.
• It is possible to incorporate synchronization and
communication into the shared-memory PRAM model.
EECC722 - Shaaban
#49 lec # 3
Fall 2000 9-18-2000
Performance of Parallel Algorithms
• Performance of a parallel algorithm is typically measured
in terms of worst-case analysis.
• For problem Q with a PRAM algorithm that runs in time
T(n) using P(n) processors, for an instance size of n:
– The time-processor product C(n) = T(n) . P(n) represents the
cost of the parallel algorithm.
– For P < P(n), each of the of the T(n) parallel steps is
simulated in O(P(n)/p) substeps. Total simulation takes
O(T(n)P(n)/p)
– The following four measures of performance are
asymptotically equivalent:
•
•
•
•
P(n) processors and T(n) time
C(n) = P(n)T(n) cost and T(n) time
O(T(n)P(n)/p) time for any number of processors p < P(n)
O(C(n)/p + T(n)) time for any number of processors.
EECC722 - Shaaban
#50 lec # 3
Fall 2000 9-18-2000
Network Model of Message-Passing Multicomputers
• A network of processors can viewed as a graph G (N,E)
– Each node i  N represents a processor
– Each edge (i,j)  E represents a two-way communication
link between processors i and j.
– Each processor is assumed to have its own local memory.
– No shared memory is available.
– Operation is synchronous or asynchronous(message
passing).
– Typical message-passing communication constructs:
• send(X,i) a copy of X is sent to processor Pi, execution
continues.
• receive(Y, j) execution suspended until the data from
processor Pj is received and stored in Y then execution
resumes.
EECC722 - Shaaban
#51 lec # 3
Fall 2000 9-18-2000
Network Model of Multicomputers
• Routing is concerned with delivering each message from
source to destination over the network.
• Additional important network topology parameters:
– The network diameter is the maximum distance between
any pair of nodes.
– The maximum degree of any node in G
• Example:
– Linear array: P processors P1, …, Pp are connected in
linear array where:
a
• Processor Pi is connected to Pi-1 and Pi+1 if they exist.
• Diameter is p-1; maximum degree is 2
– A ring is a linear array of processors where processors P1
and Pp are directly connected.
EECC722 - Shaaban
#52 lec # 3
Fall 2000 9-18-2000
A Four-Dimensional Hypercube
• Two processors are connected if their binary indices
differ in one bit position.
EECC722 - Shaaban
#53 lec # 3
Fall 2000 9-18-2000
Example: Asynchronous Matrix Vector Product on a Ring
•
•
Input:
– n x n matrix A ; vector x of order n
– The processor number i. The number of processors
– The ith submatrix B = A( 1:n, (i-1)r +1 ; ir) of size n x r where r = n/p
– The ith subvector w = x(i - 1)r + 1 : ir) of size r
Output:
– Processor Pi computes the vector y = A1x1 + …. Aixi and passes the result
to the right
– Upon completion P1 will hold the product Ax
Begin
1. Compute the matrix vector product z = Bw
2. If i = 1 then set y: = 0
2/p)
T
=
k(n
comp
else receive(y,left)
Tcomm = p(l+ mn)
3. Set y: = y +z
T = Tcomp + Tcomm
= k(n2/p) + p(l+ mn)
4. send(y, right)
5. if i =1 then receive(y,left)
End
EECC722 - Shaaban
#54 lec # 3
Fall 2000 9-18-2000
Creating a Parallel Program
• Assumption: Sequential algorithm to solve problem is given
– Or a different algorithm with more inherent parallelism is devised.
– Most programming problems have several parallel solutions. The best
solution may differ from that suggested by existing sequential
algorithms.
One must:
– Identify work that can be done in parallel
– Partition work and perhaps data among processes
– Manage data access, communication and synchronization
– Note: work includes computation, data access and I/O
Main goal: Speedup (plus low programming effort and resource needs)
Speedup (p) =
For a fixed problem:
Speedup (p) =
Performance(p)
Performance(1)
Time(1)
Time(p)
EECC722 - Shaaban
#55 lec # 3
Fall 2000 9-18-2000
Some Important Concepts
Task:
– Arbitrary piece of undecomposed work in parallel computation
– Executed sequentially on a single processor; concurrency is only
across tasks
– E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace
– Fine-grained versus coarse-grained tasks
Process (thread):
– Abstract entity that performs the tasks assigned to processes
– Processes communicate and synchronize to perform their tasks
Processor:
– Physical engine on which process executes
– Processes virtualize machine to programmer
• first write program in terms of processes, then map to processors
EECC722 - Shaaban
#56 lec # 3
Fall 2000 9-18-2000
Levels of Parallelism in Program Execution
Increasing
communications
demand and
mapping/scheduling
overhead
Level 5
Jobs or programs
(Multiprogramming)
Level 4
Subprograms, job
steps or related parts
of a program
}
}
}
Coarse
Grain
Medium
Grain
Level 3
Level 2
Procedures, subroutines,
or co-routines
Higher
degree of
Parallelism
Non-recursive loops or
unfolded iterations
Fine
Grain
Level 1
Instructions or
statements
EECC722 - Shaaban
#57 lec # 3
Fall 2000 9-18-2000
Hardware and Software Parallelism
• Hardware parallelism:
– Defined by machine architecture, hardware multiplicity
(number of processors available) and connectivity.
– Often a function of cost/performance tradeoffs.
– Characterized in a single processor by the number of
instructions k issued in a single cycle (k-issue processor).
– A multiprocessor system with n k-issue processor can
handle a maximum limit of nk threads.
• Software parallelism:
– Defined by the control and data dependence of programs.
– Revealed in program profiling or program flow graph.
– A function of algorithm, programming style and compiler
optimization.
EECC722 - Shaaban
#58 lec # 3
Fall 2000 9-18-2000
Computational Parallelism and Grain Size
• Grain size (granularity) is a measure of the amount of
computation involved in a task in parallel computation :
– Instruction Level:
• At instruction or statement level.
• 20 instructions grain size or less.
• For scientific applications, parallelism at this level range from
500 to 3000 concurrent statements
• Manual parallelism detection is difficult but assisted by
parallelizing compilers.
– Loop level:
•
•
•
•
Iterative loop operations.
Typically, 500 instructions or less per iteration.
Optimized on vector parallel computers.
Independent successive loop operations can be vectorized or
run in SIMD mode.
EECC722 - Shaaban
#59 lec # 3
Fall 2000 9-18-2000
Computational Parallelism and Grain Size
– Procedure level:
•
•
•
•
Medium-size grain; task, procedure, subroutine levels.
Less than 2000 instructions.
More difficult detection of parallel than finer-grain levels.
Less communication requirements than fine-grain
parallelism.
• Relies heavily on effective operating system support.
– Subprogram level:
• Job and subprogram level.
• Thousands of instructions per grain.
• Often scheduled on message-passing multicomputers.
– Job (program) level, or Multiprogrammimg:
• Independent programs executed on a parallel computer.
• Grain size in tens of thousands of instructions.
EECC722 - Shaaban
#60 lec # 3
Fall 2000 9-18-2000
Example Motivating Problems:
Simulating Ocean Currents
(a) Cross sections
(b) Spatial discretization of a cross section
– Model as two-dimensional grids
– Discretize in space and time
• finer spatial and temporal resolution => greater accuracy
– Many different computations per time step
• set up and solve equations
– Concurrency across and within grid computations
EECC722 - Shaaban
#61 lec # 3
Fall 2000 9-18-2000
Example Motivating Problems:
Simulating Galaxy Evolution
– Simulate the interactions of many stars evolving over time
– Computing forces is expensive
– O(n2) brute force approach
– Hierarchical Methods take advantage of force law: G
m1m2
r2
Star on w hich f orc es
are being computed
Star too close to
approximate
•
Large group far
enough aw ay to
approximate
Small gr oup far enough aw ay to
approximate to center of mass
Many time-steps, plenty of concurrency across stars within one
EECC722 - Shaaban
#62 lec # 3
Fall 2000 9-18-2000
Example Motivating Problems:
Rendering Scenes by Ray Tracing
– Shoot rays into scene through pixels in image plane
– Follow their paths
• They bounce around as they strike objects
• They generate new rays: ray tree per input ray
– Result is color and opacity for that pixel
– Parallelism across rays
• All above case studies have abundant concurrency
EECC722 - Shaaban
#63 lec # 3
Fall 2000 9-18-2000
Limited Concurrency: Amdahl’s Law
– Most fundamental limitation on parallel speedup.
– If fraction s of seqeuential execution is inherently serial,
speedup <= 1/s
– Example: 2-phase calculation
• sweep over n-by-n grid and do some independent computation
• sweep again and add each value to global sum
– Time for first phase = n2/p
– Second phase serialized at global variable, so time = n2
2n2
– Speedup <= n2
or at most 2
+
n2
p
– Possible Trick: divide second phase into two
• Accumulate into private sum during sweep
• Add per-process private sum into global sum
2n2
– Parallel time is n2/p + n2/p + p, and speedup at best 2n2 + p2
EECC722 - Shaaban
#64 lec # 3
Fall 2000 9-18-2000
Amdahl’s Law Example:
A Pictorial Depiction
1
(a)
work done concurrently
n2
n2
p
(b)
1
n2/p
n2
p
1
(c)
n2/p n2/p p
Time
EECC722 - Shaaban
#65 lec # 3
Fall 2000 9-18-2000
Parallel Performance Metrics
Degree of Parallelism (DOP)
• For a given time period, DOP reflects the number of processors in
a specific parallel computer actually executing a particular parallel
program.
• Average Parallelism:
–
–
–
–
given maximum parallelism = m
n homogeneous processors
computing capacity of a single processor D
Total amount of work W (instructions, computations):
t2
W  D  DOP( t )dt or as a discrete summation
t1
m
W  D  i. t i
i 1
m
Where ti is the total time that DOP = i and
The average parallelism A:
t2
1
A
DOP( t )dt


t 2 t1 t 1
In discrete form
t  t  t
i 1
i
2
 m

A    i. t i
 i1 
1
 m 
  t i
 i1 
EECC722 - Shaaban
#66 lec # 3
Fall 2000 9-18-2000
Example: Concurrency Profile of
A Divide-and-Conquer Algorithm
 m
  m 
Execution observed from t1 = 2 to t2 = 27
A    i. t i   t i
 i1   i1 
Peak parallelism m = 8
A = (1x5 + 2x3 + 3x4 + 4x6 + 5x2 + 6x2 + 8x3) / (5 + 3+4+6+2+2+3)
= 93/25 = 3.72
•
•
•
Degree of Parallelism (DOP)
11
10
9
8
7
6
5
4
3
2
1
1
2
t1
3
4
5
6
7
8
9
10
11 12
13
14 15
16 17
18 19
20 21
22 23
24 25
Time
26 27
t2
EECC722 - Shaaban
#67 lec # 3
Fall 2000 9-18-2000
Parallel Performance Example
•
The execution time T for three parallel programs is given in terms of processor
count P and problem size N
• In each case, we assume that the total computation work performed by
an optimal sequential algorithm scales as N+N2 .
1 For first parallel algorithm: T = N + N2/P
This algorithm partitions the computationally demanding O(N2) component of
the algorithm but replicates the O(N) component on every processor. There are
no other sources of overhead.
2
For the second parallel algorithm: T = (N+N2 )/P + 100
This algorithm optimally divides all the computation among all processors but
introduces an additional cost of 100.
3
For the third parallel algorithm: T = (N+N2 )/P + 0.6P2
This algorithm also partitions all the computation optimally but introduces
an additional cost of 0.6P2.
•
All three algorithms achieve a speedup of about 10.8 when P = 12 and N=100 . However,
they behave differently in other situations as shown next.
With N=100 , all three algorithms perform poorly for larger P , although Algorithm (3)
does noticeably worse than the other two.
When N=1000 , Algorithm (2) is much better than Algorithm (1) for larger P .
•
•
EECC722 - Shaaban
#68 lec # 3
Fall 2000 9-18-2000
Parallel
Performance
Example
(continued)
All algorithms achieve:
Speedup = 10.8 when P = 12 and N=100
N=1000 , Algorithm (2) performs
much better than Algorithm (1)
for larger P .
Algorithm 1: T = N + N2/P
Algorithm 2: T = (N+N2 )/P + 100
Algorithm 3: T = (N+N2 )/P + 0.6P2
EECC722 - Shaaban
#69 lec # 3
Fall 2000 9-18-2000
Steps in Creating a Parallel Program
Partitioning
D
e
c
o
m
p
o
s
i
t
i
o
n
Sequential
computation
A
s
s
i
g
n
m
e
n
t
Tasks
p0
p1
p2
p3
Processes
O
r
c
h
e
s
t
r
a
t
i
o
n
p0
p1
p2
p3
Parallel
program
M
a
p
p
i
n
g
P0
P1
P2
P3
Processors
• 4 steps:
Decomposition, Assignment, Orchestration, Mapping
– Done by programmer or system software (compiler,
runtime, ...)
– Issues are the same, so assume programmer does it all
explicitly
EECC722 - Shaaban
#70 lec # 3
Fall 2000 9-18-2000
•
Decomposition
Break up computation into concurrent tasks to be divided among
processes:
– Tasks may become available dynamically.
– No. of available tasks may vary with time.
– Together with assignment, also called partitioning.
i.e. identify concurrency and decide level at which to exploit it.
• Grain-size problem:
– To determine the number and size of grains or tasks in a parallel
program.
– Problem and machine-dependent.
– Solutions involve tradeoffs between parallelism, communication and
scheduling/synchronization overhead.
• Grain packing:
– To combine multiple fine-grain nodes into a coarse grain node (task)
to reduce communication delays and overall scheduling overhead.
Goal: Enough tasks to keep processes busy, but not too many
– No. of tasks available at a time is upper bound on achievable speedup
EECC722 - Shaaban
#71 lec # 3
Fall 2000 9-18-2000
Assignment
• Specifying mechanisms to divide work up among processes:
– Together with decomposition, also called partitioning.
– Balance workload, reduce communication and management cost
• Partitioning problem:
– To partition a program into parallel branches, modules to give
the shortest possible execution on a specific parallel architecture.
• Structured approaches usually work well:
– Code inspection (parallel loops) or understanding of application.
– Well-known heuristics.
– Static versus dynamic assignment.
• As programmers, we worry about partitioning first:
– Usually independent of architecture or programming model.
– But cost and complexity of using primitives may affect decisions.
EECC722 - Shaaban
#72 lec # 3
Fall 2000 9-18-2000
–
–
–
–
Orchestration
Naming data.
Structuring communication.
Synchronization.
Organizing data structures and scheduling tasks temporally.
• Goals
–
–
–
–
Reduce cost of communication and synch. as seen by processors
Reserve locality of data reference (incl. data structure organization)
Schedule tasks to satisfy dependences early
Reduce overhead of parallelism management
• Closest to architecture (and programming model &
language).
– Choices depend a lot on comm. abstraction, efficiency of primitives.
– Architects should provide appropriate primitives efficiently.
EECC722 - Shaaban
#73 lec # 3
Fall 2000 9-18-2000
Mapping
• Each task is assigned to a processor in a manner that attempts to
satisfy the competing goals of maximizing processor utilization and
minimizing communication costs.
• Mapping can be specified statically or determined at runtime by
load-balancing algorithms (dynamic scheduling).
• Two aspects of mapping:
– Which processes will run on the same processor, if necessary
– Which process runs on which particular processor
• mapping to a network topology
• One extreme: space-sharing
– Machine divided into subsets, only one app at a time in a subset
– Processes can be pinned to processors, or left to OS.
• Another extreme: complete resource management control to OS
– OS uses the performance techniques we will discuss later.
• Real world is between the two.
– User specifies desires in some aspects, system may ignore
EECC722 - Shaaban
#74 lec # 3
Fall 2000 9-18-2000
Program Partitioning Example
Example 2.4 page 64
Fig 2.6 page 65
Fig 2.7 page 66
In Advanced Computer
Architecture, Hwang
EECC722 - Shaaban
#75 lec # 3
Fall 2000 9-18-2000
Static Multiprocessor Scheduling
Dynamic multiprocessor scheduling is an NP-hard problem.
Node Duplication: to eliminate idle time and communication delays, some
nodes may be duplicated in more than one processor.
Fig. 2.8 page 67
Example: 2.5 page 68
In Advanced Computer
Architecture, Hwang
EECC722 - Shaaban
#76 lec # 3
Fall 2000 9-18-2000
Table 2.1 Steps in the Parallelization Process and Their Goals
ArchitectureDependent?
Major Performance Goals
Decomposition
Mostly no
Expose enough concurrency but not too much
Assignment
Mostly no
Balance workload
Reduce communication volume
Orchestration
Yes
Reduce noninherent communication via data
locality
Reduce communication and synchr
onization cost
as seen by the processor
Reduce serialization at shared resources
Schedule tasks to satisfy dependences early
Mapping
Yes
Put related processes on the same pr
ocessor if
necessary
Exploit locality in network topology
Step
EECC722 - Shaaban
#77 lec # 3
Fall 2000 9-18-2000
Successive Refinement
Partitioning is often independent of architecture, and may be
done first:
– View machine as a collection of communicating processors
• Balancing the workload.
• Reducing the amount of inherent communication
• Reducing extra work.
– Above three issues are conflicting.
Then deal with interactions with architecture:
– View machine as an extended memory hierarchy
• Extra communication due to architectural interactions.
• Cost of communication depends on how it is structured
– This may inspire changes in partitioning.
EECC722 - Shaaban
#78 lec # 3
Fall 2000 9-18-2000
Partitioning for Performance
• Balancing the workload and reducing wait time at synch
points
• Reducing inherent communication.
• Reducing extra work.
These algorithmic issues have extreme trade-offs:
– Minimize communication => run on 1 processor.
=> extreme load imbalance.
– Maximize load balance => random assignment of tiny tasks.
=> no control over communication.
– Good partition may imply extra work to compute or manage it
• The goal is to compromise between the above extremes
– Fortunately, often not difficult in practice.
EECC722 - Shaaban
#79 lec # 3
Fall 2000 9-18-2000
Load Balancing and Synch Wait
Time Reduction
Limit on speedup:
Speedupproblem(p) 
Sequential Work
Max Work on any Processor
– Work includes data access and other costs.
– Not just equal work, but must be busy at same time.
Four parts to load balancing and reducing synch wait time:
1. Identify enough concurrency.
2. Decide how to manage it.
3. Determine the granularity at which to exploit it
4. Reduce serialization and cost of synchronization
EECC722 - Shaaban
#80 lec # 3
Fall 2000 9-18-2000
Managing Concurrency
Static versus Dynamic techniques
Static:
– Algorithmic assignment based on input; won’t change
– Low runtime overhead
– Computation must be predictable
– Preferable when applicable (except in
multiprogrammed/heterogeneous environment)
Dynamic:
– Adapt at runtime to balance load
– Can increase communication and reduce locality
– Can increase task management overheads
EECC722 - Shaaban
#81 lec # 3
Fall 2000 9-18-2000
Dynamic Load Balancing
• To achieve best performance of a parallel computing system running
a parallel problem, it’s essential to maximize processor utilization by
distributing the computation load evenly or balancing the load
among the available processors.
• Optimal static load balancing, optimal mapping or scheduling, is an
intractable NP-complete problem, except for specific problems on
specific networks.
• Hence heuristics are usually used to select processors for processes.
• Even the best static mapping may offer the best execution time due to
changing conditions at runtime and the process may need to done
dynamically.
• The methods used for balancing the computational load dynamically
among processors can be broadly classified as:
1. Centralized dynamic load balancing.
2. Decentralized dynamic load balancing.
EECC722 - Shaaban
#82 lec # 3
Fall 2000 9-18-2000
Processor Load Balance
& Performance
EECC722 - Shaaban
#83 lec # 3
Fall 2000 9-18-2000
Dynamic Tasking with Task Queues
Centralized versus distributed queues.
Task stealing with distributed queues.
– Can compromise communication and locality, and increase
synchronization.
– Whom to steal from, how many tasks to steal, ...
– Termination detection
– Maximum imbalance related to size of task
All processes
insert tasks
P0 inserts
QQ
0
P1 inserts
P2 inserts
P3 inserts
Q1
Q2
Q3
P2 removes
P3 removes
Others may
steal
All remove tasks
(a) Centralized task queue
P0 removes
P1 removes
(b) Distributed task queues (one per process)
EECC722 - Shaaban
#84 lec # 3
Fall 2000 9-18-2000
Implications of Load Balancing
Extends speedup limit expression to:
Speedupproblem(p) 
Sequential Work
Max (Work + Synch Wait Time)
Generally, responsibility of software
Architecture can support task stealing and synch efficiently
– Fine-grained communication, low-overhead access to queues
• Efficient support allows smaller tasks, better load balancing
– Naming logically shared data in the presence of task stealing
• Need to access data of stolen tasks, esp. multiply-stolen tasks
=> Hardware shared address space advantageous
– Efficient support for point-to-point communication
EECC722 - Shaaban
#85 lec # 3
Fall 2000 9-18-2000
Reducing Inherent Communication
Measure: communication to computation ratio
Focus here is on inherent communication
– Determined by assignment of tasks to processes
– Actual communication can be greater
• Assign tasks that access same data to same process
• Optimal solution to reduce communication and
achive an optimal load balance is NP-hard in the
general case
• Simple heuristic solutions work well in practice:
– Due to specific structure of applications.
EECC722 - Shaaban
#86 lec # 3
Fall 2000 9-18-2000
Implications of Communication-toComputation Ratio
• Architects must examine application needs
• If denominator is execution time, ratio gives average BW needs
• If operation count, gives extremes in impact of latency and
bandwidth
– Latency: assume no latency hiding
– Bandwidth: assume all latency hidden
– Reality is somewhere in between
• Actual impact of communication depends on structure and
cost as well:
Speedup <
Sequential Work
Max (Work + Synch Wait Time + Comm Cost)
– Need to keep communication balanced across processors as
well.
EECC722 - Shaaban
#87 lec # 3
Fall 2000 9-18-2000
Reducing Extra Work (Overheads)
• Common sources of extra work:
– Computing a good partition
e.g. partitioning in Barnes-Hut or sparse matrix
– Using redundant computation to avoid communication
– Task, data and process management overhead
• Applications, languages, runtime systems, OS
–
Imposing structure on communication
• Coalescing messages, allowing effective naming
• Architectural Implications:
– Reduce need by making communication and orchestration
efficient
Speedup <
Sequential Work
Max (Work + Synch Wait Time + Comm Cost + Extra Work)
EECC722 - Shaaban
#88 lec # 3
Fall 2000 9-18-2000
Extended Memory-Hierarchy View of
Multiprocessors
• Levels in extended hierarchy:
– Registers, caches, local memory, remote memory
(topology)
– Glued together by communication architecture
– Levels communicate at a certain granularity of data
transfer
• Need to exploit spatial and temporal locality in hierarchy
– Otherwise extra communication may also be caused
– Especially important since communication is expensive
EECC722 - Shaaban
#89 lec # 3
Fall 2000 9-18-2000
Extended Hierarchy
• Idealized view: local cache hierarchy + single main memory
• But reality is more complex:
– Centralized Memory: caches of other processors
– Distributed Memory: some local, some remote; + network
topology
– Management of levels:
• Caches managed by hardware
• Main memory depends on programming model
– SAS: data movement between local and remote transparent
– Message passing: explicit
– Improve performance through architecture or program
locality
– Tradeoff with parallelism; need good node performance and
parallelism
EECC722 - Shaaban
#90 lec # 3
Fall 2000 9-18-2000
Artifactual Communication in Extended Hierarchy
Accesses not satisfied in local portion cause communication
– Inherent communication, implicit or explicit, causes
transfers
• Determined by program
– Artifactual communication:
•
•
•
•
•
•
Determined by program implementation and arch. interactions
Poor allocation of data across distributed memories
Unnecessary data in a transfer
Unnecessary transfers due to system granularities
Redundant communication of data
finite replication capacity (in cache or main memory)
– Inherent communication assumes unlimited capacity, small
transfers, perfect knowledge of what is needed.
– More on artifactual communication later; first consider
replication-induced further
EECC722 - Shaaban
#91 lec # 3
Fall 2000 9-18-2000
Structuring Communication
Given amount of comm (inherent or artifactual), goal is to reduce cost
• Cost of communication as seen by process:
n /m
C = f * ( o + l + c + tc - overlap)
B
•
•
•
•
•
•
•
•
f = frequency of messages
o = overhead per message (at both ends)
l = network delay per message
nc = total data sent
m = number of messages
B = bandwidth along path (determined by network, NI, assist)
tc = cost induced by contention per message
overlap = amount of latency hidden by overlap with comp. or comm.
– Portion in parentheses is cost of a message (as seen by processor)
– That portion, ignoring overlap, is latency of a message
– Goal: reduce terms in latency and increase overlap
EECC722 - Shaaban
#92 lec # 3
Fall 2000 9-18-2000
Reducing Overhead
• Can reduce no. of messages m or overhead per message o
• o is usually determined by hardware or system software
– Program should try to reduce m by coalescing messages
– More control when communication is explicit
• Coalescing data into larger messages:
– Easy for regular, coarse-grained communication
– Can be difficult for irregular, naturally fine-grained
communication
• May require changes to algorithm and extra work
– coalescing data and determining what and to whom to send
• Will discuss more in implications for programming models
later
EECC722 - Shaaban
#93 lec # 3
Fall 2000 9-18-2000
Reducing Network Delay
• Network delay component = f*h*th
• h = number of hops traversed in network
• th = link+switch latency per hop
• Reducing f: Communicate less, or make messages larger
• Reducing h:
– Map communication patterns to network topology
e.g. nearest-neighbor on mesh and ring; all-to-all
– How important is this?
• Used to be a major focus of parallel algorithms
• Depends on no. of processors, how th, compares with other
components
• Less important on modern machines
– Overheads, processor count, multiprogramming
EECC722 - Shaaban
#94 lec # 3
Fall 2000 9-18-2000
Overlapping Communication
• Cannot afford to stall for high latencies
• Overlap with computation or communication to hide
latency
• Requires extra concurrency (slackness), higher
bandwidth
• Techniques:
–
–
–
–
Prefetching
Block data transfer
Proceeding past communication
Multithreading
EECC722 - Shaaban
#95 lec # 3
Fall 2000 9-18-2000
Summary of Tradeoffs
• Different goals often have conflicting demands
– Load Balance
• Fine-grain tasks
• Random or dynamic assignment
– Communication
• Usually coarse grain tasks
• Decompose to obtain locality: not random/dynamic
– Extra Work
• Coarse grain tasks
• Simple assignment
– Communication Cost:
• Big transfers: amortize overhead and latency
• Small transfers: reduce contention
EECC722 - Shaaban
#96 lec # 3
Fall 2000 9-18-2000
Relationship Between Perspectives
Processor time component
Parallelization step(s)
Perf ormance issue
Decomposition/
assignment/
orchestration
Load imbalance and
synchronization
Synch w ait
Decomposition/
assignment
Extra w ork
Busy-overhead
Decomposition/
assignment
Inherent
communication
volume
Data-remote
Orchestration
Artif actual
communication
and data locality
Data-local
Orchestration/
mapping
Communication
structure
EECC722 - Shaaban
#97 lec # 3
Fall 2000 9-18-2000
Summary
Speedupprob(p) =
Busy(1) + Data(1)
Busyuseful(p)+Datalocal(p)+Synch(p)+Dateremote(p)+Busyoverhead(p)
– Goal is to reduce denominator components
– Both programmer and system have role to play
– Architecture cannot do much about load imbalance or too
much communication
– But it can:
• reduce incentive for creating ill-behaved programs (efficient
naming, communication and synchronization)
• reduce artifactual communication
• provide efficient naming for flexible assignment
• allow effective overlapping of communication
EECC722 - Shaaban
#98 lec # 3
Fall 2000 9-18-2000
Generic Distributed Memory
Organization
OS Supported?
Network protocols?
Multi-stage
interconnection
network (MIN)?
Custom-designed?
Scalable network
Communication Assist
Extend of functionality?
Global virtual
Shared address space?
$
P
•
•
Node:
O(10) Bus-based SMP
Custom-designed CPU?
Node/System integration level?
How far? Cray-on-a-Chip?
SMP-on-a-Chip?
Switch

CA
M
Message
transaction
DMA?
Switch
Switch
•
•
Network bandwidth?
Bandwidth demand?
– Independent processes?
– Communicating processes?
Latency? O(log2P) increase?
Cost scalability of system?
EECC722 - Shaaban
#99 lec # 3
Fall 2000 9-18-2000