Fundamental Issues in Parallel and Distributed Computing
Download
Report
Transcript Fundamental Issues in Parallel and Distributed Computing
Fundamental Issues in
Parallel and Distributed Computing
Assaf Schuster, Computer Science, Technion
Serial vs. parallel program
One
instruction at
a time
Task=תהליך
Multiple
instructions in
parallel
Levels of parallelism
•
Transistors
•
Hardware modules
•
Architecture, pipeline
•
Superscalar
•
Multicore
•
Symmetric multi-processors
•
Hardware connected multi-processors
•
Tightly-coupled machine clusters
•
Distributed systems
•
Geo-distributed systems
•
Internet-scale systems
•
Interstellar systems
Flynn's HARDWARE Taxonomy
SIMD
Lock-step execution
Example: vector operations
MIMD
Example: multicores
Why parallel programming?
Most software will have to be written as a
parallel software
Why ?
Free lunch ...
Wait 18 months – get a new CPU with x2 speed
Moore's law: #transitors per chip = 2(years)
Source: Wikipedia
… is over
More transistors = more execution units
BUT performance per unit does not improve
Bad news: serial programs will not run faster
Good news: parallelization has high performance
potential
May get faster on newer architectures
Parallel programming is hard
Need to optimize for performance
Understand management of resources
Identify bottlenecks
No one technology fits all needs
Zoo of programming models, languages, run-times
Hardware architecture is a moving target
Parallel thinking is NOT intuitive
Parallel debugging is not fun
But there is no detour
Parallelizing Game of Life
(Cellular automaton of pixels)
Given a 2D grid
vt(i,j)=F(vt-1(of all its neighbors))
i,j+1
i-1,j
i,j
i,j-1
i+1,j
Problem partitioning
Domain decomposition
(SPMD)
Input domain
Output domain
Both
Functional decomposition
(MPMD)
Independent tasks
Pipelining
We choose: domain decomposition
The field is split between processors
CPU 0
CPU 1
i,j+1
i-1,j
i,j
i,j-1
i+1,j
Issue 1. Memory
Can we access v(i+1,j) from CPU 0 as in serial
program?
CPU 0
CPU 1
i,j+1
i-1,j
i,j
i,j-1
i+1,j
It depends...
YES: Shared memory space architecture: same
memory space
CPU 0
CPU 1
Memory
NO: Distributed memory space architecture:
disjoint memory space
Network
CPU 0
CPU 1
Memo
ry
Memo
ry
Someone has to pay
Easier to program, harder to build hardware
CPU 0
CPU 1
Memory
Harder to program, easier to build hardware
Network
CPU 0
CPU 1
Memo
ry
Memo
ry
Tradeoff: Programability vs. Scalability
“Scalability” – the holy grail
Improving efficiency on larger problems when more
processors are available.
• Single machine, multicore - vertical scalability
• Multiple machines – horizontal scalability, scale out
Memory architecture –
latency in accessing data
CPU0:
Time to access v(i+1,j) = Time to access v(i-1,j)?
CPU 0
CPU 1
i,j+1
i-1,j
i,j
i,j-1
i+1,j
Hardware shared memory, flavors 1
Uniform memory access: UMA
Same cost of accessing any data by all processors
SMP = Symmetric Multi-Processor
Hardware shared memory, flavors 2
NON-Uniform memory access: NUMA
Tradeoff: Scalability vs. Latency
Software Distributed Shared
Memory (SDSM)
Process 1
Process 2
CPU 0
CPU 1
Memo
ry
Memo
ry
Network
DSM daemons
Software - DSM: emulation of NUMA in
software over distributed memory space
Memory-optimized programming
Most modern systems are NUMA or distributed
Architecture is easier to scale up by scaling out
Access time difference: local vs. remote data:
x100-10000
Locality: most important optimization parameter
for program speed
serial or parallel
Issue 2: Control
Can we assign one pixel per CPU?
Can we assign one pixel per process/logical task?
i,j+1
i-1,j
i,j
i,j-1
i+1,j
Task management overhead
Each task has a state that should be managed
More tasks – more state to manage
Who manages tasks?
How many tasks should be run by a cpu?
… depends on the complexity of F
v(i,j)= F(all v's neighbors)
Question
Every process reads the data from its neighbors
Will it produce correct results?
i,j+1
i-1,j
i,j
i,j-1
i+1,j
Issue 3: Synchronization
The order of reads and writes made in different
tasks is non-deterministic
Synchronization is required to enforce the order
Locks, semaphores, barriers, conditionals
Check point
Fundamental hardware-related issues:
Memory accesses
Control
Optimizing locality of accesses
Overhead
Synchronization
Goals and tradeoffs:
Ease of Programing, Correctness, Scalability
Parallel programming issues
We decide to split this 3x3 grid like this:
CPU 0
CPU 1
i,j+1
i-1,j
i,j
CPU 2
CPU 4
i,j-1
• OK?
i+1,j
Issue 1: Load balancing
CPU 0
CPU 1
i,j+1
i-1,j
CPU 2
i,j
i,j-1
i+1,j
CPU 4
Always waiting for the slowest task
Solutions?
Issue 2: Granularity
G=Computation/Communication
Fine-grain parallelism (small G)
Good load balancing
Potentially high overhead
Coarse-grain parallelism (large G)
Potentially bad load balancing
Low overhead
What granularity works for you?
Must balance overhead and CPU utilization
Issue 3: Memory Models
a=0,b=0
Which writes
by a task are
seen by which
reads of the
other tasks?
Print(b)
Print(a)
a=1
b=1
Printed:0,0?
Printed:1,0?
Printed:1,1?
Memory Consistency Models
Example program:
Pi: R V; W V,7; R V; R V
Pj: R V; W V,13; R V; R V
A consistency/memory model is an “agreement” between the
execution environment (H/W, OS, middleware) and the processes.
Runtime guarantees to the application certain properties on the
way values written to shared variables become visible to reads.
This determines the memory model, what’s valid, what’s not.
Example execution:
Pi: R V,0; W V,7; R V,7; R V,13
Pj: R V,0; W V,13; R V,13; R V,7
Order of writes to V as seen to Pi: (1) W V,7; (2) W V,13
Order of writes to V as seen to Pj: (1) W V,13; (2) W V,7
Memory Model: Coherence
Coherence is the memory model in which (the system
guarantees to the program that) writes performed by the
processes for every specific variable are viewed by all processes
in the same order.
Example program:
All valid executions under Coherence:
Pi:
W V,7
RV
RV
Pj:
W V,13
RV
RV
Pi:
W V,7
R V,7
R V,13
Pi:
W V,7
R V,7
R V,7
Pj:
W V,13
R V,13
R V,7
Pj:
W V,13
R V,13
R V,13
Pi:
W V,7
R V,13
R V,13
Pi:
W V,7
R V,7
R V,7
Pj:
W V,13
R V,13
R V,13
Pj:
W V,13
R V,7
R V,7
Pi:
W V,7
R V,7
R V,7
Pj:
W V,13
R V,13
R V,13
The Register Property: the view of a process consists of the values it
“sees” in its reads, and the writes it performs. If a R V in P which is later
than W V,x in P sees value different than x, then a later R V cannot see x.
Tradeoff: Memory Model/Programability vs. Scalability
Summary
Parallelism and scalability do not come for free
Overhead
Memory-aware access
Synchronization
Load balancing
Granularity
Memory Models
Does it worth the trouble?
Depends on how hard are above vs. potential gain
How would you know?
Some tasks cannot be parallelized
“Speedup” – yet another holy grail
time of best available serial program
-------------------------------------------------
time of parallel program
An upper bound on the speedup
Amdahl's law
Sequential component limits the speedup
Split program serial time Tserial=1 into
Ideally parallelizable: A
Cannot be parallelized: 1-A
Ideal parallel time Tparallel= A/#CPUs+(1-A)
Ideal speedup(#CPUs)=Tserial/Tparallel
<=1/(A/#CPUs+(1-A))
39
Bad news
So why do we need machines with 1000x
CPUs?
Source: wikipedia
The larger the problem the smaller the serial part
and the closer the simulation to the best serial program
Super-linear speedups
• Do they exist?
True Super-Linear Speedups
• Suppose data does not fit a single cpu cache
• But after domain decomposition it does
Debugging Parallel Programs
General Purpose Computing on
Graphic Processing Unit (GPU)
NVIDIA’s
Streaming
MultiProcessor
CUDA=
Compute
Unified
Device
Architecture
NVIDIA’s
Streaming
MultiProcessor
NVIDIA’s Kepler = 2 Tera-instructions/sec (real)
54