Fundamental Issues in Parallel and Distributed Computing

Download Report

Transcript Fundamental Issues in Parallel and Distributed Computing

Fundamental Issues in
Parallel and Distributed Computing
Assaf Schuster, Computer Science, Technion
Serial vs. parallel program



One
instruction at
a time
Task=‫תהליך‬
Multiple
instructions in
parallel
Levels of parallelism
•
Transistors
•
Hardware modules
•
Architecture, pipeline
•
Superscalar
•
Multicore
•
Symmetric multi-processors
•
Hardware connected multi-processors
•
Tightly-coupled machine clusters
•
Distributed systems
•
Geo-distributed systems
•
Internet-scale systems
•
Interstellar systems
Flynn's HARDWARE Taxonomy
SIMD
Lock-step execution

Example: vector operations
MIMD

Example: multicores
Why parallel programming?


Most software will have to be written as a
parallel software
Why ?
Free lunch ...
Wait 18 months – get a new CPU with x2 speed
Moore's law: #transitors per chip = 2(years)
Source: Wikipedia
… is over

More transistors = more execution units

BUT performance per unit does not improve

Bad news: serial programs will not run faster

Good news: parallelization has high performance
potential

May get faster on newer architectures
Parallel programming is hard


Need to optimize for performance

Understand management of resources

Identify bottlenecks
No one technology fits all needs

Zoo of programming models, languages, run-times

Hardware architecture is a moving target

Parallel thinking is NOT intuitive

Parallel debugging is not fun
But there is no detour
Parallelizing Game of Life
(Cellular automaton of pixels)

Given a 2D grid

vt(i,j)=F(vt-1(of all its neighbors))
i,j+1
i-1,j
i,j
i,j-1
i+1,j
Problem partitioning


Domain decomposition

(SPMD)

Input domain

Output domain

Both
Functional decomposition

(MPMD)

Independent tasks

Pipelining
We choose: domain decomposition

The field is split between processors
CPU 0
CPU 1
i,j+1
i-1,j
i,j
i,j-1
i+1,j
Issue 1. Memory

Can we access v(i+1,j) from CPU 0 as in serial
program?
CPU 0
CPU 1
i,j+1
i-1,j
i,j
i,j-1
i+1,j
It depends...

YES: Shared memory space architecture: same
memory space
CPU 0
CPU 1
Memory

NO: Distributed memory space architecture:
disjoint memory space
Network
CPU 0
CPU 1
Memo
ry
Memo
ry
Someone has to pay

Easier to program, harder to build hardware
CPU 0
CPU 1
Memory

Harder to program, easier to build hardware
Network

CPU 0
CPU 1
Memo
ry
Memo
ry
Tradeoff: Programability vs. Scalability
“Scalability” – the holy grail
Improving efficiency on larger problems when more
processors are available.
• Single machine, multicore - vertical scalability
• Multiple machines – horizontal scalability, scale out
Memory architecture –
latency in accessing data

CPU0:

Time to access v(i+1,j) = Time to access v(i-1,j)?
CPU 0
CPU 1
i,j+1
i-1,j
i,j
i,j-1
i+1,j
Hardware shared memory, flavors 1

Uniform memory access: UMA

Same cost of accessing any data by all processors
SMP = Symmetric Multi-Processor
Hardware shared memory, flavors 2

NON-Uniform memory access: NUMA

Tradeoff: Scalability vs. Latency
Software Distributed Shared
Memory (SDSM)
Process 1
Process 2
CPU 0
CPU 1
Memo
ry
Memo
ry
Network
DSM daemons

Software - DSM: emulation of NUMA in
software over distributed memory space
Memory-optimized programming

Most modern systems are NUMA or distributed



Architecture is easier to scale up by scaling out
Access time difference: local vs. remote data:
x100-10000
Locality: most important optimization parameter
for program speed

serial or parallel
Issue 2: Control

Can we assign one pixel per CPU?

Can we assign one pixel per process/logical task?
i,j+1
i-1,j
i,j
i,j-1
i+1,j
Task management overhead

Each task has a state that should be managed

More tasks – more state to manage


Who manages tasks?
How many tasks should be run by a cpu?

… depends on the complexity of F
v(i,j)= F(all v's neighbors)
Question

Every process reads the data from its neighbors

Will it produce correct results?
i,j+1
i-1,j
i,j
i,j-1
i+1,j
Issue 3: Synchronization


The order of reads and writes made in different
tasks is non-deterministic
Synchronization is required to enforce the order

Locks, semaphores, barriers, conditionals
Check point
Fundamental hardware-related issues:

Memory accesses


Control


Optimizing locality of accesses
Overhead
Synchronization
Goals and tradeoffs:
Ease of Programing, Correctness, Scalability
Parallel programming issues

We decide to split this 3x3 grid like this:
CPU 0
CPU 1
i,j+1
i-1,j
i,j
CPU 2
CPU 4
i,j-1
• OK?
i+1,j
Issue 1: Load balancing
CPU 0
CPU 1
i,j+1
i-1,j
CPU 2
i,j
i,j-1
i+1,j
CPU 4

Always waiting for the slowest task

Solutions?
Issue 2: Granularity

G=Computation/Communication

Fine-grain parallelism (small G)


Good load balancing

Potentially high overhead
Coarse-grain parallelism (large G)

Potentially bad load balancing

Low overhead
What granularity works for you?
Must balance overhead and CPU utilization
Issue 3: Memory Models
a=0,b=0
Which writes
by a task are
seen by which
reads of the
other tasks?
Print(b)
Print(a)
a=1
b=1
Printed:0,0?
Printed:1,0?
Printed:1,1?
Memory Consistency Models
Example program:
Pi: R V; W V,7; R V; R V
Pj: R V; W V,13; R V; R V
A consistency/memory model is an “agreement” between the
execution environment (H/W, OS, middleware) and the processes.
Runtime guarantees to the application certain properties on the
way values written to shared variables become visible to reads.
This determines the memory model, what’s valid, what’s not.
Example execution:
Pi: R V,0; W V,7; R V,7; R V,13
Pj: R V,0; W V,13; R V,13; R V,7
Order of writes to V as seen to Pi: (1) W V,7; (2) W V,13
Order of writes to V as seen to Pj: (1) W V,13; (2) W V,7
Memory Model: Coherence
Coherence is the memory model in which (the system
guarantees to the program that) writes performed by the
processes for every specific variable are viewed by all processes
in the same order.
Example program:
All valid executions under Coherence:
Pi:
W V,7
RV
RV
Pj:
W V,13
RV
RV
Pi:
W V,7
R V,7
R V,13
Pi:
W V,7
R V,7
R V,7
Pj:
W V,13
R V,13
R V,7
Pj:
W V,13
R V,13
R V,13
Pi:
W V,7
R V,13
R V,13
Pi:
W V,7
R V,7
R V,7
Pj:
W V,13
R V,13
R V,13
Pj:
W V,13
R V,7
R V,7
Pi:
W V,7
R V,7
R V,7
Pj:
W V,13
R V,13
R V,13
The Register Property: the view of a process consists of the values it
“sees” in its reads, and the writes it performs. If a R V in P which is later
than W V,x in P sees value different than x, then a later R V cannot see x.
Tradeoff: Memory Model/Programability vs. Scalability
Summary


Parallelism and scalability do not come for free

Overhead

Memory-aware access

Synchronization

Load balancing

Granularity

Memory Models
Does it worth the trouble?

Depends on how hard are above vs. potential gain


How would you know?
Some tasks cannot be parallelized
“Speedup” – yet another holy grail
time of best available serial program
-------------------------------------------------
time of parallel program
An upper bound on the speedup
Amdahl's law

Sequential component limits the speedup

Split program serial time Tserial=1 into

Ideally parallelizable: A

Cannot be parallelized: 1-A

Ideal parallel time Tparallel= A/#CPUs+(1-A)
Ideal speedup(#CPUs)=Tserial/Tparallel
<=1/(A/#CPUs+(1-A))
39
Bad news

So why do we need machines with 1000x
CPUs?
Source: wikipedia
The larger the problem the smaller the serial part
and the closer the simulation to the best serial program
Super-linear speedups
• Do they exist?
True Super-Linear Speedups
• Suppose data does not fit a single cpu cache
• But after domain decomposition it does
Debugging Parallel Programs
General Purpose Computing on
Graphic Processing Unit (GPU)
NVIDIA’s
Streaming
MultiProcessor
CUDA=
Compute
Unified
Device
Architecture
NVIDIA’s
Streaming
MultiProcessor
NVIDIA’s Kepler = 2 Tera-instructions/sec (real)
54