Transcript Single data

CIT 668: System Architecture
Parallel Computing
Topics
1.
2.
3.
4.
5.
6.
7.
What is Parallel Computing?
Why use Parallel Computing?
Types of Parallelism
Amdahl’s Law
Flynn’s Taxonomy of Parallel Computers
Parallel Memory Architectures
Parallel Programming Models
Images from LLNL Parallel Computing
Tutorial, Wikipedia, or Majd F. Sakr’s
parallel computation lectures unless
otherwise noted.
Serial and Parallel Computation
Serial
Parallel
History of Parallel Computation
Parallel Computation
• Breaking a problem into pieces and using
multiple computer resources to solve each
piece of the problem and reassemble the
solution pieces into the final answer.
• Parallelism is limited by data dependencies
Data Dependencies
Data Dependencies
Parallel Terminology
Task: A logically discrete section of computational work. A
task is typically a thread or process at the OS level.
Communications: Parallel tasks typically need to exchange
data through a shared memory busy or over a network.
Synchronization: Coordination of parallel tasks in real
time. Often implemented by establishing a
synchronization point within an application where task
cannot proceed further until other tasks reach that point.
Scalability: Ability of parallel system to demonstrate
aproportionate increase in speed with addition of more
processors.
Embarrassingly Parallel: Solving many similar but
independent tasks in parallel with little or no need for
coordination between tasks.
Parallel Granularity
Fine grain
Coarse grain
• Relatively small
amounts of
computation
done between
communication
events.
• Low
computation to
communication
ratio.
• Facilitates load
balancing.
• Relatively large
amounts of
computation
done between
communication
events.
• High
computation to
communication
ratio.
• Difficult to load
balance.
Why Parallel Computing?
The Real World is Parallel
Modeling Science & Engineering Problems
Reasons for Parallel Computation
Limits to serial computing
– CPU speeds increased slowly >2003
Solve problems faster
– Reduce time by using more resources
Solve larger problems
– Scientific problems
– Web-scale applications
•
•
•
•
Bit-level Parallelism
Instruction-level Parallelism
Data-level Parallelism
Task-level Parallelism
Types of Parallelism
The Processor
• The Brain: a functional unit that interprets and carries out
instructions (mathematical operations)
• Also called a CPU (actually includes CPU + ALU)
• Consists of hundreds of millions of transistors.
Processor Components: Control
Control Unit
– Processor’s supervisor
– Fetch/execute cycle
Fetch
Store
Decode
Execute
Program Counter (PC): stores address of instruction to be fetched.
Instruction Register (IR): has instruction most recently fetched.
Processor Components: Datapath
Register File: General purpose
storage locations inside processor
that hold addresses or values.
Arithmetic Logic Unit: Set of
functional units that perform
arithmetic and logic operations.
A 32-bit processor has registers that are typically 32 bits in size.
Bit-level Parallelism
• Increase processor word
size to operate on more
bits at once.
• Task: add two 64-bit
numbers
• 32-bit CPU: must complete
two 32-bit operations plus
handle carry operations
• 64-bit CPU: adds two 64bit ints in single instruction
Evolution of Processor Word Size
Instruction Level Parallelism (ILP)
Running independent instructions on separate
execution units simultaneously.
x=a+b
y=c+d
z=x+y
Serial Execution:
If each instruction takes one cycle, it takes
3 clock cycles to run program.
Parallel Execution:
– First two programs are independent, so can
be executed simultaneously.
– Third instruction depends on first two, so
must be executed afterwards.
– Two clock cycles to run program.
Instruction Pipelining
• Improve ILP by
splitting processing of
a single machine
language instruction
into a series of
independent steps.
• CPU can issue
instructions at
processing rate of
slowest step,
increasing clock speed.
IF = Instruction Fetch
ID = Instruction Decode
EX = Execute
MEM = Memory access
WB = Register write back
Sequential Laundry
6 PM 7
T
a
s
k
O
r
d
e
r
A
8
9
10
11
12
1
2 AM
30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
Time
B
C
D
Sequential laundry = 8 hours for 4 loads
http://inst.eecs.Berkeley.edu/~cs61c/fa10
Pipelined Laundry
6 PM 7
T
a
s
k
8
9
3030 30 30 30 30 30
10
11
12
1
2 AM
Time
A
B
C
O
D
r
d
e
r
Pipelined laundry = 3.5 hours for 4 loads!
http://inst.eecs.Berkeley.edu/~cs61c/fa10
Pipelining Lessons
6 PM
T
a
s
k
8
9
Time
30 30 30 30 30 30 30
A
B
O
r
d
e
r
7
C
D
• Pipelining doesn’t decrease
latency of single task; it
increases throughput of
entire workload
• Multiple tasks operating
simultaneously using
different resources
• Potential speedup = Number
of stages
• Time to fill pipeline and time
to drain it reduces speedup:
2.3X v. 4X in this example
• Speedup limited by slowest
pipeline stage.
http://inst.eecs.Berkeley.edu/~cs61c/fa10
Processor Pipeline Execution
P1
P2
P3
Instr 1 IF
ID
ALU MEM WR
Instr 2
IF
Instr 3
Instr 4
Instr 5
Instr 6
Instr 7
Instr 8
ID
IF
P4
P6
P7
P8
ALU MEM WR
IF
ID
ALU MEM WR
ID
IF
P5
P9
P 10 P 11 P 12
ALU MEM WR
ID
IF
ALU MEM WR
ID
IF
ALU MEM WR
ID
IF
ALU MEM WR
ID
IF
ALU MEM WR
ID
http://inst.eecs.Berkeley.edu/~cs61c/fa10
ALU MEM WR
Hazards
Problems that prevent pipeline execution.
Data hazards
– Must wait for previous instruction to compute data
that current instruction needs for input.
Structural hazards
– A processor component (execution unit or memory
unit) is needed by two simultaneous instructions.
Control hazards
– Conditional branch statements offer two
alternatives for the next instruction to fetch.
Working Around Hazards
Instruction re-ordering
– Re-order instructions to extract more ILP
– Data dependencies limit re-ordering
Branch prediction
– Predict which branch is likely to be taken then
execute those instructions
– Loops usually continue in loop; only exit once
Loop unrolling
– Human or compiler replicates body of loop to
expose more parallelism
Superscalar
Pipelined Execution
Superscalar Execution
http://arstechnica.com/paedia/c/cpu/part-2/
AMD Athlon Architecture
Superscalar Pipeline
Pipeline Depth and Issue Width
CPU
Year
Clock
Speed
Pipeline Issue
Stages
Width
Cores
Power
80486
1989
25 MHz
5
1
1
5W
Pentium
1993
66 MHz
5
2
1
10W
Pentium Pro
1997
150 MHz
10
3
1
29W
Pentium 4
2001
1500 MHz
22
3
1
75W
Pentium 4 Prescott
2004
3600 MHz
31
3
1
103W
Core 2 Conroe
2006
2930 MHz
14
4
2
75W
Core 2 Yorkfield
2008
2930 MHz
16
4
4
95W
Core i7 Gulftown
2010
3460 MHz
16
4
6
130W
Pipeline Depth and Issue Width
10000
Clock
1000
Power
Pipeline Stages
100
Issue width
10
Cores
1
1989 1992 1995 1998 2001 2004 2007 2010
http://inst.eecs.Berkeley.edu/~cs61c/fa10
Data Parallelism
• Distribute data across different computing nodes,
so same operation performed on different parts of
same data structure
• AKA loop-level parallelism
• If each loop iteration depends on results from
previous iteration, loop cannot be parallelized
Task Parallelism
• Different operations on different data sets
• Each processor performs different task
• Each processor communicates with other
processes to get inputs and put results
Multiple CPUs
Multicore
• Multicore CPU chips contain
multiple complete processors
• Individual L1 and shared L2 caches
• OS and applications see each core
as an independent processor
• Each core can run a separate task
• A single application must be
divided into multiple tasks to
improve performance
Multicore Organization Alternatives
Core 2 Duo and Core i7 Architectures
High Core CPUs Use On-Chip Network
• Low count cores
use ~1000 wires
per core to
connect to L3
cache and each
other.
• To scale, Core i7
uses a ring bus
to communicate
with cores and
L3 cache slices.
Core i7 2600k Ring Bus Architecture
Multicore vs. Multiprocessor
Inter-processor
communication
Memory bandwidth
per CPU
Power consumption
per CPU
Cost per CPU
Multicore
Faster
Multiprocessor
Slower
Lower
Higher
Lower
Higher
Lower
Higher
Simultaneous Multi-Threading
CPU presents virtual cores to OS
– CPU duplicates PC, IR, and thread state registers
– But keeps same number of execution units
OS feeds two threads at once to CPU
– Improves ILP by having multiple instruction streams
– that are unlikely to have cross-stream dependencies
CPU Parallelism
Architectures Compared
• Parallelism types
combined to increase
parallel capacity.
• x86 cores have been
superscalar since
Pentium in 1993.
• Server x86 CPUs use
simultaneous
multithreading since
Pentium 4 Xeon in 2000.
Flynn’s Taxonomy
Flynn’s Taxonomy
Single Instruction
Multiple Instruction
Single Data
SISD: Pentium III
MISD: None today
Multiple Data
SIMD: SSE
instruction set
MIMD: Xeon e5345
(Clovertown)
Single Instruction Single Data
Serial computation
Single instruction = only one
instruction stream acted on by
CPU during one clock cycle
Single data = only one data
stream used as input during
any clock cycle
Single Instruction Multiple Data
Parallel computation
Single instruction = All
processing units given
same instruction at any
clock cycle.
Multiple data = Each
processing unit can
operate on a different
data element
Applications: graphics
Radeon R770 GPU
Multiple Instruction Single Data
Parallel computation
Single data = A single
data stream is fed into
multiple processing units
Multiple instruction =
Each processing unit
operates on data
independently via its
own instruction stream
Multiple Instruction Multiple Data
Parallel computation
Multiple instruction =
each processing unit
may execute a different
instruction stream
Multiple data = each
processing unit may
work with a different
data stream
Examples: multicore,
grid computing
Taxonomy of Parallel Architectures
Amdahl’s Law
Scaling Parallel Computing
Just add more processors?
Hardware limitations
– Memory-CPU bus bandwidth on local machine
– Network bandwidth and latency
Software limitations
– Most algorithms have limits to scalability
– Supporting libraries may have their own limits
Amdahl’s Law
Speedup due to enhancement E is
𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝐸
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =
𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑤𝑖𝑡ℎ 𝐸
Suppose E accelerates a piece P (P<1) of task by a
factor S (S>1) and remainder unaffected
Exec time with E = Exec time w/o E × [ 1 - P + P/S ]
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =
1
1 − 𝑃 + 𝑃/𝑆
Amdahl’s Law: Example
Consider an application whose work is divided
into the following four components:
Work
Load
Memory
Access
Computation
Disk
Access
Network
Access
Time
10%
70%
10%
10%
What is the expected percent improvement if:
Memory access speed is doubled? 5%
Computation speed is doubled?
35%
Amdahl’s Law for Parallelization
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =
1
1 − 𝑃 + 𝑃/𝑆
• Let P be the parallelizable portion of code
• As the number of processors increases, the time to
do the parallel portion of the program, P/S tends
towards zero, reducing the equation to:
1
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =
1−𝑃
• If P=0, then speedup=1 (no improvement)
• If P=1, then speedup grows without limit.
• If P=0.5, then maximum speed is 2.
Amdahl’s Law: Parallel Example
Consider an application whose work is divided
into the following four functions:
Work
Load
Time
f1
f2
4%
10%
f3
f4
80%
6%
Assume f1, f3, and f4 can be parallelized, but f2 must be computed
serially.
Parallelizing which part would best
improve performance?
f3
What is the best performance speedup
that could be reached by parallelizing all
three parallelizable functions?
10X
Amdahl’s Law: Time Example
Consider an application whose work is divided
into the following four functions:
Work
Load
Time
f1
2ms
f2
f3
f4
5ms
40ms
3ms
Assume f1, f3, and f4 can be parallelized, but f2 must be computed
serially. Assume that running the whole program takes 50ms.
What is the best running time that can be
achieved by parallelizing f1, f3, and f4?
Why can’t parallelizing the program decrease
the total running time below that time?
5ms
5ms is the time
required for serial
part. Even if
parallel part takes
0ms, f2 still takes
5ms to run.
Amdahl’s Law
•
•
•
•
Uniform Memory Architecture
Non-Uniform Memory Architecture
Distributed Memory
Hybrid Distributed-Shared Memory
Parallel Memory Architectures
Uniform Memory Architecture (UMA)
• Global shared address space for all memory
• Symmetric Multi-Processing (SMP)
• Does not scale to much greater than 8 CPUs
due to memory contention
Non-Uniform Memory Architecture (NUMA)
•
•
•
•
Global shared address space where
Processors have fast access to nearby memory
Memory access across links is slower
Better scaling than UMA but still limited by
memory contention
Distributed Memory
• Each CPU has own local address space; changes by
each CPU are not visible to other CPUs
• CPUs must use network to exchange data
• Highly scalable: CPUs have fast local RAM
• Data communication is programmer’s responsibility
Hybrid Distributed-Shared Memory
• Shared memory components are SMP nodes
• Distributed memory is network of SMP nodes
• Used by most large supercomputers + clouds
Hybrid Distributed-Shared in Cloud
•
•
•
•
Shared Memory Model
Threads Model
Data Parallel Model
Message Passing Model
Parallel Programming Models
Parallel Programming Models
• An abstraction above the hardware level for
programmers to use.
• Any programming model can be used with any
parallel architecture.
• Model performance may depend on
architecture.
• Model choice depends on problem being
solved and programmer preference.
Shared Memory
• Tasks share
common global
address space,
which they read
and write to
asynchronously.
• Software
mechanisms
such as locks
and semaphores
used to control
access to shared
memory.
Shared Memory Model
Advantages
Disadvantages
• Program development
simplified since process
owns data stored in
memory.
• Referencing data in shared
memory is similar to
traditional serial
programming.
• Difficult to understand and
manage data locality.
• Keeping data local to the
CPU working on it is faster,
but bus traffic results when
other CPUs try to access that
data.
Threads
• Divide single program into multiple concurrent
execution paths called threads.
• Threads are distributed across CPUs and can be
executed on multiple CPUs simultaneously.
Threads
• Each thread has local data structures specific to
this thread.
• However, all threads share common process global
memory space.
• Threads are associated with shared memory
architectures.
• Threads can be scheduled by OS or by middleware
such as the language VM (green threads).
– Green threads start and synchronize faster
– But most implementations cannot use multiple CPUs
Data Parallel
• Data set is divided into chunks and operations are
performed on each chunk concurrently
• Tasks carried out on same part of data structure
perform same operations on each instance of data
(ex: multiply each array element by X)
Message Passing
• Tasks use only their
own local memory.
• Tasks can be on
multiple machines.
• Data exchanged
between tasks by
sending and
receiving messages.
• Data transfer
requires cooperation
between tasks.
Parallel Code Example
Example: Array Processing
if MASTER
initialize array
send each WORKER
info on chunk it owns
chunk of initial array
end
recv results from workers
elsif SLAVE
recv info on my chunk
recv array data
do j=1stcol..last col
do i = 1,n
a(i,j) = f(i,j)
end
end
end
Array is divided into chunks; each
CPU owns a chunk and executes the
portion of loop corresponding to it.
Example: Heat Equation
• The Heat Equation
describes change in
temperature in a region
over time, given initial
temperature distribution
and boundary conditions.
• Divide region into chunks
and iterate, allowing heat
from nearby chunks to
change temperature in
chunk for next iteration.
Example: Heat Equation
To compute Ux,y
Serial program:
Example: Parallel Heat Equation Solver
• Divide array into chunks.
• Data dependencies:
– Interior elements are
independent of other tasks.
– Border elements are
dependent on other tasks, so
tasks must communicate.
• Master sends initial state to
worker tasks and collects
results from workers.
• Workers compute heat
equation, communicating
state of border elements with
adjacent worker tasks.
Parallel Code
Key Points
Granularity refers to ratio of comp to comm.
Levels of Parallelism
– Bit-level parallelism
– Data-level parallelism (loop-level)
– Instruction-level parallelism
• Pipelining
• Superscalar
– Task-level parallelism
• Multi-CPU
• Multi-core
• Hyperthreading
Flynn’s Taxonomy: SIMD, MIMD
Key Points
Amdahl’s Law
– No matter how many processors, speed-up is limited by
sequential portion of program
Parallel Memory Architectures
– Shared memory for SMP: UMA, NUMA
– Distributed memory for clusters or large-scale MP
– Most clouds use hybrid shared-distributed arch
Parallel Programming Models
–
–
–
–
Shared Memory
Threads
Data Parallel
Message Passing