Parallel Processing, Part 1
Download
Report
Transcript Parallel Processing, Part 1
Part I
Fundamental Concepts
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 1
About This Presentation
This presentation is intended to support the use of the textbook
Introduction to Parallel Processing: Algorithms and Architectures
(Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by
the author in connection with teaching the graduate-level course
ECE 254B: Advanced Computer Architecture: Parallel Processing,
at the University of California, Santa Barbara. Instructors can use
these slides in classroom teaching and for other educational
purposes. Any other use is strictly prohibited. © Behrooz Parhami
Edition
First
Winter 2014
Released
Revised
Revised
Revised
Spring 2005
Spring 2006
Fall 2008
Fall 2010
Winter 2013
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 2
I Fundamental Concepts
Topics in This Part
Chapter 1 Introduction to Parallelism
Chapter 2 A Taste of Parallel Algorithms
Chapter 3 Parallel Algorithm Complexity
Chapter 4 Models of Parallel Processing
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 3
1 Introduction to Parallelism
Topics in This Chapter
1.1 Why Parallel Processing?
1.2 A Motivating Example
1.3 Parallel Processing Ups and Downs
1.4 Types of Parallelism: A Taxonomy
1.5 Roadblocks to Parallel Processing
1.6 Effectiveness of Parallel Processing
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 4
Some Resources
1
Our textbook; followed closely in lectures
Parhami, B., Introduction to Parallel Processing:
Algorithms and Architectures, Plenum Press, 1999
2
Recommended book; complementary software topics
Herlihy, M. and M. Shavit, The Art of Multiprocessor
Programming, Morgan Kaufmann, revised 1st ed., 2012
3
Free on-line book (Creative Commons License)
Matloff, N., Programming on Parallel Machines: GPU,
Multicore, Clusters and More, 341 pp., PDF file
http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf
4
Useful free on-line course, sponsored by NVIDIA
“Introduction to Parallel Programming,” CPU/GPU-CUDA
https://www.udacity.com/course/cs344
Winter 2014
Parallel Processing, Fundamental Concepts
Complete Unified
Device Architecture
Slide 5
1.1 Why Parallel Processing?
The request for higher-performance digital computers seems
unending.
In the past two decades, the performance of microprocessors
has enjoyed an exponential growth.
The growth of microprocessor speed/performance by a factor
of 2 every 18 months is known as Moore’s law
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 6
1.1 Why Parallel Processing?
This growth is the result of a combination of two factors:
1) Increase in complexity(related both to higher device
density and to larger size) of VLSI chips, projected to rise
to around 10 M transistor per chip for microprocessors.
2) Introduction of, and improvements in, architectural
features such as on-chip cache memories, large instruction
buffers, multiple instruction issue per cycle,
multithreading
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 7
1.1 Why Parallel Processing?
Moore’s law was originally formulated in 1965 in terms of the
doubling of chip complexity every year (later revised to every 18
months) [Scha97].
Moore’s law seems to hold regardless of how one measures
processor performance:
counting the number of executed instructions per second (IPS)
counting the number of floating-point operations per second
(FLOPS)
using sophisticated benchmark suites that attempt to measure the
processor's performance on real applications
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 8
1.1 Why Parallel Processing?
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 9
1.1 Why Parallel Processing?
Even though it is expected that Moore's law will continue to hold
for the near future, there is a limit that will eventually be reached.
That some previous predictions about when the limit will be
reached have proven wrong does not alter the fact that a limit,
dictated by physical laws, does exist.
The most easily understood physical limit is that imposed by the
finite speed of signal propagation along a wire.
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 10
Why High-Performance Computing?
1
Higher speed (solve problems faster)
Important when there are “hard” or “soft” deadlines;
e.g., 24-hour weather forecast
2
Higher throughput (solve more problems)
Important when we have many similar tasks to perform;
e.g., transaction processing
3
Higher computational power (solve larger problems)
e.g., weather forecast for a week rather than 24 hours,
or with a finer mesh for greater accuracy
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 11
Init.
1.2 A Motivating
Example
Fig. 1.3 The sieve of
Eratosthenes yielding a
list of 10 primes for n = 30.
Marked elements have
been distinguished by
erasure from the list.
Any composite number
has a prime factor
that is no greater than
its square root.
Winter 2014
2m
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Pass 1
Pass 2
Pass 3
2
3m
2
3
2
3
5
5m
5
7
7
7 m
9
11
11
11
13
13
13
17
17
17
19
19
19
23
23
23
25
25
15
21
27
29
Parallel Processing, Fundamental Concepts
29
29
Slide 12
Single-Processor Implementation of the Sieve
Current Prime
P
Bit-vector
1
Index
2
n
Fig. 1.4 Schematic representation of single-processor
solution for the sieve of Eratosthenes.
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 13
Control-Parallel Implementation of the Sieve
Index
P2
P1
Shared
Memory
Index
...
Index
Pp
Current Prime
I/O Device
1
2
n
(b)
Fig. 1.5 Schematic representation of a control-parallel
solution for the sieve of Eratosthenes.
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 14
Running Time of the Sequential/Parallel Sieve
Time
0
100
200
300
400
500
600
700
800
900 1000 1100 1200 1300 1400 1500
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
2
|
3
|
5
|
7
| 11 |13|17
19 29
p = 1, t = 1411
23 31
23 29 31
2
|
7
|17
3
5
| 11 |13|
19
p = 2, t =
706
2
|
3
5
|
7
p = 3, t =
11 |
13|17
|
19 29 31
23
499
Fig. 1.6
Control-parallel realization of the sieve of
Eratosthenes with n = 1000 and 1 p 3.
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 15
Data-Parallel Implementation of the Sieve
P1
Current Prime
1
Assume at most n processors,
so that all prime factors dealt with
are in P1 (which broadcasts them)
Index
2
n/p
P2
Communication
Pp
Current Prime
Current Prime
n/p+1
Index
2n/p
Index
n <n/p
n–n/p+1
Fig. 1.7
Winter 2014
n
Data-parallel realization of the sieve of Eratosthenes.
Parallel Processing, Fundamental Concepts
Slide 16
One Reason for Sublinear Speedup:
Communication Overhead
Ideal speedup
Solution time
Actual speedup
Computation
Communication
Number of processors
Number of processors
Fig. 1.8
Trade-off between communication time and computation
time in the data-parallel realization of the sieve of Eratosthenes.
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 17
Another Reason for Sublinear Speedup:
Input/Output Overhead
Ideal speedup
Solution time
Actual speedup
Computation
I/O time
Number of processors
Number of processors
Fig. 1.9
Effect of a constant I/O time on the data-parallel
realization of the sieve of Eratosthenes.
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 18
1.3 Parallel Processing Ups and Downs
Parallel processing, in the literal sense of the term, is used in
virtually every modern computer.
(For example, overlapping I/O with computation is a form of parallel
processing, as is the overlap between instruction preparation and execution
in a pipelined processor.)
Other forms of parallelism or concurrency that are widely
used include the use of multiple functional units (e.g., separate
integer and floating-point ALUs or two floating-point
multipliers in one ALU) and multitasking (which allows
overlap between computation and memory load necessitated
by a page fault).
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 19
1.3 Parallel Processing Ups and Downs
However, in this book:
The term parallel processing is used in a restricted sense of
having multiple (usually identical) processors for the main
computation and not for the I/O or other peripheral activities.
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 20
1.3 Parallel Processing Ups and Downs
The history of parallel processing has had its ups and downs
(read company formations and bankruptcies!) with what
appears to be a 20-year cycle. Serious interest in parallel
processing started in the 1960s.
Commercial interest in parallel processing resurfaced in the
1980s. Driven primarily by contracts from the defense
establishment and other federal agencies in the United States,
numerous companies were formed to develop parallel
systems.
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 21
1.3 Parallel Processing Ups and Downs
However, three factors led to another recess:
Government funding in the United States and other countries
dried up, in part related to the end of the cold war between the
NATO allies and the Soviet bloc.
Commercial users in banking and other data-intensive
industries were either saturated or disappointed by application
difficulties.
Microprocessors developed so fast in terms of
performance/cost ratio that custom designed parallel machines
always lagged in cost-effectiveness.
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 22
SISD
SIMD
MISD
MIMD
Uniprocessors Array or vector
processors
Rarely used
Multiproc’s or
multicomputers
Shared
variables
Global
memory
Multiple data
streams
Distributed
memory
Single data
stream
Johnson’ s expansion
Multiple instr
streams
Single instr
stream
1.4 Types of Parallelism: A Taxonomy
GMSV
Message
passing
GMMP
Shared-memory
multiprocessors
Rarely used
DMSV
DMMP
Distributed
Distrib-memory
shared memory multicomputers
Flynn’s categories
Fig. 1.11
Winter 2014
The Flynn-Johnson classification of computer systems.
Parallel Processing, Fundamental Concepts
Slide 23
1.5 Roadblocks to Parallel Processing
Grosch’s law: Economy of scale applies, or power = cost2
No longer valid; in fact we can buy more
MFLOPS computing power on micros rather than supers
Minsky’s conjecture: Speedup tends to be proportional to log p
Has roots in analysis of memory bank conflicts; can be overcome
Tyranny of IC technology: Uniprocessors suffice (x10 faster/5 yrs)
Faster ICs make parallel machines faster too; what about x1000?
Tyranny of vector supercomputers: Familiar programming model
Not all computations involve vectors; parallel vector machines
Software inertia: Billions of dollars investment in software
New programs; even uniprocessors benefit from parallelism spec
Amdahl’s law: Unparallelizable code severely limits the speedup
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 24
Amdahl’s Law
50
f = fraction
f = 0
Spee du p ( s )
40
unaffected
f = 0 .01
p = speedup
30
of the rest
f = 0 .02
20
f = 0 .05
s=
10
f = 0 .1
min(p, 1/f)
0
0
10
20
30
40
1
f + (1 – f)/p
50
E nha nc em en t f ac tor ( p )
Fig. 1.12 Limit on speed-up according to Amdahl’s law.
Winter 2014
Parallel Processing, Fundamental Concepts
Slide 25
1.6 Effectiveness of Parallel Processing
Fig. 1.13
Task graph
exhibiting
limited
inherent
parallelism.
1
2
3
4
p
Number of processors
W(p) Work performed by p processors
T(p) Execution time with p processors
T(1) = W(1); T(p) W(p)
S(p) Speedup = T(1) / T(p)
5
8
7
10
6
E(p) Efficiency = T(1) / [p T(p)]
9
R(p) Redundancy = W(p) / W(1)
11
12
13
Winter 2014
W(1) = 13
T(1) = 13
T() = 8
U(p) Utilization = W(p) / [p T(p)]
Q(p) Quality = T3(1) / [p T2(p) W(p)]
Parallel Processing, Fundamental Concepts
Slide 26
Reduction or Fan-in Computation
Example: Adding 16 numbers, 8 processors, unit-time additions
-----------
16 numbers to be added
-----------
Zero-time communication
+
+
+
+
+
+
+
+
+
+
+
+
E(8) = 15 / (8 4) = 47%
S(8) = 15 / 4 = 3.75
R(8) = 15 / 15 = 1
Q(8) = 1.76
Unit-time communication
+
+
+
E(8) = 15 / (8 7) = 27%
S(8) = 15 / 7 = 2.14
R(8) = 22 / 15 = 1.47
Q(8) = 0.39
Sum
Fig. 1.14
Winter 2014
Computation graph for finding the sum of 16 numbers .
Parallel Processing, Fundamental Concepts
Slide 27