PowerPoint Slides - Dissertation Chapter 1
Download
Report
Transcript PowerPoint Slides - Dissertation Chapter 1
Parallel Processing
A Perspective
Hardware and Software
1
Introduction to PP
From “Efficient Linked List Ranking
Algorithms and Parentheses Matching
as a New Strategy for Parallel
Algorithm Design”, R. Halverson
Chapter 1 – Introduction to Parallel
Processing
2
Parallel Processing Research
1980’s – Great Deal of Research &
Publications
1990’s – Hardware not too successful
so the research area “dies” – Why???
Early 2000’s – Begins Resurgence?
Why??? Will it continue to be
successful this time ???
2010 – Multicore, Graphics processing
units
3
Goal of PP
Why bother with Parallel Processing?
Goal: Solve problems faster!
In reality, faster but efficient!
Work-Optimal: parallel algorithm runs
faster than sequential algorithm in
proportion to the number of processors
used.
Sometimes work-optimal is not
possible
4
Performance Metrics
Running time
Speedup = T1 /Tp
Work = p * Tp
Work Optimal: Work = O(T1)
Linear speedup
Scalable: if algorithm performance
increases linearly with the number of
processors utilized
5
PP Issues
Processors: number, connectivity,
communication
Memory: shared vs. local
Data structures
Data Distribution
Problem Solving Strategies
6
Parallel Problems
One Approach: Try to develop a
parallel solution to a problem without
consideration of the hardware.
Apply: Apply the solution to the
specific hardware and determine
the extra cost, if any
If not acceptably efficient, try again!
7
Parallel Problems
Another Approach: Armed with the
knowledge of strategies, data
structures, etc. that work well for a
particular hardware, develop a
solution with a specific hardware in
mind.
Third Approach: Modify a solution for
one hardware configuration for
another
8
Real World Problems
Inherently Parallel – nature or
structure of the problem lends itself to
parallelism
Examples
Mowing a lawn
Cleaning a house
Grading papers
Problems are easily divided into subproblems; very little overhead
9
Real World Problems
Not Inherently Parallel – parallelism is
possible but more complex to define
or with (excessive) overhead cost
Examples
Balancing a checkbook
Giving a haircut
Wallpapering a room
Prefix sums of an array
10
Some Computer Problems
Are these “inherently parallel” or not?
Processing customers’ monthly bills
Payroll checks
Building student grade reports from class
grade sheets
Searching for an item in a linked list
A video game program
Searching a state driver’s license database
Is problem hw, sw, data? Assumptions?
11
General Observations
What characteristics make a problem
inherently parallel?
What characteristics make a problem
difficult to parallelize?
* Consider hw, sw, data structures.
12
Payroll Problem
Consider 10 PUs (processing unit) with
each employee’s information stored in
a row of an array, A.
Label P0, P1,…P9
A[100] – 0 to 99
For i = 0 to 9
Pi process A[i*10] to A[((i+1)*10)-1]
13
Code for Payroll
For i = 0 to 9
Pi process A[i*10] to A[((i+1)*10)-1]
Each PU runs a process in parallel
For each Pi , i = 0 to 9 do //separate
process
For j = 0 to 9
Process A[i*10 + j]
14
Time Complexity of Payroll
Algorithm
Consider P processors
Consider N data items
Each PC has N/P data items
Assume data is accessible &
writeable to each PC
Time for each P: O(N/P)
Work = P * O(N/P) = O(N)
Overhead??
15
Payroll Questions??
Now we have a solution, must be
applied to hardware. Which
hardware?
Main question: Where is the array
and how is it accessed by each
processor?
One shared memory or many local
memories?
Where are the results placed?
16
What about I/O??
Generally, in parallel algorithms, I/O
is disregarded.
Assumption: Data is stored in the
available memory.
Assumption: Results are written back
to memory.
Data input and output are generally
independent of the processing
algorithm.
17
Balancing a Checkbook
Consider same hardware & data array
Can still distribute and process in the
same manner as the payroll
Each block computes deposits as addition &
checks a subtraction; totals the page (10
totals)
BUT then must combine the 10 totals to
the final total
This is the overhead!
18
Complexity of Checkbook
Consider P processors
Consider N data items
Each PC has N/P data items
Assume data is accessible & writeable
Time for each section: O(N/P)
Combination of P subtotals
Time for combining: O(P) to O(log P)
Depends on strategy used
Total: O(N/P + P) to O(N/P + log P)
19
Performance
Complexity - Perfect Parallel Algorithm
If the best sequential algorithm for a
problem is O(f(x)) then the parallel
algorithm would be O(f(x)/P)
This happens if little or no overhead
Actual Run Time
Typically, takes 4 processors to
achieve ½ the actual run time
20
Performance Measures
Run Time: not a practical measurement
Assume T1 & Tp are run times using 1 & p
processors, respectively
Speedup: S = T1/Tp
Work: W = p * Tp (aka Cost)
If W = O(T1) the it is Work (Cost)
Optimal & achieves Linear Speedup
21
Scalability
An algorithm is said to be Scalable
if performance increases linearly
with the number of processors
i.e. double the processors, cut time in
half
Implication: Algorithm sustains
good performance over a wide
range of processors.
22
Scalability
What about continuing to add processors?
At what point does adding more processors
stop improving the run time?
Does adding processors ever cause the
algorithm to take more time?
What is the optimal number of processors?
Consider W = p * Tp = O(T1)
Solve for p
Want W to be optimal for different numbers
of processors
23
Models of Computation
Two major categories
Shared memory
e.g. PRAM
Fixed connection
e.g. Hypercube
There are numerous versions of
each
Not all are totally realizable in hw
24
Sidenote: Models
Distributed Computing
Use of 2 or more separate
computers used to solve a single
problem
A version of a network
Clusters
Not really a topic for this course
25
Shared Memory Model
PRAM – parallel random access
machine
A category with 4 variants
EREW-CREW-ERCW-CRCW
All communication through a
shared global memory
Each PC has a small local memory
26
Variants of PRAM
EREW-CREW-ERCW-CRCW
Concurrent read: 2 or more
processors may read the same (or
different) memory location
simultaneously
Exclusive read: 2 or more processors
may access global memory location
only if each is accessing a unique
address
Similarly defined for write
27
Shared Memory Model
M
P0
M
P1
M
P2
M
P3
Shared Global Memory
28
Shared Memory
What are some implications of
the variants in memory access of
the PRAM model?
What is the strongest model?
29
Fixed Connection Models
Each PU contains a Local Memory
Distributed memory
No shared memory
PUs are connected through some type
of Interconnection Network
Interconnection network defines the model
Communication is via Message Passing
Can be synchronous or asynchronous
30
Interconnection Networks
Bus Network (Linear)
Ring
Mesh
Torus
Hypercube
31
Hypercube Model
Distributed memory, message passing,
fixed connection, parallel computer
N = 2r number of nodes
E = r 2r-1 number of edges
Nodes are numbered 0 – N in binary
such that any 2 nodes differing in one
bit are connected by an edge
Dimension is r
32
Hypercube Examples
N = 2, 4
10
0
11
1
00
01
N=2
Dimension = 1
N=4
Dimension = 2
33
Hypercube Example
N=8
111
110
N=8
Dimension = 3
010
011
100
000
101
001
34
Hypercube Considerations
Message Passing
Communication
Possible Delays
Load Balancing
Each PC has same work load
Data Distribution
Must follow connections
35
Consider Checkbook Problem
How about distribution of data?
Often initial distribution is
disregarded
What about the combination of the
subtotals?
Reduction is by dimension
O(log P) = r
36
Design Strategies
Paradigm: a general strategy
used to aid in the development of
the solution to a problem
37
Paradigms
Extended from Sequential Use
Extended from sequential use
Divide-and-Conquer
Always used in parallel algorithms
Divide data vs. Divide code
Branch-and-Bound
Dynamic Programming
38
Paradigms
Developed for Parallel Use
Deterministic coin tossing
Symmetry breaking
Accelerating cascades
Tree contraction
Euler Tours
Linked List Ranking
All Nearest Smaller Values (ANSV)
Parentheses Matching
39
Divide-and-Conquer
Most basic parallel strategy
Used in virtually every parallel
algorithm
Problem is divided into several subproblems that can be solved
independently; results of sub-problems
are combined into the final solution
Example: Checkbook Problem
40
Dynamic Programming
Divide & Conquer technique used when
sub-problems are not independent;
share common sub-problems
Sub-problem solutions are stored in
table for use by other processes
Often used for optimization problems
Minimum or Maximum
Fibonacci Numbers
41
Branch-and-Bound
Breadth-first tree processing
technique
Uses a bounding function that
allows some branches of the tree to
be pruned (i.e. eliminated)
Example: Game programming
42
Symmetry Breaking
Strategy that breaks a linked
structure (e.g. linked list) into disjoint
pieces for processing
Deterministic Coin Tossing
Using a binary representation of index,
nonadjacent elements are selected for
processing
Often used in Linked List Ranking
Algorithms
43
Accelerated Cascades
Applying 2 or more algorithms to a
single problem,
Change from one to another based on
the ratio of the problem size to the
number of processors – Threshold
This “fine tuning” sometimes allows
for better performance
44
Tree Contraction
(aka Contraction)
Nodes of a tree are removed;
information removed is combined with
remaining nodes
Multiple processors are assigned to
independent nodes
Tree is reduced to a single node which
contains the solution
E.G. Arithmetic Expression computation;
Addition of a series of numbers
45
Euler Tour
Create duplicate nodes in a tree or
graph with edge in opposite direction
to create a circuit
Allows tree or graph to be processed
as a linked list
46
Linked List Ranking
Halverson’s area of dissertation
research
ProblemDefinition:
Given a linked list of elements, number
the elements in order (or in reverse
order)
For list of length 20+
47
Linked List Ranking
Applied to a wide range of problems
Euler Tours
Tree Searches
List packing
Connectivity
Tree Traversals
Spanning Trees &
Forests
Connected
Components
Graph
Decomposition
48
All Nearest Smaller Values
Given a sequence of values, for each
value x, which predecessor elements
are smaller than x
Successfully applied to
Depth first search of interval graph
Parentheses matching
Line Packing
Triangulating a monotone polynomial
49
Parentheses Matching
In a properly formed string of
parentheses, find the index of each
parentheses mate
Applied to solve
Heights of all nodes in a tree
Extreme values in a tree
Lowest common ancestor
Balancing binary trees
50
Parentheses Matching
1
2 3
4
5
6 7 8 9
10 11 12
( ( ) ( ( ( ) ) )
(
12 3 2
11 10
9
8
7 6
5
4
)
)
1
51
Parallel Algorithm Design
Identify problems and/or classes
of problems for which a particular
strategy will work
Apply to the appropriate hardware
Most of the paradigms have been
optimized for a variety of parallel
architectures
52
Broadcast Operation
Not a paradigm, but an operation used
in many parallel algorithms
Provide one or more items of data to all
the processors (individual memories)
Let P be the number of processors.
For most models, broadcast operation is
O(log P) time complexity
53
Broadcast
Shared Memory (EREW)
P0 writes for P1; P0 & P1 write for P2 &
P3; P0 – P3 write for P4 – P7
Then each PU has a copy to be read in one
time unit
Hypercube
P0 sends to P1; P0 & P1 send to P2 & P3,
etc.
Both are O(log P)
54
Remainder of this Course
Principles & Techniques of parallel
processing & parallel algorithm
development on a variety of models
Application using CUDA
nVidia Graphics card processors
Overview of new models & languages
Student presentations
55