Transcript PPT

ECE 669
Parallel Computer Architecture
Lecture 15
Mid-term Review
ECE669 L15: Mid-term Review
March 25, 2004
Is Parallel Computing Inevitable?
° Application demands: Our insatiable need for
computing cycles
° Technology Trends
° Architecture Trends
° Economics
° Current trends:
• Today’s microprocessors have multiprocessor support
• Servers and workstations becoming MP: Sun, SGI, DEC, HP!...
• Tomorrow’s microprocessors are multiprocessors
ECE669 L15: Mid-term Review
March 25, 2004
Application Trends
° Application demand for performance fuels
advances in hardware, which enables new appl’ns,
which...
• Cycle drives exponential increase in microprocessor performance
• Drives parallel architecture harder
- most demanding applications
New Applications
More Performance
° Range of performance demands
• Need range of system performance with progressively increasing
cost
ECE669 L15: Mid-term Review
March 25, 2004
Architectural Trends
° Architecture translates technology’s gifts into
performance and capability
° Resolves the tradeoff between parallelism and
locality
• Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip
connect
• Tradeoffs may change with scale and technology advances
° Understanding microprocessor architectural
trends
=> Helps build intuition about design issues or parallel
machines
=> Shows fundamental role of parallelism even in “sequential”
computers
ECE669 L15: Mid-term Review
March 25, 2004
Phases in “VLSI” Generation
Bit-level parallelism
Instruction-level
Thread-level (?)
100,000,000

10,000,000






1,000,000


R10000




 









 

Pentium
Transistors


 i80386



i80286 
100,000


 R3000
 R2000

 i8086
10,000
 i8080
 i8008

 i4004
1,000
1970
ECE669 L15: Mid-term Review
1975
1980
1985
1990
1995
2000
March 25, 2004
2005
Programming Model
° Look at major programming models
• Where did they come from?
• What do they provide?
• How have they converged?
° Extract general structure and fundamental issues
° Reexamine traditional camps from new perspective
Systolic
Arrays
Dataflow
ECE669 L15: Mid-term Review
SIMD
Generic
Architecture
Message Passing
Shared Memory
March 25, 2004
Programming Model
° Conceptualization of the machine that programmer
uses in coding applications
• How parts cooperate and coordinate their activities
• Specifies communication and synchronization operations
° Multiprogramming
• no communication or synch. at program level
° Shared address space
• like bulletin board
° Message passing
• like letters or phone calls, explicit point to point
° Data parallel:
• more regimented, global actions on data
• Implemented with shared address space or message passing
ECE669 L15: Mid-term Review
March 25, 2004
Shared Physical Memory
° Any processor can directly reference any memory
location
° Any I/O controller - any memory
° Operating system can run on any processor, or all.
•
OS uses shared memory to coordinate
° Communication occurs implicitly as result of loads
and stores
°
What about application processes?
ECE669 L15: Mid-term Review
March 25, 2004
Message Passing Architectures
° Complete computer as building block, including I/O
• Communication via explicit I/O operations
° Programming model
• direct access only to private address space (local memory),
• communication via explicit messages (send/receive)
° High-level block diagram
• Communication integration?
- Mem, I/O, LAN, Cluster
• Easier to build and scale than SAS
Network
M
$
P
M
$

P
° Programming model more removed from basic
hardware operations
• Library or OS intervention
ECE669 L15: Mid-term Review
March 25, 2004
M
$
P
Message-Passing Abstraction
Match
ReceiveY, P, t
AddressY
Send X, Q, t
AddressX
Local process
address space
Local process
address space
ProcessP
Process Q
•
•
•
•
•
•
Send specifies buffer to be transmitted and receiving process
Recv specifies sending process and application storage to receive into
Memory to memory copy, but need to name processes
Optional tag on send and matching rule on receive
User process names local data and entities in process/tag space too
In simplest form, the send/recv match achieves pairwise synch event
- Other variants too
• Many overheads: copying, buffer management, protection
ECE669 L15: Mid-term Review
March 25, 2004
Simulating Ocean Currents
(a) Cross sections
(b) Spatial discretization of a cross section
° Model as two-dimensional grids
• Discretize in space and time
• finer spatial and temporal resolution => greater accuracy
° Many different computations per time step
- set up and solve equations
• Concurrency across and within grid computations
° Static and regular
ECE669 L15: Mid-term Review
March 25, 2004
4 Steps in Creating a Parallel Program
Partitioning
D
e
c
o
m
p
o
s
i
t
i
o
n
Sequential
computation
A
s
s
i
g
n
m
e
n
t
Tasks
p0
p1
p2
p3
Processes
O
r
c
h
e
s
t
r
a
t
i
o
n
p0
p1
p2
p3
M
a
p
p
i
n
g
Parallel
program
P0
P1
P2
P3
Processors
° Decomposition of computation in tasks
° Assignment of tasks to processes
° Orchestration of data access, comm, synch.
° Mapping processes to processors
ECE669 L15: Mid-term Review
March 25, 2004
Discretize
° Time
• Where
° Space
A
An 1  An

T
t
Forward
difference
n-2
1
t 
T steps
A
Ai 1  Ai

x
x
• Where
1
x 
X grid points
n-1
n
° 1st
 
A
 
x x 
° 2nd
Ai 1 
  Ai
Ai
Boundary
conditions
Space
A11 A12
 Ai
1

x 2
 2A
Ai 1  2Ai  Ai 1

x 2
x 2
• Can use other discretizations
- Backward
- Leap frog
ECE669 L15: Mid-term Review
Time
March 25, 2004
1D Case
A
 2A

B
T
x 2
n 1
Ai
° Or
n
1
 Ai

t
x
Ain 1 
A nx +1 
A n +1 
i

 
 n +1 
A 2 
n +1
A i 
ECE669 L15: Mid-term Review
t
x 2
2
A



t
 x 2
0


A
n
i 1
n
i 1
n
n
 2 A i  Ai-1
B
i

 2 Ain  Ain1  Bi t  Ain
0
2t
1
2
x
t
x 2
An 
x
 
A
 i 

 
n
A2 
n

A


i

March 25, 2004
 
 
B
 
 
 
Multigrid
° Basic idea ---> Solve on coarse grid
---> then on fine grid
8, 8
8, 1
X k+1
1, 1
ECE669 L15: Mid-term Review
1, 8
March 25, 2004
Multigrid
° Basic idea ---> Solve on coarse grid
---> then on fine grid
8, 8
8, 8
8, 1
X k+1 i, j
1, 1
ECE669 L15: Mid-term Review
1, 8
March 25, 2004
Domain Decomposition
° Works well for scientific, engineering, graphics, ...
applications
° Exploits local-biased nature of physical problems
• Information requirements often short-range
• Or long-range but fall off with distance
° Simple example: nearest-neighbor grid
computation
n
n
p
P0
P1
P2
P3
P4
P5
P6
P7
n
P8
P9
P10
P11
P12
P13
P14
P15
n
p
Perimeter to Area comm-to-comp ratio (area to volume in 3-d)
•Depends on n,p: decreases with n, increases with p
ECE669 L15: Mid-term Review
March 25, 2004
Domain Decomposition
Best domain decomposition depends on information requirements
Nearest neighbor example:
block versus strip decomposition:
n
----p
n
n
P0
P1
P2
P3
P4
P5
P6
P7
----p
n
P8
P9
P10
P11
P12
P13
P14
P15
0.5
4*p
° Comm to comp:
for block, 2*p for strip
n
n
° Application dependent: strip may be better in other cases
ECE669 L15: Mid-term Review
March 25, 2004
Exploiting Temporal Locality
• Structure algorithm so working sets map well to hierarchy
- often techniques to reduce inherent communication do well here
- schedule tasks for data reuse once assigned
• Solver example: blocking
(a) Unblocked access pattern in a sweep
ECE669 L15: Mid-term Review
(b) Blocked access pattern with B = 4
March 25, 2004
1-D Array of nodes for Jacobi
N
ops
P
1
2
…
3
1
N
P
{
Comm
Model: 1 op, 1cycle
1comm/hop, 1cycle
T 
N

P
ECE669 L15: Mid-term Review
Comm
N
P
P
March 25, 2004
Scalability
°
S I N   N
SRN   ?
° Ideal speedup on any number of procs.
° Find best P
T par
dT
dP

T seg 
S R N  
° So,
P
2
N 3 ...
1 

q  N 3 


N
2
N
N 3 
1
N
3
2
SR 
1
N 
N 3



y N
3


 N
N
S N 
I
1
1
N 3
° So, 1-D array is
ECE669 L15: Mid-term Review

 0
P 
T par
N
P

scalable for Jacobi
24
March 25, 2004
Detailed Example
p
c
m
P




10  10 6
0 .1  10 6
0 .1  10 6
10
p  N
c
P
or
10  10
0 .1  10
or
also
6
6

N
10
N  1000
for balance
RM  m
N  m
P
1000
 100  m
10
Memory size of m = 100 yields a balanced machine.
ECE669 L15: Mid-term Review
March 25, 2004
Better Algorithm, but basically Branch and Bound
° Little et al.
° Basic Ideas:
2
Cost matrix
4
2
2
1
1
2
•
4
3
4
9
6
1
2
6
9
3
1
6
4
2
2
•
6
•
2
3
1
2
•
4
4
2
•
2
•
ECE669 L15: Mid-term Review
4
March 25, 2004
Better Algorithm, but basically Branch and Bound
Little et al.
2
2
1
1
•
2
2
0
4
•
3
4
2
5
9
6
6
•
4
4
2
2
1
2
6
9
3
1
6
4
3
1
2
4
2
•
•
4
2
•
4
2
To
•
Notion of reduction:
2
4
• Subtract same value from each row or column
1
•
•3
9
6
•4
At least 4
ECE669 L15: Mid-term Review
March 25, 2004
Better Algorithm, but basically Branch and Bound
Little et al.
2
2
1
1
•
0
3
4
2
5
4
4
9
4
1
2
3
4
2
1
2
1
1
0
1
•
6
2
2
6
9
•
•
3
1
6
2
2
6
4
4
4
•
2
•
2
To
From 2
•
2
2
4
1
3
1
•
2
4
ECE669 L15: Mid-term Review
At least 1
•3
9
6
•4
At least 4
March 25, 2004
Communication Finite State Machine
• Each node has a
processing part and a
communications part
• Interface to local
processor is a FIFO
• Communication to nearneighbors is pipelined
ECE669 L15: Mid-term Review
March 25, 2004
Statically Programmed Communication
• Data transferred one
node in one cycle
• Inter-processor path may
require multiple cycles
• Heavy arrows represent
local transfers
• Grey arrows represent
non-local transfers
ECE669 L15: Mid-term Review
March 25, 2004