No Slide Title

Transcript No Slide Title

Chapter 1: Perspectives
Copyright @ 2005-2008 Yan Solihin
Copyright notice:
No part of this publication may be reproduced, stored in a retrieval system, or
transmitted by any means (electronic, mechanical, photocopying, recording, or
otherwise) without the prior written permission of the author.
An exception is granted for academic lectures at universities and colleges, provided
that the following text is included in such copy: “Source: Yan Solihin, Fundamentals
of Parallel Computer Architecture, 2008”.
Evolution in Microprocessors
QuickTime™ and a
decompressor
are needed to see this picture.
Fundamentals of Computer Architecture - Chapter 1
2
Key Points




Increasingly more and more components can be
integrated on a single chip
Speed of integration tracks Moore’s law: doubling every
18-24 months.
Performance tracks speed of integration up until
recently
At the architecture level, there are two techniques



Instruction Level Parallelism
Cache Memory
Performance gain from uniprocessor system so
significant making multiprocessor systems not
profitable
Fundamentals of Computer Architecture - Chapter 1
3
Illustration


100-processor system with perfect speedup
Compared to a single processor system







Year
Year
Year
…
Year
1: 100x faster
2: 62.5x faster
3: 39x faster
10: 0.9x faster
Single processor performance catches up in just a few
years!
Even worse




It takes longer to develop a multiprocessor system
Low volume means prices must be very high
High prices delay adoption
Perfect speedup is unattainable
Fundamentals of Computer Architecture - Chapter 1
4
Why did uniproc performance grow so
fast?

~ half from circuit improvement (smaller transistors,
faster clock, etc.)
~ half from architecture/organization:

Instruction Level Parallelism (ILP)





Pipelining: RISC, CISC with RISC backend
Superscalar
Out of order execution
Memory hierarchy (Caches)


Exploiting spatial and temporal locality
Multiple cache levels
Fundamentals of Computer Architecture - Chapter 1
5
But the uniproc perf growth is stalling

Source of uniprocessor performance growth:
instruction level parallelism (ILP)


ILP growth has slowed abruptly




Parallel execution of independent instructions from a
single thread
Memory wall: Processor speed grows at 55%/year,
memory speed grows at 7% per year
ILP wall: achieving higher ILP requires quadratically
increasing complexity (and power)
Power efficiency
Thermal packaging limit vs. cost
Fundamentals of Computer Architecture - Chapter 1
6
Types of parallelism

Instruction level (ECE 521)

Pipelining
A (a load)
B
C
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
ID
EX
IF
WB
MEM
WB
Fundamentals of Computer Architecture - Chapter 1
7



Superscalar/ VLIW
Original:
LD
F0, 34(R2)
ADDD
F4, F0, F2
LD
F7, 45(R3)
ADDD
F8, F7, F6
Schedule as:
LD
F0, 34(R2)
| LD
F7, 45(R3)
ADDD
F4, F0, F2
| ADDD F8, F0, F6
+ Moderate degree of parallelism (sometimes 50)
- Requires fast communication (register level)
Fundamentals of Computer Architecture - Chapter 1
8
Why ILP is slowing

Branch prediction accuracy is already > 90%


Number of pipeline stages is already deep (~20-30
stages)



But critical dependence loops do not change
Memory latency requires more clock cycles to satisfy
Processor width is already high


Hard to improve it even more
Quadratically increasing complexity to increase the width
Cache size


Effective, but also shows diminishing returns
In general, the size must be doubled to reduce miss rate
by a half
Fundamentals of Computer Architecture - Chapter 1
9
Current Trend: Multicore and Manycore
Aspects
Intel
Clovertown
AMD
Barcelona
IBM Cell
# cores
4
4
8+1
Clock Freq
2.66 GHz
2.3 GHz
3.2 GHz
Core type
OOO
Superscalar
OOO
Superscalar
2-issue SIMD
Caches
2x4MB L2
512KB L2
(private),
2MB L3 (shd)
256KB local
store
Chip power
120 Watts
95 Watts
100 Watts
Fundamentals of Computer Architecture - Chapter 1
10
Historical Perspectives

80s – early 90s: prime time for parallel architecture
research



A microprocessor cannot fit on a chip, so naturally need
multiple chips (and processors)
J-machine, M-machine, Alewife, Tera, HEP, etc.
90s: at the low end, uniprocessor systems’ speed
grows much faster than parallel systems’ speed




A microprocessor fits on a chip. So do branch predictor,
multiple functional units, large caches, etc!
Microprocessor also exploits parallelism (pipelining,
multiple-issue, VLIW) – parallelisms originally invented for
multiprocessors
Many parallel computer vendors went bankrupt
Prestigious but small high-performance computing market
Fundamentals of Computer Architecture - Chapter 1
11
“If the automobile industry advanced as rapidly as
the semiconductor industry, a Rolls Royce would
get ½ million miles per gallon and it would be
cheaper to throw it away than to park it.”
Gordon Moore,
Intel Corporation
Fundamentals of Computer Architecture - Chapter 1
12

90s: emergence of distributed (vs. parallel) machines

Progress in network technologies:





Network bandwidth grows faster than Moore’s law
Fast interconnection network getting cheap
Connects cheap uniprocessor systems into a large
distributed machine
Network of Workstations, Clusters, GRID
00s: parallel architectures are back




Transistors per chip >> microproc transistors
Harder to get more performance from a uniprocessor
SMT (Simultaneous multithreading), CMP (Chip MultiProcessor), ultimately Massive CMP
E.g. Intel Pentium D, Core Duo, AMD Dual Core, IBM Power5,
Sun Niagara, etc.
Fundamentals of Computer Architecture - Chapter 1
13
What is a Parallel Architecture?
“A parallel computer is a collection of processing
elements that can communicate and cooperate to
solve a large problem fast.”
-- Almasi & Gottlieb
Fundamentals of Computer Architecture - Chapter 1
14
Parallel computers


A parallel computer is a collection of processing
elements that can communicate and cooperate to
solve a large problem fast. [Almasi&Gottlieb]
“collection of processing elements”



How many? How powerful each? Scalability?
Few very powerful (e.g., Altix) vs. many small ones
(BlueGene)
“that can communicate”



How do PEs communicate? (shared memory vs. msg
passing)
Interconnection network (bus, multistage, crossbar, …)
Evaluation criteria: cost, latency, throughput, scalability,
and fault tolerance
Fundamentals of Computer Architecture - Chapter 1
15

“and cooperate”



Issues: granularity, synchronization, and autonomy
Synchronization allows sequencing of operations to ensure
correctness
Granularity up => parallelism down, communication
down, overhead down





Statement/Instruction level: 2-10 instructions (ECE 521)
Loop level: 10-1K instructions
Task level: 1K – 1M instructions
Program level: > 1M instructions
Autonomy

SIMD (single instruction stream) vs. MIMD (multiple
instruction streams)
Fundamentals of Computer Architecture - Chapter 1
16

“solve a large problem fast”


General vs. special purpose machine?
Any machine can solve certain problems well
What domains?
 Highly (embarassingly) parallel apps


Medium parallel apps


Many scientific codes
Many engineering apps (finite-elements, VLSI-CAD)
Not parallel apps

Compilers, editors (do we care?)
Fundamentals of Computer Architecture - Chapter 1
17
Why parallel computers?

Absolute performance: Can we afford to wait?



Folding of a single protein takes years to simulate on the
most advanced microprocessor. It only takes days on a
parallel computer
Weather forecast: timeliness is crucial
Cost/performance


Harder to improve performance on a single processor
Bigger monolithic processor vs. many, simple processors

Power/performance
Reliability and availability

Key enabling technology:



Advances in microproc and interconnect technology
Advances in software technology
Fundamentals of Computer Architecture - Chapter 1
18
Scope of CSC/ECE 506

Parallelism


Loop Level and Task Level Parallelism
Flynn taxonomy:


SIMD (vector architecture)
MIMD



Shared memory machines (SMP and DSM)
Clusters
Programming Model:



Shared Memory
Message passing
Hybrid
Fundamentals of Computer Architecture - Chapter 1
19
Loop level parallelism

Each iteration can be computed independently
for (i=0; i<8; i++)
a[i] = b[i] + c[i];

Each iteration cannot be computed independently, thus
does not have loop level parallelism
for (i=0; i<8; i++)
a[i] = b[i] + a[i-1];
+ Very high parallelism > 1K
+ Often easy to achieve load balance
Some loops are not parallel
Some apps do not have many loops
Fundamentals of Computer Architecture - Chapter 1
20
Task level parallelism

Arbitrary code segments in a single program
Across loops:

Subroutines:

…
for (i=0; i<n; i++)
sum = sum + a[i];
for (i=0; i<n; i++)
prod = prod * a[i];
…
Cost = getCost();
A = computeSum();
B = A + Cost;
Threads: e.g. editor: GUI, printing, parsing
+ Larger granularity => low overheads, communication
Low degree of parallelism
Hard to balance

Fundamentals of Computer Architecture - Chapter 1
21
Program level parallelism


Various independent programs execute together
gmake:




gcc
gcc
gcc
gcc
–c code1.c
// assign to proc1
–c code2.c
// assign to proc2
–c main.c
// assign to proc3
main.o code1.o code2.o
+ no communication
Hard to balance
Few opportunities
Fundamentals of Computer Architecture - Chapter 1
22
Scope of CSC/ECE 506

Parallelism


Loop Level and Task Level Parallelism
Flynn taxonomy:


SIMD (vector architecture)
MIMD



*Shared memory machines (SMP and DSM)
Clusters
Programming Model:



Shared Memory
Message passing
Hybrid
Fundamentals of Computer Architecture - Chapter 1
23
Taxonomy of Parallel Computers
The Flynn taxonomy:
• Single or multiple instruction streams.
• Single or multiple data streams.
 1. SISD machine (Most desktops, laptops)


Only one instruction fetch stream
Most of today’s workstations or desktops
Control
unit
Instruction
stream
ALU
Data
stream
Fundamentals of Computer Architecture - Chapter 1
24
SIMD


Examples: Vector processors, SIMD extensions (MMX)
A single instruction operates on multiple data items.
SISD:
for (i=0; i<8; i++)
a[i] = b[i] + c[i];
SIMD:
a = b + c;
ALU 1
Control Instruction
unit
stream
ALU 2
ALUn

// vector addition
Data
stream
1
Data
stream
2
Data
streamn
Pseudo-SIMD popular for multimedia extension
Fundamentals of Computer Architecture - Chapter 1
25
MISD machine


Example: CMU Warp
Systolic arrays
Data
stream
Control Instruction ALU 1
unit 1 stream 1
Control Instruction ALU 2
unit 2 stream 2
Control Instruction ALUn
unitn
streamn
Fundamentals of Computer Architecture - Chapter 1
26
MIMD machine


Independent processors connected together to form a
multiprocessor system.
Physical organization:


Determines which memory hierarchy level is shared
Programming abstraction:

Shared Memory:




on a chip: Chip Multiprocessor (CMP)
Interconnected by a bus: Symmetric multiprocessors
(SMP)
Point-to-point interconnection: Distributed Shared
Memory (DSM)
Distributed Memory:

Clusters, Grid
Fundamentals of Computer Architecture - Chapter 1
28
MIMD Physical Organization
P
P
Shared Cache Architecture:
- CMP (or Simultaneous Multi-Threading)
- e.g.: Pentium4 chip, IBM Power4 chip, SUN
Niagara, Pentium D, etc.
- Implies shared memory hardware
…
caches
M
P
P
caches
caches
Network
M
…
UMA (Uniform Memory Access)
Shared Memory :
- Pentium Pro Quad, Sun Enterprise,
etc.
- What interconnection network?
- Bus
- Multistage
- Crossbar
- etc.
- Implies shared memory hardware
Fundamentals of Computer Architecture - Chapter 1
29
MIMD Physical Organization (2)
P
P
caches
caches
M
M
Network
…
NUMA (Non-Uniform Memory Access)
Shared Memory :
- SGI Origin, Altix, IBM p690,
AMD Hammer-based system
- What interconnection network?
- Crossbar
- Mesh
- Hypercube
- etc.
- Also referred to as Distributed
Shared Memory
Fundamentals of Computer Architecture - Chapter 1
30
MIMD Physical Organization (3)
P
P
caches
caches
M
M
I/O
I/O
Distributed System/Memory:
- Also called clusters, grid
- Don’t confuse it with distributed
shared memory
Network
Fundamentals of Computer Architecture - Chapter 1
31
Parallel vs. Distributed Computers
Cost
Distrib comp
Parallel comp
Distrib comp
Perf
Parallel comp
size
size
• Small scale machines: parallel system cheaper
• Large scale machines: distributed system cheaper
• Performance: parallel system better (but more expensive)
• System size: parallel system limited, and cost grows fast
• However, must also consider software cost
Fundamentals of Computer Architecture - Chapter 1
32
Scope of CSC/ECE 506

Parallelism


Loop Level and Task Level Parallelism
Flynn taxonomy:

MIMD


Shared memory machines (SMP and DSM)
Programming Model:




Shared Memory
Message passing
Hybrid (e.g., UPC)
Data parallel
Fundamentals of Computer Architecture - Chapter 1
33
Programming Models

Shared Memory / Shared Address Space:

Each processor can see the entire memory
P
P
…
P
Shared M

Programming model = thread programming in
uniprocessor systems
Fundamentals of Computer Architecture - Chapter 1
34

Distributed Memory / Message Passing / Multiple
Address Space:

a processor can only directly access its own local
memory. All communication happens by explicit
messages.
P
P
M
M
P
P
M
M
Fundamentals of Computer Architecture - Chapter 1
35
Shared Mem compared to Msg Passing
+ Can easily be automated (parallelizing compiler,
OpenMP)
+ Shared vars are not communicated, but must be
guarded
- How to provide shared memory? Complex hardware
- Synchronization overhead grows fast with more
processors
+- Difficult to debug, not intuitive for users
Fundamentals of Computer Architecture - Chapter 1
36
Data Parallel Prog Paradigm & Systems

Programming model

Operations performed in parallel on each element of data structure
Logically single thread of control, performs sequential or parallel steps

Conceptually, a processor associated with each data element


Architectural model

Array of many simple, cheap processors with little memory each



Processors don’t sequence through instructions
Attached to a control processor that issues instructions
Specialized and general communication, cheap global synchronization
Original motivations
•Matches simple differential equation solvers
•Centralize high cost of instruction
fetch/sequencing
Fundamentals of Computer Architecture - Chapter 1
37
Application of Data Parallelism
Each PE contains an employee record with his/her salary
If salary > 100K then
salary = salary *1.05
else
salary = salary *1.10



Logically, the whole operation is a single step
Some processors enabled for arithmetic operation, others disabled

Other examples:

Finite differences, linear algebra, ...
 Document searching, graphics, image processing, ...
Some recent machines:



Thinking Machines CM-1, CM-2 (and CM-5)
Maspar MP-1 and MP-2,
Fundamentals of Computer Architecture - Chapter 1
38
Common Today

Systolic Arrays:


Dataflow:



most small scale servers (up to 128 processors)
Now in workstations/desktops/laptops, too
Message passing: most large scale systems


idea adopted in superscalar processors
Shared memory:


idea adopted in graphics and network processors
clusters, grid (hundreds to thousands of processors)
Data parallel/SIMD:


small scale: SIMD multimedia extension (MMX, VIS)
large scale: vector processors
Fundamentals of Computer Architecture - Chapter 1
39
Top 500 Supercomputer


http://www.top500.org
Let’s look at the Earth Simulator


Was #1 in 2004, now #10 in 2006
Hardware:


5,120 (640 8-way nodes) 500 MHz NEC CPUs
8 GFLOPS per CPU (41 TFLOPS total)







30s TFLOPS sustained performance!
2 GB (4 512 MB FPLRAM modules) per CPU (10 TB total)
shared memory inside the node
10 TB total memory
640 × 640 crossbar switch between the nodes
16 GB/s inter-node bandwidth
20 kVA power consumption per node
Fundamentals of Computer Architecture - Chapter 1
40

Programming Model

In a CPU: data parallel, using automatic vectorization


In a node (8 CPUs): shared memory using OpenMP


Instruction level
Loop level
Across nodes: message passing using MPI-2 or HPF

Algorithm level
Fundamentals of Computer Architecture - Chapter 1
41

“The machine room sits at approximately 4th floor
level. The 3rd floor level is taken by hundreds of
kilometers of copper cabling, and the lower floors
house the air conditioning and electrical equipment.
The structure is enclosed in a cooling shell, with the air
pumped from underneath through the cabinets, and
collected to the two long sides of the building. The
aeroshell gives the building its "pumped-up"
appearance. The machine room is electromagnetically
shielded to prevent interference from nearby
expressway and rail. Even the halogen light sources
are outside the shield, and the light is distributed by a
grid of scattering pipes under the ceiling. The entire
structure is mechanically isolated from the
surroundings, suspended in order to make it less prone
to earthquake damage. All attachments (power,
cooling, access walkways) are flexible. “
Fundamentals of Computer Architecture - Chapter 1
42
Fundamentals of Computer Architecture - Chapter 1
43






Linpack performance: 40 TFlops, 80% peak
Real world performance: 33-66% peak (vs. less than
15% for clusters)
Cost? Hint: starts with 4
Maintenance $15M per year
Failure one processor per week
Distributed Memory Parallel Computing System which
640 processor nodes interconnected by Single-Stage
Crossbar Network
Fundamentals of Computer Architecture - Chapter 1
44
Fastest (#1 as of Aug 2006)





BlueGene
65536 processors
Each processor PowerPC 440 700 MHz (2.8 GFlops)
Rpeak (GFlops):183 TFLOPS
Rmax (GFlops):136 TFLOPS
Fundamentals of Computer Architecture - Chapter 1
45

No Slide Title

Transcript No Slide Title

Directory