Distributed Shared Memory

Download Report

Transcript Distributed Shared Memory

Chapter 1: Perspectives
Copyright @ 2005-2008 Yan Solihin
Copyright notice:
No part of this publication may be reproduced, stored in a retrieval system, or
transmitted by any means (electronic, mechanical, photocopying, recording, or
otherwise) without the prior written permission of the author.
An exception is granted for academic lectures at universities and colleges, provided
that the following text is included in such copy: “Source: Yan Solihin, Fundamentals
of Parallel Computer Architecture, 2008”.
Outline for Lecture 1





Introduction
Types of parallelism
Architectural trends
Why parallel computers?
Scope of CSC/ECE 506
Fundamentals of Computer Architecture - Chapter 1
2
Evolution in Microprocessors
QuickTime™ and a
decompressor
are needed to see this picture.
Fundamentals of Computer Architecture - Chapter 1
3
Key Points



More and more components can be integrated on a
single chip
Speed of integration tracks Moore’s law, doubling every
18–24 months.
Exercise: Look up how the number of transistors per
chip has changed, esp. since 2006. Submit here.

Until recently, performance tracked speed of integration

At the architectural level, two techniques facilitated this:



Instruction-level parallelism
Cache memory
Performance gain from uniprocessor system was high
enough that multiprocessor systems were not viable for
most uses.
Fundamentals of Computer Architecture - Chapter 1
4
Illustration


100-processor system with perfect speedup
Compared to a single processor system







Year
Year
Year
…
Year
1: 100x faster
2: 62.5x faster
3: 39x faster
10: 0.9x faster
Single-processor performance catches up in just a few
years!
Even worse




It takes longer to develop a multiprocessor system
Low volume means prices must be very high
High prices delay adoption
Perfect speedup is unattainable
Fundamentals of Computer Architecture - Chapter 1
5
Why did uniprocessor performance
grow so fast?

≈ half from circuit improvement (smaller transistors,
faster clock, etc.)
≈ half from architecture/organization:

Instruction-level parallelism (ILP)





Pipelining: RISC, CISC with RISC back-end
Superscalar
Out-of-order execution
Memory hierarchy (caches)


Exploit spatial and temporal locality
Multiple cache levels
Fundamentals of Computer Architecture - Chapter 1
6
But uniprocessor perf. growth is stalling

Source of uniprocessor performance growth: ILP


ILP growth has slowed abruptly




Parallel execution of independent instructions from a
single thread
Memory wall: Processor speed grows at 55%/year,
memory speed grows at 7% per year
ILP wall: achieving higher ILP requires quadratically
increasing complexity (and power)
Power efficiency
Thermal packaging limit vs. cost
Fundamentals of Computer Architecture - Chapter 1
7
Types of parallelism

Instruction level (cf. ECE 521)

Pipelining
A (a load)
B
C
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
Fundamentals of Computer Architecture - Chapter 1
8
Types of parallelism, cont.



Superscalar/ VLIW
Original:
LD
F0, 34(R2)
ADDD
F4, F0, F2
LD
F7, 45(R3)
ADDD
F8, F7, F6
Schedule as:
LD
F0, 34(R2)
| LD
F7, 45(R3)
ADDD
F4, F0, F2
| ADDD F8, F0, F6
+ Moderate degree of parallelism (sometimes 50)
– Requires fast communication (register level)
Fundamentals of Computer Architecture - Chapter 1
9
Why ILP is slowing

Branch-prediction accuracy is already > 90%


Number of pipeline stages is already deep (≈ 20–30
stages)



But critical dependence loops do not change
Memory latency requires more clock cycles to satisfy
Processor width is already high


Hard to improve it even more
Quadratically increasing complexity to increase the width
Cache size


Effective, but also shows diminishing returns
In general, the size must be doubled to reduce miss rate
by a half
Fundamentals of Computer Architecture - Chapter 1
10
Current trends: multicore and manycore
Aspect
Intel
Clovertown
AMD
Barcelona
IBM Cell
# cores
4
4
8+1
Clock
frequency
2.66 GHz
2.3 GHz
3.2 GHz
Core type
OOO
Superscalar
OOO
Superscalar
2-issue SIMD
Caches
2x4MB L2
512KB L2
(private),
2MB L3 (shd)
256KB local
store
Chip power
120 watts
95 watts
100 watts
Exercise: Browse the Web for information on more recent
processors, and for each processor, fill out this form.
(You can view the submissions here.)
Fundamentals of Computer Architecture - Chapter 1
11
Historical perspectives

80s – early 90s: Prime time for parallel architecture
research




A microprocessor cannot fit on a chip, so naturally need
multiple chips (and processors)
J-machine, M-machine, Alewife, Tera, HEP, DASH, etc.
Exercise: Pick one of these machines, and identify a
major innovation that it introduced. Submit here.
90s: At the low end, uniprocessor systems’ speed
grows much faster than parallel systems’ speed




A microprocessor fits on a chip. So do branch predictor,
multiple functional units, large caches, etc.!
Microprocessor also exploits parallelism (pipelining,
multiple-issue, VLIW) – parallelisms originally invented for
multiprocessors
Many parallel computer vendors went bankrupt
Prestigious but small high-performance computing market
Fundamentals of Computer Architecture - Chapter 1
12
“If the automobile industry advanced as rapidly as the
semiconductor industry, a Rolls Royce would get a half
million miles per gallon and it would be cheaper to
throw it away than to park it.”
Gordon Moore,
Intel Corporation, 1998
Fundamentals of Computer Architecture - Chapter 1
13
Historical perspectives, cont.

90s: Emergence of distributed (vs. parallel) machines

Progress in network technologies:





Network bandwidth grows faster than Moore’s law
Fast interconnection networks getting cheap
Connects cheap uniprocessor systems into a large
distributed machine
Network of Workstations, Clusters, Grid
00s: Parallel architectures are back!




Transistors per chip >> microprocessor transistors
Harder to get more performance from a uniprocessor
SMT (Simultaneous multithreading), CMP (Chip MultiProcessor), ultimately Massive CMP
E.g. Intel Pentium D, Core Duo, AMD Dual Core, IBM
Power5, Sun Niagara.
Fundamentals of Computer Architecture - Chapter 1
14
Parallel computers


A parallel computer is a collection of processing
elements that can communicate and cooperate to solve
a large problem fast. [Almasi & Gottlieb]
“collection of processing elements”



How many? How powerful each? Scalability?
Few very powerful (e.g., Altix) vs. many small ones
(BlueGene)
“that can communicate”



How do PEs communicate? (shared memory vs. messagepassing)
Interconnection network (bus, multistage, crossbar, …)
Metrics: cost, latency, throughput, scalability, fault
tolerance
Fundamentals of Computer Architecture - Chapter 1
15
Parallel computers, cont.

“and cooperate”



Issues: granularity, synchronization, and autonomy
Synchronization allows sequencing of operations to ensure
correctness
Granularity up  parallelism down, communication down,
overhead down





Statement/instruction level: 2–10 instructions (ECE 521)
Loop level: 10 – 1K instructions
Task level: 1K – 1M instructions
Program level: > 1M instructions
Autonomy

SIMD (single instruction stream) vs. MIMD (multiple
instruction streams)
Fundamentals of Computer Architecture - Chapter 1
16
Parallel computers, cont.

“solve a large problem fast”


General- vs. special-purpose machine?
Any machine can solve certain problems well
What domains?
 Highly (embarassingly) parallel apps


Medium parallel apps


Many scientific codes
Many engineering apps (finite-element, VLSI-CAD)
Not parallel apps

Compilers, editors (do we care?)
Fundamentals of Computer Architecture - Chapter 1
17
Why parallel computers?

Absolute performance: Can we afford to wait?



Folding of a single protein takes years to simulate on the
most advanced microprocessor. It only takes days on a
parallel computer
Weather forecast: timeliness is crucial
Cost/performance


Harder to improve performance on a single processor
Bigger monolithic processor vs. many, simple processors

Power/performance

Reliability and availability

Key enabling technologies


Advances in microprocessor and interconnect technology
Advances in software technology
Fundamentals of Computer Architecture - Chapter 1
18
Scope of CSC/ECE 506

Parallelism


Loop-level and task-level parallelism
Flynn taxonomy:


SIMD (vector architecture)
MIMD



Shared memory machines (SMP and DSM)
Clusters
Programming Model:



Shared memory
Message-passing
Hybrid
Fundamentals of Computer Architecture - Chapter 1
19
Loop-level parallelism

Sometimes each iteration can be performed
independently
for (i=0; i<8; i++)
a[i] = b[i] + c[i];

Sometimes iterations cannot be performed
independently  no loop-level parallelism
for (i=0; i<8; i++)
a[i] = b[i] + a[i-1];
+
+
–
–
Very high parallelism > 1K
Often easy to achieve load balance
Some loops are not parallel
Some apps do not have many loops
Fundamentals of Computer Architecture - Chapter 1
20
Task-level parallelism

Arbitrary code segments in a single program
Across loops:

Subroutines:

…
for (i=0; i<n; i++)
sum = sum + a[i];
for (i=0; i<n; i++)
prod = prod * a[i];
…
Cost = getCost();
A = computeSum();
B = A + Cost;

Threads: e.g. editor: GUI, printing, parsing
+ Larger granularity => low overheads, communication
– Low degree of parallelism
– Hard to balance
Fundamentals of Computer Architecture - Chapter 1
21
Program-level parallelism


Various independent programs execute together
gmake:




gcc
gcc
gcc
gcc
–c code1.c
// assign to proc1
–c code2.c
// assign to proc2
–c main.c
// assign to proc3
main.o code1.o code2.o
+ No communication
– Hard to balance
– Few opportunities
Fundamentals of Computer Architecture - Chapter 1
22
Scope of CSC/ECE 506

Parallelism


Loop-level and task-level parallelism
Flynn taxonomy:


SIMD (vector architecture)
MIMD



*Shared memory machines (SMP and DSM)
Clusters
Programming Model:



Shared memory
Message-passing
Hybrid
Fundamentals of Computer Architecture - Chapter 1
23
Taxonomy of parallel computers
The Flynn taxonomy:
• Single or multiple instruction streams.
• Single or multiple data streams.

1. SISD machine (Most desktops, laptops)


Only one instruction fetch stream
Most of yesterday’s workstations or desktops
Control
unit
Instruction
stream
ALU
Data
stream
Fundamentals of Computer Architecture - Chapter 1
24
SIMD


Examples: Vector processors, SIMD extensions (MMX)
A single instruction operates on multiple data items.
SISD:
for (i=0; i<8; i++)
a[i] = b[i] + c[i];
SIMD:
a = b + c;
ALU 1
Control Instruction
unit
stream
ALU 2
ALUn
// vector addition
Data
stream
1
Data
stream
2
Data
streamn
Fundamentals of Computer Architecture - Chapter 1
25
MISD


Example: CMU Warp
Systolic arrays
Data
stream
Control Instruction ALU 1
unit 1 stream 1
Control Instruction ALU 2
unit 2 stream 2
Control Instruction ALUn
unitn
streamn
Fundamentals of Computer Architecture - Chapter 1
26
MIMD


Independent processors connected together to form a
multiprocessor system.
Physical organization:


Determines which memory hierarchy level is shared
Programming abstraction:

Shared Memory:




on a chip: Chip Multiprocessor (CMP)
Interconnected by a bus: Symmetric multiprocessors
(SMP)
Point-to-point interconnection: Distributed Shared
Memory (DSM)
Distributed Memory:

Clusters, Grid
Fundamentals of Computer Architecture - Chapter 1
28
MIMD Physical Organization
P
P
Shared-cache architecture:
- CMP (or Simultaneous Multi-Threading)
- e.g.: Pentium 4 chip, IBM Power4 chip, Sun
Niagara, Pentium D, etc.
- Implies shared-memory hardware
…
caches
M
P
P
caches
caches
Network
M
…
UMA (Uniform Memory Access)
Shared Memory :
- Pentium Pro Quad, Sun Enterprise,
etc.
- What interconnection network?
- Bus
- Multistage
- Crossbar
- etc.
- Implies shared-memory hardware
Fundamentals of Computer Architecture - Chapter 1
29
MIMD Physical Organization (2)
P
P
caches
caches
M
M
Network
…
NUMA (Non-Uniform Memory Access)
Shared Memory :
- SGI Origin, Altix, IBM p690,
AMD Hammer-based system
- What interconnection network?
- Crossbar
- Mesh
- Hypercube
- etc.
- Also referred to as Distributed
Shared Memory
Fundamentals of Computer Architecture - Chapter 1
30
MIMD Physical Organization (3)
P
P
caches
caches
M
M
I/O
I/O
Distributed System/Memory:
- Also called clusters, grid
- Don’t confuse it with distributed
shared memory
Network
Fundamentals of Computer Architecture - Chapter 1
31
Scope of CSC/ECE 506

Parallelism


Loop-level and task-level parallelism
Flynn taxonomy:

MIMD


Shared memory machines (SMP and DSM)
Programming Model:




Shared memory
Message-passing
Hybrid (e.g., UPC)
Data parallel
Fundamentals of Computer Architecture - Chapter 1
32
Programming models: shared memory

Shared Memory / Shared Address Space:

Each processor can see the entire memory
P
P
…
P
Shared M

Programming model = thread programming in
uniprocessor systems
Fundamentals of Computer Architecture - Chapter 1
33
Programming models: message-passing

Distributed Memory / Message Passing / Multiple
Address Space:

a processor can only directly access its own local
memory. All communication happens by explicit
messages.
P
P
M
M
P
P
M
M
Fundamentals of Computer Architecture - Chapter 1
34
Programming models: data parallel

Programming model

Operations performed in parallel on each element of data structure
Logically single thread of control, performs sequential or parallel
steps

Conceptually, a processor associated with each data element


Architectural model

Array of many simple, cheap processing elements (PEs) with little
memory each




Processing elements don’t sequence through instructions
PEs are attached to a control processor that issues instructions
Specialized and general communication, cheap global
synchronization
Original motivation


Matches simple differential equation solvers
Centralize high cost of instruction fetch/sequencing
Fundamentals of Computer Architecture - Chapter 1
35
Top 500 supercomputers


http://www.top500.org
Let’s look at the Earth Simulator


Was #1 in 2004, now #10 in 2006
Hardware:


5,120 (640 8-way nodes) 500 MHz NEC CPUs
8 GFLOPS per CPU (41 TFLOPS total)







30s TFLOPS sustained performance!
2 GB (4 512 MB FPLRAM modules) per CPU (10 TB total)
Shared memory inside the node
10 TB total memory
640 × 640 crossbar switch between the nodes
16 GB/s inter-node bandwidth
20 kVA power consumption per node
Fundamentals of Computer Architecture - Chapter 1
36
Fundamentals of Computer Architecture - Chapter 1
37
Exercise



Go to http://www.top500.org and look at the Statistics
menu near the right-hand side. Click on one of the
statistics, e.g., Vendors, Processor Architecture, and
examine what kind of systems are prevalent. Then do
the same for earlier lists, and report on the trend. You
may find interesting results by clicking on the
“Development” tab.
For example, if you choose “Processor Architecture,”
the current list,
http://www.top500.org/stats/list/34/procarch/, will tell
you how many vector and scalar architectures there
are. Change the “34” to “32” to get last year’s list.
Change it to a lower number to get an earlier year’s
list. You can go all the way back to the first list from
1993.
Submit your results here.
Fundamentals of Computer Architecture - Chapter 1
38