Supercomputing in Plain English: Distributed Parallel

Download Report

Transcript Supercomputing in Plain English: Distributed Parallel

Lecture 29 Computer
Arch. From:
Supercomputing in Plain
English
Part VII: Multicore Madness
Henry Neeman, Director
OU Supercomputing Center for Education & Research
University of Oklahoma
Wednesday October 17 2007
Outline




The March of Progress
Multicore/Many-core Basics
Software Strategies for Multicore/Many-core
A Concrete Example: Weather Forecasting
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
2
The March of Progress
OU’s TeraFLOP Cluster, 2002
10 racks @ 1000 lbs per rack
270 Pentium4 Xeon CPUs,
2.0 GHz, 512 KB L2 cache
270 GB RAM, 400 MHz FSB
8 TB disk
Myrinet2000 Interconnect
100 Mbps Ethernet Interconnect
OS: Red Hat Linux
Peak speed: 1.08 TFLOP/s
(1.08 trillion calculations per second)
One of the first Pentium4 clusters!
boomer.oscer.ou.edu
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
4
TeraFLOP, Prototype 2006, Sale 2011
9 years from room to chip!
http://news.com.com/2300-1006_3-6119652.html
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
5
Moore’s Law
In 1965, Gordon Moore was an engineer at Fairchild
Semiconductor.
He noticed that the number of transistors that could be
squeezed onto a chip was doubling about every 18 months.
It turns out that computer speed is roughly proportional to the
number of transistors per unit area.
Moore wrote a paper about this concept, which became known
as “Moore’s Law.”
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
6
log(Speed)
Moore’s Law in Practice
Year
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
7
log(Speed)
Moore’s Law in Practice
Year
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
8
log(Speed)
Moore’s Law in Practice
Year
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
9
log(Speed)
Moore’s Law in Practice
Year
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
10
log(Speed)
Moore’s Law in Practice
Year
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
11
Fastest Supercomputer vs. Moore
Fastest Supercomputer in the World
1000000
www.top500.org
Speed in GFLOP/s
100000
10000
Fastest
Moore
1000
100
10
1
1992
1994
1996
1998
2000
2002
2004
2006
2008
Year
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
12
The Tyranny of
the Storage Hierarchy
The Storage Hierarchy
[5]
Fast, expensive, few  Registers





Cache memory
Main memory (RAM)
Hard disk
Removable media (e.g., DVD)
Internet
Slow, cheap, a lot
[6]
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
14
RAM is Slow
The speed of data transfer
between Main Memory and the
CPU is much slower than the
speed of calculating, so the CPU
spends most of its time waiting
for data to come in or go out.
CPU 351 GB/sec[7]
Bottleneck
10.66 GB/sec[9] (3%)
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
15
Why Have Cache?
Cache is nearly the same speed
as the CPU, so the CPU doesn’t
have to wait nearly as long for
stuff that’s already in cache:
it can do more
operations per second!
CPU 351 GB/sec[7]
253 GB/sec[8] (72%)
10.66 GB/sec[9] (3%)
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
16
Storage Use Strategies




Register reuse: Do a lot of work on the same data before
working on new data.
Cache reuse: The program is much more efficient if all of
the data and instructions fit in cache; if not, try to use
what’s in cache a lot before using anything that isn’t in
cache.
Data locality: Try to access data that are near each other
in memory before data that are far.
I/O efficiency: Do a bunch of I/O all at once rather than a
little bit at a time; don’t mix calculations and I/O.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
17
A Concrete Example





OSCER’s big cluster, topdawg, has Irwindale CPUs: single
core, 3.2 GHz, 800 MHz Front Side Bus.
The theoretical peak CPU speed is 6.4 GFLOPs (double
precision) per CPU, and in practice we’ve gotten as high as
94% of that.
So, in theory each CPU could consume 143 GB/sec.
The theoretical peak RAM bandwidth is 6.4 GB/sec, but in
practice we get about half that.
So, any code that does less than 45 calculations per byte
transferred between RAM and cache has speed limited by
RAM bandwidth.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
18
Good Cache Reuse
Example
A Sample Application
Matrix-Matrix Multiply
Let A, B and C be matrices of sizes
nr  nc, nr  nk and nk  nc, respectively:
 a1,1
a
 2,1
A   a3,1

 
anr ,1

a1, 2
a1,3

a2 , 2
a3, 2
a2 , 3
a3,3





anr , 2
anr ,3 
a1,nc 
a2,nc 
a3,nc 

 
anr ,nc 
 b1,1 b1, 2
b
 2,1 b2, 2
B   b3,1 b3, 2


 
bnr ,1 bnr , 2

b1,3

b2,3
b3,3




bnr ,3 
b1,nk 
b2,nk 
b3,nk 

 
bnr ,nk 
 c1,1 c1, 2
c
 2,1 c2, 2
C   c3,1 c3, 2


 
cnk ,1 cnk , 2

c1,3

c2 , 3
c3,3




cnk ,3 
c1,nc 
c2,nc 
c3,nc 

 
cnk ,nc 
The definition of A = B • C is
nk
ar ,c   br ,k  ck ,c  br ,1  c1,c  br , 2  c2,c  br ,3  c3,c    br ,nk  cnk ,c
k 1
for r  {1, nr}, c  {1, nc}.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
20
Matrix Multiply: Naïve Version
SUBROUTINE matrix_matrix_mult_naive
&
IMPLICIT NONE
INTEGER,INTENT(IN) :: nr, nc, nq
REAL,DIMENSION(nr,nc),INTENT(OUT)
REAL,DIMENSION(nr,nq),INTENT(IN)
REAL,DIMENSION(nq,nc),INTENT(IN)
(dst, src1, src2, &
nr, nc, nq)
:: dst
:: src1
:: src2
INTEGER :: r, c, q
DO c = 1, nc
DO r = 1, nr
dst(r,c) = 0.0
DO q = 1, nq
dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c)
END DO
END DO
END DO
END SUBROUTINE matrix_matrix_mult_naive
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
21
Performance of Matrix Multiply
800
Matrix-Matrix Multiply
700
CPU sec
600
500
400
300
Better
Init
200
100
0
0
10000000
20000000
30000000
40000000
50000000
60000000
Total Problem Size in bytes (nr*nc+nr*nq+nq*nc)
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
22
Tiling
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
23
Tiling




Tile: A small rectangular subdomain of a problem domain.
Sometimes called a block or a chunk.
Tiling: Breaking the domain into tiles.
Tiling strategy: Operate on each tile to completion, then
move to the next tile.
Tile size can be set at runtime, according to what’s best for
the machine that you’re running on.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
24
Tiling Code
SUBROUTINE matrix_matrix_mult_by_tiling (dst, src1, src2, nr, nc, nq, &
&
rtilesize, ctilesize, qtilesize)
IMPLICIT NONE
INTEGER,INTENT(IN) :: nr, nc, nq
REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst
REAL,DIMENSION(nr,nq),INTENT(IN) :: src1
REAL,DIMENSION(nq,nc),INTENT(IN) :: src2
INTEGER,INTENT(IN) :: rtilesize, ctilesize, qtilesize
INTEGER :: rstart, rend, cstart, cend, qstart, qend
DO cstart = 1, nc, ctilesize
cend = cstart + ctilesize - 1
IF (cend > nc) cend = nc
DO rstart = 1, nr, rtilesize
rend = rstart + rtilesize - 1
IF (rend > nr) rend = nr
DO qstart = 1, nq, qtilesize
qend = qstart + qtilesize - 1
IF (qend > nq) qend = nq
CALL matrix_matrix_mult_tile(dst, src1, src2, nr, nc, nq, &
&
rstart, rend, cstart, cend, qstart, qend)
END DO
END DO
END DO
END SUBROUTINE matrix_matrix_mult_by_tiling
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
25
Multiplying Within a Tile
SUBROUTINE matrix_matrix_mult_tile (dst, src1, src2, nr, nc, nq, &
&
rstart, rend, cstart, cend, qstart, qend)
IMPLICIT NONE
INTEGER,INTENT(IN) :: nr, nc, nq
REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst
REAL,DIMENSION(nr,nq),INTENT(IN) :: src1
REAL,DIMENSION(nq,nc),INTENT(IN) :: src2
INTEGER,INTENT(IN) :: rstart, rend, cstart, cend, qstart, qend
INTEGER :: r, c, q
DO c = cstart, cend
DO r = rstart, rend
IF (qstart == 1) dst(r,c) = 0.0
DO q = qstart, qend
dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c)
END DO
END DO
END DO
END SUBROUTINE matrix_matrix_mult_tile
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
26
Reminder: Naïve Version, Again
SUBROUTINE matrix_matrix_mult_naive
&
IMPLICIT NONE
INTEGER,INTENT(IN) :: nr, nc, nq
REAL,DIMENSION(nr,nc),INTENT(OUT)
REAL,DIMENSION(nr,nq),INTENT(IN)
REAL,DIMENSION(nq,nc),INTENT(IN)
(dst, src1, src2, &
nr, nc, nq)
:: dst
:: src1
:: src2
INTEGER :: r, c, q
DO c = 1, nc
DO r = 1, nr
dst(r,c) = 0.0
DO q = 1, nq
dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c)
END DO
END DO
END DO
END SUBROUTINE matrix_matrix_mult_naive
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
27
Performance with Tiling
Matrix-Matrix Mutiply Via Tiling
Matrix-Matrix Mutiply Via Tiling (log-log)
250
1000
200
100
512x256
10
CPU sec
150
512x512
1024x512
100
1024x1024
1
50
1E+08
10000000 1000000
100000
10000
1000
100
10
2048x1024
Better
0.1
0
100000000
10000000
1000000
100000
10000
1000
100
10
Tile Size (bytes)
Tile Size (bytes)
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
28
The Advantages of Tiling



It allows your code to exploit data locality better, to get
much more cache reuse: your code runs faster!
It’s a relatively modest amount of extra coding (typically a
few wrapper functions and some changes to loop bounds).
If you don’t need tiling – because of the hardware, the
compiler or the problem size – then you can turn it off by
simply setting the tile size equal to the problem size.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
29
Why Does Tiling Work Here?
Cache optimization works best when the number of
calculations per byte is large.
For example, with matrix-matrix multiply on an n × n matrix,
there are O(n3) calculations (on the order of n3), but only
O(n2) bytes of data.
So, for large n, there are a huge number of calculations per
byte transferred between RAM and cache.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
30
Multicore/Many-core
Basics
What is Multicore?




In the olden days (i.e., the first half of 2005), each CPU chip
had one “brain” in it.
More recently, each CPU chip has 2 cores (brains), and,
starting in late 2006, 4 cores.
Jargon: Each CPU chip plugs into a socket, so these days,
to avoid confusion, people refer to sockets and cores, rather
than CPUs or processors.
Each core is just like a full blown CPU, except that it shares
its socket with one or more other cores – and therefore
shares its bandwidth to RAM.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
32
Dual Core
Core Core
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
33
Quad Core
Core Core
Core Core
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
34
Oct Core
Core Core Core Core
Core Core Core Core
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
35
The Challenge of Multicore: RAM






Each socket has access to a certain amount of RAM, at a
fixed RAM bandwidth per SOCKET.
As the number of cores per socket increases, the contention
for RAM bandwidth increases too.
At 2 cores in a socket, this problem isn’t too bad. But at 16
or 32 or 80 cores, it’s a huge problem.
So, applications that are cache optimized will get big
speedups.
But, applications whose performance is limited by RAM
bandwidth are going to speed up only as fast as RAM
bandwidth speeds up.
RAM bandwidth speeds up much slower than CPU speeds
up.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
36
The Challenge of Multicore: Network





Each node has access to a certain number of network ports,
at a fixed number of network ports per NODE.
As the number of cores per node increases, the contention
for network ports increases too.
At 2 cores in a socket, this problem isn’t too bad. But at 16
or 32 or 80 cores, it’s a huge problem.
So, applications that do minimal communication will get
big speedups.
But, applications whose performance is limited by the
number of MPI messages are going to speed up very very
little – and may even crash the node.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
37
Multicore/Many-core Problem



Most multicore chip families have relatively small cache per
core (e.g., 2 MB) – and this problem seems likely to remain.
Small TLBs make the problem worse: 512 KB per core
rather than 2 MB.
So, to get good cache reuse, you need to partition algorithm
so subproblem needs no more than 512 KB.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
38
The T.L.B. on a Current Chip
On Intel Core Duo (“Yonah”):
 Cache size is 2 MB per core.
 Page size is 4 KB.
 A core’s data TLB size is 128 page table entries.
 Therefore, D-TLB only covers 512 KB of cache.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
39
The T.L.B. on a Current Chip
On Intel Core Duo (“Yonah”):
 Cache size is 2 MB per core.
 Page size is 4 KB.
 A core’s data TLB size is 128 page table entries.
 Therefore, D-TLB only covers 512 KB of cache.
 The cost of a TLB miss is 49 cycles, equivalent to as many
as 196 calculations! (4 FLOPs per cycle)
http://www.digit-life.com/articles2/cpu/rmma-via-c7.html
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
40
What Do We Need?



We need much bigger caches!
TLB must be big enough to cover the entire cache.
It’d be nice to have RAM speed increase as fast as core
counts increase, but let’s not kid ourselves.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
41
To Learn More Supercomputing
http://www.oscer.ou.edu/education.php
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
42
Computer Architecture
Fall 2007
Lecture 29. CMPs & SMTs
Adapted from Mary Jane Irwin
( www.cse.psu.edu/~mji )
[Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
Lecture 29
Fall
Multithreading on A Chip

Find a way to “hide” true data dependency stalls, cache
miss stalls, and branch stalls by finding instructions (from
other process threads) that are independent of those
stalling instructions

Multithreading – increase the utilization of resources on a
chip by allowing multiple processes (threads) to share the
functional units of a single processor
Lecture 29

Processor must duplicate the state hardware for each thread – a
separate register file, PC, instruction buffer, and store buffer for
each thread

The caches, TLBs, BHT, BTB can be shared (although the miss
rates may increase if they are not sized accordingly)

The memory can be shared through virtual memory mechanisms

Hardware must support efficient thread context switching
Fall
Types of Multithreading on a Chip


Fine-grain – switch threads on every instruction issue

Round-robin thread interleaving (skipping stalled threads)

Processor must be able to switch threads on every clock cycle

Advantage – can hide throughput losses that come from both
short and long stalls

Disadvantage – slows down the execution of an individual thread
since a thread that is ready to execute without stalls is delayed
by instructions from other threads
Coarse-grain – switches threads only on costly stalls
(e.g., L2 cache misses)

Advantages – thread switching doesn’t have to be essentially
free and much less likely to slow down the execution of an
individual thread

Disadvantage – limited, due to pipeline start-up costs, in its
ability to overcome throughput loss
- Pipeline must be flushed and refilled on thread switches
Lecture 29
Fall
Multithreaded Example: Sun’s Niagara (UltraSparc T1)
1.2 GHz
1.0 GHz
Cache
(I/D/L2)
32K/64K/
(8M external)
16K/8K/3M
Issue rate
4 issue
1 issue
Pipe stages
14 stages
6 stages
BHT entries
16K x 2-b
None
TLB entries
128I/512D
64I/64D
Memory BW
2.4 GB/s
~20GB/s
Transistors
29 million
200 million
Power (max) 53 W
Lecture 29
<60 W
4-way MT SPARC pipe
Clock rate
4-way MT SPARC pipe
64-b
4-way MT SPARC pipe
64-b
4-way MT SPARC pipe
Data width
4-way MT SPARC pipe
Niagara
4-way MT SPARC pipe
Ultra III
4-way MT SPARC pipe
Eight fine grain multithreaded single-issue, in-order cores
(no speculation, no dynamic branch prediction)
4-way MT SPARC pipe

Crossbar
I/O
shared
funct’s
4-way banked L2$
Memory controllers
Fall
Niagara Integer Pipeline

Cores are simple (single-issue, 6 stage, no branch
prediction), small, and power-efficient
Fetch
Thrd Sel
Decode
RegFile
x4
I$
Inst
bufx4
ITLB
Thrd
Sel
Mux
Thrd
Sel
Mux
Decode
Thread
Select
Logic
Execute
ALU
Mul
Shft
Div
Memory
D$
DTLB
Stbufx4
WB
Crossbar
Interface
Instr type
Cache misses
Traps & interrupts
Resource conflicts
PC
logicx4
From MPR, Vol. 18, #9, Sept. 2004
Lecture 29
Fall
Simultaneous Multithreading (SMT)

A variation on multithreading that uses the resources of a
multiple-issue, dynamically scheduled processor
(superscalar) to exploit both program ILP and threadlevel parallelism (TLP)

Most SS processors have more machine level parallelism than
most programs can effectively use (i.e., than have ILP)

With register renaming and dynamic scheduling, multiple
instructions from independent threads can be issued without
regard to dependencies among them
- Need separate rename tables (ROBs) for each thread
- Need the capability to commit from multiple threads (i.e., from
multiple ROBs) in one cycle

Intel’s Pentium 4 SMT called hyperthreading

Lecture 29
Supports just two threads (doubles the architecture state)
Fall
Threading on a 4-way SS Processor Example
Coarse MT
Fine MT
SMT
Issue slots →
Thread A
Thread B
Time →
Thread C Thread D
Lecture 29
Fall
Multicore Xbox360 – “Xenon” processor


Lecture 29
To provide game developers with a balanced and
powerful platform

Three SMT processors, 32KB L1 D$ & I$, 1MB UL2 cache

165M transistors total

3.2 Ghz Near-POWER ISA

2-issue, 21 stage pipeline, with 128 128-bit registers

Weak branch prediction – supported by software hinting

In order instructions

Narrow cores – 2 INT units, 2 128-bit VMX units, 1 of anything
else
An ATI-designed 500MZ GPU w/ 512MB of DDR3DRAM

337M transistors, 10MB framebuffer

48 pixel shader cores, each with 4 ALUs
Fall
Xenon Diagram
Core 1
Core 2
L1D L1I
L1D L1I
L1D L1I
XMA Dec
Core 0
1MB UL2
MC1
MC0
512MB
DRAM
BIU/IO Intf
Lecture 29
SMC
GPU
DVD
HDD Port
Front USBs (2)
Wireless
MU ports (2 USBs)
Rear USB (1)
Ethernet
IR
Audio Out
Flash
Systems Control
3D Core
10MB
EDRAM
Video
Out
Analog
Chip
Video Out
Fall
The PS3 “Cell” Processor Architecture

Composed of a Non-SMP Architecture

234M transistors @ 4Ghz

1 Power Processing Element, 8 “Synergistic” (SIMD) PE’s

512KB L2 $ - Massively high bandwidth (200GB/s) bus connects
it to everything else

The PPE is strangely similar to one of the Xenon cores
- Almost identical, really. Slight ISA differences, and fine-grained MT
instead of real SMT

The real differences lie in the SPEs (21M transistors each)
- An attempt to ‘fix’ the memory latency problem by giving each
processor complete control over it’s own 256KB “scratchpad” – 14M
transistors
– Direct mapped for low latency
- 4 vector units per SPE, 1 of everything else – 7M trans.
Lecture 29
Fall
The PS3 “Cell” Processor Architecture
Lecture 29
Fall
How to make use of the SPEs
Lecture 29
Fall
What about the Software?




Lecture 29
Makes use of special IBM “Hypervisor”

Like an OS for OS’s

Runs both a real time OS (for sound) and non-real time (for
things like AI)
Software must be specially coded to run well

The single PPE will be quickly bogged down

Must make use of SPEs wherever possible

This isn’t easy, by any standard
What about Microsoft?

Development suite identifies which 6 threads you’re expected to
run

Four of them are DirectX based, and handled by the OS

Only need to write two threads, functionally
http://ps3forums.com/showthread.php?t=22858
Fall
Next Lecture and Reminders

Reminders

Lecture 29
Final is Thursday, December 13 from 10-11:50 AM in ITT 322
Fall