Transcript Slide 1

Embedded Computer Architecture
5KK73
MPSoC Platforms
Part2: Cell
Bart Mesman and Henk Corporaal
The Complexity Crisis
I have always wished that my computer
would be as easy to use as my telephone.
My wish has come true. I no longer know
how to use my telephone.
--Bjarne Stroustrup
7/21/2015
2
The Software Crisis
7/21/2015
3
The first SW crisis
Time Frame: ’60s and ’70s
• Problem: Assembly Language Programming
–
•
•
Needed to get Abstraction and Portability
without losing Performance
Solution:
–
7/21/2015
Computers could handle larger more complex
programs
High-level languages for von-Neumann machines
FORTRAN and C
4
The second SW crisis
Time Frame: ’80s and ’90s
• Problem: Inability to build and maintain complex
and robust applications requiring multi-million
lines of code developed by hundreds of
programmers
– Computers could handle larger more complex
programs
• Needed to get Composability and Maintainability
– High-performance was not an issue: left for Moore’s
Law
7/21/2015
5
Solution
• Object Oriented Programming
– C++, C# and Java
• Also…
– Better tools
• Component libraries, Purify
– Better software engineering methodology
• Design patterns, specification, testing, code
reviews
7/21/2015
6
Today: Programmers are Oblivious
to Processors
• Solid boundary between Hardware and Software
• Programmers don’t have to know anything about the
processor
– High level languages abstract away the processors
• Ex: Java bytecode is machine independent
– Moore’s law does not require the programmers to know anything
about the processors to get good speedups
• Programs are oblivious of the processor -> work on all
processors
– A program written in ’70 using C still works and is much faster
today
• This abstraction provides a lot of freedom for the
programmers
7/21/2015
7
The third crisis: Powered by
PlayStation
7/21/2015
8
Contents
• Hammer your head against 4 walls
– Or: Why Multi-Processor
• Cell Architecture
• Programming and porting
– plus case-study
7/21/2015
9
Moore’s Law
7/21/2015
10
Single Processor
SPECint Performance
7/21/2015
11
What’s stopping them?
• General-purpose uni-cores have stopped
historic performance scaling
– Power consumption
– Wire delays
– DRAM access latency
– Diminishing returns of more instruction-level
parallelism
7/21/2015
12
Power density
7/21/2015
13
Power Efficiency (Watts/Spec)
7/21/2015
14
1 clock cycle wire range
7/21/2015
15
Global wiring delay becomes
dominant over gate delay
Gate delay vs. wire delay
400
350
300
ps
250
wire delay (ps/mm)
200
gate delay (ps)
150
100
50
0
0.5
0.35
0.25
0.18
0.13
0.1
technology (micron)
7/21/2015
16
Memory
Performance
µProc:
55%/yea
r
1000
10
Processor-Memory
Performance Gap:
CPU
100
(grows 50% / year)
“Moore’s Law”
DRAM:
7%/year
DRAM
1
1980
1985
1990
1995
2000
2005
Time
[Patterson]
7/21/2015
17
Now what?
• Latest research drained
• Tried every trick in the book
So: We’re fresh out of ideas
Multi-processor is all that’s left!
7/21/2015
18
Low power through parallelism
• Sequential Processor
–
–
–
–
Switching capacitance C
Frequency f
Voltage V
P = fCV2
• Parallel Processor (two times the number of units)
–
–
–
–
7/21/2015
Switching capacitance 2C
Frequency f/2
Voltage V’ < V
P = f/2 2C V’2 = fCV’2
19
Architecture methods
Powerful Instructions (1)
MD-technique
• Multiple data operands per operation
• SIMD: Single Instruction Multiple Data
Vector instruction:
Assembly:
for (i=0, i++, i<64)
c[i] = a[i] + 5*b[i];
set
ldv
mulvi
ldv
addv
stv
c = a + 5*b
7/21/2015
vl,64
v1,0(r2)
v2,v1,5
v1,0(r1)
v3,v1,v2
v3,0(r3)
20
Architecture methods
Powerful Instructions (1)
• Sub-word parallelism
– SIMD on restricted scale:
– Used for Multi-media instructions
– Motivation: use a powerful 64-bit alu
as 4 x 16-bit alus
• Examples
– MMX, SUN-VIS, HP MAX-2, AMDK7/Athlon 3Dnow, Trimedia II
– Example: i=1..4|ai-bi|
7/21/2015
*
*
*
*
21
MPSoC Issues
•
•
•
•
•
•
Homogeneous vs Heterogeneous
Shared memory vs local memory
Topology
Communication (Bus vs. Network)
Granularity (many small vs few large)
Mapping
– Automatic vs manual parallelization
– TLP vs DLP
– Parallel vs Pipelined
7/21/2015
22
Multi-core
7/21/2015
23
Cell
7/21/2015
24
What can it do?
7/21/2015
25
Cell/B.E. - the history
• Sony/Toshiba/IBM consortium
– Austin, TX – March 2001
– Initial investment: $400,000,000
• Official name: STI Cell Broadband
Engine
– Also goes by Cell BE, STI Cell, Cell
• In production for:
– PlayStation 3 from Sony
– Mercury’s blades
7/21/2015
26
Cell blade
7/21/2015
27
Cell/B.E. – the architecture
1 x PPE 64-bit PowerPC
L1: 32 KB I$ + 32 KB D$
L2: 512 KB
8 x SPE cores:
Local store: 256 KB
128 x 128 bit vector
registers
Hybrid memory model:
PPE: Rd/Wr
SPEs: Asynchronous DMA
• EIB: 205 GB/s sustained aggregate bandwidth
• Processor-to-memory bandwidth: 25.6 GB/s
• Processor-to-processor: 20 GB/s in each direction
7/21/2015
28
Cell chip
7/21/2015
29
SPE
7/21/2015
30
SPE
7/21/2015
31
SPE pipeline
7/21/2015
32
Communication
7/21/2015
33
8 parallel transactions
7/21/2015
34
C++ on Cell
1
2
3
4
5
6
7/21/2015
Send the code of the function to be run on SPE
Send address to fetch the data
DMA data in LS from the main memory
Run the code on the SPE
DMA data out of LS to the main memory
Signal the PPE that the SPE has finished the function
35
Conclusions
• Multi-processors inevitable
• Huge performance increase, but…
• Hell to program
– Got to be an architecture expert
– Portability?
7/21/2015
36