Overview of the Architecture, Circuit Design, and Physical

Download Report

Transcript Overview of the Architecture, Circuit Design, and Physical

OVERVIEW OF THE ARCHITECTURE, CIRCUIT
DESIGN, AND
PHYSICAL IMPLEMENTATION OF A
FIRST-GENERATION CELL PROCESSOR
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1,
JANUARY 2006
First Consumer Product

Play Station 3!
Introduction

Developed through partnership of
 SONY
Computer Entertainment.
 Toshiba.
 IBM.

Aim
 Highly
tuned for media processing.
 Expected demands for complex and larger data
handling.
What is Cell?



Cell is an architecture for high performance
distributed computing.
It is comprised of hardware and software cells.
Implementation of a wide range of single or multiple
processor and memory configurations.
“Supercomputer” in daily life






Parallelism with high frequency.
Real time response.
Supports Multiple operating system.
10 simultaneous threads.
128 memory requests.
Optimally address many different system and
application requirements.
Architecture Overview








8 SPE’s with Local Storage (LS).
PPE with its L2 cache.
Internal element interconnect bus (EIB).
Memory Interface Controller (MIC).
Bus Interface Controller (BIC).
Power Management Unit (PMU).
Thermal Management Unit (TMU).
Pervasive Unit.
High Level Diagram
Die Photograph
Synergistic Processing Elements (SPE)
(1/2)


Share system memory with PPE through DMA.
Data and instructions in a private real address space
supported by a 256 K LS.
According
to IBM a single
SPE can perform as well as
a top end (single core)
desktop CPU given the right
task.
Synergistic Processing Elements (SPE)
(2/2)



Access main storage by issuing DMA commands to the
associated MFC block (asynchronous transfer).
Fully pipelined 128 bit wide dual issue SIMD.
SPE’s in a Cell can be chained together to act as a
stream processor.
Power Processor Element (PPE) (1/2)


32-kB instruction and data cache.
64 bit “Power Architecture” with 512kB L2 cache.
Power Processor Element (PPE) (2/2)

Through MMIO control registers can intiate DMA for
SPE.

Hyepervisor extension.

Moderate length of pipeline.
Element Interconnect Bus(EIB)

Can transfer upto 96bytes per cycle.

4 16byte wide rings
 Two
rings going clockwise.
 Two rings going counterclockwise.

Separate address and command network.

12on/off ramps.
Memory Interface Controller (MIC)

Two 36 bit wide XDR memory banks.

Can also support just a single bank.

Speed matching SRAM and two clocks.
Power Reduction

Power Management Unit.

PMU allows software controls to reduce chip power.

Can cause OS to throttle, pause or stop for single or
multiple units.
Thermal Monitoring



Thermal Sensors and Thermal Monitoring Unit.
One sensor located at relatively constant temp.
location, for external cooling.
10 DTS at various critical locations.
Optimum Point (1/3)

Triple constraint : Power, Performance, Area.

Gate Oxide thickness
 Thinner
oxide
 Higher
performance.
 Higher gate tunneling too.
 Reliabilty concerns.
Optimum Point (2/3)

Channel Length
 Short
channel length
 Improved
performance.
 Increased leakage current too.

Supply Voltage
 Higher
voltage
 Improved
performance.
 Higher AC/DC power.
Optimum Point (3/3)

Wire Levels
 Few
levels
 Increased
 Many
chip area.
levels
 More
cost.
Final Technology Parameters
Chip Integration

241M transistors.

8912 discrete flour planned blocks.

Custom tailored nets.

20 separate power domains.
POWER-CONSCIOUS DESIGN
OF
THE CELL PROCESSOR’S
SPE
Osamu Takahashi
IBM Systems and
Technology Group
Scott Cottier
Sang H. Dhong
Brian Flachs
Joel Silberman
IBM T.J. Watson
Research Center
The CELL Processor - Properties



Mostly CMOS static gates.
Dynamic gates used for time critical paths.
Tight coupling of
 ISA
 uArchitecture
 Physical
implementation
achieves Compact and Power efficient design.
APPLICATIONS

To name a few (list goes endless)
 Image
processing for high definition TV
 Image processing for medical usages
 High performance computing
 Gaming

Flexible enough to be a GP uP that supports HLL
programming.
Cell processor - Architecture


64-bit power core
Eight Synergistic Processor
Elements(SPEs)

L2 Cache

Interconnection bus

I/O Controller

Rambus Flex I/O
Architecture contd.

SPE has two clock domains:
 one


with an 11FO4 cycle time.
 other with a 22FO4 cycle time.
Implementation using custom design - high-frequency
domain.
The SPE contains
 256 Kbytes of dedicated local store memory.
 The 128-bit, 128-entry general-purpose register file
with six read ports and two write ports.
SPE



The SMF operates at half the SPE’s
frequency.
The SPE operates at operations of up
to 5.6 GHz at a 1.4 V supply and
56° C.
The SPE’s measured power
consumption is in the range of 1 W to
11 W, depending on
Operating clock frequency.
 Temperature.
 Workload.

Triple design constraints




Cell contains eight copies of the SPE.
Optimization of the SPE’s power and area is critical to the overall chip design.
Conscious effort to reduce SPE area and power while meeting the 11 FO4 cycle
time performance objectives.
Optimized design to balance three constraints of





Power.
Area.
Performance.
Tradeoffs to achieve the overall best results
Some techniques used





latch selection.
fine-grained clock-gating scheme.
multiclock-domain design.
use of dual-threshold voltage.
Selective use of dynamic circuits.
Latch selection



Logic has 8-9FO4
time.
Rest of the time used
by latches.
Several Latches with
various insertion
delays used.
Transmission Gate Latch


SPE’s main workhorse latch.
Come in two varieties
 Scannable.
 Non


scannable.
Each has several power
levels.
Used almost throughout the
SPE.
Pulsed Clock Latch





Non scannable.
Small insertion delay.
Small Area.
Relatively low power consumption.
Used in
 Most
timing.
 Power critical areas.
Dynamic multiplexer latch




Scannable.
Multiplexing widths from
4-10.
Small insertion delay.
Used in
 Time
critical.
 Multiplexing requiring
areas.

Typical use in dataflow
operand latches.
Dynamic PLA Latch





Scannable latch.
Used to generate control signals (clock gating
signals).
The last two latches use slightly higher power.
Complete complex task in critical time.
Example of a tradeoff among triple constraints.
Fine-grained clock gating


Effective method of reducing power used extensively in the CELL.
Use of local clock buffer (LCB)






Supplies clock to bank of latches.
If enable signal fired LCB buffers the
global clock and sends to the bank of
latches.
SPE activates only necessary pipeline
stages.
Registers are turned off normally.
Functional blocks were simulated and
verified.
50% active power reduction using this
design process.
Multiple clock frequency domains


High frequency increases
performance.
Has some penalties





Higher clock power.
Higher percentage of clock
insertion delays.
Shorter distance that a signal can
travel.
SPE has some units whose
performance does not solely
depend on frequency.
SMF operates at half the
frequency.
Multiple clock frequency domains

11 FO4 blocks






22 FO4 blocks




Register file.
Fixed point unit.
Floating point unit.
Data forwarding.
Load/Store.
Direct memory access unit.
Bus control.
Distribution of one clock to both
domains.
SMF activated every second clock
cycle.
Multiple clock frequency domains


Avoids physical implementation difficulties.
Helps escape
Latch insertion delay.
 Travel distance penalties.


Advantages
Large percentage of clock dedicated to logic.
 Most of SMF paths become non-critical.
 Smaller transistors can be used.


SMF optimized for both area and power without
sacrificing performance.
Dual-threshold-voltage devices






Leakage – significant portion of power consumption
for deep micron technology.
Cannot be solved by clock gating or two clock
domains.
Use high-threshold-voltage transistors.
Penalty – slower switching time.
Used in paths with enough timing slack.
Non critical paths from SMF because of two clock
domains were replaced with these.
Selective use of dynamic circuits


Advantages of static circuits over dynamic
 Design ease.
 Low switching factor.
 Tool compatibility.
 Technology independence.
Advantages of dynamic circuits over static counterparts
Faster speed due to low cap at dynamic nodes.
 Larger gains because of invertors after logic.
 Micro architecture efficiency – fewer stages.
 Smaller area.

Selective use of dynamic circuits




Dynamic logic requires a clock
– higher power consumption.
Requires both true and
complementary signals.
Static implementation tends to
hit speed wall earlier.
Approach for design


Implement logic circuits in static
CMOS as much as possible.
Alternatives when static did not
meet the speed requirements.
Selective use of dynamic circuits




Dynamic logic requires a clock – higher power
consumption.
Requires both true and complementary signals.
Static implementation tends to hit speed wall
earlier.
Approach for design
 Implement
logic circuits in static CMOS as much as
possible.
 Alternatives when static did not meet the speed
requirements.
Selective use of dynamic circuits



Dynamic circuits have static interfaces.
19 percent of the non-SRAM area.
Include the following macros
 Dataflow
forwarding.
 Multiport register file.
 Floating point unit.
 Dynamic PLL.
 Multiplexer latch.
 Instruction line buffer.
SPE hardware measurements








Tested for complicated 3D picture rendering.
The fastest operation ran at 5.6 GHz with a 1.4 V supply at 56° C.
The global clock mesh’s measured power is 1.3 W per SPE at a
1.2V supply and 2.0-GHz clock frequency.
The Cell architecture is compatible with the 64b Power architecture
so that applications can be built on the Power investments.
It can be considered as a non-homogenous coherent chip
multiprocessor.
High design frequency has been achieved through highly optimized
implementation.
Its streaming DMA architecture helps to enhance memory
effectiveness of a processor.
Refer to shmoo plot for power analysis
SPE shmoo plot
Applications of the CELL Processor
And Its Potential For Scientific
Computing
r
THE POWER!
FOLDING@HOME Broke the Guinness world record
for the “worlds most powerful distributed network”
with computing power of > 1 PF(thousand trillion
floating point operations per second).
Blue Gene is 500 TF

WHY THE POWER?
Cell combines the considerable floating point resources required
for demanding numerical algorithms with a power efficient
software-controlled memory hierarchy.
Contains a powerful 64-bit Dual-threaded IBM PowerPC core
and eight proprietary 'Synergistic Processing Elements' (SPEs), eight more highly specialized mini-computers on the same die.
Cell’s peak double precision performance is very impressive
relative to its commodity peers (14.6Gflop/[email protected]),
OVERVIEW
Quantitative Performance comparison of the cell to
AMD Opteron(superscalar), Intel Itanium 2(VLIW)
and Cray X1E(vector)
Minor Architectural Changes (CELL +) to improve
DP performance.
Complexity of mapping scientific algorithms onto
the CELL.
A few interesting Applications

ARCHITECTURE
Each SPE contains 4 SP 6 cycle pipelined FMA(fused
multiply–add) datapaths, 1 DP 9 cycle pipelined FMA
datapath + 4 cycles for data movement.
7 Cycle in-order ex. Pipeline and forwarding network.
Inserts a 6 cycle stall after a DP instr
1 DP instruction issued every 7 Cycles
DP Performance is 1/14 peak SP performance

Programming

Modified SPMD(Single Program Multiple Data)
Dual
Program Multiple Data
Each SPE has its own local memory to fetch code
and read/write data.
All loads and stores are local.
Explicit DMA operations to move data from main
memory to local memory.
Software controlled Memory

Programming Models
Very challenging to program.
Explicit parallelism between SPE and PPC
Quad word ISA
Unlike MPI communication intrinsics are low level, hence faster
Three Basic Models
Task Parallel – Separate Tasks assigned each SPE
Pipeline Parallel – Large Blocks of data transferred
between SPEs
Data Parallel – Same code, Distinct Data (paper uses this)

Benchmark Kernels
Stencil Computations on Structured Grids
Sparse Matrix-Vector Multiplication
Matrix-Matrix Multiplication
1D FFTs
2D FFTs

CELL +
The authors of this paper proposed minor
architectural changes to the CELL Processor
DP wasn’t a major focus for the Gaming world
Redesign would increase complexity and power
consuption
DP instructions fetched every 2 cycles keeping
everything else the same

The Processors Used
Benchmark 1 –GEMM
Dense Matrix-Matrix Multiplication – High
Computational Intensity and regular memory access
Expect to reach close to peak on most platforms
Explored two blocking formats: Column major and
Block data layout

Benchmark 1 –GEMM
Gflop/s
Cell+PM
CellPM
X1E
AMD64
IA64
DP
51.1
14.6
16.9
4.0
5.4
204.7
29.5
7.8
3.0
—
SP
Mflop/W
Cell+PM
CellPM
X1E
AMD64
IA64
DP
1277
365
141
45
42
5117
245
88
23
SP
—
BENCHMARK 2 – Sparse Matrix
Vector Multiply
Seems
like a poor choice at first glance due to low computational intensity and
irregular data accesses.
But
less local store latency, task parallelism, 8 SPE load store units and DMA
prove otherwise.
Most
of the matrix entries are zero, thus the nonzeros are sparsely distributed
and can be streamed via DMA
Like
DGEMM, can exploit a FMA well
Very
low computational intensity (1 FMA for every 12+ bytes)
Non
FP instructions can dominate
Row
lengths can be unique and in multiples of 4
SpMV - Results
CellFSS
Cell+PM
CellPM
X1E
AMD64
IA64
DP
3.04
2.46
2.34
1.14
0.36
0.36
SP
-
-
4.08
-
0.53
0.41
DP
3.38*
4.35
4.00
2.64
0.60
0.67
SP
-
-
7.68
-
0.80
0.83
CellFSS
Cell+PM
CellPM
X1E
AMD64
IA64
DP
76.0
61.5
58.5
9.50
4.04
2.77
SP
-
-
102
-
5.96
3.15
DP
84.5*
109
100
22.0
6.74
5.15
SP
-
-
192
-
8.99
6.38
Gflop/s
unsymmetric
symmetric
Mflop/W
unsymmetric
symmetric
Stencil Based Computations
Stencil computations codes represent wide array of
scientific applications

Each point in multidimensional grid is updated from subset of
neighbours


Finite difference operations used to solver complex numerical systems

Here simple heat equations and 3D hyperbolic PDE are examined
Relatively low computational intensity results in low % of peak on
superscalars


Memory bandwidth bound – Low computational intensity
Stencils - Results
Gflop/s
CellFSS
Cell+PM
CellPM
X1E
AMD64
IA64
DP
7.25
21.1
8.2
3.91
0.57
1.19
SP
65.8
-
21.2
3.26
1.07
1.97
Mflop/W
CellFSS
Cell+PM
CellPM
X1E
AMD64
IA64
DP
181
528
205
32.6
6.4
9.15
SP
1645
-
530
27.2
12
15.2
1D Fast Fourier Transforms
Fast Fourier transform (FFT) - is of great importance to a wide variety of
applications
 One of the main techniques for solving PDEs
 Relatively low computational intensity with non-trivial volume data
movement
1D FFT: Naïve Algorithm - cooperatively executed across the SPEs
 Load roots of unity, load data (cyclic)
 3 stages: local work, on-chip transpose, local work
 No double buffering (ie no overlap of communication or computation)
2D FFT: 1D FFTs are each run on single SPE
 Each SPE performs 2 * (N/8) FFTs
 Double buffer (2 incoming and 2 outgoing)
2
 Straightforward algorithm (N 2D FFT):
 N simultaneous FFTs, transpose,
 Transposes represent about 50% of SP execution time, but only 20% of DP
Cell performance compared with highly optimized FFTW and vendor libraries

1D FFT Results
Cell+PM
CellPM
X1E
AMD64
IA64
DP
13.4
5.85
4.53
1.61
2.70
SP
-
33.7
5.30
3.24
1.72
DP
16.2
6.65
7.05
0.69
0.31
SP
-
38.2
7.93
1.32
0.42
Cell+PM
CellPM
X1E
AMD64
IA64
DP
335
146
37.8
18.1
20.8
SP
-
843
44.2
36.4
13.2
DP
405
166
58.8
7.75
2.38
SP
-
955
66.1
14.8
3.23
averaged Gflop/s
1D
2D
averaged Mflop/W
1D
2D
A Few Conclusions
Far more predictable than conventional machines
Even in double precision, it obtains much better
performance on a surprising variety of codes.
Cell can eliminate unneeded memory traffic, hide
memory latency, and thus achieves a much higher
percentage of memory bandwidth.
Instruction set can be very inefficient for poorly SIMD
or misaligned codes.
Loop overheads can heavily dominate performance.
Programming Model is clunky

Real World Applications
FOLDING@HOME
Folding@home tm is a Distributed Computing Project
at Stanford University
Connects > 1 Million CPUs
Mainly to study protein folding and misfolding
PS3 Cell Broadband Engine increased the total
computation power exponentially upto 1 PT
1 work unit takes 8 hours. Run PS3 overnight. Then
sends results back.
250 K CPUs active in 2008

Other Real life Scientific Appln

Ray Tracing

Modeling of the human brain
Solve complex equations to predict gravity waves that are
generated by the super-sized black


To assist an autonomous vehicle.
Axion Racing Entry Into Darpa Urban
Challenge
Series of events designed to test autonomous vehicles for
developing technology that keep people off the battlefield.
Axion Racing, used PS3 running Yellow Dog Linux as part of
its on-board image recognition system.
‘Spirit’ the name of Axion Racing’s vehicle, was the first of
its kind to drive itself to the 14,110 foot summit of
Colorado’s Pikes Peak.
uses stereo vision (2 cameras) concept to determine object
distance andthen running them through the software produces
something called a disparity map. The further away the
object is the smaller the disparity map, likewise the opposite
for near objects.
Spirit uses cell to park and reverse.

SPIRIT
Along with the stereo cameras, Spirit uses a laser
range finder,infrared camera and two NAVCOM
Starfire GPS units And an inertial navigation
system (to correct for GPS errors and signal
losses)
Ray Tracing
Very computationally intense algorithm to model the
path taken by light as they interact with optical
surfaces
Also used in modeling radio waves, radiation effects
and in other engineering areas.
Algorithm needs to be heavily modified to run on
the cell.
Ray Tracing
This video shows a progression of ray-traced shaders
executing on a cluster of IBM QS20 Cell blades.
over 300,000 triangles, render at over 60 frames
per second (depending on the shader) at 1080p
resolution using 14 Cell processors.
Because of the scalable nature of the ray-tracer it
can also render interactive frames on a single Linux
Playstation3 using only 6 SPEs.

$
Conclusion
Overall, a single PS3 performs better than the highest-end
desktops available and compares to as many as 25 nodes of
an IBM Blue Gene supercomputer. And there is still tremendous
scope left for extracting more performance through further
optimization.
Its a commodity processor, hence cheap and can be used in
large quantities.
The most Difficult process is writing and compiling code!
QUESTIONS????