Power - ICR(ISAC CPU Research)

Download Report

Transcript Power - ICR(ISAC CPU Research)

100mW 1TFLOPS
Massive Processors
February 24, 2011
Sung Bae Park
SAIT
Outline
 Trend
 Direction
 Future
Trend
Revolution
 SW & Data surpass HW  Massive Processors on Smart SoC
2nd IT Revolution
[Mobile & Cloud]
Inflection
Point
Massive Processors
Others
1st
IT Revolution
[PC & Internet]
Smart SoC
-
HW SoC
Modem
Channel
Low Price
Low Power
Fast TAT
OTA* SW Update
3DG
Image
A/V
CPU
[Source: IBM]
‘10
‘20
OTA*: Over-The-Air
Matured Si Scaling
 Si Scaling will be matured to integrate 8M 32-Bit ALUs by 2020
 Massive Processors on Smart SoC
32-Bit ALUs
8M
1.2K Gates/32-Bit ALU
@ 100mm2 SoC
4M
2M
1M
Processor
 PC CPU: Highly Programmable & High Performance, but Power & Price
 HW SoC: Low Power & Price, Medium Performance, but Programmability
 Smart SoC: Highly Programmable & High Performance, Low Power & Price
Smart SoC $250B
HW SoC $100B
PC CPU $50B
Inflection
Point
1980
Drivers
Obstacles
1990
Smart Consumer
on Massive Processor
Nokia Phone
on ARM CPU
IBM PC
on Intel x86 CPU
2000
• x86 Binary Compatible
Mass Infra for IHV/ISV
2010
2017
• High Performance
3-4GHz 6-24 Core
• Data: Low Power Low Price
Dedicated HW IPs
• CPU: Low Power, Low Price
ARM Mass Infra for IHV/ISV
• P4 Innovation in Massive Processors
P4 [Price, Power, Performance &
Programmability] enabled by
Matured Si Scaling  4K/4M/4G,.. ALUs
• Power ~100W
• Price ~$100
• Memory Bottleneck
• HW IP: No Programmability
• CPU/DSP: x10 Power, Price &
Memory Bottleneck than HW IPs
• Programming Model
• Dynamic Compiler & Debugger
• Simulator, Profiler & Runtimes
Power
 Si technology goes on well to sub-10nm along with Moore’s Law
 No power efficiency improvements with Si technology scaling
 Power Constraints for Mobile Device & Cloud Computing Data Center
[Source: Office of Science, U.S. DEPARTMENT OF ENERGY ]
Waste
“Years of research in low-power embedded computing have shown
only one design technique to reduce power: reduce waste.”
- Mark Horowitz, Stanford University & Rambus Inc.
FHD H.264 Decoder
Power
Area
FoM*
*FoM: Figure-of-Merit
1/(Power*Area)
Reduce Power: Reduce Waste
Wasted Transistors
Wasted Computation
Wasted Bandwidth
Wasted Voltage
[Figure: Office of Science, U.S. DEPARTMENT OF ENERGY ]
Direction
Architecture: Make HW IP Programmable
 Reprogrammable FSM with Microcode + Domain Specific HW FU with ISA
 Extreme RISC in Horizontal Control, and Extreme CISC in Vertical Data
Radio ISA FU
Advanced DSP design driven by
workload analysis
Data
- Cellular
- Channel/Wireless
Control
Latency
F0
F1
F2
Instruction
Fetch
Throughput
D1
D2
Instruction
Decode
Smart compiler
(C/C++)
E1
E2
E3
Tag
Match
AGU Access ing
E6
E7
FU0
ALU1 ALU2 ALU3
WB
MUL1 MUL2 SHFT
WB
I$
Instr.
Op.
Dec
Fetch
LS Pipeline
MUL1 MUL2 SHFT
DATA
VLIW
E5
FU1
I$
Instruction
E4
WB
LS Pipeline
Central RF (Register file)
FU
FU
FU
FU
FU
FU
FU
FU
RF
RF
RF
RF
FU
FU
FU
FU
RF
RF
RF
RF
FU
FU
FU
FU
RF
RF
RF
RF
Coarse Grain Array (CGA)
Media ISA FU
- AV/Image
- 3D/Ray-Tracing
Intelligence ISA FU
- Recognition
- Mining
- Synthesis
WB
FU2
WB
Architecture > x3
 Reduce Waste in Computation: Maximize Essential Algorithm
 Reduce Waste in Bandwidth: Minimize Memory Access
FHD H.264 Decoder
Algorithm to Architecture Innovation
Reduce Waste Cycles & Memory
32b 1x16 FU  64b SIMD (8x8)x16 FU
SIMD Pack/Unpack  Implicit Shuffling Mux
Serial CABAC  Parallel Programmable CABAC
Limited Interconnection  Fully Programmable
1-Thread  Micro-Multi Thread: VLIW, CGA
64-bit DMA  1024-Bit Adaptive Programmable DMA
Banked Memory  X-Y Simultaneous Access Stack
Power
Area
FoM*
*FoM: Figure-of-Merit
1/(Power*Area)
Case Study I: Memory
 No Need to Calculate Address: Implicit/Local/Distributed 32x32x64-Bit RF
 X-Y Bi-Directional Random Access for Extreme Spatial Locality in Bit-Pixel
Stream Applications
X
H.264 FHD decoding – Luma Interprediction
X-Y
Stack
Vertical filtering
Horizontal filtering
Y
¼ pel
Total cycles
= ~18 cycles
Total cycles = 170
Load64b
add
Load64b
add
Load64b
add
Load64b
add
Load64b
add
Load64b
add
CAT_WIN
CAT_WIN
CAT_WIN
CAT_WIN
Data
Shuffling
LOAD from stack;
Load from
Stack
Store64b
add
Store64b
add
Data Store &
address
generation
Data
Computation
Data load &
address
generation
Loop 예제: II=2, loop count=16
RoundSat
RoundSat
FirFilter64b
FirFilter64b
RoundSat
Avg64b
FirFilter64b
FirFilter64b
RoundSat
Avg64b
NONE;
Data
STORE to stack;
Computation
Data Computation
Store to Stack
Case Study II: Computation
 Adaptive Algorithm for Wide Execution Unit (64b x 8 FU = 512b)
 3x3 Median Filter in a Single Cycle without any data exchange
5
5
5
5
5
-
1
2
3
4
5
2
1
-
5
<=0
+
-
+
<=0
>0
+
5
5
6
7
8
6
5
+
5
7
+
4
5
6
7
8
9
FU0 FU1
FU2
FU3 FU4
FU5
FU6 FU7
FU8
8
<=0
+
>0
+
3
<=0
>0
+
2
5
<=0
>0
+
5
-
+
+
5
5
5
<=0
>0
+
5
4
<=0
>0
+
5
<=0
>0
3
-
5
1
+
>=4
+
<=4
00 15
>0
+
5
Case Study III: eDRAM
 Multi-GHz Multi-GB eDRAM for Performance and Cost
Computation
Memory
[Source: IBM ]
Implementation > x3
 Reduce Waste in Circuits: Extreme Custom Circuits > x2-3
 Reduce Waste in Device: Circuits Design Assisted Device > x1.5
FHD H.264 Decoder
Architecture to Si Implementation Innovation
Reduce Voltage & Speed Up by Custom Design & Device
1V 1GHz  0.6V 1GHz
- 1mA Ion 1nA Ioff @1V  1mA Ion 1nA Ioff @ 0.6V
Improve vxo (Carrier Virtual Source Velocity)
with Strained-Si Adjusted by Sweet Spot ABB
- DLV(Deep Low Voltage) Custom Circuits & Device
- Extensive Clock Gating & Power Gating
Power
Area
FoM*
*FoM: Figure-of-Merit
1/(Power*Area)
Custom Circuits & Device
 Designers’ High: 1mA 1nA @ 0.6V Device with Design Assist
 Custom Style CMOS Circuits with Knee-Slew RLC Impedance Matching
4.7GHz IBM Power6
3.3MHz/uA 340mm2
13 Stage 13 FO4
20ps/FO4 @1400uA
GHz
4
4GHz Intel Penryn
2.3MHz/uA 150mm2
14 Stage 15 FO4
17ps/FO4 @1700uA
3GHz Custom MP
2.7MHz/uA
3
10-12 Stage 13 FO4 Custom
24ps/FO4 @1200uA
2GHz Custom MP x10 Paradigm Shift Semicustom 설계 기술:
 Creative NMOS-Dominated Full Custom
2.7MHz/uA 2mm2
2
10-12 Stage 13 FO4 Custom
 Coarse Grain CLK Gating, DVFS, ABB & LAGS
38ps/FO4 @740uA


1GHz MP

1.35MHz/uA 3/4mm2

10 Stage 26 FO4 P&R
38ps/FO4 @740uA

1
300
45n
600
32n
900
28n
1200
1500
fknee(40GHz) UHF and DLS (Deep Low Swing)
0.7V DLV (Deep Low Voltage) Circuits and Device
Advanced Pipelining
Correlation Error between Model and Silicon
High Performance SRAM/Library
1800 Idsat, uA
Speed
> x2 ↑
> x1.2↑
> x1.3↑
≒
> x2 ↑
> x1.1↑
> x1.3↑
Power Area
≒
≒
> x1.3 ↓≒
> x1.2 ↓ > x1.2 ↓
> x2 ↓ ≒
≒
≒
> x1.1↓ > x1.2 ↓
> x1.3↓ > x1.3 ↓
[Source: IEDM & ISSCC]
MUX Circuits
Case Special
Study I:Custom
MUX, Mission
Critical- Issue
for Processor
3:1
4:1
4:1
• Full Swing
A
• 260ps
• Slow due to
too high Cj
• 20uW
3:1 16:1
• 260ps
• Slow due to
too high Cj
• 20uW
• Low Swing
• Static Power
• 140ps (x1.8)
• 25uW (x1.25)
• Low Swing
• No Static Power
• 160ps (x1.6)
• 15uW (x0.75)
• Full Swing
B
3:1 16:1
C
3:1 16:1
D
?
Special Custom MUX Circuits - Options
Case Study I: Candidates
I
II
III
IV
V
VI
P1, P2 are short pulses to initiate charge
and discharge of the output
Special Custom MUX Circuits - Solutions
Case Study I: Solution
?
• Low Swing
• No Static Power
• 160ps (x1.6)
• 15uW (x0.75)
Case Study II: RLC Power
• Resistor: 1 ohm
• Inductor: 1 nH
• Capacitor: 1 nF
i(t)
R
L
VDD
C
v(t)
• Power source period: 1 us
• Power maximum voltage: 1 V
• Power minimum voltage: 0 V
• Power source slope: 100 ns, 10 ns, 1 ns, 100 ps, 10 ps, 1 ps
S lope
RC
R LC
R atio
(R LC /R C )
100.000 ns
19.8 uW
20.0 uW
101%
10.000 ns
180.0 uW
199.9 uW
111%
1.000 ns
735.9 uW
933.0 uW
127%
0.100 ns
968.4 uW
999.2 uW
103%
0.010 ns
996.6 uW
1000.0 uW
100%
0.001 ns
998.8 uW
1000.0 uW
100%
Exponential
CV2
Minimum due to
Steady Change
# of Charges same with 1ps slope
but instaneous peak current
is quite different, and the power.
Due to oscillation
I2 Increase
Exponential Damping CV2
Case Study II: Impedance Matching
 Frequency analysis: Slope and Impedance Matching reduce the Power
Total Power
RC
RLC
Power Ratio
RLC/RC
134MHz
7.5ns period
15ns slope
Case Study III: Tolerable Circuits Design
Design
1
Loading Tolerance: data dependent Cg/Cj/Cw
2
Coupling Tolerance: temporal/spatial (d-d, d-s, s-d, s-s)
3
Clock Tolerance: jitter, duty, slew, skew
4
Temperature Tolerance: spatial, temporal
5
Supply tolerance: spatial, temporal, Low VDD margin
6
Device Tolerance: N / P, Vt, W/Leff Tracking, Gate RC
7
Leakage Tolerance: spatial, temporal
8
Quality Tolerance: NBTI/PBTI, HCI, Antenna
9
State Element Tolerance: R/W/S Stability, Soft Error
10
Signal Tolerance: edge rate, frequency response
11
Noise Tolerance: logic, delay, supply, signal NM
12
Dynamic Circuits Tolerance: charge sharing/injection/leakage
13
Logical Tolerance: min-max path
14
Back-End: EM/RC, Special DRC, Latch-Up
15
Modeling Tolerance: transistor, contact, via & wire RLC
16
Tool Tolerance: extraction and simulation
[Source: ISSCC]
Case Study IV: Low Temperature Inversion
 Low Temperature Inversion getting worse in sub-45nm Timing Closure
 Adaptive Body Biasing w/ Temperature Sensing
- Vth Increase by VFB Adjust @High Temp
- Mobility Increase by Coloumb/Surface Scattering Adjust @Low Temp
[Source: Nano-CMOS Circuit and Physical Design]
Future
Massive Processors
 2-10GHz Massive Array Processors with Innovative Memory & NW
 Seamless Open OS Platform with Rich SDKs
Design Methodology
• Custom to SoC
• PM: PG/CG w/ DVFS
General Purpose CPU
• x86 / ARM
Tool Chains
• Integrated Compiler
• System Simulator
Specific Processors
• RMS Accelerations
※ Recognition, Mining & Synthesis
Seamless Platform
• Open OS to Std. Drivers
• OpenCL, MPI, GCD
• Total Solutions
Array Processors
• Scalable Domain FU
X-Y Stack RF
• # of port, entry, bus
Device
• 0.6V 1mA 1nA @32nm
( 0.7V 0.9mA FBB)
Cross Bar Shuffling NW
• # of channel, queues, slice
Analog IPs
• Low Swing Bus Drv/Rec
• High-Q PLLs
Local Queue Memory
• # of port, size, bus
FPGA for Special IP & IO
• HDMI, Serdes,..
[Base Figure: MIT Prof. Arvind]
Package
• 3D Integration
Virtual Prototyping
 100mW 100GFLOPS Virtual Massive Processor for 32nm Node
Area (mm2)
Speed (MHz)
# of ALU
GFLOPS
Control Efficiency
Practical GFLOPS
Si Watt/Hz/mm2
Area Efficiency
Effective Area
Area Watt/Hz/mm2
Speed Power (Watt/Hz)
Activation Rate
Practical Power (Watt)
# of FHD Codec
mm2/FHD Codec
Watt/FHD Codec
Normalized Power/HW IP
TFLOPS @100mW
Practical TFLOPS@100mW
CPU
GPU
Mobile
DSP
MP
Virtual MP
HW IP
200
5000
8
40
1
40
2.00E-08
0.3
60
1.20E-06
6000
0.01
60
1
200
60
500
0.00007
0.00007
400
1000
1024
1024
0.5
512
2.00E-08
0.2
80
1.60E-06
1600
0.05
80
12.8
31.25
6.25
52.1
0.00128
0.00064
10
1000
2
2
1
2
2.00E-08
0.3
3
6.00E-08
60
0.01
0.6
0.05
200
12
100
0.00033
0.00033
10
500
8
4
0.5
2
2.00E-08
0.1
1
2.00E-08
10
0.05
0.5
0.05
200
10
83.3
0.00080
0.00040
10
1000
32
32
0.5
16
2.00E-08
0.1
1
2.00E-08
20
0.05
1
0.4
25
2.5
20.8
0.00320
0.00160
100
3000
6000
18000
1
18000
2.00E-08
0.3
30
6.00E-07
1800
0.01
18
450
0.22
0.040
0.333
0.10000
0.10000
2
200
200
40
1
40
2.00E-08
0.1
0.2
4.00E-09
0.8
0.15
0.12
1
2
0.12
1
0.03333
0.03333
100mW 1TFLOPS Massive Processors
FLOPS
100T
10T
Exa Flop @230KW
1/10,000 Power Revolution in 10-years
1/100 from Scaling
Additional 1/100 from Innovation in
HW like Essential Computing
Si Technology for DLV Device & Circuits
Multi-GHz Multi-GB eDRAM
Exa-byte/sec 3D Integration
Exa Flop @70MW
1/30 Power Efficiency in 5-years
1/3 from Computing, 1/10 from Scaling
Peta Flop @2.3MW
1T
GP
GPU
100G
HW
ASIC
PC
CPU
10G
Mobile
CPU
2020
1G
Moore’s Law
x2 / 18-months
(x10 / 5-years)
0.1
1
10
100
2015
2010
2005
1000 Watts
Massive Processors Applications
[Source: IBM, etc.]
Yet Primitive,…
 For 100 s 32K Atoms Protein Folding Simulation
 Time Step
: 10-15 second
 Simulation Step
: 1011 Steps
 # of Atoms in protein and water
: 32,000
 # of Force Computation per Step
: 109
 # of Instructions per Force Computation: 1,000
 Total # of Instructions
: 1023 Instructions
 To Execute 1023 Instructions
 3GHz PC
: 300,000 Years
 8,192 Processor Nodes IBM ASCI White : 160 Years
 65,536 Processor Nodes IBM Blue Gene : 9 Years
 Peta Flop IBM Roadrunner
: 3 Years
 Exa Flop in 2018
: 27 Hrs
[Source: IBM]
,.. to Understand the Nature
[Source: Nature]
Appendix
Where the Power Goes and How to Save it?
 It’s not the LC, but the R that matters!
 Thermal Phonons induced by Electron Scattering & Lattice Vibrations
 Getting Worse in Thicker & Narrower Wires than Smaller Transistors
Pumping-Out Electrons
Pumping-In Electrons
Maxwell Power Save in Interconnections
- Minimize Over/Under Damping
by Precise Impedance Matching
- Maximize Regularity for Hierarchical P&R
- Maximize Local Clustering w/ PG & CG
Poisson Power Save in Si
- Maximize Conductance @ DLV
- Minimize Area w/ Custom
- Minimize Switching Activities w/ PG & CG
Thermal Phonons
[Metal Stack: Intel 32nm]
FQHE: Zero Resistance  Zero Power
 Fractional Quantum Hall Effect @ Certain Magnetic Field
 “Sharp Resonance” as Impedance Matching and/or Superconductor
In 1980, Klaus von Klitzing [103] found that at temperatures of only a
few Kelvin and high magnetic ¯eld (3-10 Tesla), the Hall resistance did
not vary linearly with the Field. Instead, he found that it varied in a
stepwise fashion. It was also found that where the Hall resistance was °at,
the longitudinal resistance disappeared. This dissipation free transport
looked very similar to superconductivity. The Field at which the
plateaus appeared, or where the longitudinal resistance vanished,
quite surprisingly, was independent of the material, temperature, or
other variables of the experiment, but only depended on a
combination of fundamental constants -¹h=e2. The quantization of
resistivity seen in these early experiments came as a grand surprise and
would lead to a new international standard of resistivity, the Klitzing,
de¯ned as the Hall resistance measured at the fourth step.
By 1982, semiconductor technology had greatly advanced and it became
possible to produce interfaces of much higher quality than where
available only a few years before. That same year, Horst Stormer and
Dan Tsui [105] repeated Klitzing's earlier experiments with much cleaner
samples and higher magnetic ¯elds. What they found was the same
stepwise behavior as seen previously, but to everyone's surprise, steps
also appeared at fractional ¯lling factors º = 1=3; 1=5; 2=5 : : : Strongly
correlated systems are notoriously di±cult to understand, but in 1983,
Robert Laughlin [106] proposed his now celebrated ansatz for a
variational wavefunction which contained no free parameters:
[Cooper Pairs to Molecules: J. N. Milstein]