20120323_LitRev_SIMD_Intel
Download
Report
Transcript 20120323_LitRev_SIMD_Intel
Literature Review
A 280mV-to-1.1V 256b Reconfigurable SIMD
Vector Permutation Engine with 2-D Shuffle
in 22nm CMOS [ISSCC ’12]
Fang-Li Yuan
Advisor: Prof. Dejan Marković
03/23/2012
IC Design Challenges: 1980s – Present
Session 1.4: Sustainability in Silicon & System Development
–
–
–
–
1980s: Design productivity
1990s: Power dissipation
2000s: Leakage power
2010s:
Energy Efficiency
Moore’s Law continues to provide more transistors
Power budgets limit our ability to use them
Fang-Li Yuan
2
Intel’s Solutions – From Transistors to Circuits
2012 ISSCC
2007 ISSCC
Fang-Li Yuan
3
Near-Vth Computing: Great for Energy Efficiency
Fang-Li Yuan
4
IA-32: 1st NTV Processor in 32nm CMOS
Fang-Li Yuan
5
NTV Circuits Gain 7x Efficiency in VPFP Mult-Add
Fang-Li Yuan
6
1st NTV SIMD Engine in 22nm Tri-Gate Technology
Fang-Li Yuan
7
System-Level Overview
32 32×8b 3R1W RF: 4~32-way,
8/16/32/64b Vertical Perm.
256b, byte-wise, any-to-any
Crossbar: Horizontal Perm.
Goal:
(1) Provide flexiblity
(2) Improve Vmin
(3) Reduce power
(4) Lower PVT var.
Results:
585 GOPS/W @280mV
(9x higher than 1.1V)
Fang-Li Yuan
8
Example: 64b 4x4 Matrix Transpose
Fang-Li Yuan
9
RF with PVT-tolerant Techniques & Vector FFs
Clockless static reads eliminate
keeper contention in dynamic BLs
Byte-wise enable-signal gating
reduce 49% of switching power
Shared P/N on virtual supplies limits
strength of cross-coupled INVs
Vector flip-flops w/ shared
local min-sized clock INVs
average the variation
Fang-Li Yuan
10
250mV Vmin Reduction Across PVT Variations
250 mV
Fang-Li Yuan
11
Vector FFs Reduce Hold-Time Violations @ Low V
Fang-Li Yuan
12
ULVS LS, & Interleaved Folded Crossbar Layout
Vector mux averages variation
effect of min-sized devices by
sharing transistors across gates
Folded layout: 50%
reduction of wiring
Interleaved layout:
50% lower coupling
Decouples CVSL stage from o/p
driver & contention devices: 20~32%
lower power, 125mV improved Vmin
Fang-Li Yuan
13
ULVS Improves Vmin by 125mV
Fang-Li Yuan
14
RF and Logic Co-optimization: Iso-Vmin
Fang-Li Yuan
15
Measured Performance
RF: 227mW, 2.5GHz @1.1V
Xbar: 69mW, 2.9GHz @1.1V
585 GOPS/W @0.26V
(9x higher than 0.9V)
RF: 106mW, 1.8GHz @0.9V
Xbar: 36mW, 2.3GHz @0.9V
RF: 109μW, 16.8MHz @0.28V
Xbar: 19μW, 10MHz @0.24V
Fang-Li Yuan
16
Conclusions
NTV computing is energy efficient but sensitive to PVT variation
Static ckts (e.g. RF read): better than dynamic ckts @ NTV
Shared P/N DETG writes improve Vmin across PVT variations
Vector FF/Mux share transistors across gates, averaging variation
ULVS LS interrupts contention devices, improving Vmin & power
Byte-wise enable-signal gating reduces power
Folded layout has 50% reduction in critical wiring length
Interleaved, opposite-direction data wires achieve 50% lower
line-to-line coupling, improving SI & delay
Fang-Li Yuan
17