Transcript Slide

Tortola: Addressing Tomorrow’s
Computing Challenges through
Hardware/Software Symbiosis
Kim Hazelwood
September 29, 2006
Modern Computing Challenges
• Performance
• Power
– Energy consumption, max instantaneous power, di/dt
• Temperature
– Total heat output, “hot spots”
• Reliability
– Neutron strikes, alpha particles, MTBF, design flaws
• Approaches: Circuit, microarchitecture, compiler
• Constraint: Fixed HW-SW interface (e.g., x86)
2 of 26
Typical Approaches
• Optimize using SW or HW techniques in isolation
• Performance
– SW: Compile-time optimizations
– HW: Architectural improvements, VLSI
technology
• Reliability: Code/data duplication (HW or SW)
• Power & Temperature
SW
– HW control mechanisms
– Profile + recompile cycle
HW
3 of 26
Modern Design Constraints
Compilers – “Compile once, run anywhere”
– Cannot ship “MS Office for 1Q05 batch of Pentium-4
3GHz, > 1GB RAM, BrandX power supply, located in
high altitudes…”
Microarchitecture – Limited window of application
knowledge (past must predict the future)
VLSI – Guaranteed correctness, reliability
We currently must optimize for the common case
(but must design for the worst case)
4 of 26
The Power of Virtualization
• A HW-SW interface layer
SW Applications
Binary Modifier
HW
x86
x86
Initially
SWI
HWI
Eventually
5 of 26
Dynamic Binary Modification
• Creates a modified code image at run time
Examples:
EXE
Transform
Code
Cache
Profile
Execute
•
•
•
•
•
•
Dynamo (HP)
DAISY/BOA (IBM)
CMS (Transmeta)
Mojo (Microsoft)
Strata (UVa)
Pin (Intel)
6 of 26
Dynamic Instrumentation Demo
• Pin
– Four architectures – IA32, EM64T, IPF, XScale
– Four OSes – Linux, FreeBSD, MacOS, Windows
– http://rogue.colorado.edu/pin/
7 of 26
Dynamic Optimization Demo
• DynamoRIO
– Windows and Linux for IA32
– http://www.cag.lcs.mit.edu/dynamorio/
8 of 26
Dynamic Binary Modification
• Creates a modified code image at run time
Examples:
EXE
Transform
Code
Cache
Profile
Execute
•
•
•
•
•
•
Dynamo (HP)
DAISY/BOA (IBM)
CMS (Transmeta)
Mojo (Microsoft)
Strata (UVa)
Pin (Intel)
• Always triggered by software events … until now
9 of 26
Tortola: Symbiotic Optimization
• Enable HW/SW Communication
SW Applications
Binary Modifier
HW
10 of 26
Simulation Methodology
• SimpleScalar 4.0 for x86
• Wattch 1.02 power extensions
• Pin dynamic instrumentation system (x86/Linux version)
SW Application
Benchmarks
Binary Modifier
Pin
HW
Wattch &
Simplescalar/x86
11 of 26
Tortola Applications
• Combine global program information with runtime feedback
– System-specific power usage
– Application-specific heat anomalies
– Workload/input specific performance optimization
• Reduce hardware complexity
– No more backwards compatibility warts
– Fix bugs after shipment
– Reduce time to market for new architectures
• One such application: The di/dt problem
12 of 26
The Di/dt Problem
• Voltage stability is important for reliability,
performance
• Low-power techniques have a negative side
effect: current variation
• Dips (undershoots) in supply voltage – can
cause incorrect values to be calculated or stored
• Spikes (overshoots) in supply voltage – can
cause reliability problems
13 of 26
The Di/dt Problem
• ITRS cites noise management as a Grand
Challenge for 5-10 year time frame
• Several trends are aggravating the issue:
–
–
–
–
Voltage is scaling down with technology
Current draw is increasing
Package impedance is not scaling as quickly
Aggressive clock gating causes large swings in
processor current draw (di/dt)
14 of 26
Di/dt Solutions
Software

MicroArch
Circuit-Level
Compiler Optimizations
Co-Designed MicroArch &
SW Binary Modifier
Sensor/Actuator Mechanisms
Decoupling capacitors
More Vdd Gnd pins on package
15 of 26
Sensor-Actuator Mechanisms
• On-chip voltage sensors detect abnormally
high/low voltage levels
• On-chip actuator then attempts to quickly
raise/lower the processor’s current draw
– Phantom firing
• increases current (at the expense of power)
– Resource throttling
• reduces current (at the expense of performance)
16 of 26
Detecting Imminent Emergencies
Soft
Emergency
Operating
Voltage Range
1.05V
1.03V
Hard
Emergency
Control
Threshold
1V
0.97V
0.95V
17 of 26
Targeting Mid-Frequency Di/dt
20 cycles
60 cycles
Processor Current (A)
Processor Current (A)
• Problematic: wide current spike
• Worst case: pulse at the resonant frequency
*From:
Minimum Voltage
Time (Cycles)
Supply Voltage (V)
Supply Voltage (V)
Maximum Voltage
Joseph
et al.
HPCA-9
Minimum Voltage
Time (Cycles)
18 of 26
Parallel
High Current
Sequential
Low Current
A Di/dt Stressmark
BEGIN_LOOP:
…
ldt
$f1, ($4)
divt
$f1, $f2, $f3
divt
$f3, $f2, $f3
stt
$f3, 8($4)
ldq
$7, 8($4)
cmovne $31, $7, $3
stq
$3, $(4)
stq
$3, $(4)
stq
$3, $(4)
…
stq
$3, $(4)
…
JUMP BEGIN_LOOP
But…Actuator engages every
loop iteration degrading
performance
Why not correct the problem in
the code?
19 of 26
Proposed Solution
• Leverage our additional software layer to supplement
existing solutions
• Microarchitecture provides feedback to our softwarebased virtual layer
Altered
Binary
VL
Executable
Modifier
Executable
SW
HW
Sensor+Actuator Ext
Microprocessor
20 of 26
Required Investigations
• Characterizing emergencies
– How often do we see di/dt emergency loops?
• Communication between the microarchitecture and
the virtual layer
– What information should be passed to virtual layer during
an emergency?
• Fixing di/dt via binary modification
– Will existing techniques help?
– New algorithms?
21 of 26
Static vs. Dynamic Instances
Data suggests modifying a few code sequences will eliminate
many voltage emergencies
1000000
Distinct
Total
10000
1000
100
apsi
ammp
sixtrack
facerec
equake
galgel
mesa
applu
mgrid
art
swim
wupwise
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
1
vpr
10
gzip
Emergencies
100000
22 of 26
Possible Compiler Optimizations
•
Our goal is to
– Smooth out current profile, or
– Knock pulses off of the resonant frequency
•
Some existing options
– Software pipelining, code motion, instruction padding
Executable
Apply
Altered
Executable Optimizations
Binary
Modifier
Sensor+Actuator Ext’ns
Microprocessor
23 of 26
Loop Unrolling & SW Pipelining
Problematic
loop:
Unrolled
loop:
A
A
A
A
B
B
B
B
A
A
B
B
Current
Loop unrolling disrupts
resonance pulse
Current
Software pipelining
smoothes profile
Iteration=1
Iteration=2
Iteration=3
Current
A
A
B A
B A
B
B
A
A
B
24 of 26
B
Unrolling the Di/dt Stressmark
H
L
1.02V
Before Loop Unrolling
H1
H2
L1
L2
After Loop Unrolling
1.01V
1.00V
0.99V
0.98V
0.97V
25 of 26
Summary
• Symbiotic program optimization is a powerful
approach
• The di/dt problem – well suited for a symbiotic
solution
• The Tortola design can also target power reduction,
temperature reduction, reliability, etc.
http://www.tortolaproject.com/
26 of 26