Audio Visual Hints - University of Virginia, Department of Computer
Download
Report
Transcript Audio Visual Hints - University of Virginia, Department of Computer
Power-Aware and
Temperature-Aware
Architecture
© 2006, Kevin Skadron
Kevin Skadron
LAVA/HotSpot Lab
Dept. of Computer Science
University of Virginia
Charlottesville, VA
[email protected]
© 2006, Kevin Skadron
“Cooking-Aware” Computing?
2
Thermal Packaging is Expensive
© 2006, Kevin Skadron
•
Nvidia GeForce
5900
•
Nvidia GeForce
7900
http://www.ixbt.com/video2/images/g71
Source: Tech-Report.com
Source:
Gordon Bell, “A Seymour
Cray perspective”
/7900gtx-front.jpg
http://www.research.microsoft.com/users/gbell/craytalk/
3
“Moore’s Law” for Power
?
100
Pentium® II
Pentium® 4
Max Power (Watts)
Pentium® Pro
Pentium® III
10
Pentium®
Pentium®
w/MMX tech.
i486
© 2006, Kevin Skadron
i386
1
1.5m
1m
0.8m
0.6m
0.35m
0.25m
0.18m
0.13m
Source: Intel
•
Reasons: higher frequencies, more “stuff”
4
Leakage – A Growing Problem
© 2006, Kevin Skadron
Source: N. S. Kim et al.., “Leakage Current: Moore’s Law Meets Static Power,” IEEE Computer, Dec. 2003.
• The fraction of leakage power is increasing exponentially
• Also exponentially dependent on temperature
• This is bad for designs with idle logic, e.g. multi-core
processors, specialized functional units, lots of storage, etc. 5
Inter-Related Design Objectives
Vdd
Vth
Frequency
Area
Throughput
Performance
Dynamic
Power
Leakage
Power
exp
Temp
© 2006, Kevin Skadron
exp
Cost
•
Reliability
Performance gains increasingly require gains in
•
•
Cooling efficiency
Power efficiency
6
ITRS Projections
2001 – was 0.4
Year
Tech node (nm)
Vdd (high perf) (V)
Vdd (low power) (V)
Frequency (high perf) (GHz)
High-perf w/ heatsink
Cost-performance
Hand-held
2003
100
1.2
1.0
3.0
149
80
2.1
© 2006, Kevin Skadron
ITRS 2005
2006
2010
70
45
1.1
1.0
0.9
0.7
6.8
15.1
Max power (W)
180
198
98
119
3.0
3.0
2013
32
0.9
0.6
23.0
2016
22
0.8
0.5
39.7
198
137
3.0
198
151
3.0
2001 – was 288
• These are targets, doubtful that they are feasible
• Growth in power density means cooling costs
continue to grow
7
Hitting the Power Wall
•
•
Intel canceled Pentium 4 microarchitecture in part
due to power limits
Couldn’t keep raising clock frequency
•
•
•
General-purpose CPU community shifting to
replicating cores
•
•
•
•
© 2006, Kevin Skadron
Slow growth in frequency
Reduces growth in power density
– but not total heat flux
Programming model an open question
In-order or out-of-order cores?
•
•
Non-ideal power scaling
Vdd scaling limited due to leakage (Vth)
Our early results suggest OO is often superior
How many threads per core?
•
•
Sun, for example, puts 4 threads per core on its 8-core
T2000 to hide memory latency
This comes at the expense of single-thread latency
8
Multi-Core Isn’t Enough
•
•
© 2006, Kevin Skadron
•
High degrees of integration still max out
heat removal
Core type and core count must be selected
to maximize power efficiency
Simply replicating cores and then trying to
scale Vdd and frequency will not work
9
Talk Outline
•
Different philosophies of Power-Aware
design
• Energy efficient vs. low power vs. temperatureaware
•
Power Management Techniques
• Dynamic
• Static
© 2006, Kevin Skadron
•
Thermal Issues
• Factors to consider
• DTM techniques
• Architectural modeling
•
Summary of Important Challenges
10
Metrics
• Power
Design for power delivery
• Average power, instantaneous power, peak power
• Energy
Low-Power Design
Power-Aware/
• Energy (MIPS/W)
Energy-Efficient
2
• Energy-Delay product (MIPS /W)
Design
• Energy-Delay2 product (MIPS3/W) – voltage
independent!
© 2006, Kevin Skadron
• Temperature
Temperature-Aware Design
• On-chip temperature: correlated with localized
power density
• Enclosure/rack/data-center cooling
11
© 2006, Kevin Skadron
12
Circuit Techniques
•
•
•
•
•
Transistor sizing
Dynamic vs. static logic
Signal and clock gating
Circuit restructuring
Low power caches, register files, queues
© 2006, Kevin Skadron
• These typically reduce the capacitance
being switched
13
Clock Gating, Signal Gating
“Disabling a functional block when it is not required for an extended
period”
• Implementation
• Simple gate that replaces
one buffer in the clock tree
• Signal gating is similar, helps
avoid glitches
• Delay is generally not a concern
except at fine granularities
signal
functional
functional
unitunit
© 2006, Kevin Skadron
ctrl
• Choice of circuit design and
clock gating style can have
a dramatic effect on temperature
distribution
14
Circuit Restructuring
• Parallelize (can reduce frequency)
• Pipeline (tolerate smaller, longer-latency circuitry)
• Reorder inputs so that most active input is closest
to output (reduces switched capacitance)
• Restructure gates (equivalent functions are not
equivalent in switched capacitance)
Example: Parallelizing (maintain throughput)
Vdd
© 2006, Kevin Skadron
Logic Block
Vdd/2
Freq = 1
Vdd = 1
Logic Block
Throughput = 1
Power = 1
Logic Block
Area = 1
Pwr Den = 1
Freq = 0.5
Vdd = 0.5
Throughput = 1
Power = 0.25
Area = 2
Pwr Den = 0.125
Source: Shekhar Borkar, keynote presentation, MICRO-37, 2004
15
bitline
wordline
sens amp
Column dec
80
Read
Write
70
60
pJoules
row dec
R rows
C cols
bitline
Cache Design
50
40
30
20
10
© 2006, Kevin Skadron
•
•
•
•
Switched capacitance
Voltage swing
Activity factor
Frequency
th
er
I/O
O
bu
se
s
LS
A
D
B
A
TB
LS
s
lin
e
W
de
r
ec
o
Caccess = R C Ccell
Reducing power
D
•
•
0
TBLSA: Tagbitlines & sense amp.
DBLSA: Data bitlines and sense
amp.
Cache parameters: 16 KB cache 0.25 μm
Villa et al, MICRO 2000
16
Cache Design
•
Banked organization
•
•
•
Dividing bit line
•
•
Sense amplifiers used to detect Vdiff across bitlines
Read operation can complete as soon as Vdiff is detected
Limiting voltage swing saves a fraction of power
Pulse word lines
•
© 2006, Kevin Skadron
Same effect for wordlines
Reducing voltage swings
•
•
•
•
Targets switched capacitance
Caccess = R C Ccell / B
•
Enabling the word line for the time needed to discharge
bitcell voltage
Designer needs to estimate access time and implement a
pulse generator
17
Architectural-Level Techniques
• Sleep modes
• Pipeline depth
• Energy-efficient front end
Prevalent
• Branch prediction accuracy is a major determinant of
pipeline activity -> spending more power in the branch
predictor can be worthwhile if it improves accuracy
•
•
•
•
•
•
Integration (e.g. multiple cores)
Multi-threading
Dynamic voltage/frequency scaling
Multi clock domain architectures (similar to GALS)
Power islands
Encoding/compression
© 2006, Kevin Skadron
• Can reduce both switched capacitance and cross talk
• Application specific hardware
• Co-processors, functional units, etc.
• Compiler techniques
Growing or Imminent
18
Optimal Pipeline Depth
•
•
Increased power and diminishing returns vs.
increased throughput
5-10 stages, 15-30 FO4
Srinivasan et al, MICRO-35, Hartstein and Puzak, ACM TACO, Dec. 2004
4-wide issue
© 2006, Kevin Skadron
Single issue
Hartstein and Puzak, ACM TACO, Dec. 2004
Pipeline Stages
19
Multi-threading
•
Do more useful work per unit time
• Amortize overhead and leakage
•
Switch-on-event MT
• Switch on cache misses, etc. (Ex: Sun T2000
“throughput computing”)
• Can even rotate among threads every instruction
(Tera/Cray)
•
Simultaneous Multithreading/HyperThreading
© 2006, Kevin Skadron
• For superscalar – eliminate wasted slots
• Intel Pentium 4, IBM POWER5, Alpha 21464
20
Architectural-Level Techniques
• Sleep modes
• Pipeline depth
• Energy-efficient front end
Prevalent
• Branch prediction accuracy is a major determinant of
pipeline activity -> spending more power in the branch
predictor can be worthwhile if it improves accuracy
• Integration (e.g. multiple cores)
• Multi-threading
• Dynamic voltage/frequency scaling
• Limits
© 2006, Kevin Skadron
• Multi clock domain architectures (similar to GALS)
• Power islands
• Encoding/compression
• Can reduce both switched capacitance and cross talk
• Application specific hardware
• Co-processors, functional units, etc.
• Compiler techniques
Growing or Imminent
21
Multi Clock Domain Architecture
•
•
•
•
© 2006, Kevin Skadron
•
Multiple voltage/clock domains inside the
processor
Globally-asynchronous locally synchronous
(GALS) clock style
Independent voltage/frequency scaling in
each domain
Synchronizers to ensure inter-domain
communication
Good for domains that are loosely coupled
anyway
• Integer/FP units in CPUs
• Multiple cores
22
Multi Clock Domain Architecture
•
Advantages
• Local clock design is not aware of global skew
• Each domain limited by its local critical path,
allowing higher frequencies
• Different voltage regulators allow for a finergrain energy control
• Frequency/voltage of each domain can be
tailored to its dynamic requirements
• Clock power is reduced
© 2006, Kevin Skadron
•
Drawbacks
• Complexity and penalty of synchronizers
• Feasibility of multiple voltage regulators
23
© 2006, Kevin Skadron
Simple Example of MCD in GPUs
•
•
•
T is performance
ED^2 and E are energy efficiency metrics
All normalized to default case with no MCD
•
The higher the leakage, the more DVS pays off
24
© 2006, Kevin Skadron
25
Static Power Dissipation
• Static power: dissipation due to leakage
current
• Exponentially dependent on T, Vdd, Vth
• Most important sources of static power:
subthreshold leakage and gate leakage
• We will focus on subthreshold
• Gate leakage has essentially
been ignored
© 2006, Kevin Skadron
– New gate insulation materials
may solve problem
26
Thermal Runaway
•
The leakage-temperature feedback can lead
to a positive feedback loop
© 2006, Kevin Skadron
• Temperature increases leakage increases
temperature increases leakage increases
• …
Source: www.usswisconsin.org
27
A Smorgasbord
• Transistor sizing
• Multi Vth
• Dynamic threshold voltage – reverse body bias –
Transmeta Efficeon
• Transmeta uses runtime compilation and load monitoring to
select thresholds
• Stack effect
• Sleep transistors
• DVS
© 2006, Kevin Skadron
• Coarse or fine grained
• Low leakage caches, register files, queues
• Hurry up and wait
• Low leakage: maintain min possible V, f
• High leakage: use high V/f to finish work quickly, then go to
sleep
28
Leakage Control
Body Bias
Stack Effect
Sleep Transistor
Vbp
Vdd
+Ve
Equal Loading
© 2006, Kevin Skadron
-Ve
Logic Block
Vbn
2-10X
5-10X
2-1000X
Reduction
Reduction
Reduction
Source: Shekhar Borkar, keynote presentation, MICRO-37, 2004
29
Sleep Transistors
• Recent work suggests that a properly sized, low-Vth
footer transistor can preserve enough leakage to
keep the cell active (Li et al, PACT’02; Agarwal et al,
DAC’02)
• Great care must be taken when
switching back to full voltage:
noise can flip bits
• Extra latency may be necessary
when re-activating
Logic Block
• Similar to principles in
sub-threshold computing
© 2006, Kevin Skadron
• Ex – sensor motes for wireless
sensor networks
• Concerns about susceptibility to SEU
30
Low-Leakage Caches
• Gated-Vdd/Vss (Powell et al, ISLPED’00; Kaxiras et al,
ISCA-28)
• Uses sleep transistor on Vdd/ground for each cache line
• Typically considered non-state-preserving, but recent work
(Agarwal et al, DAC’02) suggests that gated-Vss it may
preserve state
• Many algorithms for determining when to gate
– May want to make decay interval temperature-dependent
• Simplest (Kaxiras et al, ISCA-28): Two-bit access counter and
decay interval
• Workload-adaptive decay intervals - hard
© 2006, Kevin Skadron
• Drowsy cache (Flautner et al, ISCA-29)
• Uses dual supply voltages: normal Vdd and a low Vdd close to
the threshold voltage
• State preserving, but requires an extra cycle to wake up – two
extra cycles if tags are decayed
31
A Smorgasbord
• Transistor sizing
• Multi Vth
• Dynamic threshold voltage – reverse body bias –
Transmeta Efficeon
• Transmeta uses runtime compilation and load monitoring to
select thresholds
• Stack effect
• Sleep transistors
• DVS
© 2006, Kevin Skadron
• Coarse or fine grained
• Low leakage caches, register files, queues
• Hurry up and wait
• Low leakage: maintain min possible V, f
• High leakage: use high V/f to finish work quickly, then go to
sleep
32
© 2006, Kevin Skadron
33
Thermal Issues - Outline
•
Arguments for dynamic thermal
management
• Factors to consider, such as reliability
© 2006, Kevin Skadron
•
•
•
Brief discussion of DTM techniques
Architectural modeling
Sensing
34
Worst-Case Leads to Over-Design
• Average case temperature lower than worst-case
• Aggressive clock gating
• Application variations
• Underutilized resources, e.g. FP units during integer code,
vertex units during fill-bound region
• Currently 20-40% difference
Reduced target
power density
© 2006, Kevin Skadron
TDP
Reduced cooling
cost
Source: Gunther et al, ITJ 2001
35
Temporal, Spatial Variations
© 2006, Kevin Skadron
Temperature variation
of SPEC applu over time
Localized hot spots
dictate cooling solution
36
Application Variations
•
•
Wide variation across applications
Architectural and technology trends are making it
worse, e.g. simultaneous multithreading (SMT)
ST
SMT
420
420
© 2006, Kevin Skadron
Kelvin
Kelvin
410
410
400
400
390
390
380
380
370
370
gzip
gzip
mcf
mcf
swim mgrid
mgrid applu
applu
swim
eon
eon
mesa
mesa
37
© 2006, Kevin Skadron
Temperature-Aware Design
•
Worst-case design is wasteful
•
Power management is not sufficient for
chip-level thermal management
• Must target blocks with high power density
• When they are hot
• Spreading heat helps
– Even if energy not affected
– Even if average temperature goes up
• This also helps reduce leakage
38
Dynamic Thermal Management
© 2006, Kevin Skadron
Temperature
Designed for Cooling Capacity w/out DTM
Designed for Cooling
Capacity w/ DTM
System
Cost Savings
DTM Trigger
Level
DTM Disabled
DTM/Response Engaged
Time
Source: David Brooks 2002
39
DTM
• Worst case design for the external cooling
solution is wasteful
• Yet safe temperatures must be maintained when
worst case happens
© 2006, Kevin Skadron
• Thermal monitors allow
• Tradeoff between cost and performance
• Cheaper package
– More triggers,
less performance
• Expensive package
– No triggers
full performance
40
Role of Architecture?
Dynamic thermal management (DTM)
• Automatic hardware response when temp. exceeds cooling
• Cut power density at runtime, on demand
• Trade reduced costs for occasional performance loss
© 2006, Kevin Skadron
• Architecture natural granularity for thermal
management
• Activity, temperature correlated within arch. units
• DTM response can target hottest unit: permits finetuned response compared to OS or package
• Modern architectures offer rich opportunities for
remapping computation
– e.g., CMPs/SoCs, graphics processors, tiled
architectures
– e.g., register file
• DTM will intermittently affect performance
41
© 2006, Kevin Skadron
Existing DTM Implementations
•
Intel Pentium 4: Global clock gating with
shut-down fail-safe
•
Intel Pentium M: Dynamic voltage scaling
•
Transmeta Crusoe: Dynamic voltage scaling
•
IBM Power 5: Probably fetch gating
•
ACPI: OS configurable combination of passive &
active cooling
•
These solutions sacrifice time (slower or stalled
execution) to reduce power density
Better: a solution in “space”
•
•
Tradeoff between exacerbating leakage (more idle logic) or
reducing leakage (lower temperatures)
42
© 2006, Kevin Skadron
Alternative: Migrating Computation
This is only a
simplistic
illustrative example
43
Space vs. Time
•
Moving the hotspot, rather than throttling it,
reduces performance overhead by almost 60%
Space
Time
© 2006, Kevin Skadron
Slowdown Factor
1.40
1.30
1.359
1.270
1.231
1.20
1.112
1.10
1.00
DVS
FG
Hyb
MC
The greater the replication and spread,
the greater the opportunities
44
Granularity of DTM
•
Subunit (single queue entry, register, etc.)
•
•
Structure (queue, register file, ALU, etc.)
•
•
Yuck: copy stalls required, hard to avoid throttling
Core
•
© 2006, Kevin Skadron
Lots of replication, low migration cost, but
not spread out
Lots of replication, good spread, but high migration cost,
and local hotspots remain
– But, if threads are short, scheduling can achieve
thermal load balancing without migration
The greater the replication and spread, the
greater the opportunities
The shorter the threads, the more flexiblity
45
Thermal Consequences
© 2006, Kevin Skadron
Temperature affects:
• Circuit performance
• Circuit power (leakage)
• IC reliability
• IC and system packaging cost
• Environment
46
Performance and Leakage
Temperature affects :
•
Transistor threshold and mobility
•
Subthreshold leakage, gate leakage
•
Ion, Ioff, Igate, delay
•
ITRS: 85°C for high-performance, 110°C for embedded!
© 2006, Kevin Skadron
Ioff
Ion
NMOS
47
Reliability
The Arrhenius Equation: MTF=A*exp(Ea/K*T)
© 2006, Kevin Skadron
MTF: mean time to failure at T
A: empirical constant
Ea: activation energy
K: Boltzmann’s constant
T: absolute temperature
Failure mechanisms:
• Electromigration
• Dielectric breakdown
• Mechanical stress
• Negative bias temperature instability (NBTI)
48
Reliability as f(T)
•
•
•
•
•
Reliability criteria (e.g., DTM thresholds) are typically
based on worst-case assumptions
But actual behavior is often not worst case
So aging occurs more slowly
This means the DTM design is over-engineered!
We can exploit
this, e.g. for DTM
or frequency
© 2006, Kevin Skadron
Spend
Bank
49
© 2006, Kevin Skadron
Average slowdown
Reliability-Aware DTM
0.16
0.12
0.08
0.04
0.00
_C
e
s
a
B
re
u
ig
f
on
_C
h
g
i
H
_
n
tio
c
e
v
on
..
.
s
e
R
_
k
ic
h
T
e
at
M
_
d
a
re
p
S
l
a
i
r
DTM_controller
DTM_reliability
50
Thermal Issues - Outline
•
Arguments for dynamic thermal
management
• Factors to consider, such as reliability
© 2006, Kevin Skadron
•
•
•
Brief discussion of DTM techniques
Architectural modeling
Sensing
51
Heat Mechanisms
•
Conduction is the main mechanism in a
single chip
• Conduction is proportional to the temperature
difference and surface area
© 2006, Kevin Skadron
•
Convection is the main mechanism in racks,
data centers, etc.
52
Simplistic steady-state model
All thermal transfer: R = k/A
T_hot
Power density matters!
Ohm’s law for thermals
(steady-state)
V = I · R -> T = P · R
T_amb
T_hot = P · Rth + T_amb
© 2006, Kevin Skadron
Ways to reduce T_hot:
-
reduce P (power-aware)
-
reduce Rth (packaging, spread heat)
-
reduce T_amb (Alaska?)
-
maybe also take advantage of
transients (Cth)
53
Simplistic dynamic thermal model
© 2006, Kevin Skadron
Electrical-thermal duality
V temp (T)
I power (P)
R thermal resistance (Rth)
C thermal capacitance (Cth)
RC time constant
T_hot
T_amb
KCL
differential eq.
I = C · dV/dt + V/R
difference eq.
V = I/C · t + V/RC · t
thermal domain
T = P/C · t + T/RC · t
(T = T_hot – T_amb)
One can compute stepwise changes in temperature
for any granularity at which one can get P, T, R, C
54
Thermal resistance
© 2006, Kevin Skadron
•
Θ = rt / A = t / kA
55
Thermal capacitance
© 2006, Kevin Skadron
• Cth = V· Cp·
(Aluminum) = 2,710 kg/m3
Cp(Aluminum) = 875 J/(kg-°C)
V = t· A = 0.000025 m3
Cbulk = V· Cp· = 59.28 J/°C
56
Thermal issues summary
• Temperature affects
performance, power, and reliability
• Architecture-level: conduction only
© 2006, Kevin Skadron
• Very crude approximation of convection as equivalent
resistance
• Convection: too complicated
– Need CFD!
• Radiation: can be ignored
•
•
•
•
Use compact models for package
Power density is key
Temporal, spatial variation are key
Hot spots drive thermal design
57
Thermal modeling
• Want a fine-grained, dynamic model of
temperature
•
•
•
•
At a granularity architects can reason about
That accounts for adjacency and package
That does not require detailed designs
That is fast enough for practical use
• HotSpot - a compact model based on
thermal R, C
© 2006, Kevin Skadron
• Parameterized to automatically derive a model
based on various
– Architectures
– Power models
– Floorplans
– Thermal Packages
58
Dynamic compact thermal model
Electrical-thermal duality
V temp (T)
I power (P)
R thermal resistance (Rth)
C thermal capacitance (Cth)
RC time constant (Rth Cth)
T_hot
T_amb
© 2006, Kevin Skadron
Kirchoff Current Law
differential eq. I = C · dV/dt + V/R
thermal domain P = Cth · dT/dt + T/Rth
where T = T_hot – T_amb
At higher granularities of P, Rth, Cth
P, T are vectors and Rth, Cth are circuit matrices
59
Example System
Heat sink
IC Package
© 2006, Kevin Skadron
Heat spreader
PCB
Die
Pin
Interface
material
60
Surface-to-surface contacts
© 2006, Kevin Skadron
• Not negligible, heat crowding
• Thermal greases/epoxy (can “pump-out”)
• Phase Change Films (undergo a transition from solid to semisolid with the application of heat)
Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001
61
© 2006, Kevin Skadron
Our Model (lateral and vertical)
Interface material
(not shown)
62
HotSpot
• Time evolution of temperature is driven by
unit activities and power dissipations
averaged over 10K cycles
• Power dissipations can come from any power
simulator, act as “current sources” in RC circuit
('P' vector in the equations)
• Simulation overhead in Wattch/SimpleScalar: <
1%
© 2006, Kevin Skadron
• Requires models of
• Floorplan: important for adjacency
• Package: important for spreading and time
constants
• R and C matrices are derived from the above
63
Validation
•
Validated and calibrated using FEM simulations,
FPGA measurements, and MICRED test chips
© 2006, Kevin Skadron
•
•
•
9x9 array of power dissipators and sensors
Compared to HotSpot configured with same grid,
package
Within 7% for both steady-state and transient stepresponse
•
Interface material (chip/spreader) matters
64
Sensors
© 2006, Kevin Skadron
Caveat emptor:
We are not well-versed on sensor design;
the following is a digest of information we
have been able to collect from industry
sources and the research literature.
65
Desirable Sensor Characteristics
© 2006, Kevin Skadron
•
•
•
•
•
•
•
Small area
Low Power
High Accuracy + Linearity
Easy access and low access time
Fast response time (slew rate)
Easy calibration
Low sensitivity to process and supply noise
66
Types of Sensors
(In approx. order of increasing ease to build)
• Thermocouples – voltage output
• Junction between wires of different materials; voltage at
terminals is α Tref – Tjunction
• Often used for external measurements
• Thermal diodes – voltage output
• Biased p-n junction; voltage drop for a known current is
temperature-dependent
• Biased resistors (thermistors) – voltage output
© 2006, Kevin Skadron
• Voltage drop for a known current is temperature dependent
– You can also think of this as varying R
• Example: 1 KΩ metal “snake”
• BiCMOS, CMOS – voltage or current output
• Rely on reference voltage or current generated from a reference
band-gap circuit; current-based designs often depend on tempdependence of threshold
• 4T RAM cell – decay time is temp-dependent
• [Kaxiras et al, ISLPED’04]
67
Sensors: Problem Issues
• Poor control of CMOS transistor
parameters
• Noisy environment
• Cross talk
• Ground noise
• Power supply noise
© 2006, Kevin Skadron
• These can be reduced by making the
sensor larger
• This increases power dissipation
• But we may want many sensors
68
“Reasonable” Values
• Based on conversations with engineers at
Sun, Intel, and HP (Alpha)
• Linearity: not a problem for range of
temperatures of interest
• Slew rate: < 1 μs
• This is the time it takes for the physical sensing
process (e.g., current) to reach equilibrium
© 2006, Kevin Skadron
• Sensor bandwidth: << 1 MHz, probably 100200 kHz
• This is the sampling rate; 100 kHz = 10 μs
• Limited by slew rate but also A/D
– Consider digitization using a counter
69
“Reasonable” Values: Precision
•
•
Mid 1980s: < 0.1° was possible
Precision
•
•
•
•
© 2006, Kevin Skadron
•
± 3° is very reasonable
± 2° is reasonable
± 1° is feasible but expensive
< ± 1° is really hard
P: 10s of mW
The limited precision of the G3 sensor
seems to have been a design choice
involving the digitization
70
Calibration
•
Accuracy vs. Precision
• Analogous to mean vs. stdev
•
Calibration deals with accuracy
• The main issue is to reduce inter-die variations
in offset
•
© 2006, Kevin Skadron
•
Typically requires per-part testing and
configuration
Basic idea: measure offset, store it, then
subtract this from dynamic measurements
71
Dynamic Offset Cancellation
•
•
•
•
© 2006, Kevin Skadron
•
Rich area of research
Build circuit to continuously, dynamically
detect offset and
cancel it
Typically uses an op-amp
Has the advantage that it adapts to
changing offsets
Has the disadvantage of more complex
circuitry
72
Role of Precision
•
Suppose:
© 2006, Kevin Skadron
• Junction temperature is J
• Max variation in sensor is S, offset is O
• Thermal emergency is T
•
T=J–S–O
•
Spatial gradients
• If sensors cannot be located exactly at
hotspots, measured temperature may be G°
lower than true hotspot
•
T=J–S–O–G
73
Rate of Change of Temperature
•
•
© 2006, Kevin Skadron
•
•
Our FEM simulations suggest maximum
0.1° in about 25-100 μs
This is for power density < 1 W/mm2 die
thickness between 0.2 and 0.7mm, and
contemporary packaging
This means slew rate is not an issue
But sampling rate is!
74
Sensors Summary
• Sensor precision cannot be ignored
• Reducing operating threshold by 1-2 degrees
will affect performance
• Precision of 1° is conceivable but
expensive
© 2006, Kevin Skadron
• Maybe reasonable for a single sensor or a few
• Precision of 2-3° is reasonable even for a
moderate number of sensors
• Power and area are probably negligible
from the architecture standpoint
• Sampling period <= 10-20 μs
75
© 2006, Kevin Skadron
76
Massive Multi-Core Design Space
© 2006, Kevin Skadron
•
•
•
•
•
•
•
# cores
Pipeline depth
Pipeline width
In-order vs. out-of-order
Cache per core
Core-to-core interconnect fabric
All dependent on temperature constraints!
77
Wither Core Type?
© 2006, Kevin Skadron
vs.
Source: Chrostopher Reeve Homepage, http://www.chrisreevehomepage.com/
Hot spot?
Cores may also be heterogeneous, with a few powerful cores
and very many small cores
78
Impact of Thermal Constraints
Thermal limits change the optimal pipeline width
as core count increases
2MB/18FO4/2
2MB/18FO4/4
12
10
BIPS
8
6
4
© 2006, Kevin Skadron
2
0
2
4
6
8
10
12
14
16
18
20
Core Number
79
Impact of Thermal Constraints
Pipeline depth, which is often fixed early in the design,
can impact the multi-core performance dramatically
Thermal limits favor shallower pipelines
4MB/12FO4/4
4MB/18FO4/4
2MB/12FO4/4
2MB/18FO4/4
4MB/24FO4/4
4MB/30FO4/4
2MB/24FO4/4
2MB/30FO4/4
45
12
40
10
35
8
25
BIPS
BIPS
30
20
4
15
10
2
5
0
0
© 2006, Kevin Skadron
6
2
4
6
8
10
12
14
Core Number
16
18
20
Without thermal constraints
2
4
6
8
10
12
14
Core Number
16
18
With thermal constraints
80
20
Workload Sensitivity
CPU- and memory-bound applications desire different resources
26-53% performance loss if switch the best configurations!
2MB/12FO4/4
2MB/24FO4/4
2MB/18FO4/2
8MB/18FO4/2
12
8MB/12FO4/4
8MB/24FO4/4
8MB/18FO4/2
16MB/18FO4/2
2MB/18FO4/4
2MB/30FO4/4
2MB/18FO4/4
8MB/18FO4/4
5
10
4
BIPS
BIPS
8
6
4
3
2
2
1
0
0
2
4
6
8
10
12
14
16
18
20
Core Number
© 2006, Kevin Skadron
8MB/18FO4/4
8MB/30FO4/4
8MB/18FO4/4
16MB/18FO4/4
400mm2
Cheap Thermal package
CPU bound
2
4
6
8
10
12
14
16
18
20
Core Number
400mm2
Cheap Thermal package
Memory bound
81
Summary
•
Reviewed current techniques for managing
dynamic power, leakage power, temperature
• A major obstacle with architectural techniques
is the difficulty of predicting performance
impact
• Spread heat in space, not time
•
© 2006, Kevin Skadron
•
•
Continuing integration makes power and
thermal constraints even more important
Optimal multi-core design is dependent on
thermal considerations
Security challenges
82
© 2006, Kevin Skadron
...or email me: [email protected]
83
More Info
© 2006, Kevin Skadron
http://www.cs.virginia.edu/~skadron
LAVA Lab
84
Backup Slides
© 2006, Kevin Skadron
•
These slides are an assortment that
wouldn’t fit in the talk but I kept to answer
questions or provide more info
85
Hot Chips are No Longer Cool!
1000
Rocket
Nozzle
Nuclear Reactor
Watts/cm 2
100
SIA
Pentium® 4
Pentium® III
Pentium® II
Hot plate
10
Today’s
laptops:
Pentium® Pro
Pentium®
i386
© 2006, Kevin Skadron
i486
1
1.5m
1m
0.7m
0.5m
0.35m
0.25m
0.18m
0.13m
0.1m
0.07m
* “New Microarchitecture Challenges in the Coming Generations of CMOS Process
Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.
86
© 2006, Kevin Skadron
ITRS quotes – thermal challenges
•
For small dies with high pad count, high power
density, or high frequency, “operating
temperature, etc for these devices exceed the
capabilities of current assembly and packaging
technology.”
•
“Thermal envelopes imposed by affordable
packaging discourage very deep pipelining.”
• Intel recently canceled its NetBurst
microarchitecture
– Press reports suggest thermal envelopes were
a factor
87
Dynamic Power Consumption
• Power dissipated due to switching activity
• A capacitance is charged and discharged
Vdd
Ec=1/2CLV2
© 2006, Kevin Skadron
Ed=1/2CLV2
Charge/discharge at the frequency f
P=a CLV2 f
88
Transistor Sizing
• Transistor sizing plays an important role to reduce
power
K = Ci/Ci-1
© 2006, Kevin Skadron
C0
C1
CN-1
CN
• Delay ~ a (k / ln K)
• Power ~ K / (K-1)
• Optimum K for both power and delay must be
pursued
89
Signal Gating
“techniques to mask unwanted switching activities from propagating
forward, causing unnecessary power dissipation”
• Implementation
• Simple gate
• Tristate buffer
• ...
signal
Output
ctrl
• Control signal needed
• Generation requires additional logic
© 2006, Kevin Skadron
• Especially helps to prevent power dissipation due
to glitches
90
Different Implementation and Corresponding
Clock Gating Choices
Latch-Mux Design
© 2006, Kevin Skadron
SRAM Design
91
DVS “Critical Power Slope”
• It may be more efficient not to use DVS, and
to run at the highest possible frequency, then
go into a sleep mode!
• Depends on power dissipation in sleep mode vs.
power dissipation at lowest voltage
© 2006, Kevin Skadron
• This has been formalized as the critical
power slope (Miyoshi et al, ICS’02):
• mcritical = (Pfmin – Pidle) / fmin
• If the actual slope m = (Pf - Pfmin) / (f – fmin) < mcritical
then it is more energy efficient to run at the
highest frequency, then go to sleep
• Switching overheads must be taken into account
92
Application-Specific Hardware
•
•
Specialized logic is usually much lower power
Co-processors
•
•
Functional units
•
•
© 2006, Kevin Skadron
•
Ex: TCP/IP offload, codecs, etc.
Ex: Intel SSE, specialized arithmetic (e.g., graphics), etc.
Ex: Custom instructions in configurable cores (e.g.,
Tensilica)
Specific example: Zoran ER4525 – cell phone
•
•
•
•
•
•
ARM microcontroller, no DSP!
Video capture & pre/post processing
Video codec
2D/3D rendering
Video display
Security
93
Gate Leakage
• Not clear if new oxide materials will arrive in time
• Any technique that reduces Vdd helps
• Otherwise it seems difficult to develop architecture
techniques that directly attack gate leakage
• In fact, very little work has been done in this area
• One example: domino gates (Hamzaoglu & Stan,
ISLPED’02)
© 2006, Kevin Skadron
• Replace traditional NMOS pull-down network with a PMOS
pull-up network
• Gate leakage is greater in NMOS than PMOS
• But PMOS domino gate is slower
• Note: Gate oxide so thin - especially prone to
manufacturing variations
94
Static Power - Modeling
•
Modeling Leakage
•
© 2006, Kevin Skadron
•
Butts and Sohi (MICRO-33)
– Pstatic = Vcc · N · kdesign · Îleak
– Îleak determined by circuit simulation, kdesign empirically
– Key contribution: separate technology from design
HotLeakage (UVA TR CS-2003-05, DATE’04)
– Extension of Butts & Sohi approach: scalable with Vdd, Vth,
Temp, and technology node; adds gate leakage
– Îleak determined by BSIM3 subthreshold equation and BSIM4
gate-leakage equations, giving an analytical expression that
accounts for dependence on factors that may change at
runtime, namely Vdd, Vth, and Temp
– kdesign replaced by separate factors for N- and P-type
transistors
– kdesign also exponentially dependent on Vdd and Tox, linearly
dependent on Temp
– Currently integrated with SimpleScalar/Wattch for caches
95
Static Power – Modeling
•
Modeling Leakage (cont.)
© 2006, Kevin Skadron
• Su et al, IBM (ISLPED’03)
–Similar approach to HotLeakage – but they
observe that modeling the change in leakage
allows linearization of the equations
• Many, many other papers on various aspects of
modeling different aspects of leakage
–Most focus on subthreshold
–Few suggest how to model leakage in
microarchitecture simulations
96
Performance Comparison
•
TT-DFS is best but can’t prevent excess temperature
•
•
•
Suitable for use with aggressive clock rates at low temp.
Hybrid technique reduces DTM cost by 25% vs. DVS (DVS
overhead important)
A substantial portion of MC’s benefit comes from the altered
floorplan, which separates hot units
Slowdown Factor
© 2006, Kevin Skadron
1.40
1.359
1.270
1.30
1.231
1.20
1.112
1.10
1.045
1.00
TTDFS
DVS
FG
Hyb
MC
97
EM Model
t failure
0
1
e
kT (t )
Ea
kT ( t )
© 2006, Kevin Skadron
Life Consumption
Rate:
dt th , th const
1
R (t )
e
kT (t )
Ea
kT ( t )
Apply in a “lumped” fashion at the granularity of
microarchitecture units, just like RAMP [Srinivasan et al.]
98
Carnot efficiency
• Note that in all cases, heat transfer is
proportional to ΔT
• This is also one of the reasons energy
“harvesting” in computers is probably not
cost-effective
• ΔT w.r.t. ambient is << 100°
• For example, with a 25W processor,
thermoelectric effect yields only ~50mW
© 2006, Kevin Skadron
• Solbrekken et al, ITHERM’04
• This is also why Peltier coolers are not
energy efficient
• 10% eff., vs. 30% for a refrigerator
99
Thermal Modeling
• Want a fine-grained, dynamic model of
temperature
•
•
•
•
At a granularity architects can reason about
That accounts for adjacency and package
That does not require detailed designs
That is fast enough for practical use
HotSpot - a compact model based on
thermal R, C
© 2006, Kevin Skadron
• Parameterized to automatically derive a model
based on various…
– Architectures
– Power models
– Floorplans
– Thermal Packages
100
Temperature equations
• Fundamental RC differential equation
• P = C dT/dt + T / R
• Steady state
• dT/dt = 0
• P=T/R
© 2006, Kevin Skadron
• When R and C are network matrices
• Steady state – T = R x P
• Modified transient equation
– dT/dt + (RC)-1 x T = C-1 x P
• HotSpot software mainly solves these
two equations
101
Our Model (Lateral and Vertical)
© 2006, Kevin Skadron
Derived from material and geometric properties
Interface material
(not shown)
102
Transient solution
•
Solves differential equations of the form dT + AT =
B where A and B are constants
•
•
•
© 2006, Kevin Skadron
•
In HotSpot, A is constant (RC) but B depends on the
power dissipation
Solution – assume constant average power dissipation
within an interval (10 K cycles) and call RK4 at the end
of each interval
In RK4, current temperature (at t) is advanced in
very small steps (t+h, t+2h ...) till the next interval
(10K cycles)
RK – `4` because error term is 4th order i.e., O(h^4)
103
© 2006, Kevin Skadron
Transient solution contd...
• 4th order error has to be within the required
precision
• The step size (h) has to be small enough
even for the maximum slope of the
temperature evolution curve
• Transient solution for the differential
equation is of the form Ae-Bt with A and B
are dependent on the RC network
• Thus, the maximum value of the slope
(AxB) and the step size are computed
accordingly
104
HotSpot
• Time evolution of temperature is driven by
unit activities and power dissipations
averaged over 10K cycles
• Power dissipations can come from any power
simulator, act as “current sources” in RC
circuit
• Simulation overhead in Wattch/SimpleScalar:
< 1%
© 2006, Kevin Skadron
• Requires models of
• Floorplan: important for adjacency
• Package: important for spreading and time
constants
105
Notes
•
Note that HotSpot currently measures
temperatures
in the silicon
• But that’s also what the most sensors measure
•
Temperature continues to rise through each
layer of the die
© 2006, Kevin Skadron
• Temperature in upper-level metal is
considerably higher
• Interconnect model released soon!
106
HotSpot Summary
• HotSpot is a simple, accurate and fast
architecture level thermal model for
microprocessors
• Over 850 downloads since June’03
• Ongoing active development –
architecture level floorplanning will be
available soon
• Download site
© 2006, Kevin Skadron
• http://lava.cs.virginia.edu/HotSpot
• Mailing list
•
www.cs.virginia.edu/mailman/listinfo/hotspot
107
Hybrid DTM
•
DVS is attractive because of its cubic advantage
•
•
•
•
Fetch gating is attractive because it can use
instruction level parallelism to reduce impact of DTM
© 2006, Kevin Skadron
•
•
P V2f
This factor dominates when DTM must be aggressive
But changing DVS setting can be costly
– Resynchronize PLL
– Sensitive to sensor noise spurious changes
Only effective when DTM is mild
So use both!
108
Migrating Computation
•
When one unit overheats, migrate its
functionality to a distant, spare unit (MC)
•
•
•
•
© 2006, Kevin Skadron
•
Spare register file (Skadron et al. 2003)
Separate core (CMP) (Heo et al. 2003)
Microarchitectural clusters
etc.
Raises many interesting issues
•
•
•
•
Cost-benefit tradeoff for that area
Use both resources (scheduling)
Extra power for long-distance communication
Floorplanning
109
Hybrid DTM, cont.
Combine fetch gating with DVS
•
•
•
•
•
When DVS is better, use it
Otherwise use fetch gating
Determined by magnitude of temperature overshoot
Crossover at FG duty cycle of 3
FG has low overhead: helps reduce cost of sensor noise
1.3
1.4
FG
Hyb
Slowdown
© 2006, Kevin Skadron
DVS
1.3
1.2
1.2
1.1
1.1
1.0
1.0
20
5
Duty Cycle
2
20
15
10
Duty Cycle
5
0
110
Slowdown
•
Hybrid DTM, cont.
•
DVS doesn’t need more than two settings for
thermal control
•
•
FG by itself does need multiple duty cycles and
hence requires PI control
•
But in a hybrid configuration, FG does not require
PI control
•
•
© 2006, Kevin Skadron
Lower voltage cools chip faster
•
FG is only used at mild DTM settings
Can pick one fixed duty cycle
This is beneficial because feedback control is
vulnerable to noise
111
Sensors
•
Almost half of DTM overhead is due to
•
•
•
•
Need localized, fine-grained sensing
Need new sensor designs that are cheap
and can be used liberally – co-locate with
hotspots
© 2006, Kevin Skadron
•
•
•
Guard banding due to offset errors and lack
of co-located sensors
Spurious sensor readings due to noise
But these may be imprecise
Many sensor designs look promising
Need new data fusion techniques to reduce
imprecision, possibly combine
heterogeneous sensors
112
Impact of Physical Constraints
•
•
Thermal constraints shift optimum toward fewer
and simpler cores
CPU-bound programs still want aggressive
superscalar cores despite throttling—but not
deeply pipelined
•
•
You can still have lots of cores
•
•
© 2006, Kevin Skadron
•
Mem-bound programs want narrow cores, lots of L2
They will be severely throttled (e.g., up to 45% voltage
reduction and 75% frequency reduction)
But you still win by adding cores until throttling
outweighs the benefit of an additional core
Preliminary results suggest that OO cores are
always preferable: they are more efficient in terms
of BIPS/area
113