buy ski dynamic

Download Report

Transcript buy ski dynamic

CS 7960-4
Lecture 16
Cache Decay: Exploiting Generational Behavior
to Reduce Cache Leakage Power
S. Kaxiras, Z. Hu, M. Martonosi
Proceedings of ISCA-28
July 2001
Leakage Power Trends
• Leakage a num transistors (incr)
supply voltage (decr)
low thresh. voltage (incr)
• L1 and L2 caches are the biggest
contributors (high transistor budgets)
Leakage Power
Vdd-Gating
• Leakage can be reduced by gating off the
supply voltage to the circuit
• When applied to a cache, the contents of the
SRAM cell are lost
• Cache decay: apply Vdd-gating when you do not
care about cache contents
Lifetime of a Cache Line
Overheads
• Hardware to determine when to decay
• Introduces additional cache misses
• Normalized cache leakage power =
Activeratio (fraction of cache that is powered on) +
(Counter overhead : Leak) x activity +
(L2 access energy : Leak) x num-misses
• Increased execution time (< 0.7%)
• L2 access/leakage ratio is ~9
Skier’s Dilemma
New skis: $400
Ski rentals: $20
Heuristic: Buy skis after rental cost = purchase price
Ski trips:
Optimal:
Heuristic:
5
10
15
20
25 50
$100 $200 $300 $400 $400 $400
$100 $200 $300 $800 $800 $800
Likewise, decay a cache line when the cost of an
additional miss equals leakage dissipated so far
Tracking Dead Time
• Each line has a 2-bit counter that gets reset on
every access and gets incremented every 2500
cycles through a global signal (negligible overhead)
• After 10,000 clock cycles, the counter reaches
the max value and triggers a decay
• Adaptive decay: Start with a short decay period;
if you have a quick miss, double the period; if there
is no miss, halve the period
Results
Overheads
Adaptive Technique
Other Results
• L2 cache is equally suitable to decay techniques
-- lifetimes are scaled by a factor of 10, an extra
miss also costs a lot more
• For their experiments, there is little interference
from multiprogramming
• Some instructions can easily be identified as
last touches to a cache block – potential for early
cache decay
The GALS Approach
• Dynamic voltage (and freq) scaling (DVS) has
favorable power-performance characteristics –
3% power savings for ~1% performance loss
• Distributing a single clock is going to be much
harder in the future – will naturally result in multiple
clock domains
• DVS can be applied to each individual domain –
identifying critical regions will allow better IPC
Multi-Clock Domain Processor
Interfacing Domains
• There are queues between each pair of domains
• Producers place data in the queues and consumers
pull data out, in an asynchronous fashion
• Synchronization delays make the IPC slightly
lower than the base case
• Occupancy in the queue means the consumer can
slow down; an empty queue implies the producer
can slow down
Next Week’s Paper
• “Reducing Power with Dynamic Critical Path
Information”, J.S. Seng, E.S. Tune, D.M. Tullsen,
Proceedings of MICRO-34, Dec 2001
Title
• Bullet