Presentation kit - University of Wisconsin–Madison

Download Report

Transcript Presentation kit - University of Wisconsin–Madison

Optimizing Total Power of
Many-core Processors
Considering Voltage Scaling Limit
and Process Variations
Jungseob Lee and Nam Sung Kim
October 9, 2009
Department of Electrical and Computer Engineering
University of Wisconsin - Madison
Outline
 Introduction
 Supply Voltage and Power Scaling

Supply Voltage Scaling of Many-Core Processors

Power Scaling of Many-Core Processors
 Impacts of Within-Die(WID) Spatial Process
Variations

Global Clocking

Frequency−Island Clocking
 Conclusions
Multicore processors
 Parallel Processing


Improved throughput of computing systems w/ more cores
Throughput is limited by power+thermal constraints w/ all cores running
 Challenges: How do we

Determine # of cores for best performance-power efficiency?

Exploit process variations for multicore processors?
GPU which has many cores [2]
Parallel processing
Serial processing
[1]
[1] Source: http://www.interactivesupercomputing.com/starpexpress/042007/3_Task_Parallel.html
[2] Source: NVIDIA
Process variations
 Types of Process variations
Die-to-Die (D2D)
Variations
Within-Die (WID) Variations
A Systematic Vth variation
map for a 16-core
processor
Wafer Scale
Courtesy: K. Bowman from Intel
The corresponding
Norm Fmax and Pleak map
C2C frequency and leakage power
variations due to spatial correlated
WID variations become considerable.
Supply Voltage Scaling1
 Supply voltage scaling of many-core processors

Throughput w/ certain # of cores at max VDD (thus Fmax)
= Throughput w/ more cores at lower VDD (thus Fmax)

Potential throughput increase by many cores and lower VDD can
reduce power.
 # of cores
4
 Operating freq
VDD
 # of cores
8
 Operating freq
Lower V than VDD
Supply Voltage Scaling2
 Supply voltage scaling of many-core processors

M∙Tcycle(VDD) = M∙((1−F) + F/N)∙Tcycle(V)
M
Number of operations
Tcycle
Cycle time of a processor at supply voltage
VDD
Nominal supply voltage of base core processor
F
Fraction of operations parallelizable w/o overhead
N
Relative number of cores
V
Scaled supply voltage of N x more cores
PTM
PTM
32nm HP
32nm LP
> 40 % ↓
Require
higher VDD
due to high Vth
Dynamic Power Analysis1
 Dynamic power scaling

Dynamic power of a base many-core processor


Pdyn,base = Ceff ∙V2DD ∙Fmax(VDD)
Dynamic power of N x more cores than the base processor

Pdyn,N = ((1−F) ∙(1+(N−1) ∙K) + F ∙N) ∙Ceff ∙V2 ∙Fmax(V)
= k(F, K, N) ∙f(V) ∙(V/VDD)2 ∙Pdyn,base
Pdyn,base
Dynamic
power
a base
core
Dynamic
power
of aofbase
processor
Ceff
Effetive total switching capacitance
VDD
Nominal voltage of the base core
Fmax
core
Maximum operating frequency of the base proc
Pdyn,N
Dynamic power of N x more cores
K
Fraction of dynamic power of idle cores
k(F,K,N)
((1−F) ∙(1+(N−1) ∙K) + F ∙N)
f(V)
Frequency scaling factor at V; Fmax(V)/Fmax(VDD)
Dynamic Power Analysis2
 Dynamic power scaling
VDD,min = 0.7V
PTM
PTM
32nm HP
32nm LP
Less VDD scaling
 Less Pdyn
reduction
Dotted lines show
projected power
consumption
when no supply
limit.
Optimal
Normalized Pdyn /
Relative # of cores
PTM
HP
PTM
LP
VDD,min
F=0.6
F=0.7
F=0.8
F=0.9
F=1.0
0.7
0.75/2
0.66/3
0.60/2
0.52/2
0.45/2
0.6
0.75/2
0.66/3
0.54/3
0.41/4
0.34/3
HP: 25~55%
No limit
0.75/2
0.66/3
0.54/3
0.41/5
0.20/8
LP: 25~54%
0.7
0.75/2
0.70/2
0.65/2
0.56/3
0.46/3
0.6
0.75/2
0.70/2
0.65/2
0.55/4
0.35/8
No limit
0.75/2
0.70/2
0.65/2
0.55/4
0.35/8
Leakage Power Analysis1
 Leakage power scaling

In nanoscale technology, leakage power is significant fraction of
total power consumption.

Leakage power of a base many-core processor


Pleak,base = Ileak(VDD) ∙VDD
Leakage power of N x more cores than the base processor

Pleak,N = N ∙Ileak(V) ∙V = N ∙l(V) ∙(V/VDD) ∙Pleak,base
P leak,base
Leakage power of a base core
Dynamic
Ileak
Total Leakage current of the base processor
V DD
Nominal voltage of the base core
Pleak,N
Dynamic power of N x more cores
l(V)
Leakage scaling factor at V
Leakage Power Analysis2
 Leakage power scaling
PTM
PTM
32nm HP
32nm LP
But Absolute
Pleak is much
less than HP
Optimal
Normalized Pleak /
Relative # of cores
PTM
HP
PTM
LP
VDD,min
F=0.6
F=0.7
F=0.8
F=0.9
F=1.0
0.7
0.46/3
0.35/3
0.31/2
0.25/2
0.20/2
0.6
0.46/3
0.35/3
0.27/3
0.21/4
0.16/3
No limit
0.46/3
0.35/3
0.27/3
0.21/4
0.15/5
0.7
0.67/2
0.62/2
0.58/2
0.54/2
0.50/2
0.6
0.67/2
0.62/2
0.58/2
0.54/2
0.50/2
No limit
0.67/2
0.62/2
0.58/2
0.54/2
0.50/2
HP: 54~80%
LP: 33~50%
Total Power Analysis1
 Total power scaling

The total power of a base many-core processor is the sum of
dynamic and leakage power.


Ptot,base = Pdyn,base + Pleak,base = Pdyn,base ∙ (1 + LF)
The total power of N x more cores than the base processor is the
sum of dynamic and leakage power.

Ptot,N = Pdyn,N + Pleak,N
= Pdyn,base ∙ { k(F,K,N) ∙ f(V) ∙ (V/VDD)2 + N ∙ l(V) ∙ (V/VDD) ∙ LF }
Ptot,base
Total power of a base core
LF
Ratio between Pleak and Pdyn ; (Pleak/Pdyn)
Ptot,N
Total power of N x more cores
Total Power Analysis2
 Total power scaling
PTM
PTM
32nm HP
32nm LP
LF 0.4/0.6
LF 0.2/0.8
Optimal
Normalized Ptot /
Relative # of cores
LF
HP: 36~65%
PTM
HP
0.4/
0.6
LP: 26~52%
PTM
LP
0.2/
0.8
VDD,min
F=0.6
F=0.7
F=0.8
F=0.9
F=1.0
0.7
0.64/2
0.53/3
0.48/2
0.41/2
0.35/2
0.6
0.64/2
0.53/3
0.43/3
0.33/4
0.27/3
No limit
0.64/2
0.53/3
0.43/3
0.33/5
0.18/8
0.7
0.74/2
0.69/2
0.63/2
0.57/3
0.48/3
0.6
0.74/2
0.69/2
0.63/2
0.57/3
0.46/5
No limit
0.74/2
0.69/2
0.63/2
0.57/3
0.46/5
More VDD
scaling  only
17% more Ptot
reduction, but
require more
on-die
memory area
Impacts of WID Variations − GC
 Global Clocking

Limits Fmax of a many-core processor to that of slowest core.

Previous Pdyn,N equation still can be used to estimate Pdyn,N

Estimation of Pleak,N have to account for each core’s leakage
variations as follows.
N

Pleak,N =

i 1
li(V) ∙(V/VDD) ∙Pleak,base
li(V)
Leakage scaling factor of i-th core; Normalized to Ileak(VDD)
Core ID
Normalized
Fmax,
Pleak
A Systematic Vth variation map for
a 16-core processor
The corresponding
Fmax and Pleak map
Impacts of WID Variations − GC
 Global Clocking
HP Slowest
base core
HP Fastest
base core
Much more
relative total
power reduction
because the
fastest base core
is not power
efficient
Base
Slow: 23~54%
Slow
Fast: 77~90%
Fast
VDD,min
F=0.6
F=0.7
F=0.8
F=0.9
F=1.0
0.7
0.77/2
0.67/2
0.59/2
0.52/2
0.46/2
0.6
0.77/2
0.67/2
0.57/3
0.46/3
0.37/2
No limit
0.77/2
0.67/2
0.57/3
0.46/4
0.29/8
0.7
0.23/3
0.18/3
0.14/4
0.12/2
0.10/2
0.6
0.23/3
0.18/3
0.14/4
0.10/4
0.07/3
No limit
0.23/3
0.18/3
0.14/4
0.10/4
0.06/8
Average
total power
of 100 die
samples /
Relative # of
cores(N)
Impact of WID Variations − FI
 Frequency−Island Clocking

FI clocking is more performance and power efficient than GC
because each core can run at its own fastest frequency.

Previous GC Pleak,N equation can be used to estimate Pleak,N.

The equation for supply voltage scaling have to be modified as
follows.
N


M ∙Tcycle,base(VDD) = M ∙((1−F) / fj + F/
i 1
fi ) ∙Tcycle(V)
Estimation of Pdyn,N also have to account for an independent
clock frequency per core.
j-1, N



Pdyn,N = ((1−F)∙(fj +

i 1, j1
N
fi ∙K) + F ∙  fi ) ∙ (V/VDD)2 ∙ Pdyn,base
i 1
The fastest one among the chosen active cores always offers
the optimal total power for processing the totally sequential
portion of workload.
Impacts of WID Variations − FI
 Frequency−Island Clocking
HP Fastest
base core
HP Slowest
base core
FI clocking is more powerefficient than the global clocking
(GC) that often wastes Fmax of
faster cores.
On average, FI clocking offers
7% lower total power
consumption than GC.
Base
Slow: 30~58%
Slow
Fast: 81~90%
Fast
VDD,min
F=0.6
F=0.7
F=0.8
F=0.9
F=1.0
0.7
0.70/2
0.63/2
0.56/2
0.50/2
0.42/2
0.6
0.70/2
0.62/3
0.53/3
0.44/3
0.36/2
No limit
0.70/2
0.62/3
0.52/3
0.43/4
0.27/8
0.7
0.19/3
0.15/4
0.12/4
0.10/3
0.10/2
0.6
0.19/3
0.15/4
0.12/4
0.09/5
0.07/3
No limit
0.19/3
0.15/4
0.12/4
0.09/5
0.06/8
Average
total power
of 100 die
samples /
Relative # of
cores(N)
Experimental Methodology
 HSPICE simulations

32nm PTM HP and LP model
 Frequency / Leakage scaling factor

A range of VDD : 0.55 ~ 1.05(V)
Complex gates for measuring l(VDD)
24 FO4 inv chain for measuring f(VDD)
 Vth and Leff WID spatial and D2D variation map
WID variation
D2D variation
Correlation distance
coefficient (Φ)
0.5
σsys
Vth
6.4%
σD2D
Vth
5.0%
[3]
1 grid point
[3] Smruti R. Sarangi et al., “VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects”, IEEE Transactions
on Semiconductor Manufacturing (IEEE TSM), February 2008.
Conclusions
 Optimal number of active cores to minimize total power
consumption of many-core processors.

2x more active cores at lower voltage offer more than 50% of
total power reduction at the same throughput with a base core.
 Extended power analysis considering WID C2C
frequency and leakage variations

2x more active cores at lower voltage is the optimal choice.

FI clocking provides lower power consumption than GC since it
can exploit C2C variations. Also the fastest one in active cores
for sequential portion of application led to the lowest power
consumption.
Backup
Introduction
 Process variations

Manufactured dies exhibit a large spread of transistor delay and
leakage power across die and within each die.

Die-to-die(D2D) variations affect all transistors on a die equally. Withindie(WID) variations induce different characteristics across each die.

As individual core size becomes smaller, core-to-core(C2C) frequency
and leakage power variations due to spatial correlated WID variations
will become considerable.
Source: Synopsys
Die-to-die variations
Spatial Within-die variations
Supply Voltage and Power Scaling2
 Supply voltage scaling of many-core processors

Throughput w/ a certain # of cores at max VDD (thus Fmax)
= Throughput w/ more cores at lower VDD (thus Fmax)

Potential throughput increase by many cores and lower VDD can
reduce power.
Many−Core
Processor
x
x
x
x
xxx
x x
xxx
xxx
 # of active cores
1
 Operating freq
VDD
xxxx
[1]
xxxx
 # of active cores
8
 Operating freq
Lower V than VDD
Active Core
Idle Core