Slide - School of Computing - National University of Singapore

Download Report

Transcript Slide - School of Computing - National University of Singapore

Estimating the Worst-Case
Energy Consumption of
Embedded Software
Ramkumar Jayaseelan Tulika Mitra Xianfeng Li
School of Computing
National University of Singapore
1
Motivation

Conventional scheduling techniques give
timing guarantees



Processor cycles is the critical resource
WCET of the tasks are required input
Battery life is equally important for mobile
devices


Scheduling technique have to give energy
guarantees
Worst-Case Energy Consumption (WCEC) of the
tasks are required input
2
Remotely Deployed Systems
Local Station
Sensor Network


Available energy unevenly distributed among nodes
Spatio-temporal scheduling benefits from WCEC
3
Energy-Based Guarantees



Scheduling critical and non-critical tasks in a
battery-operated system
Non-critical tasks can be run only if energy
constraints for critical tasks are satisfied
Worst-case energy estimation is crucial
4
Reward-Based Scheduling




Energy consumption  Voltage
Delay  (1 / Voltage)
Reward-based scheduling attempts to satisfy
constraints on energy and timing
Energy guarantee only if worst-case energy
consumption of tasks are known
5
Outline






Background
Relation between WCET and Worst-case
energy consumption
Estimation technique: Simplified model
Instruction cache and speculation
Experimental results
Conclusion
6
Background




Power and energy are often used
interchangeably
Power is energy consumed per unit time
Energy consumed during program execution
E=P×t
Approximation as P is also a function of time
7
E=P×T is an approximation

In reality when a program executes
Power
Time

Energy is the area under the curve
E = ∫P(t)dt
8
WCEC versus WCET
21000
Total Energy
Execution Time
5000
Energy(nano Joules)
19000
4900
18000
17000
4800
16000
4700
Execution Time(cycles)
20000
5100
15000
4600
14000
13000
4500
Program Inputs
Full Input Space Expansion for a 5-element Insertion Sort program
9
Cannot Estimate WCEC from WCET
Benchmark
WCET×avg_power
µJ
Observed
µJ
isort
489.92
525.88
fft
12106.49
10260.86
fdct
138.20
105.57
ludcmp
131.76
119.33
matsum
972.03
1154.31
minver
93.61
80.80
bsearch
3.84
3.07
des
724.05
643.75
matmult
178.12
166.88
qsort
54.79
43.73
qurt
23.80
17.65
Possible underestimation using WCEC=WCET × power
10
WCEC versus WCET


WCEC path need not be the same as the
WCET path
WCEC cannot be directly estimated from the
WCET value
11
A closer look at Power



Dynamic Power : Power Consumption due to
switching of transistors
Leakage Power: Power consumed
independent of switching activity
Dynamic power forms the bulk of power
consumption in today’s processors
12
Dynamic Power

Dynamic Power
P=(1/2) × A × V2 × C × f
V is supply voltage
C is the capacitance of the circuit
f is the frequency
A is the activity factor


V, C, f are independent of program execution
Variation in P is due to the variation in A
13
Variation in Activity Factor (A)

Not all parts of the processor are used in
every cycle




e.g., data-cache is used only for loads/stores
Clock gating disables unused components
Activity factor (A) varies during the execution
of the program
Model variation in A through static analysis
14
Switch-off Energy

An inactive component cannot be fully
switched off


A certain portion of the peak energy is consumed
even in idle cycles
Switch-off energy is proportional to the
number of idle cycles
15
Clock Energy and Leakage Energy



Clock power: power consumed in clock
distribution network
Leakage power: power consumed due to
leakage in transistors
Clock energy and leakage energy are directly
proportional to the execution time
16
Energy Components Summary

Dynamic Energy



Switch-off Energy



Switching of transistors during execution
Independent of execution time
Energy consumed in unused components
Depends on idle cycles
Clock and Leakage energy

Directly proportional to execution time
17
WCEC versus WCET
21000
Total Energy
Execution Time
5000
Energy(nano Joules)
19000
4900
18000
17000
4800
16000
4700
Execution Time(cycles)
20000
5100
15000
4600
14000
13000
4500
Program Inputs
Full Input Space Expansion for a 5-element Insertion Sort program
18
Our Analysis: Overview



Operate on the control flow graph
Estimate worst-case energy of basic blocks
Formulate estimation for whole program as
an integer linear programming (ILP) problem
19
ILP Formulation


Input: Control flow graph of the program
Objective function:
Worst Case Energy =  WCECB  countB

Need to estimate Worst-Case Energy
Consumption( WCECB) for each basic block
20
Flow Constraints
Inflow = Basic Block Execution Count = Outflow
Bounds on maximum loop iterations
B0
B1
B2
E0,1 = B0 = 1
E2,3 + E1,3 = B3 = 1
E0,1 + E2,1 = E1,2 + E1,3 = B1
E1,2 = E2,3 + E2,1= B2
Loop bound: E2,1 <= 100
B3
21
Worst-Case Energy of a Basic Block


Processor Model
Energy Components


Instruction Specific Energy
Pipeline Specific Energy
22
Processor Model
I+1
I
IF
IBUF
ID
ISSUE
I-1
I-4
ROB
EX
I-2
I-3
WB
ALU
CM
MULT
FPU
23
Pipelined Execution of Instructions
ADD R1,R2,R3
MUL R4,R5,R6
SUB R7,R8,R9
CC
ADD
MUL
SUB
1
IF
2
3
4
5
6
7
ID
IS
EX
WB
CM
IF
ID
IS
EX
WB
CM
IF
ID
IS
EX
WB
8
CM
Difficult to statically predict the energy consumption in each cycle
24
Pipelined Execution of Instructions
ADD R1,R2,R3
MUL R4,R5,R6
SUB R7,R8,R9
CC
ADD
MUL
SUB
1
IF
2
3
4
5
6
7
8
ID
IS
EX
WB
CM
Stall
Stall
IF
ID
IS
EX
WB
IF
ID
IS
EX
Difficult to statically predict the energy consumption in each cycle
25
Our Approach


Determine the maximum energy consumed
on a component by component basis
Static analysis to determine the maximum
energy consumed by a component in a
specified interval
26
Execution of Instruction
IF
ID
ISSUE
EX
WB
CM
27
Instruction Specific Energy

Energy consumed due to the sub-tasks
associated with execution of an instruction



e.g., register file access, ALU usage, etc.
Depends on the type of executed instruction
No correlation with execution time
28
Pipeline Specific Energy

During program execution energy is
consumed due to





Switch-off power
(idle cycles)
Leakage power
(every cycle)
Clock network power (every cycle)
Cannot be attributed to any instruction
Energy consumed even in idle cycles
29
Energy Components

Observation: Energy consumed can be
separated out as

Instruction Specific energy



Energy associated with the execution of a particular
instruction
Independent of execution time
Pipeline Specific energy


Energy consumed in other components such as clock
network, leakage etc.
Related to execution time
30
Worst-case Energy of a Basic block
energyBB  dynamicBB  switchoffBB  leakageBB  clock BB


dynamicBB : Instruction-Specific Energy for BB
switchoffBB , leakageBB and clockBB are energy
consumed in unused components, leakage and clock
network during WCETBB
31
Instruction Specific Energy


Energy consumed due to switching activity generated by
the instructions in BB
Sum of energy consumed by individual instructions in BB
dynamicBB  instrBB dynamicinstr
32
Switch-off Energy


Unused units consume 10% of peak energy
Switch-off energy for a specific component (C)
switchoff BB (C )  Idle _ cycles (C )  energy (C )  0.1
switchoff BB (C )  (WCETBB  uses BB (C ))  energy (C )  0.1

Switch-off energy for basic block BB
switchoff BB 
 switchoff
BB
(C )
Ccomponents
33
Clock Energy and Leakage Energy

Clock Energy
clockenergy BB  clockenergycycle  WCETBB

Leakage Energy
leakage _ energy BB  leakage _ energycycle  WCETBB
34
Overlap among basic blocks
Time
t1
B1
B2
B1
t2
t3
BB
WCETBB
t4
t5
B3
B3
35
Switch-off Energy


Unused units consume 10% of peak energy
Switch-off energy for a specific component (C)
switchoffBB(C )  Idle _ cycles (C )  energy (C )  0.1
switchoffBB(C )  (WCETBB  usesBB(C ))  energy (C )  0.1

Switch-off energy for basic block BB
switchoffBB 
 switchoff
BB
(C )
Ccomponents
36
Instruction Cache Modeling




Context based ILP formulation used in WCET
analysis [Li et al RTSS 2004]
Basic block divided into memory blocks
A context comprises of mapping each of
these memory blocks to hit/miss
Estimate the worst-case energy of each
context taking into account main memory
access energy
37
Modeling Branch miss-prediction
Time
t1
BB’
t2
BB’
BX
t3
BX
BB
BB
38
Objective function
Energy  i 1  j i C (i ) energyj  i (c,  )  countj  i (c,  )
N
 energyj  i (m,  )  countj  i (m,  )




count(c,ω) is the number of times the basic block Bi is executed
with path from Bj and the branch is predicted correctly
count(m,ω) is similarly defined where the branch is misspredicted
In a similar manner energy(c,ω) and energy(m,ω) are defined
The ILP problem is solved to generate values for count using
constraints similar to WCET analysis
39
Results




Platform: Simplescalar toolset
Modified WCET analysis tool [Li et al RTSS
2004] to estimate worst-case energy
Energy values for processor components
derived from parameterized models in Wattch
ILP problem is solved using CPLEX
40
Results



Compare estimated WCEC against the
observed values for eleven benchmarks
Observed values are obtained using Wattch
power simulator
Actual inputs producing WCEC is unknown

Manually select inputs that might produce WCEC
41
Styles of Clock Gating



Simple: Peak power is consumed even if
there is one access to a specific component
Ideal : Power consumed is proportional to the
number of ports accessed
Realistic: Same as ideal but unused
components consume switch-off power
42
Results
Simple Clock Gating

Ideal Clock Gating
Benchmarks
Est(µJ)
Obs(µJ)
Ratio
Est(µJ)
Obs(µJ)
Ratio
isort
524.95
455.94
1.15
468.85
422.76
1.11
fft
11057.50
9185.39
1.20
9600.99
8586.49
1.12
fdct
99.31
88.79
1.11
89.92
83.63
1.08
ludcmp
115.39
100.32
1.15
98.75
92.77
1.06
matsum
1227.37
994.11
1.23
1012.83
929.94
1.09
minver
74.91
64.15
1.17
63.66
59.61
1.07
bsearch
3.51
3.07
1.14
2.54
2.40
1.06
des
613.16
553.74
1.10
546.41
518.22
1.05
matmult
172.39
136.93
1.26
149.70
132.08
1.13
qsort
39.50
33.84
1.17
34.90
31.16
1.12
qurt
16.36
12.97
1.26
13.98
11.91
1.17
Results for ideal clock gating more accurate than
simple because of distribution of accesses
43
Results
Realistic Clock Gating

Ideal Clock Gating
Benchmarks
Est(µJ)
Obs(µJ)
Ratio
Est(µJ)
Obs(µJ)
Ratio
isort
596.93
525.88
1.14
468.85
422.76
1.11
fft
13631.21
10260.86
1.33
9600.99
8586.49
1.12
fdct
121.65
105.57
1.15
89.92
83.63
1.08
ludcmp
139.75
119.33
1.17
98.75
92.77
1.06
matsum
1397.72
1154.31
1.21
1012.83
929.94
1.09
minver
90.95
80.80
1.13
63.66
59.61
1.07
bsearch
3.81
3.07
1.24
2.54
2.40
1.06
des
715.58
643.75
1.11
546.41
518.22
1.05
matmult
212.94
166.88
1.28
149.70
132.08
1.13
qsort
49.84
43.73
1.14
34.90
31.16
1.12
qurt
21.95
17.65
1.24
13.98
11.91
1.17
Results for ideal clock gating more accurate than realistic
because of conservative WCET estimation
44
Conclusion


Static worst-case energy estimation
technique that takes into account pipelining,
instruction cache and branch prediction
Future work


Validation using commercial processors
Explore the possibility of providing thermal
guarantees
45
Execution of an Add Instruction
ADD
IF
I-Cache Access
ADD
ID
Instruction Decode + Rename Logic
ADD
ADD
ADD
ADD
ISSUE
EX
WB
CM
Wakeup + Selection logic
Register File Read + Add unit access
Result Bus
ROB-retire + Register file Update
46
Instruction Specific Energy



Each Component Accessed once
Selection logic maybe accessed multiple times
Instruction Specific Energy is
dynamicBB  selection _ powercycle  wcetBB  instrBB dynamicinstr
47