Power Management in Large Server Clusters
Download
Report
Transcript Power Management in Large Server Clusters
High-Performance
Power-Aware Computing
Vincent W. Freeh
Computer Science
NCSU
[email protected]
1
Acknowledgements
NCSU
Tyler K. Bletsch
Mark E. Femal
Nandini Kappiah
Feng Pan
Daniel M. Smith
U of Georgia
Robert Springer
Barry Rountree
Prof. David K. Lowenthal
2
The case for power management
Eric Schmidt, Google CEO:
“it’s not speed but power—low power, because data
centers can consume as much electricity as a small
city.”
Power/energy consumption becoming key issue
Power limitations
Energy = Heat; Heat dissipation is costly
Non-trivial amount of money
Consequence
Excessive power consumption limits performance
Fewer nodes can operate concurrently
Goal
Increase power/energy efficiency
More performance per unit power/energy
3
frequency x voltage2
power
How: CPU scaling
Reduce frequency & voltage
Reduce power & performance
Energy/power gears
Frequency-voltage pair
Power-performance setting
Energy-time tradeoff
Why CPU scaling?
Large power consumer
Mechanism exists
power
application throughput
CPU scaling
frequency/voltage
frequency/voltage
4
Is CPU scaling a win?
power
PCPU
ECPU
Psystem
Pother
Eother
time
T
full
5
Is CPU scaling a win?
power
PCPU
benefit
ECPU
Psystem
cost
Eother
Pother
time
full
PCPU
T
Psystem
Pother
T+DT
reduced
6
Our work
Exploit bottlenecks
Application waiting on bottleneck resource
Reduce power consumption (non-critical resource)
Generally CPU not on critical path
Bottlenecks we exploit
Intra-node (memory)
Inter-node (load imbalance)
Contributions
Impact studies [HPPAC ’05] [IPDPS ’05]
Varying gears/nodes [PPoPP ’05] [PPoPP ’06 (submitted)]
Leveraging load imbalance [SC ’05]
7
Methodology
Cluster used: 10 nodes, AMD Athlon-64
Processor supports 7 frequency-voltage settings
(gears)
Frequency (MHz) 2000 1800 1600 1400 1200 1000 800
Voltage (V)
1.5
1.4 1.35 1.3 1.2
1.1 1.0
Measure
Wall clock time (gettimeofday system call)
Energy (external power meter)
8
NAS
9
CG – 1 node
2000MHz
800MHz
+1%
-17%
Not CPU bound:
•Little time penalty
•Large energy savings
10
EP – 1 node
+11%
-3%
CPU bound:
•Big time penalty
•No (little) energy savings
11
Operation per miss
CG: 8.60
BT: 79.6
SP: 49.5
EP: 844
12
Multiple nodes – EP
S8 = 7.9
S4 = 4.0
E = 1.02
S2 = 2.0
Perfect speedup:
E constant
as N increases
13
Multiple nodes
– LU
S = 5.3
8
E8 = 1.16
Gear 2
S8 = 5.8
E8 = 1.28
S4 = 3.3
E4 = 1.15
S2 = 1.9
E2 = 1.03
Good speedup:
E-T tradeoff
as N increases
14
Phases
16
Phases: LU
17
Phase detection
First, divide program into blocks
All code in block execute in same gear
Block boundaries
MPI operation
Expect OPM change
Then, merge adjacent blocks into phases
Merge if similar memory pressure
Use OPM
| OPMi – OPMi+1 | small
Merge if small (short time)
Note, in future:
Leverage large body of phase detection research
[Kennedy & Kremer 1998] [Sherwood, et al 2002]
18
Data collection
Use MPI-jack
Pre and post hooks
For example
MPI
application
MPI
library
MPI-jack
code
Program tracing
Gear shifting
Gather profile data during execution
Define MPI-jack hook for every MPI operation
Insert pseudo MPI call at end of loops
Information collected:
Type of call and location (PC)
Status (gear, time, etc)
Statistics (uops and L2 misses for OPM calculation)
19
Example: bt
20
Comparing two schedules
What is the “best” schedule?
Depends on user
User supplies “better” function
bool better(i, j)
Several metrics can be used
Energy-delay
Energy-delay squared [Cameron et al. SC2004]
21
Slope metric
E
Project uses slope
Energy-time tradeoff
limit
i
j
T
Slope = -1 energy savings = time delay
Energy-delay product
User-defines the limit
Limit = 0 minimize energy
Limit = -∞ minimize time
If slope < limit, then better
We do not advocate this metric over others
22
Example: bt
Solutions
Slope
< -1.5?
1
00 01
-11.7
true
2
01 02
-1.78
true
3
02 03
-1.19
false
4
02 12
-1.44
false
02 is the best
23
Benefit of multiple gears: mg
24
Current work: no. of nodes, gear/phase
25
Load imbalance
26
Node bottleneck
Best course is to keep load balanced
Load balancing is hard
Slow down if not critical node
How to tell if not critical node?
Suppose a barrier
All nodes must arrive before any leave
No benefit to arriving early
Measure block time
Assume it is (mostly) the same between iterations
Assumptions
Iterative application
Past predicts future
27
Example
synch pt
predicted
synch pt
synch pt
slack
predicted t
performance = 1
performance = (t-slack)/t
iteration k
iteration k+1
Reduced performance & power
Energy savings
28
Measuring slack
Blocking operations
Receive
Wait
Barrier
Measure with MPI_Jack
Too frequent
Can be hundreds or thousands per second
Aggregate slack for one or more iterations
Computing slack, S
Measure times for computing and blocking phases
T= C1 + B1 + C2 + B2 + …+ Cn + Bn
Compute aggregate slack
S = (B1+B2+…+Bn)/T
29
Communication slack
Slack
Aztec
Sweep3d
CG
Slack
Use net slack
Varies between nodes
Each node individually
determines slack
Varies between applications
Reduction to find min slack
30
Shifting
slack
When to reduce performance?
When there is enough slack
When to increase performance?
When application performance suffers
Create high and low limit for slack
Need damping
Dynamically learn
Not the same for all applications
Range starts small
Increase if necessary
reduce gear
same gear
increase gear
T
31
Aztec gears
32
Sweep3d
Aztec
Performance
33
Synthetic benchmark
34
Summary
Contributions
Improved energy efficiency of HPC applications
Found simple metric for phase boundary location
Developed simple, effective linear time algorithm
for determining proper gears
Leveraged load imbalance
Future work
Reduce sampling interval to handful of iterations
Reduce algorithm time w/ modeling and prediction
Develop AMPERE
a message passing environment for reducing energy
http://fortknox.csc.ncsu.edu:osr/
[email protected] [email protected]
35
End
36