slides - Computer Science

Download Report

Transcript slides - Computer Science

Zehan Cui, Yan Zhu,
Yungang Bao, Mingyu Chen
Institute of Computing Technology, Chinese Academy of Sciences
July 28, 2011

Motivation

Design & Implementation

Experiments

Conclusion & Work in Progress

Motivation

Design & Implementation

Experiments

Conclusion & Work in Progress

Watts/Server 
[source: The Problem of Power Consumption in Servers,Intel,2009]

CPU no longer dominates
the system power.
[source: Barroso et. al. , The datacenter
as a computer, 2009]

Measurement is the basis.
Hardware
model
Low
power
measurement
Software

Component-Level: ATX-based method
accuracy
Directly powered through ATX wires.
Modern motherboards mostly have dedicated
ATX wires for processor.
VRM (Voltage Regulation Module) loss
Usually deduced from multi ATX wires.
Platform dependent.

Motivation

Design & Implementation

Experiments

Conclusion & Work in Progress

Disk & CPU
◦ Similar to other ATX-based methods

Memory & Add-in Card Devices
◦ Wrapper-based methods

Advantages
◦ Accurate: direct measurement
◦ Easy-to-use: no deduction needed
◦ Portable: multi-platform
Power
Supply
Current Sensor

Prototype
◦ Disk power
◦ CPU power
◦ Memory power
Component
Count
Description
Wrapper Card
1
Memory power measurement.
Intermediate
Card
1
8 channels.
DMM
2
Agilent 34411A.
Collector
1
PC
• Support DDR2-400 DIMM.
• A channel is capable of converting one current into voltages.
• One channel each.
• Max speed: 50K samples per second.
• LAN interface.
• Collect data from DMM.

Motivation

Design & Implementation

Experiments

Conclusion & Work in Progress
Component
Detail
CPU
Intel Core2 Duo E4500
Memory
DDR2-400 2GB UDIMM
Disk
640GB SATA
# of Cores: 2
Clock Speed: 2.2GHz
L2 Cache: 2MB
FSB Speed: 800MHz
Frequency: 200MHz
Max Bandwidth: 3.2GB/s
401.bzip2 from SPECCPU2006
50
CPU
Memory
Disk
(unit: Watt)
45
Power of Components

40
35
30
25
20
15
10
5
0
0
10
20
30
Time from Beginning
40
50
(unit: Second)
60
70

More frequently we measure the power, more
details we can get.
Observation:
5,000 samples/s is an appropriate sample frequency at
component level.
Higher BW,
but lower Power
Lower BW,
Higher Power



Malloc 512MB
Access in
different strides
Two causes
◦
◦
Row conflict
Lots of TLB miss
Time: 6.5 times longer
Power: slightly lower
Energy: 5.9 times higher
 increase row buffer hit rate
 large page may be more efficient
What is the relationship between
performance and power?

64MB memory
◦ Random vs. Sequential
 Jump at least 64B
 eliminate cache hit
 Large page(2MB)
 eliminate TLB miss

Load/Sotre_Unit % = LSU_stall_time/CPU_Cycle
Observation:
It seems that DRAM power is already proportional to bandwidth.
But the fact is that …


Use different SEEDs to generate different random
access patterns;
Power varies less than 1.1%.
Observation:
DRAM power is highly correlated to two factors
• Load/Store Unit Utilization
• Sequential / Random
We can build memory power models based on the two factors rather
than Bandwidth.

Motivation

Design & Implementation

Experiments

Conclusion & Work in Progress

We use a hybrid approach
◦ ATX-Based  CPU/Disk
◦ Wrapper card  DRAM/…


5KHz is an appropriate sampling frequency to
disclose fine-grain power behavior.
DRAM power is highly correlated to
Load/Store Unit Utilization, rather than
Bandwidth.

Upgrade current system
◦ Support DDR3
◦ Support Large memory capacity
◦ Support 40 simultaneous measuring channels
 Use FPGA to collect measured data

Correlate the measured power data with
high-level semantics information
Thanks!
&
Questions?
Backup

Wrapper Card already exists

We only did several small modifications
Current Sensor
Power Supply
Signals

Normal
DIMM: Dual-Inline Memory Module
DIMM slot
Motherboard

With our initial wrapper card
Wrapper Card
DIMM
DIMM slot
Motherboard
I/O Circuitry
Banks
Row Decoder
Driver
s
Column Decoder
[Source: H. David et. al., Memory Power Management via
Dynamic Voltage/Frequency Scaling, ICAC, 2011]
Recievers
Runs at bus speed
• Independent arrays
Clock sync/distribution
• Asynchronous:
On-Die
Termination
Bank
0
Bus drivers and receivers
independent of
• Required by bus electrical
Buffering/queueing
memory bus speed
characteristics
for reliable operation
• Resistive element that dissipates power
Sense
when
busAmps
is active
Write
FIFO
Registers
•
•
•
•
ODT
28

Can be approximately divided into
◦ Background power
 considered to be stable
◦ Bank power
 active/precharge
 Related to frequency of row operation
◦ I/O power
 Burst
 proportional to bandwidth
◦ Termination power
 Termination resistors
 Proportional to bandwidth
P=U*I
Doesn’t fluctuate
too much, less than
2% in our platform.
DC Voltage
ADC
CSA
or
DMM
Data
Collector
(PC)
DC Current
DC Voltage
(Current-Sense
Amplifier)

Possible reason for non-proportional of
random power in slide17:
◦ When bandwidth is low, auto-precharge (caused by
refresh) cause every access needs ACTIVE; the bank
power is proportional to bandwidth.
◦ When bandwidth is high, some access may hit in the
row buffer, which need less ACTIVE; the slope of
bank power increase is lower than before.