Tools and Techniques for Power and Reliability Optimizations
Download
Report
Transcript Tools and Techniques for Power and Reliability Optimizations
Designing Tomorrow’s
Computing Platforms
Challenges, Solutions, and Tools
Sudhanva Gurumurthi
e-mail: [email protected]
Talk Outline
• Modern Computer Architecture
– The Good
– The Bad
– The Ugly
• My Previous Work
• Current and Future Research
The Good
Source: http://www.intel.com/technology/silicon/mooreslaw/
Microprocessor Technology
Advancement
• Plentiful Transistors
– Superscalar, SMT, CMP
– Larger caches, deeper memory-hierarchy
– High-bandwidth access to memory
• Simultaneously, clock frequencies have
grown tremendously
Storage Has Become Ubiquitous
Density
Speed
Growth in Drive Performance
Source: Hitachi GST Technology Overview Charts, http://www.hitachigst.com/hdd/technolo/overview/storagetechchart.html
The Bad
Power Dissipation
90
80
Power (W)
70
60
50
40
30
20
10
0
8086
286
386
486
Pentium Pentium Pentium
III
4
Particle Induced Soft-Errors
0
1
Source: FACT Group, Intel
Are you kidding me?
• No!!
– In 2000, Sun Microsystems reported random crashes
in one of its server products due to no parityprotection in the caches.
– Eugene Normand’s study of the error-logs of large
systems indicated several such errors
– There are conference sessions and even full
conferences/workshops devoted to this problem
– Have personal experience collecting and analyzing
soft-error data
Where Do These Particles Come
From?
• Neutrons
– Terrestrial cosmic rays
• Alpha particles
– Packaging
Should we worry?
• Yes!!
– Thanks to Moore’s Law
• Lower operating voltages
• Exponential increase transistor integration density
• Power management (voltage-scaling)
– Larger systems
• Impractical to shield against cosmic rays
– Need several feet of concrete
– Radiation-hardening hurts performance, area, and
cost
Redundant Multi-Threading
Input
Replicator
Output
Comparator
Rest of the System
Source: Mukherjee et al, “Detailed Design and Evaluation of Redundant Multithreading Alternatives”, ISCA’02
Performance of Redundant MultiThreading
45
Percentage of IPC Lost
40
35
30
25
20
15
10
5
0
gzip
sw im
vpr
gcc
m esa
art
m cf
equake
parser
vortex
bzip2
Temperature Affects Disk Drive
Reliability
• Heat-Related Problems
– Data corruption
– Higher off-track errors
– Head-crashes
Disk drive design constrained by the thermal-envelope
• Puts a limit on drive performance
Source: D. Anderson et al, “More than an Interface – SCSI vs. ATA”, FAST 2003.
Thermal-Constrained Design
Data Rate =~ (Linear-Density)*(RPM)*(Diameter)
1 platter
Increase
Data-Rate
RPM
Shrink
Platter
(RPM)2.8
40% Annual
(Dia)4.6
IDR Growth
Capacity
Lower
Lower Data Rate
Capacity
(# Platters)
Increase RPM
Temperature
Power =~ (# Platters)*(RPM)2.8(Diameter)4.6
The Bad
Drive Temperature
2.6"
2.1"
1.6"
100
Thermal-Envelope
Year
20
12
20
11
20
10
20
09
20
08
20
07
20
06
20
05
20
04
20
03
10
20
02
Temperature (C)
1000
The Bad
Data Rate
30-60% Performance Boost
for 10,000 RPM Increase
Search-Engine
Thermal Behavior
Thermal Envelope = 45.22 C
The Ugly
Design Tools
• Designing complex systems requires
extensive simulation
• Need to model all aspects of the system
– Software layers
– Power
– Temperature
– Effect of faults
Simulation Problems
• Painfully slow
– Speed vs. Accuracy
• No good support available for modeling
effects like temperature and reliability
• Can themselves be hard to write
• Buggy
My Previous Work
Thesis Work:
Power Management of
Enterprise Storage Systems
Enterprise Storage Market
Growth
• Storage demand growing at annual rate of 60%
– By 2008, a company would manage 10 times the
storage it has today.
Sources:
1. “Enterprise Storage: A Look into the Future”, TNM Seminar Series, Oct. 31, 2000
2. “More Power Needed”, Energy User News, Nov. 2002
Power Demands of Data Centers
“What matters most to the computer designers at Google is not speed but
power – low-power – because data centers can consume as much
electricity as a city”, Eric Schmidt, CEO, Google
• Data centers
consume several
Megawatts of
power
• Electricity bill
– $4 billion/year
– Disks account
for 27% of
computing-load
costs
• Difficult to cool at
high powerdensities
Sources:
1. “Intel’s Huge Bet Turns Iffy”, New York Times article, September 29, 2002
2. “Power, Heat, and Sledgehammer, Apr. 2002.
3. “Heat Density Trends in Data Processing, Computer Systems, and Telecommunications Equipment”, 2000.
Data Center Cooling Costs
Servers
Air-Conditioning
Other
7%
42%
51%
• Data center of a large financial institution in New York
City
– Power consumption ~ 4.8 MW
Source: “Energy Benchmarking and Case Study – NY Data Center No. 2”, Lawrence Berkeley National Lab, July
2003.
Where Does Power Go?
Active = 11 W
Spindle
Motor
(SPM)
Idle = 9 W
Standby = 1 W
Voice-Coil
Motor
(VCM)
4W
Seek = 13 W
Traditional Power Management
(TPM)
Idleness
Detected
Disk Active
Disk Request
Disk Active
Idle
Spinup
Spindown
Standby Mode
Time
I/O Characteristics of Server
Systems
• Large number of disks
– RAID arrays
• Heavier I/O loads sustained over long periods.
• Stringent performance requirements.
• Server disks physically different
– Not made to use spindowns.
– Longer spindown/spinup latencies
• Server Disk - Hitachi Ultrastar – 15 seconds/26 seconds
• Laptop Disk - Hitachi Travelstar – 4.5 seconds
Feasibility of Applying TPM
•
No prior study on how to tackle this problem
systematically.
•
Questions
1. Is there idleness?
2. Can we do TPM?
•
Answers
1. Yes
2. No! Why??
•
Large number of very short duration (few ms)
idle-periods
The Solution
• Traditional Power Management
– Not effective for server workloads
• Power =~ (# Platters)*(RPM)2.8(Diameter)4.6
– All three can be varied at design-time to meet
the power budget
• Laptop vs. Server disk
– RPM could be varied dynamically
• Dynamic RPM (DRPM)
Potential Benefits of DRPM
TPMperf
DRPMperf
Combined
% Savings in Eidle
80
70
60
50
40
30
20
10
0
10
100
500
1000
10000
Mean Inter-Arrival Time (ms)
100000
Control-Policy Performance
Research Impact
• The feasibility study [ISPASS’03] started off new
research in server disk power management
– Active groups: UIUC, Rutgers, UMass, UArizona,
Rochester
• DRPM paper [ISCA’03] widely cited in
architecture and systems conferences like ISCA,
HPCA, ASPLOS, SOSP, OSDI
• Multi-speed drives starting to appear in the
market
– Hitachi Deskstar 7K400
My Other Work
• Microarchitectural Techniques to Enhance
Redundant Multi-Threading Performance
– Instruction Reuse [ISCA’04]
• Soft-Error Data Collection and Analysis from
Actual Systems (Intel)
• Soft-Error Tolerant Cache Coherence-Protocols
(Intel)
• Simulator Design
– SoftWatt [HPCA’02]
– MEMSIM (IBM Research)
More Details About My Work
• Papers:
– S. Gurumurthi et al., Disk Drive Roadmap from the Thermal
Perspective: A Case for Dynamic Thermal Management, ISCA
2005.
– A. Parashar et al., A Complexity-Effective Approach to ALU
Bandwidth Enhancement for Instruction-Level Temporal
Redundancy, ISCA 2004.
– S. Gurumurthi et al., DRPM: Dynamic Speed Control for Power
Management in Server Class Disks, ISCA 2003.
– S. Gurumurthi et al., Using Complete Machine Simulation for
Software Power Estimation: The SoftWatt Approach, HPCA
2002.
• Available via my CS Department homepage.
Some Research Directions
• Temperature-Aware Storage Systems
– Devices
– Systems issues
• Multi-Dimensional Approach to Fault Tolerance
– Tradeoffs between performance, power, reliability
– Dynamic adaptation
• Microarchitectural Support for Security
• Design of accurate and fast simulation tools
Research Directions in Storage
• Storage architecture is still quite a nascent
field
• Plenty of research opportunities:
– Emerging technologies
• MEMS, holographic, molecular storage
– New Research Avenues
•
•
•
•
Security
Application/Content-Awareness
Active disks
Long-term and survivable storage
Looking for Students!
• Shall be offering a research course in
Spring 2006.
– Many project opportunities
• Contact Information:
– E-mail: gurumurthi@cs
– Office: 236B, Olsson Hall
Divider Slide
Approach 1:
Seek Throttling
T
VCM On
E
M
P
Thermal-Envelope
VCM Off
E
R
A
T
U
R
E
TIME
Results
2-42% reduction in IPC gap
(avg. 23%)
100.00%
80.00%
70.00%
60.00%
DIE-IRB-1K-sat
DIE-2xALU
50.00%
DIE-IRB-ideal
40.00%
30.00%
20.00%
10.00%
e
A
ve
ra
g
ip
2
25
6.
bz
rt
ex
25
5.
vo
rs
er
p
Benchmark
19
7.
pa
18
8.
am
m
ua
ke
cf
18
3.
eq
18
1.
m
17
9.
ar
t
c
es
a
17
7.
m
17
6.
gc
r
17
5.
vp
17
1.
sw
im
ip
0.00%
16
4.
gz
Percentage of IPC Gap (SIE-DIE) recovered
90.00%