slides - Duke People

Download Report

Transcript slides - Duke People

+
A Dynamic Compilation Framework for Controlling
Microprocessor Energy and Performance
Gift Nyikayaramba
30 September 2014
+
Overview
Present framework for RDO
with dynamic compilation
Implement prototype on an
industrial strength dynamic
compilation environment
Deploy prototype for test
Benchmark using standard
benchmarks
Present results
Key design issues
1. Code region selection
2. DVFS decision
3. Code insertion/transformation
Platform
Intel PIN variant
Deployment environment
Intel Pentium M processor on an
Intel development board for direct
current and voltage measurement
1.
2.
Benchmarks Used
SPEC (95 and 2K)
Olden
Results
1. 3X-5X better than hardware DVFS
2. 2X better than static compilation
scheme
+
Motivation

DVFS is an effective means to control performance and power
consumption

Current paradigms of implementation (hardware, OS-time
interrupts and static compilation) limit the available gains

Dynamic compilation based DVFS is fine grained, code-aware and
adaptive to different microarchitectures giving it an advantage over
traditional methods leading to greater gains

Framework can be adapted to other areas (di/dt and thermal
control)
+ Motivation
Hardware driven DVFS relies on
counters
that invoke decisions for fixed time
intervals a manner that is:
1. Not program adaptable
2. Not phase aware
3. Agnostic of recurrent nature of
phases
Compiler driven DVFS schemes
(static or dynamic) can apply DVFS
to fine-grained code regions so as to
adapt naturally to program phase
changes.
+ Motivation
Static compiler DVFS depends on
profiling and offline analysis for
decisions.
Differences in runtime environments
for profiler and program can result in
mismatch at runtime because:
1. Micro architecture impacts CPU
slack
2. Memory usage depends on
runtime inputs and program
patterns
Dynamic compiler DVFS can
utilize run-time system
information and make inputadaptive and architectureadaptive decisions.
+
How it works

Traditional DVFS’s power and energy management is OS controlled and
transitions happen between applications and on switches between
processor states

Paper presents fine-grained intra-task dynamic DVFS that can switch
make power-performance tradeoffs between phases of the same
application

A dynamic compiler is a run-time software system that compiles,
modifies, and optimizes a program’s instruction sequence as it runs.

Dynamic compilation DVFS works by optimizing the application binary
code and inserting DVFS control instructions at program execution time.

Most DVFS implementations allow direct software control via mode set
instructions (by accessing special mode set registers), a dynamic
compiler can be used to insert DVFS mode set instructions into
application binary code at run time.
+
Disadvantages/Challenges
A penalty is paid for dynamic optimization until a chip multiprocessor
is used on the sidelines
Challenge becomes designing a simple and inexpensive optimization
algorithms to minimize runtime optimization cost
+
Design Framework and
DVFS Decision Algorithms
+
Key design decisions
I. Candidate Code Region Selection
- Cost effectiveness requires optimization of the “hot” code regions
- DVFS is a slow process hence only long running code regions
should be optimized
 Loops and functions are candidate code regions
 Extend existing profiling infrastructure to monitor and identify hot
functions or loops.
II. DVFS decisions
- Deciding whether or not it is appropriate to apply DVFS and if so,
what is the appropriate DVFS setting
- Analysis/decision algorithm needs to be simple and fast to
=> Fast DVFS decision algorithm, based on an analytical decision
minimi
mode
+ Key design decisions
II. DVFS Code Insertion and Transformation
- Number of code regions to optimize
i) One code region is simple to implement and manage
ii) Multiple regions enable greater gains but they require more
complex management
=> Two solutions, parent only policy and a stacked DVFS policy
- Code insertion to maximize energy gains without impacting
performance and correctness
=> Interaction between DVFS optimizer and the conventional
performance optimizer.
+
Operational Block Diagram
+
DVFS decision algorithms
RDO inserts testing and decision code at entry and exit
points of candidate regions
Testing and decision code collects run-time information
- cache misses
- memory bus transactions, etc.
Once enough information has been collected RDO decides DVFS
setting for code region based on collected info + RDO setup
After decision is made, testing and decision code is removed and
DVFS code insertion and transformation is carried out
Decision is upheld until the end of program execution or the next
reoptimization point
+
DVFS Analytic Decision Model
• Simple and fast because βis directly computed from runtime hardware
information
• Ad-hoc and rather imprecise, especially handling of Ploss
+
Implementation and
Deployment
+
Implementation
Intel PIN variant (O-PIN) used for
dynamic instrumentation and
compilation
O-PIN has more dynamic
functionality and less operation
overhead in comparison to PIN
+ Deployment
+
Experimental Results
+
Experimental Setup

Ploss = 5%

Hot threshold = 4 loop executions (results not sensitive in 3-20
range)

Long running threshold = 1.5ms (at least 3X bigger than
voltage transition time)

SPEC2K FP and SPEC 2K INT benchmarks used. Also used
SPEC95 FP and Olden. Why?
+
RDO in Action
35% energy savings with only 5%
performance loss
+
SPEC 2KINT
& Olden
Results
+
SPEC 2K &
95 FP
Results
+
Results Summary
Benchmark
Suite
EDP improvement
after accounting
for O-PIN
overhead
SPEC95FP
16.7%
SPEC2KFP
17.9%
SPEC2K INT
-1.4%
Olden
20.9%
+
Suggestions + future work
Suggestions
1. Establish upper-bound for DVFS savings and designing for comparison
against that standard
2. Incorporation of more micro-architectural support for dynamic
compilation scheme such as slack identification and prediction logic
3. Finer granularity DVFS settings for higher gains
Future Work
1. Addressing specific design issues like code transformation and DVFS
periodic re-optimization
2. Finer breakdown and analysis of results
3. Implement performance optimizations and study interactions between
performance optimizations and energy optimizations