gem5, GPGPUSim, McPAT, GPUWattch,

Download Report

Transcript gem5, GPGPUSim, McPAT, GPUWattch,

For a number of years we have been familiar with
the observation that the quality of programmers
architecture researchers is a decreasing function of
the density of go to statements reliance on
quantitative architecture simulators in the
programs architecture papers they produce. More
recently we discovered why the use of the go to
statement architecture simulators has such
disastrous effects, and we became convinced that
the go to statement architecture simulator should
be abolished from all "higher level" programming
languages architecture research.
gem5, GPGPUSim, McPAT, GPUWattch,
“Your favorite simulator here”
Considered Harmful
Tony Nowatzki, Jaikrishnan Menon, Chen-Han Ho,
Karthikeyan Sankaralingam
[email protected]
University of Wisconsin - Madison
6/15/2014
(hate mail
here please)
Cycle Accurate
Simulation
Trace-Based
Analysis
How do I best
evaluate my idea?
Program
Analysis
Reasoned
Arguments
“Cycle-Accurate
Simulation”
Custom FirstOrder Models
Mathematical
Proofs
How do I best
get it published?
2
3
The Good,
the bad,
and the Ugly
4
Outline
5
McPAT
Detailed
Power
Breakdown
Pitfall 1: Simulator
Errors Inaccessible
to users.
6
Pitfall 2: Simulator
Tool Misuse
7
This is how GPUWattch Actually Works
power
distribution
activity
counts
GPU
+
McPAT
GPGPUSim
scaling factors
(10-50x)
• Scaling factors are fitted to the GPU being measured.
• Changing GPGPUSim parameters would not yield a valid
power model.
8
GPUWattch
Pitfall 2: Simulator
Tool Misuse
9
Pitfall 3: Mixed Abstractions with
Unknown Consequences
GPGPUSIM 2000
Register File
Microarchitecture
Warp
Scheduler
Branch
Stack
10
Pitfall 3: Mixed Abstractions with
Unknown Consequences
GPGPUSIM 2000
Register File
Microarchitecture
Warp
Scheduler
Branch
Stack
11
Pitfall4: Emerging Simulators
(The Entire Core Microarchitecture)
12
Pitfall 5: The
Trends Myth
13
Pitfall 5 Example
Simulator bug: X86 Gem5 always predicts branches inside
macro-ops as not taken.
Program Trace
Accelerated Region
(100× faster)
Unaccelerated Region
Accelerator Benefit: 30%
Benefit after error fix: 300%
Unaccelerated Region
Accelerated Region
(100× faster)
14
Ex
5
4
Power
Issue
3
Clock
Cache
2
1
MMU
0
TPC Media Cloud
0 Performance
2
4
6
5
4
3
2
1
0
ED^4
Pitfall 6: Poor
qualitative findings
from data inundation.
Energy
10
8
6
4
2
0
Spec
NoC
Spec TPC Media Cloud
00%
CPI Stack
80%
60%
40%
Random Heat Map
20%
0%
15
Pitfall 7: Amplified Reviewer Noise
16
“I have a trace-based,
empirical, regression-fitted,
cycle-accurate model.”
“I’m using cycleaccurate simulation to
evaluate my singlecycle per instruction
architecture.”
“Will someone please
publish my
mathematical model?”
17
Outline
18
“Footprint”
Stack Layers
Small Footprint
Medium Footprint
Large Footprint
Algorithm
Application
Compiler
OS
IO
Mem. Controller
Caches
Core µarch
Circuits
Gates
Transistors
Physics
How do I
evaluate
my idea?
Cycle
Accurate
Simulation
Trace-Based
Mathematical
Analysis
Proofs
Custom FirstProgram Reasoned
Order Models
Analysis Arguments
19
The Footprint in Context
Stack Layers
Algorithm
Application
Compiler
OS
IO
Mem. Controller
Caches
Core µarch
Circuits
Gates
Transistors
Physics
Stay Away From
Conservation
the Valley
Cores
Temporal
Memory
Streaming
Sampling
DMR
PC-Required Approach: “Cycle-Accurate Simulation”
Appropriate
Approach:
Cycle-accurate
Simulation
Mathematical
Models
Custom
First-Order
Models
What the
authors Did:
Cycle-accurate
Simulation
Above, but not
Originally not a
Conf. Publication
More Than
PC-Required
Mathematical
Proof
PC-Required
Approach
20
Tool Developers
Authors
a few
humble
requests…
Reviewers
21
Opinions?
http://sim-harmful.blogspot.com/
22
Thank
you!
http://sim-harmful.blogspot.com/
23
Backup Slides
24
Simulator Errors
25
Gem5 – Writeback Buffers
• Implications:
Inst. Queue
FU
FU
…
Execute
– Each instruction holds a WB buffer
slot during execute.
– # Entries = WB_depth × WB_width
– Default WBdepth=1
Issue
• Issue: A default setting causes
inappropriate slowdowns.
• Details:
FU
“Writeback Buffer”
– 5× Perf Loss in micro-benchmarks.
– Up to 25% performance loss on
Parsec workloads [Vamsi Krishna]
26
Gem5 – Pipeline Replay
• Issue: Gem5 Pipeline Replay mechanism is
unrealistic (both conservative and pessimistic)
• Details:
– Deep OOO pipelines speculatively schedule
instructions for back-to-back execution, variable
latency disrupts the schedule.
– Conservative: Entire OOO pipeline (not just
dependent insts.) repeatedly flushed on cash block.
– Optimistic: Simple block on cache miss.
• Implications:
– 5× Perf Loss in micro-benchmarks.
27
Gem5 – µop Inefficiencies
• Issue: Flaws and inefficiencies in gem5 µops.
• Details (examples):
– Instructions which read destination register, but don’t
have to (mov)
– Same as above, but with flags register (xor, and, or…)
– Instructions which are marked as requiring the FP unit,
but don’t need it (ldfp, mov2fp)
– SIMD FP instructions aren’t counted as FP. (mulps …)
• Implications:
– Incorrect performance and energy estimates (especially
for FP code)
28
McPAT – Fitted Values
• Issue: McPAT relies on some design-specific
constants, users should take caution generalizing.
• Details (examples):
– Dynamic component of energy added for FU if the
design is OOO.
– Per-access FU energy divided by 2 if processor type is
embedded.
– Percentage chip using long-channel devices set by
values from Niagara vs Xeon Tulsa.
• Implications:
– Users can easily use tool outside of validated range.
29
McPAT – Pipeline & Clock Power
– McPAT models pipeline/clock
power considering average
switching factors.
– This power is reported by
distributing it among stages.
– A factor of cycles is lost for OOO
processors in this calculation.
• Implications:
– Incorrect power estimates for
experiments longer than a few
cycles.
McPAT Pipeline Power for a
65nm Idle Processor
Pipeline Power (watts)
• Issue: McPAT drops pipeline and
clock power for OOO designs.
• Details (examples):
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Inorder
OOO
1
10
100
Total Experiment Cycles
30
GPGPUSim V2.x
• Issue: Certain features in GPGPUSim V2.x are modeled functionally
and aren’t appropriate for evaluating their micro-architectural
enhancements.
• Details:
– Register File microarchitecture: The operand collector modeled
assuming fixed latency accesses to the SRAM with some additional
queuing latency. (lacks contention)
– Thread/warp/wavefront scheduling and dispatch: Different warp
scheduling schemes implemented, but schedules are generated
functionally.
– Branch divergence structures and Branch Unit: Tracking structures are
functionally emulated as part of the abstract hardware model, and the
branch unit microarchitecture is not modeled at the cycle-level.
• Implications:
– It is possible to do research with this tool while remaining detached
from micro-architecture.
[These issues largely fixed in GPGPUSim V3.x]
31
GPUWattch
1. In the methodology as
presented, the McPAT
modeling is mathematically
irrelevant.
2. Implementation bounds the
McPAT scaling factors (10-50×).
– Not mathematically irrelevant.
– Enables plausible distribution.
3. Potential for inappropriate use:
Usage
GPGPUSim
Training
Scaled
Activity
Factors
Modified
McPAT
Train with
measured
GPU
Component-wise
Power
– Modifying GPU configuration without re-training
would yield invalid models.
– Scaling factors too high to trust power will behave like
CPU components.
32
GPUWattch
• Issue: Following the methodology as presented
will lead to “Irrelevant modeling” Errors.
• Details:
𝑃𝑏𝑒𝑛𝑐ℎ =
𝛼𝑏𝑒𝑛𝑐ℎ,𝑐𝑜𝑚𝑝 × 𝑃𝑀𝐴𝑋𝑐𝑜𝑚𝑝 × 𝑥𝑐𝑜𝑚𝑝
– Power eq:
∀𝑐𝑜𝑚𝑝
Power
Estimate
Activity
Factors
Max power of
component
Scaling
Factor
– Authors determine scaling factors by minimizing
squared error for measured values. [linear regression]
– Scaling by constant factor is meaningless, this is
equivalent:
𝑃
=
𝛼
× 𝑥′
𝑏𝑒𝑛𝑐ℎ
𝑏𝑒𝑛𝑐ℎ,𝑐𝑜𝑚𝑝
𝑐𝑜𝑚𝑝
∀𝑐𝑜𝑚𝑝
– Implication: Users of GPUWattch methodology will
waste effort in detailed modeling!
33
GPUWattch – A clarification
• GPUWattch authors actually bound the 𝑥𝑐𝑜𝑚𝑝
scaling factors: 𝑃𝑏𝑒𝑛𝑐ℎ =
𝛼𝑏𝑒𝑛𝑐ℎ,𝑐𝑜𝑚𝑝 × 𝑃𝑀𝐴𝑋𝑐𝑜𝑚𝑝 × 𝑥𝑐𝑜𝑚𝑝
∀𝑐𝑜𝑚𝑝
𝒙𝒎𝒊𝒏 ≤ 𝒙𝒄𝒐𝒎𝒑 ≤ 𝒙𝒎𝒂𝒙
• Factors are ≤10× for core, and ≤50× for uncore
• Therefore, this is not a pure linear regression, and
𝑃𝑀𝐴𝑋𝑐𝑜𝑚𝑝 is not mathematically irrelevant.
• In practice, these bounds keep the scaling factors
closer to reality (not negative or extremely high)
• However, this leads to a potential “user-error” …
34
GPUWattch
• Issue: GPUWattch methodology restricts itself to
GPUs with physically measurable artifacts.
• Details:
– Scaling factors are unique to a particular platform.
– GPUWattch relies on McPAT’s scaling for capturing effects
between different GPU configurations.
– Claim: Relying on McPAT for this is highly questionable
without evidence.
– Can McPAT get the absolute value so wrong (8× and 22×
average scaling factors for the validated platforms), but
somehow get the relative scaling correct?
• Implication: GPUWattch restricted to small-footprint
research evaluation.
35