Preliminary Exam
Download
Report
Transcript Preliminary Exam
On Power-Proportional Processors
Yasuko Watanabe
[email protected]
Advisor: Dr. David A. Wood
Oral Defense
October 11, 2011
Executive Summary (1/2)
Power-constrained chips
– Power-performance trade-offs mainly via DVFS
Limited utility of DVFS in future technology nodes
Concept: Power-proportional processors
– Alternative to DVFS
– Consume power in proportion to performance
Contribution: Mechanisms for power proportionality
– Dynamic resource management with 3 proposals
Ideal power proportionality
Power
Aggregate resources
Disable resources
Performance
2
Executive Summary (2/2)
Proposal 1: WiDGET framework
– Goal: Scalability at higher performance
– Mechanism: Distributed in-order Execution Units (EUs)
Power
Proposals 1 & 2
Proposal 3
Performance
Proposal 2: Deconstructing scalable cores
– Goal: Energy-efficient scalability at higher performance
– Mechanism: Trade-offs between wire delay and more resources
Proposal 3: Power gliding concept
– Goal: Power scale-down at lower performance
– Mechanism: Dynamically disable performance optimizations
3
Outline
Motivation
Proposal 1: WiDGET framework
Proposal 2: Deconstructing scalable cores
Proposal 3: Power gliding
Related work
Conclusions
4
Intel Power Trend
Thermal Design Power (W)
100
10
1
0.1
1970
1975
1980
1985
1990
1995
2000
2005
2010
Pre Pentium
Pentium
Pentium MMX
Pentium II
Pentium III
Pentium 4
Celeron
Core 2 Duo
Core 2 Quad
Core i3
Core i5
Core i7
Itanium
Itanium 2
Pentium D
Xeon
Year
Exponential power increase until ~100W
Source: Stanford CPU database
5
A Temporary Solution for Power
Multi-cores with dynamic voltage/frequency scaling (DVFS)
– Dynamic power-performance trade-offs with static cores
• 3% reduction in power with 1% performance loss
– Meeting Amdahl’s Law for diverse workloads
– Reduced complexity of cores
But the utility of DVFS is limited in future nodes
6
Nominal Vdd
Nominal Vdd Projection
Vth
Vth Projection
Voltage (V)
5
4
3
2
1
0
1200 800 600 350 250 180 130 90
65
45
32
22
15
11
8
Feature Size (nm)
Source: Stanford CPU database
6
Proposed Approach
Goal: Power-proportional cores
– Consume power in proportion to performance
– Single-thread context
How?: Dynamic resource scaling
– Guided by 3:1 power-to-performance ratio of DVFS
– Aggregate resources to scale up
• 1% performance increase with at most 3% power increase
– Disable resources to scale down
• 3% power savings at 1% performance loss
Ideal power proportionality
Power
Aggregate
Disable
Performance
Proposals 1 & 2
Proposal 3
7
Outline
Motivation
Proposal 1: WiDGET framework
– Skip
– 3-slide version
Proposal 2: Deconstructing scalable cores
Proposal 3: Power gliding
Related work
Conclusions
8
Sea of Resources
L1I
L1I
L1I
L1I
Thread context management or
Instruction Engine
(Frontend + Backend)
L1D
L1D
L1D
L1D
In-order Execution Unit (EU)
L2
EU cluster (4 EUs)
L1D
L1D
L1D
L1D
Max EU allocation (8 EUs)
L1I
L1I
L1I
L1I
9
In-Order EU
L1I
L1I
L1I
L1I
Router
L1D
•Executes 1 instruction/cycle
In-Order
L1D
L1D
L1D
Instr Buffers
L2
L1D
L1D
L1D
L1D
•EU aggregation for OoO-like
performance
Increases both issue BW & buffering
Latency tolerance with distributed
buffering
Extracts MLP & ILP
Operand Buffer
Router
L1I
L1I
L1I
L1I
10
Power Proportionality
Normalized Chip Power
1
Neon
0.9
8EUs+4IBs
Neon (~ Xeon)
Mite
0.8
1 EU
0.7
2 EUs
0.6
3 EUs
1EU+4IBs
0.5
0.4
0.3
4 EUs
5 EUs
1EU+1IB
Mite
6 EUs
7 EUs
0.2
0.3
(~ Atom)
0.5
0.7
0.9
1.1
Normalized Performance
1.3
8 EUs
21% power savings to match Neon’s performance
8% power savings for 26% better performance than Neon
Power scaling of 54% to approximate Mite
Covers both Neon and Mite on a single chip
11
Outline
Motivation
Proposal 1: WiDGET framework
Proposal 2: Deconstructing scalable cores
Proposal 3: Power gliding
Related work
Conclusions
12
Other Scaling Opportunities?
Unanswered questions by WiDGET
– Scaling other resources?
– Better power scale-down?
– Impact of wire delay on performance?
Deconstruct prior scalable cores
– Understand trade-offs in achieving scalability
• Resource acquisition vs. wire delay
– 2 categories
1. Resource borrowing
2. Resource overprovisioning
13
Deconstruction: Scaling Taxonomy
Core Fusion
Scaled Component
Borrow?
Over
Provision?
CLP
Borrow?
Forwardflow
Over
Provision?
L1-I
Borrow?
Over
Provision?
No scaling
Frontend
Fixed
Scheduling
Execution resources
Instruction window
L1-D
Resource
Acquisition
Philosophy
No scaling
Overprovision a few
core-private
resources
Aggregate neighboring cores
BAR model
(Borrowing All Resources)
COR model
(Cheap Overprovisioned Resources)
Scale up to 4x
14
Scaling: BAR vs. COR
Scaling Point: 1
4
2
BAR
BAR1
BAR2
BAR4
L
2
L
2
L
2
L
2
L
2
COR
COR1
COR2
COR4
I$
I$
I$
I$
FE
FE
FE
FE
FE
Steer
L
2
L
2
L
2
Steer
Inter-core
xbars
IQ
IQ
IQ
IQ
D$
D$
D$
D$
I$
IQ
IQ
IQ
IQ
D$
+Pipeline balance
-Latency overheads
-Imbalance
+Minimize latency
15
Assumption & Parameters @ 3 GHz
Component
Scaling Point
BAR
COR
1
2
4
L1-I
32
64
128
32
FE/BE Width
2
4
8
4
FE Depth
7
11
11
7
Scheduling
1
2
4
1
2
4
Exec Resources
1
2
4
1
2
4
Instr Window
64
128
256
64
128
256
L1-D
32
64
128
Resource Advantage
Optimistic Assumptions
1
2
32
4
Communication
– 0 cycle for intra-core
– 2 cycles for inter-core
Steering heuristics
– BAR: Cache-bank predictor
– COR: Dependence-based
Power-gating of extra
resources
Baseline
– Non-scalable 4-wide OoO
– Roughly equivalent to
COR2
16
Performance: BAR vs. COR
Instructions Delayed by
Remote Operand Transfers
IPC Normalized to OoO
35
1
30
0.8
25
Scaling 4
0.6
Scaling 2
0.4
Scaling 1
0.2
Instructions (%)
Normalized IPC
1.2
20
15
10
Better
0
BAR COR
CINT
BAR COR
CFP
Better
5
BAR COR
0
BAR2
BAR4
Commercial
COR yields 9% higher IPC than BAR
– COR: No additional wire delays
– BAR: Inter-core communication when scaled up
• E.g., Operand crossbar across 4 cores
Maintaining balance is unnecessary when scaled up
17
Chip Power: BAR vs. COR
Chip Power Normalized to OoO
Power Breakdown Normalized to OoO
6
1.5
Scaling 4
Scaling 2
1
Scaling 1
0.5
Better
Normalized Power
Normalized Chip Power
2
Scaling 4
5
Scaling 2
4
Scaling 1
3
2
1
0
0
BARCOR
BAR COR
CINT
BAR COR
BAR COR
CFP
Commercial
L1-I
BARCOR
F/D/R
BARCOR
Sched/
Steer
BARCOR
Exec
BARCOR
Backend
BARCOR
BARCOR
L1-D
L2/L3
BARCOR
Chip
Power
COR: 9% more power at scaling point 1
BAR: Up to 36% more power when scaled up
Large power differentials:
– Cache aggregation (Largest)
– Frontend (Width scaling and crossbars in BAR)
– Centralized ROB in COR
18
Deconstructing Power-Hungry Components
Frontend/backend width
– Scaling down: Energy efficient
– Scaling up: 29% power increase with negligible
performance increase
Cache aggregation
– Energy inefficient
– L1-I: Not a bottleneck
• < 0.5% average miss rate
– L1-D: More harm than good
• Large working sets
• Longer effective access latency
19
Improving Scalable Cores: COBRA
One scaling philosophy does not fit all situations
Hybrid of BAR and COR
– Performance scalability features from COR
• Overprovisioned window/execution resources
– Low-power features from BAR
• Interleaved ROB (core private)
• Pipeline width scale-down (core private)
– Borrow only execution resources
• Scaling of up to 8x
20
Two Execution Styles of COBRA
I$
I$
FE
FE
Steer
Steer
COBRo
Out-of-Order
OoO
IQ
COBRi
In Order
L2
L2
L2
L2
L2
In-order
instruction buffer
L2
D$
D$
1 cycle
COBRo: Out-of-order execution
COBRi: In-order execution with WiDGET’s EUs
– Single-issue per execution resource
21
Power & Performance
1.8
Normalized Chip Power
Scaling 4
Scaling 8
1.6
OoO
1.4
BAR
1.2
COR
Scaling 2
1
COBRo
Scaling 1
0.8
COBRi
0.6
0.4
0.4
0.6
0.8
1
1.2
1.4
Normalized IPC
COBRo: 5% less power than COR for the same performance
COBRi: Lowest power
– Low performance when scaled down
– 13% better performance than COR with less power at scaling 4
COBRA (COBRo & COBRi)
– Further, latency-effective scaling
22
Energy Efficiency
3.5
3
ED2
2.5
2
1.5
1
Better
0.5
Scaling 1
Scaling 2
Scaling 4
COBRi
COBRo
COBRi
COBRo
COR
BAR
COBRi
COBRo
COR
BAR
COBRi
COBRo
COR
BAR
0
Scaling 8
COBRA : Energy efficient foundation
– Except for COBRi1
• Not enough power savings for the lower performance
– COBRo: Up to 48% improvement
– COBRi: Up to 68% improvement
COBRi: Up to 50% more efficient than COBRo
– Eliminates expensive OoO issuing
23
Summary
Whole-core scaling is energy inefficient
– Wire delays outweigh the benefits
Scaled-down cores should maintain pipe balance
COBRA: Energy-efficient foundation
–
–
–
–
Overprovisions window resources
Only borrows small, latency-effective resources
Scales down frontend/backend with the window
Aggregation of in-order executions more energy
efficient than using OoO executions
24
Outline
Motivation
Proposal 1: WiDGET framework
Proposal 2: Deconstructing scalable cores
Proposal 3: Power gliding
Related work
Conclusions
Power
Power Gliding
Proposal 3
Performance
25
Motivation
Past
Future
Power
Fmin
Vmin
X
DVFS
Vmax, Fmax
1
Fmin
Power
Vmax, Fmax
1
Vmin
X
DVFS
FS
Power Gliding
FS
0
0
0
1
0
Performance
1
Performance
Implications of smaller voltage scaling range
– Increasing reliance on frequency scaling
– Reduced power range
Power gliding goal
– Extend the DVFS curve
26
Power Gliding Approach
Disable or constrain performance optimizations
– Optimization rule:
• 1% performance improvement with no more than 3% power increase
• Otherwise, DVFS can do better (Pentium M)
– Use the rule in reverse
Power-inefficient optimization
Turn it off
Can do better than DVFS
Power Increase
3
2
Power-efficient optimization
Leave it on
1
3:1 Optimization
0
0
1
Performance Improvement
27
Two Case Studies
2 different targets
1. Core frontend
2. L2 cache
Approach: Use existing low-power techniques
– Chosen based on intuition
• Associated performance loss not always appropriate for
high-performance processors
• But viable options under the 3:1 ratio
– Use without complex policies
28
Methodology
Baseline
– Non-scalable 4-wide OoO
Comparison
– Frequency scaling (Simulated)
• Modeled after POWER7
• Min frequency = 50% of Nominal frequency
– DVFS (Analytical)
• 22% operating voltage range based on Pentium M
Goal
– Power-performance curve closer to DVFS than
frequency scaling
29
Case Study 1:
Frontend Power Gliding
30
Power-Dominant Optimizations
Renamer checkpointing
+ Fast branch misprediction recovery
– Only 0.05% of checkpoints useful due to highly
accurate branch predictor
Aggressive fetching
+ Fast window re-fills
– Underutilized once the scheduler is full
– Prone to fetch wrong-path instructions
31
Implementation
5 different power gliding configurations
Stall-8 through Stall-1
Config
Max In-Flight
Unresolved Branches
Checkpoint
Count
Fetch Buffer
Size
Physical Registers
Simplified speculation
control without
confidence estimator
Convert branches
to commit-time
recovery
Help reduce
speculation
Reduced pressure
with speculation
control
Base
Unconstrained
16
16
128
Stall-8
8
Stall-4
4
Stall-3
3
0
4
64
Stall-2
2
Stall-1
1
32
Case Study 2:
L2 Power Gliding
33
Observations on L2 Cache
Static power dominated
Not all workloads need full capacity or
optimized latency
Memory-intensive workloads
– Tolerant of smaller L2 sizes
Compute-bound workloads
– Sensitive to smaller L2 sizes
– But low L2 miss rate (0.3%)
34
Implementation
Level-[12]
Reduce static power with drowsy mode
Level-[345]
Reduce L2 associativity (8-way 1MB direct-mapped 128KB)
Config
Base
Drowsy L2 Data Drowsy L2 Tags
N
Level-1
N
L2 Associativity
L2 Access Cycles
12
8
13
Level-2
Level-3
Level-4
Level-5
Y
Y
4
2
14
1
35
Power-Performance Curves
libquantum
gobmk
1
0.8
0.6
0.4
1
Norm. Power
Norm. Power
Norm. Power
1
HMean
0.8
0.6
0.4
0.4
0.6
0.8
1
Freq Scaling
0.8
Analytical DVFS
0.6
Frontend PG
L2 PG
0.4
0.4
Norm. Performance
0.6
0.8
1
0.4
Norm. Performance
0.6
0.8
1
Norm. Performance
Better scaling than frequency scaling
Some even better than DVFS
Power gliding
– Reduces power-dominant optimizations and wasteful work
Frequency scaling
– Uniformly slows down execution
36
Summary
Frequency scaling
– Limited to linear dynamic power reduction
– Less effective on memory-intensive workloads
Power gliding
– Disables/constrains optimizations that meet 3:1
ratio
– Addresses leakage power
– More efficient scaling than frequency scaling
– Exceeds power savings by DVFS in some cases
37
Outline
Motivation
Proposal 1: WiDGET framework
Proposal 2: Deconstructing scalable cores
Proposal 3: Power gliding
Related work
Conclusions
38
Outline
Motivation
Proposal 1: WiDGET framework
Proposal 2: Deconstructing scalable cores
Proposal 3: Power gliding
Related work
Conclusions
39
Putting Everything Together
Normalized Chip Power
1
COBRi8
0.8
COBRi4
0.6
COBRi2
COBRi1
0.4
Level-0
L2 Power
Gliding
0.2
COBRi + L2 Power Gliding
Ideal Power Proportionality
0
0
0.2
0.4
0.6
Normalized Performance
0.8
1
Approximate ideal power proportionality
Processor that scales down power by 85%
OoO to in-order conversion not efficient
Level-0:
Single-issue in-order COBRi
(COBRi with 1 FIFO buffer)
40
Conclusions
Limited DVFS-driven power management
Power-proportional cores for future
technology nodes
– Dynamic resource allocation
– Aggregate resources to scale up
• WiDGET & COBRA
– Disable resources to scale down
• WiDGET, COBRA, & Power Gliding
– One processor, many different operating points
41
Acknowledgement
Committee
Special thanks to:
David Wood
John Davis
UW architecture students
Dan Gibson
Derek Hower
AMD Research
Joe Eckert
42
Backup Slides
43
Orthogonal Work
Circuit-level techniques
– Supply-voltage reduction
• Near-threshold operation [Dreslinski10]
• Subthreshold operation [Chandrakasan10]
– Globally-asynchronous locally-synchronous designs
[Kalla10]
– Transistor optimization [Azizi10]
• Multi- / variable-threshold CMOS
• Sleep transistors
System-level techniques
– PowerNap [Meisner09]
– Thread Motion [Rangan09]
44
Related Power Management Work
Energy-proportional computing [Barroso07]
Dynamically adaptive cores [Albonesi03]
– Localized changes
– Limited scalability
– Wasteful power reduction with minimum
performance impact
Heterogeneous CMP [Kumar03]
– Bound to static design choices
– Less effective for non-targeted apps
– More verification (and design) time
45
Related Low-Complexity uArch Work
Clustered Architectures
– Goal: Superscalar ILP without impacting cycle time
– Usually OoO-execution clusters
– Performance-centric steering policies
• Load balancing over locality
Approximating OoO execution
– Braid architecture [Tseng08]
– Instruction Level Distributed Processing [Kim02]
– Both require ISA changes or binary translation
46
Prior Scalable Cores
Similar vision, different scaling mechanisms
Core Fusion [Ipek07]
– Whole-core aggregation
– Centralized rename
Composable Lightweight Processors (CLP) [Kim07]
– Whole-core aggregation
– EDGE ISA assisted scheduling
Forwardflow [Gibson10]
– Only scales the window & execution
– Dataflow architecture
47
Dynamic Voltage/Freq Scaling (DVFS)
Dynamically trade-off power for performance
– Change voltage and freq at runtime
– Often regulated by OS
• Slow response time
Linear reduction of V & F
– Cubic in dynamic power
– Linear in performance
– Quadratic in dynamic energy
Effective for thermal management
Challenges
– Controlling DVFS
– Diminishing returns of DVFS
48
Intel Technology Scaling Trends
Normalized (log)
100000
10000
1000
Nornalized TDP
100
Normalized Frequency
10
Normalized Vdd
1
0.1
4004 Processor
0.5 W
0.01
0.1 MHz
5V
0.001
10 µm
1970
Normalized Feature Size
~2 orders of magnitude
1980
1990
2000
2010
Year
Main reason behind power increase:
Increasingly power-inefficient transistors
Source: Stanford CPU database
49
Core Wars
Backup Slides
50
Core Scaling Taxonomy
Component
Definition
Scaling Alternatives
L1-I
Mechanism to aggregate L1-Is
• No scaling
• Sub-banked L1-Is
Frontend
Mechanism to scale frontend width
• Static overprovisioning
• Aggregated frontend
Scheduling
Mechanism to scale instruction
scheduler
• Steering based on architectural register
dependency
• Steering with an L1-D bank predictor
Execution
Resources
Mechanism to scale the number of
functional pipelines
• Static overprovisioning
• Scaled with scheduler
Instruction
Window
Mechanism to scale the size of the
instruction window
• Static overprovisioning
• Scaled with scheduler
L1-D
Mechanism to aggregate L1-Ds
• No scaling
• Bank-interleaved L1-Ds
• Ad hoc coherent L1-Ds
Resource
Acquisition
Philosophy
Means by which cores are provided
with additional resources when
scaled up
• Resource borrowing
• Resource overprovisioning
51
Area and Wire Delays
Smaller technologies / higher
frequencies Smaller distance
@ 3GHz & 45nm
– 2mm of distance in cycle
Max Signalling Distance
(microns)
6000
45nm
5000
32nm
4000
3000
2000
1000
0
1000
2000
3000
4000
Clock Frequency (MHz)
• Restriction on size and placement of shared resources
– E.g., 2-issue OoO core in 45nm at 3GHz
2 cores
4 cores
32KB L1-D
1-cycle distance
2-cycle distance
• More cores to share Tighter constraints
52
Sensitivity to Communication Overheads
COR
Cache
1.4
Communication
Ideal
1.2
1
Better
0.8
BAR
1.6
Normalized Cycles
Cache
1.4
Communication
Ideal
1.2
1
Scaling 1
Scaling 2
Scaling 4
Scaling 1
Scaling 2
*COR-0C
BAR-0C
BAR-1C
*BAR-2C
*COR-0C
BAR-0C
BAR-1C
*BAR-2C
*COR-0C
COR-1C
COR-2C
*BAR-2C
*COR-0C
COR-1C
COR-2C
*BAR-2C
*COR-0C
*BAR-0C
0.8
*COR-0C
*BAR-0C
Normalized Cycles
1.6
Scaling 4
Not sensitive to frontend depth
Cache-bank misprediction penalties dominate the
overheads
BAR outperforms COR only at scaling point 4 with no
wire delay
53
Frontend/Backend Width
IPC Normalized to OoO
Chip Power Normalized to OoO
Normalized IPC
1
0.8
0.6
0.4
0.2
Better
0
4 wide 2 wide
COR1
4 wide 8 wide
Normalized Chip Power
1.2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Better
4 wide 2 wide
COR4
COR1
4 wide 8 wide
COR4
COR variants with different frontend/backend width
– COR1’: 2-wide (Narrower)
– COR4’: 8-wide (Wider)
Narrower COR1’ compared to default COR1
– Negligible performance impact
– 8% less power
Wider COR4’ compared to default COR4
– Negligible performance impact
– 29% more power
Scaling down the width energy efficient
54
L1-I Aggregation
0.2
Better
0
BAR2
COR2
BAR4
4SubBankI$
D$0
0.4
1I$
D$3
C0
0.6
4SubBankI$
D$2
C3
2 cycles
0.8
1I$
D$1
C2
I$3
2SubBankI$
C1
I$2
1I$
D$0
2 cycles
I$1
2SubBankI$
C0
I$0
1
1I$
I$0
1.2
COR4’ with a single L1-I
Normalized IPC
BAR4’ with a single L1-I
COR4
Little impact on overall performance
– Avg miss rate < 0.5%
– Highest miss rate: 6% (OLTP)
L1-I aggregation not necessary
55
L1-D Aggregation
C1
C2
C3
C0
2 cycles
2 cycles
0.6
0.4
0.2
D$2
D$3
Better
0
2BankD$
D$1
1D$
D$0
BAR2
COR2
BAR4
4BankD$
C0
0.8
1I$
I$0
4BankD$
I$3
1D$
I$2
2BankD$
I$1
1
1D$
I$0
D$0
1.2
COR4’ with a single L1-D
Normalized IPC
BAR4’ with a single L1-D
COR4
More harm than good
– Especially for fully scaled-up cores
– 2% for BAR4’ and 4% for COR4’
Why counter intuitive results?
– Large working sets
– Longer effective access latency
• BAR: Cache-core mispredictions
• COR’: Remote cache access latencies
L1-D aggregation not necessary
56
Cache-Bank Predictor
For core-interleaved L1-D banks
– Each core’s D$ has associated
address range
– Pro: Maximize total cache size
– Con: Address not known @ steer
Cache-bank predictor @ steer
– Predict core with mapped address
• History-based prediction
– If mispredicted, re-route memory ops
to correct cores
I$0
I$1
I$2
I$3
FE0
FE1
FE2
FE3
Steer
IQ0
IQ1
IQ2
IQ3
D$0
D$1
D$2
D$3
e.g., x0 - xFF
• Longer load-to-use latency
57
Instruction Buffer Limit Study
IPC
Chip Power
2 Buffers
1.2
4 Buffers
1
6 Buffers
8 Buffers
0.8
10 Buffers
0.6
12 Buffers
0.4
14 Buffers
Better
0.2
1
2
3
4
5
6
Number of EUs
7
16 Buffers
8
1.4
Normalized Chip Power
Normalized IPC
1.4
2 Buffers
4 Buffers
1.2
6 Buffers
1
8 Buffers
10 Buffers
0.8
12 Buffers
0.6
Better
14 Buffers
16 Buffers
0.4
1
2
3
4
5
6
7
8
Number of EUs
Diminishing returns after 8 buffers/EU
Energy efficiency
– Little sensitivity beyond 4 buffers/EU
58
Power Gliding
Backup Slides
59
Core Frequency Scaling Model
1
0.8
0.6
0.4
0.2
0
Run Time
1
1.00
0.88
0.76
0.64
0.8
Slowdown
Normalized Chip Power
Chip Power
0.50
0.6
0.4
0.2
0
1.00
0.88
0.76
0.64
0.50
Power-inefficient scaling
– Especially at lower frequencies
– Only indirect impact on leakage power
Memory-intensive workloads
– Tolerant to frequency scaling
60
Baseline Power Breakdown
Normalized Chip Power
1
L1I
0.8
Frontend
0.6
Execution
Backend
0.4
L1D
0.2
L2
0
libquantum bwaves
oltp
gobmk
namd
L3
Memory-intensive workloads
– Higher L2 & L3 power fractions
Device type assumption
– L2: High performance
– L3: Low standby power
61
Frontend Power Gliding Implementation
Checkpoint removal
– Convert to commit-time recovery
Speculation control
– Stall if (unresolved branches == threshold)
– No use of confidence estimation
Fetch buffer resizing
– Help regulate speculation
– Reduced to match issue width
Register file resizing
– Power-saving opportunity created by speculation control
– Less in-flight instructions Less pressure on registers
62
Power-Performance Curves
L3 misses per 1K instrs
Lower
namd
Higher
bwaves
0.6
0.4
0.8
0.6
0.4
0.4
0.6
0.8
1
Norm. Performance
0.6
0.8
1
HMean
Norm. Power
Freq Scaling
0.8
Analytical DVFS
0.6
Frontend PG
L2 PG
0.4
0.4
0.6
0.8
Norm. Performance
1
0.8
0.6
0.4
0.4
Norm. Performance
1
1
Norm. Power
0.8
gobmk
1
Norm. Power
Norm. Power
Norm. Power
oltp
1
1
Norm. Power
libquantum
1
0.8
0.6
0.4
0.4
0.6
0.8
Norm. Performance
1
0.8
0.6
0.4
0.4
0.6
0.8
Norm. Performance
1
0.4
0.6
0.8
1
Norm. Performance
Better scaling than frequency scaling
Some even better than DVFS
Power gliding
– Reduces power-dominant
optimizations and wasteful work
Frequency scaling
– Uniformly slows down execution
63
IPC Impact of Each Power Gliding Technique
Normalized IPC
1
0.8
0.6
0.4
0.2
0
& Chkpt Removal
& Spec Cntrl
& Fetch Buff
& Regs
Stall-1
Small performance implications from no checkpoint
Speculation control affects IPC the most
– Low branch miss rate more sensitive
– Exception: bwaves due to very low branches per 1K
instructions
• Suffers from reduced registers instead
64
Normalized Power
Power Breakdown of Frontend Power Gliding
1
L1I
0.8
Frontend
0.6
Execution
0.4
Backend
0.2
L1D
libquantum
bwaves
oltp
Base
Stall-8
Stall-4
Stall-3
Stall-2
Stall-1
Base
Stall-8
Stall-4
Stall-3
Stall-2
Stall-1
Base
Stall-8
Stall-4
Stall-3
Stall-2
Stall-1
Base
Stall-8
Stall-4
Stall-3
Stall-2
Stall-1
Base
Stall-8
Stall-4
Stall-3
Stall-2
Stall-1
0
gobmk
L2
L3
namd
Frontend power
Better
– Smaller with more aggressive stalling levels
Reduced power throughout the pipe
– E.g., gobmk with Stall-1
– Almost half execution, backend, and L1-D power
65
L2 Miss Rate Sensitivity
0.5
L2 Miss Rate
0.4
8 way (Default)
0.3
4 way
2 way
0.2
1 way
0.1
Better
0
libquantum
bwaves
oltp
gobmk
namd
66
Power Breakdown of L2 Power Gliding
Normalized Power
1
0.8
L1I
0.6
Frontend
0.4
Execution
Backend
0.2
L1D
L2
libquantum
bwaves
oltp
Base
Level-1
Level-2
Level-3
Level-4
Level-5
Base
Level-1
Level-2
Level-3
Level-4
Level-5
Base
Level-1
Level-2
Level-3
Level-4
Level-5
Base
Level-1
Level-2
Level-3
Level-4
Level-5
Base
Level-1
Level-2
Level-3
Level-4
Level-5
0
gobmk
Smaller L2 More L3 utilization
L3
namd
Better
– Power shifting from L2 to L3
Longer core idle periods Less core power
– E.g., namd with Level-5
– ~20% less power in execution and backend
67
OLTP: COBRi & L2 Power Gliding
OLTP
Normalized Chip Power
1
COBRi8
0.8
COBRi4
COBRi2
0.6
COBRi1
0.4
Level-0
0.2
COBRi + L2 Power Gliding
Ideal Power Proportionality
Power gliding
0
0
0.2
0.4
0.6
0.8
1
Normalized Performance
68
L2 Power Gliding on COBRi1
Normalized Chip Power
1
COBRi8
0.8
COBRi4
0.6
COBRi2
COBRi1
0.4
Power gliding
0.2
COBRi + L2 Power Gliding
Ideal Power Proportionality
0
0
0.2
0.4
0.6
0.8
1
Normalized Performance
69
WiDGET
Backup Slides
70
WiDGET Vision
TLP
Power
ILP
Power
ILP
Power
TLP
ILP
Power
ILP
Power
Just 5 examples.
Much more can be done.
71
EU Cluster
L1I
L1D
L1I
L1D
L1I
L1D
L2
L1D
L1D
L1I
L1I 0
L1I 1
IE 0
IE 1
Thread context management or
Instruction Engine
(Front-end + Back-end)
L1D
EU Cluster
L1D
L1D
In-order Execution Unit (EU)
L1D 0
L1D 1
Full
1-cycle
bypass
inter-cluster
within a cluster
link
L1I
L1I
L1I
L1I
72
Instruction Engine (IE)
L1I
L1I
L1I
L1I
RF
BR
Pred
L1D
L1D
Fetch
L1D
Decode
Rename
Steering
ROB
Front-End
L1D
Commit
Back-End
•Thread
L2 specific structures
L1D
L1D
•Front-end + back-end
•Similar
a conventional OoO pipe
L1D toL1D
•Steering logic for distributed EUs
•Achieve OoO performance with in-order EUs
•Expose independent instr chains
L1I
L1I
L1I
L1I
73
Instruction Steering Mechanism
Goals
– Expose independent instruction chains
– Achieve OoO performance with multiple in-order EUs
– Keep dependent instrs nearby
3 things to keep track
– Producer’s location
– Whether producer has another consumer
– Empty buffers
Last Producer Table
&
Full bit vector
Empty bit vector
74
Steering Heuristic
Based on dependence-based steering [Palacharla97]
– Expose independent instr chains
– Consumer directly behind the producer
– Stall steering when no empty buffer is found
WiDGET: Power-performance goal
– Emphasize locality & scalability
Cluster 0 Cluster 1
Outstanding Ops?
0
Any empty buf
2
1
Avail behind
either of producers?
Y
Y
N
Empty buf in
Producer buf
either of clusters
Avail behind producer?
N
Empty buf
within cluster
• Consumer-push operand transfers
– Send steered EU ID to the producer EU
– Multi-cast result to all consumers
75
Coarse-Grain OoO Execution
OoO Issue
1
2
7
8
3
5
4
6
WiDGET
8
5
6
7
2
4
6
1
3
5
4
3
2
8
1
7
76
Steering Example
Has a consumer?
Register
Buffer ID
1
2
6
8
3
5
4
7
1
2
3
4
5
6
7
8
﬩
0
﬩
0
﬩
1
﬩
1
﬩
0
﬩
2
﬩
1
﬩
2
1
0
0
1
0
1
0
1
0
0
1
0
0
0
1
2
3
0
1
1
0
1
0
1
0
0
0
0
0
1
2
3
Empty / full
bit vectors
Last Producer Table
77
Instruction Buffers
Small FIFO buffer
– Config: 16 entries
1 straight instr chain per buffer
Instr
Entry
– Consumer EU field
Op 1
Op 2
Consumer EU bit vector
• Set if a consumer is steered to different EU
• Read after computation
– Multi-cast the result to consumers
78
Memory Disambiguation on WiDGET
Challenges arising from modularity
– Less communication between modules
– Less centralized structures
Benefits of NoSQ [Sha06]
– Mem dependency register dependency
– Reduced communication
– No centralized structure
– Only register dependency relation b/w EUs
• Faster execution of loads
79
Memory Instructions?
No LSQ thanks to NoSQ [Sha06]
Instead,
– Exploit in-window ST-LD forwarding
– LDs: Predict if dependent ST is in-flight @ Rename
• If so, read from ST’s source register, not from cache
• Else, read from cache
• @ Commit, re-execute if necessary
– STs: Write @ Commit
– Prediction
• Dynamic distance in stores
• Path-sensitive
80
Area Model (45nm)
Assumptions
– Single-threaded uniprocessor
– On-chip 1MB L2
– Atom chip ≈ WiDGET (2 EUs, 1 buffer per EU)
WiDGET:
> Mite by 10%
< Neon by 19%
Area (mm²)
50
40
30
20
10
0
Mite
WiDGET
Neon
81
Power Breakdown
1
Normalized Power
L3
0.8
L2
L1D
0.6
L1I
Fetch/Decode/Rename
0.4
Backend
ALU
0.2
Execution
0
Neon 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Mite 1 EU 2 EUs 3 EUs 4 EUs 5 EUs 6 EUs 7 EUs 8 EUs
Better
Less than ⅓ of Neon’s execution power
– Due to no OoO scheduler and limited bypass
Increase in WiDGET’s power caused by:
– Increased EUs and instruction buffers
– Higher utilization of other resources
82
Geometric Mean Power Efficiency (BIPS³/W)
Neon
Mite
Best-case: 2x of Neon, 21x of Mite
1.5x the efficiency of Xeon for the same performance
83
Related Work
Backup Slides
84
Comparison to Related Work
Scale Up &
Down?
Symmetric?
Decoupled
Exec?
In-Order?
Wire Delays?
Data Driven?
ISA
Compativility?
WiDGET
√
√
√
√
√
√
√
Adaptive Cores
X
-
√/X
√/X
-
-
√
Heterogeneous
CMPs
X
X
X
√/X
-
-
√
Core Fusion
√
√
X
X
√
-
√
CLP
√
√
√
√
√
√
X
TLS
X
√
X
√/X
-
X
√
Multiscalar
X
√
X
X
-
X
X
ComplexityEffective
X
√
√
√
X
√
√
Salverda & Zilles
√
√
X
√
X
√
√
ILDP & Braid
X
√
√
√
-
√
X
Quad-Cluster
X
√
√
X
√
√/X
√
Access/Execute
X
X
X
√
-
√
X
Design EX
85
EX: Steering for Locality
Clustered
Architectures
1
2
e.g., Advanced RMBS,
Modulo
3
5
2
1
4
3
Cluster 1
Delay
6
5
Maintain
load balance
4
WiDGET
6
Cluster 0
Exploit locality
Cluster 0
2
1
4
3
Cluster 1
Delay
6
5
86
Vs. Complexity-Effective Superscalars
Palacharla et al.
– Goal: Performance
– Consider all buffers for steering and issuing
• More buffers -> More options
WiDGET
– Goal: Power-performance
– Requirements
• Shorter wires, power gating, core scaling
– Differences: Localization & scalability
• Cluster-affined steering
• Keep dependent chains nearby
• Issuing selection only from a subset of buffers
– New question: Which empty buffer to steer to?
87
Steering Cost Model [Salverda08]
Steer to distributed IQs
Steering policy determines issue time
– Constrained by dependency, structural hazards, issue
policy
Ideal steering will issue an instr:
– As soon as it becomes ready (horizon)
– Without blocking others (frontier) (constraints of in-order)
Steering Cost = horizon – frontier
– Good steering: Min absolute (Steering Cost)
88
EX: Application of Cost Model
Steer instr 3 to In-order IQs
IQ 0
2
3
4
Challenges
Time
1
IQ 1
1
1
2
2
3
4
F
F
C = -1
C = -1
IQ 2
IQ 3
F
F: Frontier
Horizon
3
F
C=0
C=1
Other instrs
Cost = H - F
– Check all IQs to find an optimal steering
– Actual exec time is unknown in advance
Argument of Salverda & Zilles
– Too complex to build or
– Too many execution resources needed to match OoO
89
Impact of Communication Delays
If 1-cycle comm latency is added…
What should happen instead
IQ 0 IQ 1 IQ 2 IQ 3
1
2 2
3
4
Time
2
1 1
IQ 0 IQ 1 IQ 2 IQ 3
1 1
3
2 2
3 4
3 3
4
4 4
5
5
Exec latency: 3
5 cycles
Exec latency: 4 cycles
Trade off parallelism for comm
90
Observation Under Communication Delays
Not beneficial to spread instructions
Reduced pressure for more execution resources
No need to consider distant IQs
Reduced problem space
Simplified steering
91
Energy-Proportional Computing for Servers
[Barroso07]
Servers
– 10-50% utilization most of the time
• Yet, availability is crucial
– Common energy-saving techs inapplicable
– 50% of full power even during low utilization
Solution: Energy proportionality
– Energy consumption in proportion to work done
Key features
– Wide dynamic power range
– Active low-power modes
• Better than sleep states with wake-up penalties
92
PowerNap [Meisner09]
Goals
– Reduction of server idle power
– Exploitation of frequent idle periods
Mechanisms
–
–
–
–
System level
Reduce transition time into & out of nap state
Ease power-performance trade-offs
Modify hardware subsystems with high idle power
• e.g., DRAM (self-refresh), fans (variable speed)
93
Thread Motion [Rangan09]
Goals
– Fine-grained power management for CMPs
– Alternative to per-core DVFS
– High system throughput within power budget
Mechanisms
– Migrate threads rather than adjusting voltage
– Homogeneous cores in multiple, static voltage/freq
domains
– 2 migration policies
• Time-driven & miss-driven
94
Vs. Thread-Level Speculation
Their way
– SW: Divides into contiguous segments
– HW: Runs speculative threads in parallel
L2
Speculation
support
Shortcomings
– Only successful for regular program structures
– Load imbalance
– Squash propagation
My Way
– No SW reliance
– Support a wider range of programs
95
Vs. Braid Architecture [Tseng08]
Their way
– ISA extension
– SW: Re-orders instrs based on dependency
– HW: Sends a group of instrs to FIFO issue queues
Shortcomings
– Re-ordering limited to basic blocks
My Way
– No SW reliance
– Exploit dynamic dependency
96
Vs. Instruction Level Distributed Processing
(ILDP) [Kim02]
Their way
– New ISA or binary translation
– SW: Identifies instr dependency
– HW: Sends a group of instrs to FIFO issue queues
Shortcomings
– Lose binary compatibility
My Way
– No SW reliance
– Exploit dynamic dependency
97