Transcript core
Leveraging
Dynamically Scalable
Cores in CMPs
Thesis Proposal
5 May 2009
Dan Gibson
1
Scalability
defn. SCALABLE
Pronunciation: \'skā-lə-bəl\
Function: adjective
1. capable of being scaled
expanded/upgraded OR reduced in size
2. capable of being easily expanded or upgraded on
demand
<a scalable computer network>
[Merriam-Webster2009]
2
Executive Summary (1/2)
CMPs target a wide userbase
Future CMPs should deliver:
TLP,
when many threads are available
ILP, when few threads are available
Reasonable power envelopes, all the time
Future Scalable CMPs should be able to:
Scale
UP for performance,
Scale DOWN for energy conservation
3
Executive Summary (2/2)
Scalable CMP =
Cores that can scale
(mechanism) +
Policies for scaling
Forwardflow: Scalable
Uniprocessors
Proposed:
Dynamically Scalable
Forwardflow Cores
Proposed:
Hierarchical Operand Networks
for Scalable Cores
Proposed:
Control Independence for
Scalable Cores
4
Outline
CMP 2009 2019: Trends toward Scalable Chips
Work So Far: Forwardflow
Proposed Work
Methodology
Dynamic Scaling
Operand Networks
Control Independence
Miscellanea
(Maybe)
Schedule/Closing Remarks
5
CMP 20092019:
Moore’s Law Endures:
For
Rock, 65nm
density
[JSSC2009]
Rock16, 16nm
[ITRS2007]
In 1965, Gordon Moore sketched out his prediction of the
pace of silicon technology. Decades later, Moore’s Law
remains true, driven largely by Intel’s unparalleled silicon
expertise.
Copyright © 2005 Intel Corporation.
More Transistors =>
More Threads? (~512)
6
CMP 20092019: Amdahl’s Law
Endures Everyone knows Amdahl’s law, but quickly forgets it.
-Thomas Puzak, IBM, 2007
f
[ (1 - f ) + N ]
Parallel Speedup
Limited by Parallel
Fraction
-1
100
10
i.e. Only ~10x
speedup at N=512,
f=90%
No TLP = No Speedup
1
1
2
f = 99%
4
8
16
f = 95%
32
64
128 256 512
f = 90%
7
CMP 20092019: SAF
[Chakraborty2008]
Simultaneously
Active Fraction
(SAF): Fraction of
devices that can be
active at the same
time, while still
remaining within the
chip’s power budget.
1
0.8
Dynamic SAF
0.6
0.4
0.2
0
HP Devices
LP Devices
90nm
65nm
45nm
32nm
More Transistors Lots of them have to be off
[Hill08]
Flavors of “Off”
8
CMP 20092019: Leakage:
Tumultuous Times for CMOS
Leakage Power by Circuit Variant
[ITRS2007]
Power (mW)
Normalized Power
1MB Cache: Dynamic & Leakage
Power [HP2008,ITRS2007]
Leakage Starts to Dominate
SOI & DG Technology Helps (ca 2010/2013)
Tradeoffs Possible:
Low-Leak Devices (slower access time)
DG Devices
LSP Devices
9
CMP 20092019: Trends
Summary
Trend
Implication
Moore’s Law
Abundant Transistors
Falling SAF
Cannot operate all
transistors all the time
Amdahl’s Law Serial Bottlenecks Matter
Rising
Leakage
Cost of “doing nothing”
increases
In
Conflict
Adding
Cores is
not
enough
10
CMP 2019: A Scalable Chip
Scale UP for Performance
Sometimes
big cores, sometimes many cores
When possible, use more resources for more
performance
Scale DOWN for Energy Conservation
Exploit
TLP with many small cores
Manage SAF by shutting down portions of
cores
11
Assume
SAF=0.5
CMP 2019: Scaling
Core Core Core Core
Core Core Core Core
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
Core Core Core Core
Core Core Core Core
Shut off Cores
Homogenous Scale Down
Core Core Core Core
Core Core
Core
Core Core
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
Core Core Core Core
Core Core Core Core
Heterogeneous Scale Down
Single-Thread Performance
12
CMP 2019: Scalable Cores
Requirements
Resources More Performance
Remove Resources Less Power
Small/Fixed Per-Thread Overhead
Add
e.g. Thread State
Scaling
UP should be useful for a variety of
workloads
i.e. Aggressive single-thread performance
13
Scalable Cores: Open Questions
Scaling UP
Where do added resources come from?
From statically-provisioned private pools?
From other cores?
Scaling DOWN
What happens to unused
Turn off? How “off” is off?
When to Scale?
In
resources?
HW/SW?
How to communicate?
14
Outline
CMP 2009 2019: Trends toward Scalable Chips
Work So Far: Forwardflow
Proposed Work
Methodology
Dynamic Scaling
Operand Networks
Control Independence
Miscellanea
(Maybe)
Schedule/Closing Remarks
15
Forwardflow: A Scalable Core
Conventional OoO does not scale well:
Structures
ROB, PRF, RAT, LSQ, …
Wire
must be scaled together
delay is not friendly to scaling
Forwardflow has ONE logical backend
structure
Scale
one structure, scale it well
Tolerates (some) wire delay
16
Forwardflow Overview
Design Philosophy:
Avoid
‘broadcast’ accesses (e.g., no CAMs)
Avoid ‘search’ operations (via pointers)
Prefer
short wires, tolerate long wires
Decouple frontend from backend details
Abstract backend as a pipeline
17
Forwardflow – Scalable Core
Design
Use Pointers to Explicitly
Define Data Movement
Every Operand has a Next Use
Pointer
Pointers specify where data
moves (in log(N) space)
Pointers are agnostic of:
Implementation
Structure sizes
Distance
ld
add
sub
st
breq
R4
R1
R4
R3
R4
4
R3
16
R8
R3
R1
R3
R4
No search operation
18
Forwardflow – Dataflow Queue
Table of in-flight instructions
Combination Scheduler,
ROB, and PRF
Manages OOO Dependencies
Performs Scheduling
Holds Data Values for All
Operands
Each operand maintains a
next use pointer hence the log(N)
Implemented as Banked
RAMs Scalable
Bird’s Eye View of FF
Dataflow Queue
Op1
Op2
Dest
ld
R4
4
R1
add
R1
R3
R3
sub
R4
16
R4
st
R3
R8
breq
R4
R5
1
2
3
4
5
Detailed View of FF
19
Forwardflow – DQ +/-’s
-
Explicit, Persistent
Dependencies
No searching of any
kind
Multi-cycle Wakeup
per value *
Dataflow Queue
Op1
Op2
Dest
ld
R4
4
R1
add
R1
R3
R3
sub
R4
16
R4
st
R3
R8
breq
R4
R5
1
2
3
4
5
*Average Number of Successors is Small
[Ramirez04,Sassone07]
20
DQ: Banks, Groups, and ALUs
Logical Organization
DQ Bank Group – Fundamental Unit of Scaling
Physical Organization
21
Forwardflow Single-Thread
Performance
4
Norm. Performance
3.5
3
OoO-128
2.5
FF-128
2
OoO-512
1.5
FF-512
1
BigSched
0.5
0
FF-128: 9% Lower Performance, 20% Lower Power
FF-512: 49% Higher Performance, 5.7% Higher Power
More FF Details
Move On
22
Forwardflow: Pipeline Tour
COMMIT
PRED FETCH
DECODE
ARF
I$
RCT
RCT
RCT
DQ
D$
EXECUTE
DISPATCH
Scalable, Decoupled
Backend
RCT: Identifies Successors
ARF: Provides Architected Values
DQ: Chases Pointers
23
RCT: Summarizing Pointers
Want to dispatch:
breq R4 R5
Need to know:
Where to get R4?
Where to get R5?
Result of DQ Entry 3
From the ARF
Register Consumer Table
summarizes where mostrecent version of registers
can be found
Dataflow Queue
Op1
Op2
ld
R4
4
R1
add
R1
R3
R3
sub
R4
16
R4
st
R3
R8
1
2
3
4
Dest
5
24
RCT: Summarizing Pointers
Register Consumer
Table (RCT)
Op2
Dest
ld
R4
4
R1
add
R1
R3
R3
sub
R4
16
R4
st
R3
R8
breq
R4
7
2-D
1
2
3
4
3-D
5
REF
WR
R1 2-S1
R4 Comes R2
From DQ R3 4-S1
Entry 3-D
R4 5-S1
3-D
R5
1-D
breq R4 R5
Dataflow Queue
Op1
R5 Comes
From ARF
25
Wakeup/Issue: Walking Pointers
Dataflow Queue
Op1
Op2
Dest
ld
R4
4
R1
add
R1
R3
R3
sub
R4
16
R4
st
R3
R8
breq
R4
7
1
2
3
4
5
Follow Dest Ptr When
New Result Produced
Continue
following
pointers to subsequent
successors
At each successor,
read ‘other’ value & try
to issue
NULL Ptr Last
Successor
26
DQ: Fields and Banks
Independent Fields Independent RAMs
I.E.
accessed independently, independent ports, etc.
Multi-Issue ≠ Multi-Port
Multi-Bank
Dispatch, Commit access contiguous DQ regions
Multi-Issue
Bank on low-order bits for dispatch/commit BW
Port Contention + Wire Delay = More Banks
Dispatch, Commit Share a Port
Bank on a high-order bit to reduce contention
27
DQ: Banks, Groups, and ALUs
Logical Organization
DQ Bank Group – Fundamental Unit of Scaling
Physical Organization
28
Evaluation Methodology
Full-System Trace-Based Simulation
Simics
+ GEMS/Ruby Memory Simulator +
Homegrown OOO/Forwardflow Simulator
SPARCv9
SPEC CPU 2k6 + Wisconsin Commercials
WATTCH/CACTI 4.1 for Power Modeling
29
Experimental Setup
CAM-128
FFlow-128
CAM-512
FFlow-512
Width
4/2/2
4/2/2
8/4/4
8/4/4
INT/FP Units
2/2
2/2
4/4
4/4
Br. Pred.
64Kb TAGE (13 Tagged Tables), Perfect BTB & RAS
Mem. Units
4
4
4
4
Sched Size.
32/128
128
64/512
512
Phys. Regs.
256
N/A
1024
N/A
Mem. Dis.
Ideal Dependence Predictor
L1-I
32KB 4-way, 64-byte line, 3 cycle pipelined
L1-D
32KB 4-way, 64-byte line, 2 cycle load-to-use
L2
2MB 4-way, 64-byte lines, 7 cycle
L3
8 MB 8-way, 64-byte line, 75 cycle, 16 banks
Memory
8 GB, 500-cycle, 12.8 GB/s
30
Forwardflow Single-Thread
Performance
4
Norm. Performance
3.5
3
2.5
OoO-128
FF-128
2
1.5
1
0.5
OoO-512
FF-512
BigSched
0
FF-128: 9% Lower Performance, 20% Lower Power
FF-512: 49% Higher Performance, 5.7% Higher Power
31
Related Work
Scalable Schedulers
Direct Instruction Wakeup [Ramirez04]:
Scheduler has a pointer to the first successor
Secondary table for matrix of successors
Hybrid Wakeup [Huang02]:
Scheduler has a pointer to the first successor
Each entry has a broadcast bit for multiple successors
Half Price [Kim02]:
Slice the scheduler in half
Second operand often unneeded
32
Related Work
Dataflow & Distributed Machines
Tagged-Token
[Arvind90]
Values (tokens) flow to successors
TRIPS
[Sankaralingam03]:
Discrete Execution Tiles: X, RF, $, etc.
EDGE ISA
Clustered
Designs [e.g. Palacharla97]
Independent execution queues
33
Outline
CMP 2009 2019: Trends toward Scalable Chips
Work So Far: Forwardflow
Proposed Work
Methodology
Dynamic Scaling
Operand Networks
Control Independence
Miscellanea
(Maybe)
Schedule/Closing Remarks
34
Proposed Methods
Simulator
Old Method
New Method
Trace-Based
ExecutionDriven
4-16,16-32
#Cores,Threads 1
Disambiguation Perf. Prediction NoSQ
Technology
70nm
32nm
Reportables
IPC, Power
Runtime,
Energy-Efficiency
Methods
Summary
All Parameters
35
Proposed Methods – Details 1
Component
Mem. Cons. Mod.
Coherence Prot.
Store Issue Policy
Configuration
Sequential Consistency
MOESI Directory (single chip)
Permissions Prefetch at X
Freq. Range
2.5 – 4.0 GHz
Technology
32nm
Window Size
Disambiguation
Branch Prediction
Frontend
Varied by experiment
NoSQ
TAGE + 16-entry RAS + 256-entry BTB
7 Cyc. Pred-to-dispatch
36
Proposed Methods – Details 2
Component
Configuration
L1-I Caches
32KB 4-way 64b 4cycle 2 proc. ports
L1-D Caches
32KB 4-way 64b 4 cycle LTU 4 proc. ports,
WI/WT, included by L2
L2 Caches
1MB 8-way 64b 11cycle WB/WA, Private
L3 Cache
8MB 16-way 64b 24cycle, Shared
Main Memory
Inter-proc network
4-8GB, 2 DDR2-like controllers (64 GB/s
peak BW), 450 cycle latency
2D Mesh 16B link
37
Proposed Methods: Memory Timing
Caches: CACTI 5.3
Indep. Vars: Size, Assoc, Banks, Ports
Output Vars: Latency, Area, Power (next slides)
Sizes: Reflect contemporary designs (e.g. Nehalem)
Memory:
Debate: On Chip or Off-Chip controllers?
On-chip: Lower latency, less bandwidth
Off-chip: More controllers, higher aggregate bandwidth, higher
latency (e.g. SERDES)
64 GB/s = 2 ‘cutting edge’ QPI link’s worth of BW
450 cycles = latency on Dual Intel Core 2 Quad.
38
Proposed Methods: Area
1.
Unit Area/AR Estimates
2.
Floorplanning
3.
4.
Manual and automated
Repeat 1&2 hierarchically for entire design
Latencies determined by floorplanned distance
CACTI, WATTCH, or literature
Heuristic-guided optimistic repeater placement
Area of I/O pads, clock gen, etc. not included
39
Proposed Methods: Power
WATTCH
Count events – events have fixed energy cost
New custom models from CACTI for all memory-like
structures
Structures have semi-constant leakage
Assume:
ITRS-HP devices for perf.-critical
ITRS-LSP or Drowsy for caches
CACTI for caches
Orion for networks
40
Proposed Methods: Chip
Floorplans
2D Tiled Topology
Tile
= Processor, L1(s), L2, L3 bank, Router
Link latency F(Tile_Area)
Assume 1P/2P parts are ‘binned’ 4P
designs
4P
8P
16P
Area, 32nm 48mm2
72mm2
126mm2
Area, 65nm 180mm2
270mm2
480mm2
41
Assumes FF-128 Cores
Outline
CMP 2009 2019: Trends toward Scalable Chips
Work So Far: Forwardflow
Proposed Work
Methodology
Dynamic Scaling (Executive
Operand Networks
Control Independence
Miscellanea
Summary)
(Maybe)
Schedule/Closing Remarks
42
Proposed Work Executive
Summary
Dynamic Scaling in CMPs
CMPs will have many cores, caches, etc., all burning power
SW Requirements will be varied
Hierarchical Operand Networks
Scalable Cores need Scalable, Generalized Communication
Control Independence
Increase performance from scaling up
Exploit Regular Communication
Adapt to Changing Conditions/Topologies (e.g., from Dynamic
Scaling)
Trades off window space for window utilization
Miscellanea
43
Dynamic Scaling in CMPs
Observation 1: Power Constraints
Many
design-time options to maintain a power
envelope:
Many Small Cores vs. Few, Aggressive Cores
How much Cache per Core?
Fewer
run-time options
DVS
DVS/DFS
Shut Cores Off, e.g. OPMS
44
Dynamic Scaling in CMPs
Observation 2: Unpredictable Software
GP
chips run a variety of SW
Highly-Threaded Servers
Single-Threads
Serial Bottlenecks
Optimal
HW for App A is non-optimal for App B
Nehalem for bzip2
Niagara for OLTP
45
Dynamic Scaling in CMPs
Opportunity: Adapt CMP to Power & SW
Demands
Migrate
design-time options to run-time
Share core & cache resources dynamically to fit SW &
Power requirements
Need Scalable Cores (e.g., Forwardflow)
Many open questions (purpose of research)
Need Suitable Intra-Core Interconnect (Coming Up)
46
Dynamic Scaling in CMPs
Core Core Core Core
Assume
SAF=0.5
Core Core
Core
Core Core
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
Core Core Core Core
Core Core Core Core
Threaded Server Workload:
Single-Thread Workload:
•Leverage TLP
•Leverage ILP
•Scale cores down
•Scale one core up
•Scale caches down?
•Turn unused cores off
47
Dynamic Scaling: Leveraging
Forwardflow
Recall: DQ is
organized into Bank
Groups
(With ALUs)
Scale Up: Build a
Larger DQ with More
Groups
Scale Down: Use
Fewer Groups
48
Where do extra resources come
from?
Option 1: Provision N
private groups per
core
Core
$
+
-
-
$
Option 2: Provision
N*P shareable groups
per chip
“Core”
Core
$
$
Simple
Cache Sharing is Difficult
Constrains maximum core size to
N
May waste area
$
+
-
$
“Core”
$
$
Core Size Less Constrained
Complicates Sharing Policy,
Interconnect
49
Dynamic Scaling: More Open
Questions
Scale in HW or SW?
SW
has scheduling visibility
HW can react at fine granularity
What about Frontends?
What about Caches?
Multiple
caches for one thread
50
Dynamic Scaling:
“Choose your own adventure” prelim
Preliminary Uniprocessor Scaling Results
Related Work
Move on to Operand Networks
51
Dynamic Scaling:
Preliminary Uniprocessor Results
Simple Scaling Heuristic, Dynamic:
Hill-Climbing
Heuristic
Periodically sample 1M instrs.:
BIPS3/W for Current DQ Size
BIPS3/W for ½ *Current DQ Size
BIPS3/W for 2 * Current DQ Size
Picks
highest observed BIPS3/W
As it turns out… Dynamic has flaws
But
does illustrate proof of concept
Why BIPS3/W?
52
Dynamic Scaling:
Preliminary Results: Performance
4
Normalized IPC
3.5
3
S-128
2.5
S-512
2
S-2048
1.5
Dynamic
1
0.5
0
bzip2
zeusmp
Hmean
53
Dynamic Scaling:
Preliminary Results: Efficiency
7
15.6
Norm. BIPS3/W
6
5
S-128
4
S-512
3
S-2048
Dynamic
2
1
0
bzip2
zeusmp
Aggregate
54
Dynamic Scaling:
Related Work
CoreFusion [Ipek07]
Fuse
individual core structures into bigger cores
Power aware microarchitecture resource scaling
[Iyer01]
Varies
RUU & Width
Positional Adaptation [Huang03]
Adaptively Applies
Low-Power Techniques:
Instruction Filtering, Sequential Cache, Reduced ALUs
55
Outline
CMP 2009 2019: Trends toward Scalable Chips
Work So Far: Forwardflow
Proposed Work
Methodology
Dynamic Scaling
Operand Networks
Control Independence
Miscellanea
(Maybe)
Schedule/Closing Remarks
56
Operand Networks
Scalable Cores need Generalized Intra-Core
Interconnect
Bypassing & Forwarding is O(N2)
Intra-core interconnect fits nicely with pointer
abstraction of dependence
All-to-All
Intra-core communication follows exploitable
patterns
Locality:
Hierarchical interconnects
Directionality: e.g. Rings instead of meshes
57
Operand Networks
Traffic Classification
by Destination:
Same Bank Group: InterBank (IB)
Next Bank Group: InterBank-Group Neighbor
(IBG-Neighbor)
Non-Next Bank Group:
Inter-Bank-Group Distant
(IBG-Distant)
58
Operand Networks
Observation: ~85% of
pointers designate near
successors
CDF
Intuition: Most of these
pointers yield IB traffic,
some IBG-N, none
IBG-D.
astar
sjeng
jbb
SPAN=5
Pointer Span
Observation: Nearly all
pointers (>95%)
designate successors
16 or fewer entries
away
Intuition: There will be
SPAN=16 very little IBG-D traffic.
59
Operand Networks
Opportunity 1: Exploit
Directional Bias
Forward pointers point…
forward
Optimize likely forward
paths (e.g. next bank
group)
Eliminate unlikely forward
paths (e.g. backward links)
First Steps: Quantify
Directional Bias &
Sensitivity
Verify: IB > IBG-N > IBG-D
Evaluate Delay on IBG-D
Mesh: No Directional Preference
Ring: Prefers Clockwise
(e.g., Multiscalar)
60
Operand Networks
RF
There is more to a core
than window and ALU
Register
File(s)
Frontend(s)
Cache(s)
Skewed demand
skewed network
FE
L1-D
TRIPS
[Gratz2007]:
Heterogeneous nodes,
homogeneous topology
61
Operand Networks
Opportunity 2: Exploit
Heterogeneity
L1-D
L1-D
FE
ARF
Pointers are not be the only
means of generating traffic
Assumption: Wires are
cheap, routers are not
First Steps: Profile
heterogeneous node
traffic
Does RF need to
communicate with L1-D?
Can frontend indirect
through ARF node?
Can DQ-to-Cache traffic
indirect through another
DQ group?
62
Operand Networks: Related Work
Skip: Move on to
Control Independence
TRIPS [Gratz07]
2D
Mesh, distributed single thread,
heterogeneous nodes
RAW [Taylor04]
2D
Mesh, distributed multiple thread, mostly
heterogenous nodes
ILDP [Kim02]
Point-to-point,
homogenous endpoints
63
Control Independence
Scaling UP only helps if resources can be gainfully used
Bad news:
Cost of control misprediction grows with window size & pipeline
depth
Even ‘good’ predictors aren’t that good
Some branches are really hard (e.g. forward conditional, jmpl)
Good news:
Some code is control independent, i.e. need not be squashed on
a mispredict
64
Control Independence /*
Sources
doFunc1();
of CI:
Coarse-Grain
Code */
(CGCI): e.g.
doFunc2();
in Multiscalar [Jacobson1997]
if( x==0 ) {
Fine-Grain
(FGCI):
Forward Branches, e.g.
Branch Hammocks [Klauser1998]
77% of conditional branch
mispredictions
a = 33;
} else {
a = 22;
}
/* CI Code */
65
Control Independence
if( x==0 ) {
a = 33;
There’s a catch:
Data Dependencies
– Control Dependent
CIDI – Control & Data Indep.
CIDD – Control Indep., Data
Dependent
} else {
/* CD */
CD
a = 22;
}
/* CIDI */
c = 77;
/* CIDD */
c = a + 1;
66
Control Independence
Why not use a prior approach?
Communication between backend/frontend is expensive in a
distributed (scalable) core.
We want fire-and-forget control independence
Mechanism Suitable?
Skipper
Ginger
TCI
Tasks/
Traces
Defer CIDD
Fetch
Re-Write Reg
Tags
Maybe: Yields back ptrs, still
squashes frontend
No: Needs
Un-dispatch operation
Chckpt. &
Recover
CGCI
No: Value checkpoints,
memory ordering
Yes
67
Control Independence: Dynamic
Predication
Processors handle data dependencies well,
control dependencies less well
Convert
control dependencies to data dependencies
(dynamic predication) [e.g., Al-Zawawi07,Cher01,Hilton07]
Primitive: select micro-op
Inputs: Two possible versions of Register Ri
Output: CD value of Ri for subsequent CIDD consumers
Primitive: skip micro-op
Inputs: Immediate, N
Ignored by execute logic
Commit logic will ignore N instructions following a skip
68
Control Independence
if( x==0 ) {
a = 33;
} else {
brz
R24
mov
R11
br
skip
/* CD */
mov
a = 22;
}
+8
22
0
R12
+4
33
select R1
R11
R12
add
R1
1
R3
/* CIDD */
c = a + 1;
69
Control Independence:
Implementing Dynamic Predication
Requirements:
Identify
reconvergent
control flow
Establish branch-toselect dependence
Cancel false-path stores
Ability to rename R1 to
R11, R12
Proposed Solutions:
Prediction mechanisms known.
[Klauser98,Cher01,Al-Zawawi07]
Branch Dest Ptr, select S1 Ptr
Convert reconvergent branch to
skip micro-op, repurpose
initiating branch’s Dest. Val. field
RCT Stacks
70
“Choose your own adventure”
prelim
Move on to…
Forward
Slice Replay
ARF Write Elision
Proposed Work Summary
71
Forward Slice Replay
FF Register
Dependencies are
Persistent
Reasons to Replay:
Can replay them if a value
“changes”
NoSQ Misprediction
Memory Race
Caveats:
Commit Race
Control Dependence
Dataflow Queue
Op1
Op2
Dest
1
ld
R4
4
R1
2
add
R1
R3
R3
3
sub
R4
16
R4
4
st
R3
R8
5
breq
R4
R5
6
ld
R4
4
R1
7
add
R1
R3
R3
72
ARF Write Elision
ARF Writes Are
Expensive
Write Elision:
ARF may not be ‘nearby’ in
a distributed design
ARF Ports are expensive
Atomic Commit Only
need to write last value
Caveats:
Must be correct: value has
to come from somewhere
May hurt performance:
Reading from ARF is
convenient
Dataflow Queue
Op1
Op2
Dest
1
ld
R4
4
R1
2
add
R1
R3
R3
3
sub
R4
16
R4
4
st
R3
R8
5
breq
R4
R5
6
ld
R4
4
R1
7
add
R1
R3
R3
73
Contributions
Forwardflow Cores
Scalable
OoO
Proposed: Dynamic Scaling in CMPs
Resource
Sharing Mechanisms and Policies
Proposed: Operand Network Hierarchies
Exploitable
communication
Proposed: Control Independence
Fire-and-forget
for scalable cores
74
Proposed Schedule
Date
Goal
Jun-Aug 2009
Google
Aug-Dec 2009
Dynamic Scaling
Dec-Jun 2010
Operand Networks
Jul-Dec 2010
Control Independence
Jan-Apr 2011
Write Thesis / Interview
May 2011
Defend
75
Thank you, committee, slide reviewers and “practice committee” for
constructive criticism.
76
Backup Slides
77
Successor CDF
78
Why BIPS3/W?
Hartstein & Puzak’s work on Optimum
Power/Performance Pipeline Depth
BIPS3/W
is equivalent to ED2 in uniprocessors
BIPS/W and BIPS2/W Optimum Depth = 0
Why not BIPS4/W?
BIPSn/W
is possible and plausible
ED, ED2 are more intuitive, esp. for CMPs
79
FF v. Multiscalar
Forwardflow
Multiscalar
Fine-grained
Compiler-assisted
HW
identification of
dependencies
task
assignment
Need
not use compiler
Could
benefit from
compiler support
Scale
one structure up
or down (discretely) to
suit workload
…for one core
Distribute
workload
among multiple
processors
… for many cores
80
Is It Correct?
Impossible to tell
Experiments
do not prove, they support or refute
What support has been observed of the
hypothesis “This is correct”?
Reasonable
agreement with published observations
(e.g. consumer fanouts)
Few timing-first functional violations
Predictable uBenchmark behavior
Linked list: No parallelism
Streaming: Much parallelism
81
CoreFusion
BPRED
On the right track
Merges
multiple
discrete elements in
multiple discrete cores
into larger components
Troublesome for N>2
I$
BPRED
I$
Decode
Decode
Sched.
Sched.
PRF
PRF
82
Control Independence: RCT Stacks
+
REF
R1 4-D
R2
REF
REF
1-D
3-D
0
1
2
3
4
brz
mov
br
skip
mov
select
R25
R11
R12
R1
+8
22
0
33
+4
R11
R12
Predict
Hammock,
Reconvergence
Reached,
Predict
Reconvergence
at
Reconvergence
at PC+16:
Merge RCT1,
RCT2
RCT0,
PC+8:
emit
select
uOps
Convert
Branch
skip,
Clone
RCT0
to
RCT1
Clone RCT0 RCT2
83
Control Independence: RCT Stacks
+
-
No changes to pointer structure at branchresolution time (Unlike Skipper/Ginger/TCI)
Changes META X-port from R/O to R/W
RCT ‘Diff’ Operation
Fetches both paths
84
“Vanilla” CMOS
N+
P-
N+
N
85
Double-Gate, Tri-Gate, Multigate
Back To Talk
86
ITRS-HP vs. ITRS-LSP Device
LSP: ~2x Thicker Gate Oxides
N+
N+
N
LSP: ~2x Longer Gates
P-
LSP: ~4x Vth
Back To Talk
87
Back To Talk
OoO Scaling
Decode Width = 2
op1
dest
src1
src2
op2
dest
src1
src2
Two-way
Fully
Bypassed
Decode Width = 4
op1
dest
src1
src2
op2
dest
src1
src2
op1
dest
src1
src2
op2
dest
src1
src2
Number of Comparators ~ O(N2)
Four-way fully bypassed is
beyond my powerpoint skill
Bypassing Complexity ~ O(N2)
88
OoO Scaling
ROB Complexity: O(N), O(I~3/2)
PRF Complexity: O(ROB), O(I~3/2)
Scheduler Complexity:
CAM:
O(N*log(N)) (size of reg tag increases log(N))
Matrix: O(N2) (in fairness, the constant in front is small)
Back To Talk
89
‘CFP’ Scaling
Issue/Bypass Logic: See OoO, O(N2)
PRF Complexity: ?
Size of deferred queue: O(N)
Back To Talk
90
Flavors of Off
Dynamic
Power
U%
Static
Power
100%
Response
Lag Time
0 cycles
1-5%
40%
1-2 cycles
Clock-Gated 1-5%
100%
~0 cycles
Vdd-Gated
<1%
100s cycles
100%
~0 cycles
Active
(Not Off)
Drowsy
(Vdd Scaled)
<1%
Freq. Scaled F%
91
DVS in sub-65nm Technology
•Leakage is much
more linear in
45nm/32nm
Leakage Power
•Devices have
smaller operating
ranges
(Vdd = 0.9V)
Back To Talk
Vdd
92
Forwardflow – Resolving Branches
On Branch Pred.:
Checkpoint
RCT
Checkpoint Pointer
Valid Bits
Checkpoint Restore
Restores
RCT
Invalidates Bad
Pointers
Dataflow Queue
Op1
Op2
Dest
1
ld
R4
4
R1
2
add
R1
R3
R3
3
sub
R4
16
R4
4
st
R3
R8
5
breq
R4
R5
6
ld
R4
4
R1
7
add
R1
R3
R3
93
A Day in the Life of a Forwardflow
Instruction: Decode
Register Consumer History
R1 R1@7D
87-S1
-D
R2
R3=0
R3 8
-D
R4 4-S1
add R1
R38
R3
add
8-S1
-D
94
A Day in the Life of a Forwardflow
Instruction: Dispatch
Dataflow Queue
add
R1@7D
R3=0
Op1
Op2
Dest
7
ld
R4
4
R1
8
add
R1
0
R3
9
Implicit -- Not actually written
95
A Day in the Life of a Forwardflow
Instruction: Wakeup
DQ7 Result is 0!
Dataflow Queue
7
ld
Op1
Op2
R4
4
DestPtr.Read(7)
Dest
R1 8-S1
8
add
R1
0
R3
9
sub
R4
16
R4
10
st
R3
R8
Update HW
next
-D1
87-S
value
0
DestVal.Write(7,0)
96
A Day in the Life of a Forwardflow
S2Val.Read(8)
Instruction: Issue (…and Execute)
Meta.Read(8)
Dataflow Queue
7
ld
Op1
Op2
R4
4
8
add
add
R1
9
sub
10
st
Dest S1Ptr.Read(8)
add 0 + 0 → DQ8
R1
0
0
R3
R4
16
R4
R3
R8
Update HW
next
8
-S1
value
0
S1Val.Write(8,0)
97
A Day in the Life of a Forwardflow
Instruction: Writeback
Dataflow Queue
Op1
Op2
DestPtr.Read(8)
Dest
7
8
add
R1
0
9
sub
R4
16
10
st
R3
R8
R3:0
R3 10-S1
R4
Update HW
next
10
-S1
8-D
value
0
DestVal.Write(8,0)
98
A Day in the Life of a Forwardflow
Instruction: Commit
DestVal.Read(8)
Dataflow Queue
Op1
Op2
Dest
ARF.Write(R3,0)
Meta.Read(8)
7
Commit Logic
8
add
add
R1
0
R3:0
R3:0
9
sub
R4
16
R4
10
st
R3
R8
99
DQ Q&A
Register Consumer History
R1
R1
R2
R2
R3
R3
R4
R4
-S1
72-S1
-S1
94-S1
5-D
-S1
8
Dataflow Queue
Op1
Op2
Dest
1
ld
R4
4
R1
2
add
R1
R3
R3
3
sub
R4
16
R4
4
st
R3
R8
5
breq
R4
R5
6
ld
R4
4
R1
7
add
R1
R3
R3
8
sub
R4
16
R4
9
st
R3
R8
100
Forwardflow – Wakeup
DQ1 Result is 7!
DestPtr.Read(1)
Dataflow Queue
1
ld
Op1
Op2
R4
4
2
add
3
sub
R4
16
4
st
R3
R8
5
breq
R1
R4
R3
0
Dest
R1 2-S1
R3
Update HW
next
-D1
21-S
value
7
R4
DestVal.Write(1,7)
101
Forwardflow – Selection
S2Val.Read(2)
Meta.Read(2)
S1Ptr.Read(2)
Dataflow Queue
1
ld
Op1
Op2
Dest
R4
4
R1
R1 R3
44
2
add
add
3
sub
R4
16
4
st
R3
R8
R3
R4
Update HW
next
2
-S1
value
7
S1Val.Write(2,7)
DQ2
5
breq
R4
0
Issue
102
Forwardflow – Building Pointer
Chains: Decode
Decode must determine, for each
operand, where the operand’s value will
originate
Vanilla-OOO:
Register Renaming
Forwardflow-OOO: Register Consumer
History
RCH records last instruction to reference a
particular architectural register
RAM-based
table, analogous to renamer
103
Decode Example
Register Consumer History
R1
R2
R3
R4
7-D
7-S1
Dataflow Queue
Op1
Op2
Dest
ld
R4
4
R4
add
R4
R1
R4
ld
R4
16
R1
5
6
7
8
9
104
Decode Example
Register Consumer History
R1 R1@7D
87-S1
-D
R2
R3=0
R3 8
-D
R4 4-S1
8: add8R1
R3
R3
add→R3
-S1
8-D
105
Forwardflow –Dispatch
Dispatch into DQ:
Writes
Dataflow Queue
metadata and
available operands
Appends instruction to
forward pointer chains
add→R3
R1@7D
R3=0
Op1
Op2
Dest
5
ld
R4
4
R4
6
add
R4
R1
R4
7
ld
R4
16
R1
8
add
R1
0
R3
9
106