Transcript core

Leveraging
Dynamically Scalable
Cores in CMPs
Thesis Proposal
5 May 2009
Dan Gibson
1
Scalability
defn. SCALABLE
Pronunciation: \'skā-lə-bəl\
Function: adjective
1. capable of being scaled
expanded/upgraded OR reduced in size
2. capable of being easily expanded or upgraded on
demand
<a scalable computer network>
[Merriam-Webster2009]
2
Executive Summary (1/2)


CMPs target a wide userbase
Future CMPs should deliver:
 TLP,
when many threads are available
 ILP, when few threads are available
 Reasonable power envelopes, all the time

Future Scalable CMPs should be able to:
 Scale
UP for performance,
 Scale DOWN for energy conservation
3
Executive Summary (2/2)

Scalable CMP =
Cores that can scale
(mechanism) +
Policies for scaling
Forwardflow: Scalable
Uniprocessors
Proposed:
Dynamically Scalable
Forwardflow Cores
Proposed:
Hierarchical Operand Networks
for Scalable Cores
Proposed:
Control Independence for
Scalable Cores
4
Outline



CMP 2009 2019: Trends toward Scalable Chips
Work So Far: Forwardflow
Proposed Work
 Methodology
 Dynamic Scaling
 Operand Networks
 Control Independence
 Miscellanea

(Maybe)
Schedule/Closing Remarks
5
CMP 20092019:

Moore’s Law Endures:
 For
Rock, 65nm
density
[JSSC2009]
Rock16, 16nm
[ITRS2007]
In 1965, Gordon Moore sketched out his prediction of the
pace of silicon technology. Decades later, Moore’s Law
remains true, driven largely by Intel’s unparalleled silicon
expertise.
Copyright © 2005 Intel Corporation.
More Transistors =>
More Threads? (~512)
6
CMP 20092019: Amdahl’s Law
Endures Everyone knows Amdahl’s law, but quickly forgets it.
-Thomas Puzak, IBM, 2007
f
[ (1 - f ) + N ]

Parallel Speedup
Limited by Parallel
Fraction


-1
100
10
i.e. Only ~10x
speedup at N=512,
f=90%
No TLP = No Speedup
1
1
2
f = 99%
4
8
16
f = 95%
32
64
128 256 512
f = 90%
7
CMP 20092019: SAF
[Chakraborty2008]
Simultaneously
Active Fraction
(SAF): Fraction of
devices that can be
active at the same
time, while still
remaining within the
chip’s power budget.
1
0.8
Dynamic SAF

0.6
0.4
0.2
0
HP Devices
LP Devices
90nm
65nm
45nm
32nm
More Transistors  Lots of them have to be off
[Hill08]
Flavors of “Off”
8
CMP 20092019: Leakage:
Tumultuous Times for CMOS
Leakage Power by Circuit Variant
[ITRS2007]
Power (mW)
Normalized Power
1MB Cache: Dynamic & Leakage
Power [HP2008,ITRS2007]



Leakage Starts to Dominate
SOI & DG Technology Helps (ca 2010/2013)
Tradeoffs Possible:

Low-Leak Devices (slower access time)
DG Devices
LSP Devices
9
CMP 20092019: Trends
Summary
Trend
Implication
Moore’s Law
Abundant Transistors
Falling SAF
Cannot operate all
transistors all the time
Amdahl’s Law Serial Bottlenecks Matter
Rising
Leakage
Cost of “doing nothing”
increases
In
Conflict
Adding
Cores is
not
enough
10
CMP 2019: A Scalable Chip

Scale UP for Performance
 Sometimes
big cores, sometimes many cores
 When possible, use more resources for more
performance

Scale DOWN for Energy Conservation
 Exploit
TLP with many small cores
 Manage SAF by shutting down portions of
cores
11
Assume
SAF=0.5
CMP 2019: Scaling
Core Core Core Core
Core Core Core Core
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
Core Core Core Core
Core Core Core Core
Shut off Cores
Homogenous Scale Down
Core Core Core Core
Core Core
Core
Core Core
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
Core Core Core Core
Core Core Core Core
Heterogeneous Scale Down
Single-Thread Performance
12
CMP 2019: Scalable Cores

Requirements
Resources  More Performance
 Remove Resources  Less Power
 Small/Fixed Per-Thread Overhead
 Add

e.g. Thread State
 Scaling
UP should be useful for a variety of
workloads

i.e. Aggressive single-thread performance
13
Scalable Cores: Open Questions

Scaling UP
 Where do added resources come from?
 From statically-provisioned private pools?
 From other cores?

Scaling DOWN
 What happens to unused
 Turn off? How “off” is off?

When to Scale?
 In

resources?
HW/SW?
How to communicate?
14
Outline



CMP 2009 2019: Trends toward Scalable Chips
Work So Far: Forwardflow
Proposed Work
 Methodology
 Dynamic Scaling
 Operand Networks
 Control Independence
 Miscellanea

(Maybe)
Schedule/Closing Remarks
15
Forwardflow: A Scalable Core

Conventional OoO does not scale well:
 Structures

ROB, PRF, RAT, LSQ, …
 Wire

must be scaled together
delay is not friendly to scaling
Forwardflow has ONE logical backend
structure
 Scale
one structure, scale it well
 Tolerates (some) wire delay
16
Forwardflow Overview

Design Philosophy:
 Avoid

‘broadcast’ accesses (e.g., no CAMs)
Avoid ‘search’ operations (via pointers)
 Prefer
short wires, tolerate long wires
 Decouple frontend from backend details

Abstract backend as a pipeline
17
Forwardflow – Scalable Core
Design

Use Pointers to Explicitly
Define Data Movement



Every Operand has a Next Use
Pointer
Pointers specify where data
moves (in log(N) space)
Pointers are agnostic of:




Implementation
Structure sizes
Distance
ld
add
sub
st
breq
R4
R1
R4
R3
R4
4
R3
16
R8
R3
R1
R3
R4
No search operation
18
Forwardflow – Dataflow Queue


Table of in-flight instructions
Combination Scheduler,
ROB, and PRF





Manages OOO Dependencies
Performs Scheduling
Holds Data Values for All
Operands
Each operand maintains a
next use pointer hence the log(N)
Implemented as Banked
RAMs  Scalable
Bird’s Eye View of FF
Dataflow Queue
Op1
Op2
Dest
ld
R4
4
R1
add
R1
R3
R3
sub
R4
16
R4
st
R3
R8
breq
R4
R5
1
2
3
4
5
Detailed View of FF
19
Forwardflow – DQ +/-’s


-
Explicit, Persistent
Dependencies
No searching of any
kind
Multi-cycle Wakeup
per value *
Dataflow Queue
Op1
Op2
Dest
ld
R4
4
R1
add
R1
R3
R3
sub
R4
16
R4
st
R3
R8
breq
R4
R5
1
2
3
4
5
*Average Number of Successors is Small
[Ramirez04,Sassone07]
20
DQ: Banks, Groups, and ALUs
Logical Organization
DQ Bank Group – Fundamental Unit of Scaling
Physical Organization
21
Forwardflow Single-Thread
Performance
4
Norm. Performance
3.5
3
OoO-128
2.5
FF-128
2
OoO-512
1.5
FF-512
1
BigSched
0.5
0
FF-128: 9% Lower Performance, 20% Lower Power
FF-512: 49% Higher Performance, 5.7% Higher Power
More FF Details
Move On
22
Forwardflow: Pipeline Tour
COMMIT
PRED FETCH
DECODE
ARF
I$
RCT
RCT
RCT
DQ
D$
EXECUTE
DISPATCH
Scalable, Decoupled
Backend



RCT: Identifies Successors
ARF: Provides Architected Values
DQ: Chases Pointers
23
RCT: Summarizing Pointers


Want to dispatch:
breq R4 R5
Need to know:

Where to get R4?


Where to get R5?


Result of DQ Entry 3
From the ARF
Register Consumer Table
summarizes where mostrecent version of registers
can be found
Dataflow Queue
Op1
Op2
ld
R4
4
R1
add
R1
R3
R3
sub
R4
16
R4
st
R3
R8
1
2
3
4
Dest
5
24
RCT: Summarizing Pointers
Register Consumer
Table (RCT)
Op2
Dest
ld
R4
4
R1
add
R1
R3
R3
sub
R4
16
R4
st
R3
R8
breq
R4
7
2-D
1
2
3
4
3-D
5
REF
WR
R1 2-S1

R4 Comes R2
From DQ R3 4-S1
Entry 3-D
R4 5-S1
3-D
R5 
1-D
breq R4 R5
Dataflow Queue
Op1


R5 Comes
From ARF
25
Wakeup/Issue: Walking Pointers
Dataflow Queue
Op1
Op2
Dest
ld
R4
4
R1
add
R1
R3
R3
sub
R4
16
R4
st
R3
R8
breq
R4
7
1
2
3
4
5

Follow Dest Ptr When
New Result Produced
 Continue
following
pointers to subsequent
successors
 At each successor,
read ‘other’ value & try
to issue

NULL Ptr  Last
Successor
26
DQ: Fields and Banks

Independent Fields  Independent RAMs
 I.E.

accessed independently, independent ports, etc.
Multi-Issue ≠ Multi-Port
 Multi-Bank
 Dispatch, Commit access contiguous DQ regions
 Multi-Issue


Bank on low-order bits for dispatch/commit BW
Port Contention + Wire Delay = More Banks
 Dispatch, Commit Share a Port
 Bank on a high-order bit to reduce contention
27
DQ: Banks, Groups, and ALUs
Logical Organization
DQ Bank Group – Fundamental Unit of Scaling
Physical Organization
28
Evaluation Methodology

Full-System Trace-Based Simulation
 Simics
+ GEMS/Ruby Memory Simulator +
Homegrown OOO/Forwardflow Simulator
SPARCv9
 SPEC CPU 2k6 + Wisconsin Commercials
 WATTCH/CACTI 4.1 for Power Modeling

29
Experimental Setup
CAM-128
FFlow-128
CAM-512
FFlow-512
Width
4/2/2
4/2/2
8/4/4
8/4/4
INT/FP Units
2/2
2/2
4/4
4/4
Br. Pred.
64Kb TAGE (13 Tagged Tables), Perfect BTB & RAS
Mem. Units
4
4
4
4
Sched Size.
32/128
128
64/512
512
Phys. Regs.
256
N/A
1024
N/A
Mem. Dis.
Ideal Dependence Predictor
L1-I
32KB 4-way, 64-byte line, 3 cycle pipelined
L1-D
32KB 4-way, 64-byte line, 2 cycle load-to-use
L2
2MB 4-way, 64-byte lines, 7 cycle
L3
8 MB 8-way, 64-byte line, 75 cycle, 16 banks
Memory
8 GB, 500-cycle, 12.8 GB/s
30
Forwardflow Single-Thread
Performance
4
Norm. Performance
3.5
3
2.5
OoO-128
FF-128
2
1.5
1
0.5
OoO-512
FF-512
BigSched
0
FF-128: 9% Lower Performance, 20% Lower Power
FF-512: 49% Higher Performance, 5.7% Higher Power
31
Related Work

Scalable Schedulers
 Direct Instruction Wakeup [Ramirez04]:
 Scheduler has a pointer to the first successor
 Secondary table for matrix of successors
 Hybrid Wakeup [Huang02]:
 Scheduler has a pointer to the first successor
 Each entry has a broadcast bit for multiple successors
 Half Price [Kim02]:
 Slice the scheduler in half
 Second operand often unneeded
32
Related Work

Dataflow & Distributed Machines
 Tagged-Token

[Arvind90]
Values (tokens) flow to successors
 TRIPS
[Sankaralingam03]:
Discrete Execution Tiles: X, RF, $, etc.
 EDGE ISA

 Clustered

Designs [e.g. Palacharla97]
Independent execution queues
33
Outline



CMP 2009 2019: Trends toward Scalable Chips
Work So Far: Forwardflow
Proposed Work
 Methodology
 Dynamic Scaling
 Operand Networks
 Control Independence
 Miscellanea

(Maybe)
Schedule/Closing Remarks
34
Proposed Methods
Simulator
Old Method
New Method
Trace-Based
ExecutionDriven
4-16,16-32
#Cores,Threads 1
Disambiguation Perf. Prediction NoSQ
Technology
70nm
32nm
Reportables
IPC, Power
Runtime,
Energy-Efficiency
Methods
Summary
All Parameters
35
Proposed Methods – Details 1
Component
Mem. Cons. Mod.
Coherence Prot.
Store Issue Policy
Configuration
Sequential Consistency
MOESI Directory (single chip)
Permissions Prefetch at X
Freq. Range
2.5 – 4.0 GHz
Technology
32nm
Window Size
Disambiguation
Branch Prediction
Frontend
Varied by experiment
NoSQ
TAGE + 16-entry RAS + 256-entry BTB
7 Cyc. Pred-to-dispatch
36
Proposed Methods – Details 2
Component
Configuration
L1-I Caches
32KB 4-way 64b 4cycle 2 proc. ports
L1-D Caches
32KB 4-way 64b 4 cycle LTU 4 proc. ports,
WI/WT, included by L2
L2 Caches
1MB 8-way 64b 11cycle WB/WA, Private
L3 Cache
8MB 16-way 64b 24cycle, Shared
Main Memory
Inter-proc network
4-8GB, 2 DDR2-like controllers (64 GB/s
peak BW), 450 cycle latency
2D Mesh 16B link
37
Proposed Methods: Memory Timing
Caches: CACTI 5.3



Indep. Vars: Size, Assoc, Banks, Ports
Output Vars: Latency, Area, Power (next slides)
Sizes: Reflect contemporary designs (e.g. Nehalem)
Memory:



Debate: On Chip or Off-Chip controllers?




On-chip: Lower latency, less bandwidth
Off-chip: More controllers, higher aggregate bandwidth, higher
latency (e.g. SERDES)
64 GB/s = 2 ‘cutting edge’ QPI link’s worth of BW
450 cycles = latency on Dual Intel Core 2 Quad.
38
Proposed Methods: Area
1.
Unit Area/AR Estimates

2.
Floorplanning

3.
4.
Manual and automated
Repeat 1&2 hierarchically for entire design
Latencies determined by floorplanned distance


CACTI, WATTCH, or literature
Heuristic-guided optimistic repeater placement
Area of I/O pads, clock gen, etc. not included
39
Proposed Methods: Power

WATTCH
 Count events – events have fixed energy cost
 New custom models from CACTI for all memory-like
structures
 Structures have semi-constant leakage
 Assume:




ITRS-HP devices for perf.-critical
ITRS-LSP or Drowsy for caches
CACTI for caches
Orion for networks
40
Proposed Methods: Chip
Floorplans

2D Tiled Topology
 Tile
= Processor, L1(s), L2, L3 bank, Router
Link latency F(Tile_Area)
 Assume 1P/2P parts are ‘binned’ 4P
designs

4P
8P
16P
Area, 32nm 48mm2
72mm2
126mm2
Area, 65nm 180mm2
270mm2
480mm2
41
Assumes FF-128 Cores
Outline



CMP 2009 2019: Trends toward Scalable Chips
Work So Far: Forwardflow
Proposed Work
 Methodology
 Dynamic Scaling (Executive
 Operand Networks
 Control Independence
 Miscellanea

Summary)
(Maybe)
Schedule/Closing Remarks
42
Proposed Work Executive
Summary

Dynamic Scaling in CMPs



CMPs will have many cores, caches, etc., all burning power
SW Requirements will be varied
Hierarchical Operand Networks

Scalable Cores need Scalable, Generalized Communication



Control Independence

Increase performance from scaling up


Exploit Regular Communication
Adapt to Changing Conditions/Topologies (e.g., from Dynamic
Scaling)
Trades off window space for window utilization
Miscellanea
43
Dynamic Scaling in CMPs

Observation 1: Power Constraints
 Many
design-time options to maintain a power
envelope:
Many Small Cores vs. Few, Aggressive Cores
 How much Cache per Core?

 Fewer
run-time options
DVS
DVS/DFS
 Shut Cores Off, e.g. OPMS

44
Dynamic Scaling in CMPs

Observation 2: Unpredictable Software
 GP
chips run a variety of SW
Highly-Threaded Servers
 Single-Threads
 Serial Bottlenecks

 Optimal
HW for App A is non-optimal for App B
Nehalem for bzip2
 Niagara for OLTP

45
Dynamic Scaling in CMPs

Opportunity: Adapt CMP to Power & SW
Demands
 Migrate
design-time options to run-time
 Share core & cache resources dynamically to fit SW &
Power requirements
 Need Scalable Cores (e.g., Forwardflow)
 Many open questions (purpose of research)
 Need Suitable Intra-Core Interconnect (Coming Up)
46
Dynamic Scaling in CMPs
Core Core Core Core
Assume
SAF=0.5
Core Core
Core
Core Core
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
Core Core Core Core
Core Core Core Core
Threaded Server Workload:
Single-Thread Workload:
•Leverage TLP
•Leverage ILP
•Scale cores down
•Scale one core up
•Scale caches down?
•Turn unused cores off
47
Dynamic Scaling: Leveraging
Forwardflow

Recall: DQ is
organized into Bank
Groups
 (With ALUs)


Scale Up: Build a
Larger DQ with More
Groups
Scale Down: Use
Fewer Groups
48
Where do extra resources come
from?

Option 1: Provision N
private groups per
core
Core
$
+
-
-
$

Option 2: Provision
N*P shareable groups
per chip
“Core”
Core
$
$
Simple
Cache Sharing is Difficult
Constrains maximum core size to
N
May waste area
$
+
-
$
“Core”
$
$
Core Size Less Constrained
Complicates Sharing Policy,
Interconnect
49
Dynamic Scaling: More Open
Questions

Scale in HW or SW?
 SW
has scheduling visibility
 HW can react at fine granularity
What about Frontends?
 What about Caches?

 Multiple
caches for one thread
50
Dynamic Scaling:
“Choose your own adventure” prelim

Preliminary Uniprocessor Scaling Results

Related Work

Move on to Operand Networks
51
Dynamic Scaling:
Preliminary Uniprocessor Results

Simple Scaling Heuristic, Dynamic:
 Hill-Climbing
Heuristic
 Periodically sample 1M instrs.:



BIPS3/W for Current DQ Size
BIPS3/W for ½ *Current DQ Size
BIPS3/W for 2 * Current DQ Size
 Picks

highest observed BIPS3/W
As it turns out… Dynamic has flaws
 But
does illustrate proof of concept
Why BIPS3/W?
52
Dynamic Scaling:
Preliminary Results: Performance
4
Normalized IPC
3.5
3
S-128
2.5
S-512
2
S-2048
1.5
Dynamic
1
0.5
0
bzip2
zeusmp
Hmean
53
Dynamic Scaling:
Preliminary Results: Efficiency
7
15.6
Norm. BIPS3/W
6
5
S-128
4
S-512
3
S-2048
Dynamic
2
1
0
bzip2
zeusmp
Aggregate
54
Dynamic Scaling:
Related Work

CoreFusion [Ipek07]
 Fuse

individual core structures into bigger cores
Power aware microarchitecture resource scaling
[Iyer01]
 Varies

RUU & Width
Positional Adaptation [Huang03]
 Adaptively Applies

Low-Power Techniques:
Instruction Filtering, Sequential Cache, Reduced ALUs
55
Outline



CMP 2009 2019: Trends toward Scalable Chips
Work So Far: Forwardflow
Proposed Work
 Methodology
 Dynamic Scaling
 Operand Networks
 Control Independence
 Miscellanea

(Maybe)
Schedule/Closing Remarks
56
Operand Networks

Scalable Cores need Generalized Intra-Core
Interconnect
Bypassing & Forwarding is O(N2)
 Intra-core interconnect fits nicely with pointer
abstraction of dependence
 All-to-All

Intra-core communication follows exploitable
patterns
 Locality:
Hierarchical interconnects
 Directionality: e.g. Rings instead of meshes
57
Operand Networks

Traffic Classification
by Destination:

Same Bank Group: InterBank (IB)

Next Bank Group: InterBank-Group Neighbor
(IBG-Neighbor)

Non-Next Bank Group:
Inter-Bank-Group Distant
(IBG-Distant)
58
Operand Networks
Observation: ~85% of
pointers designate near
successors
CDF
Intuition: Most of these
pointers yield IB traffic,
some IBG-N, none
IBG-D.
astar
sjeng
jbb
SPAN=5
Pointer Span
Observation: Nearly all
pointers (>95%)
designate successors
16 or fewer entries
away
Intuition: There will be
SPAN=16 very little IBG-D traffic.
59
Operand Networks

Opportunity 1: Exploit
Directional Bias




Forward pointers point…
forward
Optimize likely forward
paths (e.g. next bank
group)
Eliminate unlikely forward
paths (e.g. backward links)
First Steps: Quantify
Directional Bias &
Sensitivity


Verify: IB > IBG-N > IBG-D
Evaluate Delay on IBG-D
Mesh: No Directional Preference
Ring: Prefers Clockwise
(e.g., Multiscalar)
60
Operand Networks
RF

There is more to a core
than window and ALU
 Register
File(s)
 Frontend(s)
 Cache(s)

Skewed demand 
skewed network
FE
L1-D
 TRIPS
[Gratz2007]:
Heterogeneous nodes,
homogeneous topology
61
Operand Networks

Opportunity 2: Exploit
Heterogeneity



L1-D
L1-D
FE
ARF
Pointers are not be the only
means of generating traffic
Assumption: Wires are
cheap, routers are not
First Steps: Profile
heterogeneous node
traffic



Does RF need to
communicate with L1-D?
Can frontend indirect
through ARF node?
Can DQ-to-Cache traffic
indirect through another
DQ group?
62
Operand Networks: Related Work
Skip: Move on to
Control Independence

TRIPS [Gratz07]
 2D
Mesh, distributed single thread,
heterogeneous nodes

RAW [Taylor04]
 2D
Mesh, distributed multiple thread, mostly
heterogenous nodes

ILDP [Kim02]
 Point-to-point,
homogenous endpoints
63
Control Independence

Scaling UP only helps if resources can be gainfully used


Bad news:



Cost of control misprediction grows with window size & pipeline
depth
Even ‘good’ predictors aren’t that good
Some branches are really hard (e.g. forward conditional, jmpl)
Good news:

Some code is control independent, i.e. need not be squashed on
a mispredict
64
Control Independence /*
 Sources
doFunc1();
of CI:
Coarse-Grain
Code */
(CGCI): e.g.
doFunc2();
in Multiscalar [Jacobson1997]
if( x==0 ) {
Fine-Grain
(FGCI):
Forward Branches, e.g.
Branch Hammocks [Klauser1998]

77% of conditional branch
mispredictions
a = 33;
} else {
a = 22;
}
/* CI Code */
65
Control Independence
if( x==0 ) {
a = 33;

There’s a catch:
Data Dependencies
– Control Dependent
 CIDI – Control & Data Indep.
 CIDD – Control Indep., Data
Dependent
} else {
/* CD */
 CD
a = 22;
}
/* CIDI */
c = 77;
/* CIDD */
c = a + 1;
66
Control Independence

Why not use a prior approach?


Communication between backend/frontend is expensive in a
distributed (scalable) core.
We want fire-and-forget control independence
Mechanism Suitable?
Skipper
Ginger
TCI
Tasks/
Traces
Defer CIDD
Fetch
Re-Write Reg
Tags
Maybe: Yields back ptrs, still
squashes frontend
No: Needs
Un-dispatch operation
Chckpt. &
Recover
CGCI
No: Value checkpoints,
memory ordering
Yes
67
Control Independence: Dynamic
Predication

Processors handle data dependencies well,
control dependencies less well
 Convert
control dependencies to data dependencies
(dynamic predication) [e.g., Al-Zawawi07,Cher01,Hilton07]
 Primitive: select micro-op


Inputs: Two possible versions of Register Ri
Output: CD value of Ri for subsequent CIDD consumers
 Primitive: skip micro-op
 Inputs: Immediate, N
 Ignored by execute logic
 Commit logic will ignore N instructions following a skip
68
Control Independence
if( x==0 ) {
a = 33;
} else {
brz
R24
mov
R11

br
skip
/* CD */
mov
a = 22;
}
+8
22
0
R12
+4

33
select R1

R11
R12
add

R1
1
R3
/* CIDD */
c = a + 1;
69
Control Independence:
Implementing Dynamic Predication

Requirements:
 Identify
reconvergent
control flow
 Establish branch-toselect dependence
 Cancel false-path stores
 Ability to rename R1 to
R11, R12

Proposed Solutions:
Prediction mechanisms known.
[Klauser98,Cher01,Al-Zawawi07]
Branch Dest Ptr, select S1 Ptr
Convert reconvergent branch to
skip micro-op, repurpose
initiating branch’s Dest. Val. field
RCT Stacks
70
“Choose your own adventure”
prelim

Move on to…
 Forward
Slice Replay
 ARF Write Elision
 Proposed Work Summary
71
Forward Slice Replay

FF Register
Dependencies are
Persistent


Reasons to Replay:



Can replay them if a value
“changes”
NoSQ Misprediction
Memory Race
Caveats:


Commit Race
Control Dependence
Dataflow Queue
Op1
Op2
Dest
1
ld
R4
4
R1
2
add
R1
R3
R3
3
sub
R4
16
R4
4
st
R3
R8
5
breq
R4
R5
6
ld
R4
4
R1
7
add
R1
R3
R3
72
ARF Write Elision

ARF Writes Are
Expensive



Write Elision:


ARF may not be ‘nearby’ in
a distributed design
ARF Ports are expensive
Atomic Commit  Only
need to write last value
Caveats:


Must be correct: value has
to come from somewhere
May hurt performance:
Reading from ARF is
convenient
Dataflow Queue
Op1
Op2
Dest
1
ld
R4
4
R1
2
add
R1
R3
R3
3
sub
R4
16
R4
4
st
R3
R8
5
breq
R4
R5
6
ld
R4
4
R1
7
add
R1
R3
R3
73
Contributions

Forwardflow Cores
 Scalable

OoO
Proposed: Dynamic Scaling in CMPs
 Resource

Sharing Mechanisms and Policies
Proposed: Operand Network Hierarchies
 Exploitable

communication
Proposed: Control Independence
 Fire-and-forget
for scalable cores
74
Proposed Schedule
Date
Goal
Jun-Aug 2009
Google
Aug-Dec 2009
Dynamic Scaling
Dec-Jun 2010
Operand Networks
Jul-Dec 2010
Control Independence
Jan-Apr 2011
Write Thesis / Interview
May 2011
Defend
75
Thank you, committee, slide reviewers and “practice committee” for
constructive criticism.
76
Backup Slides
77
Successor CDF
78
Why BIPS3/W?

Hartstein & Puzak’s work on Optimum
Power/Performance Pipeline Depth
 BIPS3/W
is equivalent to ED2 in uniprocessors
 BIPS/W and BIPS2/W  Optimum Depth = 0

Why not BIPS4/W?
 BIPSn/W

is possible and plausible
ED, ED2 are more intuitive, esp. for CMPs
79
FF v. Multiscalar
Forwardflow
Multiscalar
Fine-grained
Compiler-assisted
HW
identification of
dependencies
task
assignment
Need
not use compiler
Could
benefit from
compiler support
Scale
one structure up
or down (discretely) to
suit workload
…for one core
Distribute
workload
among multiple
processors
… for many cores
80
Is It Correct?

Impossible to tell
 Experiments

do not prove, they support or refute
What support has been observed of the
hypothesis “This is correct”?
 Reasonable
agreement with published observations
(e.g. consumer fanouts)
 Few timing-first functional violations
 Predictable uBenchmark behavior


Linked list: No parallelism
Streaming: Much parallelism
81
CoreFusion
BPRED

On the right track
 Merges
multiple
discrete elements in
multiple discrete cores
into larger components
 Troublesome for N>2
I$
BPRED
I$
Decode
Decode
Sched.
Sched.
PRF
PRF
82
Control Independence: RCT Stacks
+
REF
R1 4-D
R2 
REF
REF
1-D
3-D


0
1
2
3
4
brz
mov
br
skip
mov
select
R25
R11 
R12 
R1 
+8
22
0
33
+4
R11
R12
Predict
Hammock,
Reconvergence
Reached,
Predict
Reconvergence
at
Reconvergence
at PC+16:
Merge RCT1,
RCT2
 RCT0,
PC+8:
emit
select
uOps
Convert
Branch
skip,
Clone
RCT0
 to
RCT1
Clone RCT0  RCT2
83
Control Independence: RCT Stacks
+
-
No changes to pointer structure at branchresolution time (Unlike Skipper/Ginger/TCI)
Changes META X-port from R/O to R/W
RCT ‘Diff’ Operation
Fetches both paths
84
“Vanilla” CMOS
N+
P-
N+
N
85
Double-Gate, Tri-Gate, Multigate
Back To Talk
86
ITRS-HP vs. ITRS-LSP Device
LSP: ~2x Thicker Gate Oxides
N+
N+
N
LSP: ~2x Longer Gates
P-
LSP: ~4x Vth
Back To Talk
87
Back To Talk
OoO Scaling
Decode Width = 2
op1
dest
src1
src2
op2
dest
src1
src2
Two-way
Fully
Bypassed
Decode Width = 4
op1
dest
src1
src2
op2
dest
src1
src2
op1
dest
src1
src2
op2
dest
src1
src2
Number of Comparators ~ O(N2)
Four-way fully bypassed is
beyond my powerpoint skill
Bypassing Complexity ~ O(N2)
88
OoO Scaling
ROB Complexity: O(N), O(I~3/2)
 PRF Complexity: O(ROB), O(I~3/2)
 Scheduler Complexity:

 CAM:
O(N*log(N)) (size of reg tag increases log(N))
 Matrix: O(N2) (in fairness, the constant in front is small)
Back To Talk
89
‘CFP’ Scaling

Issue/Bypass Logic: See OoO, O(N2)

PRF Complexity: ?

Size of deferred queue: O(N)
Back To Talk
90
Flavors of Off
Dynamic
Power
U%
Static
Power
100%
Response
Lag Time
0 cycles
1-5%
40%
1-2 cycles
Clock-Gated 1-5%
100%
~0 cycles
Vdd-Gated
<1%
100s cycles
100%
~0 cycles
Active
(Not Off)
Drowsy
(Vdd Scaled)
<1%
Freq. Scaled F%
91
DVS in sub-65nm Technology
•Leakage is much
more linear in
45nm/32nm
Leakage Power
•Devices have
smaller operating
ranges
(Vdd = 0.9V)
Back To Talk
Vdd
92
Forwardflow – Resolving Branches

On Branch Pred.:
 Checkpoint
RCT
 Checkpoint Pointer
Valid Bits

Checkpoint Restore
 Restores
RCT
 Invalidates Bad
Pointers
Dataflow Queue
Op1
Op2
Dest
1
ld
R4
4
R1
2
add
R1
R3
R3
3
sub
R4
16
R4
4
st
R3
R8
5
breq
R4
R5
6
ld
R4
4
R1
7
add
R1
R3
R3
93
A Day in the Life of a Forwardflow
Instruction: Decode
Register Consumer History
R1 R1@7D
87-S1
-D
R2

R3=0
R3 8
-D
R4 4-S1
add R1
R38
R3
add
8-S1
-D
94
A Day in the Life of a Forwardflow
Instruction: Dispatch
Dataflow Queue
add
R1@7D
R3=0
Op1
Op2
Dest
7
ld
R4
4
R1
8
add
R1
0
R3
9
Implicit -- Not actually written
95
A Day in the Life of a Forwardflow
Instruction: Wakeup
DQ7 Result is 0!
Dataflow Queue
7
ld
Op1
Op2
R4
4
DestPtr.Read(7)
Dest
R1 8-S1
8
add
R1
0
R3
9
sub
R4
16
R4
10
st
R3
R8
Update HW
next
-D1
87-S
value
0
DestVal.Write(7,0)
96
A Day in the Life of a Forwardflow
S2Val.Read(8)
Instruction: Issue (…and Execute)
Meta.Read(8)
Dataflow Queue
7
ld
Op1
Op2
R4
4
8
add
add
R1
9
sub
10
st

Dest S1Ptr.Read(8)
add 0 + 0 → DQ8
R1
0
0
R3
R4
16
R4
R3
R8
Update HW
next
8
-S1
value
0
S1Val.Write(8,0)
97
A Day in the Life of a Forwardflow
Instruction: Writeback
Dataflow Queue
Op1
Op2
DestPtr.Read(8)
Dest
7
8
add
R1
0
9
sub
R4
16
10
st
R3
R8
R3:0
R3 10-S1
R4
Update HW
next
10
-S1
8-D
value
0
DestVal.Write(8,0)
98
A Day in the Life of a Forwardflow
Instruction: Commit
DestVal.Read(8)
Dataflow Queue
Op1
Op2
Dest
ARF.Write(R3,0)
Meta.Read(8)
7
Commit Logic
8
add
add
R1
0
R3:0
R3:0
9
sub
R4
16
R4
10
st
R3
R8
99
DQ Q&A
Register Consumer History
R1
R1
R2
R2
R3
R3
R4
R4
-S1
72-S1

-S1
94-S1
5-D
-S1
8
Dataflow Queue
Op1
Op2
Dest
1
ld
R4
4
R1
2
add
R1
R3
R3
3
sub
R4
16
R4
4
st
R3
R8
5
breq
R4
R5
6
ld
R4
4
R1
7
add
R1
R3
R3
8
sub
R4
16
R4
9
st
R3
R8
100
Forwardflow – Wakeup
DQ1 Result is 7!
DestPtr.Read(1)
Dataflow Queue
1
ld
Op1
Op2
R4
4
2
add
3
sub
R4
16
4
st
R3
R8
5
breq
R1
R4
R3
0
Dest
R1 2-S1
R3
Update HW
next
-D1
21-S
value
7
R4
DestVal.Write(1,7)
101
Forwardflow – Selection
S2Val.Read(2)
Meta.Read(2)
S1Ptr.Read(2)
Dataflow Queue
1
ld
Op1
Op2
Dest
R4
4
R1
R1  R3
44
2
add
add
3
sub
R4
16
4
st
R3
R8
R3
R4
Update HW
next
2
-S1
value
7
S1Val.Write(2,7)
DQ2
5
breq
R4
0
Issue
102
Forwardflow – Building Pointer
Chains: Decode

Decode must determine, for each
operand, where the operand’s value will
originate
 Vanilla-OOO:
Register Renaming
 Forwardflow-OOO: Register Consumer
History

RCH records last instruction to reference a
particular architectural register
 RAM-based
table, analogous to renamer
103
Decode Example
Register Consumer History
R1
R2
R3
R4
7-D


7-S1
Dataflow Queue
Op1
Op2
Dest
ld
R4
4
R4
add
R4
R1
R4
ld
R4
16
R1
5
6
7
8
9
104
Decode Example
Register Consumer History
R1 R1@7D
87-S1
-D
R2

R3=0
R3 8
-D
R4 4-S1
8: add8R1
R3
R3
add→R3
-S1
8-D
105
Forwardflow –Dispatch

Dispatch into DQ:
 Writes
Dataflow Queue
metadata and
available operands
 Appends instruction to
forward pointer chains
add→R3
R1@7D
R3=0
Op1
Op2
Dest
5
ld
R4
4
R4
6
add
R4
R1
R4
7
ld
R4
16
R1
8
add
R1
0
R3
9
106