Forwardflow - Computer Sciences Dept.

Download Report

Transcript Forwardflow - Computer Sciences Dept.

Forwardflow
A Scalable Core for
Power-Constrained CMPs
Dan Gibson and David A. Wood
ISCA 2010 SAINT-MALO FRANCE
UW-Madison Computer Sciences Multifacet Group
© 2010
Executive Summary [1/2]
• Future CMPs will need Scalable Cores
– Scale UP for single-thread performance
• Exploit ILP
– Scale DOWN for multiple threads
• Save power
• Exploit TLP
• Hard with traditional μArch
ISCA 2010 - 2
Executive Summary [2/2]
• Our Contribution: Forwardflow
– New Scalable Core μArch
• Uses pointers to eliminate associative search
• Distributes values, no PRF
• Scales to large instruction window sizes
• Full window scheduler  No IQ clog
– Scales dynamically
• Variable-sized instruction window
– ~20% power/performance range
ISCA 2010 - 3
Ancient History: The Memory Wall
[Wulf94] (24 years ago)
• 1994: Processors get
faster faster than DRAM
gets faster
• Solutions: More Caches,
Superscalar, OoO, etc.
386, 20MHz
486, 50MHz
• 2010: Processors are a
lot faster than DRAM
– 1 DRAM Access = ~100s
cycles
P6, 166MHz
IMAGE: Prise de la Bastille (Storming the Bastille)
By Jen-Pierre-Louis-Laurent Houel
ISCA 2010 - 4
PIV, 4000MHz
Moore’s Law Endures
(Obligatory Slide)
• Device Counts Continue
to Grow
Rock, 65nm
[JSSC2009]
Rock16, 16nm
[ITRS2007]
In 1965, Gordon Moore sketched out his prediction of the
pace of silicon technology. Decades later, Moore’s Law
remains true, driven largely by Intel’s unparalleled silicon
expertise.
Copyright © 2005 Intel Corporation.
More Transistors
More Threads? (~512)
More Cache?
ISCA 2010 - 5
Amdahl’s Law Endures
Everyone knows Amdahl’s law, but quickly forgets it.
-Thomas Puzak
[
(1 - f ) +
f
]
N
• Parallel Speedup
Limited by Parallel
Fraction
-1
100
10
– i.e., Only ~10x
speedup at
N=512, f=90%
1
Takeaway:
No TLP = No Speedup
[Hill08]
1
2
4
f = 99%
ISCA 2010 - 6
8
16
f = 95%
32
64
128 256 512
f = 90%
Utilization Wall
(aka SAF)
[Venkatesh2009]
[Chakraborty2008]
devices in a fixed-area
design that can be
active at the same
time, while still
remaining within a
fixed power budget.
HP Devices
LP Devices
1
0.8
Dynamic SAF
• Simultaneously
Active Fraction
(SAF): Fraction of
0.6
0.4
0.2
0
90nm
65nm
45nm
Takeaway: More Transistors  Lots of them have to be off
ISCA 2010 - 7
32nm
Walls, Laws, and Threads
• Power prevents all of the chip from operating
all of the time
• Many applications are single-threaded
– Need ILP
• Some applications are multi-threaded
– Need TLP
• Emerging Solution: Scalable Cores
ISCA 2010 - 8
Scalable Cores
If you were plowing a field, which would you rather use: Two strong oxen or
1024 chickens?
-Attributed to Seymour Cray
• Scale UP for Performance
– Use more resources for more
performance
– (e.g., 2 Strong Oxen)
• Scale DOWN for Energy
Conservation
– Exploit TLP with many small cores
– (e.g., 1024 Chickens)
ISCA 2010 - 9
Core Scaling
Assume SAF = 50%
Core Core Core Core
$
$
$
$
$
$
$
$
Core Core Core Core
Scale Down
$
$
$
$
$
$
$
$
Core Core Core Core
Core Core Core Core
Baseline 8-Core CMP
Many Threads for TLP
Core Core
Core
Core Core
Scale Up
Hard to do with a traditional
core design (not impossible)
$
$
$
$
$
$
$
$
Core Core Core Core
One Thread for ILP
ISCA 2010 - 10
Microarchitecture for Scalable Cores
• Conventional OoO:
– Interdependent structures scaled together
• Some structures easy to scale, some hard
– Scaling up means scaling to large sizes
• Hard to tolerate search operations in large structures
• This Work:
– Single, integrated structure
– Wire-delay tolerant design
– Avoid associative search
ISCA 2010 - 11
RAM-based
Disaggregated
Instruction
Window
Pointers
instead
Forwardflow – Forward Pointers
• Use Pointers to explicitly
define data movement
• Every Operand has a Next Use
Pointer
• Register names not needed
+ No search operation
• No associative search (ever)
― Serialized Wakeup
S1
ld
R4
add R1
sub R4
st
R3
breq R4
• Usually OK: Most ops have few
successors [Ramirez04,Sassone07]
ISCA 2010 - 12
S2
4
R3
16
R8
R3
Dst
R1
R3
R4
Forwardflow – Dataflow Queue (DQ)
• Combination Scheduler,
ROB, and Register File
– Schedules instructions
– Holds data values for all
operands
Register Consumer Table
(RCT)
R1 2-S1
R1
R2 
R2
R3 4-S1
R3
R4 5-S1
R4
Dataflow Queue (DQ)
Op1
Op2
0 ld
1 add
2 sub
3 st
R4
R1
R4
R3
4
R3
16
R8
4
R4
R3
Next Instruction: breq R4 R3
ISCA 2010 - 13
breq
Dest
R1
R3R3=0
R4
Physical Organization
Logical DQ Organization
Physical DQ Organization
DQ Bank Group – Fundamental Unit of Scaling
ISCA 2010 - 14
Scaling a Forwardflow Core
• 128-entry DQ ea.
• 2 IALU ea.
• 2 FPALU ea.
BP
• Scale the core by
scaling the DQ
– BGs power on/off
independently
RCT
RCT
RCT
L1-D
– 4 Bank Groups
Backend
ARF
L1-I
• Fully-provisioned
Forwardflow Core
Frontend
F-4:
BGs: 4
IALU: 8
DQ: 512-entry
FPALU: 8
DMEM: 2
F-2:
BGs: 2
IALU: 4
DQ: 256-entry
FPALU: 4
DMEM: 2
F-1:
BGs: 1
IALU: 2
DQ: 128-entry
FPALU: 2
DMEM: 2
ISCA 2010 - 15
Evaluation: Questions
1. Is single-thread Forwardflow core
performance comparable to a
similarly-sized OoO?
2. Does FF DQ scaling effectively scale
performance for single threads?
3. How does DQ scaling affect power
consumption?
ISCA 2010 - 16
• 8-Core CMP
DRAM0
– 32KB L1s, 1MB L2s, 8MB
shared L3
– NoSQ [Sha06]
– OoO Baseline
– SPARCv9 “+”
L1-I
L1-I
L1-I
L1-I
Core0
Core1
Core2
Core3
L1-D
L1-D
L1-D
L1-D
L2
L2
L2
L2
L3B0
L3B1
L3B2
L3B3
L3B4
L3B5
L3B6
L3B7
L2
L2
L2
L2
L1-D
L1-D
L1-D
L1-D
Core4
Core5
Core6
Core7
L1-I
L1-I
L1-I
L1-I
• Running One Thread
– 7 Cores Off, 1 On
– specCPU + Com (1)
ISCA 2010 - 17
DRAM1
Evaluation: Target Machine
Results: OoO-like Performance
Some bad cases (e.g., bzip2): Not
enough misses to cover serialized
wakeup
Some good cases
(e.g., libquantum):
OoO suffers from IQ
1.4
Clog
Overall, Forwardflow
(F-1) performance is
close to that of samesize OoO
1
0.8
0.6
0.4
0.2
0
F-1
SPEC INT 2006
SPEC FP 2006
ISCA 2010 - 18
Commercial
Workloads
GMean
Normalized Runtime
1.2
Results: Performance Scaling
Takeaway: Forwardflow’s Backend
Scaling Scales Core Performance
Runtime Reduction
compared to F-1:
Some great cases,
some non-great cases
•F-2: 12%
•F-4: 21%
1.4
1
0.8
0.6
0.4
0.2
0
SPEC INT 2006
F-1
F-2
F-4
SPEC FP 2006
ISCA 2010 - 19
Commercial
Workloads
GMean
Normalized Runtime
1.2
Results: Power Scaling
• F-1 consumes 10% less
power than OoO
1.2
+14%
Normalized Power
1
+16%
0.8
+14%
+13%
0.6
+12%
0.4
+11%
0.2
0
OoO
F-1
F-2
Configuration
Backend Frontend
Static
Other
Caches
F-4
– Most of the difference
comes from the finegrained DQ accesses and
smaller RF
• Scaling up increases
power consumption in
unscaled components
– Larger windows better
utilize caches and
frontend
– Backend consumption
scales reasonably (30%)
ISCA 2010 - 20
Concluding Remarks
• Future CMPs will need Scalable Cores
– Scale UP for single-thread performance
– Scale DOWN to run multiple threads
• Forwardflow Core:
– New μArch for scaling the instruction
window
– ~20% power/performance
ISCA 2010 - 21
This looks familiar…
Didn’t I just see a talk on this topic from the same institution?
-75% of the audience (the waking portion)
WiDGET
Forwardflow
Vision
Scalable Cores
Scalable Cores
Mechanism
Steering, InOrder
Pointers, DQ,
OoO
[Watanabe10]
Approximates
In-Order
Full-Window
Scheduling
ISCA 2010 - 22
[Gibson10]
Acknowledgments / Q&A
NSF CCR-0324878, CNS-0551401, CNS0720565 for financial support (e.g.,
keeping me alive in graduate school,
buying cluster nodes, etc.) Multifacet and
Multiscalar groups for years of guidance
and advice. Yasuko Watanabe for
simulator contributions. UW Computer
Architecture Affiliates for many
discussions, suggestions, and
encouraging remarks.
ACM/SIGARCH+IEEE/TCCA for part of a
trip to France. Megan Gibson for the rest.
Anonymous reviewers are also swell
people and their advice made this work
better.
ISCA 2010 - 23
INDEX OF BACKUP SLIDES
•
•
•
•
•
•
Multithreaded Workloads
Using DVFS to Scale
Mispredictions
ARF
More vs. WiDGET
A Day In the Life of a Forwardflow Op
–
–
–
–
–
–
Decode
Dispatch
Wakeup
Issue
Writeback
Commit
ISCA 2010 - 24
ISCA 2010 - 25
Related Work
• Scalable Schedulers
– Direct Instruction Wakeup [Ramirez04]:
• Scheduler has a pointer to the first successor
• Secondary table for matrix of successors
– Hybrid Wakeup [Huang02]:
• Scheduler has a pointer to the first successor
• Each entry has a broadcast bit for multiple
successors
– Half Price [Kim02]:
• Slice the scheduler in half
• Second operand often unneeded
ISCA 2010 - 26
Related Work
• Dataflow & Distributed Machines
– Tagged-Token [Arvind90]
• Values (tokens) flow to successors
– TRIPS [Sankaralingam03]:
• Discrete Execution Tiles: X, RF, $, etc.
• EDGE ISA
– Clustered Designs [e.g. Palacharla97]
• Independent execution queues
ISCA 2010 - 27
Results: Multiple Threads
specOMP Power/Performance
Normalized Power
16
applu
14
apsi
12
equake
10
8
6
4
0.95
swim
wupwise
1
1.05
1.1
1.15
1.2
Speedup
Most benchmarks trade off
power/performance with
different Forwardflow
configurations
Some do not
1.25
1.3
1.35
1.4
Feasible operating points depends on
available power, e.g.:
ISCA 2010 - 28
•… at Y=8, 9 of 14 can run without DVFS
•… at Y=15, nearly all can run F-4
Back to Index
Shutting Off Cores (or DVFS)
Assume SAF = 50%
Core Core Core Core
$
$
$
$
$
$
$
$
Core Core Core Core
Shut off 50%
Core Core Core Core
$
$
$
$
$
$
$
$
Core Core Core Core
Baseline 8-Core CMP
4 Equally-Powerful
Cores
•Shutting off cores is too much
•Limits TLP: Not enough cores
•Limits ILP: Cores aren’t aggressive enough
ISCA 2010 - 29
Back to Index
Forwardflow
– Resolving Branches
• On Branch Pred.:
Dataflow Queue
Op1
– Checkpoint RCT
– Checkpoint Pointer
Valid Bits
• Checkpoint Restore
– Restores RCT
– Invalidates Bad
Pointers
Op2
Dest
1
ld
R4
4
R1
2
add
R1
R3
R3
3
sub
R4
16
R4
4
st
R3
R8
5 breq R4
R5
6
ld
R4
4
R1
7
add
R1
R3
R3
ISCA 2010 - 30
Back to Index
Forwardflow – ARF [1/2]
• Architectural Register
File (ARF)
– Read at Dispatch
– Written at Commit
Dataflow Queue (DQ)
0 ld
1 add
: Read 2 sub
from ARF
3 st
Register Consumer Table
(RCT)
R1 2-S1
R1
R2 
R2
R3 4-S1
R3
R4 5-S1
R4
4
Next Instruction: mov R2  R5
ISCA 2010 - 31
mov
Op1
Op2
Dest
R4
R1
R4
R3
4
R3
16
R8
R1
R3
R4
R2
Forwardflow – ARF [2/2]
• Architectural Register
File (ARF)
– Read at Dispatch
– Written at Commit
Write R1 = 44 to ARF
Dataflow Queue (DQ)
Op1
Op2
0 ld
1 add
2 sub
3 st
R4
R1
R4
R3
4
R3
16
R8
4
R2
Next Commit: ld [R4+4]  R1
ISCA 2010 - 32
mov
Dest
R1=44
R1
R3
R4
R5
Back to Index
WiDGET vs. FF [1/3]
Forwardflow
WiDGET
– Full-window
scheduling, clog-free
– Steering-based
scheduling
• Good for Lookahead
• Good for MLP
– Pointer-based
dependences
• Simple steering,
simple execution logic
• Can clog
– Some centralization
• Serialized Wakeup
• Bad for serial uses of
same value
• PRF
– Scales down to inorder
ISCA 2010 - 33
Back to Index
WiDGET vs. FF [2/3]
ld [R1+R2]R3
Ample MLP, many
forward slices
R2
indep.
br -8
WiDGET
ldR3
? R2
Forwardflow
ld [R1+R2]R3
Cannot
Steer: Stall
indep.
R2
br -8
ld [R1+R2]R3
indep.
ldR3
? R2
ldR3
? R2
IB 0
IB 1
EU 0
ldR3
? R2
ldR3
? R2
IB 0
IB 1
EU 1
ISCA 2010 - 34
R2
br -8
ld [R1+R2]R3
indep.
R2
br -8
Entire
Window
WiDGET vs. FF [3/3]
ld [R1+R2]R3
mul R5 R3 R2
Serial use of R3
add R5 R3 R4
WiDGET
shr R3
4 R9
Forwardflow
ld [R1+R2]R3
mul R5 R3 R2
mulR2
ld R3
add R5 R3 R4
addR4
IB 0
IB 1
EU 0
shrR9
shr R3
IB 0
IB 1
4 R9
Artificial Serialization
EU 1
ISCA 2010 - 35
Back to Index
A Day in the Life of a Forwardflow Instruction:
Decode
Register Consumer Table
8-S1
R1 R1@7D
7-D

R2
R3=0
R3 8-D

R4 4-S1
add R1
R3
R3
8-S1
add
8-D
ISCA 2010 - 36
Back to Index
A Day in the Life of a Forwardflow Instruction:
Dispatch
Dataflow Queue
add
R1@7D
R3=0
Op1
Op2
Dest
7
ld
R4
4
R1
8
add
R1
0
R3
9
Implicit -- Not actually written
ISCA 2010 - 37
Back to Index
A Day in the Life of a Forwardflow Instruction: Wakeup
DQ7 Result is 0!
Dataflow Queue
7
ld
Op1
Op2
R4
4
DestPtr.Read(7)
Dest
R1 8-S1
8
add
R1
0
R3
9
sub
R4
16
R4
10
st
R3
R8
Update HW
next
-D1
87-S
value
0
DestVal.Write(7,0)
ISCA 2010 - 38
Back to Index
A Day in the Life of a Forwardflow Instruction: Issue (…and
Execute)
S2Val.Read(8)
Meta.Read(8)
Dataflow Queue
7
ld
Op1
Op2
R4
4
8
add
add
R1
9
sub
10
st

Dest S1Ptr.Read(8)
add 0 + 0 → DQ8
R1
0
0
R3
R4
16
R4
R3
R8
ISCA 2010 - 39
Update HW
next
8
-S1
value
0
S1Val.Write(8,0)
Back to Index
A Day in the Life of a Forwardflow Instruction: Writeback
Dataflow Queue
Op1
Op2
DestPtr.Read(8)
Dest
7
8
add
R1
0
9
sub
R4
16
10
st
R3
R8
R3:0
R3 10-S1
R4
Update HW
next
10
-S1
8-D
value
0
DestVal.Write(8,0)
ISCA 2010 - 40
Back to Index
A Day in the Life of a Forwardflow Instruction:
Commit
Dataflow Queue
Op1
Op2
Dest
DestVal.Read(8)
ARF.Write(R3,0)
Meta.Read(8)
7
Commit Logic
8
add
add
R1
0
R3:0
R3:0
9
sub
R4
16
R4
10
st
R3
R8
ISCA 2010 - 41
Back to Index
DQ Q&A
Dataflow Queue
Op1
Register Consumer Table
R1
R1
R2
R2
R3
R3
R4
R4
-S1
72-S1

-S1
94-S1
5-D
-S1
8
Op2
Dest
1
ld
R4
4
R1
2
add
R1
R3
R3
3
sub
R4
16
R4
4
st
R3
R8
5
breq
R4
R5
6
ld
R4
4
R1
7
add
R1
R3
R3
8
sub
R4
16
R4
9
st
R3
R8
ISCA 2010 - 42