Transcript lec11-2

CS 152
Computer Architecture and Engineering
Lecture 19 -- Dynamic Scheduling II
2014-4-3
John Lazzaro
(not a prof - “John” is always OK)
TA: Eric Love
www-inst.eecs.berkeley.edu/~cs152/
Play:
CS 152 L18: Dynamic Scheduling I
UC Regents Spring 2014 © UCB
Case studies of dynamic execution
DEC Alpha 21264: High performance
from a relatively simple implementation
of a modern instruction set.
Short Break
Simultaneous Multi-threading: Adapting
multi-threading to dynamic scheduling.
IBM Power: Evolving dynamic designs
over many generations.
CS 152 L21: Networks and Routers
UC Regents Fall 2006 © UCB
DEC Alpha
21164: 4-issue
in-order design.
21264: 4-issue
out-of-order
design.
21264 was
50% to 200%
faster in
real-world
applications.
500 MHz
0.5µ parts
for in-order
21164 and
out-of-order
21264.
21264 has
55% more
transistors
than the 21164.
The die is
44% larger.
21264 has a 1.7x advantage
on integer code, and a 2.7x
advantage of floating-point code.
21264
consumes
46%
more power than
the 21164.
Similarly-sized onchip caches (116K
vs 128K)
In-order 21164 has
larger
off-chip cache.
The Real Difference: Speculation
If the ability to
recover from
mis-speculation
is built into an
implementation ...
it offers
the option to add
speculative
features to all
parts of the
design.
CS 152 L19: Dynamic Scheduling II
UC Regents Spring 2014 © UCB
21264 die
Separate
OoO control
for integer and
floating point.
RISC decode
happens in
OoO blocks
Unlabeled areas
devoted to
memory
system control
OoO
OoO
FP
Pipe
Int
Pipe
Int
Pipe
Fetch
and
predict
I-Cache
I-Cache
Data
Cache
Data
Cache
21264 pipeline diagram
Rename and Issue stages are primary
locations of dynamic scheduling logic. Load/store
disambiguation support resides in Memory stage.
Slot: absorbs delay of long path on last slide.
Fetch stage close-up:
Each cache line stores predictions of the next line, and the
cache way to be fetched. If predictions are correct, fetcher
maintains the required 4 instructions/cycle pace.
Speculative
Rename stage close-up:
(1) Allocates new physical registers for destinations,
(2) Looks up physical register numbers for sources,
(3) Handle rename dependences within the 4
issuing instructions in one clock cycle!
For mis-speculation recovery
Timestamped.
Input: 4 instructions specifying
architected registers.
Output:
12 physical
registers
numbers:
1 destination
and 2 sources
for the 4
instructions
to be issued.
Recall: malloc() -- free() in hardware
The
record-keeping
shown in this
diagram occurs in
the rename
stage.
CS 152 L18: Dynamic Scheduling I
UC Regents Spring 2014 © UCB
Issue stage close-up:
(1) Newly issued instructions placed in top of queue.
(2) Instructions check scoreboard: are 2 sources ready?
(3) Arbiter selects 4 oldest “ready” instructions.
(4) Update removes these 4 from queue.
Input:
4
just-issued
instructions,
renamed to
use physical
registers.
Scoreboard: Tracks writes
to physical registers.
Output:
The 4 oldest
instructions
whose
2 source
registers are
ready for
use.
Execution close-up:
(1) Two copies of register files, to reduce port pressure.
(2) Forwarding buses are low-latency paths through CPU.
Relies on
speculations
Latencies, from issue
to retirement.
Short latencies
keep buffers to
a reasonable size.
Retirement managed here.
8 retirements per cycle
can be sustained over
short time periods.
Peak rate is 11 retirements
in a single cycle.
Execution unit close-up:
(1) Two arbiters: one for top pipes, one for bottom pipes.
(2) Instructions statically assigned to top or bottom.
(3) Arbiter dynamically selects left or right.
Top Top
Bottom
Thus, 2 dual-issue dynamic machines, not a 4-issue machine.
Why? Simplifies arbiter. Performance penalty? A few %.
Memory stages close-up:
1st stop: TLB, to convert
virtual memory addresses.
Loads and stores from
execution unit appear as
“Cluster 0/1 memory unit” in
the diagram below.
3rd stop: Flush
STQ to the data
cache ... on a
miss, place in
Miss Address File.
(MAF == MHSR)
1 GHz
“Double
pumped”
2nd stop:
Load Queue
(LDQ) and Sto
Queue (SDQ)
each hold 32
instructions,
until retiremen
...
So we can
roll back!
LDQ/STQ
close-up:
Hazards we are trying to prevent:
To do so, LDQ and SDQ lists of up to 32 loads and
stores, in issued order. When a new load or store
arrives, addresses are compared to detect/fix hazards:
LDQ/STQ
speculation
It also marks the load instruction in a
predictor, so that future invocations are not
speculatively executed.
First execution
Subsequent execution
Designing a microprocessor is a
team sport. Below are the author
and acknowledgement lists for
the papers whose figures I use.
micro-architects
architect
circuits
There is no “i” in T-E-A-M ...
Break
Play:
CS 152 L19: Dynamic Scheduling II
UC Regents Spring 2014 © UCB
Multi-Threading
(Dynamic Scheduling)
CS 152 L18: Dynamic Scheduling I
UC Regents Spring 2014 © UCB
Power 4 (predates Power 5 shown earlier)
Single-threaded predecessor to
Power 5. 8 execution units in
out-of-order engine, each may
issue an instruction each cycle.
CS 152 L18: Dynamic Scheduling I
UC Regents Spring 2014 © UCB
For most apps, most execution units lie idle
Observation:
Most hardware
in an
out-of-order
CPU concerns
physical
registers.
Could several
instruction
threads share
this hardware?
CS 152 L18: Dynamic Scheduling I
For an 8-way
superscalar.
From: Tullsen,
Eggers, and Levy,
“Simultaneous
Multithreading:
Maximizing Onchip Parallelism,
ISCA 1995.
UC Regents Spring 2014 © UCB
Simultaneous Multi-threading ...
One thread, 8 units
Cycle M M FX FX FP FP BR CC
Two threads, 8 units
Cycle M M FX FX FP FP BR CC
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes
CS 152 L18: Dynamic Scheduling I
UC Regents Spring 2014 © UCB
Power 4
Power 5
2 commits
(architected
register sets)
2 fetch (PC),
2 initial decodes
CS 152 L18: Dynamic Scheduling I
UC Regents Spring 2014 © UCB
Power 5 data flow ...
Why only 2 threads? With 4, one of the shared
resources (physical registers, cache, memory
bandwidth) would be prone to bottleneck.
CS 152 L18: Dynamic Scheduling I
UC Regents Spring 2014 © UCB
Power 5 thread performance ...
Relative priority
of each thread
controllable in
hardware.
For balanced
operation, both
threads run
slower than if
they “owned”
the machine.
CS 152 L18: Dynamic Scheduling I
UC Regents Spring 2014 © UCB
Multi-Core
CS 152 L18: Dynamic Scheduling I
UC Regents Spring 2014 © UCB
Recall: Superscalar utilization by a thread
For an 8-way
superscalar.
CS 152 L18: Dynamic Scheduling I
Observation:
In many cases,
the on-chip
cache and
DRAM I/O
bandwidth is
also
underutilized by
one CPU.
So, let 2 cores
share them.
UC Regents Spring 2014 © UCB
Most of Power 5 die is shared hardware
Shared
Components
Core #1
L2 Cache
L3
Cache
Control
Core #2
CS 152 L18: Dynamic Scheduling I
DRAM
Controller
UC Regents Spring 2014 © UCB
Core-to-core interactions stay on chip
(1) Threads on two
cores that use
shared libraries
conserve L2
memory.
CS 152 L18: Dynamic Scheduling I
(2) Threads on
two cores share
memory via L2
cache operations.
Much faster than
2 CPUs on 2
chips.
UC Regents Spring 2014 © UCB
Sun Niagara
CS 152 L18: Dynamic Scheduling I
UC Regents Spring 2014 © UCB
The case for Sun’s Niagara ...
For an 8-way
superscalar.
CS 152 L18: Dynamic Scheduling I
Observation:
Some apps
struggle to
reach
a CPI == 1.
For throughput
on these apps,
a large number
of single-issue
cores is better
than a few
superscalars.
UC Regents Spring 2014 © UCB
Niagara (original): 32 threads on one chip
8 cores:
Single-issue, 1.2 GHz
6-stage pipeline
4-way multi-threaded
Fast crypto support
Shared resources:
3MB on-chip cache
4 DDR2 interfaces
32G DRAM, 20 Gb/s
1 shared FP unit
GB Ethernet ports
Die size: 340 mm² in 90 nm.
Power: 50-60 W
Sources: Hot Chips, via EE Times, Infoworld.
J Schwartz weblog (Sun COO)UC Regents Spring 2014 © UCB
CS 152 L18: Dynamic Scheduling I
The board that booted Niagara first-silicon
Source: J Schwartz weblog (then Sun COO, now CEO)
CS 152 L18: Dynamic Scheduling I
UC Regents Spring 2014 © UCB
Used in Sun Fire T2000: “Coolthreads”
Claim: server uses 1/3 the power of competing servers
Web server
benchmarks
used to
position the
T2000 in the
market.
CS 152 L18: Dynamic Scheduling I
UC Regents Spring 2014 © UCB
IBM RISC chips,
since Power 4 (2001) ...
2014
CS 152 L19: Dynamic Scheduling II
UC Regents Spring 2014 © UCB
CS 152 L19: Dynamic Scheduling II
UC Regents Spring 2014 © UCB
Recap: Dynamic Scheduling
Three big ideas: register renaming,
data-driven detection of RAW
resolution, bus-based architecture.
Very complex, but enables many
things: out-of-order execution,
multiple issue, loop unrolling, etc.
Has saved architectures that have a
small number of registers: IBM 360
floating-point ISA, Intel x86 ISA.
CS 152 L18: Dynamic Scheduling I
UC Regents Spring 2014 © UCB
On Tuesday
Epilogue ...
Have a good weekend!