Evaluating a Multithreaded Superscalar Microprocessor

Download Report

Transcript Evaluating a Multithreaded Superscalar Microprocessor

Current and Future Trends in Processor
Architecture
Theo Ungerer
Dept. of Computer Design and Fault Tolerance
University of Karlsruhe
D-76128 Karlsruhe, Germany
[email protected]
http://goethe.ira.uka.de/people/ungerer/
Tutorial Background Material
•
•
J. Silc, B. Robic, Th. Ungerer: Processor Architecture From
Dataflow to Superscalar and Beyond.
Springer-Verlag, Berlin, Heidelberg, New York 1999, 406 pages.
Book homepage: http://goethe.ira.uka.de/people/ungerer/proc-arch/
•
Slide collection of tutorial slides:
http://goethe.ira.uka.de/people/ungerer/
•
Slide collection of book contents (in 15 lectures):
http://goethe.ira.uka.de/people/ungerer/prozarch/prslides99-00.html
2
Outline of the Tutorial
•
State-of-the-art multiple-issue processors
Superscalar – Overview
 Superscalar in more detail: Instruction Fetch and Branch Prediction, Decode,
Rename, Issue, Dispatch, Execution Units, Completion, Retirement
 VLIW/EPIC

•
Solutions for future high-performance processors
Technology prognosis
 Speed-up of a single-threaded application
- Advanced superscalar, Trace Cache, Superspeculative, Multiscalar
processors
 Speed-up of multi-threaded applications
- Chip multiprocessors (CMPs) and Simultaneous multithreading

3
Multiple-issue Processors
•
Today's microprocessors utilize instruction-level parallelism
by a multi-stage instruction pipeline and by the superscalar or
the VLIW/EPIC technique.
•
Most of today's general-purpose microprocessors are four- or
six-issue superscalars.
•
VLIW (very long instruction word) is the choice for most
signal processors.
VLIW is enhanced to EPIC (explicitly parallel instruction
computing) by HP/Intel for its IA-64 ISA.
•
4
Instruction Pipelining
5
IF
•
•
ID
and
Rename
Instruction
Window
Superscalar Pipeline
Issue
EX
EX
EX
EX
Retire
and
Write
Back
Instructions in the instruction window are free from control
dependencies due to branch prediction, and free from name
dependences due to register renaming.
So, only (true) data dependences and structural conflicts remain to
be solved.
6
Superscalar vs. VLIW
Superscalar and VLIW: More than a single instruction can
be issued to the execution units per cycle.
•
Superscalar machines are able to dynamically issue
multiple instructions each clock cycle from a conventional
linear instruction stream.
•
VLIW processors use a long instruction word that contains a
usually fixed number of instructions that are fetched, decoded,
issued, and executed synchronously.
•
Superscalar: dynamic issue, VLIW: static issue
7
Sections of a Superscalar Pipeline
•
The ability to issue and execute instructions out-of-order
partitions a superscalar pipeline in three distinct sections:
 in-order section with the instruction fetch, decode and
rename stages - the issue is also part of the in-order section
in case of an in-order issue,
 out-of-order section starting with the issue in case of an
out-of-order issue processor, the execution stage, and
usually the completion stage, and again an
 in-order section that comprises the retirement and writeback stages.
8
Components of a Superscalar Processor
I-cache
BHT BTAC
Branch
Unit
MMU
Instruction Fetch Unit
32 (64)
Data
Bus
Instruction Decode and
Register Rename Unit
32 (64)
Bus
Instruction Buffer
Inter- Address
Bus
Reorder Buffer face
Unit
Instruction
Issue Unit
Load/
Store
Unit
MMU
FloatingPoint
Unit(s)
Integer
Unit(s)
Retire
Unit
FloatingPoint
Registers
General
Purpose
Registers
Rename
Registers
Control
Bus
D-cache
9
Branch-Target Buffer or
Branch-Target Address Cache
•
•
The Branch Target Buffer (BTB) or Branch-Target Address
Cache (BTAC) stores branch and jump addresses, their target
addresses, and optionally prediction information.
The BTB is accessed during the IF stage.
Branch address
...
Prediction
Target address
bits
...
...
10
Branch Prediction
•
•
•
Branch prediction foretells the outcome of conditional branch
instructions.
Excellent branch handling techniques are essential for today's and
for future microprocessors.
Requirements of high performance branch handling:
an early determination of the branch outcome (the so-called branch
resolution),
 buffering of the branch target address in a BTAC,
 an excellent branch predictor (i.e. branch prediction technique) and
speculative execution mechanism,
 often another branch is predicted while a previous branch is still unresolved,
so the processor must be able to pursue two or more speculation levels,
 and an efficient rerolling mechanism when a branch is mispredicted
(minimizing the branch misprediction penalty).

11
Misprediction Penalty
•
•
The performance of branch prediction depends on the
prediction accuracy and the cost of misprediction.
Misprediction penalty depends on many organizational
features:
the pipeline length (favoring shorter pipelines over longer pipelines),
 the overall organization of the pipeline,
 the fact if misspeculated instructions can be removed from internal
buffers, or have to be executed and can only be removed in the retire
stage,
 the number of speculative instructions in the instruction window or the
reorder buffer. Typically only a limited number of instructions can be
removed each cycle.

•
Mispredicted is expensive:
4 to 9 cycles in the Alpha 21264,
 11 or more cycles in the Pentium II.

12
Static Branch Prediction
•
•
•
•
Static Branch Prediction predicts always the same direction
for the same branch during the whole program execution.
It comprises hardware-fixed prediction and compiler-directed
prediction.
Simple hardware-fixed direction mechanisms can be:
 Predict always not taken
 Predict always taken
 Backward branch predict taken, forward branch predict not
taken
Sometimes a bit in the branch opcode allows the compiler to
decide the prediction direction.
13
Dynamic Branch Prediction
•
•
Dynamic Branch Prediction: the hardware influences the
prediction while execution proceeds.
Prediction is decided on the computation history of the
program.
•
During the start-up phase of the program execution, where a
static branch prediction might be effective, the history
information is gathered and dynamic branch prediction gets
effective.
•
In general, dynamic branch prediction gives better results than
static branch prediction, but at the cost of increased hardware
complexity.
14
One-bit Predictor
T
NT
Predict Taken
Predict Not
Taken
T
NT
15
One-bit vs. Two-bit Predictors
•
•
•
•
A one-bit predictor correctly predicts a branch at the end of a
loop iteration, as long as the loop does not exit.
In nested loops, a one-bit prediction scheme will cause two
mispredictions for the inner loop:
 One at the end of the loop, when the iteration exits the
loop instead of looping again, and
 one when executing the first loop iteration, when it
predicts exit instead of looping.
Such a double misprediction in nested loops is avoided by a
two-bit predictor scheme.
Two-bit Prediction: A prediction must miss twice before it is
changed when a two-bit prediction scheme is applied.
16
Two-bit Predictors
(Saturation Counter Scheme)
T
(11)
Predict Strongly
Taken
NT
T
(10)
Predict Weakly
Taken
NT
T
(01)
Predict Weakly
Not Taken
NT
T
(00)
Predict Strongly
Not Taken
NT
17
Two-bit Predictors
(Hysteresis Scheme)
NT
T
(11)
Predict Strongly
Taken
NT
T
(10)
Predict Weakly
Taken
(01)
Predict Weakly
Not Taken
NT
T
(00)
Predict Strongly
Not Taken
NT
T
18
Two-bit Predictors
•
•
•
•
Two-bit predictors can be implemented in the Branch Target
Buffer (BTB) assigning two state bits to each entry in the BTB.
Another solution is to use a BTB for target addresses
and a separate Branch History Table (BHT) as prediction buffer.
A mispredict in the BHT occurs due to two reasons:
 either a wrong guess for that branch,
 or the branch history of a wrong branch is used because the
table is indexed.
In an indexed table lookup part of the instruction address is used
as index to identify a table entry.
19
Two-bit Predictors and Correlation-based Prediction
•
•
Two-bit predictors work well for programs which contain many
frequently executed loop-control branches (floating-point intensive
programs).
Shortcomings arise from dependent (correlated) branches, which
are frequent in integer-dominated programs.
Example:
if (d==0)
d=1;
if (d==1)
...
/* branch b1*/
/*branch b2 */
20
Predictor Behavior in Example
•
•
•
•
•
A one-bit predictor initialized to “ predict taken” for branches b1 and b2
 every branch is mispredicted.
A two-bit predictor of of saturation counter scheme starting from the
state “predict weakly taken”  every branch is mispredicted.
The two-bit predictor of hysteresis scheme mispredicts every second
branch execution of b1 and b2.
A (1,1) correlating predictor takes advantage of the correlation of the
two branches; it mispredicts only in the first iteration when d = 2.
See: http://goethe.ira.uka.de/people/ungerer/prozarch/prslides99-00.htlm
9th lecture for animated demonstrations
21
Correlation-based Predictor
•
•
•
•
•
•
The two-bit predictor scheme uses only the recent behavior of a
single branch to predict the future of that branch.
Correlations between different branch instructions are not taken into
account.
Correlation-based predictors or correlating predictors additionally
use the behavior of other branches to make a prediction.
While two-bit predictors use self-history only, the correlating
predictor uses neighbor history additionally.
Notation: (m,n)-correlation-based predictor or (m,n)-predictor uses
the behavior of the last m branches to choose from 2m branch
predictors, each of which is a n-bit predictor for a single branch.
Branch history register (BHR): The global history of the most recent
m branches can be recorded in a m-bit shift register where each bit
records whether the branch was taken or not taken.
22
Correlation-based Prediction
(2,2)-predictor
Pattern History Tables PHTs
(2-bit predictors)
Branch address
...
...
...
...
...
...
10
1 1
...
...
select
Branch History Register BHR
(2-bit shift register)
1
0
23
Two-level Adaptive Predictors
•
•
•
•
•
•
Developed by Yeh and Patt at the same time (1992) as the
correlation-based prediction scheme.
The basic two-level predictor uses a single global branch history
register (BHR) of k bits to index in a pattern history table (PHT) of
2-bit counters.
Global history schemes correspond to correlation-based predictor
schemes.
Example for the notation: GAg:
 a single global BHR (denoted G) and
 a single global PHT (denoted g),
 A stands for adaptive.
All PHT implementations of Yeh and Patt use 2-bit predictors.
GAg-predictor with a 4-bit BHR length is denoted as GAg(4).
24
Implementation of a GAg(4)-predictor
Branch
History
Register
(BHR)
shift direction
1 1 0 0
Index
Branch Pattern
History Table
(PHT)
...
1100
1
1
predict:
taken
...
•
In the GAg predictor schemes the PHT lookup depends
entirely on the bit pattern in the BHR and is completely
independent of the branch address.
25
Variations of Two-level Adaptive Predictors
Mispredictions can be restrained by additionally using:
the full branch address to distinguish multiple PHTs
(called per-address PHTs),
a subset of branches (e.g. n bits of the branch address) to distinguish
multiple PHTs (called per-set PHTs),
the full branch address to distinguish multiple BHRs
(called per-address BHRs),
a subset of branches to distinguish multiple BHRs
(called per-set BHRs),
or a combination scheme.
26
Two-level Adaptive Predictors
single global PHT
single global BHR
GAg
per-address BHT
PAg
per-set BHT
SAg
per-set PHTs
GAs
PAs
SAs
per-address PHTs
GAp
PAp
SAp
27
gselect and gshare Predictors
•
•
•
gselect predictor: concatenates some lower order bit of the
branch address and the global history
gshare predictor: uses the bitwise exclusive OR of part of
the branch address and the global history as hash function.
McFarling: gshare slightly better than gselect
Branch Address
00000000
00000000
11111111
11111111
BHR
00000001
00000000
00000000
10000000
gselect4/4
00000001
00000000
11110000
11110000
gshare8/8
00000001
00000000
11111111
01111111
28
Hybrid Predictors
•
•
•
The second strategy of McFarling is to combine multiple
separate branch predictors, each tuned to a different class of
branches.
Two or more predictors and a predictor selection mechanism
are necessary in a combining or hybrid predictor.
 McFarling: combination of two-bit predictor and gshare
two-level adaptive,
 Young and Smith: a compiler-based static branch
prediction with a two-level adaptive type,
 and many more combinations!
Hybrid predictors often better than single-type predictors.
29
Simulations of Grunwald 1998
committed conditional taken
Application instructions branches branches
(in millions) (in millions) (%)
compress
80.4
14.4
54.6
gcc
250.9
50.4
49.0
perl
228.2
43.8
52.6
go
548.1
80.3
54.5
m88ksim
416.5
89.8
71.7
xlisp
183.3
41.8
39.5
vortex
180.9
29.1
50.1
jpeg
252.0
20.0
70.0
mean
267.6
46.2
54.3
SAg
10.1
12.8
9.2
25.6
4.7
10.3
2.0
10.3
8.6
misprediction rate
(%)
gshare combining
10.1
9.9
23.9
12.2
25.9
11.4
34.4
24.1
8.6
4.7
10.2
6.8
8.3
1.7
12.5
10.4
14.5
8.1
SAg, gshare and MCFarling‘s combining predictor for some SPECmarks
30
Results
•
•
•
Simulation of Keeton et al. 1998 using an OLTP (online
transaction workload) on a PentiumPro multiprocessor
reported a misprediction rate of 14% with an branch
instruction frequency of about 21%.
Two different conclusions may be drawn from these
simulation results:
 Branch predictors should be further improved
 and/or branch prediction is only effective if the branch is
predictable.
If a branch outcome is dependent on irregular data inputs, the
branch often shows an irregular behavior.
 Question: Confidence of a branch prediction?
31
Confidence Estimation
•
•
•
•
Confidence estimation is a technique for assessing the
quality of a particular prediction.
Applied to branch prediction, a confidence estimator attempts
to assess the prediction made by a branch predictor.
A low confidence branch is a branch which frequently
changes its branch direction in an irregular way making its
outcome hard to predict or even unpredictable.
Four classes possible:
 correctly predicted with high confidence C(HC),
 correctly predicted with low confidence C(LC),
 incorrectly predicted with high confidence I(HC), and
 incorrectly predicted with low confidence I(LC).
32
Predicated Instructions
•
•
•
•
•
Method to “remove” branches
Predicated or conditional instructions and one or more predicate
registers use a predicate register as additional input operand.
The Boolean result of a condition testing is recorded in a (one-bit)
predicate register.
Predicated instructions are fetched, decoded and placed in the
instruction window like non predicated instructions.
It is dependent on the processor architecture, how far a predicated
instruction proceeds speculatively in the pipeline before its predication
is resolved:
 A predicated instruction executes only if its predicate is true,
otherwise the instruction is discarded.
 Alternatively the predicated instruction may be executed, but
commits only if the predicate is true, otherwise the result is
discarded.
33
Predication Example
if (x = = 0) {
a = b + c;
d = e - f;
}
g = h * i;
/*branch b1 */
(Pred = (x = = 0) )
if Pred then a = b + c;
if Pred then e = e - f;
g = h * i;
/* branch b1: Pred is set to true in x equals 0 */
/* The operations are only performed */
/* if Pred is set to true */
/* instruction independent of branch b1 */
34
Predication
 Able to eliminate a branch and therefore the associated branch
prediction  increasing the distance between mispredictions.
 The the run length of a code block is increased  better
compiler scheduling.
- Predication affects the instruction set, adds a port to the register
file, and complicates instruction execution.
- Predicated instructions that are discarded still consume
processor resources; especially the fetch bandwidth.
•
•
Predication is most effective when control dependences can be
completely eliminated, such as in an if-then with a small then
body.
The use of predicated instructions is limited when the control
flow involves more than a simple alternative sequence.
35
Eager (Multipath) Execution
•
•
•

•
Execution proceeds down both paths of a branch, and no prediction is
made.
When a branch resolves, all operations on the non-taken path are
discarded.
With limited resources, the eager execution strategy must be
employed carefully.
Mechanism is required that decides when to employ prediction and
when eager execution: e.g. a confidence estimator
Rarely implemented (IBM mainframes) but some research projects:
 Dansoft processor, Polypath architecture, selective dual path
execution, simultaneous speculation scheduling, disjoint eager
execution
36
Branch handling techniques and implementations
Technique
No branch prediction
Static prediction
always not taken
always taken
backward taken, forward not taken
semistatic with profiling
Dynamic prediction:
1-bit
2-bit
two-level adaptive
Hybrid prediction
Predication
Eager execution (limited)
Implementation examples
Intel 8086
Intel i486
Sun SuperSPARC
HP PA-7x00
early PowerPCs
DEC Alpha 21064, AMD K5
PowerPC 604, MIPS R10000,
Cyrix 6x86 and M2, NexGen 586
Intel PentiumPro, Pentium II, AMD K6
DEC Alpha 21264
Intel/HP Itanium and most signal processors as e.g.
ARM processors, TI TMS320C6201 and many other
IBM mainframes: IBM 360/91, IBM 3090
37
High-Bandwidth Branch Prediction
•
•
Future microprocessor will require more than one prediction per
cycle starting speculation over multiple branches in a single cycle
When multiple branches are predicted per cycle, then instructions
must be fetched from multiple target addresses per cycle,
complicating I-cache access.
 Solution: Trace
cache in combination with next trace
prediction.
38
IF
•
ID
and
Rename
Instruction
Window
Back to the Superscalar Pipeline
Issue
EX
EX
EX
EX
Retire
and
Write
Back
In-order delivery of instructions to the out-of-order execution kernel!
39
Decode Stage
•
•
•
•
•
Delivery task: Keep instruction window full
==> the deeper instruction look-ahead allows to find more
instructions to issue to the execution units.
Fetch and decode instructions at a higher bandwidth than execute
them.
The processor fetches and decodes today about 1.4 to twice as
many instructions than it commits (because of mispredicted branch
paths).
Typically the decode bandwidth is the same as the instruction fetch
bandwidth.
Multiple instruction fetch and decode is supported by a fixed
instruction length.
40
Decoding variable-length instructions
•
Variable instruction length:
often the case for legacy CISC instruction sets as the Intel IA32
ISA.
 a multistage decode is necessary.
 The first stage determines the instruction limits within the
instruction stream.
 The second stage decodes the instructions generating one or
several micro-ops from each instruction.
•
Complex CISC instructions are split into micro-ops which
resemble ordinary RISC instructions.
41
Two principal techniques to implement renaming:
•
•
•
Separate sets of architectural registers and rename (physical) registers
are provided.
 The physical registers contain values (of completed but not yet
retired instructions),
 the architectural registers store the committed values.
 After commitment of an instruction, copying its result from the
rename register to the architectural register is required.
Only a single set of registers is provided and architectural registers
are dynamically mapped to physical registers.
 The physical registers contain committed values and temporary
results.
 After commitment of an instruction, the physical register is made
permanent and no copying is necessary.
Alternative to the dynamic renaming is static renaming in
combination with a large register file as defined for the Intel Itanium.42
Issue and Dispatch
•
•
•
•
•
The notion of the instruction window comprises all the waiting
stations between decode (rename) and execute stages.
The instruction window isolates the decode/rename from the
execution stages of the pipeline.
Instruction issue is the process of initiating instruction
execution in the processor's functional units.
 issue to a FU or a reservation station
 dispatch, if a second issue stage exists to denote when an
instruction is started to execute in the functional unit.
The instruction-issue policy is the protocol used to issue
instructions.
The processor's lookahead capability is the ability to examine
instructions beyond the current point of execution in hope of
finding independent instructions to execute.
43
Issue
•
•
•
The issue logic examines the waiting instructions in the instruction
window and simultaneously assigns (issues) a number of instructions
to the FUs up to a maximum issue bandwidth.
The program order of the issued instructions is stored in the reorder
buffer.
Instruction issue from the instruction window can be:
 in-order (only in program order) or out-of-order
 it can be subject to simultaneous data dependences and resource
constraints,
 or it can be divided in two (or more) stages
- checking structural conflict in the first and data dependences
in the next stage (or vice versa).
- In the case of structural conflicts first, the instructions are
issued to reservation stations (buffers) in front of the FUs
where the issued instructions await missing operands.
44
Reservation Station(s)
•
Two definitions in literature:
 A reservation station is a buffer for a single instruction
with its operands (original Tomasulo paper, Flynn's book,
Hennessy/Patterson book).
 A reservation station is a buffer (in front of one or more
FUs) with one or more entries and each entry can buffer an
instruction with its operands(PowerPC literature).
•
Depending on the specific processor, reservation stations can
be central to a number of FUs or each FU has one or more
own reservation stations.
Instructions await their operands in the reservation stations, as
in the Tomasulo algorithm.
•
45
Dispatch
•
•
•
•
•
•
An instruction is then said to be dispatched from a reservation
station to the FU when all operands are available, and execution
starts.
If all its operands are available during issue and the FU is not
busy, an instruction is immediately dispatched, starting execution
in the next cycle after the issue.
So, the dispatch is usually not a pipeline stage.
An issued instruction may stay in the reservation station for zero
to several cycles.
Dispatch and execution is performed out of program order.
Other authors interchange the meaning of issue and dispatch or
use different semantic.
46
The following issue schemes are commonly used:
•
Single-level, central issue: single-level issue out of a central
window as in Pentium II processor
Issue
and
Dispatch
Decode
and
Rename
Functional
Units
47
Single-level, two-window issue
•
Single-level, two-window issue: single-level issue with a
instruction window decoupling using two separate windows

most common: separate floating point and integer windows as in HP
8000 processor
Issue
and
Dispatch
Decode
and
Rename
Functional
Units
Functional
Units
48
Two-level issue with multiple windows
•
Two-level issue with multiple windows with a centralized
window in the first stage and separate windows in the second
stage (PowerPC 604 and 620 processors).
Issue
Dispatch
Functional Unit
Functional Unit
Decode
and
Rename
Functional Unit
Functional Unit
Reservation Stations
49
Execution Stages
•
•
•
•
•
Various types of FUs classified as:
 single-cycle (latency of one) or
 multiple-cycle (latency more than one) units.
Single-cycle units produce a result one cycle after an instruction
started execution. Usually they are also able to accept a new
instruction each cycle (throughput of one).
Multi-cycle units perform more complex operations that cannot be
implemented within a single cycle.
Multi-cycle units
 can be pipelined to accept a new operation each cycle or each
other cycle
 or they are non-pipelined.
Another class of units exists that perform the operations with
variable cycle times.
50
Types of FUs
•
•
•
•
•
single-cycle (single latency) units:
 (simple) integer and (integer-based) multimedia units,
multicycle units that are pipelined (throughput of one):
 complex integer, floating-point, and (floating-point -based)
multimedia unit (also called multimedia vector units),
multicycle units that are pipelined but do not accept a new operation
each cycle (throughput of 1/2 or less):
 often the 64-bit floating-point operations in a floating-point unit,
multicycle units that are often not pipelined:
 division unit, square root units, complex multimedia units
variable cycle time units:
 load/store unit (depending on cache misses) and special
implementations of e.g. floating-point units.
51
Multimedia Units
•
Utilization of subword parallelism (data parallel instructions, SIMD)
R1:
x1
R2:
x2
x3
x4
*
y1
*
*
y2
y3
y4
*
R3: x1*y1 x2*y2 x3*y3 x4*y4
•
•
•
Saturation arithmetic
Additional arithmetic instructions, e.g. pavgusb (average instruction), masking
and selection instructions, reordering and conversion
MM streams and/or 3D graphics supported
52
Finalizing Pipelined Execution
- Completion, Commitment, Retirement and Write-Back
•
•
•
•
An instruction is completed when the FU finished the execution of the
instruction and the result is made available for forwarding and buffering.
 Instruction completion is out of program order.
Committing an operation means that the results of the operation have been made
permanent and the operation retired from the scheduler.
Retiring means removal from the scheduler with or without the commitment of
operation results, whichever is appropriate.
 Retiring an operation does not imply the results of the operation are either
permanent or non permanent.
A result is made permanent:
 either by making the mapping of architectural to physical register permanent
(if no separate physical registers exist) or
 by copying the result value from the rename register to the architectural
register ( in case of separate physical and architectural registers)
in an own write-back stage after the commitment!
53
Reorder Buffers
•
•
•
•
•
•
The reorder buffer keeps the original program order of the
instructions after instruction issue and allows result serialization
during the retire stage.
State bits store if an instruction is on a speculative path, and
when the branch is resolved, if the instruction is on a correct
path or must be discarded.
When an instruction completes, the state is marked in its entry.
Exceptions are marked in the reorder buffer entry of the
triggering instruction.
The reorder buffer is implemented as a circular FIFO buffer.
Reorder buffer entries are allocate in the (first) issue stage and
deallocated serially when the instruction retires.
54
Precise Interrupt (Precise Exception)
•
•
•
•
An interrupt or exception is called precise if the saved
processor state corresponds with the sequential model of
program execution where one instruction execution ends
before the next begins.
Precise exception means that all instructions before the
faulting instruction are committed and those after it can be
restarted from scratch.
If an interrupt occurred, all instructions that are in program
order before the interrupt signaling instruction are committed,
and all later instructions are removed.
Depending on the architecture and the type of exception, the
faulting instruction should be committed or removed without
any lasting effect.
55
VLIW and EPIC

VLIW (very long instruction word) and
EPIC (explicit parallel instruction computing):

Compiler packs a fixed number of instructions into a single
VLIW/EPIC instruction.

The instructions within a VLIW instruction are issued and
executed in parallel, EPIC is more flexible.

Examples:
VLIW: High-end signal processors (TMS320C6201)
EPIC: Intel Merced/Itanium
56
Intel's IA-64 EPIC Format
IA-64 instruction word
Instruction 2
41 bits
Instruction 1
41 bits
Instruction 0
41 bits
Template
5 bits
128 bits
•
•
•
•
IA-64 instructions are packed by compiler into bundles.
A bundle is a 128-bit long instruction word (LIW) containing three IA-64
instructions along with a so-called template that contains instruction grouping
information.
IA-64 does not insert no-op instructions to fill slots in the bundles.
The template explicitly indicates parallelism, that is,
 whether the instructions in the bundle can be executed in parallel
 or if one or more must be executed serially
 and whether the bundle can be executed in parallel with the neighbor
bundles.
57
Outline of the Tutorial
•
State-of-the-art multiple-issue processors
Superscalar – Overview
 Superscalar in more detail: Instruction Fetch and Branch Prediction, Decode,
Rename, Issue, Dispatch, Execution Units, Completion, Retirement
 VLIW/EPIC

•
Microarchitectural solutions for future microprocessors
Technology prognosis
 Speed-up of a single-threaded application
- Advanced superscalar, Trace Cache, Superspeculative, Multiscalar
processors
 Speed-up of multi-threaded applications
- Chip multiprocessors (CMPs) and Simultaneous multithreading

58
Technological Forecasts
•
Moore's Law:
number of transistors per chip double every two years
•
SIA (semiconductor industries association) prognosis 1998:
Year of 1st shipment
Local Clock (GHz)
Across Chip (GHz)
Chip Size (mm²)
Dense Lines (nm)
Number of chip I/O
Transistors per chip
1997 1999 2002 2005 2008 2011 2014
0,75 1,25
2,1
3,5
6
10 16,9
0,75
1,2
1,6
2
2,5
3 3,674
300
340
430
520
620
750
901
250
180
130
100
70
50
35
1515 1867 2553 3492 4776 6532 8935
11M 21M 76M 200M 520M 1,4B 3,62B
59
Design Challenges
•
•
•
•
•
•
Increasing clock speed,
the amount of work that can be performed per cycle,
and the number of instructions needed to perform a task.
Today's general trend toward more complex designs is
opposed by the wiring delay within the processor chip as
main technological problem.
higher clock rates with subquarter-micron designs
 on-chip interconnecting wires cause a significant portion
of the delay time in circuits.
Functional partitioning becomes more important!
60
Architectural Challenges and Implications
•
•
•
•
•
Preserve object code compatibility (may be avoided by a virtual
machine that targets run-time ISAs)
Find ways of expressing and exposing more parallelism to the
processor. It is doubtful if enough ILP is available. Harness
thread-level paralelism (TLP) additionally.
Memory bottleneck
Power consumption for mobile computers and appliances.
Soft errors by cosmic rays of gamma radiation may be faced with
fault-tolerant design through the chip.
61
Future Processor Architecture Principles
•
•
Speed-up of a single-threaded application
 Advanced superscalar
 Trace Cache
 Superspeculative
 Multiscalar processors
Speed-up of multi-threaded applications
 Chip multiprocessors (CMPs)
 Simultaneous multithreading
62
Processor Techniques to Speed-up Single-threaded
Application
•
•
•
•
Advanced superscalar processors scale current designs up to
issue 16 or 32 instructions per cycle.
Trace cache facilitates instruction fetch and branch prediction
Superspeculative processors enhance wide-issue superscalar
performance by speculating aggressively at every point.
Multiscalar processors divide a program in a collection of tasks
that are distributed to a number of parallel processing units under
control of a single hardware sequencer.
63
Advanced Superscalar Processors for Billion Transistor
Chips in Year 2005 - Characteristics
•
•
•
•
•
•
•
Aggressive speculation, such as a very aggressive dynamic branch
predictor,
a large trace cache,
very-wide-issue superscalar processing (an issue width of 16 or 32
instructions per cycle),
a large number of reservation stations to accommodate 2,000
instructions,
24 to 48 highly optimized, pipelined functional units,
sufficient on-chip data cache, and
sufficient resolution and forwarding logic.

see: Yale N. Patt, Sanjay J. Patel, Marius Evers, Daniel H. Friendly, Jaret
Stark: One Billion Transistors, One Uniprocessor, One Chip. IEEE
Computer, September 1997, pp. 51-57.
64
The Trace Cache
•
•
•
•
•
•
Trace cache is a special I-cache that captures dynamic instruction
sequences in contrast to the I-cache that contains static instruction
sequences.
Like the I-cache, the trace cache is accessed using the starting address
of the next block of instructions.
Unlike the I-cache, it stores logically contiguous instructions in
physically contiguous storage.
A trace cache line stores a segment of the dynamic instruction trace
across multiple, potentially taken branches.
Each line stores a snapshot, or trace, of the dynamic instruction
stream.
The trace construction is of the critical path.
65
I-cache and Trace Cache
I-cache
Trace cache
66
Superspeculative Processors
•
•
•
Idea: Instructions generate many highly predictable result values in
real programs ==>
Speculate on source operand values and begin execution
without waiting for result from the previous instruction.
Speculate about true data dependences!!
reasons for the existence of value locality
 Register spill code.
 Input sets often contain data with little variation.
 A compiler often generates run-time constants due to errorchecking, switch statement evaluation, and virtual function calls.
 The compiler also often loads program constants from memory
rather than using immediate operands.
See M. H. Lipasti, J. P. Shen: Superspeculative Microarchitecture for Beyond
AD 2000. IEEE Computer, Sept. 1997, pp. 59-66
67
Strong- vs. Weak-dependence Model
•
•
Strong-dependence model for program execution:
a total instruction ordering of a sequential program.
 Two instructions are identified as either dependent or
independent, and when in doubt, dependences are pessimistically
assumed to exist.
 Dependences are never allowed to be violated and are enforced
during instruction processing.
Weak-dependence model:
 specifying that dependences can be temporarily violated during
instruction execution as long as recovery can be performed prior
to affecting the permanent machine state.
 Advantage: the machine can speculate aggressively and
temporarily violate the dependences.
The machine can exceed the performance limit imposed by the
strong-dependence model.
68
Implementation of a Weak-dependence Model
•
The front-end engine assumes the weak-dependence model
and is highly speculative, predicting instructions to
aggressively speculate past them.
•
The back-end engine still uses the strong-dependence model
to validate the speculations, recover from misspeculation, and
provide history and guidance information to the speculative
engine.
69
Superflow processor (Lipasti and Shen 1997)
•
The Superflow processor speculates on
 instruction flow: two-phase branch predictor combined
with trace cache
 register data flow: dependence prediction: predict the
register value dependence between instructions
- source operand value prediction
- constant value prediction
- value stride prediction: speculate on constant,
incremental increases in operand values
- dependence prediction predicts inter-instruction
dependences
 memory data flow: prediction of load values, of load
addresses and alias prediction
70
RSs
Decode
Floating-point
Integer
Register dataflow
D-cache
RSs
Memory
Store queue
Retire
Reorder buffer
RSs
Instruction buffer
Fetch
I-cache
Superflow
Processor
Proposal
RSs
Branch
predictor
Trace
cache
Instruction flow
Memory dataflow
Multimedia
Multiscalar Processors
•
•
•
•
•
A program is represented as a control flow graph (CFG), where basic
blocks are nodes, and arcs represent flow of control.
A multiscalar processor walks through the CFG speculatively, taking
task-sized steps, without pausing to inspect any of the instructions
within a task.
The tasks are distributed to a number of parallel PEs within a
processor.
Each PE fetches and executes instructions belonging to its assigned
task.
The primary constraint: it must preserve the sequential program
semantics.
72
Multiscalar mode of execution
PE 0
A
B
C
D
Data values
Task A
PE 1
Task B
PE 2
Task D
E
PE 3
Task E
73
Multiscalar processor
Tail
Sequencer
I-cache
unidirectional
unidirectional
ring
ring
Processing
...
...
unit
Processing element n
Processing element 1
Head
Register
file
...
D-cache
Data
bank
ARB
Interconnect
...
Data
bank
74
Multiscalar, Trace and Speculative Multithreaded
Processors
•
•
•
•
•
Multiscalar: A program is statically partitioned into tasks which
are marked by annotations of the CFG.
Trace Processor: Tasks are generated from traces of the trace
cache.
Speculative multithreading: Tasks are otherwise dynamically
constructed.
Common target: Increase of single-thread program performance by
dynamically utilizing thread-level speculation additionally to
instruction-level parallelism.
A „thread“ means a „HW thread“
75
Additional utilization of more coarse-grained parallelism
•
Chip multiprocessors (CMPs) or multiprocessor chips
 integrate two or more complete processors on a single chip,
 every functional unit of a processor is duplicated
•
Simultaneous multithreaded processors (SMPs)
 store multiple contexts in different register sets on the chip,
 the functional units are multiplexed between the threads,
 instructions of different contexts are simultaneously executed
76
Shared memory candidates for CMPs
Processor
Processor
Processor
Processor
Primary Cache
Secondary Cache
Global Memory
Shared primary cache
77
Shared memory candidates for CMPs
Shared caches and memory
or
shared secondary cache
Processor
Processor
Processor
Processor
Processor
Processor
Processor
Processor
Primary
Cache
Primary
Cache
Primary
Cache
Primary
Cache
Primary
Cache
Primary
Cache
Primary
Cache
Primary
Cache
Secondary
Cache
Secondary
Cache
Secondary
Cache
Secondary
Cache
Global Memory
Secondary Cache
Global Memory
78
Hydra: A Single-Chip Multiprocessor
A Single Chip
Centralized Bus Arbitration Mechanisms
CPU 0
Primary
I-cache
Primary
D-cache
CPU 0 Memory Controller
On-chip Secondary
Cache
CPU 1
Primary
I-cache
Primary
D-cache
CPU 1 Memory Controller
CPU 2
Primary
I-cache
CPU 3
Primary
D-cache
Primary
I-cache
CPU2 Memory Controller
Off-chip L3
Interface
Rambus Memory
Interface
Cache SRAM Array
DRAM Main Memory
Primary
D-cache
CPU 3 Memory Controller
DMA
I/O Bus
Interface
I/O Device
79
Shared memory candidates for CMPs
Processor
Processor
Processor
Processor
Primary Cache
Global
Memory
Secndary
Cache
Global Memory
Shared global memory, no caches
80
Motivation for Processor-in-Memory
•
•
•
•
•
•
Technological trends have produced a large and growing gap
between processor speed and DRAM access latency.
Today, it takes dozens of cycles for data to travel between the
CPU and main memory.
CPU-centric design philosophy has led to very complex
superscalar processors with deep pipelines.
Much of this complexity is devoted to hiding memory access
latency.
Memory wall: the phenomenon that access times are
increasingly limiting system performance.
Memory-centric design is envisioned for the future!
81
PIM or Intelligent RAM (IRAM)
•
•
•
PIM (processor-in-memory) or IRAM (intelligent RAM)
approaches couple processor execution with large, highbandwidth, on-chip DRAM banks.
PIM or IRAM merge processor and memory into a single chip.
Advantages:
The processor-DRAM gap in access speed increases in future. PIM
provides higher bandwidth and lower latency for (on-chip-)memory
accesses.
 DRAM can accommodate 30 to 50 times more data than the same chip
area devoted to caches.
 On-chip memory may be treated as main memory - in contrast to a cache
which is just a redundant memory copy.
 PIM decreases energy consumption in the memory system due to the
reduction of off-chip accesses.

82
PIM Challenges
•
•
•
Scaling a system beyond a single PIM.
The DRAM technology today does not allow on-chip coupling
of high performance processors with DRAM memory since the
clock rate of DRAM memory is too low.
 Logic and DRAM manufacturing processes are
fundamentally different.
The PIM approach can be combined with most processor
organizations.
 The processor(s) itself may be a simple or moderately
superscalar standard processor,
 it may also include a vector unit as in the vector IRAM type,
 or be designed around a smart memory system.
 In future: potentially memory-centric architectures.
83
Conclusions on CMP
•
Usually, a CMP will feature:
 separate L1 I-cache and D-cache per on-chip CPU
 and an optional unified L2 cache.
•
If the CPUs always execute threads of the same process, the L2
cache organization will be simplified, because different processes
do not have to be distinguished.
•
Recently announced commercial processors with CMP hardware:
 IBM Power4 processor with 2 processor on a single die
 Sun MAJC5200 two processor on a die (each processor a 4threaded block-interleaving VLIW)
84
Motivation for Multithreaded Processors
•
•
Aim: Latency tolerance
What is the problem? Load access latencies measured on an Alpha
Server 4100 SMP with four 300 MHz Alpha 21164 processors are:
 7 cycles for a primary cache miss which hits in the on-chip L2
cache of the 21164 processor,
 21 cycles for a L2 cache miss which hits in the L3 (board-level)
cache,
 80 cycles for a miss that is served by the memory, and
 125 cycles for a dirty miss, i.e., a miss that has to be served
from another processor's cache memory.
85
Multithreading
•
Multithreading
 The ability to pursue two or more threads of control in parallel
within a processor pipeline.
 Advantage: The latencies that arise in the computation of a single
instruction stream are filled by computations of another thread.
•
Multithreaded processors are able to bridge latencies by switching
to another thread of control - in contrast to chip multiprocessors.
86
Multithreaded Processors
•
Multithreading:
 Provide several program counters registers (and usually
several register sets) on chip
 Fast context switching by switching to another thread of
control
Thread 1:
Register set 1
PC
PSR 1
Thread 2:
Register set 2
PC
PSR 2
Thread 3:
Register set 3
PC
PSR 3
Thread 4:
Register set 4
PC
PSR 4
...
...
FP
...
87
Approaches of Multithreaded Processors
•
•
•
Cycle-by-cycle interleaving
 An instruction of another thread is fetched and fed into the
execution pipeline at each processor cycle.
Block-interleaving
 The instructions of a thread are executed successively until
an event occurs that may cause latency. This event induces
a context switch.
Simultaneous multithreading
 Instructions are simultaneously issued from multiple
threads to the FUs of a superscalar processor.
 combines a wide issue superscalar instruction issue with
multithreading.
88
(a)
(b)
(c)
(a) single-threaded scalar
(b) cycle-by-cycle interleaving multithreaded scalar
(c) block interleaving multithreaded scalar
Context switch
Context switch
Time (process cycles)
Comparision of Multithreading with Non-Multithreading
Approaches:
89
Tim e (p rocesso r cycles)
Simultaneous Multithreading (SMT)
and Chip Multiprocessors (CMP)
Issue slots
(a)
SMT
(b)
CMP
90
Simultaneous Multithreading
•
State of research
 SMT is simulated and evaluated with Spec92, Spec95, and with
database transaction and decision support workloads
 Mostly unrelated programs are loaded in the thread slots!
 Typical result: 8-threaded SMT reaches a two- to threefold IPC
increase over single-threaded superscalar.
•
State of industrial development
 DEC/Compaq announced Alpha EV8 ( 21464 ) as
4-threaded 8-wide superscalar SMT processor
at Microprocessor Forum October 1999
91
Combining SMT and Multimedia
•
•
•
•
Start with a wide-issue superscalar general-purpose processor
Enhance by simultaneous multithreading
Enhance by multimedia unit(s)
Enhance by on-chip RAM memory for constants and local variables
92
The SMT Multimedia Processor Model
To Memory
Memoryinterface
DCache
Global
L/S
Local
Memory
Local
L/S
ICache
I/O
Thread
Control
Branch
IF
ID
Rename
RI
IF
ID
Simple
Integer
RT
WB
Compl
Integer
BTAC
Register
93
IPC of Maximum Processor Models
6,32
6,33
5,56
5,64
7
5,67
5,34
6
3,84
3,89
3,91
1,98
1,99
1,99
1
1
8
3,52
3,27
1,96
1,86
1
1,86
1,86
1,57
1
4
3,53
0,96
Threads
1
1
2
4
6
5
4
IPC
3
2
1
0
8
Issue
94
CMP or SMT?
•
•
•
•
The performance race between SMT and CMP is not yet decided.
CMP is easier to implement, but only SMT has the ability to hide
latencies.
A functional partitioning is not easily reached within a SMT
processor due to the centralized instruction issue.
 A separation of the thread queues is a possible solution,
although it does not remove the central instruction issue.
 A combination of simultaneous multithreading with the CMP
may be superior.
Research: combine SMT or CMP organization with the ability to
create threads with compiler support or fully dynamically out of a
single thread
 thread-level speculation
 close to multiscalar
95
This is the End!
•
Several alternative processor design principles were introduced:
 fine grain techniques (increasing performance of a single
thread of control)
 coarse grain techniques to speed up a multiprogramming mix
Nothing is so hard to predict like the future.
96