kubi-252f03lec09

Download Report

Transcript kubi-252f03lec09

CS252
Graduate Computer Architecture
Lecture 9
Instruction Level Parallelism:
Reality Sets In
September 29th, 2003
Prof. John Kubiatowicz
http://www.cs.berkeley.edu/~kubitron/courses/cs252-F03
9/29/03
CS252/Kubiatowicz
Lec 9.1
Review:
Explicit Register Renaming
• Make use of a physical register file that is larger
than number of registers specified by ISA
• Keep a translation table:
– ISA register => physical register mapping
– When register is written, replace table entry with new register from
freelist.
– Physical register becomes free when not being used by any
instructions in progress.
Fetch
Decode/
Rename
Execute
Rename
Table
9/29/03
CS252/Kubiatowicz
Lec 9.2
Review: Reorder Buffer (ROB)
• Use of reorder buffer
– In-order issue, Out-of-order execution, In-order commit
– Holds results until they can be committed in order
» Serves as source of values until instructions committed
– Provides support for precise exceptions/Speculation: simply throw out
instructions later than excepted instruction.
– Commits user-visible state in instruction order
– Stores sent to memory system only when they reach head of buffer
• In-Order-Commit is important because:
– Allows the generation of precise exceptions
– Allows speculation across branches
9/29/03
CS252/Kubiatowicz
Lec 9.3
Review: Memory Disambiguation
• Question: Given a load that follows a store in program
order, are the two related?
– Trying to detect RAW hazards through memory
– Stores commit in order (ROB), so no WAR/WAW memory hazards.
• Implementation
– Keep queue of stores, in program order
– Watch for position of new loads relative to existing stores
– Typically, this is a different buffer than ROB!
» Could be ROB (has right properties), but too expensive
• When have address for load, check store queue:
– If any store prior to load is waiting for its address, stall load.
– If load address matches earlier store address (associative lookup),
then we have a memory-induced RAW hazard:
» store value available  return value
» store value not available  return ROB number of source
– Otherwise, send out request to memory
• Will relax exact dependency checking in later lecture
9/29/03
CS252/Kubiatowicz
Lec 9.4
Does it make sense to rename
memory addresses?
• Renaming is about transforming the names of
storage locations:
– I.e. Register names: F0P10, F0 P2, ….
– Data addresses…?
• Consider:
loop:
<inst1>
st
<inst2>
ld
<really
beq
r2, 0(r3)
r6, 0(r3)
slow instruction>
loop
• In order to overlap iterations, need to:
– Recognize that there are names being reused
– Bypass the store/load sequence internal to pipeline
• Often done by memory disambiguation unit
9/29/03
CS252/Kubiatowicz
Lec 9.5
Review: Instruction Level Parallelism
• High speed execution based on instruction
level parallelism (ilp): potential of short
instruction sequences to execute in parallel
• High-speed microprocessors exploit ILP by:
1) pipelined execution: overlap instructions
2) Out-of-order execution (commit in-order)
3) Multiple issue: issue and execute multiple
instructions per clock cycle
4) Vector instructions: many independent ops specified
with a single instruction
• Memory accesses for high-speed
microprocessor?
–
9/29/03
Data Cache possibly multiported, multiple levels
CS252/Kubiatowicz
Lec 9.6
Getting CPI < 1: Issuing
Multiple Instructions/Cycle
• Superscalar: varying no. instructions/cycle (1 to
8), scheduled by compiler or by HW (Tomasulo)
– IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000
• (Very) Long Instruction Words (V)LIW:
fixed number of instructions (4-16) scheduled
by the compiler; put ops into wide templates
– Joint HP/Intel agreement in 1999/2000?
– Intel Architecture-64 (IA-64) 64-bit address
» Style: “Explicitly Parallel Instruction Computer (EPIC)”
– New SUN MAJIC Architecture: VLIW for Java
• Vector Processing:
Explicit coding of independent loops as
operations on large vectors of numbers
– Multimedia instructions being added to many processors
• Anticipated success lead to use of
Instructions Per Clock cycle (IPC) vs. CPI
9/29/03
CS252/Kubiatowicz
Lec 9.7
Review: Multiple Issue Challenges
• If more instructions issue at same time, greater
difficulty of decode and issue:
– Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide
if 1 or 2 instructions can issue
– Register file: need 2x reads and 1x writes/cycle
– Rename logic: must be able to rename same register multiple times
in one cycle! For instance, consider 4-way issue:
add r1, r2, r3
add p11, p4, p7
sub r4, r1, r2

sub p22, p11, p4
lw r1, 4(r4)
lw p23, 4(p22)
add r5, r1, r2
add p12, p23, p4
Imagine doing this transformation in a single cycle!
– Result buses: Need to complete multiple instructions/cycle
» So, need multiple buses with associated matching logic at every
reservation station.
» Or, need multiple forwarding paths
9/29/03
CS252/Kubiatowicz
Lec 9.8
First Pentium-4:
Willamette
Die Photo
9/29/03
Heat Sink
CS252/Kubiatowicz
Lec 9.9
Pentium-4 Pipeline
Pentium (Original 586)
Pentium-II (and III) (Original 686)
• Microprocessor Report: August 2000
–
–
–
–
–
9/29/03
20 Pipeline Stages!
Drive Wire Delay!
Trace-Cache: caching paths through the code for quick decoding.
Renaming: similar to Tomasulo architecture
Branch and DATA prediction!
CS252/Kubiatowicz
Lec 9.10
Where is the P4 “Decode”?
• On Hit:
Hit:
(no Decode!)
– Trace Cache holds ops
– Renamed/Scheduled
• Hit on complex ops:
– Trace Cache only holds
pointer to decode ROM
• On Miss:
– Must decode x86
instructions into ops
– Potentially slow!
Miss:
(Decode)
9/29/03
CS252/Kubiatowicz
Lec 9.11
Today: VLSI Microprocessors
9/29/03
Pentium® III
PowerPC 7400 (G4)
28M transistors / 733MHz-1Gz / 13-26W
L=0.25µm shrunk to L=0.18µm
6.5M transistors / 450MHz / 8-10W
L=0.15µm
CS252/Kubiatowicz
Lec 9.12
Today: VLSI Microprocessors
Process Shrinks
Pentium® 4
42M transistors / 1.3-1.8GHz
49-55W
L=180nm
9/29/03
Pentium® 4 “Northwood” Pentium® 4 “Prescott”
55M transistors / 2-2.5GHz
55W
L=0.130nm Area=131mm2
125M transistors / 2.8-3.4GHz
115W
L=90nm Area=112mm2
CS252/Kubiatowicz
Lec 9.13
Today: VLSI Microprocessors
PowerPC® 940 (G5)
58M transistors / 2GHz / 97W
L=130nm Area=118mm2
Intel Itanium® 2
Image courtesy International Business Machines
All Rights Reserved
410M transistors / 1.3GHz / 130W
L=130nm Area=374mm2
Image source: Intel Corporation www.intel.com
9/29/03
CS252/Kubiatowicz
Lec 9.14
Dynamic Pipelining
Instruction Fetch
and decode unit
Functional
units
In-order issue
Reservation
station
Reservation
station
Reservation
station
Reservation
station
Integer
Integer
Floating
point
Load/
Store
Commit
unit
Out-of-order
execute
In-order commit
(Fig. 6.49, old 6.61)
9/29/03
CS252/Kubiatowicz
Lec 9.15
Dynamic Pipelining in the
Pentium 4
(Fig. 6.50, mod old 6.62)
9/29/03
CS252/Kubiatowicz
Lec 9.16
Dynamic Pipelining in the
Pentium 4
Source: “The Microarchitecture of the Pentium® 4 Processor”, Intel Technology Journal, First Quarter 2001
http://developer.intel.com/technology/itj/q12001/articles/art_2.htm.
9/29/03
CS252/Kubiatowicz
Lec 9.17
Pentium 3 & 4 Pipeline Stages
Drive stages - waiting for signal propagation
Source: “The Microarchitecture of the Pentium® 4 Processor”, Intel Technology Journal, First Quarter 2001
http://developer.intel.com/technology/itj/q12001/articles/art_2.htm.
9/29/03
CS252/Kubiatowicz
Lec 9.18
Example: Caches in the
Pentium 4
L1 Trace:
decoded instr.
L1 Data :
64-kbyte
block size
Write Through
L2 Cache:
128-kbyte
block size
Write Back
Source: “The Microarchitecture of the Pentium® 4 Processor”, Intel Technology Journal, First Quarter 2001
http://developer.intel.com/technology/itj/q12001/articles/art_2.htm.
9/29/03
CS252/Kubiatowicz
Lec 9.19
Techniques to Increase ILP
•
•
•
•
•
•
•
•
9/29/03
Forwarding
Branch Prediction
Superpipelining
Multiple Issue - Superscalar, VLIW/EPIC
Software manipulation
Dynamic Pipeline Scheduling
Speculative Execution
Simultaneous Multithreading (SMT)
\
CS252/Kubiatowicz
Lec 9.20
Simultaneous Multithreading
(SMT)
• Key idea: extend processor to multiple
threads of execution that execute
concurrently
–
–
–
–
Each thread has its own PC and register state
Scheduling hardware shares functional units
Appears to software as two “separate” processors
Advantage: when one thread stalls, another may be
ready
– Proposed for servers, where multiple threads are
common
Issue Slots
State
Thread A
9/29/03
State
Thread B
Functional Units
Time
CS252/Kubiatowicz
Lec 9.21
VLIW: Very Large Instruction Word
• Each “instruction” has explicit coding for multiple
operations
– In EPIC, grouping called a “packet”
– In Transmeta, grouping called a “molecule” (with “atoms” as ops)
• Tradeoff instruction space for simple decoding
– The long instruction word has room for many operations
– By definition, all the operations the compiler puts in the long
instruction word are independent => execute in parallel
– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
» 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168
bits wide
– Need compiling technique that schedules across several branches
9/29/03
CS252/Kubiatowicz
Lec 9.22
Intel/HP “Explicitly Parallel
Instruction Computer (EPIC)”
• 3 Instructions in 128 bit “groups”; field determines if
instructions dependent or independent
– Smaller code size than old VLIW, larger than x86/RISC
– Groups can be linked to show independence > 3 instr
• 128 integer registers + 128 floating point registers
– Not separate register files per functional unit as in old VLIW
• Hardware checks dependencies
(interlocks => binary compatibility over time)
• Predicated execution (select 1 out of 64 1-bit flags)
=> 40% fewer mispredictions?
• IA-64: instruction set architecture; EPIC is type
– VLIW = EPIC?
• Itanium™ is name of first implementation (2000/2001?)
– Highly parallel and deeply pipelined hardware at 800Mhz
– 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process
9/29/03
CS252/Kubiatowicz
Lec 9.23
Itanium™ Processor Silicon
(Copyright: Intel at Hotchips ’00)
IA-32
Control
FPU
IA-64 Control
Integer Units
Instr.
Fetch &
Decode
Cache
TLB
Cache
Bus
Core Processor Die
9/29/03
4 x 1MB L3 cache
CS252/Kubiatowicz
Lec 9.24
Itanium™ Machine Characteristics
(Copyright: Intel at Hotchips ’00)
Frequency
800 MHz
Transistor Count
25.4M CPU; 295M L3
Process
0.18u CMOS, 6 metal layer
Package
Organic Land Grid Array
Machine Width
6 insts/clock (4 ALU/MM, 2 Ld/St, 2 FP, 3 Br)
Registers
14 ported 128 GR & 128 FR; 64 Predicates
Speculation
32 entry ALAT, Exception Deferral
Branch Prediction
Multilevel 4-stage Prediction Hierarchy
FP Compute Bandwidth
3.2 GFlops (DP/EP); 6.4 GFlops (SP)
Memory -> FP Bandwidth
4 DP (8 SP) operands/clock
Virtual Memory Support
64 entry ITLB, 32/96 2-level DTLB, VHPT
L2/L1 Cache
Dual ported 96K Unified & 16KD; 16KI
L2/L1 Latency
6 / 2 clocks
L3 Cache
4MB, 4-way s.a., BW of 12.8 GB/sec;
System Bus
2.1 GB/sec; 4-way Glueless MP
Scalable to large (512+ proc) systems
9/29/03
CS252/Kubiatowicz
Lec 9.25
Itanium™ EPIC Design Maximizes SW-HW Synergy
(Copyright: Intel at Hotchips ’00)
Architecture Features programmed by compiler:
Branch
Hints
Explicit
Parallelism
Register
Data & Control
Stack
Predication
Speculation
& Rotation
Memory
Hints
Micro-architecture Features in hardware:
9/29/03
Fast, Simple 6-Issue
Instruction
Cache
& Branch
Predictors
Issue
Register
Handling
128 GR &
128 FR,
Register
Remap
&
Stack
Engine
Control
Parallel Resources
Bypasses & Dependencies
Fetch
4 Integer +
4 MMX Units
Memory
Subsystem
2 FMACs
(4 for SSE)
Three
levels of
cache:
2 LD/ST units
L1, L2, L3
32 entry ALAT
Speculation Deferral Management
CS252/Kubiatowicz
Lec 9.26
10 Stage In-Order Core Pipeline
(Copyright: Intel at Hotchips ’00)
Front End
Execution
• Pre-fetch/Fetch of up
to 6 instructions/cycle
• Hierarchy of branch
predictors
• Decoupling buffer
•
•
•
•
EXPAND
IPG
INST POINTER
GENERATION
FET ROT EXP
FETCH
ROTATE
Instruction Delivery
• Dispersal of up to 6
instructions on 9 ports
• Reg. remapping
• Reg. stack engine
9/29/03
RENAME
REN
4 single cycle ALUs, 2 ld/str
Advanced load control
Predicate delivery & branch
Nat/Exception//Retirement
WORD-LINE
REGISTER READ
DECODE
WLD
REG
EXE
EXECUTE
DET
WRB
EXCEPTION WRITE-BACK
DETECT
Operand Delivery
• Reg read + Bypasses
• Register scoreboard
• Predicated
dependencies
CS252/Kubiatowicz
Lec 9.27
CS 252 Administrivia
• Exam:
Monday 10/13
Location: 306 Soda Hall
TIME: 5:30 - 8:30
• This info is on the Lecture page (has been)
• Meet at LaVal’s afterwards for Pizza and Beverages
• Assignment up tomorrow.
– Done in pairs. Put both names on papers.
– Make sure you have partners! Feel free to use mailing list for this.
9/29/03
CS252/Kubiatowicz
Lec 9.28
Limits to Multi-Issue Machines
• Inherent limitations of ILP
– 1 branch in 5: How to keep a 5-way VLIW busy?
– Latencies of units: many operations must be scheduled
– Need about Pipeline Depth x No. Functional Units of independent
operations to keep all pipelines busy.
– Difficulties in building HW
– Easy: More instruction bandwidth
– Easy: Duplicate FUs to get parallel execution
– Hard: Increase ports to Register File (bandwidth)
» VLIW example needs 7 read and 3 write for Int. Reg.
& 5 read and 3 write for FP reg
– Harder: Increase ports to memory (bandwidth)
– Decoding Superscalar and impact on clock rate, pipeline depth?
9/29/03
CS252/Kubiatowicz
Lec 9.29
Limits to Multi-Issue Machines
• Limitations specific to either Superscalar or VLIW
implementation
– Decode issue in Superscalar: how wide practical?
– VLIW code size: unroll loops + wasted fields in VLIW
» IA-64 compresses dependent instructions, but still larger
– VLIW lock step => 1 hazard & all instructions stall
» IA-64 not lock step? Dynamic pipeline?
– VLIW & binary compatibilityIA-64 promises binary compatibility
9/29/03
CS252/Kubiatowicz
Lec 9.30
Limits to ILP
• Conflicting studies of amount
– Benchmarks (vectorized Fortran FP vs. integer C programs)
– Hardware sophistication
– Compiler sophistication
• How much ILP is available using existing
mechanims with increasing HW budgets?
• Do we need to invent new HW/SW mechanisms to
keep on processor performance curve?
– Intel MMX
– Motorola AltaVec
– Supersparc Multimedia ops, etc.
9/29/03
CS252/Kubiatowicz
Lec 9.31
Limits to ILP
Initial HW Model here; MIPS compilers.
Assumptions for ideal/perfect machine to start:
1. Register renaming–infinite virtual registers and
all WAW & WAR hazards are avoided
2. Branch prediction–perfect; no mispredictions
3. Jump prediction–all jumps perfectly predicted =>
machine with perfect speculation & an unbounded
buffer of instructions available
4. Memory-address alias analysis–addresses are
known & a store can be moved before a load
provided addresses not equal
1 cycle latency for all instructions; unlimited number
of instructions issued per clock cycle
9/29/03
CS252/Kubiatowicz
Lec 9.32
Upper Limit to ILP: Ideal
Machine
(Figure 4.38, page 319)
160
150.1
FP: 75 - 150
140
Instruction Issues per cycle
IPC
120
118.7
Integer: 18 - 60
100
75.2
80
62.6
60
54.8
40
17.9
20
0
gcc
espresso
li
fpppp
doducd
tomcatv
Programs
9/29/03
CS252/Kubiatowicz
Lec 9.33
More Realistic HW: Branch Impact
Figure 4.40, Page 323
60
50
Change from Infinite
window to examine to
2000 and maximum
issue of 64
instructions per clock
cycle
FP: 15 - 45
61
60
58
48
46 45
46 45 45
Instruction issues per cycle
IPC
41
40
35
Integer: 6 - 12
30
29
19
20
16
15
12
10
13 14
10
9
7
6
6
6
6
7
4
2
2
2
0
gcc
espresso
li
fpppp
doducd
tomcatv
Program
Perfect
9/29/03
Perfect
Selective predictor
Pick Cor. or BHT
Standard 2-bit
BHT (512)
Static
Profile
None
CS252/Kubiatowicz
No prediction
Lec 9.34
More Realistic HW: Register Impact
Figure 4.44, Page 328
FP: 11 - 45
59
60
Change 2000 instr
window, 64 instr
issue, 8K 2 level
Prediction
IPC
Instruction issues per cycle
50
40
54
49
45
35
Integer: 5 - 15
30
44
29
28
20
20
15 15
11 10 10
10
16
13
12 12 12 11
10
9
5
5
4
11
6
4
15
5
5
5
4
7
5
5
0
gcc
espresso
li
fpppp
doducd
tomcatv
Program
Infinite
9/29/03
Infinite
256
128
64
32
256
128
64
32
None
None
CS252/Kubiatowicz
Lec 9.35
More Realistic HW: Alias
Impact
Figure 4.46, Page 330
49
50
Change 2000 instr
window, 64 instr
issue, 8K 2 level
Prediction, 256
renaming registers
Integer: 4 - 9
45
40
Instruction issues per cycle
35
IPC
49
30
25
20
45
45
FP: 4 - 45
(Fortran,
no heap)
16
16
15
15
12
10
10
5
9
7
7
4
5
5
4
3
3
4
6
4
3
5
4
0
gcc
espresso
li
fpppp
doducd
tomcatv
Program
Perfect
Perfect
9/29/03
Global/stack Perfect
Inspection
Global/Stack perf; Inspec.
heap conflicts
Assem.
None
None
CS252/Kubiatowicz
Lec 9.36
Realistic HW for ‘9X: Window Impact
(Figure 4.48, Page 332)
60
IPC
Instruction issues per cycle
50
40
30
Perfect disambiguation
(HW), 1K Selective
Prediction, 16 entry
return, 64 registers,
issue as many as
window
56
52
47
FP: 8 - 45
45
35
34
22
Integer: 6 - 12
20
15 15
10 10 10
10
9
13
12 12 11 11
10
8
8
6
4
6
3
17 16
14
9
6
4
22
2
15
14
12
9
8
4
9
7
5
4
3
3
6
3
3
0
gcc
expresso
li
fpppp
doducd
tomcatv
Program
Infinite
9/29/03
256
128
Infinite 256 128
64
32
16
64
32
16
8
8
4
4
CS252/Kubiatowicz
Lec 9.37
Braniac vs. Speed
Demon(1993)
• 8-scalar IBM Power-2 @ 71.5 MHz (5 stage
pipe)
900 vs. 2-scalar Alpha @ 200 MHz (7 stage pipe)
800
700
SPECMarks
600
500
400
300
200
100
Benchmark
9/29/03
fpppp
nasa
hydro2d
su2cor
swm256
mdljsp2
ear
alvinn
ora
tomcatv
wave5
mdljdp2
doduc
spice
gcc
sc
compress
eqntott
li
espresso
0
CS252/Kubiatowicz
Lec 9.38
Problems with scalar approach to
ILP extraction
• Limits to conventional exploitation of ILP:
– pipelined clock rate: at some point, each increase in clock rate has
corresponding CPI increase (branches, other hazards)
– branch prediction: branches get in the way of wide issue. They
are too unpredictable.
– instruction fetch and decode: at some point, its hard to fetch and
decode more instructions per clock cycle
– register renaming: Rename logic gets really complicate for many
instructions
– cache hit rate: some long-running (scientific) programs have very
large data sets accessed with poor locality; others have continuous
data streams (multimedia) and hence poor locality
9/29/03
CS252/Kubiatowicz
Lec 9.39
Cost-performance of simple vs. OOO
•
•
•
•
•
•
•
•
MIPS MPUs
Clock Rate
On-Chip Caches
Instructions/Cycle
Pipe stages
Model
Die Size (mm2)
R5000
200 MHz
32K/32K
1(+ FP)
5
In-order
84
– without cache, TLB
32
Development (man yr.) 60
SPECint_base95
5.7
9/29/03
R10000
10k/5k
195 MHz
1.0x
32K/32K
1.0x
4
4.0x
5-7
1.2x
Out-of-order --298
3.5x
205
6.3x
300
5.0x
8.8
1.6x
CS252/Kubiatowicz
Lec 9.40
Alternative Model:
Vector Processing
• Vector processors have high-level operations that
work on linear arrays of numbers: "vectors"
SCALAR
(1 operation)
r2
r1
v1 v2
+
+
r3
v3
add r3, r1, r2
9/29/03
VECTOR
(N operations)
vector
length
add.vv v3, v1, v2
CS252/Kubiatowicz
25
Lec 9.41
Properties of Vector Processors
• Each result independent of previous result
=> long pipeline, compiler ensures no dependencies
=> high clock rate
• Vector instructions access memory with known pattern
=> highly interleaved memory
=> amortize memory latency of over 64 elements
=> no (data) caches required! (Do use instruction
cache)
• Reduces branches and branch problems in pipelines
• Single vector instruction implies lots of work ( loop)
=> fewer instruction fetches
9/29/03
CS252/Kubiatowicz
Lec 9.42
Operation & Instruction Count:
RISC v. Vector Processor
(from F. Quintana, U. Barcelona.)
Spec92fp Operations (Millions) Instructions (M)
Program RISC Vector R / V RISC Vector R / V
swim256 115
95
1.1x
115
0.8
142x
hydro2d
58
40
1.4x
58
0.8
71x
nasa7
69
41
1.7x
69
2.2
31x
su2cor
51
35
1.4x
51
1.8
29x
tomcatv
15
10
1.4x
15
1.3
11x
wave5
27
25
1.1x
27
7.2
4x
mdljdp2
32
52
0.6x
32 15.8
2x
Vector reduces ops by 1.2X, instructions by 20X
9/29/03
CS252/Kubiatowicz
Lec 9.43
Styles of Vector
Architectures
• memory-memory vector processors: all vector
operations are memory to memory
• vector-register processors: all vector operations
between vector registers (except load and store)
–
–
–
9/29/03
Vector equivalent of load-store architectures
Includes all vector machines since late 1980s:
Cray, Convex, Fujitsu, Hitachi, NEC
We assume vector-register for rest of lectures
CS252/Kubiatowicz
Lec 9.44
Summary
• Basic techniques we have discussed for ILP apply to a
number of modern processors
• Pentium-IV
– On-the-fly compilation of x86ops during cache miss
– Trace Cache (TC) keeps trace of compiled code
– Rest of pipeline not so unfamiliar
• IA64
– VLIW with a bit of dynamic instruction grouping
• Studies of parallelism:
– Lots of parallelism available when infinite resources
– Much less when restrict: window size, rename registers, branch
prediction accuracy, etc…
9/29/03
CS252/Kubiatowicz
Lec 9.45