slides in ppt

Download Report

Transcript slides in ppt

CS 161Computer Architecture
Introduction to Advanced Architecturs
Lecture 13
Instructor: L.N. Bhuyan
www.cs.ucr.edu/~bhuyan
Adapted from notes by Dave Patterson
(http.cs.berkeley.edu/~patterson)
1
1999 ©UCB
Stages of Execution in Pipelined MIPS
5 stage instruction pipeline
1) I-fetch: Fetch Instruction, Increment PC
2) Decode: Instruction, Read Registers
3) Execute:
Mem-reference: Calculate Address
R-format: Perform ALU Operation
4) Memory:
Load: Read Data from Data Memory
Store: Write Data to Data Memory
5) Write Back: Write Data to Register
2
1999 ©UCB
Pipelined Execution Representation
Time
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
Program Flow
°To simplify pipeline, every instruction
takes same number of steps, called
stages
3
°One clock cycle per stage
1999 ©UCB
Review: Single-cycle Datapath for MIPS
Stage 5
PC
Instruction
Memory
(Imem)
Stage 1
Registers
Stage 2
ALU
Stage 3
Data
Memory
(Dmem)
Stage 4
°Use datapath figure to represent pipeline
IFtch Dcd Exec Mem WB
4
Reg
ALU
IM
DM
Reg
1999 ©UCB
Graphical Pipeline Representation
Time (clock cycles)
ALU
I
n
IM
DM
Reg
Reg
s Load
IM
DM
Reg
Reg
t Add
r.
IM
DM
Reg
Reg
Store
O
IM
DM
Reg
Reg
Sub
r
IM
DM
Reg
Reg
d Or
e
r
(right half highlighted means read, left half write)
ALU
ALU
ALU
ALU
5
1999 ©UCB
Required Changes to Datapath
° Introduce registers to separate 5 stages by putting
IF/ID, ID/EX, EX/MEM, and MEM/WB registers in the
datapath.
° Next PC value is computed in the 3rd step, but we
need to bring in next instn in the next cycle – Move
PCSrc Mux to 1st stage
° Branch address is computed in 3rd stage. With
pipeline, the PC value has changed! Must carry the
PC value along with instn. Width of IF/ID register =
(IR)+(PC) = 64 bits.
° For lw instn, we need write register address at stage
5. But the IR is now occupied by another instn! So,
we must carry the IR destination field as we move
along the stages. See connection in fig. Length od
ID/EX register = (Reg1)+(Reg2)+(offset)+(PC)+ destn
= 133 bits
6
1999 ©UCB
Pipelined Datapath (with Pipeline Regs)(6.2)
Fetch
Decode
Execute
Memory
Write
Back
0
M
u
x
1
IF/ID
EX/MEM
ID/EX
MEM/WB
Add
4
Add
Add
result
PC
Ins truction
Shift
left 2
Address
Read
register 1
Read
data 1
Read
register 2
Read
data 2
Write
register
Imem
Write
data
0
M
u
x
1
Regs
Zero
ALU ALU
result
Address
Write
data
16
Sign
extend
32
Read
data
1
M
u
x
0
Dmem
5
7
64 bits
133 bits
102 bits
69 bits
1999 ©UCB
Pipelined Control (6.3)
• Start with single-cycle controller
• Group control lines by pipeline stage needed
• Extend pipeline registers with control bits
WB
Instruction
Control
Mem
WB
EX
Mem
RegDst
ALUop
ALUSrc
IF/ID
8
ID/EX
WB
Branch
MemRead
MemWrite
EX/MEM
MemToReg
RegWrite
MEM/WB
1999 ©UCB
Problems for Pipelining
°Hazards prevent next instruction from
executing during its designated clock
cycle, limiting speedup
• Structural hazards: HW cannot support
this combination of instructions (single
person to fold and put clothes away)
• Control hazards: conditional branches &
other instructions may stall the pipeline
delaying later instructions (must check
detergent level before washing next load)
• Data hazards: Instruction depends on
result of prior instruction still in the
pipeline (matching socks in later load)
9
1999 ©UCB
MIPS R4000 pipeline
10
1999 ©UCB
Advanced Architectural Concepts
°Can we achieve CPI < 1?
(i.e., can we have IPC > 1?)
State-of-the-Art Microprocessor
°“Superscalar” execution or Instruction
Level Parallelism (ILP)
“Deeper Pipeline => Dynamic Branch
Prediction => Speculation => Recovery
° “Out-of-order” Execution => Instruction
Window and Prefetch => Reorder Buffers
°“VLIW” Ex: Intel/HP Titanium
11
1999 ©UCB
Instruction Level Parallelism (ILP) IPC > 1
Time
IFtch Dcd Exec Mem WB
IFetchDcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
Program Flow
ILP = 2
EX: Pentium, SPARC, MIPS 10000, IBM Power PC
12
1999 ©UCB
HW Schemes: Instruction Parallelism
°Key idea: Allow instructions behind
stall to proceed
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14
• Enables out-of-order execution => out-of-order
completion
• ID stage checks for hazards. If no hazards, issue
the instn for execution. Scoreboard dates to
CDC 6600 in 1963
13
1999 ©UCB
How ILP Works
° Issuing multiple instructions per cycle would
require fetching multiple instructions from
memory per cycle => called Superscalar degree
or Issue width
° To find independent instructions, we must have
a big pool of instructions to choose from, called
instruction buffer (IB). As IB length increases,
complexity of decoder (control) increases that
increases the datapath cycle time
° Prefetch instructions sequentially by an IFU
that operates independently from datapath
control. Fetch instruction (PC)+L, where L is the
IB size or as directed by the branch predictor.
14
1999 ©UCB
Microarchitecture of an ILP-based CPU (Power PC)
15
1999 ©UCB
16
1999 ©UCB
Very Large Instruction Word (VLIW) IPC > 1
Time
IFtch Dcd Exec Mem WB
Exec
IFtch Dcd Exec Mem WB
Exec
IFtch Dcd Exec Mem WB
Exec
Program Flow
17
EX: Itanium
1999 ©UCB
TriMedia TM32 Architecture
64-bit
memory
bus
32-bit
peripheral
bus
multi-port 128 words x 32 bits register file
bypass network
data
cache
16KB
instruction
cache
32 KB
FU
FU
FU
FU
FU
PC
VLIW instruction decode and launch
Compressed code in the Instruction Cache
18
1999 ©UCB
What is Multiprocessing
°Parallelism at the Instruction Level is
limited because of data dependency
=> Speed up is limited!!
°Abundant availability of program level
parallelism, like Do I = 1000, Loop
Level Parallelism. How about
employing multiple processors to
execute the loops => Parallel
processing or Multiprocessing
°With billion transistors on a chip, we
can put a few CPUs in one chip =>
Chip multiprocessor
19
1999 ©UCB
Hardware Multithreading
° We need to develop a hardware multithreading
technique because switching between threads in
software is very time-consuming (Why?), so not
suitable for main memory (instead of I/O) access,
Ex: Multitasking
° Develop multiple PCs and register sets on the CPU
so that thread switching can occur without having
to store the register contents in main memory
(stack, like it is done for context switching).
° Several threads reside in the CPU simultaneously,
and execution switches between the threads on
main memory access.
° How about both multiprocessors and
multithreading on a chip? => Network Processor
20
1999 ©UCB
Hardware Multithreading
° How can we guarantee no dependencies between instructions in a
pipeline?
• One way is to interleave execution of instructions from
different program threads on same pipeline
Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe
T1: LW r1, 0(r2)
T2: ADD r7, r1, r4
T3: XORI r5, r4, #12
T4: SW 0(r7), r5
T1: LW r5, 12(r1)
21
1999 ©UCB
Architectural Comparisons (cont.)
Fine-Grained Coarse-Grained
Multiprocessing
Time (processor cycle)
Superscalar
Simultaneous
Multithreading
22
Thread 1
Thread 3
Thread 5
Thread 2
Thread 4
Idle slot
1999 ©UCB
Intel IXP2400 Network Processor

XScale core
replaces
StrongARM

1.4 GHz target in
0.13-micron

Nearest neighbor
routes added
between
microengines

Hardware to
accelerate CRC
operations and
Random number
generation

16 entry CAM
23
1999 ©UCB
IBM Cell Processor
SPU: Synergetic Processor Unit
24
1999 ©UCB
Chip Multiprocessors
25
1999 ©UCB