slides in ppt - Computer Science and Engineering
Download
Report
Transcript slides in ppt - Computer Science and Engineering
CS 162 Computer Architecture
Lecture 2: Introduction & Pipelining
Instructor: L.N. Bhuyan
www.cs.ucr.edu/~bhuyan/cs162
1
1999 ©UCB
Review of Last Class
°MIPS Datapath
°Introduction to Pipelining
°Introduction to Instruction Level
Parallelism (ILP)
°Introduction to VLIW
2
1999 ©UCB
What is Multiprocessing
°Parallelism at the Instruction Level is
limited because of data dependency
=> Speed up is limited!!
°Abundant availability of program level
parallelism, like Do I = 1000, Loop
Level Parallelism. How about
employing multiple processors to
execute the loops => Parallel
processing or Multiprocessing
°With billion transistors on a chip, we
can put a few CPUs in one chip =>
Chip multiprocessor
3
1999 ©UCB
Memory Latency Problem
Even if we increase CPU power, memory is
the real bottleneck. Techniques to
alleviate memory latency problem:
1. Memory hierarchy – Program locality,
cache memory, multilevel, pages and
context switching
2. Prefetching – Get the instruction/data
before the CPU needs. Good for instns
because of sequential locality, so all
modern processors use prefetch buffers
for instns. What do with data?
3. Multithreading – Can the CPU jump to
another program when accessing
memory? It’s like multiprogramming!!
4
1999 ©UCB
Hardware Multithreading
° We need to develop a hardware multithreading
technique because switching between threads in
software is very time-consuming (Why?), so not
suitable for main memory (instead of I/O) access,
Ex: Multitasking
° Develop multiple PCs and register sets on the CPU
so that thread switching can occur without having
to store the register contents in main memory
(stack, like it is done for context switching).
° Several threads reside in the CPU simultaneously,
and execution switches between the threads on
main memory access.
° How about both multiprocessors and
multithreading on a chip? => Network Processor
5
1999 ©UCB
Architectural Comparisons (cont.)
Fine-Grained Coarse-Grained
Multiprocessing
Time (processor cycle)
Superscalar
Simultaneous
Multithreading
6
Thread 1
Thread 3
Thread 5
Thread 2
Thread 4
Idle slot
1999 ©UCB
Intel IXP1200 Network Processor
Initial
component of the Intel Exchange
Architecture - IXA
micro engine is a 5-stage pipeline – no ILP,
4-way multithreaded
Each
core multiprocessing – 6 Micro engines and a
Strong Arm Core
7
166
MHz fundamental clock rate
Intel claims 2.5 Mpps IP routing for 64 byte packets
Already
7
the most widely used NPU
Or more accurately the most widely admitted use
1999 ©UCB
IXP1200 Chip Layout
StrongARM
processing
core
Microengines
introduce
new ISA
I/O
PCI
SDRAM
SRAM
IX : PCI-like packet bus
On
8
chip FIFOs
16 entry 64B each
1999 ©UCB
IXP1200 Microengine
4 hardware contexts
Single issue processor
Explicit optional context switch on
SRAM access
Registers
All are single ported
Separate GPR
1536 registers total
32-bit ALU
Can access GPR or XFER registers
Standard 5 stage pipe
4KB SRAM instruction store – not a
cache!
9
1999 ©UCB
Intel IXP2400 Microengine (New)
XScale core
replaces
StrongARM
1.4 GHz target in
0.13-micron
Nearest neighbor
routes added
between
microengines
Hardware to
accelerate CRC
operations and
Random number
generation
16 entry CAM
10
1999 ©UCB
MIPS Pipeline
Chapter 6 CS 161 Text
11
1999 ©UCB
Review: Single-cycle Datapath for MIPS
Stage 5
PC
Instruction
Memory
(Imem)
Stage 1
Registers
Stage 2
ALU
Stage 3
Data
Memory
(Dmem)
Stage 4
°Use datapath figure to represent pipeline
IFtch Dcd Exec Mem WB
12
Reg
ALU
IM
DM
Reg
1999 ©UCB
Stages of Execution in Pipelined MIPS
5 stage instruction pipeline
1) I-fetch: Fetch Instruction, Increment PC
2) Decode: Instruction, Read Registers
3) Execute:
Mem-reference: Calculate Address
R-format: Perform ALU Operation
4) Memory:
Load: Read Data from Data Memory
Store: Write Data to Data Memory
5) Write Back: Write Data to Register
13
1999 ©UCB
Pipelined Execution Representation
Time
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
Program Flow
°To simplify pipeline, every instruction
takes same number of steps, called
stages
14
°One clock cycle per stage
1999 ©UCB
Datapath Timing: Single-cycle vs. Pipelined
°Assume the following delays for major
functional units:
• 2 ns for a memory access or ALU operation
• 1 ns for register file read or write
°Total datapath delay for single-cycle:
15
Insn
Type
Insn
Fetch
Reg
Read
ALU
Oper
beq
R-form
sw
lw
2ns
2ns
2ns
2ns
1ns
1ns
1ns
1ns
2ns
2ns
2ns
2ns
Data Reg
Access Write
1ns
2ns
2ns
1ns
Total
Time
5ns
6ns
7ns
8ns
°In pipeline machine, each stage = length
of longest delay = 2ns; 5 stages = 10ns
1999 ©UCB
Pipelining Lessons
° Pipelining doesn’t help latency (execution
time) of single task, it helps throughput of
entire workload
° Multiple tasks operating simultaneously
using different resources
° Potential speedup = Number of pipe stages
° Time to “fill” pipeline and time to “drain” it
reduces speedup
° Pipeline rate limited by slowest pipeline stage
° Unbalanced lengths of pipe stages also
reduces speedup
16
1999 ©UCB
Single Cycle Datapath (From Ch 5)
4
P
C
Read
Addr
31:0
Instruction
Imem
15:11
a
d
d
25:21
20:16
M
u
x
<<
2
PCSrc
MemWrite
Read
Reg1
Read
Read data1
Reg2
Read
Write
data2
Reg
Write
Data
Regs
RegDst
RegWrite
15:0
17
a
d
d
M
u
x
Sign
Extend
M
u
x
A
L
U
Read
data
Zero
Address
MemToReg
Dmem
ALUcon
ALUsrc
Write
Data
MemRead
ALUOp
M
u
x
1999 ©UCB
Required Changes to Datapath
°Introduce registers to separate 5
stages by putting IF/ID, ID/EX, EX/MEM,
and MEM/WB registers in the datapath.
°Next PC value is computed in the 3rd
step, but we need to bring in next instn
in the next cycle – Move PCSrc Mux to
1st stage. The PC is incremented unless
there is a new branch address.
°Branch address is computed in 3rd
stage. With pipeline, the PC value has
changed! Must carry the PC value
along with instn. Width of IF/ID register
= (IR)+(PC) = 64 bits.
18
1999 ©UCB
Changes to Datapath Contd.
°For lw instn, we need write register
address at stage 5. But the IR is now
occupied by another instn! So, we
must carry the IR destination field as
we move along the stages. See
connection in fig.
Length of ID/EX register =
(Reg1:32)+(Reg2:32)+(offset:32)+
(PC:32)+ (destination register:5)
= 133 bits
Assignment: What are the lengths of
EX/MEM, and MEM/WB registers
19
1999 ©UCB
Pipelined Datapath (with Pipeline Regs)(6.2)
Fetch
Decode
Execute
Memory
Write
Back
0
M
u
x
1
IF/ID
EX/MEM
ID/EX
MEM/WB
Add
4
Add
Add
result
PC
Ins truction
Shift
left 2
Address
Read
register 1
Read
data 1
Read
register 2
Read
data 2
Write
register
Imem
Write
data
0
M
u
x
1
Regs
Zero
ALU ALU
result
Address
Write
data
16
Sign
extend
32
Read
data
1
M
u
x
0
Dmem
5
20
64 bits
133 bits
102 bits
69 bits
1999 ©UCB