Pipelining

Transcript Pipelining

1 November
•
•
•
•
•
Exam Postmortem
11 Classes to go!
Read Sections 7.1 and 7.2
You read 6.1 for this time. Right?
Pipelining then on to Memory hierarchy
11/1/2005
Comp 120 Fall 2005
1
1
[6] Give as least 3 different formulas for execution time using the
following variables. Each equation should be minimal; that is, it
should not contain any variable that is not needed. CPI, R=clock
rate, T=cycle time, M=MIPS, I=number of instructions in program,
C=number of cycles in program.
CPI*I*T, CPI*I/R, C*T, C/R, I/(MIPS*10^6)
11/1/2005
Comp 120 Fall 2005
2
2
[12] Soon we will all have computers with multiple processors. This will
allow us to run some kinds of programs faster. Suppose your
computer has N processors and that a program you want to make
faster has some fraction F of its computation that is inherently
sequential (either all the processors have to do the same work thus
duplicating effort, or the work has to be done by a single processor
while the other processors are idle). What speedup can you expect
to achieve on your new computer?
1/(F + (1-F)/N)
11/1/2005
Comp 120 Fall 2005
3
3
[10] Show the minimum sequence of instructions required to sum up
an array of 5 integers, stored as consecutive words in memory,
given the address of the first word in $a0; leaving the sum in
register $v0?
5 loads + 4 adds = 9 instructions
11/1/2005
Comp 120 Fall 2005
4
4
[10] Which of the following cannot be EXACTLY represented by an
IEEE double-precision floating-point number? (a) 0, (b) 10.2, (c)
1.625, (d) 11.5
11/1/2005
Comp 120 Fall 2005
5
5
•
[12] All of the following equations are true for the idealized numbers
you studied in algebra. Which ones are true for IEEE floating point
numbers? Assume that all of the numbers are well within the range
of largest and smallest possible numbers (that is, underflow and
overflow are not a problem)
–
–
–
–
A+B=A if and only if B=0
(A+B)+C = A+(B+C)
A+B = B+A 
A*(B+C) = A*B+A*C 
11/1/2005
Comp 120 Fall 2005
6
6
[10] Consider the characteristics of two machines M1 and M2. M1 has a clock rate of 1GHz. M2
has a clock rate of 2GHz. There are 4 classes of instructions (A-D) in the instruction set. In a set
of benchmark programs, the frequency of each class of instructions is shown in the table.
Instruction Class Frequency M1 CPI M2 CPI
A
40%
2
6
B
25%
3
6
C
20%
3
6
D
15%
5
8
What is the average CPI for each machine?
CPI1 = 2*.4+3*.25+3*.2+5*.15 = 2.9;
CPI2 = 6*.4+6*.25+6*.2+8*.15 = 6.3.
Which machine is faster? M1
By what factor faster is it? (1000*6.3)/(2000*2.9) = 1.086.
What is the cycle time of each machine? M1=1ns, M2=0.5ns.
11/1/2005
Comp 120 Fall 2005
7
7
[8] Draw a block diagram showing how to implement 2’s complement
subtraction of 3 bit numbers (A minus B) using 1-bit full-adder
blocks with inputs A, B, and Cin and outputs Sum and Cout and any
other AND, OR, or INVERT blocks you may need.
11/1/2005
Comp 120 Fall 2005
8
8
[10] Assume that multiply instructions take 12 cycles and account for
10% of the instructions in a typical program and that the other 90%
of the instructions require an average of 4 cycles for each
instruction. What percentage of time does the CPU spend doing
multiplication?
12*0.1/(12*0.1+4*0.9)=25%
11/1/2005
Comp 120 Fall 2005
9
9
[10] In a certain set of benchmark programs about every 4th instruction
is a load instruction that fetches data from main memory. The time
required for a load is 50ns. The CPI for all other instructions is 4.
Assuming the ISA’s are the same, how much faster will the
benchmarks run with a 1GHz clock than with a 500MHz clock?
For 1GHz: 50 + 12 = 62 cycles and 62 nanoseconds
For 500MHz: 25 + 12 = 37 cycles and 74 nanoseconds
Speedup is 74 / 62 = 1.19
11/1/2005
Comp 120 Fall 2005
10
10
[12] Explain why floating point addition and subtraction may result in
loss of precision but multiplication and division do not.
Before we can add we have to adjust the binary point so that the
exponents are equal. This can cause many bits to be lost if the
exponents are very different.
Multiplication doesn’t have this problem. All the bits play equally.
11/1/2005
Comp 120 Fall 2005
11
Results
7
6
5
4
3
2
1
0
<60
61-70
71-80
81-90 91-100
Average: 72.5
Median: 81
11/1/2005
Comp 120 Fall 2005
12
Chapter 6 Pipelining
11/1/2005
Comp 120 Fall 2005
13
Doing Laundry
Time
6 PM
7
8
9
10
11
12
1
2 AM
6 PM
7
8
9
10
11
12
1
2 AM
Task
order
A
B
C
D
Time
Task
order
A
B
C
D
11/1/2005
Comp 120 Fall 2005
14
Pipelining
• Improve performance by increasing instruction throughput
Program
execution
Time
order
(in instructions)
lw $1, 100($0)
2
Instruction
Reg
fetch
lw $2, 200($0)
4
6
8
ALU
Data
access
10
12
14
ALU
Data
access
16
18
Reg
Instruction
Reg
fetch
8 ns
lw $3, 300($0)
Reg
Instruction
fetch
8 ns
...
8 ns
Program
execution
Time
order
(in instructions)
2
lw $1, 100($0)
Instruction
fetch
lw $2, 200($0)
2 ns
lw $3, 300($0)
4
Reg
Instruction
fetch
2 ns
6
ALU
Reg
Instruction
fetch
2 ns
8
Data
access
ALU
Reg
2 ns
10
14
12
Reg
Data
access
Reg
ALU
Data
access
2 ns
2 ns
Reg
2 ns
Ideal speedup is number of stages in the pipeline. Do we achieve this?
11/1/2005
Comp 120 Fall 2005
15
Pipeline control
• We have 5 stages. What needs to be controlled in each stage?
–
–
–
–
–
Instruction Fetch and PC Increment
Instruction Decode / Register Fetch
Execution
Memory Stage
Register Write Back
• How would control be handled in an automobile plant?
– a fancy control center telling everyone what to do?
– should we use a finite state machine?
11/1/2005
Comp 120 Fall 2005
16
Pipelining
• What makes it easy
– all instructions are the same length
– just a few instruction formats
– memory operands appear only in loads and stores
• What makes it hard?
– structural hazards: suppose we had only one memory
– control hazards: need to worry about branch instructions
– data hazards: an instruction depends on a previous instruction
• Individual Instructions still take the same number of cycles
• But we’ve improved the through-put by increasing the number of
simultaneously executing instructions
11/1/2005
Comp 120 Fall 2005
17
Structural Hazards
Inst
Fetch
11/1/2005
Reg
Read
ALU
Data
Reg
Access Write
Inst
Fetch
Reg
Read
ALU
Data
Access
Reg
Write
Inst
Fetch
Reg
Read
ALU
Data
Access
Reg
Write
Inst
Fetch
Reg
Read
ALU
Data
Access
Comp 120 Fall 2005
Reg
Write
18
Data Hazards
• Problem with starting next instruction before first is finished
– dependencies that “go backward in time” are data hazards
Time (in clock cycles)
CC 1
Value of
register $2: 10
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
10
10
10/– 20
– 20
– 20
– 20
– 20
DM
Reg
Program
execution
order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
11/1/2005
IM
Reg
IM
DM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Comp 120 Fall 2005
Reg
Reg
Reg
DM
Reg
19
Software Solution
• Have compiler guarantee no hazards
• Where do we insert the “nops” ?
sub
and
or
add
sw
$2, $1, $3
$12, $2, $5
$13, $6, $2
$14, $2, $2
$15, 100($2)
• Problem: this really slows us down!
11/1/2005
Comp 120 Fall 2005
20
Forwarding
•
Use temporary results, don’t wait for them to be written
– register file forwarding to handle read/write to same register
– ALU forwarding
Time (in clock cycles)
CC 1
Value of register $2 : 10
Value of EX/MEM : X
Value of MEM/WB : X
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
X
X
10
– 20
X
10/– 20
X
– 20
– 20
X
X
– 20
X
X
– 20
X
X
– 20
X
X
DM
Reg
Program
execution order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
11/1/2005
IM
Reg
IM
Reg
IM
DM
Reg
Reg
IM
DM
Reg
IM
Reg
DM
Reg
Comp 120 Fall 2005
Reg
DM
Reg
21
Can't always forward
•
Load word can still cause a hazard:
– an instruction tries to read a register following a load instruction that writes to the
same register.
Time (in clock cycles)
Program
CC 1
execution
order
(in instructions)
lw $2, 20($1)
–
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
•
IM
CC 2
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
CC 6
CC 8
CC 9
Reg
DM
Reg
IM
CC 7
Reg
DM
Reg
Reg
DM
Reg
Thus, we need a hazard detection unit to “stall” the instruction
11/1/2005
Comp 120 Fall 2005
22
Stalling
• We can stall the pipeline by keeping an instruction in the same
stage
Program
Time (in clock cycles)
execution
CC 1
CC 2
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
IM
CC 3
Reg
IM
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
CC 6
CC 7
DM
Reg
Reg
DM
CC 8
CC 9
CC 10
Reg
bubble
add $9, $4, $2
slt $1, $6, $7
11/1/2005
IM
DM
Reg
IM
Comp 120 Fall 2005
Reg
Reg
DM
Reg
23
Branch Hazards
• When we decide to branch, other instructions are in the pipeline!
Time (in clock cycles)
Program
execution
CC 1
CC 2
order
(in instructions)
40 beq $1, $3, 7
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
IM
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
72 lw $4, 50($7)
CC 6
CC 8
CC 9
Reg
DM
Reg
IM
CC 7
Reg
DM
Reg
Reg
DM
Reg
• We are predicting “branch not taken”
– need to add hardware for flushing instructions if we are wrong
11/1/2005
Comp 120 Fall 2005
24
Improving Performance
• Try to avoid stalls! E.g., reorder these instructions:
lw
lw
sw
sw
$t0,
$t2,
$t2,
$t0,
0($t1)
4($t1)
0($t1)
4($t1)
• Add a “branch delay slot”
– the next instruction after a branch is always executed
– rely on compiler to “fill” the slot with something useful
• Superscalar: start more than one instruction in the same cycle
11/1/2005
Comp 120 Fall 2005
25
Dynamic Scheduling
• The hardware performs the “scheduling”
– hardware tries to find instructions to execute
– out of order execution is possible
– speculative execution and dynamic branch prediction
• All modern processors are very complicated
– Pentium 4: 20 stage pipeline, 6 simultaneous instructions
– PowerPC and Pentium: branch history table
– Compiler technology important
11/1/2005
Comp 120 Fall 2005
26
Chapter 7 Preview
Memory Hierarchy
11/1/2005
Comp 120 Fall 2005
27
Memory Hierarchy
• Memory devices come in several different flavors
– SRAM – Static Ram
• fast (1 to 10ns)
• expensive (>10 times DRAM)
• small capacity (< ¼ DRAM)
– DRAM – Dynamic RAM
•
•
•
•
16 times slower than SRAM (50ns – 100ns)
Access time varies with address
Affordable ($160 / gigabyte)
1 Gig considered big
– DISK
• Slow! (10ms access time)
• Cheap! (< $1 / gigabyte)
• Big! (1Tbyte is no problem)
11/1/2005
Comp 120 Fall 2005
28
Memory Hierarchy
• Users want large and fast memories!
Try to give it to them
– build a memory hierarchy
CPU
Level 1
Levels in the
memory hierarchy
Increasing distance
from the CPU in
access time
Level 2
Level n
Size of the memory at each level
11/1/2005
Comp 120 Fall 2005
29
Locality
• A principle that makes having a memory hierarchy a good idea
• If an item is referenced,
temporal locality: it will tend to be referenced again soon
spatial locality: nearby items will tend to be referenced soon.
Why does code have locality?
11/1/2005
Comp 120 Fall 2005
30
10
11/1/2005
Comp 120 Fall 2005
31

Pipelining

Transcript Pipelining

Directory