CS61C: Machine Structures

Download Report

Transcript CS61C: Machine Structures

CS61C
Introduction to Pipelining
Lecture 25
April 28, 1999
Dave Patterson
(http.cs.berkeley.edu/~patterson)
www-inst.eecs.berkeley.edu/~cs61c/schedule.html
cs 61C L25 pipeline.1
Patterson Spring 99 ©UCB
Outline
°Review Parameter Passing on Stacks
°Pipelining Analogy
°Pipelining Instruction Execution
°Administrivia, “What’s this Stuff Bad for?”
°Hazards to Pipelining
°Solutions to Hazards
°Advanced Pipelining Concepts by Analogy
°Conclusion
cs 61C L25 pipeline.2
Patterson Spring 99 ©UCB
Review 1/1
°Every machine has a convention for how
arguments are passed.
°In MIPS, where do the arguments go if you
are passing more than 4 words? Stack!
°It is sometimes useful to have a variable
number of arguments.
• The C convention is to use “...”
• *fmt is used to determine the number of
variables and their types.
cs 61C L25 pipeline.3
Patterson Spring 99 ©UCB
Pipelining is Natural! Laundry Example
° Ann, Brian, Cathy, Dave
each have one load of
A B C D
clothes to wash, dry,
fold, and put away
° Washer takes 30
minutes
° Dryer takes 30 minutes
° “Folder” takes 30
minutes
° “Stasher” takes 30
minutes to put clothes
into drawers
cs 61C L25 pipeline.4
Patterson Spring 99 ©UCB
Sequential Laundry
6 PM 7
T
a
s
k
O
r
d
e
r
A
8
9
10
11
12
1
2 AM
30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
Time
B
C
D
°Sequential laundry takes
8 hours for 4 loads
cs 61C L25 pipeline.5
Patterson Spring 99 ©UCB
Pipelined Laundry: Start work ASAP
6 PM 7
T
a
s
k
8
9
3030 30 30 30 30 30
10
11
12
1
2 AM
Time
A
B
C
O
D
r
d
e °Pipelined
r 3.5 hours
cs 61C L25 pipeline.6
laundry takes
for 4 loads!
Patterson Spring 99 ©UCB
Pipelining Lessons
6 PM
T
a
s
k
8
30 30 30 30 30 30 30
A
B
O
r
d
e
r
7
° Pipelining doesn’t help
9 latency of single task, it
helps throughput of
Time entire workload
C
D
cs 61C L25 pipeline.7
° Multiple tasks
operating
simultaneously using
different resources
° Potential speedup =
Number pipe stages
° Time to “fill” pipeline
and time to “drain” it
reduces speedup:
2.3X v. 4X in this
example
Patterson Spring 99 ©UCB
Pipelining Lessons
T
a
s
k
° Suppose new
Washer takes 20
6 PM
7
8
9
minutes, new
Time
Stasher takes 20
minutes. How much
30 30 30 30 30 30 30
faster is pipeline?
A
B
O
r
d
e
r
C
D
cs 61C L25 pipeline.8
° Pipeline rate limited
by slowest pipeline
stage
° Unbalanced lengths
of pipe stages also
reduces speedup
Patterson Spring 99 ©UCB
Review: Steps in Executing MIPS (Lec. 20)
1) Ifetch: Fetch Instruction, Increment PC
2) Decode Instruction, Read Registers
3) Execute:
Mem-ref: Calculate Address
Arith-log: Perform Operation
Branch: Compare if operands ==
4) Memory:
Load:
Read Data from Memory
Store:
Write Data to Memory
5) Write Back: Write Data to Register
Branch:
if operands ==, Change PC
cs 61C L25 pipeline.9
Patterson Spring 99 ©UCB
Pipelined Execution Representation
Time
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
Program Flow
IFtch Dcd Exec Mem WB
°To simplify pipeline, every instruction
takes same number of steps
°Steps also called pipeline “stages”
cs 61C L25 pipeline.10
Patterson Spring 99 ©UCB
Review: A Datapath for MIPS (Lec. 20)
Stage 5
PC
Instruction
Cache
Stage 1
Registers
Stage 2
ALU
Data
Cache
Stage 3 (Stage 4)
°Use data path figure to represent pipeline
IFtch Dcd Exec Mem WB
cs 61C L25 pipeline.11
Reg
ALU
I$
D$
Reg
Patterson Spring 99 ©UCB
Graphical Pipeline Representation
Time (clock cycles)
ALU
I
n
I$
D$
Reg
Reg
s Load
I$
D$
Reg
Reg
t Add
r.
I$
D$
Reg
Reg
Store
O
I$
D$
Reg
Reg
Sub
r
I$
D$
Reg
Reg
d Or
e
r
(right half highlight means read, left half write)
ALU
ALU
ALU
ALU
cs 61C L25 pipeline.12
Patterson Spring 99 ©UCB
Example
°Suppose 2 ns for memory access, 2 ns
for ALU operation, and 1 ns for register
file read or write
°Nonpipelined Execution:
• lw : IF + Read Reg + ALU + Memory + Write
Reg = 2 + 1 + 2 + 2 + 1 = 8 ns
• add: IF + Read Reg + ALU + Write Reg
= 2 + 1 + 2 + 1 = 6 ns
°Pipelined Execution:
• Max(IF,Read Reg,ALU, Memory,Write Reg) =
2 ns
cs 61C L25 pipeline.13
Patterson Spring 99 ©UCB
Administrivia
°Project 6 (last): Due Today
°Next Readings: 7.5
°11th homework (last): Due Friday 4/30 7PM
• Exercises 2.6, 2.13, 6.1, 6.3, 6.4
cs 61C L25 pipeline.14
Patterson Spring 99 ©UCB
Administrivia: Rest of 61C
F 4/30 Review: Caches/TLB/VM; Section 7.5
M 5/3 Deadline to correct your grade record
W 5/5 Review: Interrupts / Polling; A.7
F 5/7 61C Summary / Your Cal heritage /
HKN Course Evalution
(Due: Final 61C Survey in lab; Return)
Sun 5/9 Final Review starting 2PM (1 Pimintel)
W 5/12 Final (5PM 1 Pimintel)
• Need Alternative Final? Contact mds@cory
cs 61C L25 pipeline.15
Patterson Spring 99 ©UCB
“What’s This Stuff (Potentially) Bad For?”
Linking Entertainment to Violence 100s of studies in recent
decades have revealed a direct correlation between exposure to media
violence--including video games--and increased aggression.
•"We are reaching that stage of desensitization at which the inflicting
of pain and suffering has become a source of entertainment; vicarious
pleasure rather than revulsion. We are learning to kill, and we are
learning to like it." Like the tobacco industry, “the evidence is there."
• The 14-year-old boy who opened fire on a prayer group in a Ky.
school foyer in 1997 was a video-game expert. He had never fired a
pistol before, but in the ensuing melee, he fired 8 shots, hit 8 people,
and killed 3. The average law enforcement officer in the United
States, at a distance of 7 yards, hits fewer than 1 in 5 shots.
• Because of freedom of speech is a value that we don't want to
compromise, “it really comes down to the people creating these
games. That's where the responsibility lies."
N.Y. Times, 4/26/99
cs 61C L25 pipeline.16
Patterson Spring 99 ©UCB
Pipeline Hazard: Matching socks in later load
6 PM 7
T
a
s
k
8
9
3030 30 30 30 30 30
A
10
11
12
1
2 AM
Time
bubble
B
C
O
D
r
d E
e
r F
A depends on D; stall since folder tied up
cs 61C L25 pipeline.17
Patterson Spring 99 ©UCB
Problems for Computers
°Limits to pipelining: Hazards prevent
next instruction from executing during
its designated clock cycle
• Structural hazards: HW cannot support
this combination of instructions (single
person to fold and put clothes away)
• Control hazards: Pipelining of branches &
other instructions stall the pipeline until
the hazard “bubbles” in the pipeline
• Data hazards: Instruction depends on
result of prior instruction still in the
pipeline (missing sock)
cs 61C L25 pipeline.18
Patterson Spring 99 ©UCB
Single Cache is a Structural Hazard
Time (clock cycles)
ALU
I
n
$
$
Reg
Reg
s Load
$
$
Reg
Reg
t Instr 1
r.
$
$
Reg
Reg
Instr 2
O
$
$
Reg
Reg
Instr 3
r
$
$
Reg
Reg
d Instr 4
e
r Read same memory twice in same clock cycle
ALU
ALU
ALU
ALU
cs 61C L25 pipeline.19
Patterson Spring 99 ©UCB
Structural Hazards limit performance
°Example: if 1.3 memory accesses per
instruction (30% of instructions
executed loads and stores)
and only one memory access per
cycle then
• Average CPI ≥ 1.3
• Otherwise resource is more than 100%
utilized
cs 61C L25 pipeline.20
Patterson Spring 99 ©UCB
Control Hazard Solutions
°Stall: wait until decision is clear
ALU
•
Move
up
decision
to
2nd
stage
by
adding
I
hardware to check registers as being read
n
Time (clock cycles)
s
I$
D$
Reg
Reg
t
Add
r.
I$
D$
Reg
Reg
Beq
ALU
ALU
O Load
bub
D$
Reg
Reg
I$
ble
r
d
e
r °Impact: 2 clock cycles per branch
instruction  slow
cs 61C L25 pipeline.21
Patterson Spring 99 ©UCB
• For example, Predict not taken
Time (clock cycles)
Beq
Reg
I$
D$
Reg
Reg
ALU
Add
I$
ALU
I
n
s
t
r.
Control Hazard Solutions
°Predict: guess one direction, then back
up if wrong
D$
Reg
ALU
O Load
I$
D$
Reg
Reg
r
d
°Impact: 1 clock per branch instruction if
e right, 2 if wrong (right ~ 50% of time)
r
°More dynamic scheme: history of 1
branch (~ 90%)
cs 61C L25 pipeline.22
Patterson Spring 99 ©UCB
Control Hazard Solutions
°Redefine branch behavior (takes place
I after next instruction) “delayed branch”
Time (clock cycles)
Beq
Reg
I$
D$
Reg
Reg
ALU
Add
I$
ALU
n
s
t
r.
D$
Reg
ALU
ALU
I$
D$
Reg
Reg
O Misc
r
I$
D$
Reg
Reg
Load
d
e °Impact: 1 clock cycles per branch
r instruction if can find instruction to put
in “slot” (≈ 50% of time)
cs 61C L25 pipeline.23
Patterson Spring 99 ©UCB
Example Nondelayed vs. Delayed Branch
Nondelayed Branch
or
$8, $9 ,$10
Delayed Branch
add $1 ,$2,$3
add $1 ,$2,$3
sub $4, $5,$6
sub $4, $5,$6
beq $1, $4, Exit
beq $1, $4, Exit
or
xor $10, $1,$11
xor $10, $1,$11
Exit:
cs 61C L25 pipeline.24
$8, $9 ,$10
Exit:
Patterson Spring 99 ©UCB
Data Hazard on Register $1
add $1 ,$2,$3
sub $4, $1 ,$3
and $6, $1 ,$7
or
$8, $1 ,$9
xor $10, $1 ,$11
cs 61C L25 pipeline.25
Patterson Spring 99 ©UCB
Data Hazard on $1:
Dependencies backwards in time are hazards
Time (clock cycles)
I$
sub $4,$1,$3
and $6,$1,$7
O
r
d
e
r
or $8,$1,$9
xor $10,r1,$11
cs 61C L25 pipeline.26
WB
D$
Reg
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
ALU
Reg
MEM
ALU
I$
EX
ALU
ID/RF
ALU
add $1,$2,$3
IF
ALU
I
n
s
t
r.
D$
Reg
Patterson Spring 99 ©UCB
Data Hazard Solution:
• “Forward” result from one stage to another
Time (clock cycles)
and $6,$1,$7
I$
Reg
I$
EX
MEM
WB
D$
Reg
Reg
D$
Reg
I$
Reg
ALU
sub $4,$1,$3
ID/RF
ALU
add $1,$2,$3
IF
ALU
I
n
s
t
r.
D$
cs 61C L25 pipeline.27
D$
Reg
ALU
ALU
O
I$
Reg
or
$8,$1,$9
r
d
I$
Reg
xor $10,r1,$11
e
r
• “or” OK if define read/write properly
Reg
D$
Reg
Patterson Spring 99 ©UCB
Forwarding (or Bypassing): What about Loads
• Dependencies backwards in time are hazards
I$
Reg
I$
EX
MEM
WB
D$
Reg
Reg
ALU
sub $4,$1,$3
ID/RF
ALU
lw $1,0($2)
IF
D$
Reg
• Can’t solve with forwarding
• Must stall instruction dependent on loads
cs 61C L25 pipeline.28
Patterson Spring 99 ©UCB
Data Hazard Even with Forwarding
• Must stall (insert bubble in) pipeline
Time (clock cycles)
cs 61C L25 pipeline.29
D$
I$
Reg
bub
ble
I$
bub
ble
Reg
D$
bub
ble
I$
Reg
ALU
or $8,$1,$9
Reg
ALU
and $6,$1,$7
I$
ALU
sub $4,$1,$6
ID/RF
ALU
lw $1, 0($2)
IF
EX
MEM
WB
Reg
D$
Reg
Reg
D$
Patterson Spring 99 ©UCB
Software Scheduling to Avoid Load Hazards
Try producing fast code for
a = b + c;
d = e – f;
a, b, c, d ,e, and f in memory
Slow code:
Fast code:
lw $2,b
lw $2,b
lw $3,c
lw $3,c
add $1,$2,$3
lw $5,e
sw $1,a
add $1,$2,$3
lw $5,e
lw $6,f
lw $6,f
sw $1,a
sub $4,$5,$6
sub $4,$5,$6
sw $4,d
sw $4,d
cs 61C L25 pipeline.30
Patterson Spring 99 ©UCB
Advanced Pipelining Concepts (if time)
°“Out-of-order” Execution
°“Superscalar” execution
°State-of-the-Art Microprocessor
cs 61C L25 pipeline.31
Patterson Spring 99 ©UCB
Review Pipeline Hazard: Stall is dependency
6 PM 7
T
a
s
k
8
9
3030 30 30 30 30 30
A
10
11
12
1
2 AM
Time
bubble
B
C
O
D
r
d E
e
r F
A depends on D; stall since folder tied up
cs 61C L25 pipeline.32
Patterson Spring 99 ©UCB
Out-of-Order Laundry: Don’t Wait
6 PM 7
T
a
s
k
8
9
3030 30 30 30 30 30
A
10
11
12
1
2 AM
Time
bubble
B
C
O
D
r
d E
e
r F
A depends on D; rest continue; need
more resources to allow out-of-order
cs 61C L25 pipeline.33
Patterson Spring 99 ©UCB
Superscalar Laundry: Parallel per stage
6 PM 7
T
a
s
k
8
9
3030 30 30 30
A
B
C
O
D
r
d E
e
r F
10
11
12
2 AM
Time
(light clothing)
(dark clothing)
(very dirty clothing)
(light clothing)
(dark clothing)
(very dirty clothing)
More resources, HW to match mix of
parallel tasks?
cs 61C L25 pipeline.34
1
Patterson Spring 99 ©UCB
Superscalar Laundry: Mismatch Mix
6 PM 7
T
a
s
k
8
9
3030 30 30 30 30 30
A
O
r B
d C
e
r
D
10
11
12
1
2 AM
Time
(light clothing)
(light clothing)
(dark clothing)
(light clothing)
Task mix underutilizes extra resources
cs 61C L25 pipeline.35
Patterson Spring 99 ©UCB
State of the Art: Alpha 21264
° 15 Million transistors
° 2 64KB caches on chip;
16MB L2 cache off chip
° Clock cycle time <1.7 nsec,
or Clock Rate >600 MHz
(Fastest Cray Supercomputer: T90 2.2 nsec)
• 90 watts per chip!
° Superscalar: fetch up to 6 instructions/clock
cycle, retires up to 4 instruction/clock cycle
° Execution out-of-order
cs 61C L25 pipeline.36
Patterson Spring 99 ©UCB
Summary 1/2: Pipelining Introduction
°Pipelining is a fundamental concept
• Multiple steps using distinct resources
• Exploiting parallelism in instructions
°What makes it easy? (MIPS vs. 80x86)
• All instructions are the same length
 simple instruction fetch
• Just a few instruction formats
 read registers before decode instruction
• Memory operands only in loads and stores
 fewer pipeline stages
• Data aligned  1 memory access / load, store
cs 61C L25 pipeline.37
Patterson Spring 99 ©UCB
Summary 2/2: Pipelining Introduction
°What makes it hard?
°Structural hazards: suppose we had only
one cache?
 Need more HW resources
°Control hazards: need to worry about
branch instructions?
 Branch prediction, delayed branch
°Data hazards: an instruction depends on
a previous instruction?
 need forwarding, compiler scheduling
cs 61C L25 pipeline.38
Patterson Spring 99 ©UCB