Lectures for 2nd Edition

Download Report

Transcript Lectures for 2nd Edition

Lecture 7: Pipelining
2004 Morgan Kaufmann Publishers
1
Notes
•
•
•
•
•
•
Homework #2 Due Today
Homework #3 Delivered this evening
Pre-Lab #2: due next Tuesday
– 3 paper readings, short summarization
Lab #1: Emailed out tonight
– Due next Tuesday
– Done in groups of 4
– C code to design an interconnection network
Where we are going:
– Week #4: Pipelining (Chapter 6) (Oct. 16, 18)
– Week #5: Cache (Chapter 7) (Oct. 23, 25)
– Week $6: Midterm (Oct. 30 (review) or Nov 1)
4 Weeks to do the project
2004 Morgan Kaufmann Publishers
2
Pipelining
•
Improve performance by increasing instruction throughput
Program
execution
Time
order
(in instructions)
200
lw $1, 100($0) Instruction
fetch Reg
lw $2, 200($0)
400
600
Data
access
ALU
800
1000
1200
1400
ALU
Data
access
1600
1800
Reg
Instruction Reg
fetch
800 ps
lw $3, 300($0)
Reg
Instruction
fetch
800 ps
Note:
timing assumptions changed
for this example
800 ps
Program
execution
Time
order
(in instructions)
200
400
600
Instruction
fetch
Reg
lw $2, 200($0) 200 ps
Instruction
fetch
Reg
200 ps
Instruction
fetch
lw $1, 100($0)
lw $3, 300($0)
ALU
800
Data
access
ALU
Reg
1000
1200
1400
Reg
Data
access
ALU
Reg
Data
access
Reg
200 ps 200 ps 200 ps 200 ps 200 ps
Ideal speedup is number of stages in the pipeline. Do we achieve this?
2004 Morgan Kaufmann Publishers
3
Multicycle Approach
•
•
Break up the instructions into steps, each step takes a cycle
– balance the amount of work to be done
– restrict each cycle to use only one major functional unit
At the end of a cycle
– store values for use in later cycles (easiest thing to do)
– introduce additional “internal” registers
PC
0
M
u
x
1
Address
Memory
MemData
Write
data
Instruction
[20–16]
Instruction
[15–0]
Instruction
register
Instruction
[15–0]
Memory
data
register
0
M
u
x
1
Read
register 1
Instruction
[25–21]
0
M
Instruction u
x
[15–11]
1
Read
data 1
Read
register 2
Registers
Write
Read
register
data 2
A
B
4
Write
data
0
M
u
x
1
16
Sign
extend
32
Zero
ALU ALU
result
ALUOut
0
1M
u
2 x
3
Shift
left 2
2004 Morgan Kaufmann Publishers
4
Idea behind multicycle approach
•
We define each instruction from the ISA perspective (do this!)
•
Break it down into steps following our rule that data flows through at
most one major functional unit (e.g., balance work across steps)
•
Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.)
•
Finally try and pack as much work into each step
(avoid unnecessary cycles)
while also trying to share steps where possible
(minimizes control, helps to simplify solution)
•
Result: Our book’s multicycle Implementation!
2004 Morgan Kaufmann Publishers
5
Five Execution Steps
•
Instruction Fetch
•
Instruction Decode and Register Fetch
•
Execution, Memory Address Computation, or Branch Completion
•
Memory Access or R-type instruction completion
•
Write-back step
INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!
2004 Morgan Kaufmann Publishers
6
Step 1: Instruction Fetch
•
•
•
Use PC to get instruction and put it in the Instruction Register.
Increment the PC by 4 and put the result back in the PC.
Can be described succinctly using RTL "Register-Transfer Language"
IR <= Memory[PC];
PC <= PC + 4;
Can we figure out the values of the control signals?
What is the advantage of updating the PC now?
2004 Morgan Kaufmann Publishers
7
Step 2: Instruction Decode and Register Fetch
•
•
•
Read registers rs and rt in case we need them
Compute the branch address in case the instruction is a branch
RTL:
A <= Reg[IR[25:21]];
B <= Reg[IR[20:16]];
ALUOut <= PC + (sign-extend(IR[15:0]) << 2);
•
We aren't setting any control lines based on the instruction type
(we are busy "decoding" it in our control logic)
2004 Morgan Kaufmann Publishers
8
Step 3 (instruction dependent)
•
ALU is performing one of three functions, based on instruction type
•
Memory Reference:
ALUOut <= A + sign-extend(IR[15:0]);
•
R-type:
ALUOut <= A op B;
•
Branch:
if (A==B) PC <= ALUOut;
2004 Morgan Kaufmann Publishers
9
Step 4 (R-type or memory-access)
•
Loads and stores access memory
MDR <= Memory[ALUOut];
or
Memory[ALUOut] <= B;
•
R-type instructions finish
Reg[IR[15:11]] <= ALUOut;
The write actually takes place at the end of the cycle on the edge
2004 Morgan Kaufmann Publishers
10
Write-back step
• Reg[IR[20:16]] <= MDR;
Which instruction needs this?
PC
0
M
u
x
1
Address
Memory
MemData
Write
data
Instruction
[20–16]
Instruction
[15–0]
Instruction
register
Instruction
[15–0]
Memory
data
register
0
M
u
x
1
Read
register 1
Instruction
[25–21]
0
M
Instruction u
x
[15–11]
1
Read
data 1
Read
register 2
Registers
Write
Read
register
data 2
A
B
4
Write
data
0
M
u
x
1
16
Sign
extend
32
Zero
ALU ALU
result
ALUOut
0
1M
u
2 x
3
Shift
left 2
2004 Morgan Kaufmann Publishers
11
Simple Questions
•
How many cycles will it take to execute this code?
Label:
•
•
lw $t2, 0($t3)
lw $t3, 4($t3)
beq $t2, $t3, Label
add $t5, $t2, $t3
sw $t5, 8($t3)
...
#assume not
What is going on during the 8th cycle of execution?
In what cycle does the actual addition of $t2 and $t3 takes place?
2004 Morgan Kaufmann Publishers
12
PCSource
PCWriteCond
PCWrite
ALUOp
Outputs
IorD
MemRead
ALUSrcB
Control
ALUSrcA
MemWrite
MemtoReg
Op
[5–0]
RegWrite
IRWrite
0
RegDst
26
Instruction [25-0]
PC
0
M
u
x
1
Instruction
[31–26]
Address
Memory
MemData
Write
data
Instruction
[20–16]
Instruction
[15–0]
Instruction
register
Instruction
[15–0]
Memory
data
register
0
M
u
x
1
Read
register 1
Instruction
[25–21]
Read
data 1
Read
register 2
Registers
Write
Read
register
data 2
0
M
Instruction u
x
[15–11]
1
A
16
Sign
extend
B
4
32
Instruction [5–0]
28
PC [31–28]
Zero
ALU ALU
result
Write
data
0
M
u
x
1
Shift
left 2
Shift
left 2
Jump
address
[31–0]
0
1M
u
2 x
3
ALU
control
ALUOut
M
1 u
x
2
Review: finite state machines
•
Finite state machines:
– a set of states and
– next state function (determined by current state and the input)
– output function (determined by current state and possibly input)
Next
state
Current state
Next-state
function
Clock
Inputs
Output
function
Outputs
– We’ll use a Moore machine (output based only on current state)
2004 Morgan Kaufmann Publishers
14
Implementing the Control
•
Value of control signals is dependent upon:
– what instruction is being executed
– which step is being performed
•
Use the information we’ve accumulated to specify a finite state machine
– specify the finite state machine graphically, or
– use microprogramming
•
Implementation can be derived from specification
2004 Morgan Kaufmann Publishers
15
Graphical Specification of FSM
Instruction fetch
MemRead
ALUSrcA = 0
IorD = 0
IRWrite
ALUSrcB = 01
ALUOp = 00
PCWrite
PCSource = 00
0
Start
•
Note:
Instruction decode/
register fetch
1
ALUSrcA = 0
ALUSrcB = 11
ALUOp = 00
– don’t care if not mentioned
– asserted if name only
– otherwise exact value
Memory address
computation
•
2
How many state
bits will we need?
6
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
8
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 10
Memory
access
3
Memory
access
5
MemRead
IorD = 1
Branch
completion
Execution
Jump
completion
9
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 01
PCWriteCond
PCSource = 01
PCWrite
PCSource = 10
R-type completion
7
MemWrite
IorD = 1
RegDst = 1
RegWrite
MemtoReg = 0
Memory read
completon step
4
RegDst = 1
RegWrite
MemtoReg = 0
2004 Morgan Kaufmann Publishers
16
Finite State Machine for Control
Implementation:
PCWrite
PCWriteCond
IorD
MemRead
MemWrite
IRWrite
Control logic
MemtoReg
PCSource
ALUOp
Outputs
ALUSrcB
ALUSrcA
RegWrite
RegDst
NS3
NS2
NS1
NS0
Instruction register
opcode field
S0
S1
S2
S3
Op0
Op1
Op2
Op3
Op4
Inputs
Op5
•
State register
2004 Morgan Kaufmann Publishers
17
PLA Implementation
•
If I picked a horizontal or vertical line could you explain it?
Op5
Op4
Op3
Op2
Op1
Op0
S3
S2
S1
S0
PCWrite
PCWriteCond
IorD
MemRead
MemWrite
IRWrite
MemtoReg
PCSource1
PCSource0
ALUOp1
ALUOp0
ALUSrcB1
ALUSrcB0
ALUSrcA
RegWrite
RegDst
NS3
NS2
NS1
NS0
2004 Morgan Kaufmann Publishers
18
ROM Implementation
•
•
ROM = "Read Only Memory"
– values of memory locations are fixed ahead of time
A ROM can be used to implement a truth table
– if the address is m-bits, we can address 2m entries in the ROM.
– our outputs are the bits of data that the address points to.
m
n
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
1
1
1
0
0
0
0
0
1
1
0
0
0
1
1
1
0
0
0
0
0
1
1
1
0
0
0
0
1
0
1
m is the "height", and n is the "width"
2004 Morgan Kaufmann Publishers
19
ROM Implementation
•
•
How many inputs are there?
6 bits for opcode, 4 bits for state = 10 address lines
(i.e., 210 = 1024 different addresses)
How many outputs are there?
16 datapath-control outputs, 4 state bits = 20 outputs
•
ROM is 210 x 20 = 20K bits
•
Rather wasteful, since for lots of the entries, the outputs are the
same
— i.e., opcode is often ignored
(and a rather unusual size)
2004 Morgan Kaufmann Publishers
20
Another Implementation Style
Complex instructions: the "next state" is often current state + 1
Control unit
PLA or ROM
Outputs
Input
PCWrite
PCWriteCond
IorD
MemRead
MemWrite
IRWrite
BWrite
MemtoReg
PCSource
ALUOp
ALUSrcB
ALUSrcA
RegWrite
RegDst
AddrCtl
1
State
Adder
Address select logic
Op[5– 0]
•
Instruction register
opcode field
2004 Morgan Kaufmann Publishers
21
Details
Op
000000
000010
000100
100011
101011
Dispatch ROM 1
Opcode name
R-format
jmp
beq
lw
sw
Value
0110
1001
1000
0010
0010
Op
100011
101011
Dispatch ROM 2
Opcode name
lw
sw
Value
0011
0101
PLA or ROM
1
State
Adder
3
Mux
2 1
AddrCtl
0
0
Dispatch ROM 2
Dispatch ROM 1
Address select logic
Instruction register
opcode field
State number
0
1
2
3
4
5
6
7
8
9
Address-control action
Use incremented state
Use dispatch ROM 1
Use dispatch ROM 2
Use incremented state
Replace state number by 0
Replace state number by 0
Use incremented state
Replace state number by 0
Replace state number by 0
Replace state number by 0
Value of AddrCtl
3
1
2
3
0
0
3
0
0
0
2004 Morgan Kaufmann Publishers
22
Microprogramming
Control unit
Microcode memory
Outputs
Input
PCWrite
PCWriteCond
IorD
MemRead
MemWrite
IRWrite
BWrite
MemtoReg
PCSource
ALUOp
ALUSrcB
ALUSrcA
RegWrite
RegDst
AddrCtl
Datapath
1
Microprogram counter
Adder
Address select logic
Instruction register
opcode field
•
What are the “microinstructions” ?
2004 Morgan Kaufmann Publishers
23
Microprogramming
•
A specification methodology
– appropriate if hundreds of opcodes, modes, cycles, etc.
– signals specified symbolically using microinstructions
Label
Fetch
Mem1
LW2
ALU
control
Add
Add
Add
SRC1
PC
PC
A
Register
control
SRC2
4
Extshft Read
Extend
PCWrite
Memory
control
Read PC ALU
Read ALU
Write MDR
SW2
Rformat1 Func code A
Write ALU
B
Write ALU
BEQ1
JUMP1
•
•
Subt
A
B
ALUOut-cond
Jump address
Sequencing
Seq
Dispatch 1
Dispatch 2
Seq
Fetch
Fetch
Seq
Fetch
Fetch
Fetch
Will two implementations of the same architecture have the same microcode?
What would a microassembler do?
2004 Morgan Kaufmann Publishers
24
Microinstruction format
Field name
ALU control
SRC1
SRC2
Value
Add
Subt
Func code
PC
A
B
4
Extend
Extshft
Read
ALUOp = 10
ALUSrcA = 0
ALUSrcA = 1
ALUSrcB = 00
ALUSrcB = 01
ALUSrcB = 10
ALUSrcB = 11
Write ALU
RegWrite,
RegDst = 1,
MemtoReg = 0
RegWrite,
RegDst = 0,
MemtoReg = 1
MemRead,
lorD = 0
MemRead,
lorD = 1
MemWrite,
lorD = 1
PCSource = 00
PCWrite
PCSource = 01,
PCWriteCond
PCSource = 10,
PCWrite
AddrCtl = 11
AddrCtl = 00
AddrCtl = 01
AddrCtl = 10
Register
control
Write MDR
Read PC
Memory
Read ALU
Write ALU
ALU
PC write control
ALUOut-cond
jump address
Sequencing
Signals active
ALUOp = 00
ALUOp = 01
Seq
Fetch
Dispatch 1
Dispatch 2
Comment
Cause the ALU to add.
Cause the ALU to subtract; this implements the compare for
branches.
Use the instruction's function code to determine ALU control.
Use the PC as the first ALU input.
Register A is the first ALU input.
Register B is the second ALU input.
Use 4 as the second ALU input.
Use output of the sign extension unit as the second ALU input.
Use the output of the shift-by-two unit as the second ALU input.
Read two registers using the rs and rt fields of the IR as the register
numbers and putting the data into registers A and B.
Write a register using the rd field of the IR as the register number and
the contents of the ALUOut as the data.
Write a register using the rt field of the IR as the register number and
the contents of the MDR as the data.
Read memory using the PC as address; write result into IR (and
the MDR).
Read memory using the ALUOut as address; write result into MDR.
Write memory using the ALUOut as address, contents of B as the
data.
Write the output of the ALU into the PC.
If the Zero output of the ALU is active, write the PC with the contents
of the register ALUOut.
Write the PC with the jump address from the instruction.
Choose the next microinstruction sequentially.
Go to the first microinstruction to begin a new instruction.
Dispatch using the ROM 1.
Dispatch using the ROM 2.
2004 Morgan Kaufmann Publishers
25
Maximally vs. Minimally Encoded
•
No encoding:
– 1 bit for each datapath operation
– faster, requires more memory (logic)
– used for Vax 780 — an astonishing 400K of memory!
•
Lots of encoding:
– send the microinstructions through logic to get control signals
– uses less memory, slower
•
Historical context of CISC:
– Too much logic to put on a single chip with everything else
– Use a ROM (or even RAM) to hold the microcode
– It’s easy to add new instructions
2004 Morgan Kaufmann Publishers
26
Microcode: Trade-offs
•
Distinction between specification and implementation is sometimes blurred
•
Specification Advantages:
– Easy to design and write
– Design architecture and microcode in parallel
•
Implementation (off-chip ROM) Advantages
– Easy to change since values are in memory
– Can emulate other architectures
– Can make use of internal registers
•
Implementation Disadvantages, SLOWER now that:
– Control is implemented on same chip as processor
– ROM is no longer faster than RAM
– No need to go back and make changes
2004 Morgan Kaufmann Publishers
27
Historical Perspective
•
•
•
•
•
In the ‘60s and ‘70s microprogramming was very important for
implementing machines
This led to more sophisticated ISAs and the VAX
In the ‘80s RISC processors based on pipelining became popular
Pipelining the microinstructions is also possible!
Implementations of IA-32 architecture processors since 486 use:
– “hardwired control” for simpler instructions
(few cycles, FSM control implemented using PLA or random logic)
– “microcoded control” for more complex instructions
(large numbers of cycles, central control store)
•
The IA-64 architecture uses a RISC-style ISA and can be
implemented without a large central control store
2004 Morgan Kaufmann Publishers
28
Pentium 4
•
Pipelining is important (last IA-32 without it was 80386 in 1985)
Control
Control
I/O
interface
Chapter 7
Instruction cache
Data
cache
Enhanced
floating point
and multimedia
Integer
datapath
Control
Advanced pipelining
hyperthreading support
•
Secondary
cache
and
memory
interface
Chapter 6
Control
Pipelining is used for the simple instructions favored by compilers
“Simply put, a high performance implementation needs to ensure that the simple
instructions execute quickly, and that the burden of the complexities of the
instruction set penalize the complex, less frequently used, instructions”
2004 Morgan Kaufmann Publishers
29
Pentium 4
•
Somewhere in all that “control we must handle complex instructions
Control
Control
I/O
interface
Instruction cache
Data
cache
Enhanced
floating point
and multimedia
Integer
datapath
Control
Advanced pipelining
hyperthreading support
•
•
•
•
Secondary
cache
and
memory
interface
Control
Processor executes simple microinstructions, 70 bits wide (hardwired)
120 control lines for integer datapath (400 for floating point)
If an instruction requires more than 4 microinstructions to implement,
control from microcode ROM (8000 microinstructions)
Its complicated!
2004 Morgan Kaufmann Publishers
30
Chapter 5 Summary
•
If we understand the instructions…
We can build a simple processor!
•
If instructions take different amounts of time, multi-cycle is better
•
Datapath implemented using:
– Combinational logic for arithmetic
– State holding elements to remember bits
•
Control implemented using:
– Combinational logic for single-cycle implementation
– Finite state machine for multi-cycle implementation
2004 Morgan Kaufmann Publishers
31
DONE
2004 Morgan Kaufmann Publishers
32
Chapter Six -- Pipelining
2004 Morgan Kaufmann Publishers
33
Pipelining
•
Improve performance by increasing instruction throughput
Program
execution
Time
order
(in instructions)
200
lw $1, 100($0) Instruction
fetch Reg
lw $2, 200($0)
400
600
Data
access
ALU
800
1000
1200
1400
ALU
Data
access
1600
1800
Reg
Instruction Reg
fetch
800 ps
lw $3, 300($0)
Reg
Instruction
fetch
800 ps
Note:
timing assumptions changed
for this example
800 ps
Program
execution
Time
order
(in instructions)
200
400
600
Instruction
fetch
Reg
lw $2, 200($0) 200 ps
Instruction
fetch
Reg
200 ps
Instruction
fetch
lw $1, 100($0)
lw $3, 300($0)
ALU
800
Data
access
ALU
Reg
1000
1200
1400
Reg
Data
access
ALU
Reg
Data
access
Reg
200 ps 200 ps 200 ps 200 ps 200 ps
Ideal speedup is number of stages in the pipeline. Do we achieve this?
2004 Morgan Kaufmann Publishers
34
Pipelining
•
What makes it easy
– all instructions are the same length
– just a few instruction formats
– memory operands appear only in loads and stores
•
What makes it hard?
– structural hazards: suppose we had only one memory
– control hazards: need to worry about branch instructions
– data hazards: an instruction depends on a previous instruction
•
We’ll build a simple pipeline and look at these issues
•
We’ll talk about modern processors and what really makes it hard:
– exception handling
– trying to improve performance with out-of-order execution, etc.
2004 Morgan Kaufmann Publishers
35
Basic Idea
IF: Instruction fetch
ID: Instruction decode/
register file read
EX: Execute/
address calculation
MEM: Memory access
WB: Write back
Add
4
Shift
left 2
P
C
Address
Instruction
Instruction
memory
Read Read
register 1 data1
Read
register 2
Registers
Write
Read
register
data2
Write
data
16
•
ADD Add
result
Zero
ALU ALU
result
Address
Read
data
Data
Memory
Write
data
Sign 32
extend
What do we need to add to actually split the datapath into stages?
2004 Morgan Kaufmann Publishers
36
Pipelined Datapath
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Shift
left 2
PC
Address
Instruction
memory
Add Add
result
Read
register 1
Read
data 1
Read
register 2
Registers
Read
Write
data 2
register
Zero
ALU ALU
result
Read
data
Address
Data
memory
Write
data
Write
data
16
Sign
extend
32
Can you find a problem even if there are no dependencies?
What instructions can we execute to manifest the problem?
2004 Morgan Kaufmann Publishers
37
Corrected Datapath
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Shift
left 2
PC
Address
Instruction
memory
Add Add
result
Read
register 1
Read
data 1
Read
register 2
Registers
Read
Write
data 2
register
Zero
ALU ALU
result
Read
data
Address
Data
memory
Write
data
Write
data
16
Sign
extend
32
2004 Morgan Kaufmann Publishers
38
Graphically Representing Pipelines
Time (in clock cycles)
Program
execution
order
(in instructions)
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
•
CC 1
CC 2
IM
Reg
IM
CC 3
ALU
Reg
IM
CC 4
CC 5
DM
Reg
ALU
DM
Reg
ALU
DM
Reg
CC 6
CC7
Reg
Can help with answering questions like:
– how many cycles does it take to execute this code?
– what is the ALU doing during cycle 4?
– use this representation to help understand datapaths
2004 Morgan Kaufmann Publishers
39
Pipeline Control
PCSrc
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add Add
result
4
Shift
left 2
Branch
RegWrite
PC
Address
Instruction
memory
Read
register 1
Read
data 1
Read
register 2
Registers
Read
Write
data 2
register
MemWrite
ALUSrc
Zero
Add ALU
result
MemtoReg
Read
data
Address
Data
memory
Write
data
Write
data
Instruction
(15Ð0)
Instruction
(20Ð16)
16
Sign
extend
32
6
ALU
control
MemRead
ALUOp
Instruction
(15Ð11)
RegDst
2004 Morgan Kaufmann Publishers
40
Pipeline control
•
We have 5 stages. What needs to be controlled in each stage?
– Instruction Fetch and PC Increment
– Instruction Decode / Register Fetch
– Execution
– Memory Stage
– Write Back
•
How would control be handled in an automobile plant?
– a fancy control center telling everyone what to do?
– should we use a finite state machine?
2004 Morgan Kaufmann Publishers
41
Pipeline Control
•
Pass control signals along just like the data
Instruction
R-format
lw
sw
beq
Execution/Address Calculation Memory access stage
stage control lines
control lines
Reg
ALU
ALU
ALU
Mem
Mem
Dst
Op1
Op0
Src Branch Read Write
1
1
0
0
0
0
0
0
0
0
1
0
1
0
X
0
0
1
0
0
1
X
0
1
0
1
0
0
Write-back
stage control
lines
Reg Mem to
write
Reg
1
0
1
1
0
X
0
X
WB
Instruction
IF/ID
Control
M
WB
EX
M
WB
ID/EX
EX/MEM
MEM/WB
2004 Morgan Kaufmann Publishers
42
Datapath with Control
PCSrc
ID/EX
WB
Control
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
Add
4
Shift
left 2
PC
Address
Instruction
memory
Add Add
result
Branch
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers
Read
Write
data 2
register
Zero
ALU ALU
result
Read
data
Address
Data
memory
Write
data
Write
data
Instruction
[15–0]
Instruction
[20–16]
16
Sign
extend
32
6
ALU
control
MemRead
ALUOp
Instruction
[15–11]
RegDst
2004 Morgan Kaufmann Publishers
43
Dependencies
•
Problem with starting next instruction before first is finished
– dependencies that “go backward in time” are data hazards
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
10
10
10
10/–20
–20
–20
–20
–20
IM
Reg
DM
Reg
Value of
register $2:
Program
execution
order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
2004 Morgan Kaufmann Publishers
44
Software Solution
•
•
Have compiler guarantee no hazards
Where do we insert the “nops” ?
sub
and
or
add
sw
•
$2, $1, $3
$12, $2, $5
$13, $6, $2
$14, $2, $2
$15, 100($2)
Problem: this really slows us down!
2004 Morgan Kaufmann Publishers
45
Forwarding
•
Use temporary results, don’t wait for them to be written
– register file forwarding to handle read/write to same register
– ALU forwarding
Time (in clock cycles)
CC 1
CC 2
Value of register $2:
10
10
Value of EX/MEM:
X
X
Value of MEM/WB:
X
X
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
–20
X
10/–20
X
–20
–20
X
X
–20
X
X
–20
X
X
–20
X
X
DM
Reg
Program
execution
order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14,$2 , $2
sw $15, 100($2)
what if this $2 was $13?
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
2004 Morgan Kaufmann Publishers
46
Forwarding
•
The main idea (some details not shown)
ID/EX
EX/MEM
MEM/WB
M
u
x
ForwardA
Registers
ALU
M
u
x
Data
memory
M
u
x
ForwardB
Rs
Rt
Rt
Rd
EX/MEM.RegisterRd
M
u
x
Forwarding
unit
MEM/WB.RegisterRd
2004 Morgan Kaufmann Publishers
47
Can't always forward
•
Load word can still cause a hazard:
– an instruction tries to read a register following a load instruction
that writes to the same register.
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
DM
Reg
CC 6
CC 7
CC 8
CC 9
Program
execution
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
•
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
Thus, we need a hazard detection unit to “stall” the load instruction
2004 Morgan Kaufmann Publishers
48
Stalling
•
We can stall the pipeline by keeping an instruction in the same stage
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
Reg
DM
Reg
CC 6
CC 7
CC 8
CC 9
CC 10
Program
execution
order
(in instructions)
lw $2, 20($1)
IM
bubble
and becomes nop
add $4, $2, $5
or $8, $2, $6
add $9, $4, $2
IM
Reg
IM
DM
Reg
IM
Reg
DM
DM
Reg
IM
Reg
Reg
Reg
DM
Reg
2004 Morgan Kaufmann Publishers
49
Hazard Detection Unit
•
Stall by letting an instruction that won’t write anything go forward
Hazard
detection
unit
ID/EX.MemRead
ID/EX
WB
M
u
x
Control
0
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
M
u
x
Registers
M
u
x
ALU
PC
Instruction
memory
M
u
x
Data
memory
IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt
Rt
IF/ID.RegisterRd
Rd
M
u
x
ID/EX.RegisterRt
Rs
Rt
Forwarding
unit
2004 Morgan Kaufmann Publishers
50
Branch Hazards
•
When we decide to branch, other instructions are in the pipeline!
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
DM
Reg
CC 6
CC 7
CC 8
CC 9
Program
execution
order
(in instructions)
40 beq $1, $3, 28
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
•
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
We are predicting “branch not taken”
– need to add hardware for flushing instructions if we are wrong
2004 Morgan Kaufmann Publishers
51
Flushing Instructions
IF.Flush
Hazard
detection
unit
ID/EX
WB
Control
0
IF/ID
M
u
x
+
EX/MEM
M
WB
EX/MEM
EX
M
WB
+
4
M
u
x
Shift
left 2
Registers
PC
=
M
u
x
Instruction
memory
ALU
M
u
x
Data
memory
M
u
x
Sign
extend
M
u
x
Fowarding
unit
Note: we’ve also moved branch decision to ID stage
2004 Morgan Kaufmann Publishers
52
Branches
•
•
•
•
If the branch is taken, we have a penalty of one cycle
For our simple design, this is reasonable
With deeper pipelines, penalty increases and static branch prediction
drastically hurts performance
Solution: dynamic branch prediction
Taken
Not taken
Predict taken
Predict taken
Taken
Not taken
Taken
Not taken
Predict not taken
Predict not taken
Taken
Not taken
A 2-bit prediction scheme
2004 Morgan Kaufmann Publishers
53
Branch Prediction
•
Sophisticated Techniques:
– A “branch target buffer” to help us look up the destination
– Correlating predictors that base prediction on global behavior
and recently executed branches (e.g., prediction for a specific
branch instruction based on what happened in previous branches)
– Tournament predictors that use different types of prediction
strategies and keep track of which one is performing best.
– A “branch delay slot” which the compiler tries to fill with a useful
instruction (make the one cycle delay part of the ISA)
•
Branch prediction is especially important because it enables other
more advanced pipelining techniques to be effective!
•
Modern processors predict correctly 95% of the time!
2004 Morgan Kaufmann Publishers
54
Improving Performance
•
Try and avoid stalls! E.g., reorder these instructions:
lw
lw
sw
sw
$t0,
$t2,
$t2,
$t0,
0($t1)
4($t1)
0($t1)
4($t1)
•
Dynamic Pipeline Scheduling
– Hardware chooses which instructions to execute next
– Will execute instructions out of order (e.g., doesn’t wait for a
dependency to be resolved, but rather keeps going!)
– Speculates on branches and keeps the pipeline full
(may need to rollback if prediction incorrect)
•
Trying to exploit instruction-level parallelism
2004 Morgan Kaufmann Publishers
55
Advanced Pipelining
•
•
•
•
Increase the depth of the pipeline
Start more than one instruction each cycle (multiple issue)
Loop unrolling to expose more ILP (better scheduling)
“Superscalar” processors
– DEC Alpha 21264: 9 stage pipeline, 6 instruction issue
•
All modern processors are superscalar and issue multiple
instructions usually with some limitations (e.g., different “pipes”)
•
VLIW: very long instruction word, static multiple issue
(relies more on compiler technology)
•
This class has given you the background you need to learn more!
2004 Morgan Kaufmann Publishers
56
Chapter 6 Summary
•
Pipelining does not improve latency, but does improve throughput
Deeply
pipelined
Multicycle
(Section 5.5)
Pipelined
Multiple issue
with deep pipeline
(Section 6.10)
Multiple issue
with deep pipeline
(Section 6.10)
Multiple-issue
pipelined
(Section 6.9)
Multiple-issue
pipelined
(Section 6.9)
Single-cycle
(Section 5.4)
Deeply
pipelined
Multicycle
(Section 5.5)
Single-cycle
(Section 5.4)
Slower
Pipelined
Faster
Instructions per clock (IPC = 1/CPI)
1
Several
Use latency in instructions
2004 Morgan Kaufmann Publishers
57
Last Slide
2004 Morgan Kaufmann Publishers
58
Chapter Seven
2004 Morgan Kaufmann Publishers
59
Memories: Review
•
SRAM:
– value is stored on a pair of inverting gates
– very fast but takes up more space than DRAM (4 to 6 transistors)
•
DRAM:
– value is stored as a charge on capacitor (must be refreshed)
– very small but slower than SRAM (factor of 5 to 10)
Word line
A
A
B
B
Pass transistor
Capacitor
Bit line
2004 Morgan Kaufmann Publishers
60
Exploiting Memory Hierarchy
•
Users want large and fast memories!
SRAM access times are .5 – 5ns at cost of $4000 to $10,000 per GB.
DRAM access times are 50-70ns at cost of $100 to $200 per GB.
Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB.
•
2004
Try and give it to them anyway
– build a memory hierarchy
CPU
Level 1
Increasing distance
from the CPU in
access time
Levels in the
Level 2
memory hierarchy
Level n
Size of the memory at each level
2004 Morgan Kaufmann Publishers
61
Locality
•
A principle that makes having a memory hierarchy a good idea
•
If an item is referenced,
temporal locality: it will tend to be referenced again soon
spatial locality: nearby items will tend to be referenced soon.
Why does code have locality?
•
Our initial focus: two levels (upper, lower)
– block: minimum unit of data
– hit: data requested is in the upper level
– miss: data requested is not in the upper level
2004 Morgan Kaufmann Publishers
62
Cache
•
•
Two issues:
– How do we know if a data item is in the cache?
– If it is, how do we find it?
Our first example:
– block size is one word of data
– "direct mapped"
For each item of data at the lower level,
there is exactly one location in the cache where it might be.
e.g., lots of items at the lower level share locations in the upper level
2004 Morgan Kaufmann Publishers
63
Direct Mapped Cache
Mapping: address is modulo the number of blocks in the cache
Cache
000
001
010
011
100
101
110
111
•
00001
00101
01001
01101
10001
10101
11001
11101
Memory
2004 Morgan Kaufmann Publishers
64
Direct Mapped Cache
•
For MIPS:
Address (showing bit positions)
31 30
Hit
13 12 11
20
2 10
Byte
offset
10
Tag
Data
Index
Index
0
1
2
Valid Tag
Data
1021
1022
1023
20
32
=
What kind of locality are we taking advantage of?
2004 Morgan Kaufmann Publishers
65
Direct Mapped Cache
•
Taking advantage of spatial locality:
Address (showing bit positions)
31
14 13
18
Hit
65
8
210
4
Tag
Byte
offset
Data
Block offset
Index
18 bits
V
512 bits
Tag
Data
256
entries
16
32
32
32
=
Mux
32
2004 Morgan Kaufmann Publishers
66
Hits vs. Misses
•
Read hits
– this is what we want!
•
Read misses
– stall the CPU, fetch block from memory, deliver to cache, restart
•
Write hits:
– can replace data in cache and memory (write-through)
– write the data only into the cache (write-back the cache later)
•
Write misses:
– read the entire block into the cache, then write the word
2004 Morgan Kaufmann Publishers
67
Hardware Issues
•
Make reading multiple words easier by using banks of memory
CPU
CPU
CPU
Multiplexor
Cache
Cache
Cache
Bus
Bus
Memory
b. Wide memory organization
Bus
Memory
Memory
Memory
Memory
bank 0
bank 1
bank 2
bank 3
c. Interleaved memory organization
Memory
a. One-word-wide
memory organization
•
It can get a lot more complicated...
2004 Morgan Kaufmann Publishers
68
Performance
•
Increasing the block size tends to decrease miss rate:
40%
35%
Miss rate
30%
25%
20%
15%
10%
5%
0%
4
16
64
Block size (bytes)
256
1 KB
8 KB
16 KB
64 KB
256 KB
•
Use split caches because there is more spatial locality in code:
Program
gcc
spice
Block size in
words
1
4
1
4
Instruction
miss rate
6.1%
2.0%
1.2%
0.3%
Data miss
rate
2.1%
1.7%
1.3%
0.6%
Effective combined
miss rate
5.4%
1.9%
1.2%
0.4%
2004 Morgan Kaufmann Publishers
69
Performance
•
Simplified model:
execution time = (execution cycles + stall cycles)  cycle time
stall cycles = # of instructions  miss ratio  miss penalty
•
Two ways of improving performance:
– decreasing the miss ratio
– decreasing the miss penalty
What happens if we increase block size?
2004 Morgan Kaufmann Publishers
70
Decreasing miss ratio with associativity
One-way set associative
(direct mapped)
Block
Tag Data
0
Two-way set associative
1
2
3
4
5
6
Set
Tag Data Tag Data
0
1
2
3
7
Four-way set associative
Set
Tag Data Tag Data Tag Data Tag Data
0
1
Eight-way set associative (fully associative)
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data
Compared to direct mapped, give a series of references that:
– results in a lower miss ratio using a 2-way set associative cache
– results in a higher miss ratio using a 2-way set associative cache
assuming we use the “least recently used” replacement strategy
2004 Morgan Kaufmann Publishers
71
An implementation
Address
31 30
12 11 10 9 8
8
22
Index
0
1
2
V
Tag
Data
V
3210
Tag
Data
V
Tag
Data
V
Tag
Data
253
254
255
22
32
4-to-1 multiplexor
Hit
Data
2004 Morgan Kaufmann Publishers
72
Performance
15%
1 KB
12%
2 KB
9%
4 KB
6%
8 KB
16 KB
32 KB
3%
64 KB
128 KB
0
One-way
Two-way
Four-way
Eight-way
Associativity
2004 Morgan Kaufmann Publishers
73
Decreasing miss penalty with multilevel caches
•
Add a second level cache:
– often primary cache is on the same chip as the processor
– use SRAMs to add another cache above primary memory (DRAM)
– miss penalty goes down if data is in 2nd level cache
•
Example:
– CPI of 1.0 on a 5 Ghz machine with a 5% miss rate, 100ns DRAM access
– Adding 2nd level cache with 5ns access time decreases miss rate to .5%
•
Using multilevel caches:
– try and optimize the hit time on the 1st level cache
– try and optimize the miss rate on the 2nd level cache
2004 Morgan Kaufmann Publishers
74
Cache Complexities
•
Not always easy to understand implications of caches:
1200
2000
Radix sort
1000
Radix sort
1600
800
1200
600
800
400
200
Quicksort
400
0
Quicksort
0
4
8
16
32
64
128
256
512 1024 2048 4096
Size (K items to sort)
Theoretical behavior of
Radix sort vs. Quicksort
4
8
16
32
64
128
256
512 1024 2048 4096
Size (K items to sort)
Observed behavior of
Radix sort vs. Quicksort
2004 Morgan Kaufmann Publishers
75
Cache Complexities
•
Here is why:
5
Radix sort
4
3
2
1
Quicksort
0
4
8
16
32
64
128
256
512 1024 2048 4096
Size (K items to sort)
•
Memory system performance is often critical factor
– multilevel caches, pipelined processors, make it harder to predict outcomes
– Compiler optimizations to increase locality sometimes hurt ILP
•
Difficult to predict best algorithm: need experimental data
2004 Morgan Kaufmann Publishers
76
Virtual Memory
•
Main memory can act as a cache for the secondary storage (disk)
Virtual addresses
Physical addresses
Address translation
Disk addresses
•
Advantages:
– illusion of having more physical memory
– program relocation
– protection
2004 Morgan Kaufmann Publishers
77
Pages: virtual memory blocks
•
Page faults: the data is not in memory, retrieve it from disk
– huge miss penalty, thus pages should be fairly large (e.g., 4KB)
– reducing page faults is important (LRU is worth the price)
– can handle the faults in software instead of hardware
– using write-through is too expensive so we use writeback
Virtual address
31 30 29 28 27
15 14 13 12 11 10 9 8
3210
Page offset
Virtual page number
Translation
29 28 27
15 14 13 12 11 10 9 8
Physical page number
3210
Page offset
Physical address
2004 Morgan Kaufmann Publishers
78
Page Tables
Virtual page
number
Page table
Physical page or
Valid disk address
1
1
1
1
0
1
1
0
1
1
0
1
Physical memory
Disk storage
2004 Morgan Kaufmann Publishers
79
Page Tables
Page table register
Virtual address
31 30 29 28 27
1 5 1 4 1 3 1 2 11 1 0 9 8
Virtual page number
Page offset
12
20
Valid
3 2 1 0
Physical page number
Page table
18
If 0 then page is not
present in memory
29 28 27
1 5 1 4 1 3 1 2 11 1 0 9 8
Physical page number
3 2 1 0
Page offset
Physical address
2004 Morgan Kaufmann Publishers
80
Making Address Translation Fast
•
A cache for address translations: translation lookaside buffer
TLB
Virtual page
number Valid Dirty Ref
1
1
1
1
0
1
0
1
1
0
0
0
Tag
Physical page
address
1
1
1
1
0
1
Physical memory
Page table
Physical page
Valid Dirty Ref or disk address
1
1
1
1
0
1
1
0
1
1
0
1
Typical values:
1
0
0
0
0
0
0
0
1
1
0
1
1
0
0
1
0
1
1
0
1
1
0
1
Disk storage
16-512 entries,
miss-rate: .01% - 1%
miss-penalty: 10 – 100 cycles
2004 Morgan Kaufmann Publishers
81
TLBs and caches
Virtual address
TLB access
TLB miss
exception
No
Yes
TLB hit?
Physical address
No
Try to read data
from cache
Cache miss stall
while read block
No
Cache hit?
Yes
Write?
No
Yes
Write access
bit on?
Write protection
exception
Yes
Try to write data
to cache
Deliver data
to the CPU
Cache miss stall
while read block
No
Cache hit?
Yes
Write data into cache,
update the dirty bit, and
put the data and the
address into the write buffer
2004 Morgan Kaufmann Publishers
82
TLBs and Caches
Virtual address
31 30 29
14 13 12 11 10 9
Virtual page number
3 2 1 0
Page offset
12
20
Valid Dirty
Tag
Physical page number
=
=
=
=
=
=
TLB
TLB hit
20
Page offset
Physical page number
Physical address
Block
Cache index
Physical address tag
offset
18
8
4
Byte
offset
2
8
12
Valid
Data
Tag
Cache
=
Cache hit
32
Data
2004 Morgan Kaufmann Publishers
83
Modern Systems
•
2004 Morgan Kaufmann Publishers
84
Modern Systems
•
Things are getting complicated!
2004 Morgan Kaufmann Publishers
85
Some Issues
•
Processor speeds continue to increase very fast
— much faster than either DRAM or disk access times
100,000
10,000
1,000
Performance
CPU
100
10
Memory
1
Year
•
Design challenge: dealing with this growing disparity
– Prefetching? 3rd level caches and more? Memory design?
2004 Morgan Kaufmann Publishers
86
Chapters 8 & 9
(partial coverage)
2004 Morgan Kaufmann Publishers
87
Interfacing Processors and Peripherals
•
•
•
I/O Design affected by many factors (expandability, resilience)
Performance:
— access latency
— throughput
— connection between devices and the system
— the memory hierarchy
— the operating system
A variety of different users (e.g., banks, supercomputers, engineers)
Interrupts
Processor
Cache
Memory- I/O bus
Main
memory
I/O
controller
Disk
Disk
I/O
controller
I/O
controller
Graphics
output
Network
2004 Morgan Kaufmann Publishers
88
I/O
•
Important but neglected
“The difficulties in assessing and designing I/O systems have
often relegated I/O to second class status”
“courses in every aspect of computing, from programming to
computer architecture often ignore I/O or give it scanty coverage”
“textbooks leave the subject to near the end, making it easier
for students and instructors to skip it!”
•
GUILTY!
— we won’t be looking at I/O in much detail
— be sure and read Chapter 8 in its entirety.
— you should probably take a networking class!
2004 Morgan Kaufmann Publishers
89
I/O Devices
•
Very diverse devices
— behavior (i.e., input vs. output)
— partner (who is at the other end?)
— data rate
2004 Morgan Kaufmann Publishers
90
I/O Example: Disk Drives
Platters
Tracks
Platter
Sectors
Track
•
To access data:
— seek: position head over the proper track (3 to 14 ms. avg.)
— rotational latency: wait for desired sector (.5 / RPM)
— transfer: grab the data (one or more sectors) 30 to 80 MB/sec
2004 Morgan Kaufmann Publishers
91
I/O Example: Buses
•
•
•
•
Shared communication link (one or more wires)
Difficult design:
— may be bottleneck
— length of the bus
— number of devices
— tradeoffs (buffers for higher bandwidth increases latency)
— support for many different devices
— cost
Types of buses:
— processor-memory (short high speed, custom design)
— backplane (high speed, often standardized, e.g., PCI)
— I/O (lengthy, different devices, e.g., USB, Firewire)
Synchronous vs. Asynchronous
— use a clock and a synchronous protocol, fast and small
but every device must operate at same rate and
clock skew requires the bus to be short
— don’t use a clock and instead use handshaking
2004 Morgan Kaufmann Publishers
92
I/O Bus Standards
•
Today we have two dominant bus standards:
2004 Morgan Kaufmann Publishers
93
Other important issues
•
Bus Arbitration:
— daisy chain arbitration (not very fair)
— centralized arbitration (requires an arbiter), e.g., PCI
— collision detection, e.g., Ethernet
•
Operating system:
— polling
— interrupts
— direct memory access (DMA)
•
Performance Analysis techniques:
— queuing theory
— simulation
— analysis, i.e., find the weakest link (see “I/O System
Design”)
•
Many new developments
2004 Morgan Kaufmann Publishers
94
Pentium 4
•
I/O Options
Pentium 4
processor
DDR 400
(3.2 GB/sec)
Main
memory
DIMMs
DDR 400
(3.2 GB/sec)
System bus (800 MHz, 604 GB/sec)
AGP 8X
Memory
(2.1 GB/sec)
Graphics
controller
output
hub
CSA
(north bridge)
(0.266 GB/sec)
1 Gbit Ethernet
82875P
Serial ATA
(150 MB/sec)
(266 MB/sec) Parallel ATA
(100 MB/sec)
Serial ATA
(150 MB/sec)
Parallel ATA
(100 MB/sec)
Disk
Disk
Stereo
(surroundsound)
AC/97
(1 MB/sec)
USB 2.0
(60 MB/sec)
...
I/O
controller
hub
(south bridge)
82801EB
CD/DVD
Tape
(20 MB/sec)
10/100 Mbit Ethernet
PCI bus
(132 MB/sec)
2004 Morgan Kaufmann Publishers
95
Fallacies and Pitfalls
•
Fallacy: the rated mean time to failure of disks is 1,200,000 hours,
so disks practically never fail.
•
Fallacy: magnetic disk storage is on its last legs, will be replaced.
•
Fallacy: A 100 MB/sec bus can transfer 100 MB/sec.
•
Pitfall: Moving functions from the CPU to the I/O processor,
expecting to improve performance without analysis.
2004 Morgan Kaufmann Publishers
96
Multiprocessors
•
Idea: create powerful computers by connecting many smaller ones
good news: works for timesharing (better than supercomputer)
bad news: its really hard to write good concurrent programs
many commercial failures
Processor
Processor
Processor
Cache
Cache
Cache
Processor
Processor
Processor
Cache
Cache
Cache
Memory
Memory
Memory
Single bus
Memory
I/O
Network
2004 Morgan Kaufmann Publishers
97
Questions
•
How do parallel processors share data?
— single address space (SMP vs. NUMA)
— message passing
•
How do parallel processors coordinate?
— synchronization (locks, semaphores)
— built into send / receive primitives
— operating system protocols
•
How are they implemented?
— connected by a single bus
— connected by a network
2004 Morgan Kaufmann Publishers
98
Supercomputers
Plot of top 500 supercomputer sites over a decade:
Single Instruction multiple data (SIMD)
500
Cluster
(network of
workstations)
400
Cluster
(network of
SMPs)
300
Massively
parallel
processors
(MPPs)
200
100
Sharedmemory
multiprocessors
(SMPs)
0
93 93 94 94 95 95 96 96 97 97 98 98 99 99 00
Uniprocessors
2004 Morgan Kaufmann Publishers
99
Using multiple processors an old idea
•
Some SIMD designs:
•
Costs for the the Illiac IV escalated from $8 million in 1966 to $32 million in
1972 despite completion of only ¼ of the machine. It took three more years
before it was operational!
“For better or worse, computer architects are not easily discouraged”
Lots of interesting designs and ideas, lots of failures, few successes
2004 Morgan Kaufmann Publishers
100
Topologies
P0
P1
P2
P3
P0
a. 2-D grid or mesh of 16 nodes
P4
P1
P5
P2
P6
P3
P7
P4
P5
P6
P7
b. Omega network
a. Crossbar
b. n-cube tree of 8 nodes (8 = 23 so n = 3)
2004 Morgan Kaufmann Publishers
101
Clusters
•
•
•
•
•
•
Constructed from whole computers
Independent, scalable networks
Strengths:
– Many applications amenable to loosely coupled machines
– Exploit local area networks
– Cost effective / Easy to expand
Weaknesses:
– Administration costs not necessarily lower
– Connected using I/O bus
Highly available due to separation of memories
In theory, we should be able to do better
2004 Morgan Kaufmann Publishers
102
Google
•
•
•
•
•
Serve an average of 1000 queries per second
Google uses 6,000 processors and 12,000 disks
Two sites in silicon valley, two in Virginia
Each site connected to internet using OC48 (2488 Mbit/sec)
Reliability:
– On an average day, 20 machines need rebooted (software error)
– 2% of the machines replaced each year
In some sense, simple ideas well executed. Better (and cheaper)
than other approaches involving increased complexity
2004 Morgan Kaufmann Publishers
103
Concluding Remarks
•
Evolution vs. Revolution
“More often the expense of innovation comes from being too disruptive
to computer users”
“Acceptance of hardware ideas requires acceptance by software
people; therefore hardware people should learn about software. And if
software people want good machines, they must learn more about hardware
to be able to communicate with and thereby influence hardware engineers.”
2004 Morgan Kaufmann Publishers
104