Transcript Lecture

Computer Systems
The processor architecture
University of Amsterdam
Arnoud Visser
Computer Systems – the processor architecture
1
Basic Knowledge
• Relative timing of the elements is important
University of Amsterdam
Arnoud Visser
Computer Systems – the processor architecture
2
Programmers visible state
Program registers
%eax
%esi
%ecx
%edi
%edx
%esp
%ebx
%ebp
Memory
CC
PC
Von Neumann architecture,
both instructions and data in memory
University of Amsterdam
Arnoud Visser
Computer Systems – the processor architecture
3
Program counter
0xffffffff
0xc0000000
Kernel virtual memory
User stack
(created at runtime)
Memory mapped region for
shared libraries
0x40000000
Memory
invisible to
user code
printf() function
Run-time heap
(created at runtime by malloc)
PC
or
Read/write data
Read-only code and data
Loaded from the
hello executable file
0x08048000
0
Unused
• The program counter holds the address of the instruction
currently executed
• The next instruction has to be collected from memory
(slow!)
University of Amsterdam
Arnoud Visser
Computer Systems – the processor architecture
4
Processing a single instruction
• Fetch
– Read the instruction (1-5 bytes) from memory
• Decode
– Reads the values from the registers
• Execute
– Perform a arithmetic/logic operation OR Test the jump conditions
• Memory
– Read/Write to memory
• Write back
– Update the registers
• PC update
– Set the address of the next instruction
University of Amsterdam
Arnoud Visser
Computer Systems – the processor architecture
5
Seq. architecture
• Hardware
PC
Write back
Data
memory
Memory
connected with named wires
(word & bytes, byte & bits, bit)
CC
Execute
icode ifun
rA
rB
valC
Need
regids
Split
PC
increment
Register
A BM
file
E
Decode
Align
Bytes 1-5
Byte 0
Instruction
memory
PC
Arnoud Visser
valP
Need
valC
Instr
valid
ALU
Fetch
Instruction
memory
PC
increment
PC
University of Amsterdam
Computer Systems – the processor architecture
6
Stage Computation: ALU Operation
OPl rA, rB
icode:ifun  M1[PC]
Read instruction byte
rA:rB  M1[PC+1]
Read register byte
valP  PC+2
Compute next PC
valA  R[rA]
Read operand A
valB  R[rB]
Read operand B
valE  valB ifun valA
Perform ALU operation
Set CC
Set condition code register
Memory
Write
R[rB]  valE
Write back result
back
PC update
PC  valP
Update PC
Fetch
Decode
Execute
– Formulate instruction execution as
sequence of simple steps
– Use same general form for all instructions
Arnoud Visser
Computer Systems – the processor architecture
University of Amsterdam
7
Stage Computation: procedure call
call Dest
icode:ifun  M1[PC]
Read instruction byte
valC  M4[PC+1]
Read destination address
valP  PC+5
Compute return point
valB  R[%esp]
Read stack pointer
valE  valB + –4
Decrement stack pointer
Memory
Write
M4[valE]  valP
R[%esp]  valE
Write return value on stack
back
PC update
PC  valC
Set PC to destination
Fetch
Decode
Execute
Update stack pointer
– Use ALU to decrement stack pointer
– Store incremented PC
University of Amsterdam
Arnoud Visser
Computer Systems – the processor architecture
8
Stage Computation: jump
jXX Dest
Fetch
icode:ifun  M1[PC]
Read instruction byte
valC  M4[PC+1]
Read destination address
valP  PC+5
Fall through address
Bch  Cond(CC,ifun)
Take branch?
PC  Bch ? valC : valP
Update PC
Decode
Execute
Memory
Write
back
PC update
– Compute both addresses
– Choose based on setting of condition codes
and branch condition XX/ifun
Arnoud Visser
Computer Systems – the processor architecture
University of Amsterdam
9
Branch conditions
JXX
Condition Codes
Description
jmp
7
0
1
Direct jump
jle
7
1
(SF^OF) | ZF
Less or equal <=
jl
7
2
SF^OF
Less <
je
7
3
ZF
Equal ==
jne
7
4
~ZF
Non equal !=
jge
7
5
~(SF^OF)
jg
7
6
~(SF^OF)
& ~ZF
Greater or equal >=
Greater >
University of Amsterdam
Arnoud Visser
Computer Systems – the processor architecture
10
Execute Logic
Datapaths & Control Logic
Bch
–
–
–
–
ALU fun: select function
ALU A: select Input A
ALU B: select Input B
Set CC: Should condition code
register be loaded?
valE
bcond
bcond
Set
CC
icode ifun
ALU
fun.
ALU
ALU
CC
CC
ALU
A
valC
ALU
B
valA
valB
University of Amsterdam
Arnoud Visser
Computer Systems – the processor architecture
11
Control logic: ALU A
OPl rA, rB
Execute valE  valB OP valA
rmmovl rA, D(rB)
Execute valE  valB + valC
popl rA
Execute valE  valB + 4
jXX Dest
Execute
Perform ALU
operation
Compute effective
address
Increment stack
pointer
No operation
call Dest
Execute valE  valB + –4
ret
Execute valE  valB + 4
Decrement stack
pointer
Increment stack
pointer
int aluA = [
icode in { IRRMOVL, IOPL } : valA;
icode in { IIRMOVL, IRMMOVL, IMRMOVL } : valC;
icode in { ICALL, IPUSHL } : -4;
icode in { IRET, IPOPL } : 4;
# Other instructions don't need ALU
];
Arnoud Visser
Computer Systems – the processor architecture
University of Amsterdam
12
Hardware structure
newPC
New
PC
PC
• This can be translated
in silicon
valM
Memory
Execute
dat
Mem. re Data a
control a memory
out
w
d
ri Addr Data
t
e
Bch
valE
ALU
fun.
ALU
CC
ALU
A
ALU
B
valAvalBdstEdstMsrcAsrcB
dstEdstMsrcAsrcB
Register
A BM
file
E
Decode
icodeifun rA rB valC
Fetch
Instruction
memory
valP
Write back
PC
increment
PC
University of Amsterdam
Arnoud Visser
Computer Systems – the processor architecture
13
University of Amsterdam
Arnoud Visser
Computer Systems – the processor architecture
14
Sequential is too slow
• Clock has to slow enough to let the signal
propagate through all wires and transistors
Clk
.
.
.
.
.
.
.
.
.
.
.
.
• Critical path: the slowest path between any
two storage devices
University of Amsterdam
Arnoud Visser
Computer Systems – the processor architecture
15
Pipelining
• Divide the operations in stages and allow to start the
next operation if the first operation is ready with
first stage
100 ps
20 ps
100 ps
20 ps
100 ps
20 ps
Comb.
logic
A
R
e
g
Comb.
logic
B
R
e
g
Comb.
logic
C
R
e
g
Clock
• Increase the throughput, increase latency
University of Amsterdam
Arnoud Visser
Computer Systems – the processor architecture
16
Insert registers between stages
W_icode, W_valM
• Pipeline registers means
extra silicon and delay
1
2
3
4
F D E M
F D E
F D
F
5
W
M
E
D
F
Cycl
eW5
I1
M
I2
E
I3
D
I4
F
I5
Arnoud Visser
6
7
8
W_valE, W_valM, W_dstE,
W_dstM
W
valM
Memory
9
M_ico
de,
M_Bc
h,
M_val M
A
Data
memory
Addr,
Data
Bch
W
M W
E M W
D E M W
valE
CC
ALU
Execute
aluA,
aluB
E
Decode
D
Fetch
valA,
d_sr
valBRegister
A BM
cA,
file
d_sr
E
cB
valP
icode,
ifun,
Instruction
rA, rB,
memory
valC
Write back
valP
PC
increment
predP
C
f_PC
PC
F
University of Amsterdam
Computer Systems – the processor architecture
17
Data hazards
Additional pipeline control is needed to prevent
unintended interactions between instructions
• Stalling (wait a few stages till hazard is gone)
• Data forwarding (passing value to E before M/W)
Pipeline architecture already used for i386
http://www.pcmech.com/show/processors/35/
University of Amsterdam
Arnoud Visser
Computer Systems – the processor architecture
18
Pipeline efficiency
Pipeline control can prevent many, but not all
interactions between instructions → bubbles
For the model described in the book:
• Load / Use hazards
(20% of load instr. → 1 bubble)
• Mispredicted branches
(40% of jmp instr. → 2 bubbles)
• Return from procedure calls
(100% of ret instr. → 3 bubbles)
University of Amsterdam
Arnoud Visser
Computer Systems – the processor architecture
19
Today’s architectures
• Superscalar (Pentium)
(often two instructions/cycle)
• Dynamic execution (P6)
(three instructions out-of-order/cycle)
• Explicit parallelism (Itanium)
(six execution units)
Arnoud Visser
Computer Systems – the processor architecture
University of Amsterdam
20
Hyper-Threading
http://or1cedar.intel.com/media/training/detect_ht_dt_v1/tutorial/ch6/topic04.htm
University of Amsterdam
Arnoud Visser
Computer Systems – the processor architecture
21
Metrics of performance
Answers per month
Application
Scaling of algorithms
Programming
Language
Compiler
(millions) of Instructions per second – MIPS
(millions) of (F.P.) operations per second – MFLOP/s
ISA
Datapath
Control
Megabytes per second
Function Units
Transistors Wires Pins
Cycles per second (clock rate)
Each metric has a place and a purpose, and each can be optimized
University of Amsterdam
Arnoud Visser
Computer Systems – the processor architecture
22
Summary
• Shown that an instruction set architecture
can be translated onto multiple processor
architectures
– Complicated control logic on datapaths
– Compilers have optimize the control logic for
multiple machines/targets
– A programmer can add/frustrate compiler
University of Amsterdam
Arnoud Visser
Computer Systems – the processor architecture
23
Assignment
•
Practice Problem 4.26 (page 430)
80 ps
30 ps
60 ps
50 ps
70 ps
10 ps
20 ps
A
B
C
D
E
F
R
e
g
Calculate the throughput and latency of a n-stage pipeline for the given 6 blocks
University of Amsterdam
Arnoud Visser
Computer Systems – the processor architecture
24