Transcript Memory
Introduction to
Processor Architecture
www.company.com
Contents
• Introduction
• Processor architecture overview
• ISA(Instruction Set Architecture)
– RISC example (SMIPs)
– CISC example (Y86)
• Processor architecture
– Single-cycle processor example(SMIPs)
• Pipelining
– Control hazard
– Branch Predictor
– Data hazard
• Cache memory
www.company.com
Introduction
www.company.com
Processors
• What is the processor?
• What’s the difference among them?
www.company.com
Processor architecture and program
• Understanding architecture, there’s more
opportunity to optimize your program.
• Let’s see some examples
www.company.com
Example1
1 • for(i=0 ; i<size ; i++) {
for(j=0 ; j<size ; j++) {
sum += array[i][j];
}
}
2 • for(j=0 ; j<size ; j++) {
for(i=0 ; i<size ; i++) {
sum += array[i][j];
}
}
Keyword : Cache
www.company.com
Example2 (1/2)
1 • for(i=0 ; i<size ; i++) {
if(i%2 == 0) {
action_even();
{
else {
action_odd();
}
}
www.company.com
Example2 (2/2)
2 • for(i=0 ; i<size ; i += 2) {
action_even();
}
for(i=1 ; i<size ; i+= 2) {
action_odd();
}
Keyword : Branch predictor and pipeline
www.company.com
Processor architecture
overview
www.company.com
Von Neumann Architecture
• Input -> process -> output model
• Integrated Instruction Memory and Data Memory
www.company.com
Basic components of x86 CPU
Register file
%eax
%esp
%ecx
%ebp
%edx
%esi
%ebx
%edi
Program Counter
CPU pipeline
Fetch
%eip
Decode
Status Registers
Zero
flag
Sign
flag
Cache Memory
Execution Units
Overflow
flag
Carry
flag
$
Commit
Memory
(external)
www.company.com
Register file
• What is a register?
A simple memory element(s.t. edge triggered flip flops)
www.company.com
Register file
• A collection of registers
– 8 registers are visible
• In fact, there are a lot of registers hided for
other usages.
ex) There are 168 registers in Intel’s Haswell
www.company.com
Program counter
%eip
• Points the address of instruction that
processor should execute next cycle.
• %eip is the name of program counter register
in X86.
• Naming convention differs with ISA,
Instruction Set Architecture
www.company.com
Status registers
Zero
flag
Sign
flag
Overflow
flag
Carry
flag
• Also a collection of registers
• Boolean registers that represents processor’s
status.
• Used to evaluate conditions
www.company.com
Memory
• Main memory, usually D-RAM
• In Von Neumann architecture,
instructions(codes) and data are on same
memory module
www.company.com
CPU pipeline
CPU pipeline
Fetch
Decode
Execution Units
Commit
• Where actual operation occurs
• Details will be explained later
www.company.com
Instruction Set
Architecture
www.company.com
Instruction Set Architecture (ISA)
• How you actually talk to a Processor
www.company.com
Instruction Set Architecture (ISA)
• Mapping between assembly code and machine code
– What assembly codes will be included?
– How to represent assembly codes in byte codes
www.company.com
Instruction
• A command to processor to make processor
perform specific task(s)
– Ex1) Mov 4(%eax), %esp (x86)
-> move the data in the address of (%eax) + 4, to %esp
– Ex2) Irmovl %eax, $256 (y86)
-> store the value 256 to the register eax
www.company.com
Representation of instructions
0
pushl
1
A
0
0
popl
rA
x
1
B
0
0
irmovl
2
2
rA
1
3
0
x
2
x
rB
3
4
5
6
Immediate Value
• Instructions are represented in byte codes
– Pushl %ebx => 0xa01f
– Irmovl %eax, $256 => 0x30f000010000
www.company.com
CISC vs RISC
CISC(Y86)
RISC(sMips)
www.company.com
CISC
• Basic Idea : give programmers powerful
instructions ; fewer instructions to complete a
work
•
•
•
•
One instruction do multiple work
A lot of instructions! (over 300 in x86)
Many instruction can access memory
Variable instruction length
www.company.com
RISC
• Basic Idea : Using simple instructions, write a
complex program
• Each instruction do only one task
• Small instructions set (about 100 in MIPS)
• Only load and store instruction can access
memory
• Fixed instruction length
www.company.com
RISC example
SMIPs ISA
www.company.com
Instruction formats
opcode
opcode
opcode
rs
rs
6
rt
rt
5
rd
6
target
5
5
5
5
shamt
immediate
6
func
5
6
R-type
16
I-type
26
J-type
• Only three formats but the fields are used differently
by different types of instructions
M05-27
www.company.com
Instruction formats
• Computational Instructions
0
opcode
rs
rt
rs
rt
rd
6
0
5
func
5
immediate
5
5
6
rd (rs) func (rt)
rt (rs) op immediate
• Load/Store Instructions
6
opcode
31
rs
26 25
5
rt
21 20
5
16 15
16
addressing mode
displacement
(rs) + displacement
0
rs is the base register
rt is the destination of a Load or the source for a Store
M05-28
www.company.com
Control instructions
• Conditional (on GPR) PC-relative branch
opcode
rs
offset
6
5
5
16
BEQZ, BNEZ
– target address = (offset in words)4 + (PC+4)
– range: 128 KB range
• Unconditional register-indirect jumps
opcode
6
rs
5
• Unconditional absolute jumps
opcode
target
6
5
16
JR, JALR
26
J, JAL
– target address = {PC<31:28>, target4}
– range : 256 MB range
jump-&-link stores PC+4 into the link register (R31)
M05-29
www.company.com
CISC example
Y86 ISA
www.company.com
Instruction formats
0
1 Byte
1
iCd
0
2 Bytes
1
iCd
0
5 Bytes
iFun
2
rA
1
iCd
0
6 Bytes
iFun
2
iFun
3
4
5
4
5
Destination
iFun
1
iCd
rB
2
rA
rB
3
6
Immediate/Offset
iCd : Instruction code
iFun : Function code
rA, rB : Register index
M05-3
www.company.com
1 byte instructions – halt, nop
0
halt
1
0
0
nop
0
1
1
0
halt: Used as a sign of program termination
- Changes processor state to halt (HLT)
nop: No operation. Used as a bubble.
M05-4
www.company.com
2 byte instruction – opl
0
OPl
1
6
Op
2
rA
rB
OPl : Perform 4 basic ALU operations; add, sub, and, xor
- R[rB] <- R[rB] Op R[rA]
- Condition flags are set depending on the result.
M05-33
www.company.com
5 byte instruction – call
0
call dest
1
8
0
2
3
4
5
Destination
call
- R[esp] <- R[esp] - 4 (Update the stack pointer; move stack top)
- M[esp] <- pc + 5
(Store the return address on the stack top)
- pc <- Destination
(Jump to Destination address)
M05-34
www.company.com
6 byte instructions – rmmov, mrmov
0
rmmovl
4
rA, Offset(rB)
mrmovl
Offset(rB), rA
1
0
0
rA
2
rB
1
5
0
rB
4
5
6
Offset
2
rA
3
3
4
5
6
Offset
rmmovl: Store
- target address = R[rB] + offset
- M[target address] <- R[rA]
mrmovl: Load
- source address = R[rB] + offset
- R[rA] <- M[source address]
M05-7
www.company.com
Processor Architecture
www.company.com
Simple processor architecture
www.company.com
Simplified version (a lot..)
Output(register values)
Clock
Large sequential Logic
Store Data
Load program codes
Memory
www.company.com
Sequential design
Register File
%
E
I
P
Fetch
Decode
Execution Units
Commit
Memory
www.company.com
Fetch unit
%
E
I
P
5) Give next instruction
(Byte code)
1) Get PC
Fetch
4) Update PC
3) Get next instruction
2) Require next instruction
Memory
www.company.com
Decode unit(1/2)
Decode
1) Truncate input Instruction
iCd
fCd
rA
rB
imm
2) Fill information structure for execution
Decode
Combinational Logic
Inst Type
Target Register A
Target Register B
Immediate value
Register value A
Register value B
…
(depends on ISA)
www.company.com
Decode unit(2/2)
Decode
Decoded Instruction
Inst Type
Target Register A
3) Read register values
Register
Read
Inst Type
Target Register A
Target Register B
Target Register B
Immediate value
Immediate value
Register value A
Register value A
Register value B
Register value B
…
(depends on ISA)
…
(depends on ISA)
Register File
www.company.com
Execution unit(1/2)
Execute
Execute
Combinational Logic
Inst Type
Executed Instruction
Inst Type
Target Register A
Target Register
Target Register B
Memory Addr
Immediate value
Register Data
Register value A
Memory Data
Register value B
ALU
1) Select input for ALU
2) Perform appropriate
ALU operation
3) Using ALU result, fill
information structure for
memory & register update
www.company.com
Execution unit(2/2)
Execute
Executed Instruction
(updated)
Inst Type
Memory Operation
Logic
Inst Type
Target Register
Target Register
Memory Addr
Memory Addr
Register Data
Register Data
Memory Data
Memory Data
5) Update the field
(if load instruction executed)
4) Perform memory
operations(Ld, St)
Memory
www.company.com
Commit unit
Commit
Inst Type
Register Update
Logic
Target Register
Memory Addr
Register Data
Memory Data
Register File
www.company.com
Single-cycle processor example
SMIPs
www.company.com
Single-Cycle SMIPS
2 read & 1 write
ports
Register File
SMIPs instructions are
all 4 byte-long
PC
+4
Decode
Execute
separate Instruction & Data
memories
Inst
Memory
Data
Memory
M06-47
www.company.com
Single-Cycle SMIPS
module mkProc(Proc);
Reg#(Addr) pc <- mkRegU;
RFile
rf <- mkRFile;
IMemory
iMem <- mkIMemory;
DMemory
dMem <- mkDMemory;
Rule doProc()
let inst = iMem.req(pc);
let dInst = decode(inst);
let rVal1 = rf.rd1(validRegValue(dInst.src1));
let rVal2 = rf.rd2(validRegValue(dInst.src2));
let eInst = exec(dInst, rVal1, rVal2, pc);
if(eInst.iType == Ld)
eInst.data <- dMem.req(MemReq{op: Ld, addr:eInst.addr, data: ?});
else if(eInst.iType == St)
let dummy <- dMem.req(MemReq{op: St, addr: eInst.addr, data: eInst.data});
if (isValid(eInst.dst))
rf.wr(validRegValue(eInst.dst), eInst.data);
pc <= eInst.brTaken ? eInst.addr : pc + 4;
endrule
endmodule
M06-48
www.company.com
Single-Cycle SMIPS
Register File
module mkProc(Proc);
Reg#(Addr) pc <- mkRegU;
RFile
rf <- mkRFile;
IMemory
iMem <- mkIMemory;
DMemory
dMem <- mkDMemory;
PC
+4
Inst
Memory
Decode
Execute
Data
Memory
• Declaration of components
M06-49
www.company.com
Single-Cycle SMIPS
Rule doProc()
let inst = iMem.req(pc);
Register File
PC
+4
Inst
Memory
Decode
Execute
Data
Memory
M06-50
www.company.com
Single-Cycle SMIPS
let dInst = decode(inst);
Register File
PC
+4
Inst
Memory
Decode
Execute
Data
Memory
M06-51
www.company.com
Single-Cycle SMIPS
let rVal1 = rf.rd1(validRegValue(dInst.src1));
let rVal2 = rf.rd2(validRegValue(dInst.src2));
Register File
PC
+4
Inst
Memory
Decode
Execute
Data
Memory
M06-52
www.company.com
Single-Cycle SMIPS
let eInst = exec(dInst, rVal1, rVal2, pc);
Register File
PC
+4
Inst
Memory
Decode
Execute
Data
Memory
M06-53
www.company.com
Single-Cycle SMIPS
if(eInst.iType == Ld)
eInst.data <- dMem.req(MemReq{op: Ld,addr:eInst.addr,data: ?});
else if(eInst.iType == St)
let dummy <- dMem.req(MemReq{op: St,addr: eInst.addr,data: eInst.data});
Register File
PC
+4
Inst
Memory
Decode
Execute
Data
Memory
M06-54
www.company.com
Single-Cycle SMIPS
if (isValid(eInst.dst))
rf.wr(validRegValue(eInst.dst), eInst.data);
Register File
PC
+4
Inst
Memory
Decode
Execute
Data
Memory
M06-55
www.company.com
Single-Cycle SMIPS
pc <= eInst.brTaken ? eInst.addr : pc + 4;
Register File
PC
+4
Inst
Memory
Decode
Execute
Data
Memory
M06-56
www.company.com
Improve processor performance
-Pipelining
www.company.com
Pipelining
• Introduce the idea of conveyor belt process
www.company.com
Pipelining
• Introduce the idea of conveyor belt process
Register File
%
E
I
P
Fetch
Decode
Execution Units
Commit
Memory
FIFO or Register
www.company.com
Pipelining
Register File
%
E
I
P
Fetch
Decode
Execution Units
Commit
Memory
• In this case, 4 instructions are executing on
the same time
www.company.com
Pipelining
• Throughput is same?
– Sequential : 1 instructions / 1 cycle
– Pipelined : 4 instructions / 4 cycle
• No, pipelined design can make clock faster
• Because task amount per cycle is decreased,
we can apply shorter clock time
www.company.com
Pipeline hazard
- control hazard
www.company.com
Control hazard (1/5)
• Where this assembly code will execute?
mov %ecx, %ebx
subl %eax, (%ebx)
je B
A:
addl %ecx, %edi
B:
leave
ret
• We don’t know the condition before run the code
www.company.com
Control hazard (2/5)
Register File
%
E
I
P
Fetch
Decode
Execution Units
Commit
Memory
• What’s next?
addl? leave?
je B
subl %eax, (%ebx)
mov %ecx, %ebx
Execution flow
www.company.com
Control hazard (3/5)
• Where this assembly code will execute?
mov %ecx, %ebx
subl %eax, (%ebx)
je B
A:
addl %ecx, %edi
mov %edi, %eax
B:
leave
ret
• As we can’t know about the future, bring the
instruction in the next position
www.company.com
Control hazard (4/5)
Register File
%
E
I
P
Fetch
Decode
Execution Units
Commit
Memory
• What if jump
occurred?
addl %ecx, %edi
je B
subl %eax, (%ebx)
Execution flow
www.company.com
Control hazard (5/5)
Register File
%
E
I
P
Fetch
Decode
Execution Units
Commit
Memory
• Wrong
instructions
were
executing
mov %edi, %eax
addl %ecx, %edi
je B
Execution flow
www.company.com
Control hazard - analysis
• We must discard some instructions when we
mispredict the branch direction…
• The longer the pipeline, the more instructions must
be discarded when branch mispredict.
www.company.com
Epoch method
• Add an epoch register in the processor state
• The Execute stage changes the epoch whenever the
pc prediction is wrong and sets the pc to the correct
value
• The Fetch stage associates the current epoch with
every instruction when it is fetched
The epoch of the instruction is
examined when it is ready to
execute. If the processor
epoch has changed the
instruction is thrown away
Fetch
Execute
Epoch
targetPC
PC
pred
f2d
inst
iMem
M07-69
www.company.com
Epoch method - Summary
• Add a ‘Tag’ to each instructions
• There are two Tag machine; in Fetch and Execute
• If tag machines recognize something wrong, they
change their testing tag
www.company.com
PC
redirect
FIFO
+4
f2d
eEpoch
fEpoch
Epoch method example (SMIPs)
Decode
Register File
Execute
FIFO
Inst
Memory
Execute sends information about the target pc to Fetch, which
updates fEpoch and pc whenever it looks at the redirect PC fifo
Data
Memory
M07-71
www.company.com
Epoch method example (SMIPs)
rule doFetch ;
let inst=iMem.req(pc); let ppc=nextAddr(pc); pc <= ppc;
f2d.enq(Fetch2Decode{pc:pc,ppc:ppc,epoch:epoch,
inst:inst});
Endrule
rule doExecute;
let x=f2d.first; let inpc=x.pc; let inEp=x.epoch;
let ppc = x.ppc; let inst = x.inst;
if(inEp == epoch) begin
let dInst = decode(inst); ... register fetch ...;
let eInst = exec(dInst, rVal1, rVal2, inpc, ppc);
...memory operation ...
...rf update ...
if (eInst.mispredict)
begin
pc <= eInst.addr; epoch <= inEp + 1; end
end
f2d.deq;
endrule
www.company.com
Branch Prediction
www.company.com
Need to predict next PC
• We must fetch a instruction every cycle
• But as we see in control hazard part, we can’t
know what is exactly next instruction
• So we must predict what is next instruction
www.company.com
How to predict next PC?
• We can depend on the history
– Memo the history and make use of it
• So we must predict what is next instruction
www.company.com
2-bit counter branch predictor
(Strongly taken)
(Weakly taken)
(Weakly not taken)
(Strongly not taken)
www.company.com
2-bit counter branch predictor
• Assume 2 BP bits per instruction
• Use saturating counter
On taken
On ¬taken
1
1
Strongly taken
1
0
Weakly taken
0
1
Weakly ¬taken
0
0
Strongly ¬taken
Direction prediction changes only after two successive bad predictions
M11-77
www.company.com
Branch History Table (BHT)
from
Fetch
Instruction
Opcode
Fetch PC
offset
0 0
k
+
2k-entry
BHT,
2 bits/entry
BHT Index
Branch?
Target PC
After decoding the instruction if it turns out be a branch, then we can consult BHT using
the pc; if this prediction is different from the incoming predicted pc we can redirect Fetch
Taken/¬Taken?
4K-entry BHT, 2 bits/entry, ~80-90% correct direction predictions
M11-78
www.company.com
Let’s see program example again
• for(i=0 ; i<size ; i++) {
if(i%2 == 0) {
action_even();
{
else {
action_odd();
}
}
www.company.com
Pipeline hazard
- data hazard
www.company.com
Data hazard by flow dependence
• Sometimes, instructions uses the result of
former instruction
I1 addl %eax, %ebx
I2 subl %ebx, %ecx
I2 must wait until I1 updates the register file,
so that I2 can see the result.
www.company.com
Dealing with data hazards
• We can wait until desired value is updated
– Stall method
• Or, we can send the value directly
– Bypass method
www.company.com
Pipeline stalling example
www.company.com
Data bypass example
www.company.com
Improve processor performance
-Cache Memory
www.company.com
Memory operations are bottle neck!
• Memory transfer rate of DDR3 RAM
– Peak transfer rate : 6400 MB/s
• Assume that we have single core processor
of clock speed 3.0Ghz, and we process a
word(32bit) every cycle.
– Approximately, we process 1.2GB/s
• 3.0 Ghz * 32bit(word size) / 8
www.company.com
Memory hierarchy
www.company.com
Locality of reference
• Temporal locality
– If a value is used, it is likely to be used again soon.
mov $100, %ecx
mov Array, %ebx
xorl %eax, %eax
Loop :
mov (%ebx), %esi
addl %esi, %eax
addl $4, %ebx
subl $1, %ecx
je
Loop
Fin :
leave
ret
// %ebx = &Array
// %eax = 0
// %esi += *(%ebx)
// %eax += %esi
// %ebx += 4
// %ecx = %ecx - 1
www.company.com
Locality of reference
•
Spatial locality
– If a value is used, nearby values are likely to be
used
for(i=0; i<size; i++) {
for(j=0; j<size; j++) {
sum += array[i][j];
}
}
www.company.com
Cache memory
• A buffer between processor and memory
– Often several levels of caches
• Small but fast
– Old values will be removed from cache to make
space for new values
• Capitalizes on spatial locality and temporal
locality
• Parameters vary by system – unknown to
programmer
www.company.com
Cache memory
• Cache memories are small, fast SRAM-based memories
managed automatically in hardware.
– Hold frequently accessed blocks of main memory
• CPU looks first for data in L1, then in L2, then in main memory.
• Typical bus structure:
CPU chip
register file
L1
cache
cache bus
L2 cache
ALU
system bus
bus interface
I/O
bridge
memory bus
main
memory
www.company.com
Structure of cache memory
• Basically, an array of memory elements
copy of main mem
locations 100, 101, ...
100
304
Data Data
Byte Byte
Data
Byte
6848
Address
Tag
Data Block
416
How many bits are needed for the tag?
Enough to uniquely identify block
www.company.com
Read behavior
Search cache tags to find match for the
processor generated address
Found in cache
a.k.a. HIT
Return copy of data
from cache
Not in cache
a.k.a. MISS
Read block of data from Main Memory
– may require writing back a cache line
Return data to processor and update cache
Which line do we replace?
www.company.com
Write behavior
• On a write hit
– Write-through: write to both cache and the next level memory
– Writeback: write only to cache and update the next level
memory when line is evacuated
• On a write miss
– Allocate – because of multi-word lines we first fetch the line,
and then update a word in it
– Not allocate – word modified in memory
M12-94
www.company.com
Direct mapped cache
• A buffer between processor and memory
– Often several levels of caches
• Small but fast
– Old values will be removed from cache to make
space for new values
• Capitalizes on spatial locality and temporal
locality
• Parameters vary by system – unknown to
programmer
www.company.com
Cache line size
• A cache line usually holds more than one word(32bit)
– Reduces the number of tags and the tag size needed to
identify memory locations
– To exploit spatial locality
M12-96
www.company.com
Types of misses
• Compulsory misses (cold start)
Cold fact of life!
– First time data is referenced
– Run billions of instructions, become insignificant
• Capacity misses
– Working set is larger than cache size
– Solution: increase cache size
• Conflict misses
– Usually multiple memory locations are mapped to the same
cache location to simplify implementations
– Thus it is possible that the designated cache location is full
while there are empty locations in the cache.
– Solution: Set-Associative Caches
M12-97
www.company.com
Direct-mapped cache
Block number
Tag
Index
t
Block offset
Offset req address
k
V
Tag
b
Data Block
2k
lines
t
=
HIT
What is a bad reference pattern?
Data Word or Byte
Strided = size of cache
www.company.com
Addressing example
00000100001100
Tag = 524
00000100001110 01 00 reqColdaddress
fact of life!
Index = 270
Block offset = 1
• Bitwise truncation
• Goto 270th cache entry and compare the Tag
M12-99
www.company.com
Address selection
Index
Tag
t
k
V
Offset
b
Tag
Data Block
2k
lines
t
=
HIT
Data Word or Byte
Why might this be undesirable?
Spatially local blocks conflict
www.company.com
Reduce conflict misses
• Memory time = Hit time + Prob(miss) * Miss penalty
• Associativity: Allow blocks to go to several sets in
cache
– 2-way set associative: each block maps to either of 2 cache
sets
– Fully associative: each block maps to any cache frame
M12-101
www.company.com
2-way Set-Associative cache
Tag
Block
Offset
Index
t
b
k
V
Tag
Data Block
V
Tag
Data Block
t
=
=
Data
Word
or Byte
HIT
www.company.com
Replacement policy
• In order to bring in a new cache line, usually another
cache line has to be thrown out. Which one?
– No choice in replacement if the cache is direct mapped
• Replacement policy for set-associative caches
– One that is not dirty, i.e., has not been modified
• In I-cache all lines are clean
• In D-cache if a dirty line has to be thrown out then it must be written
back first
– Least recently used?
– Most recently used?
– Random?
How much is performance
affected by the choice?
Difficult to know without
quantitative measurements
M12-103
www.company.com
Implementing LRU..
• We need time stamps for all lines
• And also require time stamp comparison!
– Log scale over head
www.company.com
Pseudo LRU example
• So use pseudo LRU instead of true LRU
• We’ll use 8-way set associative cache
• Use three bit history bits
• If a line is referenced, memo the following code
at history bits
– Line 0 : 000
– Line 1 : 001
– Line 2 : 010
– Line 3 : 011
…
www.company.com
Pseudo LRU example
www.company.com
Pseudo LRU example
www.company.com