+ CSE502: Computer Architecture

Download Report

Transcript + CSE502: Computer Architecture

CSE502: Computer Architecture
CSE 502:
Computer Architecture
Review
CSE502: Computer Architecture
Course Overview (1/2)
• Caveat 1: I’m (kind of) new here.
• Caveat 2: This is a (somewhat) new course.
• Computer Architecture is
… the science and art of selecting and
interconnecting hardware and software
components to create computers …
CSE502: Computer Architecture
Course Overview (2/2)
• This course is hard, roughly like CSE 506
– In CSE 506, you learn what’s inside an OS
– In CSE 502, you learn what’s inside a CPU
• This is a project course
– Learn why things are the way they are, first hand
– We will “build” emulators of CPU components
CSE502: Computer Architecture
Policy and Projects
• Probably different from other classes
– Much more open, but much more strict
• Most people followed the policy
• Some did not
– Resembles the “real world”
• You’re here because you want to learn and to be here
• If you managed to get your partner(s) to do the work
– You’re probably good enough to do it at your job too
» The good: You might make a good manager
» The bad: You didn’t learn much
• Time mgmt. often more important than tech. skill
– If you started early, you probably have an A already
CSE502: Computer Architecture
Amdahl’s Law
Speedup = timewithout enhancement / timewith enhancement
An enhancement speeds up fraction f of a task by factor S
timenew = timeorig·( (1-f) + f/S )
Soverall = 1 / ( (1-f) + f/S )
timeorig
1 - f) f
(1
f
timenew
(1 - f) f/S
f/S
CSE502: Computer Architecture
The Iron Law of Processor Performance
Time
Instructions
Cycles
Time



Program
Program
Instruction Cycle
Total Work
In Program
CPI or 1/IPC
1/f (frequency)
Algorithms,
Compilers,
ISA Extensions
Microarchitecture
Microarchitecture,
Process Tech
Architects target CPI, but must understand the others
CSE502: Computer Architecture
Averaging Performance Numbers (2/2)
• Arithmetic: times
– proportional to time
– e.g., latency
• Harmonic: rates
– inversely proportional to time
– e.g., throughput
1 n

n i 1Timei
n
1

Rate
n
i 1
• Geometric: ratios
– unit-less quantities
– e.g., speedups
i
n
n
 Ratio
i 1
i
CSE502: Computer Architecture
Power vs. Energy
What uses power in a chip?
• Power: instantaneous rate of energy transfer
– Expressed in Watts
– In Architecture, implies conversion of electricity to heat
– Power(Comp1+Comp2)=Power(Comp1)+Power(Comp2)
• Energy: measure of using power for some time
– Expressed in Joules
– power * time (joules = watts * seconds)
– Energy(OP1+OP2)=Energy(OP1)+Energy(OP2)
CSE502: Computer Architecture
ISA: A contract between HW and SW
• ISA: Instruction Set Architecture
– A well-defined hardware/software interface
• The “contract” between software and hardware
– Functional definition of operations supported by hardware
– Precise description of how to invoke all features
• No guarantees regarding
– How operations are implemented
– Which operations are fast and which are slow (and when)
– Which operations take more energy (and which take less)
CSE502: Computer Architecture
Components of an ISA
• Programmer-visible states
– Program counter, general purpose registers,
memory, control registers
• Programmer-visible behaviors
– What to do, when to do it
Example “register-transfer-level”
description of an instruction
• A binary encoding
if imem[pc]==“add rd, rs, rt”
then
pc  pc+1
gpr[rd]=gpr[rs]+grp[rt]
ISAs last forever, don’t add stuff you don’t need
CSE502: Computer Architecture
Locality Principle
• Recent past is a good indication of near future
Spatial Locality: If you looked something up, it is very likely
you will look up something nearby soon
Temporal Locality: If you looked something up, it is very likely
that you will look it up again soon
CSE502: Computer Architecture
Caches
• An automatically managed hierarchy
• Break memory into blocks (several bytes)
and transfer data to/from cache in blocks
– spatial locality
• Keep recently accessed blocks
– temporal locality
Core
$
Memory
CSE502: Computer Architecture
Fully-Associative Cache
63
address
0
tag[63:6] block offset[5:0]
• Keep blocks in cache frames
– data
– state (e.g., valid)
– address tag
state
state
state
tag
tag
tag
=
=
=
data
data
data
state
tag
=
data
hit?
multiplexor
What happens when the cache runs out of space?
CSE502: Computer Architecture
The 3 C’s of Cache Misses
•
•
•
•
Compulsory: Never accessed before
Capacity: Accessed long ago and already replaced
Conflict: Neither compulsory nor capacity
Coherence: (In multi-cores, become owner to write)
CSE502: Computer Architecture
Cache Size
• Cache size is data capacity (don’t count tag and state)
– Bigger can exploit temporal locality better
– Not always better
• Too large a cache
• Too small a cache
– Limited temporal locality
– Useful data constantly replaced
hit rate
– Smaller is faster  bigger is slower
– Access time may hurt critical path
working set
size
capacity
CSE502: Computer Architecture
Block Size
• Block size is the data that is
– Associated with an address tag
– Not necessarily the unit of transfer between hierarchies
• Too small a block
• Too large a block
– Useless data transferred
– Too few total blocks
hit rate
– Don’t exploit spatial locality well
– Excessive tag overhead
• Useful data frequently replaced
block size
CSE502: Computer Architecture
Direct-Mapped Cache
• Use middle bits as index
• Only one tag comparison
tag[63:16] index[15:6] block offset[5:0]
decoder
data
data
data
state
state
state
tag
tag
tag
data
state
tag
multiplexor
=
tag match
(hit?)
CSE502: Computer Architecture
N-Way Set-Associative Cache
tag[63:15] index[14:6] block offset[5:0]
way
set
state
state
state
tag
tag
tag
data
state
tag
multiplexor
decoder
decoder
data
data
data
data
data
data
state
state
state
tag
tag
tag
data
state
tag
multiplexor
=
multiplexor
=
hit?
Note the additional bit(s) moved from index to tag
CSE502: Computer Architecture
Associativity
• Larger associativity
– lower miss rate (fewer conflicts)
– higher power consumption
• Smaller associativity
hit rate
– lower cost
– faster hit time
~5
for L1-D
associativity
CSE502: Computer Architecture
Parallel vs Serial Caches
• Tag and Data usually separate (tag is smaller & faster)
– State bits stored along with tags
• Valid bit, “LRU” bit(s), …
Parallel access to Tag and Data
reduces latency (good for L1)
Serial access to Tag and Data
reduces power (good for L2+)
enable
= = = =
= = = =
valid?
valid?
hit?
data
hit?
data
CSE502: Computer Architecture
Physically-Indexed Caches
• Core requests are VAs
Virtual Address
tag[63:14] index[13:6] block offset[5:0]
virtual page[63:13]
• Cache index is PA[15:6]
– VA passes through TLB
– D-TLB on critical path
• Cache tag is PA[63:16]
• If index size < page size
– Can use VA for index
D-TLB
page offset[12:0]
/ index[6:0]
physical
index[7:0]
/
/
physical
index[0:0]
/
physical
tag[51:1]
= = = =
CSE502: Computer Architecture
Virtually-Indexed Caches
• Core requests are VAs
• Cache index is VA[15:6]
• Cache tag is PA[63:16]
• Why not tag with VA?
– Cache flush on ctx switch
• Virtual aliases
– Ensure they don’t exist
– … or check all on miss
Virtual Address
tag[63:14] index[13:6] block offset[5:0]
virtual page[63:13]
page offset[12:0]
/ virtual index[7:0]
D-TLB
/
physical
tag[51:0]
= = = =
One bit overlaps
CSE502: Computer Architecture
Inclusion
• Core often accesses blocks not present on chip
– Should block be allocated in L3, L2, and L1?
• Called Inclusive caches
• Waste of space
• Requires forced evict (e.g., force evict from L1 on evict from L2+)
– Only allocate blocks in L1
• Called Non-inclusive caches (who not “exclusive”?)
• Must write back clean lines
• Some processors combine both
– L3 is inclusive of L1 and L2
– L2 is non-inclusive of L1 (like a large victim cache)
CSE502: Computer Architecture
Parity & ECC
• Cosmic radiation can strike at any time
– Especially at high altitude
– Or during solar flares
• What can be done?
0
1
0
1
1
0
0
1
0
1
1
0
0
– Parity
• 1 bit to indicate if sum is odd/even (detects single-bit errors)
– Error Correcting Codes (ECC)
• 8 bit code per 64-bit word
• Generally SECDED (Single-Error-Correct, Double-Error-Detect)
• Detecting errors on clean cache lines is harmless
– Pretend it’s a cache miss
1
0
1
CSE502: Computer Architecture
SRAM vs. DRAM
• SRAM = Static RAM
– As long as power is present, data is retained
• DRAM = Dynamic RAM
– If you don’t do anything, you lose the data
• SRAM: 6T per bit
– built with normal high-speed CMOS technology
• DRAM: 1T per bit (+1 capacitor)
– built with special DRAM process optimized for density
CSE502: Computer Architecture
DRAM Chip Organization
• Low-Level organization is very similar to SRAM
• Cells are only single-ended
– Reads destructive: contents are erased by reading
• Row buffer holds read data
– Data in row buffer is called a DRAM row
• Often called “page” - not necessarily same as OS page
– Read gets entire row into the buffer
– Block reads always performed out of the row buffer
• Reading a whole row, but accessing one block
• Similar to reading a cache line, but accessing one word
CSE502: Computer Architecture
DRAM Organization
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
All banks within the
rank share all address
and control pins
x8 DRAM
Bank
All banks are independent,
but can only talk to one
bank at a time
DIMM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
x8 means each DRAM
outputs 8 bits, need 8
chips for DDRx (64-bit)
x8 DRAM
DRAM
DRAM
DRAM
DRAM
Why 9 chips per rank?
64 bits data, 8 bits ECC
Rank
Dual-rank x8 (2Rx8) DIMM
CSE502: Computer Architecture
AMAT with MLP
• If …
cache hit is 10 cycles (core to L1 and back)
memory access is 100 cycles (core to mem and back)
• Then …
at 50% miss ratio, avg. access: 0.5×10+0.5×100 = 55
• Unless MLP is >1.0, then…
at 50% mr,1.5 MLP,avg. access:(0.5×10+0.5×100)/1.5 = 37
at 50% mr,4.0 MLP,avg. access:(0.5×10+0.5×100)/4.0 = 14
In many cases, MLP dictates performance
CSE502: Computer Architecture
Memory Controller (1/2)
Read
Queue
Write
Queue
Response
Queue
Commands
Data
To/From CPU
Scheduler
Channel 0
Buffer
Channel 1
Memory
Controller
CSE502: Computer Architecture
Memory Controller (2/2)
• Memory controller connects CPU and DRAM
• Receives requests after cache misses in LLC
– Possibly originating from multiple cores
• Complicated piece of hardware, handles:
–
–
–
–
DRAM Refresh
Row-Buffer Management Policies
Address Mapping Schemes
Request Scheduling
CSE502: Computer Architecture
Address Mapping Schemes
• Example Open-page Mapping Scheme:
High Parallelism:
Easy Expandability:
[row rank bank column channel offset]
[channel rank row bank column offset]
• Example Close-page Mapping Scheme:
High Parallelism:
Easy Expandability:
[row column rank bank channel offset]
[channel rank row column bank offset]
CSE502: Computer Architecture
Memory Request Scheduling
• Write buffering
– Writes can wait until reads are done
• Queue DRAM commands
– Usually into per-bank queues
– Allows easily reordering ops. meant for same bank
• Common policies:
– First-Come-First-Served (FCFS)
– First-Ready—First-Come-First-Served (FR-FCFS)
CSE502: Computer Architecture
Prefetching (1/2)
• Fetch block ahead of demand
• Target compulsory, capacity, (& coherence) misses
– Not conflict: prefetched block would conflict
• Big challenges:
– Knowing “what” to fetch
• Fetching useless blocks wastes resources
– Knowing “when” to fetch
• Too early  clutters storage (or gets thrown out before use)
• Fetching too late  defeats purpose of “pre”-fetching
CSE502: Computer Architecture
Prefetching (2/2)
• Without prefetching:
L1
L2
DRAM
Load
time
• With prefetching:
Prefetch
• Or:
Data
Load
Total Load-to-Use Latency
Data
Much improved Load-to-Use Latency
Prefetch
Load
Data
Somewhat improved Latency
Prefetching must be accurate and timely
CSE502: Computer Architecture
Next-Line (or Adjacent-Line) Prefetching
• On request for line X, prefetch X+1 (or X^0x1)
– Assumes spatial locality
• Often a good assumption
– Should stop at physical (OS) page boundaries
• Can often be done efficiently
– Adjacent-line is convenient when next-level block is bigger
– Prefetch from DRAM can use bursts and row-buffer hits
• Works for I$ and D$
– Instructions execute sequentially
– Large data structures often span multiple blocks
Simple, but usually not timely
CSE502: Computer Architecture
Next-N-Line Prefetching
• On request for line X, prefetch X+1, X+2, …, X+N
– N is called “prefetch depth” or “prefetch degree”
• Must carefully tune depth N. Large N is …
– More likely to be useful (correct and timely)
– More aggressive  more likely to make a mistake
• Might evict something useful
– More expensive  need storage for prefetched lines
• Might delay useful request on interconnect or port
Still simple, but more timely than Next-Line
CSE502: Computer Architecture
Stride Prefetching
Elements in array of structs
Column in matrix
• Access patterns often follow a stride
– Accessing column of elements in a matrix
– Accessing elements in array of structs
• Detect stride S, prefetch depth N
– Prefetch X+1∙S, X+2∙S, …, X+N∙S
CSE502: Computer Architecture
“Localized” Stride Prefetchers
• Store PC, last address, last stride, and count in RPT
• On access, check RPT (Reference Prediction Table)
– Same stride?  count++ if yes, count-- or count=0 if no
– If count is high, prefetch (last address + stride*N)
PCa: 0x409A34
Load R1 = [R2]
Tag
0x409
PCb: 0x409A38
PCc: 0x409A40
Last Addr Stride Count
A+3N
N
Load R3 = [R4]
Store [R6] = R5
2
+
0x409
X+3N
N
2
0x409
Y+2N
N
1
If confident
about the stride
(count > Cmin),
prefetch
(A+4N)
CSE502: Computer Architecture
Evaluating Prefetchers
• Compare against larger caches
– Complex prefetcher vs. simple prefetcher with larger cache
• Primary metrics
– Coverage: prefetched hits / base misses
– Accuracy: prefetched hits / total prefetches
– Timeliness: latency of prefetched blocks / hit latency
• Secondary metrics
– Pollution: misses / (prefetched hits + base misses)
– Bandwidth: total prefetches + misses / base misses
– Power, Energy, Area...
CSE502: Computer Architecture
Before there was pipelining…
Single-cycle
Multi-cycle
insn0.(fetch,decode,exec)
insn0.fetch
insn0.dec
insn0.exec
insn1.(fetch,decode,exec)
insn1.fetch
insn1.dec
insn1.exec
time
• Single-cycle control: hardwired
– Low CPI (1)
– Long clock period (to accommodate slowest instruction)
• Multi-cycle control: micro-programmed
– Short clock period
– High CPI
• Can we have both low CPI and short clock period?
CSE502: Computer Architecture
Pipelining
Multi-cycle
Pipelined
insn0.fetch
insn0.dec
insn0.exec
insn0.fetch
insn0.dec
insn0.exec
insn1.fetch
insn1.dec
insn1.exec
insn2.fetch
insn2.dec
time
insn1.fetch
insn1.dec
insn1.exec
insn2.exec
• Start with multi-cycle design
• When insn0 goes from stage 1 to stage 2
… insn1 starts stage 1
• Each instruction passes through all stages
… but instructions enter and leave at faster rate
Can have as many insns in flight as there are stages
CSE502: Computer Architecture
Instruction Dependencies
• Data Dependence
– Read-After-Write (RAW) (only true dependence)
• Read must wait until earlier write finishes
– Anti-Dependence (WAR)
• Write must wait until earlier read finishes (avoid clobbering)
– Output Dependence (WAW)
• Earlier write can’t overwrite later write
• Control Dependence (a.k.a. Procedural Dependence)
– Branch condition must execute before branch target
– Instructions after branch cannot run before branch
CSE502: Computer Architecture
Pipeline Terminology
• Pipeline Hazards
– Potential violations of program dependencies
– Must ensure program dependencies are not violated
• Hazard Resolution
– Static method: performed at compile time in software
– Dynamic method: performed at runtime using hardware
– Two options: Stall (costs perf.) or Forward (costs hw.)
• Pipeline Interlock
– Hardware mechanism for dynamic hazard resolution
– Must detect and enforce dependencies at runtime
CSE502: Computer Architecture
Simple 5-stage Pipeline
M
U
X
Inst
Cache
PC+1
instruction
PC
+
regA
regB
Register file
1
R0 0
R1
R2
R3
R4
R5
R6
R7
M
U
X
IF/ID
+
PC+1
target
eq?
valA
valB
offset
M
U
X
A
L
U
ALU
result
ALU
result
Data
Cache
mdata
M
U
X
data
dest
valB
dest
dest
dest
op
op
op
ID/EX
EX/Mem
Mem/WB
CSE502: Computer Architecture
Balancing Pipeline Stages
Coarser-Grained Machine Cycle:
4 machine cyc / instruction
IF
ID
TIF&ID= 8 units
Finer-Grained Machine Cycle:
11 machine cyc /instruction
IF
IF
ID
OF
# stages = 4
Tcyc= 9 units
TOF= 9 units
OF
OF
EX
TEX= 5 units
OF
EX
EX
WB
TOS= 9 units
WB
WB
WB
# stages = 11
Tcyc= 3 units
CSE502: Computer Architecture
IPC vs. Frequency
• 10-15% IPC not bad if frequency can double
1000ps
2.0 IPC, 1GHz
2 BIPS
500ps 500ps
1.7 IPC, 2GHz
3.4 BIPS
• Frequency doesn’t double
– Latch/pipeline overhead
– Stage imbalance
900ps
450ps
900ps
350
450ps
550
1.5GHz
CSE502: Computer Architecture
Architectures for Instruction Parallelism
• Scalar pipeline (baseline)
– Instruction/overlap parallelism = D
– Operation Latency = 1
– Peak IPC = 1.0
D
Successive
Instructions
D different instructions overlapped
1
2
3
Time in cycles
4
5
6
7
8
9
10
11
12
CSE502: Computer Architecture
Superscalar Machine
• Superscalar (pipelined) Execution
– Instruction parallelism = D x N
– Operation Latency = 1
– Peak IPC = N per cycle
D x N different instructions overlapped
Successive
Instructions
N
1
2
3
Time in cycles
4
5
6
7
8
9
10
11
12
CSE502: Computer Architecture
RISC ISA Format
• Fixed-length
– MIPS all insts are 32-bits/4 bytes
• Few formats
– MIPS has 3: R (reg, reg, reg), I (reg, reg, imm), J (addr)
– Alpha has 5: Operate, Op w/ Imm, Mem, Branch, FP
• Regularity across formats (when possible/practical)
– MIPS & Alpha opcode in same bit-position for all formats
– MIPS rs & rt fields in same bit-position for R and I formats
– Alpha ra/fa field in same bit-position for all 5 formats
CSE502: Computer Architecture
Superscalar Decode for RISC ISAs
• Decode X insns. per cycle (e.g., 4-wide)
– Just duplicate the hardware
– Instructions aligned at 32-bit boundaries
1-Fetch
4-wide superscalar fetch
32-bit inst
32-bit inst
32-bit inst
32-bit inst
32-bit inst
Decoder
Decoder
Decoder
Decoder
Decoder
decoded
inst
decoded
inst
decoded
inst
decoded
inst
decoded
inst
scalar
superscalar
CSE502: Computer Architecture
CISC ISA
• RISC focus on fast access to information
– Easy decode, I$, large RF’s, D$
• CISC focus on max expressiveness per min space
– Designed in era with fewer transistors, chips
– Each memory access very expensive
• Pack as much work into as few bytes as possible
• More “expressive” instructions
– Better potential code generation in theory
– More complex code generation in practice
CSE502: Computer Architecture
ADD in RISC ISA
Mode
Example
Meaning
Register
ADD R4, R3, R2
R4 = R3 + R2
CSE502: Computer Architecture
ADD in CISC ISA
Mode
Example
Meaning
Register
ADD R4, R3
R4 = R4 + R3
Immediate
ADD R4, #3
R4 = R4 + 3
Displacement
ADD R4, 100(R1)
R4 = R4 + Mem[100+R1]
Register Indirect
ADD R4, (R1)
R4 = R4 + Mem[R1]
Indexed/Base
ADD R3, (R1+R2)
R3 = R3 + Mem[R1+R2]
Direct/Absolute
ADD R1, (1234)
R1 = R1 + Mem[1234]
Memory Indirect
ADD R1, @(R3)
R1 = R1 + Mem[Mem[R3]]
Auto-Increment
ADD R1,(R2)+
R1 = R1 + Mem[R2]; R2++
Auto-Decrement
ADD R1, -(R2)
R2--; R1 = R1 + Mem[R2]
CSE502: Computer Architecture
RISC (MIPS) vs CISC (x86)
lui R1, Disp[31:16]
ori R1, R1, Disp[15:0]
add R1, R1, R2
shli R3, R3, 3
add R3, R3, R1
lui R1, Imm[31:16]
ori R1, R1, Imm[15:0]
st [R3], R1
MOV [EBX+EAX*8+Disp], Imm
8 insns. at 32 bits each vs 1 insn. at 88 bits: 2.9x!
CSE502: Computer Architecture
x86 Encoding
• Basic x86 Instruction:
Prefixes
0-4 bytes
Opcode
1-2 bytes
Mod R/M
0-1 bytes
Shortest Inst: 1 byte
SIB
0-1 bytes
Displacement
Immediate
0/1/2/4 bytes 0/1/2/4 bytes
Longest Inst 15 bytes
• Opcode has flag indicating Mod R/M is present
– Most instructions use the Mod R/M byte
– Mod R/M specifies if optional SIB byte is used
– Mod R/M and SIB may specify additional constants
Instruction length not known until after decode
CSE502: Computer Architecture
Instruction Cache Organization
• To fetch N instructions per cycle...
– L1-I line must be wide enough for N instructions
• PC register selects L1-I line
• A fetch group is the set of insns. starting at PC
– For N-wide machine, [PC,PC+N-1]
PC
Decoder
Tag
Tag
Tag
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Tag
Tag
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Cache Line
CSE502: Computer Architecture
Fetch Misalignment
• Now takes two cycles to fetch N instructions
PC: xxx01001
Decoder
000
001
010
011
111
Tag
00
Inst
Inst
Inst
Inst
Inst
01
Inst
Inst
Inst
Inst
Inst
Inst
10
Inst
Inst
Inst
Inst
Inst
Inst
11
Inst
Inst
Inst
Inst
Inst
Cycle 2
PC: xxx01100
Inst
Decoder
Cycle 1
Tag
Tag
Tag
Tag
000
001
010
011
111
Tag
Tag
Tag
Tag
Tag
00
Inst
Inst
Inst
Inst
01
Inst
Inst
Inst
Inst
10
Inst
Inst
Inst
Inst
11
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Inst
CSE502: Computer Architecture
Fragmentation due to Branches
• Fetch group is aligned, cache line size > fetch group
– Taken branches still limit fetch width
Decoder
Tag
Tag
Tag
Inst
Inst
Inst
Inst
Branch
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Tag
Tag
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Inst
X
X
CSE502: Computer Architecture
Types of Branches
• Direction:
– Conditional vs. Unconditional
• Target:
– PC-encoded
• PC-relative
• Absolute offset
– Computed (target derived from register)
Need direction and target to find next fetch group
CSE502: Computer Architecture
Branch Prediction Overview
• Use two hardware predictors
– Direction predictor guesses if branch is taken or not-taken
– Target predictor guesses the destination PC
• Predictions are based on history
– Use previous behavior as indication of future behavior
– Use historical context to disambiguate predictions
CSE502: Computer Architecture
Direction vs. Target Prediction
• Direction: 0 or 1
• Target: 32- or 64-bit value
• Turns out targets are generally easier to predict
– Don’t need to predict N-t target
– T target doesn’t usually change
• Only need to predict taken-branch targets
• Prediction is really just a “cache”
Target
Pred
+
– Branch Target Buffer (BTB)
sizeof(inst)
PC
CSE502: Computer Architecture
Branch Target Buffer (BTB)
Branch Instruction
Address (Tag)
Branch PC
V
BIA
BTA
Branch Target
Address
Valid Bit
Next Fetch PC
=
Hit?
CSE502: Computer Architecture
BTB w/Partial Tags
00000000cfff9810
00000000cfff9824
v 00000000cfff981 00000000cfff9704
v 00000000cfff982 00000000cfff9830
v 00000000cfff984 00000000cfff9900
00000000cfff984c
00001111beef9810
00000000cfff9810
00000000cfff9824
v f981 00000000cfff9704
v f982 00000000cfff9830
v f984 00000000cfff9900
00000000cfff984c
Fewer bits to compare, but prediction may alias
CSE502: Computer Architecture
BTB w/PC-offset Encoding
v f981 00000000cfff9704
00000000cfff984c
v f982 00000000cfff9830
v f984 00000000cfff9900
v f981 ff9704
00000000cfff984c
v f982 ff9830
v f984 ff9900
00000000cf ff9900
If target too far or PC rolls over, will mispredict
CSE502: Computer Architecture
Branches Have Locality
• If a branch was previously taken…
– There’s a good chance it’ll be taken again
for(i=0; i < 100000; i++)
{
/* do stuff */
}
This branch will be taken
99,999 times in a row.
CSE502: Computer Architecture
Last Outcome Predictor
• Do what you did last time
0xDC08:
0xDC44:
for(i=0; i < 100000; i++)
{
if( ( i % 100) == 0 )
tick( );
0xDC50:
if( (i & 1) == 1)
odd( );
}
T
N
CSE502: Computer Architecture
Saturating Two-Bit Counter
Predict N-t
Predict T
Transition on T outcome
Transition on N-t outcome
0
1
FSM for Last-Outcome
Prediction
2
0
3
1
FSM for 2bC
(2-bit Counter)
CSE502: Computer Architecture
Typical Organization of 2bC Predictor
PC
32 or 64 bits
hash
n entries/counters
log2 n bits
table update
Prediction
FSM
Update
Logic
Actual outcome
CSE502: Computer Architecture
Track the History of Branches
Previous Outcome
PC
Counter if prev=0
1
3
0
1
3
3
Counter if prev=1
prev = 1
3
3
prediction = T 
prev = 0
3
2
prediction = T
prev = 1
3
2
prediction = T
prev = 1
3
3
prediction = T
CSE502: Computer Architecture
Deeper History Covers More Patterns
• Counters learn “pattern” of prediction
Previous 3 Outcomes
Counter if prev=000
Counter if prev=001
Counter if prev=010
PC
0 0 1
1
3
1
0
3
2
0
2
Counter if prev=111
001  1; 011  0; 110  0; 100  1
00110011001… (0011)*
CSE502: Computer Architecture
Predictor Training Time
• Ex: prediction equals opposite for 2nd most recent
• Hist Len = 2
• Hist Len = 3
• 4 states to train:
• 8 states to train:
NN  T
NT  T
TN  N
TT  N
NNN  T
NNT  T
NTN  N
NTT  N
TNN  T
TNT  T
TTN  N
TTT  N
CSE502: Computer Architecture
Predictor Organizations
PC Hash
Different pattern for
each branch PC
PC Hash
Shared set of
patterns
PC Hash
Mix of both
CSE502: Computer Architecture
Two-Level Predictor Organization
• Branch History Table (BHT)
– 2a entries
– h-bit history per entry
PC Hash
h
• Pattern History Table (PHT)
– 2b sets
– 2h counters per set
a
b
• Total Size in bits
– h2a + 2(b+h)2
Each entry is a 2-bit counter
CSE502: Computer Architecture
Combined Indexing
• “gshare” (S. McFarling)
PC Hash
k
k
XOR
k = log2counters
CSE502: Computer Architecture
OoO Execution
• Out-of-Order execution (OoO)
– Totally in the hardware
– Also called Dynamic scheduling
• Fetch many instructions into instruction window
– Use branch prediction to speculate past branches
• Rename regs. to avoid false deps. (WAW and WAR)
• Execute insns. as soon as possible
– As soon as deps. (regs and memory) are known
• Today’s machines: 100+ insns. scheduling window
CSE502: Computer Architecture
Superscalar != Out-of-Order
E
F
G
A
A
B
C
D
E
F
G
10 cycles
B
D
E
C
F
G
8 cycles
2-wide
Out-of-Order
A
C
D
E
B
F
G
7 cycles
cache miss
D
A
cache miss
B
C
1-wide
Out-of-Order
cache miss
A
2-wide
In-Order
cache miss
A: R1 = Load 16[R2]
B: R3 = R1 + R4
C: R6 = Load 8[R9]
D: R5 = R2 – 4
E: R7 = Load 20[R5]
F: R4 = R4 – 1
G: BEQ R4, #0
1-wide
In-Order
C
D
F
G
B
5 cycles
E
CSE502: Computer Architecture
Review of Register Dependencies
Read-After-Write
Write-After-Read
Write-After-Write
A: R1 = R2 + R3
B: R4 = R1 * R4
A: R1 = R3 / R4
B: R3 = R2 * R4
A: R1 = R2 + R3
B: R1 = R3 * R4
R1
R2
R3
R4
7
5 A 7
-2
-2
-2
9 B 9
9
3
21
3
R1
R2
R3
R4
5 A 3
3
B
-2
-2
-2
9
9
-6
3
3
3
R1
R2
R3
R4
5 A 7 B 27
-2
-2
-2
9
9
9
3
3
3
R1
R2
R3
R4
5
5 A 7
-2
-2
-2
9 B 9
9
3
15
15
R1
R2
R3
R4
5
5 A -2
B
-2
-2
-2
-6
9
-6
3
3
3
R1
R2
R3
R4
5 B 27 A 7
-2
-2
-2
9
9
9
3
3
3
CSE502: Computer Architecture
Register Renaming
• Register renaming (in hardware)
–
–
–
–
“Change” register names to eliminate WAR/WAW hazards
Arch. registers (r1,f0…) are names, not storage locations
Can have more locations than names
Can have multiple active versions of same name
• How does it work?
– Map-table: maps names to most recent locations
– On a write: allocate new location, note in map-table
– On a read: find location of most recent write via map-table
CSE502: Computer Architecture
Tomasulo’s Algorithm
•
•
•
•
Reservation Stations (RS): instruction buffer
Common data bus (CDB): broadcasts results to RS
Register renaming: removes WAR/WAW hazards
Bypassing (not shown here to make example simpler)
CSE502: Computer Architecture
Tomasulo Data Structures
R
op
Reservation Stations
T
T
T1
==
==
==
==
T2
==
==
==
==
T
CDB.T
Fetched
insns
Regfile
value
V1
V2
FU
CDB.V
Map Table
CSE502: Computer Architecture
Where is the “register rename”?
R
op
Reservation Stations
T
T
T1
==
==
==
==
T2
==
==
==
==
T
CDB.T
Fetched
insns
Regfile
value
V1
V2
CDB.V
Map Table
FU
• Value copies in RS (V1, V2)
• Insn. stores correct input values in its own RS entry
• “Free list” is implicit (allocate/deallocate as part of RS)
CSE502: Computer Architecture
Precise State
• Speculative execution requires
– (Ability to) abort & restart at every branch
– Abort & restart at every load
• Synchronous (exception and trap) events require
– Abort & restart at every load, store, divide, …
• Asynchronous (hardware) interrupts require
– Abort & restart at every ??
• Real world: bite the bullet
– Implement abort & restart at every insn.
– Called precise state
CSE502: Computer Architecture
Complete and Retire
Re-Order Buffer (ROB)
regfile
I$
L1-D
B
P
C
R
• Complete (C): insns. write results into ROB
– Out-of-order: don’t block younger insns.
• Retire (R): a.k.a. commit, graduate
– ROB writes results to register file
– In-order: stall back-propagates to younger insns.
CSE502: Computer Architecture
P6 Data Structures
Map Table
Regfile
value
T+
R
value
T
Dispatch
RS
T1
==
==
==
==
T2
==
==
==
==
T
V1
V2
FU
CDB.V
op
CDB.T
Head
Retire
Tail
Dispatch
ROB
CSE502: Computer Architecture
MIPS R10K: Alternative Implementation
Map Table
T+
T
T
R
Regfile
value
T Told
Head
Retire
Dispatch
T T1+ T2+
== ==
== ==
== ==
== ==
RS
T
Free
List ROB
CDB.T
op
Arch.
Map
Tail
Dispatch
FU
• One big physical register file holds all data - no copies
+ Register file close to FUs  small and fast data path
– ROB and RS “on the side” used only for control and tags
CSE502: Computer Architecture
Executing Memory Instructions
• If R1 != R7
– Then Load R8 gets correct value from cache
• If R1 == R7
– Then Load R8 should get value from the Store
– But it didn’t!
Load R3 = 0[R6]
Add R7 = R3 + R9
Store R4  0[R7]
Sub R1 = R1 – R2
Load R8 = 0[R1]
But there was a later load…
Issue
Issue
Issue
Issue
Issue
Cache
Miss!
Miss
serviced…
Cache Hit!
CSE502: Computer Architecture
Memory Disambiguation Problem
• Ordering problem is a data-dependence violation
• Imprecise memory worse than imprecise registers
• Why can’t this happen with non-memory insts?
– Operand specifiers in non-memory insns. are absolute
• “R1” refers to one specific location
– Operand specifiers in memory insns. are ambiguous
• “R1” refers to a memory location specified by the value of R1.
• When pointers (e.g., R1) change, so does this location
CSE502: Computer Architecture
Two Problems
• Memory disambiguation on loads
– Do earlier unexecuted stores to the same address exist?
• Binary question: answer is yes or no
• Store-to-load forwarding problem
– I’m a load: Which earlier store do I get my value from?
– I’m a store: Which later load(s) do I forward my value to?
• Non-binary question: answer is one or more insn. identifiers
CSE502: Computer Architecture
Load/Store Queue (1/2)
• Load/store queue (LSQ)
– Completed stores write to LSQ
– When store retires, head of LSQ written to L1-D
• (or write buffer)
– When loads execute, access LSQ and L1-D in parallel
• Forward from LSQ if older store with matching address
CSE502: Computer Architecture
Load/Store Queue (2/2)
ROB
regfile
I$
B
P
load data
store data
addr
load/store
L1-D
LSQ
Almost a “real” processor diagram
CSE502: Computer Architecture
Loads Execute When …
• Most aggressive approach
• Relies on fact that storeload forwarding is rare
• Greatest potential IPC – loads never stall
• Potential for incorrect execution
– Need to be able to “undo” bad loads
CSE502: Computer Architecture
Detecting Ordering Violations
• Case 1: Older store execs before younger load
– No problem; if same address stld forwarding happens
• Case 2: Older store execs after younger load
– Store scans all younger loads
– Address match  ordering violation
CSE502: Computer Architecture
Loads Checking for Earlier Stores
• On Load dispatch, find data from earlier Store
Address Bank
Data Bank
=
ST 0x4000
Addr match
=
No earlier
matches
=
ST 0x4000
=
=
ST 0x4120
Valid store
Use this
store
Need to adjust this so that
load need not be at bottom,
and LSQ can wrap-around
=
=
LD 0x4000
0
If |LSQ| is large, logic can be
adapted to have log delay
CSE502: Computer Architecture
Data Forwarding
• On execute Store (STA+STD), check for later Loads
ST 0x4120
LD 0x4000
ST 0x4000
Similar Logic to Previous Slide
ST 0x4000
Data Bank
Overwritten
Is Load
Capture
Value
Addr Match
Overwritten
This is ugly, complicated, slow, and power hungry
CSE502: Computer Architecture
Data-Capture Scheduler
Fetch &
Dispatch
PRF/ROB
Data-Capture
Scheduler
Functional
Units
Physical register update
ARF
Bypass
• Dispatch: read available
operands from ARF/ROB,
store in scheduler
• Commit: Missing operands
filled in from bypass
• Issue: When ready, operands
sent directly from scheduler
to functional units
CSE502: Computer Architecture
Scheduling Loop or Wakeup-Select Loop
• Wake-Up Part:
– Executing insn notifies dependents
– Waiting insns. check if all deps are satisfied
• If yes, “wake up” instutrction
• Select Part:
– Choose which instructions get to execute
• More than one insn. can be ready
• Number of functional units and memory ports are limited
CSE502: Computer Architecture
Interaction with Execution
Payload RAM
D SL SR
opcode
A
Select Logic
ValL
ValR
ValL
ValR
ValL
ValL
ValR
ValR
CSE502: Computer Architecture
Simple Scheduler Pipeline
A
A:
Select
Payload
tag broadcast
B:
Wakeup
B
Execute
enable
capture
on tag match
result
broadcast
Capture
C
Select
Payload
tag broadcast
Wakeup
C:
Cycle i
Execute
enable
capture
Cycle i+1
Very long clock cycle
Capture
CSE502: Computer Architecture
Deeper Scheduler Pipeline
A
A:
Select
Payload
result
broadcast
tag broadcast
B:
Wakeup
B
Execute
enable
capture
Select
C
Capture
Payload
Execute
tag broadcast
Wakeup
C:
enable
capture
Select
Cycle i
Cycle i+1
Capture
Payload
Cycle i+2
Execute
Cycle i+3
Faster, but Capture & Payload on same cycle
CSE502: Computer Architecture
Very Deep Scheduler Pipeline
A
B
C
A: Select
B:
C:
Select
A&B both
ready, only
A selected,
B bids again
Payload
Select
Execute
Payload
Wakeup
D
Execute
Capture
Select
Payload
Wakeup
D:
AC and CD must
be bypassed,
BD OK without bypass
Execute
Capture
Wakeup
Capture
Select
Cycle i
i+1
i+2
i+3
i+4
Payload
Execute
i+5
i+6
Dependent instructions can’t execute back-to-back
CSE502: Computer Architecture
Non-Data-Capture Scheduler
Fetch &
Dispatch
Fetch &
Dispatch
Scheduler
Scheduler
Functional
Units
Unified
PRF
Functional
Units
Physical register
update
PRF
Physical register
update
ARF
CSE502: Computer Architecture
Pipeline Timing
S X E
Data-Capture
Select
Payload
Wakeup Select
Execute
Payload
“Skip” Cycle
Execute
S X X X E
Non-Data-Capture
Select
Payload
Wakeup Select
Read Operands from PRF
Payload
Execute
Read Operands from PRF
Substantial increase in schedule-to-execute latency
Exec
CSE502: Computer Architecture
Handling Multi-Cycle Instructions
Sched PayLd
Exec
Add R1 = R2 + R3
WU Sched
PayLd
Exec
Xor R4 = R1 ^ R5
Sched PayLd
Exec
Exec
Exec
WU Sched PayLd
Exec
Add R4 = R1 + R5
Mul R1 = R2 × R3
Instructions can’t execute too early
CSE502: Computer Architecture
Non-Deterministic Latencies
• Real situations have unknown latency
– Load instructions
• Latency  {L1_lat, L2_lat, L3_lat, DRAM_lat}
• DRAM_lat is not a constant either, queuing delays
– Architecture specific cases
• PowerPC 603 has “early out” for multiplication
• Intel Core 2’s has early out divider also
CSE502: Computer Architecture
Load-Hit Speculation
• Caches work pretty well
– Hit rates are high (otherwise we wouldn’t use caches)
– Assume all loads hit in the cache
Sched PayLd
Exec
Broadcast delayed
by DL1 latency
Exec
Exec
Sched PayLd
Cache hit,
data forwarded
Exec
R1 = 16[$sp]
R2 = R1 + #4
What to do on a cache miss?
CSE502: Computer Architecture
Simple Select Logic
Grant0 = 1
Grant1 = !Bid0
Grant2 = !Bid0 & !Bid1
Grant3 = !Bid0 & !Bid1 & !Bid2
Grantn-1 = !Bid0 & … & !Bidn-2
1
O(log S) gates
granti
xi = Bidi
Scheduler Entries
S entries
yields O(S)
gate delay
1
x0
x1
x2
x3
x4
x5
x6
x7
x8
grant0
grant1
grant2
grant3
grant4
grant5
grant6
grant7
grant8
grant9
CSE502: Computer Architecture
Implementing Oldest First Select
G
A
F
D
B
H
C
E
6
0
5
3
1
7
2
4
0
0
3
∞
Grant
0
2
2
Age-Aware Select Logic
Must broadcast grant age to instructions
CSE502: Computer Architecture
Problems in N-of-M Select
∞
O(log M) gate delay / select
∞
Age-Aware 1-of-M
∞
Age-Aware 1-of-M
6
0
5
3
1
7
2
4
Age-Aware 1-of-M
G
A
F
D
B
H
C
E
N layers  O(N log M) delay
CSE502: Computer Architecture
Select Binding
(Idle)
CMP
3 2
Not-Quite-Oldest-First:
Ready insns are aged 2, 3, 4
Issued insns are 2 and 4
XOR
SUB
2 2
8 1
ADD 4 1
CMP
3 2
Wasted Resources:
3 instructions are ready
Only 1 gets to issue
Select Logic for ALU2
ADD 4 1
ADD 5 1
Select Logic for ALU1
2 2
8 1
Select Logic for ALU2
XOR
SUB
Select Logic for ALU1
ADD 5 1
CSE502: Computer Architecture
Execution Ports
Port 0Port 1Port 2Port 3 Port 4
• Divide functional units
into P groups
Shift
Store FM/D SIMD
ALU1 ALU2 ALU3 M/D FAdd
– Called “ports”
• Area only O(P2M log M),
where P << F
• Logic for tracking bids
and grants less complex
(deals with P sets)
Load
ADD 3
LOAD 5
ADD 2
MUL
8
CSE502: Computer Architecture
Decentralized RS
• Natural split: INT vs. FP
L1 Data Cache
Int Cluster
ALU1 ALU2
FP-Ld FP-St
FAdd FM/D
Port 3
Port 2
Port 1
Port 0
Int-only wakeup
Load
FP
RF
Often implies non-ROB based
physical register file:
FP-only wakeup
INT
RF
Store
FP Cluster
One “unified” integer
PRF, and one “unified”
FP PRF, each managed
separately with their
own free lists
CSE502: Computer Architecture
Higher Complexity not Worth Effort
Performance
Made sense to go
Superscalar/OOO:
good ROI
Very little gain for
substantial effort
“Effort”
Scalar
In-Order
Moderate-Pipe
Superscalar/OOO
Very-Deep-Pipe
Aggressive
Superscalar/OOO
CSE502: Computer Architecture
SMP Machines
• SMP = Symmetric Multi-Processing
– Symmetric = All CPUs have “equal” access to memory
• OS seems multiple CPUs
– Runs one process (or thread) on each CPU
CPU0
CPU1
CPU2
CPU3
CSE502: Computer Architecture
MP Workload Benefits
runtime
3-wide
OOO
CPU
4-wide
OOO
CPU
3-wide
OOO
CPU
2-wide
OOO
CPU
3-wide
OOO
CPU
2-wide
OOO
CPU
Task A
Task B
Task A
Task B
Benefit
Task A
Task A
Task B
Task B
CSE502: Computer Architecture
… If Only One Task Available
runtime
Task A
3-wide
OOO
CPU
Task A
4-wide
OOO
CPU
3-wide
OOO
CPU
2-wide
OOO
CPU
Benefit
3-wide
OOO
CPU
2-wide
OOO
CPU
Idle
Task A
Task A
No benefit over 1 CPU
Performance
degradation!
CSE502: Computer Architecture
Chip-Multiprocessing (CMP)
• Simple SMP on the same chip
– CPUs now called “cores” by hardware designers
– OS designers still call these “CPUs”
Intel “Smithfield” Block Diagram
AMD Dual-Core Athlon FX
CSE502: Computer Architecture
On-chip Interconnects (1/4)
• Today, (Core+L1+L2) = “core”
– (L3+I/O+Memory) = “uncore”
• How to interconnect multiple “core”s to “uncore”?
• Possible topologies
–
–
–
–
–
Bus
Crossbar
Ring
Mesh
Torus
Core
Core
Core
Core
$
$
$
$
LLC $
Memory
Controller
CSE502: Computer Architecture
On-chip Interconnects (2/4)
• Possible topologies
Oracle UltraSPARC T5 (3.6GHz,
16 cores, 8 threads per core)
–
–
–
–
–
Bus
Crossbar
Ring
Mesh
Torus
Core
$
$
Bank 0
$
Bank 1
$
Bank 2
$
Bank 3
Memory
Controller
Core
$
Core
$
Core
$
CSE502: Computer Architecture
On-chip Interconnects (3/4)
• Possible topologies
Intel Sandy Bridge (3.5GHz,
6 cores, 2 threads per core)
–
–
–
–
–
Bus
Crossbar
Ring
Mesh
Torus
Core
$
Core
$
Core
$
Core
$
$
Bank 0
$
Bank 1
$
Bank 2
$
Bank 3
Memory
Controller
• 3 ports per switch
• Simple and cheap
• Can be bi-directional to
reduce latency
CSE502: Computer Architecture
On-chip Interconnects (4/4)
Tilera Tile64 (866MHz, 64 cores)
• Possible topologies
–
–
–
–
–
Bus
Crossbar
Ring
Mesh
Torus
Core
$
Core
$
$
Bank 0
$
Bank 1
Core
$
Core
$
Core
$
$
Bank 2
$
Bank 3
$
Bank 4
Core
$
Core
$
Core
$
$
Bank 5
$
Bank 6
$
Bank 7
Memory
Controller
• Up to 5 ports per switch
Tiled organization combines core and cache
CSE502: Computer Architecture
Multi-Threading
• Uni-Processor: 4-6 wide, lucky if you get 1-2 IPC
– Poor utilization of transistors
• SMP: 2-4 CPUs, but need independent threads
– Poor utilization as well (if limited tasks)
• {Coarse-Grained,Fine-Grained,Simultaneous}-MT
– Use single large uni-processor as a multi-processor
• Core provide multiple hardware contexts (threads)
– Per-thread PC
– Per-thread ARF (or map table)
– Each core appears as multiple CPUs
• OS designers still call these “CPUs”
CSE502: Computer Architecture
Scalar Pipeline
Time
Dependencies limit functional unit utilization
CSE502: Computer Architecture
Superscalar Pipeline
Time
Higher performance than scalar, but lower utilization
CSE502: Computer Architecture
Chip Multiprocessing (CMP)
Time
Limited utilization when running one thread
CSE502: Computer Architecture
Coarse-Grained Multithreading
Time
Hardware Context Switch
Only good for long latency ops (i.e., cache misses)
CSE502: Computer Architecture
Fine-Grained Multithreading
Time
Saturated workload -> Lots of threads
Unsaturated workload -> Lots of stalls
Intra-thread dependencies still limit performance
CSE502: Computer Architecture
Simultaneous Multithreading
Time
Max utilization of functional units
CSE502: Computer Architecture
Paired vs. Separate Processor/Memory?
• Separate CPU/memory
– Uniform memory access
(UMA)
• Equal latency to memory
– Low peak performance
• Paired CPU/memory
– Non-uniform memory access
(NUMA)
• Faster local memory
• Data placement matters
– High peak performance
CPU($)
CPU($)
CPU($)
CPU($)
Mem
Mem
Mem
Mem
CPU($)
Mem R
CPU($)
Mem R
CPU($)
Mem R
CPU($)
Mem R
CSE502: Computer Architecture
Issues for Shared Memory Systems
• Two big ones
– Cache coherence
– Memory consistency model
• Closely related
• Often confused
CSE502: Computer Architecture
Cache Coherence: The Problem
• Variable A initially has value 0
• P1 stores value 1 into A
• P2 loads A from memory and sees old value 0
t1: Store A=1
P1
A:A:001
P2
A: 0
L1
t2: Load A?
L1
Bus
A: 0
Main Memory
Need to do something to keep P2’s cache coherent
CSE502: Computer Architecture
Simple MSI Protocol
Load / BusRd
Modified
BusRdX, BusInv / [BusReply]
BusRdX / BusReply
Evict / BusWB
Store / BusRdX
Invalid
BusRd / [BusReply]
Shared
Load / --
Evict / --
Cache Actions:
• Load, Store, Evict
Bus Actions:
• BusRd, BusRdX
BusInv, BusWB,
BusReply
Load, Store / --
Usable coherence protocol
CSE502: Computer Architecture
Coherence vs. Consistency
• Coherence concerns only one memory location
• Consistency concerns ordering for all locations
• A Memory System is Coherent if
– Can serialize all operations to that location
• Operations performed by any core appear in program order
– Read returns value written by last store to that location
• A Memory System is Consistent if
– It follows the rules of its Memory Model
• Operations on memory locations appear in some defined order
CSE502: Computer Architecture
Sequential Consistency (SC)
processors
issue
P1
memory
ops
in
program
order
P3
P2
switch randomly set
after each memory op
Memory
Defines Single Sequential Order Among All Ops.
CSE502: Computer Architecture
Mutex Example w/ Store Buffer
P2
P1
Read B
t1
Write A
t3
Read A
t2
Write B
t4
Shared Bus
A: 0
B: 0
P1
lockA: A = 1;
if (B != 0)
{ A = 0; goto lockA; }
/* critical section*/
A = 0;
P2
lockB: B=1;
if (A != 0)
{ B = 0; goto lockB; }
/* critical section*/
B = 0;
Does not work
CSE502: Computer Architecture
Relaxed Consistency Models
• Sequential Consistency (SC):
– R → W, R → R, W → R, W → W
X →Y
X must complete before Y
• Total Store Ordering (TSO) relaxes W → R
– R → W, R → R, W → W
• Partial Store Ordering relaxes W → W (coalescing WB)
– R → W, R → R
• Weak Ordering or Release Consistency (RC)
– All ordering explicitly declared
• Use fences to define boundaries
• Use acquire and release to force flushing of values
CSE502: Computer Architecture
Good Luck!