Detailed Design and Evaluation of Redundant Multithreading

Download Report

Transcript Detailed Design and Evaluation of Redundant Multithreading

Redundant Multithreading Techniques
for
Transient Fault Detection
Shubu Mukherjee Intel
Michael Kontz HP (current)
Steve Reinhardt Intel Consultant, U. of Michigan
Versions of this work have been presented at ISCA 2000 and ISCA 2002
Transient Faults from
Cosmic Rays & Alpha particles
+ decreasing feature size
-
decreasing voltage (exponential dependence?)
increasing number of transistors (Moore’s Law)
increasing system size (number of processors)
no practical absorbent for cosmic rays
2
Fault Detection via Lockstepping
(HP Himalaya)
R1  (R2)
R1  (R2)
microprocessor
microprocessor
Output
Comparison
Input
Replication
Memory covered by ECC
RAID array covered by parity
Servernet covered by CRC
Replicated Microprocessors + Cycle-by-Cycle Lockstepping
3
Fault Detection via Simultaneous Multithreading
R1  (R2)
R1  (R2)
THREAD
THREAD
Output
Comparison
Input
Replication
Memory covered by ECC
RAID array covered by parity
Servernet covered by CRC
Threads
?
Replicated Microprocessors + Cycle-by-Cycle Lockstepping
4
Simultaneous Multithreading (SMT)
Thread1
Thread2
Instruction
Scheduler
Functional
Units
Example: Alpha 21464, Intel Northwood
5
Redundant Multithreading (RMT)
RMT = Multithreading + Fault Detection
Multithreading
(MT)
Redundant
Multithreading
(RMT)
Multithreaded
Simultaneous
Uniprocessor
Multithreading
(SMT)
Simultaneous &
Redundant
Threading (SRT)
Chip
Multiprocessor
(CMP)
Multiple Threads
running on CMP
6
Chip-Level
Redundant
Threading (CRT)
Outline
SRT concepts & design
 Preferential Space Redundancy
 SRT Performance Analysis



Single- & multi-threaded workloads
Chip-level Redundant Threading (CRT)
Concept
 Performance analysis

Summary
 Current & Future Work

7
Overview
SRT = SMT + Fault Detection
 Advantages



Piggyback on an SMT processor with little extra hardware

Better performance than complete replication

Lower cost due to market volume of SMT & SRT
Challenges


Lockstepping very difficult with SRT
Must carefully fetch/schedule instructions from redundant
threads
8
Sphere of Replication
Sphere of Replication
Leading
Thread
Trailing
Thread
Input
Replication
Output
Comparison
Memory System (incl. L1 caches)

Two copies of each architecturally visible thread


Co-scheduled on SMT core
Compare results: signal fault if different
9
Basic Pipeline
Fetch
Decode
Dispatch
Execute
Data Cache
10
Commit
Load Value Queue (LVQ)
Fetch
Decode
Dispatch
Execute
Commit
LVQ
Data Cache

Load Value Queue (LVQ)


Keep threads on same path despite I/O or MP writes
Out-of-order load issue possible
11
Store Queue Comparator (STQ)
Fetch
Decode
Dispatch
Execute
Commit
STQ
Data Cache

Store Queue Comparator


Compares outputs to data cache
Catch faults before propagating to rest of system
12
Store Queue Comparator (cont’d)
Store
Queue

Compare
address & data
to data
cache
Extends residence time of leading-thread stores




st ...
st 5  [0x120]
st ...
st ...
st ...
st 5  [0x120]
Size constrained by cycle time goal
Base CPU statically partitions single queue among threads
Potential solution: per-thread store queues
Deadlock if matching trailing store cannot commit

Several small but crucial changes to avoid this
13
Branch Outcome Queue (BOQ)
BOQ
Fetch
Decode
Dispatch
Execute
Commit
Data Cache

Branch Outcome Queue


Forward leading-thread branch targets to trailing fetch
100% prediction accuracy in absence of faults
14
Line Prediction Queue (LPQ)
LPQ
Fetch
Decode
Dispatch
Execute
Commit
Data Cache

Line Prediction Queue


Alpha 21464 fetches chunks using line predictions
Chunk = contiguous block of 8 instructions
15
Line Prediction Queue (cont’d)

Generate stream of “chunked” line predictions


Every leading-thread instruction carries its
I-cache coordinates
Commit logic merges into fetch chunks for LPQ
–
Independent of leading-thread fetch chunks
–
Commit-to-fetch dependence raised deadlock issues
1F8:
Chunk 1: end of cache line
1FC:
200:
Chunk 2: taken branch 204:
208:
200:
16
add
load R1(R2)
beq 280
and
bne 200
add
Line Prediction Queue (cont’d)

Read-out on trailing-thread fetch also complex



Base CPU “thread chooser” gets multiple line
predictions, ignores all but one
Fetches must be retried on I-cache miss
Tricky to keep queue in sync with thread progress


Add handshake to advance queue head
Roll back head on I-cache miss
–
Track both last attempted & last successful chunks
17
Outline
SRT concepts & design
 Preferential Space Redundancy
 SRT Performance Analysis



Single- & multi-threaded workloads
Chip-level Redundant Threading (CRT)
Concept
 Performance analysis

Summary
 Current & Future Work

18
Preferential Space Redundancy

SRT combines two types of redundancy



Space redundancy preferable


Time: same physical resource, different time
Space: different physical resource
Better coverage of permanent/long-duration faults
Bias towards space redundancy where possible
19
PSR Example: Clustered Execution
LPQ
add r1,r2,r3
add r1,r2,r3 add r1,r2,r3
Fetch
Decode
add r1,r2,r3
IQ 0
Dispatch
Commit
IQ 1

Exec 0
Exec 1
Base CPU has two execution clusters

Separate instruction queues, function units

Steered in dispatch stage
20
PSR Example: Clustered Execution
LPQ
0 000
add r1,r2,r3 [0] add r1,r2,r3 [0]
IQ 0
Fetch
Decode
Dispatch
add r1,r2,r3 [0]
0
Commit
IQ 1

Exec 0
Exec 1
Leading thread instructions record their cluster



Bit carried with fetch chunk through LPQ
Attached to trailing-thread instruction
Dispatch sends to opposite cluster if possible
21
PSR Example: Clustered Execution
LPQ
IQ 0
Fetch
Decode
Exec 0
Dispatch
Commit
add r1,r2,r3 [0] add r1,r2,r3 [0] add r1,r2,r3 [0]
IQ 1
Exec 1
add r1,r2,r3 add r1,r2,r3

99.94% of instruction pairs use different clusters


Full spatial redundancy for execution
No performance impact (occasional slight gain)
22
Outline
SRT concepts & design
 Preferential Space Redundancy
 SRT Performance Analysis



Single- & multi-threaded workloads
Chip-level Redundant Threading (CRT)
Concept
 Performance analysis

Summary
 Current & Future Work

23
SRT Evaluation

Used SPEC CPU95, 15M instrs/thread



Constrained by simulation environment
 120M instrs for 4 redundant thread pairs
Eight-issue, four-context SMT CPU


128-entry instruction queue
64-entry load and store queues
–



Default: statically partitioned among active threads
22-stage pipeline
64KB 2-way assoc. L1 caches
3 MB 8-way assoc L2
24
Relative Performance
SRT Performance: One Thread
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
SRT
SRT + ptSQ
s
es
m
co
pr
f
pp
p
p
c
gc
go
dr
y
h
d
o2
eg
p
ij
li
m
m
si
k
88
rl
e
p
t
x
tv
e
a
t
c
or
m
v
o

One logical thread  two hardware contexts

Performance degradation = 30%

Per-thread store queue buys extra 4%
25
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
im
+f
pp
pp
sw
im
go
+s
w
go
+f
pp
pp
wi
m
gc
c+
s
go
SRT
SRT + ptSQ
gc
c+
gc
c+
fp
pp
p
Relative Performance
SRT Performance: Two Threads
Two logical threads  four hardware contexts
 Average slowdown increases to 40%
 Only 32% with per-thread store queues

26
Outline
SRT concepts & design
 Preferential Space Redundancy
 SRT Performance Analysis



Single- & multi-threaded workloads
Chip-level Redundant Threading (CRT)
Concept
 Performance analysis

Summary
 Current & Future Work

27
Chip-Level Redundant Threading
SRT typically more efficient than splitting one
processor into two half-size CPUs
 What if you already have two CPUs?



IBM Power4, HP PA-8800 (Mako)
Conceptually easy to run these in lock-step


Benefit: full physical redundancy
Costs:
–
–

Latency through centralized checker logic
Overheads (misspeculation etc.) incurred twice
CRT combines best of SRT & lockstepping

requires multithreaded CMP cores
28
Chip-Level Redundant Threading
CPU A
Leading
Thread A
LVQ
LPQ
Stores
CPU B
Trailing
Thread A
LVQ
Trailing
Thread B
LPQ
Stores
29
Leading
Thread B
CRT Performance
Lock8
CRT
Relative Performance
1.2
CRT + ptSQ
1
0.8
0.6
0.4
0.2
0
1

2
3
4
5
6
8
7
9
10
11
12
13
14
15
With per-thread store queues, ~13% improvement
over lockstepping with 8-cycle checker latency
30
Summary & Conclusions

SRT is applicable in a real-world SMT design


~30% slowdown, slightly worse with two threads
Store queue capacity can limit performance
Preferential space redundancy improves coverage
 Chip-level Redundant Threading = SRT for CMPs



Looser synchronization than lockstepping
Free up resources for other application threads
31
More Information

Publications




S.K. Reinhardt & S.S.Mukherjee, “Transient Fault Detection via
Simultaneous Multithreading,” International Symposium on
Computer Architecture (ISCA), 2000
S.S.Mukherjee, M.Kontz, & S.K.Reinhardt, “Detailed Design
and Evaluation of Redundant Multithreading Alternatives,”
International Symposium on Computer Architecture (ISCA),
2002
Papers available from:
– http://www.cs.wisc.edu/~shubu
– http://www.eecs.umich.edu/~stever
Patents

Compaq/HP filed eight patent applications on SRT
32