Part I - Greg Bronevetsky

Download Report

Transcript Part I - Greg Bronevetsky

CS717
Hardware Fault Tolerance Through
Simultaneous Multithreading
Jonathan Winter
1
CS717
Motivation
• Microprocessor hardware is becoming more and
more susceptible to transient faults due to:
–
–
–
–
Increasing number of transistors on a chip
Decreasing feature sizes
Reduced chip voltages and noise margins
Widespread use of dynamic logic
• Extra available transistors can be used to provide
architecture-based fault tolerance
• Benefits:
– Hide all fault tolerance from software
– Program independent
2
CS717
SMT-based Fault Tolerance
• Complete redundancy without complete
replication
• Leverages idle hardware already on chip
• Uses inter-thread “communication” to
decrease execution time
• Potential for dual function chip
– switches between high performance and fault
tolerance modes
3
CS717
3 SMT + Fault Tolerance Papers
•
Eric Rotenberg, "AR-SMT - A Microarchitectural
Approach to Fault Tolerance in Microprocessors",
Symposium on Fault-Tolerant Computing, 1999.
•
Steven K. Reinhardt and Shubhendu S. Mukherjee,
"Transient Fault Detection via Simultaneous
Multithreading", ISCA 2000.
•
Shubhendu S. Mukherjee, Michael Kontz and Steven
K. Reinhardt, "Detailed Design and Evaluation of
Redundant Multithreading Alternatives", ISCA 2002.
4
CS717
Outline
1.
Background
•
•
2.
AR-SMT
•
•
•
3.
Basic mechanisms
Implementation issues
Simulation and Results
Transient Fault Detection via SMT
•
•
•
•
4.
Sphere of replication
Basic mechanisms
Comparison to AR-SMT
Simulation and Results
Redundant Multithreading Alternatives
•
•
•
5.
6.
SMT
Hardware fault tolerance
Realistic processor implementation
CRT
Simulation and Results
Fault Recovery
Next Lecture
5
CS717
Simultaneous Multithreading
• Multiple threads from the same or different processes
execute simultaneously through the pipeline
• Dynamic partitioning of resources reduces waste
• SMT transparent to stages in between register
renaming and write-back/commit
• Some hardware must be duplicated for each thread
6
CS717
Hardware Fault Tolerance
• Fault tolerance provided by redundancy
– Time (Execute program twice on same hardware)
– Space (Run on duplicate hardware)
– Information (Use parity, ECC, etc.)
• Previous approaches
– Complete hardware replication only for missioncritical systems (lockstepping)
– Parity and ECC for caches, memories, and wires
– Self checking circuits (expensive)
– Recomputing with Shifted Operands
7
CS717
Hardware Fault Tolerance (part 2)
• Transient Faults
– Alpha & Beta particles and cosmic rays (neutrons)
– Flip bit stored in memory or dynamic latch
– Cause erroneous output of a logic gate
• Permanent Faults
– Design errors
– Manufacturing defects
– Electromigration
8
CS717
Origins of SMT + Fault Tolerance
• First proposed by Nirmal R. Saxena and Edward J.
McCluskey in “Dependable Adaptive Computing
Systems – The ROAR Project”, IEEE Conference on
Systems, Man, and Cybernetics, 1998
• Fault detection performed at checkpoints
• OS inter-process communication used for voting
• Redundant threads could also be explicitly coded
• No SMT simulation performed
• Reconfigurable hardware for HP/FT modes
9
CS717
AR-SMT
• First paper to describe implementation details
• Fault tolerance through time/space
redundancy
• Transient faults are primary focus
• Two copies of the program run as separate
threads sharing hardware resources
• Dynamic instruction scheduling enables
efficient resource utilization
10
CS717
Basic Mechanism
• Two copies called active stream (A-stream)
and redundant instruction stream (R-stream)
• A-stream
– Executes normally
– Pushes results (new PC, register updates,
memory modifications) on Delay Buffer
• R-stream
– Compares execution results with delay buffer
– Committed state used as checkpoint if comparison
fails (fault detected)
11
CS717
A/R Streams and Delay Buffer
• Delay Buffer ensured time redundancy
• Transient error could be in either stream or
both
• Undetected if error causes same wrong
execution
12
CS717
Control and Data Dependencies
• Delay buffer contains all branch outcomes
and load values
• R-stream can use delay buffer as “predictor”
• Delays buffer wrong only if error occurred in
A-stream
• Hardware for validating prediction doubles as
fault-detection mechanism
• R-stream executes faster because it does not
miss-speculate branches or load values
13
CS717
Hierarchical Architecture
• Novel microarchitecture employed to provide space
redundancy and reduce complexity
• Based on earlier proposed Trace Processor
• Eric Rotenberg, Quinn Jacobson, Yiannakis Sazeides, and Jim
Smith, “Trace Processors”, MICRO 1997
• Instruction stream partitioned into larger units called
traces
• Trace passes through front end as one unit
• Backend consists of processing elements (PE)
– Contains issue queue, registers, and functional units
– Executes on one trace at a time.
14
CS717
Trace Processors
• Trace is 16
instructions,
including branches
• Detect permanent
faults by running
streams through
different PEs
• PE with permanent
fault is removed
from pool
15
CS717
Implementation Issues
• Register renaming allows processor to hold
two copies of the same logical register
• Virtual memory allows two streams to load
and store from the “same” address without
interference
• Branch and load value predictors only used
by A-stream
• Trace processor design permits only PEs to
be space shared (more later)
16
CS717
Implementation Issues (part 2)
• Complex fetch policies not beneficial in AR-SMT
– Full delay buffer  fetch R-stream
– Delay buffer not full  retire A-stream
• OS allocates two contiguous physical pages, one for
each stream
• No page table for R-stream – use delay buffer
• Stall A-stream on exception, trap, or context switch
• Multiprocessor support/coherence not discussed
• AR-SMT synchronization delays must be included in
real time system guarantees
17
CS717
Simulation Environment
•
•
•
•
•
•
•
•
Simulator based on Simplescalar platform
No fault coverage evaluated
Performance of AR-SMT on trace processor
Benchmarks from SPEC95 integer apps.
Traces terminate at indirect jumps and returns
Trace processors with 4 and 8 PEs used
PEs issue 4 instructions per cycle
A and R-stream only threads running
18
CS717
Results
• Execution
normalized to single
thread
• AR-SMT takes only
12% - 29% longer
with 4 PEs
• Overhead reduced to
5% to 27% with 8
PEs
19
CS717
Results (part 2)
• R-stream successfully
uses idle bandwidth
• R-stream needs less
computational
resources because of
delay buffer
• AR-SMT overhead
higher for programs
with high ILP
20
CS717
Conclusions
• SMT-based fault tolerance is superior to reexecuting the program on the same hardware
• Control and data flow information can be
forwarding to improve lagging thread
• Trace processor architecture can provide
space redundancy to detect permanent faults
• Hardware can be used more efficiently by ARSMT to provide redundancy with only 10% 30% overhead.
21