A Mechanism for Online Diagnosis of Hard Faults in

Download Report

Transcript A Mechanism for Online Diagnosis of Hard Faults in

A Mechanism for Online Diagnosis
of Hard Faults in Microprocessors
Motivation
Current Techniques
Proposed Mechanism for Online Fault Diagnosis
Results
Challenges
Conclusion
Transient
Faults
Hard Faults
Electron Migration
Gate Oxide Breakdown
Single Event Upset
Process Scaling
Redundancy
DIVA
DIVA
error detection and correction
UTILIZE
REDUNDANCY
deconfigure
unit
Track Units
YES
If(error_count >
threshold)
DIVA
ERROR
error_count++
NO
No Action
Reorder Buffer
Units that can be turned off in case of a fault
ALU
Reservation Station
DIVA CHECKER
Deconfigure entries in circular buffer
Deconfigure entries in tabular structure
• DIVA: 6% of an Alpha 21264 core
• Error counters (~1227 bits total)
• Instruction resource usage (19 wires in total)
• Deconfiguration logic
• Can be reduced using coarse granularity
Hard fault diagnosis latency
Performance impact of losing component to
hard fault
Error count threshold
• Related to resource usage
• Heavily used resources have higher
counters
• Pipeline flushes before threshold is
reached
Error count threshold
• Related to resource usage
• Heavily used resources have higher
counters
• Pipeline flushes before threshold is
reached
Transient faults DIVA CHECKER
A
HARD
FAULT
ERROR
B
TRANSIENT
FAULT
C
Desired
Independent resource usage
Observed
D
E
F
• Certain structures cannot be protected
• Register File
• Issue logic
• Common Data Bus (CDB)
• Transient fault False Deconfiguration
• Possibly masked by error counter
• Faults in the error counter or deconfiguration logic
• Periodically test counters
• Permanently configure or deconfigure FDU
upon error
• Window of vulnerability
• DIVA produces errors until counter
saturates
• As transistors shrink, hard fault rate increases
• Current reliability mechanisms
•
•
•
•
Redundancy (TMR)
Thread level redundancy
Pre shipment testing and deconfiguration
Low cost solutions such as DIVA
• Online diagnosis
• Low cost and hardware overhead
• Use FDUs along with DIVA to diagnose faults dynamically
• Increase yield  Binned to a lower performance bin
What are the advantages of this hybrid scheme over using
just a DIVA checker?
As process technology gets smaller, can this mechanism
help increase the lifetime of the processor a significant
amount?
As transistors shrink, the number of cores will increase, can
this mechanism be used still as opposed to turning off a
faulty core?
How can we extend this mechanism to take care of the issue
logic, singleton resources and CDB?
•
•
•
•
•
•
•
•
•
Electron Migration. Digital image. Wikimedia.org. Wikimedia, 6 Mar. 2007. Web.
<http://upload.wikimedia.org/wikipedia/commons/thumb/8/8b/Leiterbahn_ausfallort_elektromigration.jpg/220pxLeiterbahn_ausfallort_elektromigration.jpg>.
Gate Oxide Breakdown. Digital image. Attopsemi Technology. Attopsemi Technology, n.d. Web.
<http://www.attopsemi.com/tec3.htm>.
Sawant, Minal. Single Event Upset. Digital image. COTS. Microsemi, Jan. 2012. Web.
<http://www.cotsjournalonline.com/articles/view/102279>.
Sawant, Minal. Soft Error Rate. Digital image. CCCP. University of Michigan, 11 May 2012. Web.
<http://cccp.eecs.umich.edu/research/reliability.php>.
Carr, Robert. Simultaneous Multithreading. Digital image. Prezi. Prezi, 31 Oct. 2013. Web.
<http://prezi.com/tegbbfk34l57/question-2/>.
Wong, William. Out of Order Pipeline. Digital image. Electronic Design. Electronic Design, 19 Oct. 2011. Web.
<http://electronicdesign.com/microcontrollers/little-core-shares-big-core-architecture>.
Mark Brehob, EECS 470 Lecture Slides
Fred A. Bower, Daniel J. Sorin, and Sule Ozev. A Mechanism for Online Diagnosis of Hard Faults
Microprocessors. In Proc. Of the 38th Annual IEEE/ACM International Symposium on Microarchiteceture
(MICRO’05), 2005
T.M. Austin. DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design. In Proc. Of the 32nd
Annual IEEE/ACM Int’l Symposium on Microarchitecture, pages 196-207, Nov. 1999.