Soft Errors: Impact and Causes

Download Report

Transcript Soft Errors: Impact and Causes

Spring 2008 CSE 591
Compilers for Embedded Systems
Aviral Shrivastava
Department of Computer Science and Engineering
Arizona State University
Lecture 2: Soft Errors
Beginings..
□ 1954-57 nuclear tests
□ Electronic monitoring equipment failure
□ could not identify the reason!!
□ Worked fine after rebooting
□ no hardware fault, no permanent fault
□ 1962 - Wallmark and Marcus
□ Surmised that cosmic rays can cause failures in electronic systems
□
Minimum Size and Maximum Packing Density of Non-Redundant Semiconductor Devices
□ 1978 – May and Woods of Intel
□ Reported alpha particle induced soft errors in the 2107-series 16KB DRAMs.
□ 1979 – Ziegler and Lanford of IBM
□ presented solid evidence that, the electronic sensitivity to
radiation-induced soft errors could become a nightmare for the
future technologies.
First Space Casualty
□
Telestar
□
□
□
□
First communication satellite
ATT Bell Telephone, NASA, British GPO, and French PTT
Launched July 10, 1962
July 23 - live transatlantic television signal
□ Supposed to telecast speech from President John. F.
Kennedy
□ Instead telecasted major league baseball
□
Telstar ushered in a new age of the benevolent use
of technology
□
July 9, 1962
□
□
□
United States tested a high-altitude nuclear device
(called Starfish Prime) which super-energized the Earth's
Van Allen Belt where Telstar took orbit
100X increase in radiation
Out of service in December, repaired, but unusable
after February
Saving Galileo
□ 1978 – Galileo commissioned for
Jupiter exploration
□ 1980 – Design and Architecture
decided
□ Use of AT 2901 for attitude control
□ 1982 – Voyager reaches Jupiter
□ Intermittent Resets
□ Sulfur ions from Jupiter’s volcanic moon,
Io, were being whipped up to high
energy by the Jovian gravity.
□ After extensive testing of Galileo, chief
engineer decided “not worth flying if
soft error problem not solved”
□ Overheads
□ 5 years, 5 million dollars
□ Sandia National Laboratories was
subcontracted to custom-make
radiation hardened 2901
Recent – Hubble Space Telescope
□ Intermittent resets after 1996 upgrade of software on
Hubble Space Telescope
□ South Atlantic Anomaly
Sun-earth Interactions
□ 11 year solar cycle of sun-spots
□ 109kg/s of material lost by the
Sun as ejected solar wind.
□ Protons (~70%), electrons, ionized
helium, less than 0.5% minor ions.
□ 2x1010 protons/cm2
□ Loose of satellites
Impact on Earth-bound Electronics
□ Documented strikes in large servers found in error logs
□ Normand, “Single Event Upset at Ground Level,” IEEE Transactions
on Nuclear Science, Vol. 43, No. 6, December 1996.
□ Sun Microsystems, 2000 (R. Baumann, 2002 IRPS Workshop talk)
□ Cosmic ray strikes on L2 cache with defective error protection
□ caused Sun’s flagship servers to suddenly and mysteriously
crash!
□ Companies affected
□ Baby Bell (Atlanta), America Online, Ebay, & dozens of other
corporations
□ Verisign moved to IBM Unix servers (for the most part)
□ Cisco line cards may reset after single event upset (SEU)
failures
Copyright 2005, M. Tahoori
8
Reactions from Companies
□ Fujitsu SPARC in 130 nm technology
□ 80% of 200k latches protected with parity
□ compare with very few latches protected in Mckinley
□ ISSCC, 2003
□ IBM declared 1000 years system MTBF as product
goal
□ for Power4 line
□ very hard to achieve this goal in a cost-effective way
□ Bossen, 2002 IRPS Workshop Talk
Copyright 2005, M. Tahoori
9
Evolution of a Product’s Team’s Psyche
□ Shock
□ “SER is the crabgrass in the lawn of computer design”
□ Denial
□ “We will do the SER work two months before tapeout”
□ Anger
□ “Our reliability target is too ambitious”
□ Acceptance
□ “You can deny physics only for so long”
Growing Problem
□ Is going to become a everyday problem (omnipresent) for
every devices (ubiquitous)
□ Soft Errors in Embedded Systems
□ Not only a space phenomenon anymore!
Phenomenon of Soft Error
□ Transient Faults
□ Random and spontaneous
bit-changes in system
□ Can be caused by
□ Circuit noise
□ Cross-talk
□ More than 50% due to
radiation strike
Causes of Soft Errors
□ Alpha particles emitted by traces of uranium, thorium, or lead
impurities in packaging materials
□ Alpha particles emitted by decaying radioactive impurities in packaging
and interconnect materials. (plastic packages is the worst.
Ceramic,HyperBGA, Flip-chip PBGA)
□ High-energy ( > 1 MeV) neutrons from cosmic radiation can induce
soft errors in semiconductor devices via secondary ions produced by
the neutron reaction with silicon nuclei
□ Less than 1% of the primary flux reaches ground level
□ Secondary radiation induced from the interaction of low-energy
neutrons and boron
□
Boron-10 in BPSG (Borophosphosilicate glass)
□ New process technologies use highly refined packaging and no
boron
□ 2nd effect is the most important
□
□
Shielding is effective only for low-energy neutrons
High energy neutrons can pass through 6 feet of concrete
LET Spectrum
□ Linear energy transfer
□ Measure of energy deposition
□ MeV per mg/cm2, MeV/μ or pC/μ
□ FIT: Failure in Time
□ No. of failures in 1 billion
hours of operation
□ MTTF: Mean Time To
Failure
□ 1000 FITs => MTTF of 114
years
□ 1 GByte of RAM @ 500
FIT/Mbit can expect an
error every two weeks
□ ECC reduces failure rate
by 2 orders of magnitude
□ hypothetical Terabyte
system would experience
a soft error every few
minutes
Metrics
Trends
□ DRAM
□ System error rate of DRAMs is
fairly constant
□ SRAM
□ Increasing exponentially
□ Logic
□ Increasing exponentially
Masking Effects
□
□
□
□
□
Logic Masking
Electrical Masking
Latching Window Masking
Microarchitectural Masking
Software Masking