slides - Arizona State University

Download Report

Transcript slides - Arizona State University

CSE 591: Advances in Reliable Computing
Aviral Shrivastava
Compiler Microarchitecture Lab
Arizona State University
M
C L
Therac-25 1985-1987





The Therac-25 was a machine for
administering radiation therapy,
generally for treating cancer patients.
‘arithmetic overflow’ sometimes
occurred during automatic safety
checks.
If, at this precise moment, the operator
was configuring the machine, the
safety checks would fail and the metal
target would not be moved into place.
The result was that beams 100 times
higher than the intended dose would
be fired into a patient, giving them
radiation poisoning.
This happened on 6 known occasions,
causing the later death of 4 patients.
Web page: aviral.lab.asu.edu
M
C L
Patriot Missile Bug - February 25th, 1991



During Operation Desert Shield,
the US military fired a patriot
missile against an incoming missile,
but hit a US base where it killed 28
soldiers and injured a further 98.
The internal clock would ‘drift’
(much like any clock) further and
further from accurate time. It was
left running for 100 hours, by which
point, the internal clock had drifted
out by 0.34 of a second.
So when it calculated the target
over half a kilometer away from
missile’s true location.
Web page: aviral.lab.asu.edu
M
C L
Skynet Brings Judgement Day (1997)

Cost: 6 billion dead, near-total destruction of
human civilization and animal ecosystems (fictional)

Disaster: Human operators attempt to shut off the
Skynet global computer network. Skynet responds
by firing U.S. nuclear missiles at Russia, initiating
global nuclear war on what became known as
Judgement Day (August 29, 1997).

Cause: Cyberdyne, the leading weapons
manufacturer, installed Skynet technology in all
military hardware including stealth bombers and
missile defense systems. The Skynet technology
formed a seamless network and effectively
removed humans from strategic
defense. Eventually Skynet became sentient, was
threatened when the humans tried to take it
offline, sought to survive, and retaliated with
nuclear war.
Web page: aviral.lab.asu.edu
M
C L
Cold War Missile Crisis September 26, 1983




Soviet military officer Stanislav Petrov received an alert that the US had
launched five Minuteman intercontinental ballistic missiles.
Petrov found it strange that the US would attack with just a handful of
warheads.
Considering that the early warning system was known to have flaws and had
been rushed into service, Petrov decided to rule the alert as a false alarm.
It was later determined that the early detection software had picked up the
sun’s reflection from the top of clouds and misinterpreted it as missile
launches.
Web page: aviral.lab.asu.edu
M
C L
Michigan Dept. of Corrections Grants Prisoners Early Release



In October 2005, The Register
reported on the early release of
23 prisoners due to a computer
programming glitch with the
Michigan Department of
Corrections.
The accidental early release dates
came around 39 to 161 days early
while an undisclosed number of
inmates were kept in jail past
their release dates.
State assembly representative
Rick Jones was concerned about
the matter, but noted that he
was “glad it’s not murderers.”
Web page: aviral.lab.asu.edu
M
C L
North American Blackout August 14, 2003



Affecting around 55 million people,
mainly in the North Eastern United
States, but also Ontario Canada, this
was one of the biggest power
blackouts in history.
While the causes of this blackout
were nothing to do with a software
bug, it could have been averted were
it not for a software bug in the
control centre alarm system.
The centre alarm system had a ‘race
condition’, which caused the alarm
system to freeze and stop processing
alerts. The alarm system failed
‘silently’, and didn’t notify anybody.
Web page: aviral.lab.asu.edu
M
C L
Blue screen of death
Web page: aviral.lab.asu.edu
M
C L
Source of Errors

Specification errors


Programming errors






Process variations
Silicon failures
mechanically and physically
protected!
Runtime errors




Incorrect implementation (Michigan prison error)
Algorithm error (Cold war missile crisis)
Floating point errors (Patriot missile)
Assuming systems are
Race conditions (Blackout)
Manufacturing errors


Functionality in footnotes
Negative Bias Temperature Instability (NBTI)
Noise effects
Voltage emergencies
Environmental

Soft errors
Web page: aviral.lab.asu.edu
M
C L
Fault Tolerant Computing is not new!

1940s: ENIAC, with 17.5K vacuum tubes and
1000s of other electrical elements, failed once every
2 days

1950s: Early ideas by von Neumann (multichannel,
with voting) and Moore-Shannon (“crummy” relays)
Web page: aviral.lab.asu.edu
M
C L
Need is changing: Automation



Space age
Age of Automation
Proliferation of robots
Web page: aviral.lab.asu.edu
M
C L
Need is changing: Proximity

Near body computing


In-body computing


Google glass
Accurate drug delivery
Robotic surgery
Web page: aviral.lab.asu.edu
M
C L
Need is changing: Technology



Transistors are smaller
Even low-energy particles can
cause soft errors.
Exponentially more low-energy
particles
Web page: aviral.lab.asu.edu
M
C L
Saving Galileo


1978 – Galileo commissioned for Jupiter
exploration
1980 – Design and Architecture decided


1982 – Voyager reaches Jupiter




Use of AT 2901 for attitude control
Intermittent Resets
Sulfur ions from Jupiter’s volcanic moon were
being whipped up to high energy by the Jovian
gravity.
After extensive testing of Galileo, chief engineer
decided “not worth flying if soft error problem not
solved”
Overheads


5 years, 5 million dollars
Sandia National Laboratories was subcontracted to
custom-make radiation hardened AT 2901
Web page: aviral.lab.asu.edu
M
C L
Fault, Error and Failure
activation
propagation
ERROR
FAULT
FAILURE
fault latency
a physical defect that
occurs within hw or sw
components
HW defect, SW bug
Physical
Universe
physical entities
making up a system
15
error latency
nonperformance of
a deviation from
some action that is
accuracy or correctness
due or expected
manifestation of a
malfunction
fault
Informational
Universe
External
Universe
units of information
(eg: data words)
the user of a system
ultimately see the
effects
Web page: aviral.lab.asu.edu
M
C L
[Geffroyand, 02] Jean-Claude Geffroyand Gilles Motet, “Design of Dependable Computing Systems”, KluwerAcademic Publishers, 2002, ISBN 1-4020-0437-0
Fumble, Safe, and Loss in Baseball
Safe: error
Loss of Game: Failure
Fumble a ball: Fault
Out: not an error
16
Web page: aviral.lab.asu.edu
M
C L
Bit Flips, Transient Faults, Soft Errors, etc.
activation
propagation
ERROR
FAULT
fault latency
error latency
Storage Device
(e.g., Memory, Cache, Registers)
17
Sequential Device
(e.g., FF)
Soft Errors
Web page: aviral.lab.asu.edu
Bit Flips = Transient
Faults = Soft Errors
MA/SW Masking
Transient Faults
Circuit Masking
Logical Device
(e.g., ALU)
FAILURE
System
(e.g., Crash)
System Failures
M
C L
Masking Effects
Web page: aviral.lab.asu.edu
M
C L
Electrical Masking
Pulse attenuated by
electrical resistance in
the circuit
Pulse still strong enough
to be latched at output
19
Web page: aviral.lab.asu.edu
M
C L
Logical Masking
Value
unchanged at
the gate
20
Web page: aviral.lab.asu.edu
M
C L
Logical Masking
Error
propagated to
the output
21
Web page: aviral.lab.asu.edu
M
C L
Temporal Masking
Transient Fault
Soft Error
A transient pulse at the latching window:
1) Before tsetup  masked (not latched)
2) After tsetup, Before thold  race condition
3) At the latching window  not masked (latched)
[Firouzi ROCS 2010]
22
Web page: aviral.lab.asu.edu
M
C L
Why are soft errors increasing?

Smaller devices and hotter chips




More devices per processor




Q_crit decreases with size of device and temperature
More susceptible to process variations and mis-connections
Temperature exacerbates wear-out
More probability of a bug
More chances of cross-talk
Tri-core processor from AMD, 7 SPUs on Cell, 48-cores on SCC
More complicated designs

More design bugs
Web page: aviral.lab.asu.edu
M
C L
Error models




Stuck-at faults
Bit-flip
Delay error
Fail-stop
Web page: aviral.lab.asu.edu
M
C L
Reliability and Availability

Availability of a system at time t is the probability that it is
working correctly at time t.


measured in number of 9s
A system with 99.999% availability has availability of 5 9s.

Reliability of a system is the probability that the system has
been operating correctly from time 0 until time t.

Mean-time-to-failure (MTTF)


Mean time between failures (MTBF)


Only a mean, also consider the standard deviation
Also considers the recovery time
Failures in time (FIT)

# of failures over 1 billion hours of operation
Web page: aviral.lab.asu.edu
M
C L
Welcome



To the course on designing reliable computing systems
Focus of the course will be on “soft errors”
Class webpage

http://www.public.asu.edu/~ashriva6/teaching/ARC/
Web page: aviral.lab.asu.edu
M
C L