CSE 520: Advanced Computer Architecture: Reliability

Download Report

Transcript CSE 520: Advanced Computer Architecture: Reliability

CSE 520: Advanced Computer
Architecture: Reliability
Aviral Shrivastava
M
C L
Therac-25 1985-1987





The Therac-25 was a machine for
administering radiation therapy,
generally for treating cancer patients.
‘arithmetic overflow’ sometimes
occurred during automatic safety
checks.
If, at this precise moment, the operator
was configuring the machine, the
safety checks would fail and the metal
target would not be moved into place.
The result was that beams 100 times
higher than the intended dose would
be fired into a patient, giving them
radiation poisoning.
This happened on 6 known occasions,
causing the later death of 4 patients.
Web page: aviral.lab.asu.edu
M
C L
Patriot Missile Bug - February 25th, 1991



During Operation Desert Shield,
the US military fired a patriot
missile against an incoming missile,
but hit a US base where it killed 28
soldiers and injured a further 98.
The internal clock would ‘drift’
(much like any clock) further and
further from accurate time. It was
left running for 100 hours, by which
point, the internal clock had drifted
out by 0.34 of a second.
So when it calculated the target
over half a kilometer away from
missile’s true location.
Web page: aviral.lab.asu.edu
M
C L
Skynet Brings Judgement Day (1997)

Cost: 6 billion dead, near-total destruction of
human civilization and animal ecosystems (fictional)

Disaster: Human operators attempt to shut off the
Skynet global computer network. Skynet responds
by firing U.S. nuclear missiles at Russia, initiating
global nuclear war on what became known as
Judgement Day (August 29, 1997).

Cause: Cyberdyne, the leading weapons
manufacturer, installed Skynet technology in all
military hardware including stealth bombers and
missile defense systems. The Skynet technology
formed a seamless network and effectively
removed humans from strategic
defense. Eventually Skynet became sentient, was
threatened when the humans tried to take it
offline, sought to survive, and retaliated with
nuclear war.
Web page: aviral.lab.asu.edu
M
C L
Cold War Missile Crisis September 26, 1983




Soviet military officer Stanislav Petrov received an alert that the US had
launched five Minuteman intercontinental ballistic missiles.
Petrov found it strange that the US would attack with just a handful of
warheads.
Considering that the early warning system was known to have flaws and had
been rushed into service, Petrov decided to rule the alert as a false alarm.
It was later determined that the early detection software had picked up the
sun’s reflection from the top of clouds and misinterpreted it as missile
launches.
Web page: aviral.lab.asu.edu
M
C L
Michigan Dept. of Corrections Grants Prisoners Early Release



In October 2005, The Register
reported on the early release of
23 prisoners due to a computer
programming glitch with the
Michigan Department of
Corrections.
The accidental early release dates
came around 39 to 161 days early
while an undisclosed number of
inmates were kept in jail past
their release dates.
State assembly representative
Rick Jones was concerned about
the matter, but noted that he
was “glad it’s not murderers.”
Web page: aviral.lab.asu.edu
M
C L
North American Blackout August 14, 2003



Affecting around 55 million people,
mainly in the North Eastern United
States, but also Ontario Canada, this
was one of the biggest power
blackouts in history.
While the causes of this blackout
were nothing to do with a software
bug, it could have been averted were
it not for a software bug in the
control centre alarm system.
The centre alarm system had a ‘race
condition’, which caused the alarm
system to freeze and stop processing
alerts. The alarm system failed
‘silently’, and didn’t notify anybody.
Web page: aviral.lab.asu.edu
M
C L
Blue screen of death
Web page: aviral.lab.asu.edu
M
C L
Source of Errors

Specification errors


Programming errors






Process variations
Silicon failures
mechanically and physically
protected!
Runtime errors




Incorrect implementation (Michigan prison error)
Algorithm error (Cold war missile crisis)
Floating point errors (Patriot missile)
Assuming systems are
Race conditions (Blackout)
Manufacturing errors


Functionality in footnotes
Negative Bias Temperature Instability (NBTI)
Noise effects
Voltage emergencies
Environmental

Soft errors
Web page: aviral.lab.asu.edu
M
C L
Fault Tolerant Computing is not new!

1940s: ENIAC, with 17.5K vacuum tubes and
1000s of other electrical elements, failed once every
2 days

1950s: Early ideas by von Neumann (multichannel,
with voting) and Moore-Shannon (“crummy” relays)
Web page: aviral.lab.asu.edu
M
C L
Need is changing: Automation



Space age
Age of Automation
Proliferation of robots
Web page: aviral.lab.asu.edu
M
C L
Need is changing: Proximity

Near body computing


In-body computing


Google glass
Accurate drug delivery
Robotic surgery
Web page: aviral.lab.asu.edu
M
C L
Need is changing: Technology



Transistors are smaller
Even low-energy particles can
cause soft errors.
Exponentially more low-energy
particles
Web page: aviral.lab.asu.edu
M
C L
Welcome



To the course on designing reliable computing systems
Focus of the course will be on “soft errors”
Class webpage

http://www.public.asu.edu/~ashriva6/teaching/ARC/
Web page: aviral.lab.asu.edu
M
C L