PPT - Computer and Information Science
Download
Report
Transcript PPT - Computer and Information Science
FAILURES IN COMPLEX SYSTEMS:
case studies, causes, and
potential remedies from artificial
intelligence research
Abstract: Computers are the
pervasive technology of our time. It is
also clear that we allow computers to be
critically tied to more and more of our
routine daily activities. This talk will
survey technological mishaps during the
past two decades mostly involving the
use of computers. The consequences
and causes of these "accidents" will be
discussed as well as how catastrophes
may be avoided in the future through
lessons and practices based on artificial
intelligence research by the speaker.
FAILURES IN COMPLEX SYSTEMS:
case studies, causes, and
potential remedies from artificial
intelligence research
1. Overview
2. Background: FAST Report
3. Normal Accidents
4. Chaos in Computer Networks
5. Case Studies -- Catastrophes
6. The Human Window and the
Potential
Role
of
AI
THE FAST STUDY
(Forecasting and Assessment in Science and
Technology) (1982, CEC, Brussels, Belgium)
1. Three Mile Island (TMI)
2. Air Traffic Control (ATC)
3. North American Air Defense System
(NORAD)
4. Royal Dutch Steel
“The mismatch between a technological system
and the humans who operate it can be at either a
“syntactic” or “semantic” level. A knowledge
representation can be syntactically correctable,
but if its semantic structure is wrong, then no
amount of human technology can correct it.
The nature of the underlying causes of
mismatch between man and machine, whether
essentially syntactic or semantic …”
It
is no longer just laymen
who cannot comprehend
computers and the
advanced information
systems under which they
operate.
“Mismatch Between Machine
Representations and Human Concepts:
dangers and remedies” Report to CEC,
Subprogram FAST (D. Kopec and D.
Michie, 1982)
Three Mile Island (1980)
• In conclusion, while the major factor that
turned this incident into a serious accident
was inappropriate operator action, many
factors contributed to the action of the
operators, such as:
– lack of clarity in their operating
procedures, deficiencies in their
training,
– failure of organizations to learn
the proper lessons from previous
incidents, and
– deficiencies in the design of the
control room.
• (John Kemeny, Report of the President’s Commission
on the Accident at Three Mile Island (p11, 1980)
“The control panel is huge, with hundreds of
alarms, and there some key indicators placed
in locations where the operators cannot see
them. There is little evidence of the impact of
modern information technology within the
control room.” (John Kemeny, Chairman of
the President’s Commission on the Accident
at Three Mile Island, 1980)
AIR TRAFFIC CONTROL
Concerns Regarding ATC 2000:
/Cont..
The F.A.A. plans call for increased
automation of controller function with a
human role change from controller of
every aircraft to A.T.C. manager who
handles EXCEPTIONS while the
computer takes care of routine A.T.C.
commands.
(Andres Zellweger, Advanced
Concepts Staff, F.A.A., 1977, News
Scientist)
AIR TRAFFIC CONTROL
/Cont..
• Will the controller who has to
intervene in an exceptional case
be properly placed to do so?
• Over long periods of time the
sheer lack of verbal
communication between A.T.C.
and individual aircraft may lead
to mistakes in instructions or
• A generally increasing
reluctance of humans to
intervene in the system at all.
“Normal Accidents”
Charles Perrow, 1984
…The odd term “normal accident” is
meant to signal that, given the system
characteristics, multiple and
unexpected interactions of failures are
inevitable. (p.5).
Normal
Accidents/Cont.
In
complex industrial space,
and military systems, the
normal accident generally
(not always) means that the
interactions are not only
unexpected, but are
incomprehensible for some
critical period of time. (p..9)
– 60-80% attributed to operator
error
– mysterious interactions
– bizarre errors
Normal Accidents
typically entail:
Multiple
Failures
– in design, equipment,
procedures, operators, and
environment
Tightly
Coupled Events
– events which are dependent
upon each other, one
transcending the other, often
following quickly in time (p. 8)
“Chaos in Computer
Networks”
Whole is not the sum of
its parts
New Discipline:
“Computational Ecology”
Concurrency control
problems
Data Staleness
Insufficient Models
The
– Real World Models
Impossible, e.g..... SDI
NORAD
On November 9, 1979, false indications of
a mass raid were caused by inadvertent
introduction of simulated data in NCS.
On June 3, 1980, false attack indications
were caused by a faulty component in a
communications processor computer.
On June 6, 1980 false attack indications
were again caused by the faulty
component during operational testing.
On October 3, 1979, a SubmarineLaunched Ballistic Missile radar (Mt.
Hebo) picked up a low orbit rocket body
that was close to decay and generated a
false launch and impact report.
NORAD CONCERNS
– The key precaution built into the
NORAD alert system is that an 8-stage
process whereby a critical human
decision must be made at each stage.
This prevents machinery alone from
ordering a nuclear strike. Defense
officials at Cheyenne Mountain gave
the following assurance:
• “If there are eight different stages,
then we were only at step one …”
However, the real concern of US. officials
are two circumstances which could directly
result from a false alarm:
– 1) the “Shrinking Time Factor” and
– 2) the possibility of
– “Escalating Responses”
CASE STUDIES
• Computers are small, but have more
control of nearly every aspect of life, e.g.
– cars, microwaves, traffic lights,
banking, buildings, telephone
networks, ATC, etc.
• Programs have become huge and
inscrutable
– Millions of lines of code
– Impossible to test fully
• Software is the “Achilles Heel” of the
computer revolution
(1) AT&T Breakdown,
Monday, Jan. 22, 1990
• Unknown combinations of events (calls)
caused malfunctions across 114 switching
centers nationwide
• 65 million calls unconnected nationwide
• Millions of dollars had been spent on
“reliability claims” of a “virtually
foolproof system”
• CAUSE: unique combination of events
revealing software error
• Flawless for months
(2) Bank of New York
(1989)
Forced
to borrow $24 billion
to cover assets temporarily
due to a software error.
Result:
interest
$5 Million lost in
(3) ATM’s Numerous
Cases of “unlimited”
or “doubled” Cash
Banks
and Customers have
become dependent upon
them
Per transaction, they have
nonetheless proven to be
quite reliable
(4) Canadian Cancer Therapy
Machine (Therac-25, 1986)
Designed by Atomic Energy
of Canada, Ltd. (AECL)
A software-controlled radiation therapy
machine used to treat people with
cancer
Between 1985 and 1987 Therac-25
machines in four medical centers gave
massive overdoses of radiation to six
patients.
Some instances operators repeated
overdoses because machine display
indicated no dose given.
Some patients received between
13,000 - 25,000 rads when 100-200
needed
Result:
severe injuries and
Therac-25: Causes of Malfunction
Lapses in Good Safety Design
– Hardware Safety interlock device
present in earlier Therac-6 and Therac20 were missing
Insufficient Testing
– Software replacing hardware safety was
not adequately tested
– Error messages were unexplained (e.g.
Malfunction-54 or H-tilt)
– One-key resumption was possible
despite error message
Therac-25/.Cont.
Bugs in the Software controlling the
machines
– Set-Up Test used a one byte flag
variable whose bit value was
incremented on each run
– When routine called for 256th time,
flag overflowed and huge electron
beam was erroneously turned on.
Inadequate system of reporting and
investigating the accident
– During real-time operation only
certain parts of operator
input/editing were recorded by
software
– It required careful reconstruction by
a physicist at one of the cancer
centers to determine what has gone
wrong.
– (Leveson, Turner, and Jacky, 1993)
Lessons to be Learned
from Catastrophes
Bhopal
Chemical Plant
(1984)
The Challenger Accident
(1986)
Chernobyl (1986)
AIR DISASTERS including
– The Case of KAL FL/007
(1983) (269 dead)
Misprogrammed
INS
– IRAN AIR FL 655 (290 dead)
Mishap
at the Human Interface;
$1 billion AEGIS system
EPITAPH
Inscrutable
Who
is writing the software?
Should
Who
Software
they be certified?
is Responsible?
Well-designed
Human
Interfaces Needed
Human
Factors Critical
Redundancy
and Self-Checking
Testing
HOW CAN Artificial Intelligence
HELP?
The Notion of a
“HUMAN WINDOW”
Sources: M. Clarke (1982); D. Michie (1982)
L (Lighthill)
R (Reti)
The Human Window
FACTORS
1. Correctness
2. Grain Size
3. Comprehensibility
4. Executability
MEMORY
Calculation
RISKY SYSTEMS
Deemed Essential
Dangers are covered up or denied
Will not be abandoned or made safe
Have numerous potential causes
including
– Poor Design
– Poorly trained Operators,
– Faulty Equipment or
Uncooperative Environment
(Perrow, 1984)
ARE WE PUTTING TOO
MUCH FAITH IN
TECHNOLOGY?
Technology: Optimistic and Pessimistic
Organization Views (Zuboff, 1988;
Fisher, 1996)
– Pessimistic Scenario
Automation of Existing
Processes
Tool for monitoring and
surveillance
– Optimistic Scenario
Information and Social Exchange
Computer technology increases
information content of tasks
Match of Organization
Structure and Technology is
Necessary
Formal,
Strict structures cannot
accommodate complex
technology
Structure
to enable coordination,
control, and communication
(Perrow, 1967; Zuboff, 1988)
Feedback:
– Coaching vs. Monitoring
Intelligent Tutoring
Systems
SmartBooks
– Interactive
– Theory of Education
Expert
Module
– Model Tracing
Student
Modeller
– Assessment and Analysis
Tutoring
Module
– Tutoring Strategy;
selection
Curriculum
Module
– Rich with Multimedia