PPT - Computer and Information Science

Download Report

Transcript PPT - Computer and Information Science

FAILURES IN COMPLEX SYSTEMS:
case studies, causes, and
potential remedies from artificial
intelligence research
Abstract: Computers are the
pervasive technology of our time. It is
also clear that we allow computers to be
critically tied to more and more of our
routine daily activities. This talk will
survey technological mishaps during the
past two decades mostly involving the
use of computers. The consequences
and causes of these "accidents" will be
discussed as well as how catastrophes
may be avoided in the future through
lessons and practices based on artificial
intelligence research by the speaker.
FAILURES IN COMPLEX SYSTEMS:
case studies, causes, and
potential remedies from artificial
intelligence research
1. Overview
2. Background: FAST Report
3. Normal Accidents
4. Chaos in Computer Networks
5. Case Studies -- Catastrophes
6. The Human Window and the
Potential
Role
of
AI
THE FAST STUDY
(Forecasting and Assessment in Science and
Technology) (1982, CEC, Brussels, Belgium)
1. Three Mile Island (TMI)
2. Air Traffic Control (ATC)
3. North American Air Defense System
(NORAD)
4. Royal Dutch Steel
“The mismatch between a technological system
and the humans who operate it can be at either a
“syntactic” or “semantic” level. A knowledge
representation can be syntactically correctable,
but if its semantic structure is wrong, then no
amount of human technology can correct it.
The nature of the underlying causes of
mismatch between man and machine, whether
essentially syntactic or semantic …”
 It
is no longer just laymen
who cannot comprehend
computers and the
advanced information
systems under which they
operate.

“Mismatch Between Machine
Representations and Human Concepts:
dangers and remedies” Report to CEC,
Subprogram FAST (D. Kopec and D.
Michie, 1982)
Three Mile Island (1980)
• In conclusion, while the major factor that
turned this incident into a serious accident
was inappropriate operator action, many
factors contributed to the action of the
operators, such as:
– lack of clarity in their operating
procedures, deficiencies in their
training,
– failure of organizations to learn
the proper lessons from previous
incidents, and
– deficiencies in the design of the
control room.
• (John Kemeny, Report of the President’s Commission
on the Accident at Three Mile Island (p11, 1980)
“The control panel is huge, with hundreds of
alarms, and there some key indicators placed
in locations where the operators cannot see
them. There is little evidence of the impact of
modern information technology within the
control room.” (John Kemeny, Chairman of
the President’s Commission on the Accident
at Three Mile Island, 1980)
AIR TRAFFIC CONTROL
Concerns Regarding ATC 2000:
/Cont..
The F.A.A. plans call for increased
automation of controller function with a
human role change from controller of
every aircraft to A.T.C. manager who
handles EXCEPTIONS while the
computer takes care of routine A.T.C.
commands.
(Andres Zellweger, Advanced
Concepts Staff, F.A.A., 1977, News
Scientist)
AIR TRAFFIC CONTROL
/Cont..
• Will the controller who has to
intervene in an exceptional case
be properly placed to do so?
• Over long periods of time the
sheer lack of verbal
communication between A.T.C.
and individual aircraft may lead
to mistakes in instructions or
• A generally increasing
reluctance of humans to
intervene in the system at all.
“Normal Accidents”
Charles Perrow, 1984
…The odd term “normal accident” is
meant to signal that, given the system
characteristics, multiple and
unexpected interactions of failures are
inevitable. (p.5).
Normal
Accidents/Cont.
 In
complex industrial space,
and military systems, the
normal accident generally
(not always) means that the
interactions are not only
unexpected, but are
incomprehensible for some
critical period of time. (p..9)
– 60-80% attributed to operator
error
– mysterious interactions
– bizarre errors
Normal Accidents
typically entail:
 Multiple
Failures
– in design, equipment,
procedures, operators, and
environment
 Tightly
Coupled Events
– events which are dependent
upon each other, one
transcending the other, often
following quickly in time (p. 8)
“Chaos in Computer
Networks”
Whole is not the sum of
its parts
 New Discipline:
“Computational Ecology”
 Concurrency control
problems
 Data Staleness
 Insufficient Models
 The
– Real World Models
Impossible, e.g..... SDI
NORAD

On November 9, 1979, false indications of
a mass raid were caused by inadvertent
introduction of simulated data in NCS.

On June 3, 1980, false attack indications
were caused by a faulty component in a
communications processor computer.

On June 6, 1980 false attack indications
were again caused by the faulty
component during operational testing.

On October 3, 1979, a SubmarineLaunched Ballistic Missile radar (Mt.
Hebo) picked up a low orbit rocket body
that was close to decay and generated a
false launch and impact report.
NORAD CONCERNS
– The key precaution built into the
NORAD alert system is that an 8-stage
process whereby a critical human
decision must be made at each stage.
This prevents machinery alone from
ordering a nuclear strike. Defense
officials at Cheyenne Mountain gave
the following assurance:
• “If there are eight different stages,
then we were only at step one …”
However, the real concern of US. officials
are two circumstances which could directly
result from a false alarm:
– 1) the “Shrinking Time Factor” and
– 2) the possibility of
– “Escalating Responses”
CASE STUDIES
• Computers are small, but have more
control of nearly every aspect of life, e.g.
– cars, microwaves, traffic lights,
banking, buildings, telephone
networks, ATC, etc.
• Programs have become huge and
inscrutable
– Millions of lines of code
– Impossible to test fully
• Software is the “Achilles Heel” of the
computer revolution
(1) AT&T Breakdown,
Monday, Jan. 22, 1990
• Unknown combinations of events (calls)
caused malfunctions across 114 switching
centers nationwide
• 65 million calls unconnected nationwide
• Millions of dollars had been spent on
“reliability claims” of a “virtually
foolproof system”
• CAUSE: unique combination of events
revealing software error
• Flawless for months
(2) Bank of New York
(1989)
 Forced
to borrow $24 billion
to cover assets temporarily
due to a software error.
 Result:
interest
$5 Million lost in
(3) ATM’s Numerous
Cases of “unlimited”
or “doubled” Cash
 Banks
and Customers have
become dependent upon
them
 Per transaction, they have
nonetheless proven to be
quite reliable
(4) Canadian Cancer Therapy
Machine (Therac-25, 1986)
Designed by Atomic Energy
of Canada, Ltd. (AECL)


A software-controlled radiation therapy
machine used to treat people with
cancer
Between 1985 and 1987 Therac-25
machines in four medical centers gave
massive overdoses of radiation to six
patients.

Some instances operators repeated
overdoses because machine display
indicated no dose given.

Some patients received between
13,000 - 25,000 rads when 100-200
needed
 Result:
severe injuries and
Therac-25: Causes of Malfunction

Lapses in Good Safety Design
– Hardware Safety interlock device
present in earlier Therac-6 and Therac20 were missing

Insufficient Testing
– Software replacing hardware safety was
not adequately tested
– Error messages were unexplained (e.g.
Malfunction-54 or H-tilt)
– One-key resumption was possible
despite error message
Therac-25/.Cont.

Bugs in the Software controlling the
machines
– Set-Up Test used a one byte flag
variable whose bit value was
incremented on each run
– When routine called for 256th time,
flag overflowed and huge electron
beam was erroneously turned on.

Inadequate system of reporting and
investigating the accident
– During real-time operation only
certain parts of operator
input/editing were recorded by
software
– It required careful reconstruction by
a physicist at one of the cancer
centers to determine what has gone
wrong.
– (Leveson, Turner, and Jacky, 1993)
Lessons to be Learned
from Catastrophes
 Bhopal
Chemical Plant
(1984)
 The Challenger Accident
(1986)
 Chernobyl (1986)
 AIR DISASTERS including
– The Case of KAL FL/007
(1983) (269 dead)
 Misprogrammed
INS
– IRAN AIR FL 655 (290 dead)
 Mishap
at the Human Interface;
 $1 billion AEGIS system
EPITAPH
 Inscrutable
 Who
is writing the software?
 Should
 Who
Software
they be certified?
is Responsible?
 Well-designed
Human
Interfaces Needed
 Human
Factors Critical
 Redundancy
and Self-Checking
 Testing

HOW CAN Artificial Intelligence
HELP?
The Notion of a
“HUMAN WINDOW”
Sources: M. Clarke (1982); D. Michie (1982)
L (Lighthill)
R (Reti)
The Human Window
FACTORS
1. Correctness
2. Grain Size
3. Comprehensibility
4. Executability
 MEMORY
Calculation
RISKY SYSTEMS

Deemed Essential

Dangers are covered up or denied

Will not be abandoned or made safe

Have numerous potential causes
including
– Poor Design
– Poorly trained Operators,
– Faulty Equipment or
Uncooperative Environment

(Perrow, 1984)
ARE WE PUTTING TOO
MUCH FAITH IN
TECHNOLOGY?

Technology: Optimistic and Pessimistic
Organization Views (Zuboff, 1988;
Fisher, 1996)
– Pessimistic Scenario
 Automation of Existing
Processes
 Tool for monitoring and
surveillance
– Optimistic Scenario
 Information and Social Exchange
 Computer technology increases
information content of tasks
Match of Organization
Structure and Technology is
Necessary
 Formal,
Strict structures cannot
accommodate complex
technology
 Structure
to enable coordination,
control, and communication
(Perrow, 1967; Zuboff, 1988)
 Feedback:
– Coaching vs. Monitoring
Intelligent Tutoring
Systems
 SmartBooks
– Interactive
– Theory of Education
 Expert
Module
– Model Tracing
 Student
Modeller
– Assessment and Analysis
 Tutoring
Module
– Tutoring Strategy;
selection
 Curriculum
Module
– Rich with Multimedia