PowerPoint presentation

Download Report

Transcript PowerPoint presentation

Characterization of
Pathological
Behavior
http://www.ices.cmu.edu/ballista
Philip Koopman
[email protected] - (412) 268-5225
Dan Siewiorek
[email protected] - (412) 268-2570
(and more than a dozen other contributors)
Goals



Detect pathological patterns for fault prognosis
Develop fault propagation models
Develop statistical identification and stochastic characterization of
pathological phenomena
2
Outline




Definitions
Digital Hardware Prediction
Digital Software Characterization
Research Challenges
3
Definitions: Cause-Effect Sequence and Duration

FAULT -

ERROR -

FAILURE -

DURATION
• Permanent• Intermittent• Transient-
incorrect state of hardware/software caused by
component failure, environment, operator
errors, or incorrect design
manifestation of a fault within a program or
data structure
services deviates from specified service due to
an error
continuous and stable due to hardware failure, repair
by replacement
occasionally present due to unstable hardware or
varying hardware/software state, repair by replacement
resulting from design errors or temporary
environmental conditions, not repairable by
replacement
4
CMU Andrew File Server Study

Configuration
• 13 SUN II Workstations with 68010 processor
• 4 Fujitsu Eagle Disk Drives

Observations
• 21 Workstation Years

Frequency of events
•
•
•
•

Permanent Failures
Intermittent Faults
Transient Faults
System Crashes
29
610
446
298
Mean Time To
•
•
•
•
Permanent Failures
Intermittent Faults
Transient Faults
System Crash
6552 hours
58 hours
354 hours
689 hours
5
Some Interesting Numbers

Permanent Outages/Total Crashes = 0.1

Intermittent Faults/Permanent Failures = 21
• Thus first symptom appears over 1200 hours prior to repair

(Crashes - Permanent)/Total Faults = 0.255

14/29 failures had three or fewer error log entries
• 8/29 had no error log entries
6
Harbinger Detection of Anomalies
7
Digital Hardware Prediction
8
Measurement and Prediction Module


History Collection -- Calculation and reporting of system availability
Future prediction -- failure prediction of system devices
History
Collection
Future
Predict
User
Application Prog
Operating System
Measurement & Prediction
Module
9
History Collection
 This module consists :
• Crash Monitor - monitors system state
• Calculator - Average uptime and average of
History Collection
Files of system state info
Crash
Monitor
uptime
( uptime downtime)
Files of
uptime
(fraction)
information
User
Application Prog
Operating System

Uptime(fraction)
Calculator
fraction,
uptime
=> Availability
( uptime downtime)
10
Average uptime
periodically samples system state
up
down
System state’s
changing
Crash Monitor
reboot
crash
uptime’
= t2 - t1 = 600min
time
t3
t2
t1
downtime’
= t3 - t1=13min
interval
= 5min
real  uptime :
600 min ~ 600  5min
11
Tu Oc
e t2
S Oc 0 0
un t 1
2 :0
Th Oc 6 0 0:
t
1 00
u
W N 31 :00 1
ed ov 01 :0 99
1 : 0 9
Tu No 1 00: 19
e v 1 01: 00 99
M No 7 00 19
on v 01 :0 9
0 9
S No 23 :00 1
un v 0 :0 9
9
1
D 2 : 0 9
S ec 9 0 00: 19
at
1 0 9
D 05 :0 0 9
Fr ec 0 0: 19
i 1 1: 00 99
Th De 1 00 1
u c 01 :0 99
W D 17 :00 0 1 9
ed ec 01 :0 99
2 :0 0 9
M De 3 0 0: 19
on c 2 1 00 9
:0
9
Th Ja 9 0 0: 19
0
9
n
1
u 1 :0 0 9
Tu Fe 0 0: 19
e b 1 01 00 99
M Fe 0 :00 19
on b 01 :0 9
2 : 0 9
S F e 2 0 00: 20
un b 1 0 0
2 :0 0 0
S Ma 8 0 0:0 20
at r 1 0 0
M 05 :0 2 0
Fr ar 0 0:0 00
i 1 1: 0 0
Th Ma 1 0 00 20
u r 1 :0 0
W M 17 :00 0 2 0
ed ar 01 :0 00
M 2 :0 0 0
Tu a 3 0 0: 20
e r 2 1: 00 00
M Ap 9 0 00: 20
on r 1 00 0
0
0 :0
S Ap 4 0 0: 20
un r 1 0 0
1 :0 0 0
S Ap 0 0 0: 20
at r 1 1 00 00
A 6 :0 2
Fr pr 01 0:0 00
2
0 0
Th i A 2 :00 2
u pr 2 01: :00 00
W M 8 00 2 0
ed ay 0 : 0 00
1
M 04 :0 0 2 0
ay 0 0: 0
0 0
10 1:0 0 0
2
01 0:0 00
:0 0 0
0: 20
00 0
0
20
00
W
ed
availability
Preliminary Experiment Data (cont.)
An NT system accumulative availability daily report over
5-month period
availability number
120
100
80
60
availability number
40
20
0
time/date
12
Future Prediction
 This module generates device failure warning information
•Sys-log Monitor : monitors new entries by checking the system event
log periodically.
•DFT Engine : DFT Heuristic applied and corresponding device
failure warning issued if rules satisfied.
Engine
Sys-log
Monitor
Files of
device
failure
warning
Error Log
User
Application Prog
Operating System
Dispersion Frame
DFT Technique
Future Prediction
13
Principle from observation

periods of increasingly unreliable behavior prior to catastrophic failure.
Error entry example: DISK:9/180445/563692570/829000:errmsg:xylg:syc:cmd6:reset failed
time
(drive not ready) blk 0 type
CPU
repair
errors
Mem Board
repair
Disk
repair
Filter by
event type
time
mem
disk

Based on this observation, the DFT Heuristic was derived, to detect
the non-monotonically decreasing inter-arrival time.
14
How DFT Works via an example
rule: if a sliding window of 1/2 of the current error interval successively
twice covers 3 errors in the future - issue a warning
last 5 errors of the same type (disk)
i-4
i-3 i-2
i-1 i
t
warning
15
Digital Software Characterization
16
Where We Started: Component Wrapping

Improve Commercial Off-The-Shelf (COTS) software robustness
17
Exception Handling The Basis for Error Detection

Exception handling is an important part of dependable systems
• Responding to unexpected operating conditions
• Tolerating activation of latent design defects

Robustness testing can help evaluate software dependability
• Reaction to exceptional situations (current results)
• Reaction to overloads and software “aging” (future results)
• First big objective: measure exception handling robustness
– Apply to operating systems
– Apply to other applications

It’s difficult to improve something you can’t measure …
so let’s figure out how to measure robustness!
18
Measurement Part 1: Software Testing

SW Testing requires:
• Test case
• Module under test
• Oracle (a “specification”)
SPECIFIED
BEHAVIOR
SHOULD
WORK
“Bad” value combinations
Module under Test
Watchdog timer/core dumps
INPUT
SPACE
VALID
INPUTS
UNDEFINED
SHOULD
RETURN
ERROR
Ballista uses:
INVALID
INPUTS
RESPONSE
SPACE
ROBUST
OPERATION
MODULE
UNDER
TEST
REPRODUCIBLE
FAILURE
UNREPRODUCIBLE
FAILURE
19
Ballista: Scalable Test Generation
API write(int filedes, const void *buffer, size_t nbytes)
TESTING
OBJECTS
TEST
VALUES
TEST CASE

FILE
DESCRIPTOR
TEST OBJECT
MEMORY
BUFFER
TEST OBJECT
SIZE
TEST
OBJECT
FD_CLOSED
FD_OPEN_READ
FD_OPEN_WRITE
FD_DELETED
FD_NOEXIST
FD_EMPTY_FILE
FD_PAST_END
FD_BEFORE_BEG
FD_PIPE_IN
FD_PIPE_OUT
FD_PIPE_IN_BLOCK
FD_PIPE_OUT_BLOCK
FD_TERM
FD_SHM_READ
FD_SHM_RW
FD_MAXINT
FD_NEG_ONE
BUF_SMALL_1
BUF_MED_PAGESIZE
BUF_LARGE_512MB
BUF_XLARGE_1GB
BUF_HUGE_2GB
BUF_MAXULONG_SIZE
BUF_64K
BUF_END_MED
BUF_FAR_PAST
BUF_ODD_ADDR
BUF_FREED
BUF_CODE
BUF_16
BUF_NULL
BUF_NEG_ONE
SIZE_1
SIZE_16
SIZE_PAGE
SIZE_PAGEx16
SIZE_PAGEx16plus1
SIZE_MAXINT
SIZE_MININT
SIZE_ZERO
SIZE_NEG
write(FD_OPEN_RD, BUFF_NULL, SIZE_16)
Ballista combines test values to generate test cases
20
Ballista: “High Level” + “Repeatable”

High level testing is done using API to perform fault injection
• Send exceptional values into a system through the API
– Requires no modification to code -- only linkable object files needed
– Can be used with any function that takes a parameter list
• Direct testing instead of middleware injection simplifies usage

Each test is a specific function call with a specific set of parameters
• System state initialized & cleaned up for each single-call test
• Combinations of valid and invalid parameters tried in turn
• A “simplistic” model, but it does in fact work...

Early results were encouraging:
• Found a significant percentage of functions with robustness failures
• Crashed systems from user mode

The testing object-based approach scales!
21
CRASH Robustness Testing Result Categories

Catastrophic
• Computer crashes/panics, requiring a reboot
• e.g., Irix 6.2: munmap(malloc((1<<30)+1), ((1<<31)-1)) );
• e.g., DUNIX 4.0D: mprotect(malloc((1 << 29)+1), 65537, 0);

Restart
• Benchmark process hangs, requiring restart

Abort
• Benchmark process aborts (e.g., “core dump”)

Silent
• No error code generated, when one should have been
(e.g., de-referencing null pointer produces no error)

Hindering
• Incorrect error code generated
22
Digital Unix 4.0 Results
23
Comparing Fifteen POSIX Operating Systems
Ballista Robustness Tests for 233 Posix Function Calls
AIX 4.1
Free BSD 2.2.5
HP-UX 9.05
HP-UX 10.20
1 Catastrophic
Irix 5.3
Irix 6.2
1 Catastrophic
Linux 2.0.18
LynxOS 2.4.0
1 Catastrophic
Abort Failures
Restart Failure
NetBSD 1.3
OSF 1 3.2
1 Catastrophic
OSF 1 4.0
QNX 4.22
2 Catastrophics
QNX 4.24
SunOS 4.1.3
SunOS 5.5
0%
5%
10%
15%
20%
25%
Normalized Failure Rate
24
Failure Rates By POSIX Fn/Call Category
25
C Library Is A Potential Robustness Bottleneck
Portions of Failure Rates Due To System/C-Library
AIX 4.1
Free BSD 2.2.5
HP-UX 9.05
HP-UX 10.20
1 Catastrophic
Irix 5.3
Irix 6.2
1 Catastrophic
Linux 2.0.18
LynxOS 2.4.0
1 Catastrophic
System Calls
C Library
NetBSD 1.3
OSF 1 3.2
1 Catastrophic
OSF 1 4.0
QNX 4.22
2 Catastrophics
QNX 4.24
SunOS 4.1.3
SunOS 5.5
0%
5%
10%
15%
20%
25%
Normalized Failure Rate
26
Failure Rates by Function Group
27
Technology Transfer
 Original
project sponsor DARPA
• Sponsored technology transfer projects for:
– Trident Submarine navigation system (U.S. Navy)
– Defense Modeling & Simulation Office HLA system
 Industrial
sponsors are continuing the
work
•
•
•
•
•
•
Cisco – Network switching infrastructure
ABB – Industrial automation framework
Emerson – Windows CE testing
AT&T – CORBA testing
ADtranz – (defining project)
Microsoft – Windows 2000 testing
 Other
users include
• Rockwell, Motorola, and, potentially, some POSIX OS developers
28
Specifying A Test (web/demo interface)

Simple demo interface; real interface has a few more steps...
29
Viewing Results

Each robustness failure is one test case (one set of parameters)
30
“Bug Report” program creation

Reproduces failure in isolation (>99% effective in practice)
/* Ballista single test case Sun Jun 13 14:11:06 1999
* fopen(FNAME_NEG, STR_EMPTY) */
...
const char *str_empty = "";
...
param0 = (char *) -1;
str_ptr = (char *) malloc (strlen (str_empty) + 1);
strcpy (str_ptr, str_empty);
param1 = str_ptr;
...
fopen (param0, param1);
31
Research Challenges
33
Research Challenges





Ballista provides a small, discrete state-space for software
components
Challenge is to create models of inter-module relations and
workload statistics to create predictions
Create discrete simulations using model and probabilities as input
parameters
Validation of model at a high level of abstraction through
experimentation on testbed
Optimize cost/performance
34
Contributors

What does it take to do this sort of research?
• A legacy of 15 years of previous Carnegie Mellon work to build upon
– But, sometimes it takes that long just to understand the real problems!
• Ballista: 3.5 years and about $1.6 Million spent to date
Students:
 Meredith Beveridge
 John Devale
 Kim Fernsler
 David Guttendorf
 Geoff Hendrey
 Nathan Kropp
 Jiantao Pan
 Charles Shelton
 Ying Shi
 Asad Zaidi
Faculty & Staff:
 Kobey DeVale
 Phil Koopman
 Roy Maxion
 Dan Siewiorek
35