Diploma de estudios Avanzados

Download Report

Transcript Diploma de estudios Avanzados

Reliability study of an embedded
operating system for industrial
applications
Pardo, J., Campelo, J.C, Serrano, J.J.
Juan Pardo
Fault Tolerant Systems Group
Polytechnic University of Valencia
Spain
1
Research Objectives

Critical industrial applications or fault tolerant
applications need for operating systems (OS) which
guarantee a correct and safe behaviour despite the
appearance of errors.

In order to validate the behaviour of an operating system
in front of errors, software fault injection techniques can
be used.

These techniques can be used to corrupt the
information of some of the operating system calls to see
how the system react in front of invalid or corrupted
values at the kernel calls.
SEPT’04
WSRS '04
2
Research Objectives

The research work presented is about the development and results
on software fault injection in an embedded system composed by a
Real-Time Operating System (RTOS) and a microcontroller.

A software fault injection tool has been developed. The methodology
proposed treated the operating system as a black-box where its
source code was not available.

With this objective a layer between the operating system and the
application to be executed has been developed.

OS error detection coverage has been measured and observations
about OS critical data structures to be improved have been
commented, in order to improve the final robustness of the operating
system.
SEPT’04
WSRS '04
3
Introduction

Software of computer systems involves a lot of aspects of our lives.
Despite their enormous expansion, they are still far from reaching the
perfection.

In order to measure the quality of the software some tests are required.

Fault tolerance deals with software’s ability to hide problems,
specifically the effects of faults [Voas98].

Robustness is the degree to which a system operates correctly in the
presence of exceptional inputs or stressful environmental conditions.

Robustness can thus be viewed as an indication on the OS capacity to
resist/react to faults induced by the applications running on top of it, or
originating from the hardware layer or from device drivers [DBench02].
SEPT’04
WSRS '04
4
Introduction

Fault Tolerant System




Dependability


SEPT’04
Fault tolerance is intended to preserve the delivery of correct
service in the presence of active faults. It is generally implemented
by error detection and subsequent system recovery
A system able to continue working although the appearance of
errors
Safe behaviour  known state which doesn’t produce any risk to
the system
To avoid the lost of human lives or important economic quantities
Final products quality  Validation before to go to the market
WSRS '04
5
Introduction
Dependability:
Dependability of a
computing system
is the ability to
deliver
service that can
justifiably be
trusted
Dependabilit
y
Attributes
Availability
Reliability
Safety
Confidentiality
Integrity
Maintainability
Means
Fault prevention
Fault tolerance
Fault removal
Fault forecasting
Threats
Faults
Errors
Failures
A. Avizienis
JC. Laprie
B. Randell
SEPT’04
WSRS '04
6
State of art
Fault Injection
Techniques
Fault Injection
FI on Simulated models
VHDL
Simulation
models
Other
languages
FI on prototypes
Hardware Injection
HWIFI
External
Software Injection
SWIFI
Time
Level
HWIFI at pin level
Static
High Level
Electromagnetic
Perturbations
Dynamic
Machine Language
Internal
Injection Objectives:
Heavy ion radiations
SEPT’04
Laser Radiation
•Prediction
Scan Chain
•Elimination
WSRS '04
7
Advantages & drawbacks (SWIFI )

Total control on When and Where to inject  Controllability

Higher level faults simulation

Reduced cost

Higher reachability

Higher portability  Flexibility

Low risk to damage the circuit under tests

Easy automation of the injection campaigns

Good observability  everyday processors have more internal tools for
debugging
SEPT’04
WSRS '04
8
Advantages & drawbacks (SWIFI )

There are zones which SW can not reach.

Less precision on timing measurements  interferences with the
system, overload, etc.

Injection and activation agents overload the system

Runtime Injection  Little intrusion

Objective: minimize the overload



Drawback for RTOS
Easy automation of injections campaigns
Pre-runtime  Less intrusion
SEPT’04
WSRS '04
9
SW Fault Injection

SW Fault Injection tools:









FIAT: Fault Injection Based Automated Testing Environment, Carnegie
Mellon University.
EFI, PROFI: Processor Fault Injector, Dortmund University.
FERRARI: Fault and ERRor Automatic Real-time Injector, Texas
University.
SFI, DOCTOR: intergrateD sOftware implemented fault injeCTiOn
enviRonment, Michigan University.
FINE: Fault Injection and moNitoring Environment, Universidad de
Illinois University.
FTAPE: Fault Tolerance and Performance Evaluator, Illinois University.
XCEPTION: Coimbra University.
MAFALDA, MAFALDA-RT: Microkernel Assessment by Fault injection
AnaLysis and Design Aid, LAAS-CNRS en Toulouse
BALLISTA: Carnegie Mellon University.
SEPT’04
WSRS '04
10
Tools



RAM
1KByte
MicroC/OS-II RTOS
Infineon C166  Microcontroller
Tasking  Compiler, Debugger..
Infineon
Microcontroller Characteristics:
16 bits High performance
On-chip CMOS
16.5 MIPS, 25/33 MHz
Advantages from CISC & RISC
High functionality for peripheral
Typical for automotive
SEPT’04
WSRS '04
XRAM XRAM
1KByte1KByte
CAN
BUSCONTROL
RAM
1KByte
CORE
ROM
PWM INTERRUPTIR+PECUNIT CONTROL
SSC WDT
CAPCOM
ADC
GPT
1+2
USART 1+2
Applications for the C166 Family
Processor -System
ROM /
Flash
CPU
Interrupt-System PEC
OSC.
Ext.
Bus
Control
PORTS
X-Bus
Periphrl.
RAM
C161
WDT
C163
GPTs
USART
CAPCOM
ADC Sync Communication PWM
C164
Peripheral-System
C165
C166
C167
Automotive
• Engine
Management
• Transmission
Control
Industrial
Control
Consumer
• Robotics
• DVD / CD-ROM
• PLC’s
• TV / Monitor
• Servo-Drives
• VCR / Sat
Receiver
• ABS/ASK
• Motor Control
• Active
Suspension
• Power-Inverters
• Machine-Tool
Control (CNC)
• Set Top Box
• Games
• Video
Surveillance
Telecom/
Datacom
EDP
• Communication
Boards (LAN)
• Hard Disk
Drives
• Modems
• Tape Drives
• PBX
• Printers
• Mobile
Communication
• Scanners
• Digital Copiers
• FAX Machines
11
COTS components

The main motivation to use Commercial Off-TheShelf (COTS) components on a system design is
the notorious cost reduction associated to the
final product development.

The use of COTS components becomes a costeffective method for rapid prototyping of complex
software systems.

On the other hand, the use of COTS software
components have serious certification problems
due to their design process is unknown.
SEPT’04
WSRS '04
12
COTS components

COTS software is composed of general purpose
components which have poor dependability
specifications.

Usually, COTS components are like a black-box,
the source code is not available and their
internal architecture (structure and data flow) is
not adequately documented.
SEPT’04
WSRS '04
13
µC/OS-II Operating System

Selection came motivated from the perspective that it is a system
widely used since several years ago.
First Version MicroC/OS 1992

Industrial robots, motor control, medical instruments, etc.

It is 99% compliant with the Motor Industry Software Reliability
Association (MISRA) C Coding Standards.

All Modified Condition Decision Coverage (MCDC) code in
MicroC/OS-II has been removed, improving code quality for RTCA /
EUROCAE DO-178B Level A-certified environments for avionics
applications.
Validated Software Comp.
SEPT’04
WSRS '04
14
µC/OS-II: Characteristics

Portable: uC/OS-II is written in highly portable ANSI C, with target
microprocessor-specific code written in assembly language.

ROMable: was designed for embedded applications. This means that if you
have the proper tool chain (i.e., C compiler, assembler, and linker/locator),
you can embed uC/OS-II as part of a product.

Scalable: it’s possible to use only the services needed in the application.
This allows to reduce the amount of memory (both RAM and ROM) needed.
Scalability is accomplished with the use of conditional compilation.

Preemptive: uC/OS-II is a fully preemptive real-time kernel. This means that
uC/OS-II always runs the highest priority task that is ready.

Multitasking: uC/OS-II can manage up to 64 tasks; however, the current
version of the software reserves eight of these tasks for system use. This
leaves your application up to 56 tasks. Each task has a unique priority
assigned to it, which means that uC/OS-II cannot do round-robin scheduling.
SEPT’04
WSRS '04
Jean J. Labrosse
15
µC/OS-II: Characteristics

Deterministic: Execution time of all uC/OS-II functions and services are
deterministic. You can always know how much time uC/OS-II will take to execute a
function or a service. Furthermore execution time of all uC/OS-II services do not
depend on the number of tasks running in your application.

Task Stacks: Each task requires its own stack; uC/OS-II allows each task to have a
different stack size. This allows you to reduce the amount of RAM needed in your
application.

Services: system services such as mailboxes, queues, semaphores, fixed-sized
memory partitions, time-related functions, etc.

Interrupt Management: Interrupts can suspend the execution of a task. If a higher
priority task is awakened as a result of the interrupt, the highest priority task will run
as soon as all nested interrupts complete. Interrupts can be nested up to 255 levels
deep.

Robust and Reliable: uC/OS-II is based on uC/OS, which has been used in
hundreds of commercial applications since 1992.
SEPT’04
WSRS '04
Jean J. Labrosse
16
Black-box approach

The aim of study was to use a black-box approach for the OS study.

So the OS source code was not modified trying to avoid as
maximum as possible an intrusion in the OS behaviour.

With this objective, a layer named as Meta-Kernel, had been
developed between the OS and the application to be executed.

Through this layer the fault injection was realized in any of the
parameters of the system calls to measure the OS robustness.

In black-box testing, input is fed into a program and the output is
checked. What goes on inside the program (the black-box) is
unimportant. (Voas98)
COTS SW
SEPT’04
WSRS '04
17
System Design

MicroC/OS-II OS
 Black-Box

OS Source Code not
modified

Injector  Layer
between the OS and
the application

Injection on the
parameters of system
calls
SEPT’04
WSRS '04
18
Injector Attributes
Injector Attributes:
SOFTWARE FAULT INJECTION ATTRIBUTES
•Prediction, elimination
•Pre-runtime & Runtime
Software Fault
Injection
•High Level
•Transient faults
•Changing of one bit at the
system calls (Bit-Flip)
Objectives
•One fault injected each exp.
Fault Prediction
•Workload for tool testing
Fault Removal
Time
Faults
Pre-runtime
Level
Runtime
Localization
Persistence
Multiplicity
Number
of
simultaneously
faults
injected
each experiment
Workload
Real Applications
Benchmarks
Synthetic Programs
Type
Duration
SEPT’04
WSRS '04
19
Workload Design
Characteristics:
•Maximum system calls
consume
•System calls of
synchronization, semaphores,
memory, queues, messages,
tasks handling, Timing
management, etc.
•Open module to include
calculus.
•Workload for testing the
injection tool and the OS
SEPT’04
WSRS '04
20
Workload Design
The
system workload was
continuously running and
consisted of a series of tasks
executing the application.
On
the other hand, an
injection agent developed
was in charge of injecting
faults and invalid values at
the kernel calls in order to
monitor the system
robustness.
SEPT’04
WSRS '04
21
Errors Classification
After the Fault Injection 
OS Error code
C167 Error
code
Detected Errors



Events after
fault
injection
Application
Error
System Call
not used
No Error
(Correct
result)
Others
↓
Not Safe
Faults
(NFS)
System Call
used but
injection no
affects
Errors which could affect the system
Classification related to the detection
mechanisms
Measures about error detection coverage and
latency times
SEPT’04
WSRS '04
22
Injection Model

The faultload is the most critical dimension of an OS benchmark
and more generally of any dependability benchmark.

Two techniques for system call parameter corruption could be
used: the ‘bit-flip technique’ consisting in flipping systematically bits
of the target parameters

and the ‘selective substitution technique’ when invalid data values
are introduced in the system call parameters.

Studies have demonstrated the equivalence of the errors provoked
by the two techniques [Dbench02].
SEPT’04
WSRS '04
23
Injection Model

BIT-FLIP technique

It is randomly chosen on
runtime:
1. System call
2. Parameter
3. Bit

Consequence of physical
faults




SEPT’04
EMI interferences
Noise
Hardware faults
...
WSRS '04
24
Analysis of the obtained results
•Codification of the different output
values:
•D0: No error, correct output (the fault
injection didn’t affect the system).
•D1: Error detected by the operating
system (µC/OS-II error code).
•D2: Error detected by the application
(the application result was no correct).
•D3: Error which produced the system
hangs. (System failure)
•D4: Error detected by the
microcontroller.
SEPT’04
WSRS '04
25
Analysis of the obtained results
Coverage:
DETECC
[Powell95, Constantinescu95]
Válidos
Complete System (µC/OS-II + Micro):
C cs = D0 + D1 + D2 + D4 =
65,7 + 21 + 2 + 2,5 = 91,2 %
D0
D1
D2
D3
D4
Total
Frecuencia
756
241
23
101
29
1150
Porcentaje
válido
65,7
21,0
2,0
8,8
2,5
100,0
Porcentaje
65,7
21,0
2,0
8,8
2,5
100,0
Porcentaje
acumulado
65,7
86,7
88,7
97,5
100,0
70
D4
60
D3
D2
50
D1
40
Operating System ( µC/OS-II ):
30
20
Porcentaje
C OS = D0 + D1 =86,7 %
D0
10
0
D0
D1
D2
D3
D4
DETECC
SEPT’04
WSRS '04
26
Analysis of the obtained results

Descriptivos
Error detection
latencies


LATENC
Time between the injection and
detection by the OS
Mean value obtained 304 μs
Media
Intervalo de confianza
para la media al 95%
Límite inferior
Límite superior
Media recortada al 5%
Mediana
Varianza
Desv. típ.
Mínimo
Máximo
Rango
Amplitud intercuartil
Asimetría
Curtosis
Estadístico
,30422573
,26533773
Error típ.
1,97E-02
,34311372
,27924537
,12800000
9,392E-02
,30646466
,102400
,972800
,870400
,49920000
1,213
-,287
,157
,312
1,2

One built-in timer of the
microcontroller to measure
latencies
 High precision
1,0
,8
,6
,4
,2
0,0
N=
241
LATENC
SEPT’04
WSRS '04
27
Other Results
‘E1’ was the most typical. This error is the ‘OS_ERR_EVENT_TYPE’. This error was
produced when the fault was injected in some semaphore, message queue or mailbox. The
system reacted going to a hanging state.
Secondly, the error code ‘E42’ related with the ‘OS_PRIO_INVALID’ was obtained when the
injection was at system calls about task management.
Valid data
Frequency tables about
the most typical error
codes given by the OS
SEPT’04
Frequency
Accumulative
percentage
Percentage
Error Code
E1
111
41,1
41,1
OS_ERR_EVENT_TYPE
E11
14
5,2
46,3
OS_MEM_INVALID_PART
E40
8
3,0
49,3
OS_TASK_DEL_ERR
E41
3
1,1
50,4
OS_PRIO_ERR
E42
69
25,6
75,9
OS_PRIO_INVALID
E60
13
4,8
80,7
OS_TASK_DEL_ERR
E81
11
4,1
84,8
OS_TIME_INVALID_MINUTES
E82
2
0,7
85,6
OS_TIME_INVALID_SECONDS
E83
10
3,7
89,3
OS_TIME_INVALID_MILLI
Ex
29
10,7
100,0
Total
270
100,0
WSRS '04
NO CODE
28
Other Results
Moreover, after the injection campaigns it was possible to see how errors were propagated
through the system. It was registered the corrupted system call and later which was the
system call who finally detected the error, taking the time employed for the system to detect
this situation.
Tabla de contingencia LLAMAD * PROPAG
Error Propagation
uento
MAD
al
LL10
LL1
LL10
LL13
LL15
LL16
LL17
LL18
LL19
LL20
LL21
LL22
LL23
LL24
LL28
LL3
LL30
LL4
LL5
LL50
LL6
LL8
LL9
LL13
LL15
LL16
LL17
LL18
LL20
PROPAG
LL23
LL22
LL24
LL28
LL3
LL30
LL4
LL5
LL50
LL9
Total
31
31
4
5
19
12
1
5
1
6
2
34
4
19
32
5
23
14
14
29
4
4
2
270
4
5
19
9
3
1
5
1
6
2
29
5
4
19
32
5
23
14
14
29
4
4
8
SEPT’04
5
19
9
4
5
9
29
4
WSRS '04
28
32
5
23
45
14
29
2
2
29
Other Results

To finish, results on which were the most critical system calls were
obtained with the aim to improve their robustness and of course the
final OS dependability.

For example, there are some data structures, related with the event
control block, in which the injection produced a lot of failures and the
most of times the system hanged.

This is due to in these structures is stored the list of tasks waiting for
some event, so if the injection corrupts that information, the system
loss the sequence of the next actions and goes to a non safe state
without knowing how to react (the system hangs).

This give us information on where dedicate special attention due to
an error on those data structures could provoke critical failures on
the system.
SEPT’04
WSRS '04
30
Conclusions

After the experiments, the error detection coverage, error detection
latency times, error propagation, typical OS error codes, etc. have
been obtained.

Fault injection into the code and data memory segments of the
microkernel will be implemented too.

About possible improvements for the MicroC/OS-II to increase its
dependability should take into account, that some detected errors in
certain data structures could provoke critical failures on the system.

These detected data structures should implement some mechanism
to protect the information they host.
SEPT’04
WSRS '04
31
Future Research

In a next research work, these data have to be
compared with other COTS RTOS working under the
same conditions.

RT-fault injector to minimize intrusion
(Without internal debug support, intrusion > 0)

Nexus-implemented fault injection





SEPT’04
Other architecture: Motorola MPC565
Intrusion -----> null
Preliminary results
Better controllability and observability
Best option to validate RTOS and applications
WSRS '04
32
Contact Data
Juan Pardo
Fault Tolerant Systems Group
Polytechnic University of Valencia
Spain
Email: [email protected]
Web: http://www.disca.upv.es/gstf/
SEPT’04
WSRS '04
33