Reliable System Design

Download Report

Transcript Reliable System Design

7. Markov Models
Reliable System Design 2011
by: Amir M. Rahmani
Markov Models

The primary difficulty with the combinatorial models is
that many complex systems cannot be modeled easily in
a combinatorial fashion.

The fault coverage is sometimes difficult to incorporate
into the reliability expression in a combinatorial model.

The process of repair is very difficult to model in a
combinatorial model.

Alternative: Markov models
matlab1.ir
Markov Process

In 1907 A.A. Markov published a paper in which he
defined and investigated the properties of what are now
known as Markov processes.

A Markov process with a discrete state space is referred
to as a Markov Chain.

A set of random variables forms a Markov chain if the
probability that the next state is Sn+1 depends only on the
current state Sn, and not on any previous states
matlab1.ir
Markov Process


A stochastic process is a function whose
values are random variables
The classification of a random process
depends on different quantities
•
•
•
– state space
– index (time) parameter
– statistical dependencies among the random
variables X(t) for different values of the index
parameter t.
matlab1.ir
Markov Process

Categories of Markov state-space models:
•
•
•
•



1. Discrete space and discrete time
2. Discrete space and continuous time
3. Continuous space and discrete time
4. Continuous space and continuous time
The first two categories involve a discrete space; that is,
the states of the system can be numbered with an integer.
In the first and the third categories, the system changes
by discrete time steps.
The second category is the one most useful for
modeling fault-tolerant systems.
matlab1.ir
Markov Process

States must be
•
•
– mutually exclusive
– collectively exhaustive

Let Pi(t)= Probability of outgoing in the state Si at
time t.

Markov Properties
•
– future state probability depends only on current state
• independent of time in state
• path to state
matlab1.ir
State Transition Diagrams

A Markov state transition diagram can graphically
represent all:
•
•

1- System states and their initial conditions.
2- Transitions between system states and corresponding
transition rates
The transition rates are replaced with equivalent
transition probabilities considering that the state
transition time is very small (Δt ) this leads to
•
•
1- A situation where the system can remain in the current state
after time t with some probability.
2- Thus, in the above case, a situation where the system can go to
the next state(s) (transition rates) after time t with some
probability.
matlab1.ir
Construction of State Transition Diagram

The basic steps in constructing state
transition diagrams are:
•
•
•
1- Define the failure criteria of the system.
2- Enumerate all of the possible states of the
system and classify them into good or failed
states.
3- Determine the transition rates between various
states and draw the state transition diagram
matlab1.ir
Example



State diagram for one component
Let X denote the lifetime for a component.
The Markov property is defined as follows:
P ( X  t  t / X  t )  t
t  0



The probability that a component fails in the
small interval Λt is proportional to the length of
the interval.
λ is the proportional constant.
The probability above does not depend on the
time t.
matlab1.ir
Markov Process


Assume exponential failure law with failure rate λ.
Probability that system failed at t+Δt, given that is was
working at time t is given by
matlab1.ir
Reliability for one component
The probability that the component works at the time t+ Δt is
P1 (t  t )  (1  t ) P1 (t )
We divide with Δt
P1 (t  t )  P1 (t )
t

) P1 (t )
t
t
Let Δt →0 , and we get
P1(t )  P1 (t )
matlab1.ir
Reliability for one component

The solution to this differential equation is

Assuming that the component works at the time t = 0, so

The reliability of the component is:
matlab1.ir
Failure probability for one component
The probability that the component does not work at the time t+ Δt is
P0 (t  t )  tP1 (t )  P0 (t )
We divide with Δt
P0 (t  t )  P0 (t )
 P1 (t )
t
Let Δt →0 , and we get
P0(t )  P1 (t )
matlab1.ir
Failure probability for one component
Solving the differential equation yields
matlab1.ir
Markov chain model
The equation system can be written using matrices
where
and
Q is called the transition rate matrix.
matlab1.ir
Cold stand-by system with one spare

State diagram

State labeling
•
•
•

2 Primary module works
1 Spare module works (Primary module does not work)
0 No module works, system failure
Assumption: The failure rate for the spare is zero.
matlab1.ir
Cold stand-by system with one spare
We calculate the reliability of the system by solving the equation system
Where
matlab1.ir
The Equation System
We solve this by Laplace transform using the following relation
~
P(t )  s P( s)  P(0)

Laplace transforms:
Time function Laplace transform
1
t
e  t
tet
matlab1.ir
1
s
1
s2
1
s
1
(s   )2
Solving the Equation System
The Laplace transform get
where
which give us
matlab1.ir
Solving the Equation System
1- We compute
which gives the following time function
2- We compute
The reliability of the system can be written as:
matlab1.ir
Calculating MTTF
Let X1 and X2 denote the time spent in state 2 and state 1, respectively.

MTTF for the system can then be written as

Alternatively, the MTTF can be computed as
matlab1.ir
Reliability
matlab1.ir
Coverage




Designing a fault-tolerant system that will correctly
detect, mask or recover from every conceivable fault, or
error, is not possible in practice.
Even if a system can be designed to tolerate a very large
number of faults, or errors, there are for most systems a
non-zero probability that a single fault will be remained.
such faults are known as “non-covered” faults.
The probability that a fault is covered (i.e., correctly
handled by the fault-tolerance mechanisms) is known as
the coverage factor, and denoted c.
The probability that a fault is non-covered can then be
written as 1 - c.
matlab1.ir
Cold Stand-by system with Coverage factor
State diagram
We can write-up the Q-matrix directly by inspecting the state diagram.
matlab1.ir
Solving the Equation System
We have the following equation system
After applying the Laplace transform, we get
We then compute
matlab1.ir
Solving the Equation System
can we compute directly from the first equation
We then compute
Reliability for the system is
matlab1.ir
The Reliability with Coverage factor
matlab1.ir
Calculating MTTF
matlab1.ir
Availability

Definition: the probability that a system is functioning
properly at a given time t.

When calculating the availability we consider both
failures and repairs. We must make assumptions about
the function time (up time) and the repair time (down
time).

The repair time consists of the time it takes to perform the
repair, the time between the system failure and the repair
is started, and the time it takes to restart the system after
the repair is completed.
matlab1.ir
Steady-state Availability
E [X0] = MTTFF (Mean Time To First Failure)
E [Xi] = MTTF (Mean Time To Failure)
E [Yi] = MTTR (Mean Time To Repair)
MTTR + MTTF = MTBF (Mean Time Between Failures)
matlab1.ir
Design Tradeoffs

How to make availability approach 100%?
MTTF
Availabili ty 
MTTF  MTTR


MTTF → infinity (high reliability)
MTTR → zero (fast recovery)
matlab1.ir
Availability vs. Reliability
– Reliability is measured by mean time To
failure (MTTF)
- There is no repair in the state of system
failure for modeling reliability.
– Availability is a function of MTTF and
mean
time
to
repair
(MTTR)
MTTF/(MTTF+MTTR)
– A system may have a high MTBF, but low
availability
matlab1.ir
Markov chain model for a simplex system
State
0: System OK
1: System failure
Failure rate: λ
Repair rate: μ
Availability: A(t) = P0 (t)
Reliability: R(t) = e-λt
Maintainability: M(t) = 1 – e-μt
matlab1.ir
The availability for a simplex system
matlab1.ir
The availability for a simplex system
matlab1.ir
Steady-state Availability
Assuming exponentially distributed function times and repair
times, we get
matlab1.ir
Markov chain for a hot stand-by system
State
0,1: System OK
2: System failure
Failure rate: λ
Repair rate: μ
Availability: A(t) = P0 (t) + P1 (t)

Assumption: Only one repair-person works with the
system when a failure has occurred.
matlab1.ir
Safety

Definition: The probability that a system is either
functioning properly, or is in safe failed state.

Calculating safety is similar to calculating reliability.

In a reliability model there is usually only one absorbing
state, while in a safety model there are at least two
absorbing states.

Among the absorbing states in a safety model, at least
one represents that system is in a safe shut-down state,
and at least one represents that a catastrophic failure has
occurred.
matlab1.ir
Safety for a simplex system with
coverage factor
We obtain the following markov chain model
and the corresponding transition-rate matrix
matlab1.ir
Safety for a simplex system with
coverage factor
The solutions of the differential equations are:
The safety of the system is:
The steady-state safety is:
matlab1.ir