Transcript Document

Probability and Statistics with
Reliability, Queuing and Computer
Science Applications: Chapter 8 on
Continuous-Time Markov Chains
Kishor Trivedi
Non-State Space Models
• Recall that non-state-space models like RBDs
and FTs can easily be formulated and
(assuming statistical independence) solved for
system reliability, system availability and
system MTTF
• Each component can have attached to it
–
–
–
–
A probability of failure
A failure rate
A distribution of time to failure
Steady-state and instantaneous unavailability
Markov chain
• To model complex interactions between components,
use other kinds of models like Markov chains or
more generally state space models.
• Many examples of dependencies among system
components have been observed in practice and
captured by Markov models.
MARKOV CHAINS
• State-space based model
• States represent various conditions of the system
• Transitions between states indicate occurrences
of events
State-Space-Based Models
• States and labeled state transitions
• State can keep track of:
– Number of functioning resources of each type
– States of recovery for each failed resource
– Number of tasks of each type waiting at each
resource
– Allocation of resources to tasks
• A transition:
– Can occur from any state to any other state
– Can represent a simple or a compound event
State-Space-Based Model (Continued)
• Drawn as a directed graph
• Transition label:
– Probability: homogeneous discrete-time Markov chain
(DTMC)
– Rate: homogeneous continuous-time Markov chain (CTMC)
– Time-dependent rate: non-homogeneous CTMC
– Distribution function: semi-Markov process (SMP)
– Two distribution functions; Markov regenerative process
(MRGP)
MARKOV CHAINS
(Continued)
• For continuous-time Markov chains (CTMCs)
the time variable associated with the system
evolution is continuous
• We will mean a CTMC whenever we speak of
Markov model (chain)
Chapter 8
Continuous Time Markov Chains
Formal Definition
• A discrete-state continuous-time stochastic process
is called a Markov chain if
for t0 < t1 < t2 < …. < tn < t , the conditional pmf
satisfies the following Markov property:
• A CTMC is characterized by state changes that
can occur at any arbitrary time
• Index space is continuous.
• The state space is discrete valued.
Continuous Time Markov Chain
(CTMC)
• A CTMC can be completely described by:
– Initial state probability vector for X(t0):
– Transition probability functions (over an
interval)
pmf of X(t)
• Using the theorem of total probability
If v = 0 in the above equation, we get
Homogenous CTMCs
•
is a (time-)homogenous CTMC iff
• Or, the conditional pmf satisfies:
• A CTMC is said to be irreducible if every state can
be reached from every other state, with a non-zero
probability.
• A state is said to be absorbing if no other state can
be reached from it with non-zero probability.
• Notion of transient, recurrent non-null, recurrent
null are the same as in a DTMC. There is no notion
of periodicity in a CTMC, however.
CTMC Dynamics
Chapman-Kolmogorov Equation
• Note that these transition probabilities are functions of
elapsed time and not of the number of elapsed steps
• The direct use of the this equation is difficult unlike the
case of DTMC where we could anchor on one-step
transition probabilities
• Hence the notion of rates of transitions which follows
next
Transition Rates
• Define the rates (probabilities per unit time):
net rate out of state j at time t:
the rate from state i to state j at time t:
Kolmogorov Differential
Equation
• The transition probabilities and transition rates are,
• Dividing both sides by h and taking the limit,
Kolmogorov Differential Equation
(contd.)
• Kolmogorov’s backward equation,
• Writing these eqs. in the matrix form,
Homogeneous CTMC
Specialize to HCTMC (Kolmogorov diff. eqn) :
•In the matrix form, (Matrix Q is called the infinitesimal
generator matrix (or simply Generator Matrix))
CTMC Steady-state Solution
•
Steady state solution of CTMC obtained by solving the
following balance equations:
• Irreducible CTMCs with all states recurrent non-null will have
+ve steady-state {πj} values that are unique and independent of
the initial probability vector. All states of a finite irreducible
CTMC will be recurrent non-null.
• Measures of interest may be computed by assigning reward
rates to states and computing expected steady state reward rate:
CTMC Measures
•
Measures of interest may be computed by assigning reward
rates to states and computing expected reward rate at time t:
• Expected accumulated reward (over an interval of time)
• Lj(t) is the expected time spent in state j during (0,t)
(LTODE)
Markov Availability Model
2-State Markov Availability Model

UP
1
1
DN
0


1

 MTTF
 MTTR
1) Steady-state balance equations for each state:
– Rate of flow IN = rate of flow OUT
• State1:
  0   1
• State0:
 1    0
2 unknowns, 2 equations, but there is only one independent
equation.
2-State Markov Availability Model
(Continued)
 0 1  1
Need an additional equation:

1
 1  1  1  1 


1

1

1
MTTF
 1  Ass 



    1   1  MTTF  MTTR
1

1  Ass 
MTTR
MTTF  MTTR
Downtime in minutes per year =
MTTR
MTTF  MTTR
* 8760*60
Ass  0.99999 1  Ass  105  DTMY  5.356min
2-State Markov Availability Model
(Continued)
2) Transient Availability for each state:
– Rate of buildup = rate of flow IN - rate of flow OUT
d 1
   0 (t )    1 (t )
dt
d 1
  (1   1 (t ))    1 (t )
since  0 (t )  1 (t )  1 we have
dt
d 1
 (    )  1 (t )  
dt
This equation can be solved to obtain assuming
 1 (t )  A(t ) 




1(0)=1

e (    ) t
2-State Markov Availability Model
(Continued)
3) R(t )  et
4) Steady State Availability:
lim A(t )  Ass 
t 


Markov availability model
• Assume we have a two-component parallel
redundant system with repair rate .
• Assume that the failure rate of both the components
is .
• When both the components have failed, the system
is considered to have failed.
Markov availability model
(Continued)
• Let the number of properly functioning
components be the state of the system.
The state space is {0,1,2} where 0 is the
system down state.
• We wish to examine effects of shared vs.
non-shared repair.
Markov availability model
(Continued)
2
2

1
0

2
2

2
1

0

Non-shared (independent)
repair
Shared repair
Markov availability model
(Continued)
• Note: Non-shared case can be modeled & solved
using a RBD or a FTREE but shared case needs the
use of Markov chains.
Steady-state balance equations
• For any state:
Rate of flow in = Rate of flow out
Consider the shared case
2 2   1
(   )1  2 2   0
 1   0
i: steady state probability that system is in state i
Steady-state balance equations
(Continued)
• Hence
Since
We have
or


2 
1 1   0

2
 0  1   2  1

    
 0   0     0  1
 1    2 
0 
 
1  2
 2
2
Steady-state balance equations
(Continued)
• Steady-state unavailability = 0= 1 - Ashared
Similarly for non-shared case,
steady-state unavailability = 1 - Anon-shared
1  Anon shared 
1
2

1
 2
 
2
• Downtime in minutes per year = (1 - A)* 8760*60
Steady-state balance equations
A larger example
• Return to the 2 control and 3 voice channels example and
assume that the control channel failure rate is c, voice channel
failure rate is v.
• Repair rates are c and v, respectively. Assuming a single
shared repair facility and control channel having preemptive
repair priority over voice channels, draw the state diagram of a
Markov availability model. Using SHARPE GUI, solve the
Markov chain for steady-state and instantaneous availability.
WFS Example
A Workstations-Fileserver Example
• Computing system consisting of:
– A file-server
– Two workstations
– Computing network connecting them
• System operational as long as:
– One of the Workstations
and
– The file-server are operational
• Computer network is assumed to be fault-free
The WFS Example
Markov Chain for WFS Example
• Assuming exponentially distributed times to
failure
– w : failure rate of workstation
– f : failure rate of file-server
• Assume that components are repairable
– w: repair rate of workstation
– f: repair rate of file-server
• File-server has (preemptive) priority for repair
over workstations (such repair priority cannot
be captured by non-state-space models)
Markov Availability Model for
WFS
w
2w
2,1
1,1
0,1
w
f
f
w
f
f
f
w
2w
2,0
f
1,0
0,0
Since all states are reachable from every other states, the
CTMC is irreducible. Furthermore, all states are positive
recurrent.
Markov Availability Model for
WFS (Continued)
In this figure, the label (i,j) of each state is
interpreted as follows: i represents the number of
workstations that are still functioning and j is 1
or 0 depending on whether the file-server is up
or down respectively.
Markov Model
• Let {X(t), t > 0} represent a finite-state
Continuous Time Markov Chain (CTMC) with
state space .
• Infinitesimal Generator Matrix Q = [qij]:
• qij (i ! j) : transition rate from state i to state j
• qii = - qi=
  j i qij , the diagonal element
Markov Availability Model for
WFS (Continued)
• For the example problem, with the states ordered
as (2,1), (2,0), (1,1), (1,0), (0,1), (0,0) the Q
matrix is given by:
Q=
f
2w
0
0
0 
 ( f  2w )




(


2

)
0
2

0
0
f
f
w
w



w
0
 (  w   f  w )
f
w
0 


0
0


(



)
0

f
f
w
w 


0
0
w
0
 ( w   f )  f 


0
0
0
0




f
f 

Markov Model (steady-state)
 : Steady-state probability vector
Q  0,

i
i
1
  ( (2,1) ,  (2,0) ,  (1,1) ,  (1,0) ,  (0,1) ,  (0,0) )
These are called steady-state balance equations
rate of flow in = rate of flow out
after solving for  , obtain Steady-state availability
ASS   ( 2,1)   (1,1)
Markov Model (transient)
 (t):transient state probability vector
• (0): initial probability vector of the CTMC
• Transient behavior described by the
Kolmogorov differential equation (KDE):
d
 (t )   (t )Q,
dt
given  (0)
Markov Availability Model
We compute the availability of the system:System is
available as long as it is in states (2,1) and (1,1).
• Instantaneous availability of the system:
A (t )   ( 2 ,1) (t )   (1,1) (t )
 Ass
lim
A
(
t
)

t
Availability (Continued)
• Interval Availability:

A (t ) 
t
0
I
A( x)dx
t
Expected uptim ein (0, t ]

t
• Steady-State Availability:
ASS  lim A(t )  lim AI (t )
t 
t 
• There are three kinds of Availabilities!
– Instantaneous, Interval & Steady-state
Markov Availability Model
(Continued)
t
Define L(t )    (u)du
o
L(i,j)(t): Expected Total Time Spent in State (i,j)
during (0,t)
Integrating the KDE, we get the LTODE:
d
L(t )  L(t )Q   (0) ,
dt
L(0)  0
• Interval availability
AI (t ) 
L( 2,1) (t )  L(1,1) (t )
t
Markov Availability Model Results
w  0.0001hr1,  f  0.00005hr1, w  1.0 hr1,  f  0.5 hr1
Ass  0.9999
2-component Availability model with
finite Detection delay
• 2-component availability model
– Steady state availability Ass = 1-π0
• Failure detection stage takes random time, EXP(δ)
– Down states are ‘0’ and ‘1D’  Ass = 1- π0- π1D
Therefore, steady state unavailability U(δ) is given by
Redundant System with Finite
Detection Switchover Time
• After solving the Markov model, we obtain
steady-state probabilities:
 2 ,  1D ,  1 ,  0
Asys   2   1 (or   1D )
• Can solve in closed-form or using SHARPE
Closed-form
 
2
2
 

] 1


 0 [1 
 (     ) 22     
    
0 
1
E
1
 

     E
1
2
 1D 
 (     ) E
1
 
2
2 
22      E
1 
A   2   1  r 1D
2  
  
2
( 2

r
)/ E
2           
 (     )
2-component availability model
with imperfect coverage
• Coverage factor = c (conditional probability
that the fault is correctly handled)
• ‘1C’ state is a reboot (down) state.
2-components availability model
: delay + imperfect coverage
• Model has detection delay + imperfect
coverage
• Down states are ‘0’, ‘1C’ and ‘1D’.
Modeling Software Faults
Operating System Failure
Availability model with hardware and software (OS)
redundancy; operational phase; Heisenbugs
Probability & Statistics with Reliability, Queuing and
Computer Science Applications
(2nd ed.)
K. S. Trivedi
John Wiley, 2001.

Assumptions



Hardware failures are permanent
A repair or replacement action
while OS failures are cleared by
a reboot
Repair or reboot takes place at
rates  and  for the hardware
and OS, respectively.
Webserver Availability Model
with warm Replication
• Two nodes for hardware redundancy
• Each node has a copy of the webserver (software
redundancy– replication)
• Primary node can fail
• Secondary node can fail
• Primary process can fail
• Secondary process can fail
• Failures may have imperfect coverage
• Time delay for fault detection
• Model of a real system developed at Avaya Labs

Modeling Software Faults
Application Failure
Availability model with passive redundancy
Assumptions




(warm replication) of application; Operational phase;
Heisenbugs or hardware transients
A web server software,
that fails at the rate p
running on a machine
that fails at the rate m
Mean time to detect
server process failure
-1p and the mean time
to detect machine
failure -1m
The mean restart time
of a machine -1m
The mean restart time
of a server -1p
Performance and Reliability Evaluation of Passive Replication Schemes in Application Level Fault-Tolerance
S. Garg, Y. Huang, C. Kintala, K. S. Trivedi and S. Yagnik
Proc. of the 29th Intl. Symp. On Fault-Tolerant Computing, FTCS-29, June 1999.
Parameters
• Process MTTF = 10 days (1/p)
• Node MTTF = 20 days (1/n)
• Process polling interval = 2 seconds (1/p)
• Mean process restart time = 30 seconds (1/p)
• Mean process failover time = 2 minutes (1/n)
• Switching time with mean 1/ s
• C = 0.95
Solution for warm replication
Modeling an N+1 Protection
System
Outline
• Description of the system
• Using a rate approximation
• Using a 3-stage Erlang approximation to a
uniform distribution
• Using a Semi-Markov model - approximation
method using a 3-stage Erlang distribution
• Using equations of the underlying SemiMarkov Process
• Solutions for the models
Description of the system
• N = Number of protected units (we use N=1)
•  = Unit failure rate
•  = Unit restoration rate
• T = deterministic time between routine
diagnostics
• c = Probability that a protection switch
successfully restores service
• d = Probability that a failure in the standby
unit is detected
Outline
Description of the system
• Using a rate approximation
• Using a 3-stage Erlang approximation to a
uniform distribution
• Using a Semi-Markov model - approximation
method using a 3-stage Erlang distribution
• Using equations of the underlying SemiMarkov Process
• Solutions for the models
Hot Standby with different coverages
Normal
(1+1)
(1-d)
(1-c)

(c+d)
Protection
Switch

Failure

Failure to
Detect
Protection
Fault
Simplex
(1)

2

Failed
(0)
Normal:
1
Protection Switch Failure:
2
Simplex:
3
Failure to detect protection fault: 4
Failed:
5
N=1
Diagnostics; Using a rate
approximation
Normal
(1+1)
(1-d)
(1-c)

(c+d)
Protection
Switch

Failure

Simplex
(1)

2/T
2

Failed
(0)
Time to diagnostic is
exponentially distributed
with mean T/2
Failure to
Detect
Protection
Fault
Normal:
1
Protection Switch Failure:
2
Simplex:
3
Failure to detect protection fault: 4
Failed:
5
N=1
Outline
Description of the system
Using a rate approximation
• Using a 3-stage Erlang approximation to a
uniform distribution
• Using a Semi-Markov model approximation method using a 3-stage
Erlang distribution
• Using equations of the underlying SemiMarkov Process
• Solutions for the models
1.8
Comparison of
probability density functions (pdf)
1.6
1.4
1.2
1
pdf
3-stage Erlang pdf
U(0,1) pdf
0.8
0.6
0.4
0.2
time
0.
9
0.
96
1.
02
0.
6
0.
66
0.
72
0.
78
0.
84
0.
3
0.
36
0.
42
0.
48
0.
54
0.
06
0.
12
0.
18
0.
24
0
0
T=1
1.2
Comparison of cumulative distribution
functions (cdf)
1
3-stage Erlang cdf
0.6
U(0,1) cdf
0.4
0.2
0.
06
0.
12
0.
18
0.
24
0.
3
0.
36
0.
42
0.
48
0.
54
0.
6
0.
66
0.
72
0.
78
0.
84
0.
9
0.
96
1.
02
0
0
cdf
0.8
time
T=1
Using a 3-stage Erlang approximation to a
uniform distribution
Normal
(1+1)
(1-d)
(1-c)

(c+d)
Protection
Switch
Failure

Time to
diagnostic is
uniformly
distributed over
(0,T) approximated
by a 3-stage
Erlang with
mean T/2
Simplex
(1)


s1
s2
6/T
Failure to
Detect
Protection
Fault
2
6/T
Failed
(0)


6/T

Outline
Description of the system
Using a rate approximation
Using a 3-stage Erlang approximation to a
uniform distribution
• Using a Semi-Markov model approximation method using a 3-stage
Erlang distribution
• Using equations of the underlying SemiMarkov Process
• Solutions for the models
Using a Semi-Markov model approximation method using an Erlang
distribution (N=1) E(t) -> 3-stage Erlang distribution
given by,
Normal
(1+1)
(1-d)
(1-c)

(c+d)
Time to
diagnostic is
uniformly
distributed over
(0,T) approximated
by a 3-stage
Erlang
distribution
with mean T/2
Protection
Switch

Failure

Simplex
(1)

E(t)
2

Failed
(0)
Failure to
Detect
Protection
Fault
Outline
Description of the system
Using a rate approximation
Using a 3-stage Erlang approximation to a
uniform distribution
Using a Semi-Markov model approximation method using a 3-stage
Erlang distribution
• Using equations of the underlying SemiMarkov Process
• Solutions for the models
Using Equations of the underlying
Semi-Markov Process
•Steady state solution
 One step transition probability matrix, P of the
embedded DTMC
Using Equations of the underlying
Semi-Markov Process (Continued)
Using Equations of the underlying
Semi-Markov Process (Continued)
•Time to the next diagnostic is uniformly distributed over (0,T)
Using Equations of the underlying
Semi-Markov Process (Continued)
Outline
Description of the system
Using a rate approximation
Using a 3-stage Erlang approximation to a
uniform distribution
Using a Semi-Markov model approximation method using a 3-stage
Erlang distribution
Using equations of the underlying SemiMarkov Process
• Solutions for the models
Solutions for the models
Parameter values assumed:
• N=1
• c = 0.9
• d = 0.9
•  = 0.0001 / hour
•  = 1 / hour
• T = 1 hour
Results obtained
• Steady state availability

Probability of being in states “Normal”,
“Simplex”, or “Failure to Detect Protection
Fault”
• Steady state unavailability

Probability of being in states “Protection Switch
Failure”, or “Failed (0)”
• Average downtime in steady state

Steady state unavailability * Number of minutes
in a year
• Average #units available
2*PNormal + 1*PSimplex +1*PFailuretoDetectProtectionFault
Markov Reliability Model
Markov reliability model
with repair
• Consider the 2-component parallel system (no delay +
perfect cov) but disallow repair from system down state
• Note that state 0 is now an absorbing state. The state
diagram is given in the following figure.
• This reliability model with repair cannot be modeled using
a reliability block diagram or a fault tree. We need to
resort to Markov chains. (This is a form of dependency
since in order to repair a component you need to know the
status of the other component).
Markov reliability model
with repair (Continued)
Absorbing state
• Markov chain has an absorbing state. In the
steady-state, system will be in state 0 with
probability 1. Hence transient analysis is of
interest. States 1 and 2 are transient states.
Markov reliability model
with repair (Continued)
Assume that the initial state of the Markov chain
is 2, that is, 2(0) = 1, k (0) = 0 for k = 0, 1.
Then the system of differential Equations is written
based on:
rate of buildup = rate of flow in - rate of flow out
for each state
Markov reliability model
with repair (Continued)
d 2 (t )
 2 2 (t )   1 (t )
dt
d 1 (t )
 2 2 (t )  (   ) 1 (t )
dt
d 0 (t )
  1 (t )
dt
Markov reliability model
with repair (Continued)
After solving these equations, we get
R(t) = 2(t) +1(t)

Recalling that MTTF   R(t ) dt
0
3

MTTF 
 2
2 2
, we get:
Markov reliability model
with repair (Continued)
Note that the MTTF of the two component
parallel redundant system, in the absence
of a repair facility (i.e.,  = 0), would have
been equal to the first term,
3 / ( 2* ), in the above expression.
Therefore, the effect of a repair facility is to
increase the mean life by  / (2*2), or by a

factor

22
(1 
3
)
2
3
1
Markov Reliability Model with
Repair ( WFS Example)
• Assume that the computer system does not recover if
both workstations fail, or if the file-server fails
Markov Reliability Model with Repair
States (0,1), (1,0) and (2,0) become absorbing states while (2,1) and (1,1)
are transient states.
Note: we have made a simplification that, once the CTMC reaches a system
failure state, we do not allow any more transitions.
Markov Model with Absorbing
States
• If we solve for (2,1)(t) and (1,1)(t) then
R(t)= (2,1)(t) + (1,1)(t)
• For a Markov chain with absorbing states:
A: the set of absorbing states
B =  - A: the set of remaining states
(i,j): Mean time spent in state i,j until absorption

 (i , j )  0  (i , j ) ( x)dx ,
QB   B (0)
QB   B (0)
(i, j )  B
Markov Model with Absorbing
States (Continued)
QB derived from Q by restricting it to only
states in B
Mean time to absorption MTTA is given as:
MTTA 

( i , j )B
(i , j )
Markov Reliability Model with
Repair (Continued)
 ( f  2w )
[
QB 
First Solve
w
]
2w
 ( w   f  w )
Markov Reliability Model with
Repair (Continued)
Then :
next solve
Then :
Mean time to failure is 19992 hours.
Markov Reliability Model
without Repair
• Assume that neither workstations nor fileserver is repairable
Markov Reliability Model
without Repair (Continued)
States (0,1), (1,0) and (2,0)
become absorbing states
Markov Reliability Model
without Repair (Continued)
 ( f  2w )
2w
QB 
0
 ( f  w )
[
Mean time to failure is 9333 hours.
]
Markov Reliability Model
with Imperfect Coverage
Markov model
with imperfect coverage
Next consider a modification of the above
example proposed by Arnold as a model of
duplex processors of an electronic
switching system. We assume that not all
faults are recoverable and that c is the
coverage factor which denotes the
conditional probability that the system
recovers given that a fault has occurred.
The state diagram is now given by the
following picture:
Now allow for Imperfect
coverage
c
Markov model
with imperfect coverage
(Continued)
Assume that the initial state is 2 so that:
2 (0)  1, 0 (0)  1 (0)  0
Then the system of differential equations are:
d2 (t )
 2c2 (t )  2 (1  c) 2 (t )  1 (t )
dt
d1 (t )
 2c2 (t )  (   ) 1 (t )
dt
d0 (t )
 2 (1  c) 2 (t )  1 (t )
dt
Markov model
with imperfect coverage
(Continued)
After solving the differential equations we obtain:
R(t)=2(t) + 1(t)
From R(t), we can system MTTF:
 (1  2c)  
MTTF 
2[   (1  c)]
It should be clear that the system MTTF and system reliability are
critically dependent on the coverage factor.