Transcript EE 681 Fall 2000 Lecture 5

```E E 681 Module 21
Markov Method of Availability
Analysis, APS systems, and
Availability Simulation Methods
W. D. Grover
TRLabs & University of Alberta
© Wayne D. Grover 2002, 2003
Markov Techniques for Availability Analysis
Markov chain:
A system with discrete states in statistical equilibrium
and with constant memoryless probabilities of making
transition between states...
Example: two-state working-failed system
p12 = 
p11 = 1-
p22 = 1-
2
1
p21 = 
E E 681 - Module 21
© Wayne D. Grover 2002, 2003
State 1: working
State 2: failed
 = 1/MTTF = failure rate
 = 1/MTTR = repair rate
‹#›
Markov Techniques for Availability Analysis
1 = prob. of being in state 1
p12 = 
p22 = 1-
p11 = 1-
2
1
p21 = 
State 1: working
State 2: failed
 = 1/MTTF = failure rate
 = 1/MTTR = repair rate
n = time-step
pij = state transition
probability
Consider that:
 p11 p12 
[1  2]  
[1  2]
1  p21 p22
0
This means that:
2
 p11 p12 
 p11 p12 
[1  2]  
[1  2]  
[1  2]
2  p21 p22
1  p21 p22
0
  p11
[1  2]  lim  
n 
  p 21
   0 lim[ P ]
Etc.
p12  n

[

1

2]
0
p 22 

n
Thus ....
n 
“spinning” the transition matrix
Once a vector is known, availability is known or computable over non-operating states
E E 681 - Module 21
© Wayne D. Grover 2002, 2003
‹#›
Markov Techniques for Availability Analysis
A separate computational method is based on observing that the system must reach
steady-state probabilities wherein, as every time epoch passes, the state occupancy
probabilities remain unchanged, implying that:
 p11 p12 
[1  2]n  
[1  2]n  1

 p21 p22
...or more
generally that :
 P 
is reached in the limit.
That is, that the state probability vector doesn’t change under repeated multiplication by
matrix P.
Which can be solved numerically for .
E E 681 - Module 21
© Wayne D. Grover 2002, 2003
‹#›
Markov Techniques: Model of a 1+1 APS System
A
Functional
view of
1+1 APS
B
1- 2 
Markov chain
model of
1 + 1 APS
2
1) Both elements
working

1-2 
2) One element
down, One element working

3) Both elements
down,
2
Failure state
1- - 
State transition
probability
matrix
E E 681 - Module 21
P1  1
2
0 
1  2
 
1   
 


2
1  2 
 0
© Wayne D. Grover 2002, 2003
‹#›
Markov Techniques: Model of 1:N APS Systems
Functional view of 1:N APS
K1-K2 byte
signalling:
K1-K2 bytes
on spare span
control
end
bridge
request
(channel
number)
control
. . .Working 1. .
...
...
bridge
confirm
(channel
number)
. . .Spare. . .
. . .Working N. .
x = normally open
_ = normally closed
E E 681 - Module 21
© Wayne D. Grover 2002, 2003
Tail-end transfer
‹#›
Markov Techniques: Models of APS Systems
Availability model view of 1:N APS
1
Ub 1
Uw
Ut1
Us= spare system unavailability
N working
N
Uw
Ub 2
Ut2
spare
E E 681 - Module 21
Uw= working system unavailability
Us
© Wayne D. Grover 2002, 2003
failure mode 1
failure mode 2
Ut1= tail end transfer unavailability,
failure mode 1
Ut2= tail end transfer unavailability,
failure mode 2
‹#›
“Cutset” (algebraic) model of APS availability
Method:
List outage-causing failure
combinations from functional inspection
/ system knowledge:
Usys 
( N  1) 2
Uw  Uw (Ub2  Ut 2  Us )  (Ub1  Ut1)
2
Outage if:
(1) two or more working systems
fail.
(2) If HEB or TET fail at any time in
mode 2, or the spare span fails,
and there is one or more working
system failure.
(3) If HEB or TET fail at any time in
mode 1 there is outage.
E E 681 - Module 21
© Wayne D. Grover 2002, 2003
2 or more
working systems
1 working system
and spare or
HEB / TET active-function
HEB / TET through-function
by itself: “series element”
‹#›
Markov Techniques: Models of APS Systems
Outage-causing states Asys  1 
Markov chain model of 1:N APS
1) All working
and spare systems up

(N-1) 

self-loops not
drawn for clarity
2) One working system
down, spare
ok


1a) spare
span failed

E E 681 - Module 21
N


(N-1) 
2a) One working down,
standby also
down
(N-2) 
3a) Two working systems
and spare all
down
2
© Wayne D. Grover 2002, 2003
N+1) N working
systems down,
one protected by
spare
 
3


i  outage states
i

3) Two working
systems down,
one protected
by spare
2


(N-2) 


N+1a) N working
systems down,
spare down
 
3

N
‹#›
An engineering issue: “exercising the spare”
• It is usual to provide redundancy in
the form of a “standby” system to
enhance availability.
Ideal view: (spare assumed always to be ready)
1- l
l
l
1- m
1- l - m
2) “working system” down, spare
takes over
1) Both working
and spare systems up
• It is also easy to overlook the impact
of “silent failure” of the protection
system.
m
3) Both working
and spare down
2m
Actual system (spare can have “silent failure”)
1- 2 



1- 
1- 
2) “working system” down, spare
takes over
1) Both working
and spare systems up


1-- 
4) Spare fails on
its own (Silent failure), working is ok
E E 681 - Module 21
3) Both working
and spare down
1- 

- see
note *
5) working system repaired,
Silent failure of
spare now known

This can be approximated by the
above iff the spare unit is
routinely “exercised”
the “surprise” transition directly into outage due to silent failure
of the spare
© Wayne D. Grover 2002, 2003
‹#›
Simulation techniques inspired by Markov modelling
Basic idea:
- Represent system in Markov chain form
- Simulate random transitions and observe state probabilities (instead of attempting
symbolic analysis of the model)
- can obtain experimental distributions of frequency / duration data
- can model transition rates that are state dependent
- can drive system through transient episodes on non-stationary statistical behaviour
Limitation:
- strictly only valid model of real systems for negative exponential (i.e., memoryless)
failure and repair models
- reason is that transitions are generated into / out-of states, they are not generated
based on individual equipment items
E E 681 - Module 21
© Wayne D. Grover 2002, 2003
‹#›
Simulation inspired by Markov modelling(2)
Basic simulation engine:
repeat
generate random variable xi for each transition type exiting current state;
t := t+ min{x1, x2, ....Xn}, k <- i : xi is min. (i.e. k indicates transition selected) ;
next_state := case of {current_state, transition k}
record / update any relevant statistical data
-examples: time since last entering a failure state, ....
Until {forever}
• General method for generating r.v.s with
any desired probability density function:
F(T)
t* 
 ln(1  u*)

u* is uniform r.v. on (0,1)
t* is then distributed as:
F (t*)  1  e  t
E E 681 - Module 21
uniform r.v.
u*
For “memoryless” processes:
1.0
1
0
T
f(t)
t*
© Wayne D. Grover 2002, 2003
‹#›
Even more general simulation method....
1
1
1
(f
repair
2
2, r 2 )
0
2
(f 3, r 3 )
element 2 time line
2
repair
2
2
failure
1
1
element 1 time line
element 2
0
1
(f 3 , r 3 )
. ..
element 1
failure
1
(f 2 , r 2 )
(f 1 , r 1 )
1
0
merged event time line
1 0
1 21 0
1
0
current system states
1
...
0
...
simulated time
Method:
- Identify each independent subsystem or component that is subject to failure and repair.
- Specify (in CDF form) the distribution of times between failure and repair times for that element.
- For each element individually simulate a time-line history of its failure and repair.
- Merged the event-sequences for all elements into one time line
- Use the merged time-line to drive the walk through the system state model.
- Accumulate all desired statistics such as:
total outage time, times between system outage, outage times duration, etc.
E E 681 - Module 21
© Wayne D. Grover 2002, 2003
‹#›
Issue (and related project topic)...
Issue is:
• virtually all analytically obtainable results provide only the mean (not distribution)
of outage times duration or times between failure
and...
• simulation based on memorylessness, always implies times between failure
and repair times duration are both negative exponentially distributed
prob
Believable for times between failures
but certainly not accurate pdf shape
for distribution of repair times.
Repair times more like
Time between failure or time to repair
E E 681 - Module 21
© Wayne D. Grover 2002, 2003
Q. (project):
How different is the accurately
simulated distribution of
outage-time durations from the
analytical model ?
Q. How much might this affect SLA
policies ? (Service Level Agreements)
‹#›
```