#### Transcript EE 681 Fall 2000 Lecture 5

E E 681 Module 21 Markov Method of Availability Analysis, APS systems, and Availability Simulation Methods W. D. Grover TRLabs & University of Alberta © Wayne D. Grover 2002, 2003 Markov Techniques for Availability Analysis Markov chain: A system with discrete states in statistical equilibrium and with constant memoryless probabilities of making transition between states... Example: two-state working-failed system p12 = p11 = 1- p22 = 1- 2 1 p21 = E E 681 - Module 21 © Wayne D. Grover 2002, 2003 State 1: working State 2: failed = 1/MTTF = failure rate = 1/MTTR = repair rate ‹#› Markov Techniques for Availability Analysis 1 = prob. of being in state 1 p12 = p22 = 1- p11 = 1- 2 1 p21 = State 1: working State 2: failed = 1/MTTF = failure rate = 1/MTTR = repair rate n = time-step pij = state transition probability Consider that: p11 p12 [1 2] [1 2] 1 p21 p22 0 This means that: 2 p11 p12 p11 p12 [1 2] [1 2] [1 2] 2 p21 p22 1 p21 p22 0 p11 [1 2] lim n p 21 0 lim[ P ] Etc. p12 n [ 1 2] 0 p 22 n Thus .... n “spinning” the transition matrix Once a vector is known, availability is known or computable over non-operating states E E 681 - Module 21 © Wayne D. Grover 2002, 2003 ‹#› Markov Techniques for Availability Analysis A separate computational method is based on observing that the system must reach steady-state probabilities wherein, as every time epoch passes, the state occupancy probabilities remain unchanged, implying that: p11 p12 [1 2]n [1 2]n 1 p21 p22 ...or more generally that : P is reached in the limit. That is, that the state probability vector doesn’t change under repeated multiplication by matrix P. Which can be solved numerically for . E E 681 - Module 21 © Wayne D. Grover 2002, 2003 ‹#› Markov Techniques: Model of a 1+1 APS System A Functional view of 1+1 APS B 1- 2 Markov chain model of 1 + 1 APS 2 1) Both elements working 1-2 2) One element down, One element working 3) Both elements down, 2 Failure state 1- - State transition probability matrix E E 681 - Module 21 P1 1 2 0 1 2 1 2 1 2 0 © Wayne D. Grover 2002, 2003 ‹#› Markov Techniques: Model of 1:N APS Systems Functional view of 1:N APS K1-K2 byte signalling: K1-K2 bytes on spare span control K1 - Head end bridge request (channel number) control . . .Working 1. . ... ... K2- Head end bridge confirm (channel number) . . .Spare. . . . . .Working N. . Head-end bridge: x = normally open _ = normally closed E E 681 - Module 21 © Wayne D. Grover 2002, 2003 Tail-end transfer ‹#› Markov Techniques: Models of APS Systems Availability model view of 1:N APS 1 Ub 1 Uw Ut1 Us= spare system unavailability N working N Uw Ub 2 Ut2 spare E E 681 - Module 21 Uw= working system unavailability Us © Wayne D. Grover 2002, 2003 Ub1= head end bridge unavailability, failure mode 1 Ub2= head end bridge unavailability, failure mode 2 Ut1= tail end transfer unavailability, failure mode 1 Ut2= tail end transfer unavailability, failure mode 2 ‹#› “Cutset” (algebraic) model of APS availability Method: List outage-causing failure combinations from functional inspection / system knowledge: Usys ( N 1) 2 Uw Uw (Ub2 Ut 2 Us ) (Ub1 Ut1) 2 Outage if: (1) two or more working systems fail. (2) If HEB or TET fail at any time in mode 2, or the spare span fails, and there is one or more working system failure. (3) If HEB or TET fail at any time in mode 1 there is outage. E E 681 - Module 21 © Wayne D. Grover 2002, 2003 2 or more working systems 1 working system and spare or HEB / TET active-function HEB / TET through-function by itself: “series element” ‹#› Markov Techniques: Models of APS Systems Outage-causing states Asys 1 Markov chain model of 1:N APS 1) All working and spare systems up (N-1) self-loops not drawn for clarity 2) One working system down, spare ok 1a) spare span failed E E 681 - Module 21 N (N-1) 2a) One working down, standby also down (N-2) 3a) Two working systems and spare all down 2 © Wayne D. Grover 2002, 2003 N+1) N working systems down, one protected by spare 3 i outage states i 3) Two working systems down, one protected by spare 2 (N-2) N+1a) N working systems down, spare down 3 N ‹#› An engineering issue: “exercising the spare” • It is usual to provide redundancy in the form of a “standby” system to enhance availability. Ideal view: (spare assumed always to be ready) 1- l l l 1- m 1- l - m 2) “working system” down, spare takes over 1) Both working and spare systems up • It is also easy to overlook the impact of “silent failure” of the protection system. m 3) Both working and spare down 2m Actual system (spare can have “silent failure”) 1- 2 1- 1- 2) “working system” down, spare takes over 1) Both working and spare systems up 1-- 4) Spare fails on its own (Silent failure), working is ok E E 681 - Module 21 3) Both working and spare down 1- - see note * 5) working system repaired, Silent failure of spare now known This can be approximated by the above iff the spare unit is routinely “exercised” the “surprise” transition directly into outage due to silent failure of the spare © Wayne D. Grover 2002, 2003 ‹#› Simulation techniques inspired by Markov modelling Basic idea: - Represent system in Markov chain form - Simulate random transitions and observe state probabilities (instead of attempting symbolic analysis of the model) Some Advantages: - can obtain experimental distributions of frequency / duration data - can model transition rates that are state dependent - can drive system through transient episodes on non-stationary statistical behaviour Limitation: - strictly only valid model of real systems for negative exponential (i.e., memoryless) failure and repair models - reason is that transitions are generated into / out-of states, they are not generated based on individual equipment items E E 681 - Module 21 © Wayne D. Grover 2002, 2003 ‹#› Simulation inspired by Markov modelling(2) Basic simulation engine: t= 0; assign initial state S(t=0); repeat generate random variable xi for each transition type exiting current state; t := t+ min{x1, x2, ....Xn}, k <- i : xi is min. (i.e. k indicates transition selected) ; next_state := case of {current_state, transition k} record / update any relevant statistical data -examples: time since last entering a failure state, .... Until {forever} • General method for generating r.v.s with any desired probability density function: F(T) t* ln(1 u*) u* is uniform r.v. on (0,1) t* is then distributed as: F (t*) 1 e t E E 681 - Module 21 uniform r.v. u* For “memoryless” processes: 1.0 1 0 T f(t) t* © Wayne D. Grover 2002, 2003 ‹#› Even more general simulation method.... 1 1 1 (f repair 2 2, r 2 ) 0 2 (f 3, r 3 ) element 2 time line 2 repair 2 2 failure 1 1 element 1 time line element 2 0 1 (f 3 , r 3 ) . .. element 1 failure 1 (f 2 , r 2 ) (f 1 , r 1 ) 1 0 merged event time line 1 0 1 21 0 1 0 current system states 1 ... 0 ... simulated time Method: - Identify each independent subsystem or component that is subject to failure and repair. - Specify (in CDF form) the distribution of times between failure and repair times for that element. - For each element individually simulate a time-line history of its failure and repair. - Merged the event-sequences for all elements into one time line - Use the merged time-line to drive the walk through the system state model. - Accumulate all desired statistics such as: total outage time, times between system outage, outage times duration, etc. E E 681 - Module 21 © Wayne D. Grover 2002, 2003 ‹#› Issue (and related project topic)... Issue is: • virtually all analytically obtainable results provide only the mean (not distribution) of outage times duration or times between failure and... • simulation based on memorylessness, always implies times between failure and repair times duration are both negative exponentially distributed prob Believable for times between failures but certainly not accurate pdf shape for distribution of repair times. Repair times more like Time between failure or time to repair E E 681 - Module 21 © Wayne D. Grover 2002, 2003 Q. (project): How different is the accurately simulated distribution of outage-time durations from the analytical model ? Q. How much might this affect SLA policies ? (Service Level Agreements) ‹#›