Lecture 4: State-Based Methods

Transcript Lecture 4: State-Based Methods

Lecture 4: State-Based
Methods
CS 7040
Trustworthy System Design,
Implementation, and Analysis
Spring 2015, Dr. Rozier
Adapted from slides by WHS at UIUC
Introduction
Availability: A motivation for StateBased Methods
• Recall that availability quantifies the
alternation between proper and improper
service.
– A(t) is 1 if service is proper, 0 otherwise.
– E[A(t)] is the probability that the service is proper
at time t.
– A(0,t) is the fraction of time the system delivers
proper service during [0,t]
Availability
• For many systems, availability is a more “useroriented” measure than reliability.
– It is often more difficult to compute, since it must
account for repair and/or replacement.
Availability: Example
• A radio transmitter has an expected time to failure of 500
days. Replacement takes an average of 48 hours.
• A cheaper, and less reliable, transmitter has an expected time
to failure of 150 days. Due to cost, a replacement is readily
available and replacement takes 8 hours on average.
Availability: Example
• A radio transmitter has an expected time to failure of 500
days. Replacement takes an average of 48 hours.
• A cheaper, and less reliable, transmitter has an expected time
to failure of 150 days. Due to cost, a replacement is readily
available and replacement takes 8 hours on average.
• Reliability,
– for the first transmitter MTTF = 500 days
– for the second transmitter MTTF = 150 days
First transmitter is 3.33x more reliable.
Availability: Example
• A radio transmitter has an expected time to failure of 500
days. Replacement takes an average of 48 hours.
• A cheaper, and less reliable, transmitter has an expected time
to failure of 150 days. Due to cost, a replacement is readily
available and replacement takes 8 hours on average.
• For
for the more reliable transmitter, and
for the less reliable transmitter.
Availability: Example
• A radio transmitter has an expected time to failure of 500
days. Replacement takes an average of 48 hours.
• A cheaper, and less reliable, transmitter has an expected time
to failure of 150 days. Due to cost, a replacement is readily
available and replacement takes 8 hours on average.
• For
for the more reliable transmitter, and
for the less reliable transmitter.
Higher reliability doesn’t necessarily mean higher availability!
Availability: Example
• A radio transmitter has an expected time to failure of 500
days. Replacement takes an average of 48 hours.
• A cheaper, and less reliable, transmitter has an expected time
to failure of 150 days. Due to cost, a replacement is readily
available and replacement takes 8 hours on average.
• For
for the more reliable transmitter, and
for the less reliable transmitter.
Higher reliability doesn’t necessarily mean higher availability!
Availability Modeling using
Combinatorial Methods
• Availability modeling can be done with combinatorial
methods, but only with independent repair assumptions, and
exponential life-time distributions.
• This uses the theory of ON/OFF processes
“On” time distribution is reliability distribution, obtained using
combinatorial methods. We represent mean as E[On].
“Off” distribution is the repair time distribution. Mean is E[Off].
Availability Modeling using
Combinatorial Methods
Availability is the fraction of the time in the On state.
• Asymptotically, if instances of On periods are independent
and identically distributed (i.i.d.) and instances of Off periods
are i.i.d. then:
Availability Modeling using
Combinatorial Methods
• Lots of assumptions that may not hold!
State-Based Methods
• More accurate modeling can be achieved with state-based
methods.
– State-based methods relax the independence assumptions needed
for combinatorial modeling!
– Failures need not be independent. Failure of one component may
make another component more or less likely to fail!
– Repairs need not be independent. Repair and replacement strategies
are an important component that must be modeled in high-availability
systems.
– High availability systems may operate in a degraded mode. In a
degraded mode, the system may deliver only a fraction of its services.
The repair process may start only when a system is sufficiently
degraded.
Repairs and Independence
Repairs and Independence
NCSA’s Blue Waters
• 288 cabinets
• 26.4 PB of usable storage
NCSA’s Blue Waters
• 288 cabinets
• 26.4 PB of usable storage
• Let’s say the MTTF of a disk is 3 years.
NCSA’s Blue Waters
• 26.4 PB of usable storage
• Let’s say the MTTF of a disk is 3 years.
• 2TB per disk
NCSA’s Blue Waters
•
•
•
•
•
26.4 PB of usable storage
Let’s say the MTTF of a disk is 3 years.
2TB per disk
> 27,000 TB of usable storage
~ 34,000 TB of actual storage (RAID)
NCSA’s Blue Waters
•
•
•
•
•
26.4 PB of usable storage
Let’s say the MTTF of a disk is 3 years.
2TB per disk
> 27,000 TB of usable storage
~ 34,000 TB of actual storage (RAID)
… 17,000 disks
NCSA’s Blue Waters
•
•
•
•
•
26.4 PB of usable storage
Let’s say the MTTF of a disk is 3 years.
2TB per disk
> 27,000 TB of usable storage
~ 34,000 TB of actual storage (RAID)
… 17,000 disks
NCSA’s Blue Waters
•
•
•
•
•
26.4 PB of usable storage
Let’s say the MTTF of a disk is 3 years.
2TB per disk
> 27,000 TB of usable storage
~ 34,000 TB of actual storage (RAID)
… 17,000 disks
NCSA’s Blue Waters
•
•
•
•
•
26.4 PB of usable storage
Let’s say the MTTF of a disk is 3 years.
2TB per disk
> 27,000 TB of usable storage
~ 34,000 TB of actual storage (RAID)
… 17,000 disks
Rate of failure for ANY disk
NCSA’s Blue Waters
• We are losing 15+ drives every day.
• But loss of a drive isn’t necessarily loss of data, right?
– 17,000 drives in 8+2 means 1,700 RAID groups.
– Chance of RAID failure on day 1 is only 6%.
– 24% on day 2
• Rather than replace every drive
immediately when it fails, let’s
replace many drives all at
once.
State-Based Methods
• We use random processes to model these
systems.
• We use “state” to “remember” the conditions
leading to dependencies.
Random Processes
• A random process is a collection of random
variables indexed by time.
– Useful for characterizing the behavior of real
systems.
• X(t) is a random process. Let X(1) be the result
of tossing a die. Let X(2) be the result of
tossing a die plus X(1), and so on. Notice that
time (T) = {1, 2, 3, …}
Random Processes
• X(t) is a random process. Let X(1) be the result
of tossing a die. Let X(2) be the result of
tossing a die plus X(1), and so on. Notice that
time (T) = {1, 2, 3, …}
P[X(2) = 12] = 1/36
P[X(3) = 14|X(1) = 2] = 1/36
E[X(n)] = 3.5n
Etc.
Random Processes
• If X is a random process, X(t) is a
random variable
• A random variable Y is a function that maps
• A random process X maps elements in the two
dimensional space
to elements in
Random Processes
• A sample path of X is the history of sample
space values X adopts as a function of time.
• When we fix t, then X becomes a function
• When we fix
X becomes a function
• By fixing
(e.g., the system is available) and observing X as a
function of T, we see a trajectory of the process sampling
or not.
Describing a Random Process
• Recall: for a random variable X, we can use the
cumulative distribution
to describe the
random variable.
• In general, no such simple description exists
for a random process.
Describing a Random Process
• A random process can often be described
succinctly in various ways, for example, say Y
is a random variable representing the roll of a
die, and X(t) is the sum after t rolls. We can
describe X(t) as:
X(t) – X(t-1) = Y
P[X(t) = i|X(t-1) = j] = P[Y = i – j],
Or
where each
Is independent.
Classifying a Random Process:
Characteristics of T
• If the number of points defined by a random
process, i.e. |T|, is finite or countable then the
random process is said to be a discrete-time
random process.
• If |T| is uncountable then random process is
said to be a continuous-time random process.
Let X(t) be the number of fault arrivals in a system up to
time t. What time of process is X(t)?
Classifying a Random Process:
State Space Type
• Let X be a random process. The state space of
a random process is the set of all possible
values that the process can take on, i.e.
If X is a random process that models a system,
then the state space of X can represent the set
of all possible configurations that the system
could be in.
Classifying a Random Process:
State Space Type
• If the state space, S, of a random process, X, is
finite or countable, then X is said to be a
discrete-state random process.
• If the state space S of a random process X is
infinite and uncountable then X is said to be a
continuous-state random process.
Classifying a Random Process:
State Space Type
• Let X be a random process that represents the
number of bad packets received over a
network.
– Classify the state space.
Classifying a Random Process:
State Space Type
• Let X be a random process that represents the
voltage on a telephone line.
– Classify the state space.
Classifying a Random Process:
State Space Type
We will concern ourselves primarily with
discrete-state processes.
Random Process Classification
Examples
Markov Process
• A special type of random process that we will
examine is called a Markov process. A Markov
process can be defined, informally, as follows:
Given the state of a Markov process X at time t,
the future behavior of X can be described
completely in terms of X(t).
Markov processes have the very useful property that their future
behavior is independent of past values.
Markov Chains
• A Markov chain is a Markov process with a discrete state
space.
We will always make the assumption that a Markov chain has a
state space in {1, 2, …} and that it is time-homogeneous.
A Markov chain is time-homogenous if its future behavior does
not depend on what time it is, only the current state.
We formalize this property by looking at a discrete-time Markov
chain (DTMC). A DTMC X has the following property:
DTMCs
• Given i, j, and k,
is a number.
•
can be interpreted as the probability that
X has value i, then after k time-steps, X will
have value j.
Frequently we write
to mean
DTMC
• Suppose we have a
processor, it can be
working or failed at
some time T.
0
1
DTMC
• Suppose we have two
processors, which can
both be working or
failed, independently, at
time T.
0
2
1
State Occupancy Probability Vector
• Let
be a row vector. We denote
to be
the i-th element of the vector. If
is a
state occupancy probability vector, then
is the probability that a DTMC is in state i at
time-step k.
• Assume that a DTMC X has a state-space of
size n, i.e. S = {1, 2, …, n}. We say formally:
Note that
State Occupancy Probability Vector
Computing a single step forward in time.
Given an initial
which is the initial probability vector
(where we are likely to be at t=0), and
how do we compute
?
Recall the definition of
State Occupancy Probability Vector
Since
State Occupancy Probability Vector
• We have
j.
which holds for all
• This resembles vector-matrix multiplication. If
we arrange the matrix
Then
State Occupancy Probability Vector
• The important consequence of this is that we
can easily specify a DTMC in terms of an
occupancy probability vector, and a
transition probability matrix P.
Transient Behavior of DTMCs
A Simple Example
• Suppose the weather in Cincinnati can be
modeled the following way:
Simple Example, cont.
Simple Example, Solution
Solution, cont.
Graphical Representation
Simple Computer Example
Limiting Behavior of DTMCs
For next time
• Apply for a Mobius account
• Use your uc.edu e-mail
address, specifically mention
this class and instructor.

Lecture 4: State-Based Methods

Transcript Lecture 4: State-Based Methods

Directory