No Slide Title
Download
Report
Transcript No Slide Title
Product Reliability
Chris Nabavi BSc SMIEEE
1
© 2006 PCE Systems Ltd
Reliability
Reliability is the probability that an equipment
will operate for some determined period of time
under the working conditions for which it was
designed
2
The “Bathtub” Curve
Failures per hour
Infant Mortality
End of Life
Operational Life Phase
Time
3
Operational Strategy
1. Run the equipment without traffic until the
infant mortality period has passed - (the burn-in
period)
2. Use the equipment during the operational life
period
3. Retire or replace the equipment before the
end of life period
4
Failure Rate
This is a statistical measure, applicable to a
large number of samples
The failure rate, is the number of failures per
unit time, divided by the number of items in the
test
is constant during the operational phase
is often expressed in % / 1000 hours
or FITs (failures in ten to the 9 hours)
5
Mean Time Between Failures
(MTBF)
This is a statistical measure, applicable to a
large number of samples
The MTBF, is the average time between
failures, times the number of items in the test
MTBF = 1 / failure rate
6
=1/
Measured MTBF and Failure Rate
A manufacturer tests 3000 light bulbs for 300
hours and observes 5 failures
Note: we don’t know the average time between failures
from this test, because they have not all failed! But
approximately:
MTBF = 3000 x 300 / 5 = 180,000 hours
Failure rate = 0.556 % per 1000 hours
This measured MTBF is an under-approximation
of the true MTBF
7
MTBF and End of Life
MTBF is a measure of quality and has nothing to
do with the expected lifetime
To visualise this, think of a candle. After three
hours, the wax will all be used up and it will have
reached its end of life. This is its expected lifetime.
However, a quality candle (higher MTBF) will be
less likely to fizzle out half way down. If we light a
new candle, just as each old one runs out of wax,
the mean time between being unexpectedly
plunged into darkness is the MTBF
8
Failure Rate (Graphical Representation)
Failures per hour
The failure rate is the size of this gap
Time
9
Example: Typical Hard Disc
10
Rated or expected life = 5 years
Guaranteed life = 3 years
MTBF = 1,000,000 hours (approx. 114 years)
Modern hard discs are fairly reliable, but being
mechanical, they wear out after a few years
Disc Replacement Strategy
Observation: The expected life is much less
than the MTBF and discs are the “weak link” in
the system
Conclusion: Replace the discs just before they
wear out under a preventative maintenance
program
11
Example MTBF Figures
12
ITEM
MTBF
N. American Power Utility
2 months !
Router
10 years
Uninterruptible Power Supply
11 years
File Server
14 years
Ethernet Hub
120 years
Transistor
30,000 years
Resistor
100,000 years
Operational Life Phase
Reliability theory only works in the operational life
phase, where the failure rates are constant
With this proviso, the maths is well established
and closely related to statistics
There is a large amount of statistical theory
concerned with sampling procedures, aimed at
estimating the MTBF of components
From now on, we are only concerned with the
operational phase
13
Probability of Survival
1
Probability of survival
p= e-t
0.37
0
14
MTBF,
Time
Non-Redundant System Reliability
For a system S, made up of components A, B,
C, etc.
1 / MTBFS = 1 / MTBFA + 1 / MTBFB + 1 / MTBFC + etc.
or S = A + B + C + etc.
These formulae are used to calculate the MTBF or
failure rate of equipment, using published tables
covering everything from a soldered joint to a disc
sub-system
15
Mean Time To Repair (MTTR)
The formulae discussed earlier assume zero
maintenance, i.e. if a device breaks down, it is not
fixed.
Often, it is important to know the probability of fixing a
broken system within a given time T
For this we need to know the MTTR, which is worked
out by examining all the steps involved and the failure
modes
The probability of fixing the broken system within time
T can then be predicted using similar exponentials as
seen already
16
Operational Readiness
Operational readiness is the probability that a
system will be ready to fulfil its function when
called upon
E.g. The probability that an email sent at a
random time will get through
MTBF
Operational readiness =
17
/ (MTBF + MTTR)
Active Redundancy
Either device can do the job
Device 1
Device 2
18
Active Redundancy Calculations
MTBFS = 1 / 1 + 1 / 2 - 1 / (1 + 2 )
Probability of survival = e-1t + e-2t - e-1t . e-2t
For an active redundancy system, S made from two identical sub-systems, A
MTBFS = 1.5 x MTBFA
Note: The failure rate is no longer constant with time
19
Passive Redundancy
When device 1 fails, switch over to device 2
Device 1
Device 2
(Device 2 not normally powered)
20
Passive Redundancy Calculations
For a passive redundancy system, S made from two
identical sub-systems, A and ignoring the reliability of the
switch-over system
Probability of survival = e-At x (1+At)
MTBFS = 2 x MTBFA
Note: The failure rate is no longer constant with time
21
Error Detection and Correction
There are two trivially simple ways to guard
against errors
Send the information twice: Then at the
receiver, if they are different, we have detected
an error
Send the information three times: Then at the
receiver, accept the majority verdict to correct
an error
But we can do better than this .....
22
The Hamming (7,4) code
0
0
1
0
1
0
0
0
0
1
0
0
0
0
0
1
1
1
0
1
0
1
1
1
1
0
1
1
Pink are parity check bits
Green are information bits
16 codes can be obtained
by adding any rows mod 2
All 16 codes have Hamming distance of 3 or more, so
the code can correct a single error
----------------- The Golay (23,12) code has 12 information bits and 11
parity check bits and can correct 3 errors
23
Redundant Array of Independent Discs
There are 5 RAID levels:
1
2
3
4
5
Mirrored discs
Hamming code error correction
Single check disc per group
Independent read and write
Spread data and parity over all discs
In each case, disc errors are corrected; the differences
are largely in the system performance
24
Effect of Non-Maintenance
Consider a file server with 6 discs in an array
The probability of getting a disc failure in a 6 disc
RAID array in one year is about 5% which is fairly
high. Assume that this happens and the problem is
left unfixed for a further 3 weeks
The probability of another disc failing in this time is
about .25% If this happens too, you loose the server!
The odds for this happening in any year are 5% of
0.25% or 1 in 8000 divided by the number of RAIDs
25
Effect of Improving the MTTR
If the MTTR of the RAID had been 3 hours instead of
3 weeks, the odds are somewhat different:
The probability of the first failure is still 5%. But now
the probability of getting a second failure in the
ensuing 3 hours is .0003% instead of .25%
So the odds of loosing the RAID in any year improve
to 1 in 6,666,675 from the previous 1 in 8000
Moral of the story: Fix the First Fault Fast
26