Statistics

Transcript Statistics

AN INTRODUCTION TO
STATISTICAL ANALYSIS
OF SIMULATION OUTPUTS
Sample Statistics

If x1, x2, …, xn are n observations of the value
of an unknown quantity X, they constitute a
sample of size n for the population on which
X is defined.
n

Sample mean
1
x   xi
n i 1
n

Sample variance
1
2
s 
( xi  x )

n  1 i 1
2
Central Limit Theorem

If the n mutually independent random
variables x1, x2, …, xn have the same
distribution, and if their mean m and their
variance s2 exist then …
Central Limit Theorem

The random variable
n
1
xi  m

n i 1
s
n
is distributed according to the standard
normal distribution (zero mean and unit
variance).
Estimating a mean


Assume that we have a sample x1, x2, …, xn
consisting of n independent * observations
of a given population
The sample mean xbar is an unbiased
estimator of the mean m of the population
* This is the critical assumption
Without it, we cannot apply the formula
Large populations

For large values of n, the (1-)% confidence
interval for m is given by

z / 2s
z / 2s 
,x 
x 

n
n 


with F ( z / 2 )  1 

2
for the standard normal distribution
Explanations


1 –  expressed in percent is the level of the
confidence interval
 90% means  = 0.10
 95% means  = 0.05
 99% means  = 0.01
 is the error probability
 Probability that the true mean falls outside
the confidence interval
95% Confidence Intervals


For =.05, z/2 =1.96
Example:
= 35, s = 4 and n = 100

If

The 95% confidence interval for
x
m is
35 ± 1.96x4/10 = 35 ± 7.84
When s is not known

We can replace s in the preceding formula by
the standard-deviation s of the sample


When n < 30, we must read the value of
z/2 from a table of Student's t-distribution
with n - 1 degrees of freedom
When n  30, we can use the standard
values
Confidence Intervals in CSIM

CSIM can automatically compute confidence
intervals for the mean value of any table,
qtable and so on.
 For everything but boxes
 xyx->confidence();
 For the elapsed times in a box
 bd->time_confidence();
 For the population of a box
 bd->_number_confidence();
Confidence Intervals in CSIM

We get 90, 95 and 98 percent confidence
intervals
 Computed using batch means method
See next section
But only for the mean


Estimating a proportion

A proportion p represents the probability
P(X) for some fixed threshold 
 97% of our customers have to wait less
than one minute

Confidence intervals for proportions are much
easier to compute than confidence intervals
for quantiles
Assumptions



Assume that we have n independent
observations x1, x2, …, xn of a given
population variable X and that this variable
has a continuous distribution
Let p represent the proportion we want to
estimate, say P(X)
Let k represent the number of observations
that are .
Basic property

If p represent the proportion we want to
estimate, say P(X), and k represent the
number of observations that are 
 The rv k is distributed according to a
binomial distribution with mean np and
variance p(1 – p)
 The rv k/n is distributed according to a
binomial distribution with mean p and
variance p(1 – p)/n
The formula

When n > 29, we can use the Wilson’s
interval

z2 / 2
z2 / 2
qˆ (1  qˆ ) z2 / 2
qˆ (1  qˆ ) z2 / 2 
 qˆ 

 z / 2

qˆ 
 z / 2

2
2
2n
n
2n
n
4n  q 
4n   1  
P


z2 / 2
z2 / 2
1
1


n
n


where z/2 = 1.96 for a 95% C.I.
Advantages


Produces tighter confidence intervals then the
Central Limit Theorem
Works when q̂ is equal to zero
2

z / 2 
P 0  q 
1
2 
n  z / 2 

Example

Assume we have 400,000 independent
observations without noting any failure
 The 95% confidence interval for the
probability that the system could fail is
2
1.96
5
(0,
)  (0,10 )
2
400,000  1.96
AUTOCORRELATION
The problem



All statistical analysis techniques we have
discussed assume that sample values are
mutually independent
Generally false for quantities such as
 Waiting times, response times, …
Tend instead to be autocorrelated

When the waiting lines are long, everybody
wait a long time
Traditional solution


Keep the measurements sufficiently apart
 Sample them every T minutes apart
 Standard solution for collecting
observations on a running system
Not practical
 Would require very long simulations
Three good solutions



Batch means
Regenerative method
Time series analysis
Batch means



We group consecutive observations into
“batches”
We compute the means of these batches
We observe that autocorrelation among batch
means decreases with size of batches
 When size increases, each batch includes
more observations that are far apart
Example



We collected the following values:
 4, 3, 3, 4, 5, 5, 3, 2, 2, 3
We group them into two batches of five
observations:
 4, 3, 3, 4, 5 and 5, 3, 2, 2, 3
The batch means are:
 3.8 and 3
Batch means in CSIM

CSIM uses fixed-size batches
 To compute confidence intervals
 To control the duration of a simulation
(run-length control)
Regenerative method


Most systems with queues go through states
that return it to an state identical to its
original state
 The system regenerates itself
Examples:
 Whenever a disk array is brought back to
its original state
 Whenever a camper rental agency has all
its campers available
Key idea

Define a regeneration interval as an
interval between two consecutive
regeneration points:
 Observations collected during the same
regeneration interval can be correlated
 Observations collected during different
regeneration intervals are
independent
Application



We group together observations that occur
within the same regeneration interval
We compute the means of these groups of
observations
These group means are independent from
each other
Limitations of the approach

Not general:

System must go through regeneration
points


System must be idle
Leads to complex computations

We rarely have exactly the same number of
observations in two different regenerations
intervals
Time series analysis



Treats consecutive observations as elements
of a time series
Estimates autocorrelation among the
elements of a time series
Includes this autocorrelation in the
computation of all confidence intervals
RUN LENGTH CONTROL
Objective


Accuracy of confidence intervals increases
with duration of simulation
 The 1/n factor
We would like to be able to stop the
simulation once a given accuracy level has
been reached for the confidence interval of a
specific measurement
The CSIM solution (I)

Specify
 Quantity of interest
 Mean value from a table, a qtable, …
 A relative accuracy
 Maximum relative error
 A confidence level (say, 0.90 to 0.99)
 A maximum simulation duration
 In seconds of CPU time
The CSIM solution (II)





void table::run_length(double accuracy,
double conf_level, double max_time)
void qtable::run_length(double accuracy,
double conf_level, double max_time)
void meter::run_length(double accuracy,
double conf_level, double max_time)
void box::time_run_length(double accuracy,
double conf_level, double max_time)
void box::number_run_length(double accuracy,
double conf_level, double max_time)
The CSIM solution (III)

Example:
 thistable->run_length(.01, .95, 500)
 Specifies than we want an error less than 1
percent for 95% confidence interval of mean
of thistable
 Stops simulation after 500 seconds

thisbox->:time_run_length(.01, .99, 500)

Same for time average of a box
The CSIM solution (IV)

Replace termination test by
 converged.wait();
Example (I)



The campers
We want
 A maximum error of 1 percent (0.01)
 For the 95 percent confidence interval
 Of average number of rented campers
 A maximum simulation time of 100 s
Will use number aspect of agency box
Example (II)

Add
 agency->number_confidence()
after activation of box agency in csim
process
Example (III)


Put main loop
 while(simtime() < DURATION) {
hold(exponential(MIART);
customer();
}
in a separate arrivals process
Make it an infinite loop
Example (IV)


Add
 converged.wait();
after call to arrivals process
 arrivals():
converged.wait();
Best way to let sim process generate
customers and wait for termination
in parallel
Example (V)

The new arrivals process
void arrivals() {
process(“arrivals”); // REQUIRED
for(;;) { // forever loop
hold(exponential(MIART));
customer();
} // forever
} // arrivals
Warnings

Confidence intervals do not take into account
model inaccuracies


While the batch means method eliminates
most effects of measurement autocorrelation,
it is not always 100% effective
The max_time parameter of the
run_length() will not necessarily stop the
simulation just after the specified CPU time

Like the emergency brake of a train
Objective


Partition processes into different classes
 Low priority
 High priority
Obtain separate statistics for each process
priority class
Declaring and initializing
process classes


To declare a dynamic process class:
 process_class *c;
To initialize a process class before it can be
used in any other statement.
 c = new process_class("low priority")
To assign a priority class


Add inside the process
 c->set_process_class();
Processes that have not been assigned a
process class belong to the “default”
process class
Reporting

Can use
 report()
 report_classes()
Other options




Can change the name of a process class:
 c->set_name("high priority");
Can reset statistics associated with a process
 c ->reset();
Can do the same for all process classes:
 reset_process_classes();
Can delete a dynamic process class:
 delete c;
RANDOM NUMBERS
Background

We need to distinguish between
 Truly random numbers
 Obtained through observations of a
physical random process
 Rolling dices, roulette
 Atomic decay
 Pseudo-random numbers
 Obtained through arithmetic operations
Example

Linear Congruential Generators (LCG)
 Easy to implement and fast
 Defined by the recurrence relation:
rn+1 = (a rn + c ) mod m
where
r1, r2, … are the random values
 m is the "modulus“
 r0 is the seed

Two realizations


GCC family of compilers
32
 m = 2
a = 69069
c=5
Microsoft Visual/Quick C/C++
32
 m = 2
a = 214013 c = 2531011
Problems with pseudorandom
number generators




Much shorter periods for some seed states
Lack of uniformity of distribution
Correlation of successive values
…
Better RNGs


Use the Mersenne twister
19937 - 1
 Period is 2
Blum-Blum-Schub
A quote

"Any one who considers arithmetical
methods of producing random digits is,
of course, in a state of sin. For, as has
been pointed out several times, there is no
such thing as a random number– there are
only methods to produce random numbers,
and a strict arithmetic procedure of course is
not such a method.“
John von Neumann
CSIM RNGs


BY default, CSIM uses a single stream of
random numbers
Can reset the seed using
 void reseed(stream *s, long n)
as in
 reseed(NIL, 13579)
Continuous distributions
supported by CSIM (I)






double uniform(double min, double max)
double triangular(double min, double max,
double mode)
double beta(double min, double max, double
shape1, double shape2)
double exponential(double mean)
double gamma(double mean, double stddev)
double erlang(double mean, double var)
Continuous distributions
supported by CSIM (II)






double hyperx(double mean, double var)
double weibull(double shape, double scale)
double normal(double mean, double stddev)
double lognormal(double mean, double
stddev)
double cauchy(double alpha, double beta)
double hypoexponential(double mn, double
var)
Continuous distributions
supported by CSIM (III)



double pareto(double a)
double zipf(long n)
double zipf_sum(long n, double *sum)
Discrete distributions
supported by CSIM






long uniform_int(long min, long max)
long bernoulli(double prob_success)
long binomial(double prob_success, long
num_trials)
long geometric(double prob_success)
long negative_binomial(long success_num,
double prob_success)
long poisson(double mean)
Empirical distributions
supported by CSIM

???
Using multiple streams


In campers example, sequence of RNs used
to generate arrivals is affected by the
numbers of campers
 If agency has less campers
 More customers will be lost
 Lost customers do no generate any RN
Better to have separate random number
streams for arrivals and service times
Declaring and initialing
random number streams


Can use:
 stream *s;
s = new stream();
By default, streams are created with seeds
that are spaced 100,000 values apart
Reseeding a stream

Use:
 s->reseed(24680);
where the new seed is a long integer
Other functions


Inspect the current state of a stream
 i = s->state();
Delete a stream
 delete s;
Using a specific seed

Prefix RNG function with name of seed:
 s->uniform (3.0, 7.0)
A CASE STUDY
RAID array revisited


Reliability of a RAID array
Reliability R(t) of a system is the probability
that will remain operational over a time
interval [0. t ] given that it was operational at
time t = 0
 Not the same as availability
 Our focus is evaluating the risk of a data
loss during array lifetime
Executing multiple runs


Multiple runs provide statistically independent
repetitions of original simulation
 Useful for
 Collecting more accurate results
 Constructing confidence intervals
Use rerun() function within a loop
 create("sim") call must be inside that loop
Overview of sim function
extern "C" void sim() { // sim process
runcount = 0;
while(runcount < NRUNS) {
create("sim"); // make it a process
// usual contents of sim process
rerun();
runcount++;
} // while
report_hdr(); // produce statistics report
} // sim
Global includes and defines
#include <cpp.h> // CSIM C++ header file
#define NDISKS 5 // number of disks in array
#define NYEARS 5 // useful lifetime of array
#define MTTF 300000.0 // disk MTTF
#define MTTR 24.0 // disk MTTR
#define NRUNS 100000 // no of runs
void disk(int i);
Global declarations
int nfailed; // number of failed disks
int runcount = 0; // number of runs
int ndatalosses = 0; //counter
double lifetime = NYEARS * 365 * 24;
// simulation duration
Sim function (I)
extern "C" void sim() { // sim process
int i;
runcount = 0;
while(runcount < NRUNS) {
create("sim"); // make this a process
dataloss = new event("dataloss");
dataloss->clear();
nfailed = 0;
Sim function (II)
// create NDISKS disk processes
for (i=0; i < NDISKS; i++){
disk(i);
} // for
dataloss->timed_wait(lifetime);
if (simtime() < lifetime) {
ndatalosses++;
} // if
rerun();
runcount++;
} // while
Sim function (III)
report_hdr(); // produce statistics report
printf("Array lifetime %fd years",
NYEARS);
printf(“or %f hours\n", lifetime);
printf("Sim time %f\n", simtime());
printf("Disk MTTF %f hours\n", MTTF);
printf("Disk MTTR %ff hours\n", MTTR);
printf(“Completed runs %d\n", runcount);
printf(“Data lossese%d\n", ndatalosses);
} // sim
Disk process (I)
void disk(int i) {
create("disk");
while(simtime() < lifetime || dataloss) {
hold(exponential(MTTF));
// disk failed
nfailed++;
Disk process (II)
if (nfailed == 2) {
dataloss = 1;
failtime = simtime();
finish->set();
terminate();
} // if
hold(MTTR); // repair process
nfailed--; // disk is replaced
} // while
} // disk
Simulation Outcome
C++/CSIM Simulation Report (Version 19.0 for Linux x86)
Tue Apr 22 10:13:14 2008
Ending simulation time:
Elapsed simulation time:
CPU time used (seconds):
Array lifetime 5 years
which corresponds to
43800 hours
Simulated time
0
Disk MTTF
300000 hours
Disk MTTR
24 hours
Number of runs completed 100000
Number in runs ending in data loss 17
0.000
0.000
0.000
Discussion



Whole program took less than 98 seconds on
linux03 server
Simulated time should be equal to
43,800 hours, not zero.
 rerun() artifact?
We observe 17 failures out of 100,000 runs
 Data survival rate is 99.983 percent
 Three nines
Confidence intervals



Data loss rate and survival rate are
distributed according to a binomial law
Since n = 100,000, the distributions of both
proportions are approximately normal
Will use

z2 / 2
z2 / 2
qˆ (1  qˆ ) z2 / 2
qˆ (1  qˆ ) z2 / 2 
 qˆ 

 z / 2

qˆ 
 z / 2

2
2
2n
n
2n
n
4n  q 
4n   1  
P


z2 / 2
z2 / 2
1
1


n
n


CI for data survival rate s


ŝ =99.983% or 0.99983
95% CI is
0.999849 ± 0.000083
[0.999766, 0.999932]
between three and four nines
What are nines?



A measure of system reliability/availability
 99% is two nines
 99.9% is three nines
More formally, we compute
-log10 (1.0 – x)
Our confidence interval could then be
expressed as
[3.63 nines, 4.17 nines]
CI for data loss rate
0.0000189 ± 0.000088
[0.0000106, 0.000273]
Analytical approach

Will use a Markov model:
 Works since failure rates and repair rates
are distributed according to exponential
laws
 Obviously not true for repair rates
The model
5
1
0
m


4
Data
Loss
State label indicates number of failed drives
Failure state is an absorbing state
Using ergodic hypothesis
5
0
1
m
4


Assume that array is returned to original state
after each data loss
Failure rate will be 4p1
Model equations



5λp0=(μ+ 4λ)p1
po + p1 = 1
Solution is
4  m
5
p0 
, p1 
9  m
9  m

Failure rate is
20
L  4p1 
9  m
2
Computing five year survival


Decay is
e  Lt
202t
 exp( 
)
9  m
For MTTF = 300,000 hours,
MTTR = 24 hours and t = 5 years
 Data survival rate is 0.999767 or
3.632 nines
 Barely inside the C.I.
Why?



Markov model assumed repair times were
exponentially distributed
 Required by the technique
Simulation model assumed deterministic
repair times
 Somewhat more realistic
Difference illustrates a major limitation of
stochastic approach
Checking the explanation (I)

We repeat the simulation using repair times
that are exponentially distributed
 We observe 19 data losses out of 100,000
runs
 qhat is 0.99981
 Confidence interval is
(3.588 nines, 4.080 nines)
Checking the explanation (II)


We repeat a second time the simulation
using repair times that are exponentially
distributed and collecting more data
 We observe 95 data losses out of 400,000
runs
 qhat is 0.999763
 Confidence interval is
(3.552 nines, 3.734 nines)
Markov model predicted 3.632 nines
Comparing both approaches

Which is the best approach in terms of
 Generality and flexibility?


Provided results?
Discrete simulation and stochastic modeling
complement each other
Extensions (I)

Other repair time distributions
 Exponential ( to compare with results of
stochastic analysis)
 Ad hoc (80% of repairs within one day,
20% within two days)
 Taking accounts of differences between
day and night, weekdays and weekends
 Will map simtime() into a calendar
Extensions (II)

Other disk arrays
 Mirrored disks: NDISKS = 2
 RAID level 6: tolerate two disk failures
 Triplicate disks: tolerate failure of two out
of three disks
 SSPiRAL arrays
 See next slide
 Will require more work
SSPiRAL arrays
ABC
BCD
B
A
DAB
D
C
CDA
Resists all triple failures
and most quadruple failures
CONCLUSION
CONCLUSION



Statistical analysis of outputs is an important
aspect of simulation
Standard statistical techniques require
independent observations w/o
autocorrelation
 Simplest solution is to use batch means
CSIM tools simplify constructions of
confidence intervals for all sorts of means

Statistics

Transcript Statistics

Directory