Simulation Input Analysis

Download Report

Transcript Simulation Input Analysis

Graduate Program in
Engineering and
Technology
Management
Simulation-4
INPUT MODELING
Aslı Sencer
STEPS OF INPUT MODELING
1) Collect data from real system of interest
 Requires substantial time and effort
 Use expert opinion in case of no sufficient data
2) Identify a probability distribution to represent the
input process


Draw frequency distribution, histograms
Choose a family of theoretical distribution
3) Estimate the parameters of the selected distribution
4) Apply goodness-of-fit tests to evaluate the chosen
distribution and the parameters


Chi-square tests
Kolmogorov Smirnov Tests
5) If these tests are not justified, choose a new
theoretical distribution and go to step 3! If all
theoretical distributions fail, then either use
emprical distribution or recollect data.
2
STEP 1: DATA COLLECTION INCLUDES
LOTS OF DIFFICULTIES
Nonhomogeneous interarrival time distribution;
distribution changes with time of the day, days of
the week, etc. You can’t merge all these data for
distribution fitting!
 Two arrival processes might be dependent; like
demand for washing machines and dryers. You
shouldn’t treat them seperately!
 Start and end of service durations might not be
clear; You should split the service into well
defined processes!
 Machines may breakdown randomly; You should
collect data for up and down times!

3
STEP 2.1: IDENTIFY THE PROBABILITY
DISTRIBUTION
Raw Data

10
2
5
2
2
8
0
8
2
3
Histogram with Discrete Data
Arrivals per
period
Frequency
0
12
1
10
2
19
3
17
4
10
5
8
6
7
7
5
8
5
9
3
10
3
11
1
8
3
1
6
10
2
2
6
4
10
5
5
8
3
0
3
4
3
2
0
1
9
9
1
3
7
1
4
4
7
6
2
1
4
6
0
2
6
1
3
0
0
9
5
0
2
5
11
3
5
4
2
3
0
6
2
1
3
1
3
6
4
7
3
5
1
5
2
2
7
2
2
4
3
7
0
3
8
1
3
3
3
0
2
0
4
2
0
2
4
Histogram of Arrivals per Period
20
18
16
14
12
Frequency
10
8
6
4
4
2
0
0
1
2
3
4
5
6
7
8
9
10
11
STEP 2.1: IDENTIFY THE PROBABILITY
DISTRIBUTION
Raw Data

Histogram with Continuous Data
Component Life
(days)
[0-3)
[3-6)
[6,9)
[9-12)
[12-15)
[15-18)
[18-21)
[21-24)
[24-27)
[27-30)
[30-33)
Frequency
23
10
5
1
1
2
0
1
1
0
1
[33-36)
...
[42-45)
...
[57-60)
...
[78-81)
...
[144-147)
1
...
1
...
1
...
1
...
1
79.919
3.027
6.769
18.387
144.695
0.941
0.624
0.590
7.004
3.217
3.081
6.505
59.899
0.141
2.663
0.878
5.380
1.928
31.764
14.382
0.062
0.021
1.192
43.565
17.967
3.148
3.371
0.300
1.005
1.008
1.961
0.013
34.760
24.420
0.091
2.157
7.078
0.002
1.147
2.336
5.845
0.123
5.009
0.433
9.003
7.579
23.960
0.543
0.219
4.562
Histogram of Component Life
25
20
15
Frequency
10
5
5
0
3
6
9
12
15
18
21
24
27
30
33
36
STEP 2.2: SELECTING THE FAMILY OF
DISTRIBUTIONS
The purpose of preparing a histogram is to infer a
known pdf or pmf.
 This theoretical distribution is used to generate
random variables like interarrival times and
service times during simulation runs.
 Exponential, normal and poisson ditributions are
frequently encountered and are not difficult to
analyze.
 Yet there are beta, gamma and weibull families
that provide a wide variety of shapes.

6
Applications of Exponential Distribution
Used to model time between independent events,
like arrivals or breakdowns
7
Inappropriate for modeling process delay times
8
Applications of Poisson Distribution
•Discrete distribution, used to model the number of
independent events occuring per unit time,
Eg. Batch sizes of customers and items
9
•If the time betweeen successive events is exponential,
then the number of events in a fixed time intervals
is poisson.
10
11
Applications of Beta Distribution:
•Often used as a rough model in the absence of data
12
•Represent random proportions
•Can be transformed into scaled beta sample
Y=a+(b-a)X
13
Applications of Erlang Distribution
•Used to represent the time required to complete a task
which can be reprsented as the sum of k exponentially
distributed durations.
•For large k, Erlang approaches normal distribution.
•For k=1, Erlang is the exponential distribution with
rate=1/β.
•Special case of gamma distribution in which α, the
shape parameter of gamma distribution is k.
14
Applications of Gamma Distribution
•Used to represent time required to complete a
task
•Same as Erlang distribution when the shape
parameter
α is an integer.
15
Applications of Johnson Dist.
Flexible domain being bounded or
unbounded
16
allows it to fit many data sets.
If δ>0, the domain is bounded
If δ<0, the domain is unbounded
Applications of Lognormal Distribution
Used to represent quantities which is the product
of large number of random quantities
Used to represent task times which are skewed
17 to
right. If X~LOGN( l ,  l ), then lnX ~NORM(μ,σ)
18
Applications of Weibull Distribution
•Widely used in reliability models to represent lifetimes.
•If the system consists of large number of parts that fail
independently, time between successive failures can be
Weibull.
•Used to model nonnegative task times that are skewed to left.
•It turns out to be exponential distribution when  =1.
19
Applications of
Continuous
Empirical
Distribution
•Used to incorporate
empirical data as an
alternative to
theoretical
distribution, when
there are
multimodes,
significant outliers,
etc.
20
Applications of
Discrete Empirical
Distribution
•Used for discrete
assignments such as job
type,
visitation sequence or
batch size
21
STEP 3: ESTIMATE THE PARAMETERS
OF
THE SELECTED DISTRIBUTION


A theoretical distribution is specified by its parameters that
are obtained from the whole population data.
Ex: Let V,W,X,Y,Z be random variables, then
V~N(µ,σ2), where µ is the mean and σ2 is the variance.
W~Poisson (λ), where λ is the mean
X~Exponential (β), where β is the mean
Y~Triangular (a,m,b), where a, m,b are the minimum,mod
and the maximum of the data
Z~Uniform (a,b), where a and b are the minimum and
maximum of the data
These parameters are estimated by using the point
estimators defined on the sample data
22
STEP 3: ESTIMATE THE PARAMETERS
OF
THE SELECTED DISTRIBUTION

Sample mean and the sample variance are the point estimators for the
population mean and population variance
Let Xi; i=1,2,...,n iid random variables (raw data are known) , then the
sample mean
and sample variance s2 are calculated as
Continuous Raw Data
Discrete Raw Data
10
2
5
2
2
8
0
8
2
3
8
3
1
6
10
2
2
6
4
10
5
5
8
3
0
3
4
3
2
0
1
9
9
1
3
7
1
4
4
7
6
2
1
4
6
0
2
6
1
3
0
0
9
5
0
2
5
11
3
5
4
2
3
0
6
2
1
3
1
3
6
4
7
3
5
1
5
2
2
7
2
2
4
3
7
0
3
8
1
3
3
3
0
2
0
4
2
0
2
4
79.919
3.027
6.769
18.387
144.695
0.941
0.624
0.590
7.004
3.217
3.081
6.505
59.899
0.141
2.663
0.878
5.380
1.928
31.764
14.382
0.062
0.021
1.192
43.565
17.967
3.148
3.371
0.300
1.005
1.008
1.961
0.013
34.760
24.420
0.091
2.157
7.078
0.002
1.147
2.336
5.845
0.123
5.009
0.433
9.003
7.579
23.960
0.543
0.219
4.562
23
STEP 3: ESTIMATE THE PARAMETERS
OF
THE SELECTED DISTRIBUTION

If the data are discrete and have been grouped in a
frequency distribution, i.e., the raw data are not known,
then
where k is the number of distinct values of X and fj;
j=1,2,...,k is the observed frequency of the value Xj of X.
Arrivals per period
Frequency
Arrivals per period
Frequency
0
12
6
7
1
10
7
5
2
19
8
5
3
17
9
3
4
10
10
3
5
8
11
1
24
STEP 3: ESTIMATE THE PARAMETERS
OF
THE SELECTED DISTRIBUTION

If the data are discrete or continuous and have been
grouped in class intervals, i.e., the raw data are not known,
then
where fj; j=1,2,...,c is the observed frequency of the jth class
interval and mj is the midpoint of the jth interval.
Component Life
(days)
Frequency
Component Life
(days)
Frequency
Component Life
(days)
Frequency
[0-3)
23
[21-24)
1
...
...
[3-6)
10
[24-27)
1
[57-60)
1
[6,9)
5
[27-30)
0
...
...
[9-12)
1
[30-33)
1
[78-81)
1
[12-15)
1
[33-36)
1
...
...
[15-18)
2
...
...
[144-147)
1
[18-21)
0
[42-45)
1
25
STEP 3: ESTIMATE THE PARAMETERS
OF
THE SELECTED DISTRIBUTION

The minimum, mod (i.e., data value with the highest
frequency) and maximum of the population data are
estimated from the sample data as
Xt is the data value that has the
highest frequency.
26
STEP 4: GOODNESS OF FIT TEST
Goodness of fit tests (GFTs) provide helpful guidance
for evaluating the suitability of the selected input
model as a simulation input.
 GFTs check the discrepancy between the emprical and
the selected theoretical distribution to decide whether
the sample is taken from that theoretical distribution
or not.
 The role of sample size, n:

If n is small, GFTs are unlikely to reject any theoretical
distribution, since discrepancy is attributed to the sampling
error!
 If n is large, then GFTs are likely to reject almost all
distributions.
27

STEP 4: GOODNESS OF FIT TESTS
CHI SQUARE TEST


Chi square test is valid for large sample sizes and for both discrete
and continuous assumptions when parameters are estimated with
maximum likelihood.
Hypothesis test:
Ho: The random variable X conforms to the theoretical distribution
with the estimated parameters
Ha: The random variable does NOT conform to the theoretical
distribution with the estimated parameters
We need a test statistic to either reject or fail to reject Ho. This test
statistic should measure the discrepency between the theoretical and
the emprical distribution.
If this test statistic is high, then Ho is rejected,
28
Otherwise we fail to reject Ho! (Hence we accept Ho)
STEP 4: GOODNESS OF FIT TESTS
CHI SQUARE TEST
Test statistic:
Arrange n observations into a set of k class intervals or cells.
The test statistic is given by
where Oi is the observed frequency in the ith class interval and
Ei is the expected frequency in the ith class interval.
where pi is the theoretical probability associated with the ith
class, i.e., pi =P(random variable X belongs to ith class).
29
STEP 4: GOODNESS OF FIT TESTS
CHI SQUARE TEST

Recommendations for number of class intervals
for continuous data
Sample Size,
n
Number of Class Intervals
k
20
Do not use chi-square test
50
5-10
100
10 to 20
>100

to n/5
It is suggested that
. In case it is smaller,
then that class should be combined with the
adjacent classes. Similarly the corresponding Oi
values should also be combined and k should be
reduced by every combined cell.
30
STEP 4: GOODNESS OF FIT TESTS
CHI SQUARE TEST

Evaluation
Let α =P(rejecting Ho when it is true); the significance level is 5%.
follows the chisquare distribution
with k-s-1 degress of
freedom, where s is
the number of
estimated
parameters.
Fail to Reject Ho
Reject Ho
If probability of the test statistic < α, reject Ho and the distribution
otherwise, fail to reject Ho.
31
CHI-SQUARE DISTRIBUTION TABLE
(k-s-1)
α
2
𝜒𝛼,𝑘−𝑠−1
32
STEP 4: GFT - CHI SQUARE TEST
EX: POISSON DISTRIBUTION

Consider the discrete data we analyzed in step 2.
Ho: # arrivals, X~ Poisson (λ=3.64)
Ha: ow
λ is the mean rate of arrivals,

=3.64
The following probabilities are found by using the pmf
P(0)=0.026
P(6)=0.085
P(1)=0.096
P(7)=0.044
P(2)=0.174
P(8)=0.020
P(3)=0.211
P(9)=0.008
P(4)=0.192
P(10)=0.003
P(5)=0.140
P(>11)=0.001
33
STEP 4: GFT - CHI SQUARE TEST
EX: POISSON DISTRIBUTION

Calculation of the chi-square test statistic with k-s-1=7-1-1=5
degrees of freedom and α=0,05.
So, Ho is rejected!
34
STEP 4: GFT - CHI SQUARE TEST
EX: ARENA INPUT ANALYZER
Distribution Summary
Distribution: Normal
Expression:
NORM(225, 89)
Reject Normal distribution at
5% significance level!
Square Error: 0.037778
Chi Square Test
Number of intervals
Degrees of freedom
Test Statistic
= 12
=9
= 1.22e+004
Corresponding p-value < 0.005
Data Summary
Number of Data Points
Min Data Value
= 27009
=1
Max Data Value
Sample Mean
= 1.88e+003
= 225
Sample Std Dev
= 89
Histogram Summary
Histogram Range
Number of Intervals
= 0.999 to 1.88e+003
= 40
Fit all summary
Function
Sq Error
----------------------Normal
0.0506
Gamma
0.0625
Beta
0.0639
Erlang
0.0673
Weibull
0.079
Lognormal 0.0926
Exponential 0.286
Triangular 0.311
Uniform
0.36
35
STEP 4: GFT - CHI SQUARE TEST
EX: ARENA INPUT ANALYZER
Distribution Summary
Distribution: Lognormal
Expression: 2 + LOGN(145, 67.9)
Square Error:
0.000271
Chi Square Test
Number of intervals
=4
Degrees of freedom
=1
Test Statistic
= 207
Reject Lognormal distribution
at 5% significance level!
Corresponding p-value < 0.005
Data Summary
Number of Data Points
= 21547
Min Data Value
=2
Max Data Value
= 6.01e+003
Sample Mean
= 146
Sample Std Dev
= 79.5
Histogram Summary
Histogram Range
= 2 to 6.01e+003
Number of Intervals
= 40
36
STEP 4: GFT - CHI SQUARE TEST
EX: ARENA INPUT ANALYZER
Distribution Summary
Distribution: Weibull
Expression:
0.999 + WEIB(94.7, 0.928)
Reject Weibull distribution at
5% significance level!
Square Error: 0.002688
Chi Square Test
Number of intervals
= 20
Degrees of freedom
= 17
Test Statistic
= 838
Corresponding p-value
< 0.005
Data Summary
Number of Data Points
= 12418
Min Data Value
=1
Max Data Value
= 1.47e+003
Sample Mean
= 108
Sample Std Dev
= 135
Histogram Summary
Histogram Range
= 0.999 to 1.47e+003
Number of Intervals
= 40
37
STEP 4: GOODNESS OF FIT TESTS
DRAWBACKS OF CHI-SQUARE GFT



The Chi-square test uses the estimates of the parameters
obtained from the sample that decreases the degrees of
freedom.
Chi-square test requires the data to be placed in class
intervals in the continuous distributions where these
classes are arbitrary and affects the value of the chi-square
test statistic.
The distribution of the chi-square test statistic is known
approximately and the power of the test (probability of
rejecting an incorrect theoretical distribution) is sometimes
low.
Hence other GFTs are also needed!
38
STEP 4: GOODNESS OF FIT TESTS
KOLMOGOROV-SMIRNOV TEST




Useful when the sample sizes are small and when no
parameters are estimated from the sample data.
Compares the cdf of the theoretical distribution, F(x) with the
emprical cdf, SN(x) of the sample of N observations.
Hypothesis test:
Ho: Data follow the selected pdf
Ha: Data do NOT follow the selected pdf
Test Statistic:
The largest deviation, D between F(x) and SN(x).
39
STEP 4: GOODNESS OF FIT TESTS
KOLMOGOROV-SMIRNOV TEST
Steps of K-S Test:
1. Rank the data so that 𝑋(1) ≤ 𝑋(2) ≤ ⋯ ≤ 𝑋(𝑁)
2. Calculate the maximum discrepancy D between F and SN,
𝐹 𝑋(𝑖) = 𝑃(𝑋 ≤ 𝑋(𝑖) )
𝑆𝑁 𝑋(𝑖)
# 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑑 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 ≤ 𝑋(𝑖)
𝑖
=
=
𝑁
𝑁
40
STEP 4: GOODNESS OF FIT TESTS
KOLMOGOROV-SMIRNOV TEST

If F is discrete 𝐷 = 𝑚𝑎𝑥 𝐷 +, 𝐷 − , where
+
𝐷 = max 𝑆𝑁 𝑋 𝑖
0≤𝑖≤𝑁
−
𝐷 = max 𝐹 𝑋
0≤𝑖≤𝑁

𝑖
−𝐹 𝑋
𝑖
− 𝑆𝑁 𝑋 𝑖−1
𝑖
= max
−𝐹 𝑋
0≤𝑖≤𝑁 𝑁
= max 𝐹 𝑋
0≤𝑖≤𝑁
𝑖
𝑖
𝑖−1
−
𝑁
If F is continuous
𝐷 = max 𝐹 𝑋 𝑖
0≤𝑖≤𝑁
− 𝑆𝑁 𝑋 𝑖
41
STEP 4: GOODNESS OF FIT TESTS
KOLMOGOROV-SMIRNOV TEST
3. Evaluation
𝐼𝑓 𝐷 > 𝐷∝,𝑁 , 𝑡ℎ𝑒𝑛 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻𝑜
𝐼𝑓 𝐷 ≤ 𝐷∝,𝑁 , 𝑡ℎ𝑒𝑛 𝑓𝑎𝑖𝑙 𝑡𝑜 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻𝑜
42
STEP 4: GOODNESS OF FIT TESTS
EXAMPLE: KOLMOGOROV-SMIRNOV TEST
Consider the data:
0.44, 0.81, 0.14, 0.05, 0.93
Ho: Data are uniform between (0,1)
Ha: ow
i
𝑋
𝐹 𝑋
𝑆𝑁 𝑋 𝑖
𝑖
𝑖
=𝑋
𝑖
= 𝑖/𝑁
𝑖/𝑁 − 𝐹 𝑋 𝑖
𝐹 𝑋
𝑖
= (𝑖 − 1)/𝑁
1
2
3
4
5
0.05
0.14
0.44
0.81
0.93
0.05
0.14
0.44
0.81
0.93
0.20
0.40
0.60
0.80
1.00
0.15
0.26
0.16
-
0.07
0.05
-
0.04
0.21
0.13
Since D=0.26 < 𝐷0.05,5 = 0.565
Ho is not rejected!
Data are uniform between (0,1)
43