Experimental Design - People

Download Report

Transcript Experimental Design - People

CS5014
Research Methods in CS
Dr. Ayman Abdel-Hamid
Computer Science Department
Virginia Tech
Performance Evaluation
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
1
Outline
Performance Evaluation
•Introduction
•Common Mistakes
Some of the material is based on Dr. Cliff Shaffer’s Notes for
CS5014. Department of Computer Science, Virginia Tech
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
2
Examples 1/2
• Evaluate design alternatives
• Compare two or more computers, programs, algorithms
Speed, memory, usability
• Determine optimum value of a parameter (tuning,
optimization)
• Locate bottlenecks
• Characterize load
• Prediction of performance on future loads
• Determine number and size of components required
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
3
Examples 2/2
•Which is the best sorting algorithm?
•What factors affect data structure visualizations?
•Code-tune a program
•Which interface design is better?
•What are the best parameter values for a biological
model?
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
4
The Art of Performance Evaluation
•Throughput in Transactions per Second
System Workload1
Workload2
A
20
10
B
10
20
How Does system A compare to system B?
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
5
Evaluation Issues
•System
• Hardware, software, network: Clear bounding for the
“system" under study
•Techniques
• Measurement, simulation, analytical modeling
•Metrics
• Response time, transactions per second
•Workload: The requests a user gives to the system
•Statistical techniques
•Experimental design
• Maximize information, minimize number of experiments
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
6
Common Mistakes 1/5
•No goals
Each model is special purpose
Performance problems are vague when first presented
• Biased goals  OUR system is better than THEIRS
• Unsystematic approach
• Analysis without understanding
People want guidance, not models
• Incorrect performance metrics
Want correct metrics, not easy ones
• Unrepresentative workload
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
7
Common Mistakes 2/5
•Wrong evaluation technique
Easy to become married to one approach
• Overlooking important parameters
• Ignoring significant factors
Parameters that are varied in the study are called factors.
There's no use comparing what can't be changed
• Inappropriate experimental design
• Inappropriate level of detail
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
8
Common Mistakes 3/5
•No analysis
• Erroneous analysis
• No sensitivity analysis
Measure the effect on changing a parameter
• Ignoring errors in input
• Improper treatment of outliers
Outliers are values that are too high or too low compared to a
majority of values
Some should be ignored (can't happen)
Some should be retained (key special cases)
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
9
Common Mistakes 4/5
•Assuming no change in the future
• Ignoring variability
Mean is of low significance in the face of high variability
• Too complex analysis
Complex models are \interesting" and so get published and
studied
Real world use is simpler
Decision makers prefer simpler models
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
10
Common Mistakes 5/5
•Improper presentation of results
•The proper metric to measure the performance of an analyst is
the number of analyses that helped decision makers (not the
number of analyses performed)
• Ignoring social aspects
• Omitting assumptions and limitations
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
11
The Error of one-sided Hypothesis
•Consider the hypothesis “X performs better than Y".
•The danger is the following chain of reasoning:
 Could this hypothesis be true?
 I have evidence that the hypothesis might be true.
 Therefore it is true.
 What got ignored is any evidence that the hypothesis might
not be (or is not) true.
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
12
A Systematic Approach
1. State goals and define the system boundaries
2. List services and outcomes
3. Select metrics
4. List parameters (System and Workload)
5. Select factors to study
6. Select evaluation technique
7. Select workload
8. Design experiments
9. Analyze and interpret data
10. Present results. Start over, if necessary!
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
13
Technique Selection
•Choices: Analytical Modeling, Simulation, and Measurement
“Until validated, all evaluation results are suspect."
•Validate one of these approaches by comparing against another.
Measurement results are just as susceptible to experimental
errors and biases as the other two techniques.
•Criteria for technique selection
 Stage of analysis
 Time required
 Tools
 Accuracy
 Trade-off evaluation
 Cost
 Saleability
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
14
Common Performance Metrics 1/3
•Response Time
Interval between a user’s request and the system response
Simplistic definition assuming request and responses are
instantaneous
Definition 1  Time between the user finishes the request
and the system starts response
Definition 2  Time between the user finishes the request
and the system completes response
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
15
Common Performance Metrics 2/3
•Throughput: The rate (requests per unit of time) at which requests
can be serviced by the system.
 Throughput generally increases as the load initially
increases.
 Eventually it stops increasing, and might then decrease.
 Nominal capacity is maximum achievable throughput under
ideal workload conditions (bandwidth for computer networks)
 Usable capacity is maximum throughput achievable without
violating a limit on response time.
 Efficiency is ratio of usable to nominal capacity.
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
16
Common Performance Metrics 3/3
•Utilization
Fraction of time the resource is busy servicing requests
Ratio of busy time to total elapsed time over a given period
•Bottleneck: Component with highest utilization
Improving this component often gives highest payoff
•Reliability
Probability of errors
Mean time between errors
•Availability
Uptime and downtime
Mean uptime (Mean time to Failure)
•Cost/performance ratio
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
17
Workloads
•A workload is the requests made by users of the system under
study.
 A test workload is any workload used in performance studies
 A real workload is one observed on a real system. It cannot
be repeated.
 A synthetic workload is a reproduction of a real workload to
be applied to the tested system
•Examples (for CPU performance)
 Addition instruction
 Instruction mixes
 Kernels
 Synthetic programs
 Application benchmarks
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
18
Selecting Workloads
1. Determine the services for the SUT (System Under Test)
•
View the system as a service provider
•
Component Under Study (CUS)?
2. Select the desired level of detail
3. Confirm that the workload is representative
4. Is the workload still valid?
•
A real-world workload is not repeatable
•
Most workloads are models of real service requests
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
19
Selecting Workloads-Level of Detail
•
Select the desired level of detail
 Most frequent request
 Frequency of request types
 Time-stamped sequence of requests (trace)
 Average resource demand
 Might be necessary to specify complete probability
distribution
 Distribution of resource demands
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
20
Selecting Workloads-Representativeness
•
A test workload representative of real application
•
Test workload and real application should match in the
following
 Arrival rate
 Resource demands
 Resource usage profile
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
21
Selecting Workloads-Other Factors
•
Loading Level
 A workload might exercise a system to its
 full capacity (best case)
 Beyond its capacity (worst case)
 Load level observed in real workload (typical case)
•
Impact of external components
•
Repeatability
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
22
Some Workload Characterization Techniques 1/3
•
•
Averaging
•
Uses arithmetic mean
•
Alternatives?
Specifying Dispersion
•
Variance
•
Markov Models
•
Clustering
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
23
Some Workload Characterization Techniques 2/3
Markov Models
 Assume the next request depends only on the last
request
 Next system state depends only on current system state
 Transition matrices
 Ex: typical distribution for some system is about 4
small packets followed by a single large packet
 Random chance: probability of small is always .8,
large is always .2.
 Markov model: Small follows small .75, large
follows small .25. In contrast, small always follows
large.
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
24
Some Workload Characterization Techniques 3/3
•
Clustering
 To separate a population into groups with similar
characteristics.
 minimize the within-group variance while maximizing
the between-group variance
 Select representatives to simplify further processing
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
25
Statistics: Basic Concepts 1/2
•
•
Independent Events: Knowing that one event has occurred does
not in any way change our estimate of the probability of the
other event.
Random Variable: takes one of a specified set of values with a
specified probability.
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
26
Statistics: Basic Concepts 2/2
For independent variables, covariance is zero. Why?
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
27
Mean, Median, and Mode
(highest probability xi)
Probability Density Function (pdf) is the derivative of the
CDF. P(x1 < x ≤ x2) = F(x2) – F(x1)
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
28
Summarizing Data by a Single Number 1/2
•
A single number  gives key characteristic of data set
•
Indices of central tendencies: mean, median, and mode
 Sample mean
 Sample median
 Sort observations in increasing order and take
observation in the middle of the series (if number even?
 take mean of middle two values)
 Sample mode
 Plot a histogram and specify the midpoint where the
histogram peaks, or category occurring most frequently
 Mean is affected more by outliers than median or mode
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
29
Summarizing Data by a Single Number 2/2
•
Mean and Median always exist and are unique
•
Mode may not exist, or might not be unique
•
Examples
 Unimodal, symmetrical pdf
 Bimodal, symmetrical pdf
 Uniform density function
 Skewed pdf
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
30
Choosing Mean, Median, or Mode
•
You can't take the median or mean of categorical data
 Use Mode.
•
Is the total of interest?
 Use mean
 (Ex: Total CPU time for five database queries vs. number
of windows on the screen open for each query.)
 Probably use mean for the times
 median for windows.
•
Are the data skewed? Use Median
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
31
Common Misuses of Means
•
Using means of significantly different values
 Mean of 10 and 1000 msec is 505 msec
•
Using mean without regard to skewness of distribution
10
5
•
9
5
11 10
5 4
10
31
Sum Mean Typical
50 10
10
50 10
5
Multiplying means to get the mean of a product
 Mean of a product of two random variables is equal to
product of means, IF two random variables are independent
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
32
Summarizing Variability
•
Mean, median, mode attempt to provide a single
“characteristic" value for the population.
 But a single value might not be meaningful
 People generally prefer systems with low variability
•
There are different ways to measure variability (Indices of
dispersion)
 Range (max - min)
 poor, tends to be unbounded, unstable over a range of
observations, susceptible to outliers
 Variance or standard deviation
 10 and 90 percentiles
 Quartiles (box plots)
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
33
Variance and Standard Deviation
•
Why division by n-1?
 Only n-1 of the n differences are independent
•
Variance is expressed in units that are the square of the unit of
the observations.
•
Standard deviation is in units of the mean
•
Better to use coefficient of variation, since it takes the scale of
measurement unit out of variability consideration
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
34
Percentiles
•
Can be performed for any variables, even without bounds
•
0.9-quantile is the 90-percentile
•
Percentiles at multiples of 10% are deciles
•
Quartiles divide data into 4 parts at 25, 50, and 75%
 25% of observations are less than or equal to first quartile
 50% of observations are less than or equal to second
quartile
•
-quantiles can be estimated by sorting the observations and
taking [(n-1)+1]th element in the ordered set
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
35
Selecting Index of Dispersion
•
Is variable bounded?
 Specify range
•
If no bounds, is distribution unimodal, symmetric?
 Use coefficient of variation
•
Is distribution nonsymmetric?
 Use percentiles
•
Use of average traffic for network design?
 Usually designed to carry 95 to 99 percentile of observed
load (range or percentile)
•
Finding a percentile requires several passes through the data
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
36
Comparing Systems Using Sample Data
•
Most CS research work involves samples from some
“population.“  Performance experiments on workloads for
programs, systems, and networks
•
A fundamental goal of CS experimentation is to determine the
mean () and variance of some population characteristic (a
parameter).
•
We can only measure the mean characteristic for the sample, not
the population (a statistic)
 The parameter value (population mean) is fixed.
 The statistic (sample mean) is a random variable.
 The values for the statistic come from some sampling
distribution.
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
37
Hypothesis Testing
•
One fundamental experimental question  What is the
parameter value?
•
Another fundamental experimental question  Are two
populations different?
 Does a subpopulation perform better?
 Is one algorithm better than another?
•
Null Hypothesis: Two populations have the same mean
•
Rejection of the Null Hypothesis: Do so if you are “confident"
that the means are different
•
Significant Difference: Two means are different with a certain
reference confidence (probability)
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
38
Confidence Intervals
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
39
Central Limit Theorem
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
40
Normal Distribution
The sum of a large number of independent observations from
any distribution has a normal distribution
•Normal variate denoted as N(, σ), where  is the mean and σ is
the standard deviation
•A normal distribution with zero mean, and unit variance is called
a unit normal or standard normal distribution  denoted N(0,1)
•An -quantile of a unit normal variate z ~ N(0,1) is denoted by z
•If a random variable has N(, σ) distribution, then (x- )/ σ has a
N (0,1) distribution
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
41
Confidence Interval Example
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
42
Confidence Intervals for Small Samples
•The sampling distribution only approximates a normal curve
when n gets big enough, over about 30.
•Before that, the “tails" of the distribution are too big, so the
estimated confidence interval is too small.
•You can calculate the confidence intervals more precisely, but
you need a separate table for every n value.
•Usually there is a table to look it up in, or better yet, the
statistical software will calculate it for you
•The 100(1-)% confidence interval is given by
x  t[1 / 2;n1] s / n
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
43
Small Sample CI Example
•8 observations
-0.04, -0.19, 0.14, -0.09, -0.14, 0.19, 0.04, and 0.09
•Mean = 0
•Sample standard deviation = 1.38
•t[0.95;7] = 1.895
•90% Confidence interval is
0  1.895 * 0.138 / 8  (0.0926,0.0926)
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
44
Testing Against Zero
•Are the differences in processor times for two algorithms
significant?
 Testing if mean (differences) is zero. Check if zero is
within confidence interval at desired confidence.
 Note that the number of samples is likely to be small.
Take from table for correct t value
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
45
Testing Against Zero Example
Difference in processor times of two different implementations of
the same algorithm was measured on seven similar workloads {1.5,
2.6, -1.8, 1.3, -0.5, 1.7, and 2.4}. Can we say with 99% confidence
that one implementation is superior to the other?
•Sample size = 7
•Mean = 7.2/7 = 1.03
•Sample variance = 2.57
•Sample standard deviation = 1.6
•100(1-) = 99,  = 0.01, 1- /2=0.995
•t[0.995,6] = 3.707
1.03  3.707 *1.6 / 7  (1.21,3.27)
•Interval includes zero, cannot say
•Confidence Interval is
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
46
Paired Observations
•Need to compare two systems on very similar workloads
If more than 2 systems, or workloads are significantly
different, analysis of experimental design techniques should be
used
•Conduct n experiments on each of the two systems, one-to-one
correspondence between ith test on the 2 systems, observations are
paired
•Calculate difference in performance
•Construct a difference interval for the difference
if includes zero, systems are not significantly different
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
47
Paired Observations Example
Six similar workloads used on 2 systems
{(5.4, 19.1), (16.6, 3.5), (0.6, 3.4), (1.4, 2.5), (0.6, 3.6), (7.3,1.7)}
Performance difference {-13.7, 13.1, -2.8, -1.1, -3.0, 5.6}
•Sample size = 6
•Mean = -0.32
•Sample variance = 81.62
•Sample standard deviation = 9.03
•100(1-) = 90,  = 0.1, 1- /2=0.95
•t[0.95,5] = 2.015
•Confidence Interval is  0.32  2.015 * 3.69  (7.75,7.11)
•Interval includes zero, two systems are not different
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
48
Unpaired Observations
•Two samples of different sizes
•No correspondence between ith observations in two samples
•Make an estimate of the variance and effective number of degrees
of freedom  classical t-test
Compute sample means
Compute sample standard deviations
Compute mean difference
Compute standard deviation of mean difference
Compute effective number of degrees of freedom
Compute confidence interval for mean difference
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
49
What Confidence Interval to Use?
•Depends on the loss if wrong
•Sometimes 50% confidence is enough, or 99% confidence is not
enough
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
50
Regression Models
•A regression model predicts a random variable as a function of
other variables.
 Response variable: the estimated variable
 Predictor variables, predictors, factors: variables used to
predict the response
 All must be quantitative variables to do calculations
•What is a good model?
 Minimize the distance between predicted value and observed
values
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
51
Linear Regression Model 1/2
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
52
Linear Regression Model 2/2
ei  yi  yˆ i
n
n
SSE   e   ( yi  b0  b1 xi )
i 1
n
2
i
i 1
n
 e  ( y  b
i 1
Performance
Evaluation
i
i 1
2
i
0
 b1 xi )  0
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
53
Estimation of Model Parameters 1/2
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
54
Estimation of Model Parameters 2/2
30
CPU time in msec
25
20
15
Series1
10
Series2
5
0
-5 0
20
40
60
80
100
120
Number of disk I/O
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
55
Allocation of Variation 1/2
•The purpose of a model is to predict the response with
minimum variability.
 No parameters: Use mean of response variable
 One parameter linear regression model: How good?
Error  ei  yi  y
Variance of errors without regression
1 n 2
1 n
2
e

(
y

y
)
 variance of y


i
i
n  1 i 1
n  1 i 1
n
SSE without regression   ( yi  y ) 2
i 1
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
56
Allocation of Variation 2/2
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
57
Confidence Intervals for Regression Parameters
•b0 and b1 are estimates from a single sample size n
•b0 and b1 are “statistics”
•Values for b0 and b1 are the mean value
•Obtain standard deviations for b0 and b1
1

x
sb0  se  
2
2
 n  x  nx 
se
sb1 
2
2 1/ 2
 x  nx
2

1/ 2

•Confidence intervals are
b0  tsb0
b1  tsb1
Performance
Evaluation
t  t1 / 2;n2
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
58
CI for Regression Parameters Example
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
59
Verifying Regression Assumptions 1/2
•Assumption: Linear relationship between x and y
If you plot your two variables, you might find many possible
results:
1. No visible relationship
2. Linear trend line
3. Two or more lines (piecewise)
4. Outlier(s)
5. Non-linear trend curve
Only the second case can use a linear model
The third case could use multiple linear models
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
60
Verifying Regression Assumptions 2/2
•Assumption: Independent Errors
Make a scatter plot of errors versus response value
If you see a trend, you have a problem  dependence of
errors on predictor value
•Assumption: The predictor variable x is nonstochastic and it is
measured without any error.
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
61
Limits to Simple Linear Regression
•Only one predictor is used
•The predictor variable must be quantitative (not categorical)
•The relationship between response and predictor must be linear
•The errors must be normally distributed
•Visual tests:
 Plot x vs. y: Anything odd?
 Plot residuals: Anything odd?
 Residual plot especially helpful if there is a meaningful order
to the measurements
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
62
Other Regression Models
•Multiple Linear Regression
More than one predictor variable is used
•Categorical Predictors
Predictor variables not quantitative but represent categories
CPU type or disk type
•Curvilinear Regression
Nonlinear relationship between response and predictors
•Transformations
Errors are not normally distributed or variance is not
homogeneous
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
63
Multiple Linear Regression 1/2
 y1 
1
y 
1
 2

 . 
.

 
 . 
.
 . 
.



 yn 



1
Performance
Evaluation
x11
x21
.
x12
x22
.
.
.
.
.
.
.
.
.
.
x1n
x2 n
.
xk 1  b0   e1 
 b  e 
xk 2 
 1   2 
.  .   . 
    
.  .   . 
.  .   . 
   
xkn 

bk 
 
ek 

© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
64
Multiple Linear Regression 2/2

Parameter Estimation b  X X
SST  SSY  SS 0
T
 X y 
1
T
SSE  y T y  bT X T y
R2 
SST  SSE
SST
Standard deviation of errors se 
SSE
n  k 1

Standard deviation sb j  se c jj , where c jj is j th diagonal term of X T X

1
SSE degrees of freedom  n  k  1
Confidence intervals computed using t1 / 2;n  k 1
Prediction : Mean of m future observatio ns
Standard Deviation of prediction s  se
Performance
Evaluation
1
 xTp ( X T X ) 1 x p
m
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
65
Multiple Linear Regression Example 1/3
Observations of disk I/Os, memory sizes, and CPU time for 7
programs
Performance
Evaluation
CPU Time
2
5
Disk I/O
14
16
Memory Size
70
75
7
9
10
13
27
42
39
50
144
190
210
235
20
83
400
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
66
Multiple Linear Regression Example 2/3
•Find a linear equation to estimate CPU time
CPU time = b0+b1(Disk I/O)+b2(Memory Size)
•b = (-0.1614,0.1182,0.0265)T
•We can compute the error terms from this (difference between the
regression formula and the actual observations)
•Compute R2 = SSR/SST=0.97 (regression explains 97% of
variation of y)
•Coefficient of Multiple Correlation = R = 0.970.5 = 0.99
•Standard deviation of errors = Square Root(SSE/(n-3)) = 1.2
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
67
Multiple Linear Regression Example 3/3
•The 90% confidence intervals for the parameters are (-2.11, 1.79),
(-0.29, 0.53), and (-0.06, 0.11), respectively.
 What does this mean?
•What is the predicted CPU time for 100 disk I/Os and a memory
size of 550?
y=-0.1614+0.1182*(100)+0.0265*(550)=26.2375
Standard deviation of predicted observation = 3.3435
90% confidence interval using a t-value of 2.132 =
(19.1096,33.3363)
Standard deviation for a large number of observations =
3.1385  90% confidence interval = (19.5467,32.9292)
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
68
Analysis of Variance (ANOVA) 1/5
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
69
Analysis of Variance (ANOVA) 2/5
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
70
Analysis of Variance (ANOVA) 3/5
•Assuming errors are independent and normally distributed. All are
identically distributed (same mean and variance)
•y’s are normally distributed since x’s are nonstochastic (measured
without errors)
•Sum of squares of normal variates has a chi-square distribution
•Given two sums of squares SSi and SSj with vi and vj degrees of
freedom. The ratio (SSi / vi) / (SSj / vj) has an F distribution
•The hypothesis that SSi is less than or equal to SSj is rejected at the
 significance level is the ratio is greater than the 1-  quantile of
the F-variate
•Compare ratio with F[1-,vi,vj]
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
71
Analysis of Variance (ANOVA) 4/5
•MSR = SSR/k = mean square of the regression
Any sum of squares divided by its degrees of freedom gives
the corresponding mean square
•MSE = SSE/(n-k-1)
•MSR/MSE has an F[k,n-k-1] distribution
F distribution with k numerator degrees of freedom and n-k-1
degrees of freedom
•If MSR/MSE is greater than value read from F-table, predictor
variables are assumed to explain a significant fraction of the
response variation (reject null hypothesis)
•This procedure is known as the F-test
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
72
Analysis of Variance (ANOVA) 5/5
•In simple linear regression
•One predictor variable
•F-test reduces to testing b1=0
•If confidence interval does include zero, parameter is nonzero
•Regression explains a significant part of the response variation
•F-test is not required
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
73
ANOVA Example 1/2
ANOVA calculations for disk-memory-CPU example
Computations performed column by column from the left
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
74
ANOVA Example 2/2
ANOVA calculations for disk-memory-CPU example
•Regression passes F-test
Computed ratio > value from F-Table
None of the regression parameters are significantly
different than zero?
•We have very high confidence that the regression model has
predictive ability
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
75
Problem of Multicollinearity
•Dilemma: None of our parameters are significant, yet the
model is!!
•The problem is that the two predictors (memory and disk I/O) are
correlated (R = .9947).
•Next we test if the two parameters each give significant regression
on their own.
We already did this for the Disk I/O regression model, and
found that it alone accounted for about 97% of the variance.
We get the same result for memory size.
•Conclusion: Each predictor alone gives as much predictive power
as the two together!
•Moral: Adding more predictors is not necessarily better in terms of
predictive ability (aside from cost considerations).
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
76
Curvilinear Regression
1/3
•Verify the validity of the linear regression model
Do a scatter plot of response vs. predictor to see if it is linear
•Often you can convert to a linear model and use the standard linear
regression
Take the log when the curve looks like y = bxa
ln y = ln b + a ln x
y = a +b/x
y= x/(a+bx)
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
77
Curvilinear Regression
2/3
•Example: Amdahl’s law says
I/O rate is proportional to the
processor speed
 I/O rate = (CPU rate)b1
 Taking logs we get
log (I/O rate) = log  +
b1 log (MIPS rate)  b0 =
log 
 Using standard linear
regression, we find that the
regression explains 84% of
the variation.
Performance
Evaluation
No
1
2
3
4
5
6
7
8
9
10
MIPS
19.63
5.45
2.63
8.24
14.00
9.87
11.27
10.13
1.01
1.26
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
I/O Rate
288.6
117.30
64.60
356.40
373.20
281.10
149.60
120.60
31.10
23.70
78
Curvilinear Regression 3/3
Parameter
Mean
b0
b1
Performance
Evaluation
1.423
Standard
Deviation
0.119
Confidence
Interval
(1.20,1.64)
0.888
0.135
(0.64, 1.14)
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
79
Common Mistakes with Regression
•Verify that the relationship is linear. Look at a scatter diagram
•Don't worry about absolute value of parameters. They are totally
dependent on an arbitrary decision regarding what dimensions to
use
CPU time (sec) = 0.01 (disk I/Os)+0.001(Memory in KB)
•Always specify confidence intervals for parameters and coefficient
of determination
•Test for correlation between predictor variables, and eliminate
redundancy. Test to see what subset of the possible predictors is
“best“ depending on cost vs. performance
•Don't make predictions too far out of the measured range
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
80
Experimental Design
“The goal of a proper experimental design is to obtain the
maximum information with the minimum number of experiments“
•Determine the affects of various factors on performance
•Does a factor have significant effect?
•An experimental design consists of specifying
The number of experiments
 The factor/level combinations of each experiment
 The number of replications of each experiment
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
81
Potential Pitfalls in Experimentation
•All measured values are random variables. Must account for
experimental error
•Control for important parameters
Example: User experience in a user interface study
•Must be able to allocate variation to the different factors
•There is a limit to the number of experiments you can perform
Some designs give more information per experiment.
•One-factor-at-a-time studies do not capture factor interactions.
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
82
Types of Experimental Design
1/3
•Simple designs
One factor is varied at a time
Inefficient  Does not capture interactions  Not
recommended
Given k factors
ith factor having n levels, need
i
k
1  (n 1) experiments
i 1 i
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
83
Types of Experimental Design
2/3
•Full factorial design
 Perform experiment (s) for every combination of factors at
every level
Many experiments required
Captures interactions
If too expensive, can reduce number of levels (ex: 2k design
 k factors used at 2 levels), number of factors, or take another
approach
Given k factors
ith factor having n levels, need
i
k
 ni experiments
i 1
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
84
Types of Experimental Design
3/3
•Fractional Factorial Design
Only run experiments for a carefully selected subset of the
full factorial combinations
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
85
2k Factorials Designs
•A 2k experimental design is used to determine the effect of k
factors, each of which have two alternatives or levels.
Used to prune out the less important factors for further study
 Pick the highest and lowest levels for each factor
 Assumes that a factor's effect is unidirectional
Performance either continuously decreases or
continuously increases as the factor is increased from
minimum to maximum
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
86
22 Factorials Designs
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
1/2
87
22 Factorials Designs
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
2/2
88
Allocation of Variation
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
89
General 2k Factorials Designs
1/2
k main effects
k! two - factor interactions
(k  2)!2!
k! three- factor interactions
(k  3)!3!
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
90
General 2k Factorials Designs
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
2/2
91
2kr Factorials Designs
•Experiments are observations of random variables
•With only one observation, can't estimate error
•If we repeat each experiment r times, we get 2kr observations.
•With a 2 factor model, we now have:
y = q0 +qAxA +qBxB +qABxAxB +e
e is the experimental error
Use sign table, and put in y column the sample means of r
measurements at the given factor levels
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
92
2kr Factorials Designs Example
eij  yij  yˆ i
22
r
SSE   eij2
i 1 j 1
2
SST  2 2 rq A2  2 2 rq B2  2 2 rq AB
  eij2
i, j
Factor A explains 79%,
B explains 15.4%,
AB explains 4%,
and error explains about 1.5%.
SST  SSA  SSB  SSAB  SSE
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
93
Confidence Intervals for Effects
Variance of erros computed from SSE
SSE
2
s 
e
2 2 (r  1)
Confidence intervals for
previous example
Degrees of freedom  2k (r  1)
q0  (39.08, 42.91)
Since r error term s should add up to zero
s
s s
s
s
 e
q
q
q
q
0
A
B
AB
22 r
qA  (19.58, 23.41)
qB  (7.58,11.41)
qAB  (3.08,6.91)
confidence intervals for effects q  t
s
i [1   / 2;22 (r  1)] q
i
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
94
2k-p Fractional Factorial Designs
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
95
Selecting The Experiments
•The sum of the squares of each column is 2k-p
qA= (-y1+y2-y3+y4-y5+y6-y7+y8)/8
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
96
2k-p Fractional Factorial Designs Example
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
97
Assumptions
•Big assumption: These experiments only provide so much
information
•The effects due to interaction are summed into the values
calculated for the separate variables
• The experiments are masking these “confounding" interactions
The effects whose influence cannot be separated are said to
be confounded
• If the interaction effects are small, it is OK
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
98
One-Factor Experiments
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
1/2
99
One-Factor Experiments
2/2
r observatio ns for each of the a alternativ es
Total of ar observatio ns arranged in an r  a matrix
r observatio ns belonging to j th alternativ e form a column vec tor
1
Grand mean  y..   yij
ar i , j
Column mean  y. j
 j  y. j  y..
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
100
One-Factor Experiments Example
Effects of 3 processors, observed 5 times each
•First alternative requires 13.3 bytes less than an average processor
•Third alternative requires 37.7 bytes more than an average
processor
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
101
Errors and Variation
For each observation, the
SSY  144 2  120 2  ....  633,639
error is the difference
2
2
SS0

ar


3

5

187
.
7
 528,281.7
between the observation
and the sum of the
SSA  r   2j  10,992.1
mean and alternative effect.
j
SST  SSY - SS0  105,357.3
Mean is 187.7, R's effect is
SSE  SST - SSA  94,365.2
-13.3, so the first error term
Percentage of variation
is
144 - (187.7-13.3)=-30.4
10,992.13
 10.4%
105,357.3
89.6% due to experiment al errors
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
102
ANOVA
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
103
Two-Factor Full Factorial Design
1/2
•Two parameters carefully controlled and varied to study effect
on performance
Compare several processors using several workloads
yij     j   i  eij
yij  observatio n with factor A at level j and factor B at level i
  mean response
 j  effect of factor A at level j
 j  effect of factor B at level i


Performance
Evaluation
j
0
j
0
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
104
Two-Factor Full Factorial Design
2/2
Factors A and B have a and b levels require ab experiment s
Total of ab observatio ns arranged in an b  a matrix
Columns correspond to levels of A and rows correspond to levels of B
1
Grand mean  y.. 
yij

ab i , j
 j  y. j  y..
 i  yi.  y..
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
105
Two-Factor Full Factorial Example
Effects of 3 different cache choices, 5 different workloads
For each row (or column), compute mean of observations in that
row (or column)
Difference between a row (or column) mean and overall mean
gives row (or column) effect
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
106
Estimating Experimental Errors
Estimated response  yˆ ij     j   i
eij  yij  yˆ ij
b
a
Variance of errors  SSE   e
i 1 j 1
2
ij
SSE  3.5  0.2  ....  (2.4)  236.8
2
Performance
Evaluation
2
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
2
107
Allocation of Variance
SSY  SS0  SSA  SSB  SSE
SSY   yij2  91,595
ij
SS0  ab 2  3  5  (72.2) 2  78,192.59
SSA  b  2j  12,857.20
j
SSB  a   i2  308.40
i
SST  SSY - SS0  13,402.41
SSE  SST - SSA - SSB  236.80
Percentage of variation explained by the caches
12,857,20
 95.9%
13,402.41
Percentage of variation due to workloads  2.3%
Unexplaine d variation (due to errors)  1.8%
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
108
ANOVA
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
109
Precautions
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
110
Quantile-Quantile Plots
•In order to determine distribution of data
Plot observed quantiles versus theoretical quantiles
•Assume yi is the observed qith quantile
•Using theoretical distribution, the qith quantile xi is computed and
a point is plotted at (xi, yi)
•To determine xi, need to invert CDF qi=F(xi)  xi = F-1(qi)
For N(0,1)  xi = 4.91[qi0.14-(1-qi)0.14]
For N(,σ), scale to + σxi before plotting
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
111
Quantile-Quantile Plots Example
•Modeling error for eight predictions of a model were found to b
{-0.04, -0.19, 0.14, -.09, -0.14, 0.19, 0.04, and 0.09}
0.4
Residual Quantile
0.3
0.2
0.1
-1.6
-1.1
-0.6
0
-0.1
-0.1
0.4
0.9
1.4
-0.2
-0.3
-0.4
Normal Quantile
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
112
Residual quantile
Q-Q Plot For Two-Factor Example
10
8
6
-2
-1.5
-1
4
2
0
-0.5 -2 0
-4
0.5
1
1.5
-6
-8
-10
Normal quantile
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
113
2
Introduction to Simulation
•If system not available, a simulation model provides an easy way
to predict the performance or compare several alternatives
•If system is available, a simulation model may be preferred over
measurements (Why?)
•Simulation models might fail!
Produce no useful results
Produce misleading results
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
114
Common Mistakes in Simulation
•Inappropriate level of detail
•Improper implementation language
•Unverified models
•Invalid models (may not represent the real system correctly)
•Improperly handled initial conditions
•Too short simulations
•Poor random-number generators
•Improper selection of seeds
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
115
Terminology
1/3
•State variables
variables whose values define the state of the system
e.g., length of job queue in a CPU scheduling simulation
•Event
a change in the system state
•Continuous-time and discrete-time models
Continuous  system state defined at all times (e.g., CPU
scheduling)
Discrete  system state defined at particular instants of
time
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
116
Terminology
2/3
•Continuous-state and discrete-state models
Nature of state variables. Time spent by students studying
a certain subject (can take an infinite number of values)
Queue length in a CPU scheduling simulation can only
assume integer values (discrete state model)
Discrete-state  discrete-event model
Continuity of time does not imply continuity of state
•Deterministic and probabilistic models
Output of a model can be predicted with certainty 
deterministic
Different results on repetition for same set of input
parameters  probabilistic
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
117
Terminology
3/3
•Static and dynamic models
System state changes with time?
•Open and closed models
Input external to the model and independent of it  open
No external input  closed
•Stable and unstable models
Steady state reached which is independent of time 
stable
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
118
Programming Language Selection
•Simulation language
•General-purpose language
•Extension of a general-purpose language
Extensions as a collection of routines to handle tasks
commonly required in simulations
•Simulation package
inflexibility
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
119
Types of Simulation
1/3
•Emulation
A simulation using hardware or firmware
e.g., Terminal emulator, processor emulator.
•Monte Carlo Simulation
Static simulation or one without a time axis
Model probabilistic phenomenon that do not change
characteristics with time
Evaluate non-probabilistic expressions using probabilistic
models
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
120
Types of Simulation
2/3
•Trace-driven simulation
Use a trace as input
A trace is a time-ordered record of events on a real system
e.g., A trace of page reference patterns of key programs
can be input to compare different memory management
schemes
Advantages: (among others) credibility, easy validation,
and accurate workload
Disadvantages: (among others) complexity and being a
single point of validation
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
121
Types of Simulation
3/3
•Discrete-event simulation
Components
Event scheduler
simulation clock and a time advancing mechanism
Unit time versus event-driven approaches
System state variables
Event routines
Input routines, report routines, initialization routines,
and trace routines
Dynamic memory management
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
122
Analysis of Simulation Results
•Model Verification Techniques
•Model Validation Techniques
•Transient Removal
•Stopping Criteria
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
123
Model Verification Techniques
•Top-down modular design
•Anti-bugging
•Structured walk-through
•Deterministic models
•Run simplified cases
•Trace: time-ordered list of events and associated variables
•On-Line graphic displays
•Continuity test
•Degeneracy tests
•Consistency tests
•Seed independence
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
124
Model Validation Techniques
•Validate three key aspects of the model
Assumptions
Input parameter values and distributions
Output values and conclusions
•Each of three aspects maybe subjected to a validity test by
comparing with that obtained from possible sources
Expert intuition
Real system measurements
Theoretical results
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
125
Transient Removal
1/5
•Steady-state performance is of interest
•What constitutes the transient state?
•Methods for transient removal are heuristic
Long runs
Proper initialization
Truncation
Initial data deletion
Moving average of independent replications
Batch means
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
126
Transient Removal
2/5
•Truncation
Variability during steady state less than that during
transient state
Measure variability in terms of range- min and max of
observations
Observations: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 10, 9, 10, 11,
10, 9, 10, 11, 10, 9
Given a sample of n observations
Ignore first l observations
Calculate min and max of n-l observations
Repeat until (l+1)th observation is neither min or max
of remaining observations
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
127
Transient Removal
3/5
•Initial data deletion
Average does not change much as observations are
deleted
Randomness in observations causes averages to change
slightly even during steady state
Average across replications (complete run with no change
in input parameter values, seed value is different)
Get a mean trajectory across replications
Get overall mean
Delete the first l observations (init l to 1), and get overall mean
for n-l values
Compute relative change in overall mean
Determine the knee where the relative change graph stabilizes
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
128
Transient Removal
4/5
•Moving average of independent replications
Similar to initial data deletion but computes mean over a
moving time interval window instead of the overall mean
Get a mean trajectory across replications
Plot a trajectory of moving average of successive 2k+1
values (init k to 1)
Repeat with k= 2,3,… until plot sufficiently smooth
Find the knee of the plot
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
129
Transient Removal
5/5
•Batch means
Run a very long simulation and later divide into several
parts of equal duration (batch)
Study the variance of batch means as a function of batch
size
A long run of N observations divided into m batches of
size n each (start with n=2 for example)
For each batch, compute a batch mean
Compute overall mean
Compute variance of batch means
Increase n and repeat
Plot variance as a function of n
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
130
Stopping Criteria: Variance Estimation
•Simulation could be run until confidence interval for the mean
response narrows to a desired width
•Can obtain variance of sample mean = variance of
observations/n
Valid if observations are independent
Observations in most simulations are not independent
•To correctly compute the variance of the mean of correlated
observations
Independent replications
Batch means
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
131
1/3
Stopping Criteria: Variance Estimation
•Independent replications
Means of independent replications are independent
though observations in a single replication are correlated
Conduct m replications of size n+n0 (n0 is length of
transient phase)
Compute a mean for each replication
Compute an overall mean for all replications
Calculate the variance of replicate means
Calculate confidence interval for mean response
Will discard mn0 initial observations
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
132
2/3
Stopping Criteria: Variance Estimation
•Batch means
Run a long simulation run, discard initial transient
interval, and divide remaining observations into several
batches
Compute means for each batch
Compute overall mean
Calculate variance of batch means
Calculate confidence interval for mean response
Will discard n0 observations (less waste)
Calculate covariance of successive batch means to find
correct batch size n
Covariance should be small compared to the variance
Performance
Evaluation
© Dr. Ayman Abdel-Hamid, CS5014,
Fall 2006
133
3/3