Linear regression

Download Report

Transcript Linear regression

Normal Distribution
Normal distribution
We often meet the situation when the
frequency of some occurances depend on the
distance from the mean value οƒ  close to mean
are very frequent, away from mean less
frequent:
- Human height, temperature, machined
products, sales, financial data
Lower tail
Upper tail
-3
-2
-1
0
1
2
3
Normal distribution
𝜎
-3
Position
-2
-1
0
πœ‡
1
2
3
Position
Normal distribution
πœ‡=0
πœ‡ = βˆ’1
πœ‡ = βˆ’2
-3
-2
-1
0
1
2
3
Normal distribution
𝜎 π‘ π‘šπ‘Žπ‘™π‘™π‘’π‘Ÿ
narrow
𝜎 π‘ π‘šπ‘Žπ‘™π‘™π‘’π‘Ÿ
narrow
𝜎 π‘™π‘Žπ‘Ÿπ‘”π‘’π‘Ÿ
widen
-3
-2
-1
0
1
2
3
Standard normal curve
Area under the curve
1
-3
-2
Z distribution
πœ‡=0
𝜎=1
-1
0
1
2
3
Standard normal curve
0.5
0.8413
0.1587
Z distribution
πœ‡=0
𝜎=1
0.0228
0.3413
0.1359
0.0013
0.1359
0.0214
-3
-2
0.9772
0.3413
0.9982
0.0214
-1
0
1
2
3
Standard normal curve
0.5
68.26%
βˆ’πŸπˆ
-3
-2
-1
0
+𝟏𝝈
1
2
3
Z distribution
πœ‡=0
𝜎=1
Standard normal curve
0.5
95.44%
βˆ’πŸπˆ
-3
-2
-1
0
+𝟐𝝈
1
2
3
Z distribution
πœ‡=0
𝜎=1
Standard normal curve
0.5
99.74
βˆ’πŸ‘πˆ
-3
-2
-1
0
+πŸ‘πˆ
1
2
3
Z distribution
πœ‡=0
𝜎=1
Apple corporation 2012 Daily Returns
1. What is the probability, for any
given day, of a return greater
than 0.5%?
2. What is the probability, for any
given day, of a loss greater than
2%?
3. What is the probability, for any
given day, of a return beween 0
and 1%?
4. What is the probability, for any
given day, of a gain or loss
greater than 3%?
Z distribution
πœ‡ = 0.11 %
𝜎 = 1.84 %
0.5 %
-3
-2
-5.41% -3.57%
-1
-1.73%
0
0.11%
1
1.95%
2
3.79%
3
5.63%
Apple corporation 2012 Daily Returns
1. What is the probability, for any
given day, of a return greater
than 0.5%?
Z distribution
πœ‡ = 0.11 %
𝜎 = 1.84 %
π‘₯βˆ’πœ‡
𝑧=
𝜎
0.5 βˆ’ 0.11
𝑧=
= 0.21
1.84
= norm.dist(0.21,0,1,TRUE)=0.58
0.5 %
1 - 0.58
= 0.42
0.58
-3
-2
-5.41% -3.57%
-1
-1.73%
0
0.11%
1
1.95%
2
3.79%
3
5.63%
Apple corporation 2012 Daily Returns
2. What is the probability, for any
given day, of a loss greater than 2%?
Z distribution
πœ‡ = 0.11 %
𝜎 = 1.84 %
π‘₯βˆ’πœ‡
𝑧=
𝜎
βˆ’2 βˆ’ 0.11
𝑧=
= βˆ’1.15
1.84
= norm.dist(-1.15,0,1,TRUE)=0.125
0.125
-3
-2
-5.41% -3.57%
-1
-1.73%
0
0.11%
1
1.95%
2
3.79%
3
5.63%
Apple corporation 2012 Daily Returns
3. What is the probability, for any
given day, of a return beween 0 and
1%?
Z distribution
πœ‡ = 0.11 %
𝜎 = 1.84 %
-3
-2
-5.41% -3.57%
-1
-1.73%
0
0.11%
1
1.95%
2
3.79%
3
5.63%
Apple corporation 2012 Daily Returns
3. What is the probability, for any
given day, of a return beween 0 and
1%?
Z distribution
πœ‡ = 0.11 %
𝜎 = 1.84 %
π‘₯βˆ’πœ‡
𝑧=
𝜎
1 βˆ’ 0.11
𝑧=
= 0.48
1.84
π‘₯βˆ’πœ‡
𝑧=
𝜎
0 βˆ’ 0.11
𝑧=
= βˆ’0.06
1.84
= norm.dist(0.48,0,1,TRUE)=0.69
= norm.dist(-0.06,0,1,TRUE)=0.48
0.21
0.21
-3
-2
-5.41% -3.57%
-1
-1.73%
0
0.11%
1
1.95%
2
3.79%
3
5.63%
Apple corporation 2012 Daily Returns
4. What is the probability, for any
given day, of a gain or loss greater
than 3%?
Z distribution
πœ‡ = 0.11 %
𝜎 = 1.84 %
-3
-2
-5.41% -3.57%
-1
-1.73%
0
0.11%
1
1.95%
2
3.79%
3
5.63%
Apple corporation 2012 Daily Returns
4. What is the probability, for any
given day, of a gain or loss greater
than 3%?
Z distribution
πœ‡ = 0.11 %
𝜎 = 1.84 %
π‘₯βˆ’πœ‡
𝑧=
𝜎
3 βˆ’ 0.11
𝑧=
= 1.57
1.84
π‘₯βˆ’πœ‡
𝑧=
𝜎
βˆ’3 βˆ’ 0.11
𝑧=
= βˆ’1.69
1.84
= norm.dist(-1.69,0,1,TRUE)= 0.046
= 1- norm.dist(1.57,0,1,TRUE)=0.058
0.104
-3
-2
-5.41% -3.57%
-1
-1.73%
0
0.11%
1
1.95%
2
3.79%
3
5.63%
Mean and variance
Statistical quality control
High Way Paving inc. Is a company specializing in residential road
surfacing using low noise pavement. Recycled rubber can be added to
asphalt mixtures to reduce road noise. However resistance to flow
must be maintained within very tight limits otherwise it may too thick
or too β€žwatery”. The goal is a viscosity of 3200.
Over several years of production and quality measurements, HWP has
determined that viscosity population mean and standard deviation is:
πœ‡ = 3200
𝜎 = 150
Statistical quality control
During manufacture of each batch of asphalt, the quality control
specialist takes 15 samples of the material and tests the viscosity.
There is no way test every kg of asphalt (population). Therefore the
company must take samples.
From those sample HWP must then make conclusions about the entire
batch.
Statistical quality control
Sample 1
Sample 2
Sample 3
Sample 4
π‘₯ = 3210.73
π‘₯ = 3150.13
π‘₯ = 3345.54
π‘₯ = 3190.67
Sample 5
Sample 6
Sample 7
Sample 8
Sample 9
π‘₯ = 3217.90
π‘₯ = 3301.45
π‘₯ = 3100.72
π‘₯ = 3413.01
π‘₯ = 3023.59
Statistical quality control
Sample
Sample mean (𝒙)
Range
Frequency
1
3210.73
2950-3049
I
2
3150.13
3050-3149
I
3
3345.54
3150-3249
IIII
4
3190.67
3250-3349
II
5
3217.90
3350-3449
I
6
3301.45
7
3100.72
8
3413.01
9
3023.59
We call this the:
Sampling distribution
(distribution of sample means)
𝐸 π‘₯ = 3217.08
Viscosity sampling distribution
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
2950-3049
3050-3149
3150-3249
3250-3349
3350-3449
Viscosity sampling distribution
4.5
𝐸 π‘₯ =πœ‡
4
3.5
As n οƒ  β€žlarge”
3
2.5
2
𝐸 π‘₯ = π‘‘β„Žπ‘’ 𝑒π‘₯𝑝𝑒𝑐𝑑𝑒𝑑 π‘£π‘Žπ‘™π‘’π‘’ π‘œπ‘“ π‘₯
πœ‡ = π‘‘β„Žπ‘’ π‘π‘œπ‘π‘’π‘™π‘Žπ‘‘π‘–π‘œπ‘› π‘šπ‘’π‘Žπ‘›
1.5
1
0.5
0
2950-3049
3050-3149
3150-3249
3250-3349
3350-3449
If we take many random samples from the population each with its own sample mean and then create a
distribution based of all of those sample means, the mean of that sampling distribution is equal to the mean
of the population.
β€’ The expected value of the sampling distribution of π‘₯ us at best going
to be an estimate of πœ‡,
β€’ We would have to take every sample from the popoulation to match
the population mean perfectly (but what is the point of sampling
then?)
β€’ The best we are going to be able to do is find and interval estimate for
the population mean πœ‡,
β€’ Our interval estimate will be influenced by sample size and the degree
of confidence we are satisfied with.
Standard deviation of π‘₯ sampling distribution
𝜎
𝜎π‘₯ =
𝑛
4.5
4
3.5
3
𝜎π‘₯ - standard deviation of π‘₯
2.5
2
𝜎 - standard deviation of population
1.5
1
n – sample size
0.5
0
2950-3049
3050-3149
3150-3249
In this case we know standard deviation of the population which is rarely the case.
3250-3349
3350-3449
Standard deviation of π‘₯ sampling distribution
𝜎
𝜎π‘₯ =
𝑛
4.5
4
3.5
3
𝜎π‘₯ - standard deviation of π‘₯
2.5
2
𝜎 - 150
1.5
1
0.5
n – 15
0
2950-3049
𝜎π‘₯ =
𝜎
150
=
= 38.7
𝑛
15
3050-3149
3150-3249
3250-3349
π‘Šπ‘–π‘‘β„Ž π‘–π‘›π‘π‘Ÿπ‘’π‘Žπ‘ π‘–π‘›π‘” π‘ π‘Žπ‘šπ‘π‘™π‘’ 𝑠𝑖𝑧𝑒 𝜎π‘₯ 𝑔𝑒𝑑𝑠 π‘ π‘šπ‘Žπ‘™π‘™π‘’π‘Ÿ.
3350-3449
Standard deviation of π‘₯ sampling distribution
A larger sample size decreases standard error.
The values of π‘₯ will have less variation and therefore
be closer to πœ‡.
𝜎
150
𝜎π‘₯ =
=
= 6.71
𝑛
500
𝜎π‘₯ =
𝜎
150
=
= 38.7
𝑛
15
𝜎π‘₯ =
𝜎
150
=
= 12.9
𝑛
135
𝜎π‘₯ =
Statistical quality control
𝜎
150
=
= 38.7
𝑛
15
Sample 1
Sample 2
Sample 3
Sample 4
𝜎π‘₯ = 38.7
𝜎π‘₯ = 38.7
𝜎π‘₯ = 38.7
𝜎π‘₯ = 38.7
π‘₯ = 3210.73
π‘₯ = 3150.13
π‘₯ = 3345.54
π‘₯ = 3190.67
Sample 5
Sample 6
Sample 7
Sample 8
Sample 9
𝜎π‘₯ = 38.7
π‘₯ = 3217.90
𝜎π‘₯ = 38.7
π‘₯ = 3301.45
𝜎π‘₯ = 38.7
π‘₯ = 3100.72
𝜎π‘₯ = 38.7
π‘₯ = 3413.01
𝜎π‘₯ = 38.7
π‘₯ = 3023.59
β€’ Standard error is stanard deviation, it allows us to calculate z-scores
and therefore area (probability) under the curve for certain region,
β€’ Any point estimator is an estimation and will contain error,
β€’ This error can be minimized by selecting large sample from the
population from which to estimate a parameter
β€’ Error component means we cannot determine single value of the
parameter, we can only provide a range or interval that may cover the
parameter οƒ  confidence interval
β€’ Most often we do not know standard deviation and we have to
estimate it
Hypothesis testing
𝐻0 : πœ‡ = πœ‡0
π»π‘Ž : πœ‡ β‰  πœ‡0
πœ‡ is the true mean of population under analysis
πœ‡0 is the hypothesized mean of the population under analysis
Is the true mean the same as the hypothesized mean?
Example 1
A bottled water company states on the product label that each bottle
contains 355 ml of water. Your work for a goverment agency that
protects consumers by testing products volumes. A sample of 50
bottles is tested. Establish null and alternative hypothesis.
What is our assumption?
We assume that 355 ml on the bottle is to be true. So:
𝐻0 : πœ‡ = 355 π‘šπ‘™
π»π‘Ž : πœ‡ β‰  355 π‘šπ‘™
If the data indicates the bottles are being
filled properly, then we fail to reject the
null, fail to reject our assumption. We
are not saying we have proven the null
just that our assumption held up.
Example 2
According to the United States Department of Agriculture, in 2006 the
average farm size in the state of Texas was 2.3 km2. Since the dacadeslong trend as been for farm sizes to increase due to large
agrobussiness, we want to analyze if farm size in 2015 larger than it
was in 2006. Establish null and alternative hypothesis.
What is our assumption?
We assume that there has been no change is farm size since 2006. We
wish to see if the farm has increased since 2006 So:
𝐻0 : πœ‡ ≀ 2.3 π‘˜π‘š2
π»π‘Ž : πœ‡ > 2.3 π‘˜π‘š2
Example 3
During the 2010-2011 English Premier League season Manchester
United home matches had an average attendace of 74,691. A club
marketing analyst would like to see if attendance decreased during the
most recent season. Establish null and alternative hypothesis.
What is our assumption?
We assume that attendance remained the same. We wish to see if the
attendanced has decreased since 2010-2011 So:
𝐻0 : πœ‡ β‰₯ 74,961
π»π‘Ž : πœ‡ < 74,961
Two scenarios - 𝜎 known or unknown?
1. The population standard deviation 𝜎 is given.
2. The population standard deviation 𝜎 is not given and we have to
estimate it, s.
β€’ When 𝜎 is given (or n > 100) we use the normal standard zdistribution,
β€’ When 𝜎 is not given we use a t-distribution with n-1 degrees of
freedom
β€’ It is better to check data for normality
Z-test for a single mean
π‘₯ βˆ’ πœ‡ 0 π‘₯ βˆ’ πœ‡0
𝑧=
=
𝜎
𝜎
π‘₯ βˆ’ π‘ π‘Žπ‘šπ‘π‘™π‘’ π‘šπ‘’π‘Žπ‘›
πœ‡0 βˆ’ β„Žπ‘¦π‘π‘œπ‘‘β„Žπ‘’π‘ π‘–π‘§π‘’π‘‘ π‘π‘œπ‘π‘’π‘™π‘Žπ‘‘π‘–π‘œπ‘› π‘šπ‘’π‘Žπ‘›
𝜎 βˆ’ π‘ π‘Žπ‘šπ‘π‘™π‘’ π‘ π‘‘π‘Žπ‘›π‘‘π‘Žπ‘Ÿπ‘‘ π‘‘π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘›
𝑛 βˆ’ π‘ π‘Žπ‘šπ‘π‘™π‘’ 𝑠𝑖𝑧𝑒
The Standard Normal Curve
SNC is also a sampling distribution of the mean π‘₯.
The distribution of many sample means of given
sample size.
𝒙
-3
-2
-1
0
1
2
3
95% Probability Interval
The are in the tails is
called alpha.
Therefore we have 5%
left evenly divided
between both tails.
𝛼 = 5% π‘œπ‘Ÿ 0.05
𝒙
95% of all
sample means
(π‘₯) are in here.
5%
= 2.5% π‘œπ‘Ÿ 0.025
2
5%
= 2.5% π‘œπ‘Ÿ 0.025
2
𝜢
𝟐
𝜢
𝟐
-3
-2
-1
0
1
2
3
95% Probability Interval
By doing so, we can
assign z-scores (t-scores)
to the uppar and lower
boundary of the 95%
interval.
If the population’s
standard deviation 𝜎 we
treat the sampling
distribution as a standard
normal curve (z-curve).
If not we have to
estimate 𝜎 and use t
distribution (t-curve).
𝒙
95% of all
sample means
(π‘₯) are in here.
βˆ’πŸ. πŸ—πŸ”πˆ
𝜢
𝟐
-3
-2
+𝟏. πŸ—πŸ”πˆ
𝒙 ± 𝟏. πŸ—πŸ”πˆπ’™
-1
0
1
𝜢
𝟐
2
3
95% Probability Interval
The standard error of the
mean 𝜎π‘₯ depends on:
𝜎
β€’ Sample size (𝜎π‘₯ = ),
What is the standard
deviation of the sampling
distribution?
𝑛
The standard error of the
mean 𝜎π‘₯ (not 𝜎 – the
standard deviation of the
population)
𝒙
95% of all
sample means
(π‘₯) are in here.
βˆ’πŸ. πŸ—πŸ”πˆ
𝜢
𝟐
-3
-2
+𝟏. πŸ—πŸ”πˆ
𝒙 ± 𝟏. πŸ—πŸ”πˆπ’™
-1
0
1
𝜢
𝟐
2
3
Two-tailed z-test rejection region
The critical value is determined by 𝛼 and if we are
using z- or t-test distribution.
𝐻0 : πœ‡ = πœ‡0
π»π‘Ž : πœ‡ β‰  πœ‡0
𝛼 = 0.05
With 𝛼 and 𝜎 known we would consult the z-table and
find the corresponding z-scores for a two-tailed test.
Z = +1.96
Critical Value
𝒙
Z = -1.96
Critical Value
Nonrejection region
Rejection region
𝛼
= 0.025
2
-3
Rejection region
𝛼
= 0.025
2
𝝁𝟎
-2
-1
0
1
2
3
The Student’s T-distribution
1. In general, the t-distribution is shorter
in the middle and fatter in the tails.
2. More probability in the tails, less near
the mean, grater chance of extreme
values.
3. There isn’t just one t-distribution.
4. There is a t-distribution for every
sample size.
5. Degrees of Freedom (n-1).
6. Smaller the sample size, the shorter
and fatter the distribution, more tail
probability.
7. However as n becomes large, the tdistribution tends to z-distribution.
Area (probability) under the curve = 1
When we do not know 𝜎 we have to
estimate. This estimation, this
uncertainty, forces us to use the
Student’s T-dsitribution.
-4
-3
-2
-1
0
1
2
3
4
The Student’s T-distribution 95% Prob. Int.
By doing so, we can assign t-scores to the
upper and lower boundary of the 95%
interval of each sample size.
Area (probability) under the curve = 1
If we do not know the population
standard deviation 𝜎 we treat the
sampling distribution as a t-distribution.
Degrees of Freedom (n-1)
n = 10
𝒙
z = -1.96
z = +1.96
95% of all
sample means
(π‘₯) are in here.
t = -2.26
df = 9
t = +2.26
𝑑𝛼 𝑑𝑓 = 9
2
-4
-3
-2
-1
0
1
2
3
4
95% Distribution comparision
Z-distribution, ±πŸ. πŸ—πŸ”
n
df
Interval
10
9
±2.262
30
29
±2.045
75
74
±1.993
100
99
±1.984
Sample standard deviation
β€’ When do we do not know the population standard deviation, we use the
stample standard deviation to approximate it,
β€’ This approximation come at a cost though in terms of our interval estimate,
β€’ We must use the t-distribution instead of z-distribution to account for this
estimation of 𝜎,
β€’ Every sample size will have its own t-distribution with degees of freedom
df = n-1,
β€’ Our standard error will now be:
𝒔
𝒔𝒙 = where 𝒔 =
𝒏
π’™π’Š βˆ’π 𝟐
𝑡
π’Š=𝟏 π‘΅βˆ’πŸ
Different standard errors
β€’ In the previous examples we knew the population standard deviation 𝜎 and it
was therefore fixed in the standard error formula,
β€’ This meant that all samples of the same size had the same standard error,
β€’ When 𝜎 is uknown we estimate it with the sample standard deviation, s,
β€’ Since every sample have a unique s, samples of the same size do not necessarily
have the same standard error,
β€’ The randomness of sample selection is represented in its standard deviation and
therefore its standard error.
The Student’s T-distribution 95% Prob. Int.
𝑠π‘₯ is largely dependent on sample size.
The standard deviation of
sampling distribution now is the
standard error of the mean.
If sample size is small 𝑠π‘₯ becomes larger
and thus distribution becomes wider.
𝒔
𝒔𝒙 =
𝒏
𝒙
95% of all
sample means
(π‘₯) are in here.
-4
-3
-2
-1
0
1
2
3
4
T-test for a single mean
π‘₯ βˆ’ πœ‡ 0 π‘₯ βˆ’ πœ‡0
𝑑= 𝑠 =
𝑠π‘₯
𝑛
π‘₯ βˆ’ π‘ π‘Žπ‘šπ‘π‘™π‘’ π‘šπ‘’π‘Žπ‘›
πœ‡0 βˆ’ β„Žπ‘¦π‘π‘œπ‘‘β„Žπ‘’π‘ π‘–π‘§π‘’π‘‘ π‘π‘œπ‘π‘’π‘™π‘Žπ‘‘π‘–π‘œπ‘› π‘šπ‘’π‘Žπ‘›
𝑠 βˆ’ π‘ π‘Žπ‘šπ‘π‘™π‘’ π‘ π‘‘π‘Žπ‘›π‘‘π‘Žπ‘Ÿπ‘‘ π‘‘π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘›
𝑛 βˆ’ π‘ π‘Žπ‘šπ‘π‘™π‘’ 𝑠𝑖𝑧𝑒
Question:
Is this t-test value in the nonrejection
region or the rejection region based on
df = n-1?
Two-tailed t-test rejection region, n = 20
With 𝛼 and 𝜎 not known and 20 samples (df = 20) we
would consult the t-table and find the corresponding
t-scores for a two-tailed test.
𝐻0 : πœ‡ = πœ‡0
π»π‘Ž : πœ‡ β‰  πœ‡0
𝛼 = 0.05
t = +2.093
Critical Value
𝒙
t = -2.093
Critical Value
Nonrejection region
Rejection region
𝛼
= 0.025
2
-3
Rejection region
𝛼
= 0.025
2
𝝁𝟎
-2
-1
0
1
2
3
General t-distribution properties
1. A smaller sample size means more sampling error.
2. This sampling error due to small n means a higher probability of
extreme sample means.
3. More probability in the tails means the center hump of the tdistribution mu come downward.
4. This process shrinkes the distribution downward and outward and
thus moving critical values.
5. Given the same 𝛼 and s, a smaller n will push the critical values
outward in the tails due the uncertainty associated with small n.
Hypothesis testing procedure
1.
2.
3.
4.
5.
6.
7.
8.
9.
Start with clear research problem.
Establish hypothesis, null and alternative.
Determine appropriate statistical test and sampling distribution.
Choose 𝛼.
State decision the decision rule.
Gather sample data.
Calculate test statistics.
State statistical conclusion.
Make a decision.
Bussiness analyst salaries
A report from 6 years ago indicated that the average gross salary for a
bussiness analyst was $69,873. Since this survey is now outdated, the
Berau of Labor Statistics wished to test this figure against current
salaries to see if the current salaries are statistically different from the
old ones.
Based on this sample, we found s = $14,985. We do not know 𝜎 and
therefore we will estimate it using s.
For this study, the BLS will take a sample of 12 current salaries.
Bussiness analyst salaries
1. Establish Hypothesis
𝐻0 : πœ‡ = $69,873
π»π‘Ž : πœ‡ β‰  $69,873
2. Determine Appropriate Statistical Test and Sampling Distribution
This will be two-tailed test.
𝒙 βˆ’ 𝝁𝟎
Salaries can higher or lower.
𝒕= 𝒔
Since 𝜎 is uknown ans n is small
𝒏
we will use t-distribution.
Bussiness analyst salaries
3. Specify the error rate (significance level) 𝛼 = 0.05
4. State the decision rule
If t > 2.201, reject 𝐻0
𝒙
For df = 11
Rejection region
𝛼
= 0.025
2
-3
Rejection region
𝛼
= 0.025
2
Nonrejection region
If t < -2.201, reject 𝐻0
𝝁𝟎
-2
-1
0
1
2
3
Bussiness analyst salaries
5. Gather data n = 12, π‘₯ = $79,180
6. Calculate test statistics
π‘₯ = $79,180
πœ‡0 = $69,873
𝑠 = $14,985
𝑛 = 12
𝒙 βˆ’ 𝝁𝟎
𝒕= 𝒔
𝒏
𝑑=
$79,180 βˆ’ $69,873
= 2.15
$14,985
12
Bussiness analyst salaries
7 and 8. State statistical conclusion
𝐻0 : πœ‡ = $69,873 OK! π»π‘Ž : πœ‡ β‰  $69,873
Since the t-statistics is in the nonrejection region we fail to reject
Null hypothesis. It is not β€žout of the ordinary” that this sample
came from a population πœ‡ = $69,873 when df = 11
𝒙
Rejection region
𝛼
= 0.025
2
Nonrejection region
Rejection region
𝛼
= 0.025
2
-3
𝝁𝟎
-2
-1
0
1
2
3
Bussiness analyst salaries, n = 15
3. Specify the error rate (significance level) 𝛼 = 0.05
4. State the decision rule
If t > 2.145, reject 𝐻0
𝒙
For df = 14
Nonrejction region shrinks!
Rejection region
𝛼
= 0.025
2
-3
Rejection region
𝛼
= 0.025
2
Nonrejection region
If t < -2.145, reject 𝐻0
𝝁𝟎
-2
-1
0
1
2
3
Bussiness analyst salaries
5. Gather data n = 12, π‘₯ = $79,180
6. Calculate test statistics
π‘₯ = $79,180
πœ‡0 = $69,873
𝑠 = $14,985
𝑛 = 15
𝒙 βˆ’ 𝝁𝟎
𝒕= 𝒔
𝒏
𝑑=
$79,180 βˆ’ $69,873
= 2.41
$14,985
12
Bussiness analyst salaries
7 and 8. State statistical conclusion
𝐻0 : πœ‡ = $69,873
π»π‘Ž : πœ‡ β‰  $69,873 𝑂𝐾!
1. The larger n decreased standard deviation of samplig
distribution thus narrowing it and making π‘₯ stand further out
on its own; more likely to belong to a different population
that does not overlap much with πœ‡0 . Created separation
between π‘₯ and πœ‡0 .
2. The larger n led to higher df. That shrinked nonrejection
region.
Rejection region
𝛼
= 0.025
2
-3
𝒙
Rejection region
𝛼
= 0.025
2
Nonrejection region
𝝁𝟎
-2
-1
0
1
2
3
Starbucks customer satisfaction
Starbucks is interestes in assessing customer satisfaction in the
Toronto. To conduct the study, Starbucks askes 25 customers in the city:
β€žCompared to other coffe houses in Toronto, would you say the
customer service at Starbucks is much better than average (5), better
then average (4), average (3), worse than average (2), much worse than
average (1)?” (β€žLikert scale”)
The man rating was determined to be 3.5. Based on this sample, the
standard deviation was found to be s = 1.4.
Bussiness analyst salaries
1. Establish Hypothesis
𝐻0 : πœ‡ ≀ 3
π»π‘Ž : πœ‡ > 3
2. Determine Appropriate Statistical Test and Sampling Distribution
This will be one-tailed test.
𝒙 βˆ’ 𝝁𝟎
We are interested in better than average rating. 𝒕 =
𝒔
Since 𝜎 is uknown ans n is small
𝒏
we will use t-distribution.
Bussiness analyst salaries, n = 25
3. Specify the error rate (significance level) 𝛼 = 0.1
4. State the decision rule
For df = 24 If t > 2.495, reject 𝐻0
𝒙
Rejection region
Nonrejection region
𝛼 = 0.1
Nonrejction region shrinks!
𝝁𝟎
-3
-2
-1
0
1
2
3
Bussiness analyst salaries
5. Gather data n = 25, π‘₯ = 3.5
6. Calculate test statistics
π‘₯ = 3.5
πœ‡0 = 3
𝑠 = 1.4
𝑛 = 25
𝒙 βˆ’ 𝝁𝟎
𝒕= 𝒔
𝒏
𝑑=
3.5 βˆ’ 3
= 1.79
1.4
25
Bussiness analyst salaries
7 and 8. State statistical conclusion
𝐻0 : πœ‡ ≀ 3OK! π»π‘Ž : πœ‡ > 3
We fail to reject null hypothesis that customer satisfaction is
below average.
𝒙 = πŸ‘. πŸ“
Rejection region
Nonrejection region
𝛼 = 0.1
t=1.79
𝝁𝟎 = πŸ‘
-3
-2
-1
0
1
2
3
The p-value method
Based on our 𝛼 = 0.01 we know that 1% of our
area (probability) is in the upper tail past our
π‘‘π‘π‘Ÿπ‘–π‘‘ = 2.495.
In the p-value method, we ask how much area (probability) is
above out test statistics of t=1.79.
Using t-table or Excel (T.DIST.RT(1.79,24)) we find that this 0.043
being greater than 0.01.
𝒙 = πŸ‘. πŸ“
Since these are greater than 𝛼 = 0.01 we would fail to reject 𝐻0 .
Nonrejection region
Rejection region
𝛼 = 0.1
t=1.79
𝝁𝟎 = πŸ‘
-3
-2
-1
0
1
2
3
We want to determine whether our sample mean (330.6) indicates that
this year's average energy cost is significantly different from last year’s
average energy cost of $260
We want to determine whether our sample mean (330.6) indicates that
this year's average energy cost is significantly different from last year’s
average energy cost of $260
ANOVA
Is one mean so far away from the
other two that is not from the same
population?
Suppose we want to compare three
sample means to see if there is a
difference between them
π‘₯1
Question:
Do all three of these means come
from a common population?
π‘₯2
π‘₯3
Is one mean so far away from the
other two that is not from the same
population?
Suppose we want to compare three
sample means to see if there is a
difference between them
π‘₯1
π‘₯2
π‘₯3
Is one mean so far away from the
other two that is not from the same
population?
Means are in different locations to
the overall mean.
π‘₯1
π‘₯2
π‘₯3
We are not asking if they are exactly
equal. We are asking if each mean
likely came from the larger overall
population
Null hypothesis:
𝐻0 : πœ‡1 = πœ‡2 = πœ‡3
π‘₯1
Variability AMONG/BETWEEN
the sample means.
π‘₯2
π‘₯3
Multiple t-test
π‘₯1
π‘₯2
π‘₯3
𝐻0 : π‘₯1 = π‘₯2 ; 𝛼 = 0.05
𝐻0 : π‘₯1 = π‘₯3 ; 𝛼 = 0.05
π‘₯1
π‘₯2
Pairwise comparision
means three t-tests all
with 𝛼 = 0.05 Type I error
rate at 95% confidence.
Error compound with each t-test:
0.95 0.95 0.95 = 0.857
π‘₯3
𝐻0 : π‘₯2 = π‘₯3 ; 𝛼 = 0.05
𝛼 = 1 βˆ’ 0.857 = 0.143
ANOVA: Analysis of Variance
π‘₯1
Variability ratio
Variability
AMONG/BETWEEN
the sample means.
π‘₯2
Distance from
overall mean
π‘₯3
overall mean
π‘₯1
Internal spread
Variability
AMONG/WITHIN
the sample means.
π‘₯2
π‘₯3
ANOVA: Analysis of Variance
π‘₯1
π‘₯2
Distance from
overall mean
π‘₯3
π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ 𝐡𝑒𝑑𝑀𝑒𝑒𝑛
=
π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ π‘Šπ‘–π‘‘β„Žπ‘–π‘›
overall mean
π‘₯1
Internal spread
π‘₯2
π‘₯3
ANOVA: Analysis of Variance
π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ 𝐡𝑒𝑑𝑀𝑒𝑒𝑛
βˆ’β†’ π‘‡π‘œπ‘‘π‘Žπ‘™ π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ πΆπ‘œπ‘šπ‘π‘œπ‘›π‘’π‘›π‘‘π‘ 
π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ π‘Šπ‘–π‘‘β„Žπ‘–π‘›
π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ 𝐡𝑒𝑑𝑀𝑒𝑒𝑛 + π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ π‘Šπ‘–π‘‘β„Žπ‘–π‘› = π‘‡π‘œπ‘‘π‘Žπ‘™ π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’
Partitioning – separating total variance into its components parts
If the variability BETWEEN the means (distance from the overall mean) in the numerator is relatively large compared to
the variance WITHIN the samples (internal spread) in the denominator, the ration will be much larger than 1. The samples
then most likely do not come from a common population οƒ  reject null hypothesis that means are equal.
ANOVA: Analysis of Variance
𝐿𝐴𝑅𝐺𝐸
= 𝑅𝑒𝑗𝑒𝑐𝑑 𝐻0
π‘ π‘šπ‘Žπ‘™π‘™
π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ 𝐡𝑒𝑑𝑀𝑒𝑒𝑛
π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ π‘Šπ‘–π‘‘β„Žπ‘–π‘›
π‘ π‘–π‘šπ‘–π‘™π‘–π‘Žπ‘Ÿ
= πΉπ‘Žπ‘–π‘™ π‘‘π‘œ 𝑅𝑒𝑗𝑒𝑐𝑑 𝐻0
π‘ π‘–π‘šπ‘–π‘™π‘–π‘Žπ‘Ÿ
π‘ π‘šπ‘Žπ‘™π‘™
= πΉπ‘Žπ‘–π‘™ π‘‘π‘œ 𝑅𝑒𝑗𝑒𝑐𝑑 𝐻0
𝐿𝐴𝑅𝐺𝐸
At least one mean is an
outlier and each
distribution is narrow;
distinct from each
other.
Means are close to
overall mean and/or
distr. overlap a bit;
hard to distinguish.
The means are close to
overall mean and/or
distr. melt together.
ANOVA: Analysis of Variance
π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ 𝐡𝑒𝑑𝑀𝑒𝑒𝑛 + π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ π‘Šπ‘–π‘‘β„Žπ‘–π‘› = π‘‡π‘œπ‘‘π‘Žπ‘™ π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’
𝐹=
𝐡𝑒𝑑𝑀𝑒𝑒𝑛
π‘Šπ‘–π‘‘β„Žπ‘–π‘›
F-ratio!
π‘₯1
π‘₯2
π‘₯3
ANOVA: Analysis of Variance Example
π‘ˆπ‘›π‘–π‘£π‘’π‘Ÿπ‘ π‘–π‘‘π‘¦ 𝑠𝑑𝑒𝑑𝑦 π‘ π‘˜π‘–π‘™π‘™π‘ 
Twenty one students at the Autonomous University of Madrid (AUM) in Spain were selected for an
informal study about student study skills; 7 first year, 7 second year, and 7 third year undergraduates were
randomly selected.
The students were given a study-skills assessment having a maximum score of 100. As researchers we are
interested in whether or not a difference exists somewhere between the three different year levels. We
will conduct this analysis using a One-Way ANOVA.
ANOVA: Analysis of Variance Example
π‘ˆπ‘›π‘–π‘£π‘’π‘Ÿπ‘ π‘–π‘‘π‘¦ 𝑠𝑑𝑒𝑑𝑦 π‘ π‘˜π‘–π‘™π‘™π‘ 
Random
sample
within
each
group
Columns/Groups
Year 1 Scores
Year 2 Scores
Year 3 Scores
82
71
64
93
62
73
61
85
87
74
94
91
69
78
56
70
66
78
53
71
87
ANOVA: Analysis of Variance Example
Random
sample
within
each
group
π‘₯1
π‘₯2
π‘₯3
Year 1 Scores
Year 2 Scores
Year 3 Scores
82
71
64
93
62
73
61
85
87
74
94
91
69
78
56
70
66
78
53
71
87
π‘₯1 =?
π‘₯2 =?
π‘₯3 =?
Overall Mean:
The mean of all
21 scores taken
together.
π‘₯ =?
ANOVA: Analysis of Variance Example
Random
sample
within
each
group
π‘₯1
π‘₯2
π‘₯3
Year 1 Scores
Year 2 Scores
Year 3 Scores
82
71
64
93
62
73
61
85
87
74
94
91
69
78
56
70
66
78
53
71
87
π‘₯1 = 71.71
π‘₯2 = 75.29
π‘₯3 = 76.57
Overall Mean:
The mean of all
21 scores taken
together.
π‘₯ = 74.52
Variance and the sum of squares
Sample Variance
2
𝑠 =
π‘₯βˆ’πœ‡
π‘›βˆ’1
Avarages squared differences
between sample and its
mean.
Sum of Squares
2
𝑆𝑆 =
π‘₯βˆ’πœ‡
2
Partitioning sum of squares
SST
(total) sum of squares
SSC
(column/groups) sum of squares
SSE
(within/error) sum of squares
ANOVA: Analysis of Variance Example
Random
sample
within
each
group
π‘₯1
π‘₯2
π‘₯3
Year 1 Scores
Year 2 Scores
Year 3 Scores
82
71
64
93
62
73
61
85
87
74
94
91
69
78
56
70
66
78
53
71
87
π‘₯1 = 71.71
π‘₯2 = 75.29
π‘₯3 = 76.57
SST
(total) sum of
squares
Overall Mean:
The mean of all
21 scores taken
together.
π‘₯ = 74.52
ANOVA: Analysis of Variance
π‘₯ = 74.52
71
SST
(total) sum of
squares
78
87
66
78
69
𝑁
88
62
𝑆𝑆𝑇 =
74
93
56
91
70
π‘₯𝑖𝑗 βˆ’ π‘₯
𝑖=1 𝑗=1
73
53
𝐾
82
94
2
ANOVA: Analysis of Variance Example
Random
sample
within
each
group
π‘₯1
π‘₯2
π‘₯3
Year 1 Scores
Year 2 Scores
Year 3 Scores
82
71
64
93
62
73
61
85
87
74
94
91
69
78
56
70
66
78
53
71
87
π‘₯1 = 71.71
π‘₯2 = 75.29
π‘₯3 = 76.57
SSC
(column/groups)
sum of squares
Overall Mean:
The mean of all
21 scores taken
together.
π‘₯ = 74.52
ANOVA: Analysis of Variance
π‘₯ = 74.52
SSC
(column/groups)
sum of squares
71.77
𝐾
75.29
𝑆𝑆𝐢 =
π‘₯π‘˜ βˆ’ π‘₯
𝑗=1
76.57
2
ANOVA: Analysis of Variance Example
Random
sample
within
each
group
π‘₯1
π‘₯2
π‘₯3
Year 1 Scores
Year 2 Scores
Year 3 Scores
82
71
64
93
62
73
61
85
87
74
94
91
69
78
56
70
66
78
53
71
87
π‘₯1 = 71.71
π‘₯2 = 75.29
π‘₯3 = 76.57
SSE
(within/error)
sum of squares
Overall Mean:
The mean of all
21 scores taken
together.
π‘₯ = 74.52
ANOVA: Analysis of Variance Example
Random
sample
within
each
group
π‘₯1
π‘₯2
π‘₯3
Year 1 Scores
Year 2 Scores
Year 3 Scores
82
71
64
93
62
73
61
85
87
74
94
91
69
78
56
70
66
78
53
71
87
π‘₯1 = 71.71
π‘₯2 = 75.29
π‘₯3 = 76.57
𝑁
𝐾
𝑆𝑆𝐸 =
π‘₯𝑖𝑗 βˆ’ π‘₯𝑗
𝑖=1 𝑗=1
2
SSE
(within/error)
sum of squares
Overall Mean:
The mean of all
21 scores taken
together.
π‘₯ = 74.52
Formulas for ANOVA
SSC
Sum of squares
π‘‘π‘“π‘π‘œπ‘™π‘’π‘šπ‘›π‘  = 𝐢 βˆ’ 1
𝑀𝑆𝐢 =
𝑆𝑆𝐢
π‘‘π‘“π‘π‘œπ‘™π‘’π‘šπ‘›π‘ 
(columns)
SSE
Sum of squares
π‘‘π‘“π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ = 𝑁 βˆ’ 𝐢
𝑀𝑆𝐸 =
π‘‘π‘“π‘‘π‘œπ‘‘π‘Žπ‘™ = 𝑁 βˆ’ 1
𝐹=
𝑆𝑆𝐸
π‘‘π‘“π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ
(within/error)
SST
Sum of squares
(total)
N = total observations C = no. of columns
𝑀𝑆𝐢
𝑀𝑆𝐸
Formulas for ANOVA – our case
SSC
Sum of squares
(columns)
π‘‘π‘“π‘π‘œπ‘™π‘’π‘šπ‘›π‘  = 2 𝑀𝑆𝐢 =
88.67
2
SSE
Sum of squares
(within/error)
π‘‘π‘“π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ = 18
2812.57
18
SST
Sum of squares
(total)
N = 21 C = 3
π‘‘π‘“π‘‘π‘œπ‘‘π‘Žπ‘™ = 20
𝑀𝑆𝐸 =
𝐹=
44.33
156.25
= 44.33
= 156.25
= 0.28
ANOVA chart
Source of Variance
df
SSE
MSE
F
Between (columns)
2
88.87
44.33
0.28
Within (error)
18
2812.57
156.25
Total
20
2901.24
𝐹=
𝑀𝑆𝐢
𝑀𝑆𝐸
𝐹𝛼,𝑑𝑓𝐢,𝑑𝑓𝐸
𝐹=
F-stat larger than F-crit?
𝑑𝑓=2
𝑑𝑓=18
𝐹0.05,2,18 = 𝐹. 𝐼𝑁𝑉. 𝑅𝑇 0.05,2,18 𝑖𝑛 𝐸π‘₯𝑐𝑒𝑙 β†’ πΉπ‘π‘Ÿπ‘–π‘‘ = 3.55
NO. Fail to reject 𝐻0 . No significant
difference in mean test score by
Year of Student.
Statistical quality control
Sample 1
Sample 2
Sample 3
Sample 4
π‘₯ = 3210.73
π‘₯ = 3150.13
π‘₯ = 3345.54
π‘₯ = 3190.67
Sample 5
Sample 6
Sample 7
Sample 8
Sample 9
π‘₯ = 3217.90
π‘₯ = 3301.45
π‘₯ = 3100.72
π‘₯ = 3413.01
π‘₯ = 3023.59
Standard deviation of π‘₯ sampling distribution
𝜎
𝜎π‘₯ =
𝑛
4.5
4
3.5
3
𝜎π‘₯ - standard deviation of π‘₯
2.5
2
𝜎 - standard deviation of population
1.5
1
n – sample size
0.5
0
2950-3049
3050-3149
3150-3249
In this case we know standard deviation of the population which is rarely the case.
3250-3349
3350-3449
Apple corporation 2012 Daily Returns
1. What is the probability, for any
given day, of a return greater
than 0.5%?
2. What is the probability, for any
given day, of a loss greater than
2%?
3. What is the probability, for any
given day, of a return beween 0
and 1%?
4. What is the probability, for any
given day, of a gain or loss
greater than 3%?
Z distribution
πœ‡ = 0.11 %
𝜎 = 1.84 %
0.5 %
-3
-2
-5.41% -3.57%
-1
-1.73%
0
0.11%
1
1.95%
2
3.79%
3
5.63%
Apple corporation 2012 Daily Returns
3. What is the probability, for any
given day, of a return beween 0 and
1%?
Z distribution
πœ‡ = 0.11 %
𝜎 = 1.84 %
π‘₯βˆ’πœ‡
𝑧=
𝜎
1 βˆ’ 0.11
𝑧=
= 0.48
1.84
π‘₯βˆ’πœ‡
𝑧=
𝜎
0 βˆ’ 0.11
𝑧=
= βˆ’0.06
1.84
= norm.dist(0.48,0,1,TRUE)=0.69
= norm.dist(-0.06,0,1,TRUE)=0.48
0.21
0.21
-3
-2
-5.41% -3.57%
-1
-1.73%
0
0.11%
1
1.95%
2
3.79%
3
5.63%
Z-test for a single mean
π‘₯ βˆ’ πœ‡ 0 π‘₯ βˆ’ πœ‡0
𝑧=
=
𝜎
𝜎
π‘₯ βˆ’ π‘ π‘Žπ‘šπ‘π‘™π‘’ π‘šπ‘’π‘Žπ‘›
πœ‡0 βˆ’ β„Žπ‘¦π‘π‘œπ‘‘β„Žπ‘’π‘ π‘–π‘§π‘’π‘‘ π‘π‘œπ‘π‘’π‘™π‘Žπ‘‘π‘–π‘œπ‘› π‘šπ‘’π‘Žπ‘›
𝜎 βˆ’ π‘ π‘Žπ‘šπ‘π‘™π‘’ π‘ π‘‘π‘Žπ‘›π‘‘π‘Žπ‘Ÿπ‘‘ π‘‘π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘›
𝑛 βˆ’ π‘ π‘Žπ‘šπ‘π‘™π‘’ 𝑠𝑖𝑧𝑒
The Student’s T-distribution
1. In general, the t-distribution is shorter
in the middle and fatter in the tails.
2. More probability in the tails, less near
the mean, grater chance of extreme
values.
3. There isn’t just one t-distribution.
4. There is a t-distribution for every
sample size.
5. Degrees of Freedom (n-1).
6. Smaller the sample size, the shorter
and fatter the distribution, more tail
probability.
7. However as n becomes large, the tdistribution tends to z-distribution.
Area (probability) under the curve = 1
When we do not know 𝜎 we have to
estimate. This estimation, this
uncertainty, forces us to use the
Student’s T-dsitribution.
-4
-3
-2
-1
0
1
2
3
4
T-test for a single mean
π‘₯ βˆ’ πœ‡ 0 π‘₯ βˆ’ πœ‡0
𝑑= 𝑠 =
𝑠π‘₯
𝑛
π‘₯ βˆ’ π‘ π‘Žπ‘šπ‘π‘™π‘’ π‘šπ‘’π‘Žπ‘›
πœ‡0 βˆ’ β„Žπ‘¦π‘π‘œπ‘‘β„Žπ‘’π‘ π‘–π‘§π‘’π‘‘ π‘π‘œπ‘π‘’π‘™π‘Žπ‘‘π‘–π‘œπ‘› π‘šπ‘’π‘Žπ‘›
𝑠 βˆ’ π‘ π‘Žπ‘šπ‘π‘™π‘’ π‘ π‘‘π‘Žπ‘›π‘‘π‘Žπ‘Ÿπ‘‘ π‘‘π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘›
𝑛 βˆ’ π‘ π‘Žπ‘šπ‘π‘™π‘’ 𝑠𝑖𝑧𝑒
Question:
Is this t-test value in the nonrejection
region or the rejection region based on
df = n-1?
95% Distribution comparision
Z-distribution, ±πŸ. πŸ—πŸ”
n
df
Interval
10
9
±2.262
30
29
±2.045
75
74
±1.993
100
99
±1.984
How stastistically can we predict data?
𝒙 βˆ’ π‘‘π‘Žπ‘‘π‘Ž
1. Collect sample
How stastistically can we predict data?
𝒙 βˆ’ π‘‘π‘Žπ‘‘π‘Ž
𝒙
1. Collect sample
2. Calculate statistical
parameters
How stastistically can we predict data?
𝒙 βˆ’ π‘‘π‘Žπ‘‘π‘Ž
𝝈
𝒙
1. Collect sample
2. Calculate statistical
parameters
How stastistically can we predict data?
𝝈
𝒙
2. Calculate statistical
parameters
How stastistically can we predict data?
𝝈
𝒙
2. Calculate statistical
parameters
3. Fit distribution
How stastistically can we predict data?
In what range
with specific
certainty do
we expect
future results?
4. Predict future results οƒ 
confidence interval
Confidence intervals
Gumball guessing game
The oject of the game is to guess how many gumballs are in the jar.
β€’ If you guess within 5 (±5) I will give you the gum plus 50 PLN,
β€’ If you guess within 15 (±15) I will give you the gum plus 15 PLN,
β€’ If you guess within 30 (±30) I will give you the gum.
But before you start guess you have to decide How confident are
you?
Confidence interval
β€’ Note that in the preceding examples we used the terms β€žconfident” and
β€žinterval” in the form of ±,
β€’ When estimating a population parameter using a sample statistics it is never
going to be perfect; there will always be an error,
β€’ To estimate a population parameter without an error we would have to include
all the sample in the population οƒ  no practical sense,
β€’ We can express that error, or uncertainty, using an interval estimate:
π‘π‘œπ‘–π‘›π‘‘ π‘’π‘ π‘‘π‘–π‘šπ‘Žπ‘‘π‘’ ± π‘€π‘Žπ‘Ÿπ‘”π‘–π‘› π‘œπ‘“ π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ
C.I. and standard error
β€’ In the previous exmaples we discussed the standard error of the mean,
β€’ To find the standard error of the mean (SEM) we need to know two things: 1)
the population standard deviation and 2) the sample size,
β€’ Most often we do not know the population standard deviation (PSD) and
therefore we have to estimate it,
β€’ Also remember that for any PSD, increasing the sample size reduces standard
error,
β€’ So we are left with idea that the confidence interval will be affected by all these
points: standard deviation, sample size and level of β€žconfidence” we are satisfied
with.
The Standard Normal Curve
SNC is also a sampling distribution of the mean π‘₯.
The distribution of many sample means of give sample
size.
𝒙
-3
-2
-1
0
1
2
3
95% Probability Interval
The are in the tails is
called alpha.
Therefore we have 5%
left evenly divided
between both tails.
𝛼 = 5% π‘œπ‘Ÿ 0.05
𝒙
95% of all
sample means
(π‘₯) are in here.
5%
= 2.5% π‘œπ‘Ÿ 0.025
2
5%
= 2.5% π‘œπ‘Ÿ 0.025
2
𝜢
𝟐
𝜢
𝟐
-3
-2
-1
0
1
2
3
95% Probability Interval
By doing so, we can
assign z-scores (t-scores)
to the uppar and lower
boundary of the 95%
interval.
If we the population
standard deviation 𝜎 we
treat the sampling
distribution as a standard
normal curve (z-curve).
If not we have to
estimate 𝜎 and use t
distribution (t-curve).
𝒙
95% of all
sample means
(π‘₯) are in here.
βˆ’πŸ. πŸ—πŸ”πˆ
𝜢
𝟐
-3
-2
+𝟏. πŸ—πŸ”πˆ
𝒙 ± 𝟏. πŸ—πŸ”πˆπ’™
-1
0
1
𝜢
𝟐
2
3
95% Probability Interval
The standard error of the
mean 𝜎π‘₯ depends on:
𝜎
β€’ Sample size (𝜎π‘₯ = ),
What is the standard
deviation of the sampling
distribution?
𝑛
The standard error of the
mean 𝜎π‘₯ (not 𝜎 – the
standard deviation of the
population)
𝒙
95% of all
sample means
(π‘₯) are in here.
βˆ’πŸ. πŸ—πŸ”πˆπ’™
𝜢
𝟐
-3
-2
+𝟏. πŸ—πŸ”πˆπ’™
𝒙 ± 𝟏. πŸ—πŸ”πˆπ’™
-1
0
1
𝜢
𝟐
2
3
95% Probability Interval
As soon as a sample mean steps outside
the dotted region, πœ‡ is no longer in its
interval.
95% of all sample means (π‘₯) are in here.
𝝁
-3 -2 -1 0
π’™πŸ
1 2 3
π’™πŸ
We take many
samples of the
same size.
π’™πŸ‘
π’™πŸ’
π’™πŸ“
π’™πŸ”
Does this sample interval contains πœ‡?
Sample of the
same size have
the same standard
error 𝜎π‘₯ . So the
95% β€žwidth” is the
same for all
sample of that
size.
Interpretation
β€’ The randomness lies in the elements chosen for the sample, NOT the population
mean.
β€’ It is the probability of obtaining a representative sample.
β€’ The proportion of samples, size n, for which our estimate, the sample mean π‘₯, is
within a certain distance ± of the true population mean, πœ‡.
β€’ The sample mean is either wihin ± interval of the true mean, or it is not (no
probability).
β€’ The confidence interval IS NOT the probability that the population mean lies
within the interval.
Example
To estimate the mean amount spent per customer at a TGI Friday’s data was
collected for 75 customers. We are to assume the population standard deviation is
$4.
1. At 95% confidence, what is the margin of error?
2. If the sample mean is $20, what is the 95% confidence interval for the
population mean (all customers)?
Example
1. At 95% confidence, what is the margin of error?
𝒙 ± 𝟏. πŸ—πŸ”πˆπ’™
𝑛 = 75
𝜎=4
𝜎π‘₯ =
𝜎
𝑛
𝜎π‘₯ =
4
75
π‘₯ ± 1.96 βˆ— 0.46
π‘₯ ± 0.91
= 0.46
Example
2. If the sample mean is $20, what is the 95% confidence interval for the population
mean (all customers)?
π‘₯ ± 0.91
20 ± 0.91
95% Probability Interval
All samples of n = 75 will have 0.91 as
the margin of error, assuming 𝜎 is know
to be 4.
95% of all intervals made using π‘₯ ±
0.91 will contain the uknown
POPULATION MEAN.
If we take another samples of n = 75 and
make intervals π‘₯ ± 0.91, 95 of them will
contain πœ‡.
𝒙
95% of all
sample means
(π‘₯) are in here.
𝟎. πŸŽπŸπŸ“
-3
-2
πŸπŸ—. πŸŽπŸ—
20 ± 0.91
-1
0
𝟐𝟎
1
95% confident
𝟎. πŸŽπŸπŸ“
2
𝟐𝟎. πŸ—πŸ
3
Example
Brick and mortar stores like Castorama etc. Have a difficult time dealing with the
practice of β€žshow-rooming” whereby customers come in to the store, examine
items, leave and then buy them cheaper on Allegro.
We call these β€žfalse customers”. Sales associates spend time with the customers
who have no intention of buying anything at the stores leading to unproductive
time spent.
Management in one company determined that if less than 15% of a salesperon’s 8
hour day (4320 seconds) is spent with β€žfalse customers” then show-rooming is a
not a major problem.
Example
To determine if current show-rooming is really a problem bases on the company’s
standards, 125 sales associates are randomly selected to measure their service
times with false customers.
For this problem we are going to assume a population standard deviation, 𝜎 =
1958 seconds. The results are:
Standard error of the mean
n = 125
π‘₯ = 3661.5
𝜎 = 1958
𝜎π‘₯ =
1958
125
= 175.13
95% Confidence Interval
π‘₯ ± 1.96𝜎π‘₯
3661.5 ± 1.96 175.13
3661.5 ± 343.25
95% Probability Interval
95% Confidence Interval
π‘₯ ± 1.96𝜎π‘₯
3661.5 ± 1.96 175.13
3661.5 ± 343.25
We are 95% confident that salesperson
spend between 3318.25 s and 4004.75 s
interacting with false customer each 8 h
day.
It is 5% possible that we will select
sample with very high 4320 sample
mean.
False customers are
not a great enough
drain to justify policy
changes.
𝒙
95% of all
sample means
(π‘₯) are in here.
𝟎. πŸŽπŸπŸ“
-3
3661.5 ± 343.25
-2
πŸ‘πŸ‘πŸπŸ–. πŸπŸ“
-1
0
πŸ‘πŸ”πŸ”πŸ. πŸ“
1
𝟎. πŸŽπŸπŸ“
2
πŸ’πŸŽπŸŽπŸ’. πŸ•πŸ“
3
Concern point
4320 seconds
Summing up
β€’ The randomness lies in the elements chosen for the sample, NOT the population
mean.
β€’ It is the probability of obtaining a representative sample.
β€’ The proportion of samples, size n, for which our estimate, the sample mean π‘₯, is
within a certain distance ± of the true population mean, πœ‡.
β€’ The sample mean is either wihin ± internval of the true mean, or it not not (no
probability).
β€’ The confidence interval IS NOT the probability that the population mean lies
within the interval.
The Student’s T-distribution
1. In general, the t-distribution is shorter
in the middle and fatter in the tails.
2. More probability in the tails, less near
the mean, grater chance of extreme
values.
3. There isn’t just one t-distribution.
4. There is a t-distribution for every
sample size.
5. Degrees of Freedom (n-1).
6. Smaller the sample size, the shorter
and fatter the distribution, more tail
probability.
7. However as n becomes large, the tdistribution tends to z-distribution.
Area (probability) under the curve = 1
When we do not know 𝜎 we have to
estimate. This estimation, this
uncertainty, forces us to use the
Student’s T-dsitribution.
-4
-3
-2
-1
0
1
2
3
4
The Student’s T-distribution 95% Prob. Int.
By doing so, we can assign t-scores to the
upper and lower boundary of the 95%
interval of each sample size.
Area (probability) under the curve = 1
If we do not know the population
standard deviation 𝜎 we treat the
sampling distribution as a t-distribution.
Degrees of Freedom (n-1)
n = 10
𝒙
z = -1.96
z = +1.96
95% of all
sample means
(π‘₯) are in here.
t = -2.26
df = 9
t = +2.26
𝑑𝛼 𝑑𝑓 = 9
2
-4
-3
-2
-1
0
1
2
3
4
95% Distribution comparision
Z-distribution, ±πŸ. πŸ—πŸ”
n
df
Interval
10
9
±2.262
30
29
±2.045
75
74
±1.993
100
99
±1.984
Sample standard deviation
β€’ When do we do not know the population standard deviation, we use the
stample standard deviation to approximate it,
β€’ This approximation come at a cost though in terms of our interval estimate,
β€’ We must use the t-distribution instead of z-distribution to account for this
estimation of 𝜎,
β€’ Every sample size will have its own t-distribution with degees of freedom
df = n-1,
β€’ Our standard error will now be:
𝒔
𝒔𝒙 =
𝒏
Different standard errors
β€’ In the previous examples we knew the population standard deviation 𝜎 and it
was therefore fixed in the standard error formula,
β€’ This meant that all samples of the same size had the same standard error,
β€’ When 𝜎 is uknown we estimate it with the sample standard deviation, s,
β€’ Since every sample have a unique s, samples of the same size do not necessarily
have the same standard error,
β€’ The randomness of sample selection is represented in its standard deviation and
therefore its standard error.
The Student’s T-distribution 95% Prob. Int.
𝑠π‘₯ is largely dependent on sample size.
The standard deviation of
sampling distribution now is the
standard error of the mean.
If sample size is small 𝑠π‘₯ becomes larger
and thus distribution becomes wider.
𝒔
𝒔𝒙 =
𝒏
𝒙
95% of all
sample means
(π‘₯) are in here.
-4
-3
-2
-1
0
1
2
3
4
Example
To estimate the mean amount spent per customer at a TGI Friday’s data was
collected for 15 customers. The sample standard deviation is $4.
1. At 95% confidence, what is the margin of error?
2. If the sample mean is $20, what is the 95% confidence interval for the
population mean (all customers)?
Example
1. At 95% confidence, what is the margin of error?
𝑛 = 15
𝑠=4
𝑑𝑓 = 14
𝒙 ± 𝟐. πŸπŸ’πŸ“π’”π’™
𝑠π‘₯ =
𝑠
𝑛
𝑠π‘₯ =
4
15
= 1.03
π‘₯ ± 2.145 βˆ— 1.03
π‘₯ ± 2.21
Example
2. If the sample mean is $20, what is the 95% confidence interval for the population
mean (all customers)?
π‘₯ ± 2.21
20 ± 2.21
95% Probability Interval
𝑛 = 15
𝑠=4
𝑑𝑓 = 14
𝛼 = 0.05
𝑑 = 2.145
π‘₯ ± 2.21
π‘₯1 ± 2.21
20 ± 2.21
We are not saying
that there is 95%
probability that
population mean is in
the interval.
We are saying that
there is 95%
probability the
interval contains
population mean. The
randomness is in the
interval not the mean.
𝒙
95% of all
sample means
(π‘₯) are in here.
20 ± 2.21
-4
-3
-2
17.79
-1
0
20
1
2
3
22.21
4
Example
Brick and mortar stores like Castorama etc. Have a difficult time dealing with the
practice of β€žshow-rooming” whereby customers come in to the store, examine
items, leave and then buy them cheaper on Allegro.
We call these β€žfalse customers”. Sales associates spend time with the customers
who have no intention of buying anything at the stores leading to unproductive
time spent.
Management in one company determined that if less than 15% of a salesperon’s 8
hour day (4320 seconds) is spent with β€žfalse customers” then show-rooming is a
not a major problem.
Example
To determine if current show-rooming is really a problem bases on the company’s
standards, 30 sales associates are randomly selected to measure their service times
with false customers.
For this problem we are going to use sample standard deviation, 𝑠 =
1958 seconds. The results are:
Standard error of the mean
n = 30
π‘₯ = 3661.5
𝜎 = 1958
𝑠π‘₯ =
1958
30
= 357.48
95% Confidence Interval
π‘₯ ± 2.045𝑠π‘₯
3661.5 ± 2.045 357.48
3661.5 ± 731.05
95% Probability Interval
95% Confidence Interval
π‘₯ ± 2.045𝑠π‘₯
3661.5 ± 2.045 357.48
3661.5 ± 731.05
𝑛 = 75
𝑠=4
𝑑𝑓 = 14
𝛼 = 0.05
𝑑 = 2.145
False customers are in
fact draining
resources. Policy
intervention needed.
𝒙
95% of all
sample means
(π‘₯) are in here.
3661.5 ± 731.05
-4
-3
-2
2930.45
-1
1
0
3661.5
We are 95% confident that salesperson
spend between 2930.45 s and 4392.55 s
interacting with false customer each 8 h
day.
It is 5% possible that we will select
sample with very high 4320 sample
mean.
Concern point
4320 seconds
t = 1.84
2
3
4392.55
4
Increase sample size from 30 to 50
95% Confidence Interval
π‘₯ ± 2.011𝑠π‘₯
3661.5 ± 2.011 276.9
3661.5 ± 556.58
𝑛 = 75
𝑠=4
𝑑𝑓 = 49
𝛼 = 0.05
𝑑 = 2.011
Falase customers are
not a great enough
drain to justify policy
changes.
𝒙
95% of all
sample means
(π‘₯) are in here.
3661.5 ± 556.58
-4
-3
-2
3104.92
-1
1
0
3661.5
We are 95% confident that salesperson
spend between 3104.92 s and 4218.08 s
interacting with false customer each 8 h
day.
It is 5% possible that we will select
sample with very high 4320 sample
mean.
Concern point
4320 seconds
t = 2.38
2
3
4218.08
4
Confidence interv al
β€’ Note that in the preceding examples we used the terms β€žconfident” and
β€žinterval” in the form of ±,
β€’ When estimating a population parameter using a sample statistics it is never
going to be perfect; there will always be an error,
β€’ To estimate a population parameter without an error we would have to include
all the sample in the population οƒ  no practical sense,
β€’ We can express that error, or uncertainty, using an interval estimate:
π‘π‘œπ‘–π‘›π‘‘ π‘’π‘ π‘‘π‘–π‘šπ‘Žπ‘‘π‘’ ± π‘€π‘Žπ‘Ÿπ‘”π‘–π‘› π‘œπ‘“ π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ
Estimating size of the sample
Margin of Error
π‘ƒπ‘œπ‘–π‘›π‘‘ π‘’π‘ π‘‘π‘–π‘šπ‘Žπ‘‘π‘’ ± π‘€π‘Žπ‘Ÿπ‘”π‘–π‘› π‘œπ‘“ π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ
𝜎
2 𝑛
π‘₯ ± 𝑧𝛼
𝑧𝛼 βˆ’ 𝑧 π‘π‘œπ‘’π‘›π‘‘π‘Žπ‘Ÿπ‘¦ π‘œπ‘“ π‘–π‘›π‘‘π‘’π‘Ÿπ‘£π‘Žπ‘™ π‘π‘Ÿπ‘œπ‘π‘Žπ‘π‘–π‘™π‘–π‘‘π‘¦
2
𝜎
2 𝑛
βˆ’β†’ 𝑧𝛼
𝐸=
𝑧𝛼
2
𝜎
𝑛
1. We choose E, our margin of error.
2. We choose our confidence probability
boundary 𝑧𝛼 .
2
𝜎
βˆ’ π‘ π‘‘π‘Žπ‘›π‘‘π‘Žπ‘Ÿπ‘‘ π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ
𝑛
3. We are given or we estimate population
standard deviation 𝜎.
4. Solve for n.
Margin of Error
π‘ƒπ‘œπ‘–π‘›π‘‘ π‘’π‘ π‘‘π‘–π‘šπ‘Žπ‘‘π‘’ ± π‘€π‘Žπ‘Ÿπ‘”π‘–π‘› π‘œπ‘“ π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ
𝜎
2 𝑛
𝐸 = 𝑧𝛼
𝑧𝛼 𝜎 2
𝑧𝛼 𝜎
𝑛=
2
𝐸
n=
2
𝐸
Margin of Error
β€’ Solving for the sample requires the population standard deviation 𝜎. Most ofther
we do not know it so we have to use an estimare or β€žplanning value” in its place.
Options:
1. Estimate 𝜎 from previous studies using the same population of interest.
2. Conduct a pilot study to select a preliminary sample. Use sample standard
deviation from the pilot study.
3. Use a judgment or best guess for 𝜎. A common guess is the data range
(high-low) divided by 4.
Example
How large a sample should be selected to provide a 95% confidence interval with a
marigin of error (E) of 8? Assiuming the population standard deviation is 𝜎 = 36.
𝑧𝛼 𝜎 2
n=
2
𝐸
=
1.96βˆ—36 2
8
= 77.8
To have 95% of our sample means contain πœ‡, we need a sample size of 78.
Example
The question we are asking is:
β€žWhat minimum sample size is necessary to produce 95% confidence that the
sample mean is ±8 of the true population mean?”
As we increase sample size we reduce the standard error and our sample most
likely becomes more representative of the population.
In graphical terms, we set the upper and lower boundary. Then we increase sample
size, pulling the distribution in the middle and inward on the sides.
Example
The larger sample size ensures more sample means are within the given margin of
error due…
To the fact that a large sample is more representative of the overall population.
Larger sample size will be required when:
1. A smaller margin of error is required.
2. A higher level of confidence is required.
3. Or both.
95% Probability Interval
95% Confidence Interval
π‘₯ ± 1.96𝜎π‘₯
3661.5 ± 1.96 175.13
3661.5 ± 343.25
We are 95% confident that salesperson
spend between 3318.25 s and 4004.75 s
interacting with false customer each 8 h
day.
It is 5% possible that we will select
sample with very high 4320 sample
mean.
Falase customers are
not a great enough
drain to justify policy
changes.
𝒙
95% of all
sample means
(π‘₯) are in here.
𝟎. πŸŽπŸπŸ“
-3
3661.5 ± 343.25
-2
πŸ‘πŸ‘πŸπŸ–. πŸπŸ“
-1
0
πŸ‘πŸ”πŸ”πŸ. πŸ“
1
𝟎. πŸŽπŸπŸ“
2
πŸ’πŸŽπŸŽπŸ’. πŸ•πŸ“
3
Concern point
4320 seconds