Presentation slides

Download Report

Transcript Presentation slides

Inference on averages
Data are collected to learn about certain numerical characteristics of a
process or phenomenon that in most cases are unknown.
Example: A study was conducted to analyze women’s bone health. Data
on the daily intakes of calcium (in milligrams) for 36 women, between
the age of 18 and 24 years, were collected. What is the estimated
average calcium intake for women in this age range?
The sample average is an estimate of the average calcium intake for
women between the age of 18 and 24 years.
Population = all the women of age (18-24) years.
Sample = 36 women of age (18-24) years selected at random
Estimating the population average
To estimate the population average:
• Select a simple random sample of size n from the population of interest, so
that each unit in the sample has the same probability to be selected.
• Collect data from the sample
• Compute the sample average and the standard deviation.
The sample average x is an estimate of the population average.
How accurate is such an estimate?
A measure of the accuracy is given by the standard error S.E. of the
sample average.
s
S .E.( x ) 
n
where s is the standard deviation of the observations. The larger the sample,
the more accurate the average is as an estimate of the population average
What is distribution of the sample average?
If the investigators takes several samples of size n and compute the averages in
each sample, then all the sample averages will be somewhere around the
population average.
sample average
x
= population average m + sampling error
S.E.
m
x
What is the shape of the sampling distribution?
If the sample size n is large (n>50), the sample average is approximately
normal with mean equal to the population mean and standard deviation equal
to the standard error of the sample average.
X is approximately N ( m ,
s
n
)
 The larger the sample, the more accurate the normal approximation is.
 If the distribution of the population is not symmetric, the normal
approximation is less accurate, and you need a larger sample.
Confidence Intervals for averages
Problem: We want to estimate the unknown population mean μ.
Answer: We compute a confidence interval for μ, that is the set of plausible
values for μ in the light of the data.
A 95% confidence interval for μ is defined as
sample average  margin of error
Where the margin of error indicates how accurate our estimate is.
Confidence Intervals
In samples of size n, a level C confidence interval for the
population average is
sample average ± ta/2*S.E. =
x  ta / 2
s
n
where ta/2 is the critical value, such that the area between - ta/2
and ta/2 under the curve of the t-distribution with n-1 degrees of
freedom is C=1-a.
0.95
The value of ta/2 is computed using the
Excel function
=TINV(a, df)
Where df = sample size -1
- ta/2
ta/2
Example
Data on the daily intakes of calcium (in milligrams) for 36 women, between
the age of 18 and 24 years were collected.
The sample average is
x  898.44
The standard deviation is s=422
The sample size is n=36
The standard error is S.E.=422/sqrt(36)=70.33
The 95% confidence interval is
(898.44 – t 0.025*70.33, 898.44+ t 0.025 *70.33)
The value t 0.025=2.03, thus a 95% C.I. for m is (755.66mg, 1041.23mg)
We are 95% confident that the “true” average calcium intake is a value
between 755.66 mg and 1041.23 mg.
= COUNT(data)
= B4/sqrt(B5)
stdev/sqrt(n)
= B5-1
n-1
=TINV((1-B6), B10) TINV(alpha, df)
Understanding a 95% confidence interval
For about 95 out of 100 samples, the population average m lies in the
associated 95% confidence intervals.
Suppose we take 25 samples of 36 women between 18 and 24 years of age
and for each sample we compute the sample average and the 95% C.I.
Why do the intervals move around?
Distribution of sample
averages
How many intervals contain the true value m?
m
In the long run, 95% of all the
samples will produce an interval that
contains the true value m.
Be careful though, it might happen
that the C.I. computed with the
sample collected in the study DOES
NOT contain the true average value!
What is the t-distribution?
The t-distribution with n-1 degrees of freedom is a symmetric
distribution with center at 0. For large n, the t-distribution is close to the
standard normal distribution.
0.4
0.4
Comparing the t-distribution curve and the standard normal curve
0.3
d.f.=15
0.1
0.2
Relative Frequency
0.2
0.0
0.1
0.0
-4
-2
0
2
4
6
-4
t
-2
0
2
4
0.4
-6
t-distribution
0.2
0.1
t-distribution curve has “fatter” tails.
For d.f. around 30, the t-distribution
curve is very similar to the standard
normal curve.
0.0
Standard Normal curve
0.3
d.f.=30
Relative Frequency
Relative Frequency
0.3
d.f.=5
-3
-2
-1
0
1
2
3
t
t
A different confidence level
Suppose we want to compute a 90% confidence interval for the average
calcium intake.
We will use the same formula, with a different critical value t
The sample average is 898.44 - The standard deviation is s=422
The sample size is n=36
The standard error is S.E.=422/sqrt(36)=70.33
The confidence level C=0.90, alpha=1-C=0.10
The 90% confidence interval is
(898.44 – t 0.05*70.33, 898.44+ t 0.05 *70.33)
The critical value t 0.05 =1.688
The C.I. Is (898.44 – 1.688*70.33, 898.44+ 1.688 *70.33)
(779.72mg, 1017.168mg)
With 90% confidence level, we state that the average calcium intake is
between 779.72mg and 1017.168 mg.
Approximate Confidence Intervals
x  margin error
The normal approximation can be used to compute approximate confidence
intervals if the sample size is large (n>30).
Area under the normal curve = 95%
m-1.96SE
m
m+1.96SE
1.64 S.E
90% Confidence Interval
95 % Confidence Interval
99 % Confidence Interval
x
x
x
Margin of error
1.96 S.E
2.57 S.E
Expressions for C.I.’s
x
s
is the sample average of n observations in a simple random sample
of size n, where n is large (>30)
is the standard deviation of the n observations.
The 90% C.I. for the population mean:
The 95% C.I. for the population mean:
The 99% C.I. for the population mean:
x  1.64 *
s
x  1.96 *
s
x  2.57 *
s
n
n
n
General remarks on C.I.’s
The purpose of a C.I. is to estimate an unknown parameter with an indication
of how accurate the estimate is and of how confident we are that the
result is correct.
The methods used here rely on the assumption that the sample is randomly
selected.
Any confidence interval has two parts:
estimate ± margin of error
The confidence level states the probability that the method will give a correct
answer, i.e. the confidence interval contains the “true” value of the
parameter.
The margin of error of a confidence interval decreases as
1. The confidence level decreases
2. The sample size n increases
Remarks:
1.
Notice the trade off between the margin of error and the confidence
level. The greater the confidence you want to place in your prediction,
the larger the margin of error is (and hence less informative you have to
make your interval).
2.
A C.I. gives the range of values for the unknown population average
that are plausible, in the light of the observed sample average. The
confidence level says how plausible.
3.
A C.I. is defined for the population parameter, NOT the sample statistic.
4.
To make a margin of error smaller, you can take a larger sample!
5.
Use the t-distribution in small samples (n<30). For large samples, the tdistribution is equivalent to the standard normal distribution.
Testing hypotheses
The recommended daily allowance (RDA) of calcium for women between
18-24 years of age is 1300 milligrams. An health organization claims
that, on average, women in this age range take less calcium than the
RDA level.
Using the collected data, what can we conclude regarding the claim of the
health organization?
Testing hypotheses
Confidence intervals can be used to test conjectures or hypotheses about a
certain characteristic of interest.
A trucking firm suspects the claim that the average lifetime of certain tires
is at least 28,000 miles. To check the claim, the firm puts 80 of these tires
on its trucks and gets an average lifetime of 27,563 miles with a standard
deviation of 1,348 miles. What can you conclude from the data ?
We can construct a confidence level and check if the interval contains the
value of 28,000 miles. In such a case, we could conclude that 28,000 is
plausible in the light of the data!
Testing hypotheses
s
A 95% C.I. for the average lifetime is x  1.96 *
n
(Are we using the t-distribution or the normal curve?)
27,563 ± 1.96* 1,348/sqrt(80)= 27,563 ± 295.39 miles = (27267, 27858).
Based on the data, the confidence interval contains
values that are lower than 28,000 miles . It is
more likely that the tires will last a shorter time.