Transcript Chapter 5

Chapter 7
Inferences Based on a Single
Sample
Parameters and Statistics
• A parameter is a numeric characteristic of a
population or distribution, usually symbolized
by a Greek letter, such as μ, the population
mean.
• Inferential Statistics uses sample information to
estimate parameters.
• A Statistic is a number calculated from data.
• There are usually statistics that do the same
job for samples that the parameters do for
populations, such as x , the sample mean.
Using Samples for Estimation
μ
x
Sample
(known statistic)
Population
(unknown parameter)
The Idea of Estimation
• We want to find a way to estimate the
population parameters.
• We only have information from a sample,
available in the form of statistics.
• The sample mean, x , is an estimator of
the population mean, μ.
• This is called a “point estimate” because it
is one point, or a single value.
Interval Estimation
• There is variation in x , since it is a random
variable calculated from data.
• A point estimate doesn’t reveal anything about
how much the estimate varies.
• An interval estimate gives a range of values that
is likely to contain the parameter.
• Intervals are often reported in polls, such as
“56% ±4% favor candidate A.” This suggests we
are not sure it is exactly 56%, but we are quite
sure that it is between 52% and 60%.
• 56% is the point estimate, whereas (52%, 60%)
is the interval estimate.
The Confidence Interval
• A confidence interval is a special interval
estimate involving a percent, called the
confidence level.
• The confidence level tells how often, if samples
were repeatedly taken, the interval estimate
would surround the true parameter.
• We can use this notation: (L,U) or (LCL,UCL).
• L and U stand for Lower and Upper endpoints.
The longer versions, LCL and UCL, stand for
“Lower Confidence Limit” and “Upper
Confidence Limit.”
• This interval is built around the point estimate.
Theory of Confidence Intervals
• Alpha (α) represents the probability that when
the sample is taken, the calculated CI will miss
the parameter.
• The confidence level is given by (1-α)×100%,
and used to name the interval, so for example,
we may have “a 90% CI for μ.”
• After sampling, we say that we are, for
example, “90% confident that we have
captured the true parameter.” (There is no
probability at this point. Either we did or we
didn’t, but we don’t know.)
How to Calculate CI’s
• Many CI’s have the following basic structure:
• P ± TS
– Where P is the parameter estimate,
– T is a “table” value equal to the number of standard
deviations needed for the confidence level,
– and S is the standard deviation of the estimate.
• The quantity TS is also called the “Error Bound”
(B) or “Margin of Error.”
• The CI should be written as (L,U) where
L= P-TS, and U= P+TS.
• Don’t forget to convert your P ± TS expression to
confidence interval form, including parentheses!
A Confidence Interval for μ
• If σ is known, and
• the population is normally distributed,
or n>30 (so that we can say x is
approximately normally distiributed),
x  z / 2 x
gives the endpoints for a (1- α)100% CI for μ
• Note how this corresponds to the P ± TS
formula given earlier.
Distribution Details
• What is z / 2?
– α is the significance level, P(CI will miss)
– The subscript on z refers to the upper tail
probability, that is, P(Z>z).
– To find this value in the table, look up the
z-value for a probability of .5-α/2.
• Examples
Example: Estimation of µ
( Known)
A random sample of 25 items resulted in a
sample mean of 50. Construct a 95%
confidence interval estimate for  if  = 10.
x  z / 2   x
10
50  1.96 
25
(46.08,53.92)
Confidence Interval Estimates
Confidence
Intervals
Mean
Known
Proportion
 Unknown
Variance
Estimation of  ( unknown)
• We now turn to the situation where  is unknown but the
sample size is large or the sample population is normal.
• Since  is unknown, we use s in its place.
• However, without knowing , we are not able to make
use of the z table in building a confidence interval.
• Instead, we will use a distribution called t (Student’s t).
• The t distribution is symmetric and bell-shaped like the
standard normal, and also has a =0, but >1, so the
shape is flatter in the middle and thicker in the tails.
Student’s t-Distributions:
Normal distribution
Student’s t, df = 15
Student’s t, df = 5
0
t
Degrees of Freedom, df:
A parameter that identifies each different distribution of
Student’s t-distribution. For the methods presented in
this chapter, the value of df will be the sample size
minus 1, df = n - 1.
Using t
• As the previous graph shows, the t
distribution has another parameter, called
degrees of freedom (df). So this is actually a
family of distributions, with different df values.
• The higher the df, the closer the t distribution
comes to the standard normal.
• For our purposes, df=n-1. It is actually
related to the denominator in the formula for
s 2.
• There is a t-table in the back of the book. It is
different from the z-table, so we have to
understand how it works.
The t table
• Refer to the table. First you will notice the lefthand column is for df.
• When df ≥100, the z-table can be used,
because the values will be very close.
• This table gives tail probabilities, similar to
z(). However, only a selection of probabilities
is given, across the top of the table.
• The interior of the table gives the t-values, so it
is arranged almost opposite of the z-table.
• The notation used for t-values is t(df,).
• Just like z(), refers to the upper tail
probability.
t-Distribution Showing t(df, ):

0
t (df , )
t
Example: Find the value of t(12, 0.025).
0.025
0.025
-t(12,0.025)
- 2.18
Portion of
t-table
df

12
0
t(12,0.025)
t
2.18
Amount of  in one-tail


0.025
2.18
Confidence Intervals
• When we build our confidence interval,
 refers to the probability in both tails.
• This is not the same  used in looking
up the distribution! So what we have to
look up is actually /2, because that’s
the upper tail probability.
• And so we come to the formula for a
(1-)100% CI for  when  is unknown:
x  t( df , / 2) sx
Example: A study is conducted to learn how long it takes the
typical tax payer to complete his or her federal income tax
return. A random sample of 17 income tax filers showed a
mean time (in hours) of 7.8 and a standard deviation of 2.3.
Find a 95% confidence interval for the true mean time
required to complete a federal income tax return. Assume the
time to complete the return is normally distributed.
Solution:
1. Parameter of Interest: the mean time required to complete
a federal income tax return.
2. Confidence Interval Criteria:
a. Assumptions: Sampled population assumed normal, s
unknown.
b. Distribution table value: t will be used.
c. Confidence level: 1 - α = 0.95
3. The Sample Evidence:
n = 17,
x = 7.8, and
s = 2.3
4. Calculations:
t(df , / 2) = t(16,0.025) = 2.12
s
2.3
sx =
=
= 0.5578
n
17
x  t( df , / 2) sx = 7.8  (2.12)(.5578)
(7.8 - 1.18, 7.8  1.18)
= (6.62, 8.98)
5. (6.62, 8.98) is the 95% confidence interval for µ.
Confidence Interval
for a Proportion
•
Assumptions
– Population Follows Binomial Distribution
– Normal Approximation Can Be Used if
• npˆ  3 npˆ 1 - pˆ ) does not Include 0 or 1
• Or (older guideline)
•
npˆ  5 and nqˆ  5
Confidence Interval Estimate
pˆ  (1 - pˆ )
pˆ  z 2 
n
Example
A random sample of 400 graduates showed
32 went to grad school. Set up a 95%
confidence interval estimate for p.
pˆ  Z / 2 
pˆ  (1 - pˆ )
n
.08  (1 - .08)
.08  1.96 
400
(0.053, 0.107)
New Method
• A new method (Agresti & Coull, 1998) can
be used to avoid the problems with
extreme p’s. There is no need to check
the np or nq values with this method.
x2
• Define
p* =
n4
• Then a (1-α)100% CI for p is given by
p *(1 - p*)
p *  z / 2
n4
Example
• In the 2004 presidential election, Ralph Nader had about
0.34% of the vote. Suppose an exit poll was taken to
estimate Nader’s share of the vote, with a sample size of
200, and 2 people indicated they voted for Nader.
• Note that with the traditional method, npˆ = 2  5 so the
formula is not valid.
• Use the p* method to construct a 95% CI for p.
p* =
x2
4
=
= .0196
n  4 204
p *(1 - p*)
.0196(.9804)
=
= .0097
n4
204
p *(1 - p*)
= .0196  1.96(.0097)
n4
A 95% CI for Nader's vote is (.0006,.0386).
p *  z / 2
Choosing CI Formulas
Confidence Intervals
CI for µ
CI for p
σ known
Small
Sample
Population
Normal
Population
not Normal
Use z with
σ
Use nonparametric
σ unknown
Large
Sample
Use z with
σ
Small
Sample
Population
Normal
Population
not Normal
Use t with
s
Use nonparametric
np>5, nq>5
Large
Sample
Use t or z
with s
Use
traditional
Use p*
method
Sample Size Calculation
• We may wish to decide upon a sample
size so that we can get a confidence
interval with a pre-determined width.
• This is common in polls, where the margin
of error is usually decided in advance.
• All CI’s we have seen so far have the form
P±B, where B is the margin of error.
• We want to fix B in advance.
Sample Size for Estimating µ,
σ Known
• Suppose X is a random variable with σ=10 and
we want a 90% CI to have a Bound, or Margin of
Error, of 3.

• Use the formula B = z / 2 .
n
10
• Fill in the numbers: 3 = 1.645
n
• Solve:
10
n = 1.645 = 5.483  n = 30.07
3
• This is the minimum sample size, but we need a
whole number, so round up to n=31.
Sample Size for Estimating µ,
σ Unknown
• If σ is unknown, the confidence interval will be
calculated using the t distribution, unless n is
very large.
• But the degrees of freedom depend on n, which
we don’t know.
• The calculation also depends on s, which we
don’t know until after sampling.
• We must have an initial guess for s, and then
use the normal distribution to approximate the t
distribution, since it does not require knowing n.
Example (σ unknown)
• A manufacturer needs to be able to
estimate the width of a new part to within
2mm with 95% confidence. There is not
enough history to know what σ would be,
so a pilot study is run by measuring 6
parts, and finding s=3.4mm.
B = z / 2
s
3.4
3.4
 2 = 1.96
 n = 1.96
 n = 11.1
2
n
n
• Rounding up to the next whole number
gives n=12.
Sample Size for Estimating p,
a Population Proportion
• With a population proportion, we also have a
problem in getting the standard deviation part of
the Margin of Error, since it depends on p, the
thing we are trying to estimate.
• There are two possibilities:
– 1) We may have a preliminary guess about p that we
can use, or
– 2) We can use p=.5 because that maximizes the
standard deviation.
• The sample size will be calculated from the
desired margin of error, or error bound.
Example (proportion)
• A pollster wants to do a simple random sample
to estimate the proportion of the population
favoring an increase in property taxes for school
funding. He wants a margin of error of 3%, with
90% confidence. The general belief is that it will
be a close election, so an initial value of p=.5 is
reasonable.
B = z / 2
pˆ (1 - pˆ )
.25
.5
 .03 = 1.645
 n = 1.645
 n = 751.7
n
n
.03
• Rounding up to the next whole number gives
n=752.
Misc. Notes
• The CI for µ formula using z is also called the
“Large Sample” CI. It is valid when σ is
known, for any sample size, but it also serves
as an approximation of the t formula (using s)
when n is large. How large? Many books say
n≥30. I recommend making use of the t table
up to n=100 since that is how far it goes.
Statistical computer programs will always
calculate t values, regardless of how large n is,
for the σ unknown case.
Misc. Notes
• The CI for µ formula using t is also called the
“Small Sample” CI, but only because the other
one is called “Large Sample.” It is valid for
any sample size when σ is unknown and the
population is normal.
• We do not cover methods for small samples
that do not come from a normal population in
this course (non-parametric methods).
Misc. Notes
• The t table is limited because it does not have a
very good selection of probabilities. It also
“jumps” in the df column. It is possible to use
the “closest” value or interpolate when you
can’t find what you need, but a better option is
to use the Excel functions, TDIST and TINV.
• However, you have to be VERY careful about
what Excel is giving you.
Excel’s TDIST function
• TDIST takes a t value and returns the tail
probability. You can choose one or two tails.
Excel’s TINV Function
• The TINV Function takes a two-tailed probability
and returns a t-value (just what we need now).
Excel Function Comparison
• The NORMSINV Function, by contrast, takes a lefttailed probability and returns a z-value. This means
you have to enter α/2 and take the negative, or else
use 1- α/2 as the argument.
Formula
=NORMSINV(0.05)
Result
-1.644853
=-NORMSINV(0.05)
=-NORMSINV(0.025)
=NORMSINV(0.975)
=TINV(0.05,1000)
1.644853
1.959963
1.959963
1.962339