501_Lecture_06x

Download Report

Transcript 501_Lecture_06x






Inference
Statistical confidence
Confidence intervals
Confidence interval for a population mean
Choosing the sample size
1

Distinguish chance variations from permanent
features of a phenomenon:
◦ Give SAT test to a SRS of 500 Indiana seniors
 sample mean = 461
 What does it say about µ(the mean SAT score of all HS
seniors in Indiana)?
◦ Is 12/20 vs. 8/20 improvements in treatment vs. control
group strong enough evidence in favor of a drug?

Methods of formal inference rely on the
assumption that the data come from properly
randomized experiment
◦ For example an SRS

The field of Statistics gives methods that give
correct results a high percentage of times (if
repeated many times)

Two most prominent methods are:
1. confidence intervals
2. tests of significance (hypothesis testing)

Example 1:
◦ Observe 15 plots of corn with yields in
bushels:
 138, 139.1, 113, 132.5, 140.7, 109.7, 118.9, 134.8, 109.6,
127.3,115.6, 130.4, 130.2, 111.7, 105.5
◦ Sample Mean = 123.8

What can be said about the mean
yield of this variety of corn for the
population?



Assume that yield is N(µ, σ) with unknown µ and
σ=10 (just assume σ is known)
  
 10 
Then X N   ,
  N  ,
  N (  , 2.58)

n

15 
68-95-99.7% rule: 95% of time sample mean is
within 2 standard deviations of population mean
◦ 2×2.58 = 5.16 from µ

Thus, 95% of time:   5.16  X    5.16
X  5.16    X  5.16



The random interval covers the unknown (but
nonrandom) population parameter µ 95% of
time.
Our confidence is 95%.
We need to be extremely careful when
observing this result.
 is between X  5.16
 123.8  5.16
 (118.64, 128.96)


This particular confidence interval may contain µ
or not…
However, such a systematic method gives
intervals covering the population mean µ in 95%
of cases.
The Big Idea: The sampling distribution of x tells us how close to µ
the sample mean x is likely to be. All confidence intervals we construct
will have a form similar to this:
estimate ± margin of error
A level C confidence interval for a parameter has two parts:

An interval calculated from the data, which has the form:
estimate ± margin of error
A
confidence level C, where C is the probability that the
interval will capture the true parameter value in repeated
samples. In other words, the confidence level is the success
rate for the method.
 It does not give us the probability that our parameter is
inside the interval.
Confidence Interval for a Population
Mean
Confidence Interval for the Mean of a Normal Population
Choose an SRS of size n from a population having unknown mean
µ and known standard deviation σ. A level C confidence interval
for µ is:

x  z*
n
The critical value z* is found from the standard Normal distribution.

C= 95%. Find z* from table A.

See also last row in Table D.
We can use a table of z/t values (Table D). For a
particular confidence level, C, the appropriate z*
value is just above it.
14

Tim Kelley weighs himself once a week for
several years. Last month he weighed himself 4
times with an average of 190.5. Examination of
Tim’s past data reveals that over relatively short
periods of time, his weight measurements are
approximately normal with a standard deviation
of about 3. Find a 90% confidence interval for
his mean weight for last month. Then, find a
99% confidence interval.


More confidence  wider interval
Less confidence  narrower interval

Suppose Tim had only weighed himself once last
month and that his one observation was x=190.5
(the same as the mean before). Estimate µ with
90% confidence.


More sample size  narrower interval
Less sample size  wider interval
The z confidence interval for the mean of a Normal population illustrates
several important properties that are shared by all confidence intervals in
common use.
 The user chooses the confidence level and the margin of error follows.
 We would like high confidence and a small margin of error.


High confidence suggests our method almost always gives correct
answers.
A small margin of error suggests we have pinned down the parameter
precisely.
The margin of error for the z confidence interval is:
z *×
s
n
To decrease the margin of error we can:
 Make z* smaller (the same as a lower confidence level C).
 Get a bigger n! Since n is under the square root sign, we must take
four times as many observations to cut the margin of error in half.
 Make σ smaller. Usually not possible.
19
The spread in the sampling distribution of the mean is a function of the
number of individuals per sample.
 The larger the sample size, the smaller the standard deviation
(spread) of the sample mean distribution.
Standard error  ⁄ √n
 The spread decreases at a rate equal to √n.
Sample size n
20
You may need a certain margin of error (e.g., in drug trials or manufacturing
specs). In most cases, we have no control over the population variability (),
but we can choose the number of measurements (n).
The confidence interval for a population mean will have a specified margin of
error m when the sample size is:
m  z*

n
z *  
n  

 m 
2

Remember, though, that sample size is not always stretchable at will.
There are typically costs and constraints associated with large
samples. The best approach is to use the smallest sample size that can
give
 you useful results.
21
How many undergraduates should we survey?
Suppose we are planning a survey about college savings programs.
We want the margin of error of the amount contributed to be $30 with
95% confidence. Let us assume the population standard deviation, σ,
equals $1483.
How many measurements should you take?
For a 95% confidence interval, z* = 1.96.
2
2
æ z *s ö
æ 1.96 *1483 ö
n =ç
÷ Þ n =ç
÷ = 9387.54.
è m ø
è
ø
30
Using only 9387 measurements will not be enough to ensure that m is
no more than $30. Therefore, we need at least 9388 measurements.
22

Tim wants to have a margin of error of only 2
pounds with 95% confidence. How many times
must he weigh himself to achieve this goal?

The data should be an SRS from the population.

The confidence interval and sample size formulas are
not correct for other sampling methods.
Inference cannot rescue badly produced data.
Confidence intervals are not resistant to outliers.
If n is small (<15) and the population is not Normal, the
true confidence level will be different from C.




The standard deviation  of the population must be
known. We will learn what to do when  is unknown in
chapter 7.
24







The reasoning of tests of significance
Stating hypotheses
Test statistics
P-values
Statistical significance
Test for a population mean
Two-sided significance tests and confidence intervals
25
Confidence intervals are one of the two most common types of statistical
inference. Use a confidence interval when your goal is to estimate a
population parameter. The second common type of inference, called
tests of significance, has a different goal: to assess evidence in the data
about some claim concerning a population.
A test of significance is a formal procedure for comparing observed data
with a claim (also called a hypothesis) whose truth we want to assess.
The claim is a statement about a parameter, like the population proportion p
or the population mean µ.
We express the results of a significance test in terms of a probability, called
the P-value, that measures how well the data and the claim agree.
26
Suppose a basketball player claimed to be an 80% free-throw shooter. To test
this claim, we have him attempt 50 free throws. He makes 32 of them. His
sample proportion of made shots is 32/50 = 0.64.
What can we conclude about the claim based on this sample data?
We can use software to simulate 400 sets of 50 shots each on
the assumption that the player is an 80% free-throw shooter.
You can say how strong the evidence
against the player’s claim is by giving the
probability that he would make as few as
32 out of 50 free throws if he really makes
80% in the long run.
Assuming that the actual parameter value
is p = 0.80, the observed statistic is so
unlikely that it gives convincing evidence
that the player’s claim is not true.
27
1. Are Tim Kelly’s weight measurements
compatible with the claim that his true mean
weight is 187 pounds?
2. In a random sample of 100 light bulbs, 7 are
found defective. Is this compatible with the
manufacturer’s claim of only 5% of the light
bulbs produced are defective?


What are we in favor of or against?
How do we stat this in terms of an appropriate
hypothesis?

The hypothesis is a statement about the parameters in a
population or model – not about the data at hand.
◦ We usually have the data and can answer questions directly
about it.

The results of a test are expressed in terms of a
probability that measures how well the data and the
hypothesis agree.
◦ This is similar to confidence but altogether different as well.

In hypothesis testing, we need to state 2 hypotheses:
◦ The null hypothesis: H0
◦ The alternative hypothesis: Ha

The null hypothesis is the claim which is initially
favored or believed to be true. Often default or
uninteresting situation of “no effect” or “no
difference”.

THEN, we usually need to determine if there is
strong enough evidence against it.

The test of significance is designed to assess
the strength of the evidence against the null
hypothesis.


The alternative hypothesis is the claim that we
“hope” or “suspect” something else is true instead
of H0.
Sometimes it is easier to begin with the alternative
hypothesis Ha and then set up H0 as the statement
that the hoped-for effect is not present.
H0: μ = 187
In words: true weight is 187 pounds.
Ha: μ > 187
In words: He weighs more than 187 pounds.
A so-called one-sided alternative Ha.
(Looking for a departure in one direction.)
 H0: μ = 187 vs. Ha: μ <187
Suspect the weight is lower. One-sided Ha.
 H0: μ = 187 vs. Ha: μ >187
Suspect the weight is higher. One-sided Ha.
 H0: μ =187 vs. Ha: μ ≠187
Suspect weight is different. Two-sided Ha.
Note: you must decide on the setting, based on general
knowledge, before you see the data or other
measurements.

Translate each of the following research
questions into appropriate hypotheses.
 Census Bureau data show that the mean household income in
the area served by a shopping mall is $62,500 per year. A market
research firm questions shoppers at the mall to find out whether
the mean household income of mall shoppers is higher than that
of the general population.
 Last year, your company’s service technicians took an average of
2.6 hours to respond to trouble calls from business customers
who had purchased service contracts. Do this year’s data show
a different average response time?

Tim Kelley has a driver’s license that gives his
weight as 187 pounds. Recall that last month’s
mean weight was 190.5, with a sample size of 4.
Also the population standard deviation is 3.
What is the probability of observing a sample
mean of 190.5 or larger when the true population
mean is 187?

the probability, computed assuming that H0 is
true, that the test statistics would take as
extreme or more extreme values as the one
actually observed.
◦ Example 1 (Tim Kelley): p-value =

This is the P-value of the test (or of the data,
given the testing procedure). If it is small, it
serves as evidence against H0.

Need to know the distribution of the test
statistics under H0 to calculate P-value.

When the P-Value is small, there are 2 choices:
◦ 1—The null hypothesis is true and our observed effect is
extremely rare!
OR more likely…
◦ 2—The null hypothesis is false and our data is telling
us this by the small P-value!

So…

We need a cut-off point (decisive value) that we
can compare our P-value to and draw a conclusion
or make a decision.
◦ In other words, how much evidence do we need to reject
H0 ?

This cut-off point is the significance level. It is
announced in advance and serves as a standard
on how much evidence against H0 we need to
reject H0. Usually denoted α.

Typical values of α: 0.05, 0.01.
◦ If not stated otherwise, assume α=0.05.


If the P-value is smaller than a fixed significance level α,
then we reject the null hypothesis (in favor of the
alternative).
Otherwise we don’t have enough evidence to reject the
null.
◦ If we don’t reject the null, do we accept it?

Guidelines:

Note: Should always report a P-value with your
conclusion and write the conclusion in terms of the
problem.
◦ Conclusion for Example 1 (Tim Kelley):
◦ IF p-value < α  Reject H0
◦ IF p-value > α  Fail to Reject H0
The final step in performing a significance test is to draw a conclusion
―reject H0 or fail to reject H0.
 If our sample result is too unlikely to have happened by chance
assuming H0 is true, then we will reject H0.
 Otherwise, we will fail to reject H0.
Note: A fail-to-reject H0 decision in a significance test does not mean that
H0 is true. For that reason, you should never “accept H0” or use language
implying that you believe H0 is true.
When we use a fixed level of significance to draw a conclusion in a
significance test,
P-value <  → reject H0 → conclude Ha (in context),
P-value ≥  → fail to reject H0 → cannot conclude Ha (in context)
If the P-value is smaller than , we also say that the data are
statistically significant at level . The quantity  is called the
significance level or the level of significance.
41
Tests of Significance: Four Steps
1. State the null and alternative hypotheses.
2. Calculate the value of the test statistic.
3. Find the P-value for the observed data.
4. State a conclusion.
We will learn the details of many tests of significance in the following
chapters. The proper test statistic is determined by the hypotheses
and the data collection design.
42
43
Reject H0 when the P-value is smaller than
significance level α.
Otherwise: Do not reject.
This rule is valid in other settings, too.

If, based on previous data or experience,we
expect “increase”, “more”, “better”, etc.
(“decrease”, “less”, “worse”, resp.), then we
can use a one sided test.

Otherwise, by default, we use two-sided. Key
words: “different”, “departures”, “changed”…

A group of 72 male executives in age group 3544 has mean systolic blood pressure 126.07. Is
this career group’s mean pressure different
than that of the general population of males in
this age group, which is N(128, 15)?
(α not given?? Assume α = 0.05)
Example 3:
A test of significance is based on a statistic that estimates the
parameter that appears in the hypotheses. When H0 is true, we expect
the estimate to be near the parameter value specified in H0.
Values of the estimate far from the parameter value specified by H0
give evidence against H0.
A test statistic calculated from the sample data measures how far
the data diverge from what we would expect if the null hypothesis
H0 were true.
z
estimate - hypothesized value
standard deviation of the estimate
Large values of the statistic show that the data are not consistent
with H0.

50

A significance test can be done in a black-and-white manner: We
reject H0 if P < , and otherwise we do not reject H0.

Reporting the P-value is a better way to summarize a test than
simply stating whether or not H0 is rejected. This is because P
quantifies how strong the evidence is against H0. The smaller the
value of P, the greater the evidence.

On the other hand, P does not provide specific information about the
true population mean µ. If you desire a likely range of values for
the parameter, use a confidence interval.
51

A level α two-sided significance test rejects H0:
µ=µ0 exactly when µ0 falls outside a level 1- α
confidence interval for µ.
◦ If µ0 is in the CI  fail to reject H0
◦ If µ0 is not in the CI  reject H0
◦ NOTE: must have “≠” in Ha!

An agro-economist examines the cellulose
content of a variety of alfalfa hay. Suppose that
the cellulose content in the population has a
standard deviation of 8 mg. A sample of 15
cuttings has a mean cellulose content of 145
mg. A previous study claimed that the mean
cellulose content was 140 mg. The 95%
confidence interval is (140.95, 149.05).
◦ Use the confidence interval to determine if the mean
cellulose content is different from 140 mg.

Now try the test using a test statistic instead of the
confidence interval, just for practice. (The result
should be the same.)




Choosing a significance level
What statistical significance does not mean
Don’t ignore lack of significance
Beware of searching for significance
55




α=0.05 is accepted standard, but…
if the conclusion that Ha is true has “costly”
implications, smaller α may be appropriate
not always need to make a decision: describing
the evidence by P-value may be enough
no sharp border between statistically significant
and insignificant

Statistically significant effect may be small:
◦ Example (“Executive” blood pressure):




µ0 = 128
σ = 15
n = 1000 obs.
sample mean = 127
◦ Z = (127-128)/ (15/sqrt(1000)) = -2.11
◦ P-value for two sided Ha = 2*0.0174=0.0348
Significant??
Stat. significance is not necessarily practical
significance.

Plot your results and confidence interval, to see if
the effect is worth your attention.

Important effects may have large P-value if
sample size too small. Converse also true.

Outliers may produce or destroy statistical
significance.
Cautions About Significance Tests
Don’t ignore lack of significance
 Consider this provocative title from the British Medical Journal: “Absence
of evidence is not evidence of absence.”
 Having no proof that a particular suspect committed a murder does not
imply that the suspect did not commit the murder.
Indeed, failing to find statistical significance in results means that “the null
hypothesis is not rejected.” This is very different from actually accepting the null
hypothesis. The sample size, for instance, could be too small to overcome
large variability in the population.
When comparing two populations, lack of significance does not imply that the
two populations are the same. The populations might be different but have
similar statistical properties.
59
Cautions About Significance Tests
Potential importance of small effects
There is no consensus on how big an effect has to be in order to be
considered meaningful. In some cases, effects that may appear to be
trivial can be very important.
Example: Improving the format of a computerized test reduces the average
response time by about 2 seconds. Although this effect is small, it is
important because this is done millions of times a year. The cumulative time
savings of using the better format is gigantic.
Always think about the context. Try to plot your results and compare
them with a baseline or results from similar studies.
60

Statistical Inference, no matter how well done,
cannot fix basic flaws in the design
◦ Bias due to:




Sampling (like voluntary response, etc)
incorrect experimental design
Poorly worded questions
Etc.
◦ Any other problems we discussed in chapter 3 can affect
the validity of the Inference.


Example: Take 100 executive rank employees.
Measure: blood pressure, height, weight, bone
density, metabolism rate, etc.
◦
◦
◦
◦

Test if their blood pressure is different, using α=0.05
Test if their height is different, using α=0.05
Test if their weight is different, using α=0.05
…
If we perform 40 significance tests, how many
do we expect to be statistically significant, just
by chance?





Remember, the significance level controls what we call
a “rare” result.
In normal practice, rare results do occur, but rarely!!
If α=0.05, then we will rarely (5% of the time) get a rare
result but this is also what we call statistical
significance!
In summary, if you are searching for significance by
running tests over and over, you will find it! But this is
terrible statistics…
We’d much rather have 1 significance test that we are
interested in at a single α=0.05!





Data: an SRS
Formulas for other randomized designs available
Haphazard data = unreliable CI
Population need not be normal but outliers pose a
threat to validity of conclusions
Will learn how to estimate σ in Chapter 7