Inference for proportions
Download
Report
Transcript Inference for proportions
Business Statistics for Managerial
Decision
Inference for proportions
Inference for Proportions
Some statistical studies concern variables
measured in a scale of equal units such as dollars
or grams.
We have discussed inference about the mean of
variables likes these in our previous lectures.
Other studies record categorical variables, such as
the race or occupation of a person, the make of a
car, or type of complaint received from a
customer.
When we record categorical variables, our data
consists of counts or percents obtained from
counts.
Inference for Proportions
The parameters we want to do inference about in
these settings are population proportions.
Just as in the case of inference about population
means, we may be concerned with a single
population or with comparing two populations.
Inference about one or two proportions is very
similar to inference about means and it is based on
sampling distributions that are approximately
Normal.
Example: Work stress and personal life
The human resources manager of a chain
restaurants is concerned that work stress
may be affecting the chain’s employees.
She asks a random sample of 100
employees to respond Yes or No to the
question “Does work stress have a negative
impact on your personal life?” Of these 68
say “yes.”
Example: Work stress and personal life
The Parameter of interest is the proportion
of the chain’s employee who would answer
“Yes” if asked.
This is population proportion, which we call
P.
The statistic used to estimate the unknown
parameter is the sample proportion
pˆ
68
0.68
100
Inference for a Single Proportion
The sample proportion p̂ is a discrete random
variable that can take the values 0, 1/100, 2/100,
…, 99/100 or 1.
The probability model for p̂ can be based on the
Binomial distributions for counts.
If the sample size n is very small, we must base
tests and confidence intervals for P on the discrete
distribution of p̂ .
We can approximate the distribution of p̂ by a
Normal distribution when the sample size is large.
Sampling Distribution of a Sample
Proportion
Choose a SRS of size n from a large population that
contains population proportion P of “successes.” Let p̂ be
the sample proportion of successes,
pˆ
count of successes in the sample X
n
n
Then:
As the sample size increases, the sampling distribution of p̂
becomes approximately Normal.
The mean of the sampling distribution is P.
The standard deviation of the sampling distribution is
p(1 p)
n
Sampling Distribution of a Sample
Proportion
The sampling
distribution of the
sample proportion p̂
of successes has
approximately a
Normal distribution.
Confidence Interval for a Single
Proportion
The sample proportion pˆ Xn is the natural estimator of the
population proportion P.
The traditional confidence interval for P is based on the
Normal approximation to the distribution of p̂ .
Unfortunately, confidence intervals based on this statistic
can be quite inaccurate, even for large samples.
We can do better by moving sample proportion p̂
slightly away from 0 and 1.
The following simple adjustment works very well in
practice.
Confidence Interval for a Single
Proportion
Wilson Estimate:
Assume we have 4 additional observations, 2 of
which are successes and 2 of which are failures.
The new sample size is n + 4 and the count of
successes is X+2.
The estimator of the population proportion is
X 2
~
p
n4
Confidence Interval for a Single
Proportion
We base a confidence interval on the z
statistic obtained by standardizing the
Wilson estimate ~p .
The distribution of ~p is close to the Normal
distribution with mean P and standard
deviation pn(14p) .
Confidence Interval for a Single
Proportion
Choose a SRS of size n from a large population with
unknown proportion p of successes. The Wilson estimate
of the population proportion is
The standard error of ~p is
SE ~p
X 2
~
p
n4
~
p (1 ~
p)
n4
An approximate Level C confidence interval for P is
~
p z * SE ~p
Where z* is the value for the standard Normal density curve with
C area between –z* and z*.
Use this interval when sample size is at least n = 5 and the
confidence level is 90% or more.
Example: estimating the effect of work
stress
The sample survey in previous example found that
68 out of 100 employees agreed that work stress
had a negative impact on their personal lives.The
sample size is n = 100 and the count of successes is
X = 68. The Wilson estimate of the proportion of
all employees affected by work stress is
X 2 68 2
~
p
0.6731
n 4 100 4
The standard error is
SE ~p
~
p (1 ~
p)
0.6731(1 0.6731)
0.0460
n4
104
Example: estimating the effect of work
stress
The z critical value for 95% confidence is
z* = 1.96, so the confidence interval is
~
p z * SE ~p 0.6731 (1.96)(0.0460)
0.673 .090
We are 95% confident that between 58.3% and
76.3% of the restaurant chain’s employees feel
that work stress is damaging their personal lives.
Significance Test for a Single Proportion
The sample proportion pˆ Xp is approximately Normal
with mean p̂ and standard deviation p(1n p)
For confidence interval we used the Wilson estimate and
estimated the standard deviation from the data.
When performing significance test, the null hypothesis
specifies a value for p which we call p0.
We assume the hypothesized p were actually true and
substitute p0 for p in the expression for p̂ and then
standardize p̂ .
pˆ
Significance Test for a Single Proportion
Example: Work stress
A national survey of restaurant employees found
that 75% said that work stress had a negative
impact on their personal lives. A sample of 100
employees of a restaurant chain found that 68
answered “Yes” when asked, “does work stress
have a negative impact on your personal life?” Is
this good reason to think that the proportion of all
employees of this chain who say “Yes” differs
from the national proportion p0 = 0.75?
Example: Work stress
To answer this question, we test
H0: p = 0.75
Ha: P 0.75
The expected number of “Yes” and “No”
responses are
100 0.75 = 75 and 1000.25 = 25
Both are greater than 10 , so we can use z test.
Test statistic is
z
pˆ p0
0.68 0.75
1.62
p0 (1 p0 )
0.75 0.25
100
n
Example: Work stress
From table A we find
p( z 1.62) 1 .9474 0.0526
The P-value is
P = 20.0526 = .1052
We conclude that the
chain restaurant data
are compatible with
the survey results.
Choosing a Sample Size
We want to see how to choose the sample size n to
obtain a confidence interval with specified margin
of error m for a population proportion.
The margin of error for the confidence interval for
a population proportion is:
~
p (1 ~
p)
m z * SE ~p z *
n4
Choosing a confidence level C fixes the critical
value z*.
Choosing a Sample Size
p and
The margin of error also depends on the the value of ~
the sample size n.
~p
We don’t know the value of until we gather data,
therefore we must guess a value to use in the calculations.
Let’s call the guess value p*. There are two ways to get p*.
Use sample estimate from a pilot study or from similar studies
done earlier.
Use p* = 0.5. Because the margin of error is largest when ~
p 0 .5 ,
this choice gives a sample size that is somewhat larger than we
really need for the confidence level we choose. It is a safe choice
no matter what the data later show.
Choosing a Sample Size
The level C confidence interval for a proportion p will
have a margin of error approximately equal to a specified
value m when the sample size satisfies
2
z*
n 4 p * (1 p*)
m
Here z* is the critical value for confidence C, and p* is a
guessed value for the proportion of successes in the future
sample.
The margin of error will be less than or equal to m if p* is
chosen to be 0.5. The sample size required is then given by
z*
n4
2
m
2
Example: Planning a sample of
customers
Your company has received complaints about its customer
support service. You intend to hire a consulting company
to carry out a sample survey of customers. Before
contacting the consultant, you want some idea of the
sample size you will have to pay for. One critical question
is the degree of satisfaction with your customer service,
measured on a five-point scale. You want to estimate the
proportion P of your customers who are satisfied (That is ,
who choose either “satisfied” or “very satisfied,” the two
highest levels on the five point scale).
Example: Planning a sample of
customers
You want to estimate P with 95% confidence and a margin
of error less than or equal to 3%. For planning purposes,
you are willing to use p* = 0.5. The sample size required is:
2
2
z * 1.96
n4
1067.1
2
m
2
0
.
03
Round up to get n+4 = 1068 or n = 1064 (Always round up.
Rounding down would give a margin of error slightly
greater than 0.03.)
Similarly for a 2.5% margin of error we have (after
rounding up)
2
1.96
n4
1537
2 0.025
Comparing Two Proportions
We often want to compare the proportions of two
groups (such as men and women) that have some
characteristics.
We call the two groups being compared
Population 1 and population 2.
The two population proportions of “Successes” P1
and P2.
The data consist of two independent SRS
The sample sizes are n1 from population 1 and n2
from population 2.
Comparing Two Proportions
The proportion of successes in each sample
estimates the corresponding population
proportion.
Here is the notation we will use
population
population
proportion
Sample
size
Count of
successes
Sample
proportion
1
P1
n1
X1
pˆ1 X1 n1
2
P2
n2
X2
pˆ 2 X 2 n2
Sampling Distribution of
pˆ1 pˆ 2
Choose independent SRS of sizes n1 and n2 from
two populations with proportions P1 and P2 of
successes.
Let D pˆ1 pˆ 2 be the difference between the two
sample proportions of successes.
Then as both sample sizes increase, the sampling
distribution of D becomes approximately Normal.
The mean of the sampling distribution is P1 P2 .
The standard deviation of the sampling distribution is
D
P1 (1 P1 ) P2 (1 P2 )
n1
n2
Sampling Distribution of
The sampling distribution
of the difference of two
sample proportions is
approximately Normal.
The mean and standard
deviation are found from
the two population
proportions of successes,
P1 and P2
pˆ1 pˆ 2
Confidence Interval
Just as in the case of estimating a single
proportion, a small modification of the
sample proportions greatly improves the
accuracy of confidence intervals.
The Wilson estimates of the two population
proportions are
~
P1 ( X 1 1) (n1 2)
~
p2 ( X 2 1) (n2 2)
Confidence Interval
~ is approximately
The standard deviation of D
D~
~
p1 (1 ~
p2 ) ~
p2 (1 ~
p2 )
n1 2
n2 2
To obtain a confidence interval for P1-P2, we
replace the unknown parameters in the standard
deviation by estimates to obtain an estimated
standard deviation, or standard error.
Confidence Interval for Comparing
Two Proportions
Example:”No Sweat” Garment Labels
Following complaints about the working
conditions in some apparel factories both in the
United States and Abroad, a joint government and
industry commission recommended in 1998 that
companies that monitor and enforce proper
standards be allowed to display a “No Sweat”
label on their product. A survey of U.S. residents
aged 18 or older asked a series of questions about
how likely they would be to purchase a garment
under various conditions.
Example:”No Sweat” Garment Labels
For some conditions, it was stated that the
garment had a “No Sweat” label; for others,
there was no mention of such label. On the
basis of of the responses, each person was
classified as a “label user” or “ a “label
nonuser.” About 16.5% of those surveyed
were label users. One purpose of the study
was to describe the demographic
characteristics of users and nonusers.
Example:”No Sweat” Garment Labels
The study suggested that there is a gender
difference in the proportion of label users.
Here is a summary of the data. Let X denote
the number of label users.
population
1 (women)
2 (men)
n
296
251
X
63
27
pˆ X n
0.213
0.108
~
p ( X 1) (n 2)
0.215
0.111
Example:”No Sweat” Garment Labels
First calculate the standard error of the observed
difference.
SED~
~
p1 (1 ~
p1 ) ~
p2 (1 ~
p2 )
n1 2
n2 2
(0.215)(0.785) (0.111)(0.889)
0.0308
296 2
251 2
The 95% confidence interval is
(~
p1 ~
p2 ) z * SED~
(0.215 0.111) (1.96)(0.0308)
.104 0.060 (0.04, 0.16)
Example:”No Sweat” Garment Labels
With 95% confidence we can say that the difference in the
proportions is between 0.04 and 0.16.
Alternatively, we can report that the women are about 10%
more likely to be label users than men, with a 95% margin
of error of 6%.
In this example we chose women to be the first population.
Had we chosen men as the first population, the estimate of
the difference would be negative (-0.104).
Because it is easier to discuss positive numbers, we
generally choose the first population to be the one with the
higher proportion.
The choice does not affect the substance of the analysis.
Significance Tests
It is sometimes useful to test the null hypothesis
that the two population proportions are the same.
We standardize D pˆ pˆ by subtracting its mean
P1-P2 and then dividing by its standard deviation
1
D
2
P1 (1 P1 ) P2 (1 P2 )
n1
n2
If n1 and n2 are large, the standardized difference
is approximately N(0, 1).
To estimate D we take into account the null
hypothesis that P1 = P2.
Significance Tests
If these two proportions are equal, we can
view all of the data as coming from a single
population.
Let P denote the common value of P1 and
P2. The standard deviation of D pˆ pˆ is then
1
Dp
P(1 P) P(1 P)
n1
n2
1 1
P(1 P)
n1 n2
2
Significance Tests
We estimate the common value of P by the overall
proportion of successes in the two samples.
number of successes in both samples
X X2
Pˆ
1
number of observatio ns in both samples
n1 n2
This estimate of P is called the pooled estimate.
To estimate the standard deviation of D, substitute p̂
for P in the expression for DP.
The result is a standard error for D under the condition that
the null hypothesis H0: P1 = P1 is true.
The test statistic uses this standard error to standardize the
difference between the two sample proportions.
Significance Tests for Comparing Two
Proportions
Example:men, women, and garment labels.
The previous example presented the survey data
on whether consumers are “label users” who pay
attention to label details when buying a shirt. Are
men and women equally likely to be label users?
Here is the data summary:
Population
n
X
1 (women)
2 (men)
296
251
63
27
pˆ X n
0.213
0.108
Example:men, women, and garment labels
We compare the proportions of label users in the
two populations (women and men) by testing the
hypotheses
H0:P1= P2
Ha:P1 P2
The pooled estimate of the common value of P is:
pˆ
63 27
90
0.1645
296 251 547
This is the proportion of label users in the entire
sample.
Example:men, women, and garment labels
The test statistic is calculated as follows:
1
1
SEDP (0.1645)(0.8355)
0.03181
296 251
z
pˆ 1 pˆ 2 0.213 0.108
3.30
SEDP
0.03181
The observed difference is more than 3 standard
deviation away from zero.
Example:men, women, and garment labels
The P-value is:
2 P( z 3.30) 2 (1 0.9995) 2 0.0005 0.001
Conclusion:
21% of women are label users versus only 11%
of men; the difference is statistically
significant.