Inference for proportions

Download Report

Transcript Inference for proportions

Business Statistics for Managerial
Decision
Inference for proportions
Inference for Proportions




Some statistical studies concern variables
measured in a scale of equal units such as dollars
or grams.
We have discussed inference about the mean of
variables likes these in our previous lectures.
Other studies record categorical variables, such as
the race or occupation of a person, the make of a
car, or type of complaint received from a
customer.
When we record categorical variables, our data
consists of counts or percents obtained from
counts.
Inference for Proportions



The parameters we want to do inference about in
these settings are population proportions.
Just as in the case of inference about population
means, we may be concerned with a single
population or with comparing two populations.
Inference about one or two proportions is very
similar to inference about means and it is based on
sampling distributions that are approximately
Normal.
Example: Work stress and personal life

The human resources manager of a chain
restaurants is concerned that work stress
may be affecting the chain’s employees.
She asks a random sample of 100
employees to respond Yes or No to the
question “Does work stress have a negative
impact on your personal life?” Of these 68
say “yes.”
Example: Work stress and personal life



The Parameter of interest is the proportion
of the chain’s employee who would answer
“Yes” if asked.
This is population proportion, which we call
P.
The statistic used to estimate the unknown
parameter is the sample proportion
pˆ 
68
 0.68
100
Inference for a Single Proportion




The sample proportion p̂ is a discrete random
variable that can take the values 0, 1/100, 2/100,
…, 99/100 or 1.
The probability model for p̂ can be based on the
Binomial distributions for counts.
If the sample size n is very small, we must base
tests and confidence intervals for P on the discrete
distribution of p̂ .
We can approximate the distribution of p̂ by a
Normal distribution when the sample size is large.
Sampling Distribution of a Sample
Proportion

Choose a SRS of size n from a large population that
contains population proportion P of “successes.” Let p̂ be
the sample proportion of successes,
pˆ 

count of successes in the sample X

n
n
Then:



As the sample size increases, the sampling distribution of p̂
becomes approximately Normal.
The mean of the sampling distribution is P.
The standard deviation of the sampling distribution is
p(1  p)
n
Sampling Distribution of a Sample
Proportion

The sampling
distribution of the
sample proportion p̂
of successes has
approximately a
Normal distribution.
Confidence Interval for a Single
Proportion





The sample proportion pˆ  Xn is the natural estimator of the
population proportion P.
The traditional confidence interval for P is based on the
Normal approximation to the distribution of p̂ .
Unfortunately, confidence intervals based on this statistic
can be quite inaccurate, even for large samples.
We can do better by moving sample proportion p̂
slightly away from 0 and 1.
The following simple adjustment works very well in
practice.
Confidence Interval for a Single
Proportion

Wilson Estimate:



Assume we have 4 additional observations, 2 of
which are successes and 2 of which are failures.
The new sample size is n + 4 and the count of
successes is X+2.
The estimator of the population proportion is
X 2
~
p
n4
Confidence Interval for a Single
Proportion


We base a confidence interval on the z
statistic obtained by standardizing the
Wilson estimate ~p .
The distribution of ~p is close to the Normal
distribution with mean P and standard
deviation pn(14p) .
Confidence Interval for a Single
Proportion


Choose a SRS of size n from a large population with
unknown proportion p of successes. The Wilson estimate
of the population proportion is
The standard error of ~p is
SE ~p 

X 2
~
p
n4
~
p (1  ~
p)
n4
An approximate Level C confidence interval for P is
~
p  z * SE ~p


Where z* is the value for the standard Normal density curve with
C area between –z* and z*.
Use this interval when sample size is at least n = 5 and the
confidence level is 90% or more.
Example: estimating the effect of work
stress

The sample survey in previous example found that
68 out of 100 employees agreed that work stress
had a negative impact on their personal lives.The
sample size is n = 100 and the count of successes is
X = 68. The Wilson estimate of the proportion of
all employees affected by work stress is
X  2 68  2
~
p

 0.6731
n  4 100  4

The standard error is
SE ~p 
~
p (1  ~
p)
0.6731(1  0.6731)

 0.0460
n4
104
Example: estimating the effect of work
stress

The z critical value for 95% confidence is
z* = 1.96, so the confidence interval is
~
p  z * SE ~p  0.6731  (1.96)(0.0460)
 0.673  .090

We are 95% confident that between 58.3% and
76.3% of the restaurant chain’s employees feel
that work stress is damaging their personal lives.
Significance Test for a Single Proportion

The sample proportion pˆ  Xp is approximately Normal
with mean  p̂ and standard deviation   p(1n p)
For confidence interval we used the Wilson estimate and
estimated the standard deviation from the data.
When performing significance test, the null hypothesis
specifies a value for p which we call p0.
We assume the hypothesized p were actually true and
substitute p0 for p in the expression for  p̂ and then
standardize p̂ .
pˆ



Significance Test for a Single Proportion
Example: Work stress

A national survey of restaurant employees found
that 75% said that work stress had a negative
impact on their personal lives. A sample of 100
employees of a restaurant chain found that 68
answered “Yes” when asked, “does work stress
have a negative impact on your personal life?” Is
this good reason to think that the proportion of all
employees of this chain who say “Yes” differs
from the national proportion p0 = 0.75?
Example: Work stress

To answer this question, we test
H0: p = 0.75
Ha: P  0.75

The expected number of “Yes” and “No”
responses are



100 0.75 = 75 and 1000.25 = 25
Both are greater than 10 , so we can use z test.
Test statistic is
z
pˆ  p0
0.68  0.75

 1.62
p0 (1  p0 )
0.75  0.25
100
n
Example: Work stress

From table A we find
p( z  1.62)  1  .9474  0.0526

The P-value is


P = 20.0526 = .1052
We conclude that the
chain restaurant data
are compatible with
the survey results.
Choosing a Sample Size


We want to see how to choose the sample size n to
obtain a confidence interval with specified margin
of error m for a population proportion.
The margin of error for the confidence interval for
a population proportion is:
~
p (1  ~
p)
m  z * SE ~p  z *
n4

Choosing a confidence level C fixes the critical
value z*.
Choosing a Sample Size



p and
The margin of error also depends on the the value of ~
the sample size n.
~p
We don’t know the value of until we gather data,
therefore we must guess a value to use in the calculations.
Let’s call the guess value p*. There are two ways to get p*.


Use sample estimate from a pilot study or from similar studies
done earlier.
Use p* = 0.5. Because the margin of error is largest when ~
p  0 .5 ,
this choice gives a sample size that is somewhat larger than we
really need for the confidence level we choose. It is a safe choice
no matter what the data later show.
Choosing a Sample Size

The level C confidence interval for a proportion p will
have a margin of error approximately equal to a specified
value m when the sample size satisfies
2
 z*
n  4    p * (1  p*)
m


Here z* is the critical value for confidence C, and p* is a
guessed value for the proportion of successes in the future
sample.
The margin of error will be less than or equal to m if p* is
chosen to be 0.5. The sample size required is then given by
 z* 
n4

2
m


2
Example: Planning a sample of
customers

Your company has received complaints about its customer
support service. You intend to hire a consulting company
to carry out a sample survey of customers. Before
contacting the consultant, you want some idea of the
sample size you will have to pay for. One critical question
is the degree of satisfaction with your customer service,
measured on a five-point scale. You want to estimate the
proportion P of your customers who are satisfied (That is ,
who choose either “satisfied” or “very satisfied,” the two
highest levels on the five point scale).
Example: Planning a sample of
customers

You want to estimate P with 95% confidence and a margin
of error less than or equal to 3%. For planning purposes,
you are willing to use p* = 0.5. The sample size required is:
2
2
 z *   1.96 
n4
 
  1067.1
2
m
2

0
.
03

 



Round up to get n+4 = 1068 or n = 1064 (Always round up.
Rounding down would give a margin of error slightly
greater than 0.03.)
Similarly for a 2.5% margin of error we have (after
rounding up)
2
 1.96 
n4
  1537
 2  0.025 
Comparing Two Proportions





We often want to compare the proportions of two
groups (such as men and women) that have some
characteristics.
We call the two groups being compared
Population 1 and population 2.
The two population proportions of “Successes” P1
and P2.
The data consist of two independent SRS
The sample sizes are n1 from population 1 and n2
from population 2.
Comparing Two Proportions


The proportion of successes in each sample
estimates the corresponding population
proportion.
Here is the notation we will use
population
population
proportion
Sample
size
Count of
successes
Sample
proportion
1
P1
n1
X1
pˆ1  X1 n1
2
P2
n2
X2
pˆ 2  X 2 n2
Sampling Distribution of



pˆ1  pˆ 2
Choose independent SRS of sizes n1 and n2 from
two populations with proportions P1 and P2 of
successes.
Let D  pˆ1  pˆ 2 be the difference between the two
sample proportions of successes.
Then as both sample sizes increase, the sampling
distribution of D becomes approximately Normal.


The mean of the sampling distribution is P1  P2 .
The standard deviation of the sampling distribution is
D 
P1 (1  P1 ) P2 (1  P2 )

n1
n2
Sampling Distribution of


The sampling distribution
of the difference of two
sample proportions is
approximately Normal.
The mean and standard
deviation are found from
the two population
proportions of successes,
P1 and P2
pˆ1  pˆ 2
Confidence Interval


Just as in the case of estimating a single
proportion, a small modification of the
sample proportions greatly improves the
accuracy of confidence intervals.
The Wilson estimates of the two population
proportions are
~
P1  ( X 1  1) (n1  2)
~
p2  ( X 2  1) (n2  2)
Confidence Interval

~ is approximately
The standard deviation of D
 D~ 

~
p1 (1  ~
p2 ) ~
p2 (1  ~
p2 )

n1  2
n2  2
To obtain a confidence interval for P1-P2, we
replace the unknown parameters in the standard
deviation by estimates to obtain an estimated
standard deviation, or standard error.
Confidence Interval for Comparing
Two Proportions
Example:”No Sweat” Garment Labels

Following complaints about the working
conditions in some apparel factories both in the
United States and Abroad, a joint government and
industry commission recommended in 1998 that
companies that monitor and enforce proper
standards be allowed to display a “No Sweat”
label on their product. A survey of U.S. residents
aged 18 or older asked a series of questions about
how likely they would be to purchase a garment
under various conditions.
Example:”No Sweat” Garment Labels

For some conditions, it was stated that the
garment had a “No Sweat” label; for others,
there was no mention of such label. On the
basis of of the responses, each person was
classified as a “label user” or “ a “label
nonuser.” About 16.5% of those surveyed
were label users. One purpose of the study
was to describe the demographic
characteristics of users and nonusers.
Example:”No Sweat” Garment Labels

The study suggested that there is a gender
difference in the proportion of label users.
Here is a summary of the data. Let X denote
the number of label users.
population
1 (women)
2 (men)
n
296
251
X
63
27
pˆ  X n
0.213
0.108
~
p  ( X  1) (n  2)
0.215
0.111
Example:”No Sweat” Garment Labels

First calculate the standard error of the observed
difference.
SED~ 


~
p1 (1  ~
p1 ) ~
p2 (1  ~
p2 )

n1  2
n2  2
(0.215)(0.785) (0.111)(0.889)

 0.0308
296  2
251  2
The 95% confidence interval is
(~
p1  ~
p2 )  z * SED~
 (0.215  0.111)  (1.96)(0.0308)
 .104  0.060  (0.04, 0.16)
Example:”No Sweat” Garment Labels





With 95% confidence we can say that the difference in the
proportions is between 0.04 and 0.16.
Alternatively, we can report that the women are about 10%
more likely to be label users than men, with a 95% margin
of error of 6%.
In this example we chose women to be the first population.
Had we chosen men as the first population, the estimate of
the difference would be negative (-0.104).
Because it is easier to discuss positive numbers, we
generally choose the first population to be the one with the
higher proportion.
The choice does not affect the substance of the analysis.
Significance Tests


It is sometimes useful to test the null hypothesis
that the two population proportions are the same.
We standardize D  pˆ  pˆ by subtracting its mean
P1-P2 and then dividing by its standard deviation
1
D 


2
P1 (1  P1 ) P2 (1  P2 )

n1
n2
If n1 and n2 are large, the standardized difference
is approximately N(0, 1).
To estimate D we take into account the null
hypothesis that P1 = P2.
Significance Tests


If these two proportions are equal, we can
view all of the data as coming from a single
population.
Let P denote the common value of P1 and
P2. The standard deviation of D  pˆ  pˆ is then
1
 Dp 
P(1  P) P(1  P)

n1
n2
1 1
 P(1  P)  
 n1 n2 
2
Significance Tests

We estimate the common value of P by the overall
proportion of successes in the two samples.
number of successes in both samples
X  X2
Pˆ 
 1
number of observatio ns in both samples
n1  n2




This estimate of P is called the pooled estimate.
To estimate the standard deviation of D, substitute p̂
for P in the expression for DP.
The result is a standard error for D under the condition that
the null hypothesis H0: P1 = P1 is true.
The test statistic uses this standard error to standardize the
difference between the two sample proportions.
Significance Tests for Comparing Two
Proportions
Example:men, women, and garment labels.


The previous example presented the survey data
on whether consumers are “label users” who pay
attention to label details when buying a shirt. Are
men and women equally likely to be label users?
Here is the data summary:
Population
n
X
1 (women)
2 (men)
296
251
63
27
pˆ  X n
0.213
0.108
Example:men, women, and garment labels

We compare the proportions of label users in the
two populations (women and men) by testing the
hypotheses
H0:P1= P2
Ha:P1  P2

The pooled estimate of the common value of P is:
pˆ 

63  27
90

 0.1645
296  251 547
This is the proportion of label users in the entire
sample.
Example:men, women, and garment labels

The test statistic is calculated as follows:
1 
 1
SEDP  (0.1645)(0.8355)

  0.03181
 296 251 
z

pˆ 1  pˆ 2 0.213  0.108

 3.30
SEDP
0.03181
The observed difference is more than 3 standard
deviation away from zero.
Example:men, women, and garment labels

The P-value is:
2  P( z  3.30)  2  (1  0.9995)  2  0.0005  0.001

Conclusion:

21% of women are label users versus only 11%
of men; the difference is statistically
significant.