Transcript Lecture16

STAT 651
Lecture #16
Copyright (c) Bani K. Mallick
1
Topics in Lecture #16

Inference about two population proportions
Copyright (c) Bani K. Mallick
2
Book Sections Covered in Lecture #16

Chapter 10.3
Copyright (c) Bani K. Mallick
3
Lecture #15 Review: Categorical
Data



In general, we can discuss a problem where
the outcome is binary, the success probability
is p, and number of experiments is n.
X = the number of successes in the
experiment
p̂ = the fraction of successes in the
experiment
Copyright (c) Bani K. Mallick
4
Lecture #15 Review: Categorical
Data




The number of success X in n experiments
each with probability of success p is called a
binomial random variable
There is a formula for this:
n!
pk (1  p)nk
Pr(X = k) =
k! (n-k)!
0! = 1, 1! = 1, 2! = 2 x 1 = 2, 3! = 3 x 2 x 1
= 6, 4! = 4 x 3 x 2 x 1 = 24, etc.
Copyright (c) Bani K. Mallick
5
Lecture #15 Review: Categorical
Data



The fraction of successes in n experiments
each with probability of success p also have a
formula :
n!
k
nk
p
(1

p
)
Pr(p̂ = k/n) = k! (n-k)!
The binomial formulae is used to understand
the properties of the sample fraction, e.g., its
standard deviation
Copyright (c) Bani K. Mallick
6
Lecture #15 Review:

If you code your attribute as “0” and “1” in
SPSS, then the sample fraction
is the
p̂
sample as the sample mean of these “data”

For example, let the “data” be 0,1,0,0,0,1,0,1

Then n = 8, and p̂ = 3/8

What is the sample mean of these data?
Copyright (c) Bani K. Mallick
7
Lecture #15 Review:

If you code your attribute as “0” and “1” in
SPSS, then the sample fraction
is the
p̂
sample as the sample mean of these “data”

For example, let the “data” be 0,1,0,0,0,1,0,1

Then n = 8, and p̂ = 3/8

What is the sample mean of these “data”?
X  3/8  p
ˆ
Copyright (c) Bani K. Mallick
8
Lecture #15 Review: Categorical
Data

(1a)100% CI for the population fraction
p
ˆ  z a/2 
ˆ ˆp

ˆ pˆ 

p
ˆ(1  p
ˆ)
n
z a/2 is by looking up 1a/2 in Table 1
Copyright (c) Bani K. Mallick
9
Lecture #15 Review: Sample Size
Calculations

If you want an (1a)100% CI interval to be
p̂  E

you should set
nz
2
a/2
p(1  p)
E2
Copyright (c) Bani K. Mallick
10
Lecture #15 Review: Sample Size
Calculations
nz

2
a/2
p(1  p)
E2
The small problem is that you do not know p.
You have two choices:


Make a guess for p
Set p = 0.50 and calculate (most
conservative, since it results in largest
sample size)
Copyright (c) Bani K. Mallick
11
Comparison of Two Population
Proportions

In some cases, we may want to compare two
populations p1 and p2

The null hypothesis is H0: p1 = p2

This is the same as H0: p1 - p2 = 0

There are two ways to test this hypothesis


One is via what is called a chisquared
statistic, which gives you only a p-value
This is bad: why?
Copyright (c) Bani K. Mallick
12
Comparison of Two Population
Proportions

In some cases, we may want to compare two
populations p1 and p2

The null hypothesis is H0: p1 - p2 = 0

There are two ways to test this hypothesis


One is via what is called a chisquared
statistic, which gives you only a p-value
This is bad: why? If you reject, you have
no idea how different the populations
are!
Copyright (c) Bani K. Mallick
13
Comparison of Two Population
Proportions



The null hypothesis is H0: p1 - p2 = 0
The other way is to form a CI for the
difference in population proportions p1 - p2
The estimate of this difference is simply the
difference in the sample fractions:
p
ˆ1  p
ˆ2
Copyright (c) Bani K. Mallick
14
Comparison of Two Population
Proportions

The standard error of the difference in the
sample fractions:
 pˆ1 pˆ2 

p1 (1  p1 ) p 2 (1  p 2 )

n1
n2
The usual way to form a CI is to replace the
unknown population fractions by the sample
fractions
Copyright (c) Bani K. Mallick
15
Comparison of Two Population
Proportions

The estimated standard error of the
difference in the sample fractions:

ˆ pˆ1 pˆ2 

p
p 2 (1  ˆ
p2 )
ˆ1 (1  p
ˆ1 ) ˆ

n1
n2
The (1a)100% CI then is
p
ˆ1  p
ˆ 2  z a / 2
ˆ ˆp1 pˆ2
Copyright (c) Bani K. Mallick
16
Comparison of Two Population
Proportions: Boxers versus Brief


Most books force you to compute this by
hand
For female preferences in men:
n1  177 , p̂1  0.7345

For male preferences:
n2  188, p̂2  0.4681

Think the populations are different?
p
ˆ1  p
ˆ2  0.2664
Copyright (c) Bani K. Mallick
17
Comparison of Two Population
Proportions: Boxers versus Brief

The estimated standard error of the
difference in the sample fractions is

ˆ pˆ1 pˆ2 
p
ˆ1 (1  p
ˆ1 ) p
ˆ 2 (1  p
ˆ2 )

n1
n2
 0.001102  0.001324  0.04944
Copyright (c) Bani K. Mallick
18
Comparison of Two Population
Proportions: Boxers versus Brief

Putting this together we get that the 95% CI
is 0.2664 – 1.96 * 0.04944 = 0.17 up to the
value 0.2664 + 1.96 * 0.04944 = 0.36

So, 95% CI is from 0.17 to 0.36

What is this a CI for?

What is the conclusion?
Copyright (c) Bani K. Mallick
19
Comparison of Two Population
Proportions: Boxers versus Brief



95% CI is from 0.17 to 0.36
What is this a CI for? The difference in
population fractions of preferring boxers is
from 0.17 to 0.36
What is the conclusion? More females
prefer men to wear boxers than do
males, by 17% to 36%
Copyright (c) Bani K. Mallick
20
Comparison of Two Population
Proportions:



Remarkably, but perhaps not surprisingly, you
do not have to compute these confidence
intervals by hand!
The idea: simply pretend, and I do mean
pretend, that the binary outcomes are real
numbers and run your ordinary t-test CI,
unequal variance line
The results will be slightly different from your
hand calculations, but actually a bit more
accurate
Copyright (c) Bani K. Mallick
21
Illustration with the Boxers Problem
The value “1” indicates a preference for boxers
Note how women have a higher preference for
boxers than do men, in this sample
Group Statistics
Boxer vers us
Briefs Preference
Gender
Female
Male
N
177
188
Mean
.7345
.4681
Copyright (c) Bani K. Mallick
Std. Error
Std. Deviation
Mean
.4429 3.329E-02
.5003 3.649E-02
22
Illustration with the Boxers Problem
Independent Samples Test
Levene's Test for
Equality of Variances
F
Boxer versus
Briefs Preference
Equal variances
assumed
Equal variances
not assumed
49.523
t-test for Equality of Means
Sig.
t
.000
df
Sig. (2-tailed)
Mean
Difference
Std. Error
Difference
95% Confidence
Interval of the
Difference
Lower
Upper
5.373
363
.000
.2664
4.957E-02
.1689
.3639
5.393
361.642
.000
.2664
4.939E-02
.1692
.3635
Copyright (c) Bani K. Mallick
23
Illustration with the Boxers Problem
Independent Samples Test
Levene's Test for
Equality of Variances
F
Boxer versus
Briefs Preference
Equal variances
assumed
Equal variances
not assumed
49.523
Sig.
t-test for Equality of Means
t
.000
5.373
5.393
df
Sig. (2-tailed)
363
361.642
Mean
Difference
.000
.000
.2664
Std. Error
Difference
4.957E-02
.2664 4.939E-02
95% Confidence
Interval of the
Difference
Lower
Upper
.1689
.3639
.1692
.3635
Difference in sample means = 0.2664
Standard error of this difference = 0.04939
Copyright (c) Bani K. Mallick
24
Illustration with the Boxers Problem:
hand CI is 0.17 to 0.36: note similarities!
Independent Samples Test
Levene's Test for
Equality of Variances
t-test for Equality of Means
Mean
F
Boxer versus
Briefs Preference
Equal variances
assumed
Equal variances
not assumed
49.523
Sig.
t
.000
5.373
5.393
df
Sig. (2-tailed) Difference
363
361.642
.000
.000
Std. Error
Difference
95% Confidence
Interval of the
Difference
Lower
Upper
.2664
4.957E-02
.1689
.3639
.2664
4.939E-02
.1692
.3635
p-value = 0.000. Note how you
use the unequal variances p-value
Copyright (c) Bani K. Mallick
25
Illustration with the Boxers Problem:
hand CI is 0.17 to 0.36: note similarities!
Independent Samples Test
Levene's Test for
Equality of Variances
F
Boxer versus
Briefs Preference
Equal variances
assumed
Equal variances
not assumed
49.523
Sig.
t-test for Equality of Means
t
.000
5.373
5.393
df
Sig. (2-tailed)
363
361.642
Mean
Difference
Std. Error
Difference
95% Confidence
Interval of the
Difference
Lower
Upper
.000
.2664
4.957E-02
.1689
.3639
.000
.2664
4.939E-02
.1692
.3635
The 95% CI from SPSS is 0.1692 to 0.3635. Nearly same as
hand calculation.
Men and Women have different preferences at even 99.9%
confidence.
Copyright (c) Bani K. Mallick
26
US Availability and Rating: Are
Better Beers More Widely Available?
The “data” are coded as
0 = not widely available
1 = widely available
Group Statistics
Availability in the U.S.
Very Good versus Other
Very Good
Fair or Good
N
11
24
Mean
0.45
0.75
Std. Deviation
.52
.44
Std. Error
Mean
.16
9.03E-02
With the “data” coded as 0 and 1, this means that in
the sample, 45% of the very good beers were widely
available
Copyright (c) Bani K. Mallick
27
US Availability and Rating: Are
Better Beers More Widely Available?
Group Statistics
Availability in the U.S.
Very Good versus Other
Very Good
Fair or Good
N
11
24
Mean
0.45
0.75
Std. Deviation
.52
.44
Std. Error
Mean
.16
9.03E-02
With the “data” coded as 0 and 1, this means that in
the sample, 75% of the fair/good beers were widely
available
Copyright (c) Bani K. Mallick
28
US Availability and Rating: Are
Better Beers More Widely Available?
Independent Samples Test
Levene's Test for
Equality of Variances
F
Availability in the U.S. Equal variances
assumed
Equal variances
not assumed
3.169
Sig.
.084
t-test for Equality of Means
t
-1.734
-1.628
df
Sig. (2-tailed)
33
16.864
Mean
Difference
Std. Error
Difference
95% Confidence
Interval of the
Difference
Lower
Upper
.092
-.30
.17
-.64
5.12E-02
.122
-.30
.18
-.68
8.77E-02
This is the p-value for the hypothesis that the two
population fractions are the same
Copyright (c) Bani K. Mallick
29
Comparison of Two Population
Proportions:

Note that the p-values were > 0.10

What does this mean?
Copyright (c) Bani K. Mallick
30
Comparison of Two Population
Proportions:

Note that the p-values were > 0.10

What does this mean?

There is no evidence that those beers which
are very good have any more or less national
availability than those which are good or fair
Copyright (c) Bani K. Mallick
31
Construction Example



The construction example was based on a
survey made available to me.
I will look at the percentages of males
sampled in Texas and in states outside of
Texas
If these were random samples, they would be
a measure of how different states are in their
gender distributions in the construction
industry
Copyright (c) Bani K. Mallick
32
Construction Data: Gender
Differences by Texas or Not
(1 = male)
Group Statistics
Sex
State: Texas or Not
Outside Texas
Texas
N
274
173
Mean
.86
.26
Std. Deviation
.34
.44
Std. Error
Mean
2.07E-02
3.35E-02
Something strange:
86% of the sample outside Texas is male
26% of the sample in Texas is male
Copyright (c) Bani K. Mallick
33
Construction Data: Gender
Differences by Texas or Not
(1 = male)
Independent Samples Test
Levene's Test for
Equality of Variances
F
Sex
Equal variances
ass umed
Equal variances
not as sumed
43.713
Sig.
.000
t-tes t for Equality of Means
t
df
Sig. (2-tailed)
Mean
Difference
Std. Error
Difference
95% Confidence
Interval of the
Difference
Lower
Upper
16.260
445
.000
.60
3.72E-02
.53
.68
15.379
300.960
.000
.60
3.93E-02
.53
.68
Something strange:
86% of the sample outside Texas is male
26% of the sample in Texas is male
Not surprising: p-value = 0.000
Copyright (c) Bani K. Mallick
34
Comparison of Two Population
Proportions:


Please study the slides for the next lecture
before coming to class
The material is somewhat difficult, and if you
do not look at the slides and try to
understand them, you will find my lecture all
but impossible to understand.
Copyright (c) Bani K. Mallick
35