Transcript 1-a

STAT 651
Lecture #15
Copyright (c) Bani K. Mallick
1
Topics in Lecture #15

Some basic probability

The binomial distribution

Inference about a single population
proportions
Copyright (c) Bani K. Mallick
2
Book Sections Covered in Lecture #15

Chapters 4.7-4.8

Chapter 10.2
Copyright (c) Bani K. Mallick
3
Lecture 14 Review: Nonparametric
Methods

Replace each observation by its rank in the
pooled data

Do the usual ANOVA F-test

Kruskal-Wallis
Copyright (c) Bani K. Mallick
4
Lecture 14 Review: Nonparametric
Methods



Once you have decided that the populations
are different in their means, there is no
version of a LSD
You simply have to do each comparison in
turn
This is a bit of a pain in SPSS, because you
physically must do each 2-population
comparison, defining the groups as you go
Copyright (c) Bani K. Mallick
5
Categorical Data


Not all experiments are based on numerical
outcomes
We will deal with categorical outcomes, i.e.,
outcomes that for each individual is a
category

The simplest categorical variable is binary:

Success or failure

Male of female
Copyright (c) Bani K. Mallick
6
Categorical Data

For example, consider flipping a fair coin, and
let

X = 0 means “tails”

X = 1 means “heads”
Copyright (c) Bani K. Mallick
7
Categorical Data



The fraction of the population who are
“successes” will be denoted by the Greek
symbol p
Note that because it is a Greek symbol, it
represents something to do with a population
For coin flipping, if you flipped all the fair
coins in the world (the population), the
fraction of the times they turn up heads
equals p
Copyright (c) Bani K. Mallick
8
Categorical Data




The fraction of the population who are
“successes” will be denoted by the Greek
symbol p
The fraction of the sample of size n who are
“successes” is going to be denoted by p̂
We want to relate p̂ to p
Let X = number of successes in the sample.
The fraction p̂ = (# successes)/n = X / n
Copyright (c) Bani K. Mallick
9
Categorical Data

Suppose you flip a coin 10 times, and get 6
heads.

The proportion of heads = 0.60

The percentage of heads = 60%
Copyright (c) Bani K. Mallick
10
Categorical Data




The number of success X in n experiments
each with probability of success p is called a
binomial random variable
There is a formula for this:
Pr(X = k) =
 Pr(ˆ
p  k /n) 
n!
pk (1  p)nk
k! (n-k)!
0! = 1, 1! = 1, 2! = 2 x 1 = 2, 3! = 3 x 2 x 1
= 6, 4! = 4 x 3 x 2 x 1 = 24, etc.
Copyright (c) Bani K. Mallick
11
Categorical Data
Pr(X  k)  Pr(p
ˆ  k /n) 



n!
pk (1  p)nk
k! (n-k)!
0! = 1, 1! = 1, 2! = 2 x 1 = 2, 3! = 3 x 2 x 1 = 6, 4!
= 4 x 3 x 2 x 1 = 24, etc.
The idea is to relate the sample fraction to
the population fraction using this formula
Key Point: if we knew p, then we could
entirely characterize the fraction of
experiments that have k successes
Copyright (c) Bani K. Mallick
12
Categorical Data



The probability that the coin lands on heads
will be denoted by the Greek symbol p
Suppose you flip a coin 2 times, and count
the number of heads.
So here, X = number of heads that arise
when you flip a coin 2 times

X takes on the values 0, 1 and 2

p̂ takes on the values 0/2, ½, 2/2
Copyright (c) Bani K. Mallick
13
Categorical Data: What the binomial
formula does

The experiment results in 4 equally likely
outcomes: each occurs ¼ of the time
Tails on
toss #1
Heads on
toss #1
¼
¼
Heads on ¼
Toss #2
¼
Tails of
toss #2
Copyright (c) Bani K. Mallick
14
Categorical Data

Heads = “success”:
Pr(X  0)  Pr(p
ˆ  0 /2)  1/ 4
Pr(X  1)  Pr(p
ˆ  1/2)  1/2
Pr(X  2)  Pr(p
ˆ  2 /2)  1/ 4
The binomial
formula can
be used to
give these
results
without
thinking
Tails on
toss #1
Heads on
toss #1
¼
¼
Heads on ¼
Toss #2
¼
Tails on
toss #2
Copyright (c) Bani K. Mallick
15
Categorical Data
Pr(X  k)  Pr(p
ˆ  k /n) 


n!
pk (1  p)nk
k! (n-k)!
0! = 1, 1! = 1, 2! = 2 x 1 = 2, 3! = 3 x 2 x 1 = 6, 4!
= 4 x 3 x 2 x 1 = 24, etc.
n=2, k=1, k! = 1, n! = 2, (n-k)! = 1
pk  .5, and(1  p)nk  .5

The binomial formula gives the answer ½,
which we know to be correct
Copyright (c) Bani K. Mallick
16
Categorical Data

Roll a fair dice
First Dice
1
2
3
4
5
6
Every
combination is
equally likely,
so what are the
probabilities?
Copyright (c) Bani K. Mallick
17
Categorical Data

Roll a fair dice
First Dice
1
2
3
4
5
6
1/6 1/6 1/6 1/6 1/6 1/6
Every
combination is
equally likely,
so what are the
probabilities?
Copyright (c) Bani K. Mallick
18
Categorical Data

Roll a fair dice
First Dice
1
2
3
4
5
6
1/6 1/6 1/6 1/6 1/6 1/6
Every
combination is
equally likely,
so what are the
probabilities?
What is the chance of rolling a 1 or a
2?
Copyright (c) Bani K. Mallick
19
Categorical Data

Roll a fair dice
First Dice
1
2
3
4
5
6
1/6 1/6 1/6 1/6 1/6 1/6
Every
combination is
equally likely,
so what are the
probabilities?
What is the chance of rolling a 1 or
2?
2/6 = 1/3
Copyright (c) Bani K. Mallick
20
Categorical Data

Now roll two fair dice
First Dice
1 2 3 4 5 6
Second
Dice
Every
combination is
equally likely,
so what are the
probabilities?
1
2
3
4
5
6
Copyright (c) Bani K. Mallick
21
Categorical Data

Roll two fair dice
Second
Dice
Every
combination is
equally likely,
so what are the
probabilities?
First Dice
1
2
3
4
5
6
1
2
3
1/36
1/36
1/36
5
6
1/36 1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
Copyright (c) Bani K. Mallick
4
22
Categorical Data

Roll two fair dice
Second
Dice
Define a success
as rolling a 1 or
a 2. What is the
chance of two
successes?
First Dice
1
2
3
4
5
6
1
2
3
1/36
1/36
1/36
5
6
1/36 1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
Copyright (c) Bani K. Mallick
4
23
Categorical Data

Roll two fair dice
Second
Dice
Define a success
as rolling a 1 or
a 2. What is the
chance of two
successes? 4/36
= 1/9
First Dice
1
2
3
4
5
6
1
2
3
1/36
1/36
1/36
5
6
1/36 1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
Copyright (c) Bani K. Mallick
4
24
Categorical Data

Roll two fair dice
Second
Dice
Define a success
as rolling a 1 or
a 2. What is the
chance of two
failures? 16/36
= 4/9
First Dice
1
2
3
4
5
6
1
2
3
1/36
1/36
1/36
5
6
1/36 1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
1/36
1/36
1/36 1/36
1/36
1/36
Copyright (c) Bani K. Mallick
4
25
Categorical Data

So, a success occurs when you roll a 1 or a 2

Pr(success on a single die) = 2/6 = 1/3 = p

Pr(2 successes) = 1/3 x 1/3 = 1/9

Use the binomial formula: pr(X=k) when k=2

k!=2, n!=2, (n-k)!=1,
Pr(X  k)  Pr(p
ˆ  k /n) 
pk  1/9, and(1  p)nk  1
n!
pk (1  p)nk  1/9
k! (n-k)!
Copyright (c) Bani K. Mallick
26
Categorical Data



In other words, the binomial formula works in
these simple cases, where we can draw nice
tables
Now think of rolling 4 dice, and ask the
chance the 3 of the 4 times you get a 1 or a
2
Too big a table: need a formula
Copyright (c) Bani K. Mallick
27
Categorical Data





Does it matter what you call as “success” and
hat you call a “failure”?
No, as long as you keep track
For example, in a class experiment many
years ago, men were asked whether they
preferred to wear boxers or briefs
This is binary, because there are only 2
outcomes
“success” = ?????
Copyright (c) Bani K. Mallick
28
Categorical Data


Binary experiments have sampling variability,
just like sample means, etc.
Experiment: “success” = being under 5’10” in
height

First 6 men with SSN < 5

First 6 men with SSN > 5

Note how the number of “successes” was
not the same! (I might have to do this a few
times)
Copyright (c) Bani K. Mallick
29
Categorical Data



The sample fraction p̂ is a random
variable
This means that if I do the experiment over
and over, I will get different values.
These different values have a standard
deviation.
Copyright (c) Bani K. Mallick
30
Categorical Data

The sample fraction p̂ has a standard error

Its standard error is


 pˆ 
p(1  p)
n
Note how if you have a bigger sample, the
standard error decreases
The standard error is biggest when p = 0.50.
Copyright (c) Bani K. Mallick
31
Categorical Data

The sample fraction
has a standard error

Its standard error is
p(1  p)
n

 pˆ 
The estimated standard error based on
the sample is

ˆ pˆ 
p
ˆ(1  p
ˆ)
n
Copyright (c) Bani K. Mallick
32
Categorical Data



It is possible to make confidence intervals for
the population fraction if the number of
successes > 5, and the number of failures >
5
If this is not satisfied, consult a statistician
Under these conditions, the Central Limit
Theorem says that the sample fraction is
approximately normally distributed (in
repeated experiments)
Copyright (c) Bani K. Mallick
33
Categorical Data

(1a)100% CI for the population fraction
p
ˆ  z a/2 
ˆ ˆp

ˆ pˆ 

p
ˆ(1  p
ˆ)
n
z a/2 is by looking up 1a/2 in Table 1
Copyright (c) Bani K. Mallick
34
Categorical Data


Often, you will only know the sample
proportion/percentage and the sample size
Computing the confidence interval for the
population proportion: two ways



By hand
By SPSS (this is a pain if you do not have the data
entered already)
Because you may need to do this by hand, I
will make you do this.
Copyright (c) Bani K. Mallick
35
Categorical Data

(1a)100% CI for the population fraction
p  z a/2 
ˆ
ˆ ˆp

95% CI, z a/2 = 1.96

n = 25,

ˆ pˆ 
p̂
= 0.30
p
.3(1  .3)
ˆ(1  p
ˆ)

 0.09165
n
25
p
ˆ  z a/2
ˆ ˆp  0.30  1.96x0.09165
Copyright (c) Bani K. Mallick
36
Categorical Data

(1a)100% CI for the population fraction
p
ˆ  z a/2
ˆ pˆ  0.30  1.96x0.09165
 0.30  0.18  [0.12, 0.48]

Interpretation?
Copyright (c) Bani K. Mallick
37
Categorical Data

(1a)100% CI for the population fraction
p
ˆ  z a/2
ˆ pˆ  0.30  1.96x0.09165
 0.30  0.18  [0.12, 0.48]

Interpretation? The proportion of
successes in the population is from 0.12
to 0.48 (12% to 48%) with 95%
confidence
Copyright (c) Bani K. Mallick
38
Categorical Data



You can use SPSS as long as the number of
successes and the number of failures both
exceed 5
To get the confidence intervals, you first have
to define a numeric version of your variable
that classifies whether an observation is a
success or failure.
You then compute the 1-sample confidence
interval from “descriptives” “Explore”: Demo
Copyright (c) Bani K. Mallick
39
Categorical Data

If you set up your data in SPSS, the “mean”
will be the proportion/fraction/percentage of
1’s

Data = 0 1 1 1 0 0 0 1 0 0

n = 10

Mean = 4/10 = .40

p̂ = .40
Copyright (c) Bani K. Mallick
40
Boxers versus briefs for males
In this output, boxers = 1 and briefs = 0
Case Processing Summary
N
Boxers or Briefs
Perference
Valid
Percent
188
100.0%
Cases
Missing
N
Percent
0
Copyright (c) Bani K. Mallick
.0%
N
Total
Percent
188
100.0%
41
Boxers versus briefs for males: what
% prefer boxers? In the sample,
46.81%. In the population???
In this output, boxers = 1 and briefs = 0. The proportion
of 1’s is the mean
Descriptives
Statistic
Boxers or Briefs
Perference
Mean
95% Confidence
Interval for Mean
.4681
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Copyright (c) Bani K. Mallick
Std. Error
3.649E-02
.3961
.5401
.4645
.0000
.250
.5003
.00
1.00
1.00
1.0000
.129
-2.005
.177
.353
42
Boxers versus briefs for males: what
% prefer boxers? Between 39.61%
and 54.01%
Descriptives
Numeric Boxers: 0
= Briefs, 1 = Boxers
Gender
Male
Mean
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Copyright (c) Bani K. Mallick
Statistic
.4681
.3961
Std. Error
3.649E-02
.5401
.4645
.0000
.250
.5003
.00
1.00
1.00
1.0000
.129
-2.005
.177
.353
43
Boxers versus briefs



In the sample, 46.81% of the men preferred
boxers to briefs: 53.19% preferred briefs.
Between 39.61% and 54.01% men prefer
boxers to briefs (95% CI)
Is there enough evidence to conclude that
men generally prefer briefs?
Copyright (c) Bani K. Mallick
44
Boxers versus briefs




In the sample, 46.81% of the men preferred
boxers to briefs: 53.19% preferred briefs.
Between 39.61% and 54.01% men prefer
boxers to briefs (95% CI)
Is there enough evidence to conclude that
men generally prefer briefs?
No: since 50% is in the CI! This means
that it is possible (95%CI) that 50% prefer
boxers, 50% prefer briefs, p= 0.50.
Copyright (c) Bani K. Mallick
45
Sample Size Calculations


The standard error of the sample fraction p̂ is
p(1  p)
 pˆ 
n
If you want an (1a)100% CI interval to be
p̂  E

you should set
E  za/2
p(1  p)
n
Copyright (c) Bani K. Mallick
46
Sample Size Calculations
E  za/2

p(1  p)
n
This means that
nz
2
a/2
p(1  p)
E2
Copyright (c) Bani K. Mallick
47
Sample Size Calculations
nz

p(1  p)
E2
The small problem is that you do not know p.
You have two choices:



2
a/2
Make a guess for p
Set p = 0.50 and calculate (most
conservative, since it results in largest
sample size)
Most polling operations make the latter
choice, since it is most conservative
Copyright (c) Bani K. Mallick
48
Sample Size Calculations: Examples
nz


2
a/2
p(1  p)
E2
Set E = 0.04, 95% CI, you guess that p =
0.30
2 .3(1  .3)
n  1.96
 504
2
.04
You have no good guess:
.5(1  .5)
n  1.96
 601
2
.04
2
Copyright (c) Bani K. Mallick
49