501_Lecture_05x

Download Report

Transcript 501_Lecture_05x








Binomial distributions for sample counts
Binomial distributions in statistical sampling
Finding binomial probabilities
Binomial mean and standard deviation
Sample proportions
Normal approximation for counts and proportions
Binomial formula
1
The law of large numbers assures us that if we measure enough
subjects, the statistic x-bar will eventually get very close to the unknown
parameter µ.
If we took every one of the possible samples of a certain size, calculated
the sample mean for each, and graphed all of those values, we’d have a
sampling distribution.
The population distribution of a variable is the distribution of
values of the variable among all individuals in the population.
The sampling distribution of a statistic is the distribution of
values taken by the statistic in all possible samples of the same
size from the same population.
2

Example:
◦ n = 2500 adults asked whether shopping is frustrating
 n is the number of trials
◦ X = 1650 answered “Yes”
 X is the number of “successes”
◦ p-hat = X/n = 0.66 is the sample proportion (of
successes)

Need to make sure we distinguish between the
count and the sample proportion
1.
Each observation falls in just two categories:



Success/Failure
Heads/Tails
Yes/No
2.
3.
4.
All observations are independent
Fixed number of trials, n
The probability of success, p, is the same in each trial

The distribution of the (total) count of successes in this
binomial setting is:
Binomial distribution
denoted B(n,p)

Toss a fair coin 10 times and count the number X of
heads
◦ Binomial or not?
◦ What about a biased coin?

Deal 10 cards from a shuffled deck of 52. X is the
number of spades.
◦ Binomial?
◦ Suggestions?


Number of girls born among first 100 children in a (large)
hospital this year
Number of girls born in this hospital so far this year

SRS is not quite a Binomial setting
◦ Why? Check the 4 properties!

However, if the population is 10 times larger than
our sample n, then the number of “successes” in
the sample is approximately Binomial.
◦ We say B(n,p)
◦ Here p is the population success rate
 usually unknown



We will just use table C
For given n and p, table gives the probability for k
successes
Table only gives p’s of 0.5 or less
◦ If you have a p greater than 0.5, you need to switch the
role of successes and failures.

Bill is the star player on his basketball team.
Over his career, his free throw percentage is
75%. However, his three-point shot percentage
is only 20%.
◦ If he tries 5 three-point shots, what is the probability
he will make 2?
◦ If he tries 10 free throws, what is the probability he
will make 7?
◦ If he tries 10 free throws again, what is the probability
he makes at least 7 free throws?
n!
k
nk
P( X  k ) 
p (1  p)
k ! n  k  !
where n !  n  (n  1)  ...  2 1
and 0!  1




Need to create a dataset with variable names for the
probabilities you want
For example, probbnml(p,n,k) will give you the
probability less than or equal to k successes. This is
considered a variable, we need to name it, such as…
prob_less_than_or_equal_to_k = probbnml(p,n,k);
What if we want greater than?
prob_greater_than = 1 – probbnml(p,n,k);
What if we want equal to?
prob_equal = probbnml(p,n,k) – probbnml(p,n,k-1);

Calculate probabilities for binomial distribution: B(n,p)
data binomial;
p=0.25;
n=10;
k=4;
prob_less_than_or_equal_to_k = probbnml(p,n,k);
prob_greater_than = 1 - probbnml(p,n,k);
prob_equal = probbnml(p,n,k) - probbnml(p,n,k-1);
run;
proc print data=binomial;
run;
Binomial Example
Obs
p
n
k
prob_less_
than_or_
equal_to_k
1
0.25
10
4
0.92187
prob_
greater_
than
prob_
equal
0.078127
0.14600

If X has binomial distribution B(n,p) then
 X  np
 X  np(1  p)

For 10 tosses of a fair coin, let X = number of
heads
◦ What is the distribution of X?
◦ Mean of X =
◦ Standard Deviation of X =

Let us take a binomial situation…
◦ We have many bags with 20 switches in each bag
◦ The probability that each individual switch is bad is 0.5

So, the number of bad switches in each bag is a
Binomial distribution with n = 20 and p = 0.5
◦ B(20,0.5)

What if we look at how many switches are bad in
many different bags?...draw a histogram!
As n gets larger, something interesting happens to the shape of a
binomial distribution.
Normal Approximation for Binomial Distributions
Suppose that X has the binomial distribution with n trials and success
probability p. When n is large, the distribution of X is approximately Normal
with mean and standard deviation
X  np
s X = np(1- p)
As a rule of thumb, we will use the Normal approximation when n is so large
that np ≥ 10 and n(1 – p) ≥ 10.
21

The sample proportion relates directly to the
count X:
X
pˆ 
n


Counts or X:
X is approximately N ( np, np(1  p) )
X is approximately
N ( np, np(1  p) )
Propotions
or p-hat:
p(1  p)
pˆ is approximately N ( p,
)
p(1n p)
pˆ is approximately N ( p,
)
n

In 2001, Barry Bonds hit 73 home runs. Was
this feat as surprising as most of us thought? In
the prior two seasons, Bonds hit a home run in
10% of his times at bat. If he went to bat 476
times in 2001, what is the probability that he hits
73 or more home runs just by chance? (Solve in
terms of both X and p-hat.) Is it appropriate to
use the normal approximation for this problem?

(The real probability from the Binomial is 0.0001)

What is the probability that the percentage of
heads in 100 tosses is between 40% and 60%?

Assume that exactly 60% of population does not
like shopping. What is the chance of obtaining
sample proportion larger than 0.65 for sample
size=2500?




Sampling distribution of sample counts and
proportions
Evaluating the Binomial Probabilities
Using the approximate sample distribution to
assess certain probabilities
The probabilities evaluated using the normal
distribution are not exact, but approximations




Population distribution vs. sampling distribution
The mean and standard deviation of the sample mean
Sampling distribution of a sample mean
Central limit theorem
27



Because portfolios usually contain many individual
stocks, when we look at the return of portfolios, we
are looking at the return of the sum (or average) of
many individual stocks
What happens to the distribution of the portfolios?
Let’s look again…


Given an SRS of size n, we observe n values X1,
X2,…, Xn, of a quantitative random variable
The sample mean of the SRS is:
1
x= ( X 1  X 2  ...  X n )
n


Assume the population has mean µ and standard
deviation σ.
Then if the observations are independent, the
sample mean, x_bar , has population mean and
standard deviation given as follows:
 x x  

 x  x 
n

n


The height in inches of a randomly chosen young
woman is N(64.5, 2.5)
What is the mean and standard deviation of the
average of 100 randomly chosen young women?
◦ Think in terms of stocks and portfolios
◦ What will the normal distribution above do?

If the variable X in the population is N(µ,σ) then
x is N (  ,




n
)
Kicker: This is often a good approximation even
if the original distribution is not normal.
This is a HUGE result, called the Central Limit
Theorem (or CLT).
It says if we start with ANY distribution, the
sample mean will be normally distributed.

Take 100 randomly chosen young women and
measure their height. What is the chance that the
average height of these 100 women is between 64
and 65 inches?


The mean time for maintenance of an air
conditioner is 60 minutes, with a standard
deviation of 60 minutes.
What is the probability that average maintenance
time of 70 air conditioners will exceed 50 minutes?
◦ Note, we didn’t say the time for maintenance is
normally distributed. In fact, it follows an exponential
distribution.
If you know n, then the distributions for the sum and
average are equivalent (if you know one, you know the
other).

x
◦ So since
has a normal distribution, then sums are also
normally distributed!

A count (think binomial) is just a sum!
◦ We are just adding up individual observations, of course that is
a sum and hence normal!
◦ So of course counts are normal!
◦ Similarly proportions function like averages, and are also
normally distributed!

The CLT is the key.
x is approximately N (  ,



n
)
This is our (familiar) approximate normality.
Important assumptions:
◦ SRS (Simple Random Sample)
◦ Population distribution of X has mean µ and standard deviation σ;
◦ Last but not least, n needs to be “large enough”. Remember the air
conditioning example. Generally, we say n ≥ 30 is “large enough”.

Warning: not all interesting distributions are normal
◦ But, the sample means are always roughly normal for large sample
sizes.




Approximate normal distribution of the sample
mean from a SRS.
CLT holds for ANY population distribution.
Also, if in fact the underlying population
distribution is exact (in some cases it is), then the
result is also exact, not an approximation.
Use the CLT to evaluate probabilities regarding
averages.

How do you tell the “X-bar” problems apart from section
1.3 “X” problems?
◦ Section 1.3 “X” problems have a sample size of 1 (n = 1).
◦ Section 5.2 “X-bar” problems have a sample size bigger than 1.
Type
Individual, x
Mean
Standard Deviation
x  
Sample mean, x
 x  x
x 
x
x 
Sample proportion, p̂
 p̂  p
n
 pˆ 
p(1  p )
n
We flip cards from a stack of cards containing
10 normal decks of cards and count each time
we flip an “Ace” as a success.

◦
◦
◦
What is the population proportion of success?
If we only did 50 cards as a sample, and 4 aces
were flipped, what is the sample proportion of
success?
If we did repeated samples of size n = 50, what is
the mean and standard deviation of the sample
proportion?
Bob is playing in the club golf tournament.
Bob’s scores vary as he plays the course
repeatedly and has a N(77,3) distribution.

◦
◦
What is the probability that Bob will shoot a 74 or
lower in the first round of the club tournament?
What is the probability that Bob will average 74 or
lower for the 4 rounds of the club tournament?