Week 7 - Massey University
Download
Report
Transcript Week 7 - Massey University
Week 7
Sample Means & Proportions
Variability of Summary Statistics
Variability in shape of distn of sample
Variability in summary statistics
Mean, median, st devn, upper quartile, …
Summary statistics have distributions
Parameters and statistics
Parameter describes underlying population
Summary statistic
Constant
Greek letter (e.g. , , , …)
Unknown value in practice
Random
Roman letter (e.g. m, s, p, …)
We hope statistic will tell us about corresponding
parameter
Distn of sample vs
Sampling distn of statistic
Values in a single random sample have a
distribution
Single sample --> single value for statistic
Sample-to-sample variability of statistic is its
sampling distribution.
Means
Unknown population mean,
Sample mean, X, has a distribution — its
sampling distribution.
Usually x ≠
A single sample mean, x, gives us information
about
Sampling distribution of mean
If sample size, n, increases:
Spread of distn of sample is (approx) same.
Spread of sampling distn of mean gets
smaller.
x is likely to be closer to
x becomes a better estimate of
Sampling distribution of mean
Population with mean , st devn
Random sample (n independent values)
Sample mean, X, has sampling distn with:
Mean,
X
St devn,
X
n
(We will deal
later with the problem that and
are unknown in practice.)
Weight loss
Estimate mean weight loss for those attending
clinic for 10 weeks
Random sample of n = 25 people
Sample mean, x
How accurate?
Let’s see, if the population distn of weight loss is:
X ~ normal 8lb, 5lb
Some samples
Four random samples of n = 25 people:
1. Mean = 8.32 pounds, st devn = 4.74 pounds
2. Mean = 8.32 pounds, st devn = 4.74 pounds
3. Mean = 8.48 pounds, st devn = 5.27 pounds
4. Mean = 7.16 pounds, st devn = 5.93 pounds
N.B. In all samples, x ≠
Sampling distribution
Means from simulation
of 400 samples
Theory:
mean = = 8 lb,
s.d.( x ) =
n
5
1 lb
25
(How does this compare to simulation? To popn distn?)
Errors in estimation
Population
X ~ normal 8lb, 5lb
Sampling
distribution of mean
mean = = 8 lb,
n
5
1 lb
25
From 70-95-100 rule
s.d.( x ) =
x will be almost certainly within 8 ± 3 lb
x is unlikely to be more than 3 lb in error
Even if we didn’t know
x is unlikely to be more than 3 lb in error
Increasing sample size, n
If we sample n = 100 people instead of 25:
s.d.( x ) =
5
0.5 lb.
n
100
Larger samples
more accurate estimates
Central Limit Theorem
If population is normal (, )
X ~ normal,
n
If popn is non-normal with (, ) but n is large
X approx ~ normal ,
n
Guideline: n > 30 even if very non-normal
Other summary statistics
E.g. Lower quartile, proportion, correlation
Usually not normal distns
Formula for standard devn of samling distn
sometimes
Sampling distn usually close to normal if n is
large
Lottery problem
Pennsylvania Cash 5 lottery
5 numbers selected from 1-39
Pick birthdays of family members (none 32-39)
P(highest selected is 32 or over)?
Statistic:
H = highest of 5 random numbers (without
replacement)
Lottery simulation
Theory? Fairly hard.
Simulation: Generated 5 numbers (without
replacement) 1560 times
Highest number > 31 in about 72% of repetitions
Normal distributions
Family of distributions (populations)
Shape depends only on parameters
(mean) & (st devn)
All have same symmetric ‘bell shape’
= 65 inches,
= 2.7 inches
Importance of normal distn
A reasonable model for many data sets
Transformed data often approx normal
Sample means (and many other statistics)
are approx normal.
Standard normal distribution
Z ~ Normal ( = 0, = 1)
-3
Prob ( Z < z* )
-2
-1
0
1
2
3
Probabilities for normal (0, 1)
Check from tables:
P(Z -3.00)
P(Z −2.59)
P(Z 1.31)
P(Z 2.00)
P(Z -4.75)
=
=
=
=
=
0.0013
0 .0048
0 .9049
0 .9772
0 .000001
Probability Z > 1.31
P(Z > 1.31) = 1 – P(Z 1.31)
= 1 – .9049 = .0951
Prob ( Z between –2.59 and 1.31)
P(-2.59 Z 1.31)
= P(Z 1.31) – P(Z -2.59)
= .9049 – .0048 = .9001
Standard devns from mean
Normal (, )
Heights
of students
= 65 inches,
= 2.7 inches
Probability and area
X ~ normal ( = 65 , = 2.7 )
P (X ≤ 67.7) = area
Probability and area (cont.)
Normal (, )
Exactly
P(X within of ) = 0.683
P(X within 2 of ) = 0.954
P(X within 3 of ) = 0.997
70-95-100 rule
approx 70%
approx 95%
approx 100%
Finding approx probabilities
Ht of college woman, X ~ normal ( = 65 , = 2.7 )
Prob (X ≤ 62 )?
1. Sketch normal density
2. Estimate area
P (X ≤ 62) = area
About 1/8
Translate question from X to Z
X ~ Normal (, )
Find P(X ≤ x*)
x*
Translate to z-score:
X
Z
Z ~ Normal ( = 0, = 1)
-3
-2
z*-1
0
1
2
3
Finding probabilities
Prob (height of randomly selected college woman ≤ 62 )?
62 65
P X 62 P Z
2.7
P Z 1.11 .1335
About 13%.
Prob (X > value)
Ht of college woman, X ~ normal ( = 65 , = 2.7 )
Prob (X > 68 inches)?
68 65
P X 68 P Z
PZ 1.11 1 PZ 1.11
2.7
1 .8665 .1335
Finding upper quartile
Blood Pressures are normal with mean 120 and
standard deviation 10. What is the 75th percentile?
Step 1: Solve for z-score
Closest z* with area of 0.7500 (tables)
z = 0.67
Step 2: Calculate x = z* +
x = (0.67)(10) + 120 = 126.7 or about 127.
Probabilities about means
Blood pressure ~ normal ( = 120, = 10)
8 people given drug
If drug does not affect blood pressure,
Find P(average blood pressure > 130)
P ( X > 130) ?
X ~ normal ( = 120, = 10)
n=8
10
X ~ normal X 120, X
3.54
8
130 120
z
2.83
3.54
prob = 0.0023
Very little chance!
Distribution of sum
X ~ distn with (, )
e.g. miles
to kilometers
aX ~ distn with (a, a)
X ~ distn with ,
n
X
n X ~ distn with n,
n
Central Limit Theorem implies approx normal
Probabilities about sum
Profit in 1 day ~ normal (= $300, = $200)
Prob(total profit in week < $1,000)?
Total =
X
~ normal 7 2,100,
1000 2100
z
2.08
529
Prob = 0.0188
7 529
Assumes
independence
Categorical data
Most important parameter is
= Prob (success)
Corresponding summary statistic is
p = Proportion (success)
N.B. Textbook uses p and p^
Number of successes
Easiest to deal with count of successes
before proportion.
If…
1.
2.
3.
4.
n “trials” (fixed beforehand).
Only “success” or “failure” possible for each trial.
Outcomes are independent.
Prob (success), remains same for all trials, .
• Prob (failure) is 1 – .
X = number of successes ~ binomial (n, )
Examples
Binomial Probabilities
nk
n!
k
P X k
1
for k = 0, 1, 2, …, n
k! n k !
You won’t need to use this!!
Prob (win game) = 0.2
Plays of game are independent.
What is Prob (wins 2 out of 3 games)?
What is P(X = 2)?
32
3!
2
P X 2
.2 1 .2
2! 3 2!
3(.2)2 (.8)1 0.096
Mean & st devn of Binomial
For a binomial (n, )
Mean
n
Standard deviation n 1
Extraterrestrial Life?
50% of large population would say “yes” if asked,
“Do you believe there is extraterrestrial life?”
Sample of n = 100
X = # “yes” ~ binomial (n = 100, = 0.5)
Mean
EX 100(.5) 50
Standard deviation 100(.5).5 5
Extraterrestrial Life?
Sample of n = 100
X = # “yes” ~ binomial (n = 100, = 0.5)
E X 100(.5) 50
100(.5).5 5
70-95-100 rule of thumb for # “yes”
About 95% chance of between 40 & 60
Almost certainly between 35 & 65
Normal approx to binomial
If X is binomial (n , ), and n is large, then X is also
approximately normal, with
Mean
E X n
Standard deviation n 1
Conditions: Both n and n(1 – ) are at least 10.
(Justified by Central Limit Theorem)
Number of H in 30 Flips
X = # heads in n = 30 flips of fair coin
X ~ binomial ( n = 30, = 0.5)
Bell-shaped & approx normal.
E X 30(.5) 15
30(.5).5 2.74
Opinion poll
n = 500 adults; 240 agreed with statement
If = 0.5 of all adults agree, what P(X ≤ 240) ?
X is approx normal with
E X 500(.5) 250
100(.5).5 11.2
240 250
P X 240 P Z
P Z .89 .1867
11.2
Not unlikely to see 48% or less, even if 50% in population
agree.
Sample Proportion
Suppose (unknown to us) 40% of a population carry
the gene for a disease, ( = 0.40).
Random sample of 25 people; X = # with gene.
X ~ binomial (n = 25 , = 0.4)
p = proportion with gene
X
p
n
Distn of sample proportion
X ~ binomial (n , )
X n
X n 1
X
p
n
p
p
Large n:
p is approx normal
1
n
(n ≥ 10 & n (1 – ) ≥ 10)
Examples
Election Polls: to estimate proportion who favor a
candidate; units = all voters.
Television Ratings: to estimate proportion of households
watching TV program; units = all households with TV.
Consumer Preferences: to estimate proportion of
consumers who prefer new recipe compared with old; units =
all consumers.
Testing ESP: to estimate probability a person can
successfully guess which of 5 symbols on a hidden card;
repeatable situation = a guess.
Public opinion poll
Suppose 40% of all voters favor Candidate A.
Pollsters sample n = 2400 voters.
Propn voting for A is approx normal
p 0.4
p
1
n
Simulation 400
times & theory.
0.4 0.6
0.01
2400
Probability from normal approx
If 40% of voters favor Candidate A, and n = 2400 sampled
p 0.4
p 0.01
Sample proportion, p, is almost certain to be between 0.37
and 0.43
Prob 0.95 of p being between 0.38 and 0.42