Class Notes Number 4 - Department of Statistics and Probability

Download Report

Transcript Class Notes Number 4 - Department of Statistics and Probability

CHAPTER 6.1
SUMMARIZING POSSIBLE OUTCOMES AND THEIR
PROBABILITIES
• DEFINITION: A RANDOM VARIABLE IS A
NUMERICAL MEASUREMENT OF THE OUTCOME
OF A RANDOM PHENOMENON (EXPERIMENT).
• DEFINITION: A DISCRETE RANDOM VARIABLE X
TAKES ITS VALUES FROM A COUNTABLE SET,
FOR EXAMPLE, N = {0, 1, 2, 3, 4, 5, 6, 7, . . . }.
• DEFINITION: THE PROBABILITY DISTRIBUTION OF
A DISCRETE RANDOM VARIABLE IS A FUNCTION
SUCH THAT FOR ALL OUTCOMES
0  p( x)  1 and
 p ( x)  1
x
1
MEAN OF A DISCRETE PROBABILITY
DISTRIBUTION
• THE MEAN OF A PROBABILITY DISTRIBUTION FOR
A DISCRETE RANDOM VARIABLE IS GIVEN BY
  E ( X )   xi p( xi )
xi
IN WORDS, TO GET THE MEAN OF A DISCRETE
PROBABILITY DISTRIBUTION, MULTIPLY EACH
POSSIBLE VALUE OF THE RANDOM VARIABLE BY
ITS PROBABILITY, AND THEN ADD ALL THESE
PRODUCTS.
2
EXAMPLE: NUMBER OF HOME RUNS IN A
GAME FOR BOSTON RED SOX
NUMBER OF HOME RUNS
PROBABILITY
0
1
2
3
4
5
6 OR MORE
SUM
0.23
0.38
0.22
0.13
0.03
0.01
0.00
1.00
3
(1) WHAT IS THE EXPECTED (MEAN) NUMBER OF HOME RUNS
FOR A BOSTON RED SOX BASEBALL GAME?
(2) INTERPRET WHAT THIS MEAN (EXPECTED VALUE)
MEANS.
4
PROBABILITY FOR CONTINUOUS RANDOM
VARIABLE
• DEFINITION: A CONTINUOUS RANDOM
VARIABLE HAS POSSIBLE VALUES THAT
FORM AN INTERVAL, THAT IS, TAKES ITS
VALUES FROM AN INTERVAL, FOR
EXAMPLE, (2 , 5).
• DEFINITION: THE PROBABILITY
DISTRIBUTION OF A CONTINUOUS RANDOM
VARIABLE IS SPECIFIED BY A CURVE THAT
DETERMINES THE PROBABILITY THAT THE
RANDOM VARIABLE FALLS IN ANY
PARTICULAR INTERVAL OF VALUES.
5
REMARKS
• EACH INTERVAL HAS PROBABILITY BETWEEN 0
AND 1. THIS IS THE AREA UNDER THE CURVE,
ABOVE THAT INTERVAL.
• THE INTERVAL CONTAINING ALL POSSIBLE
VALUES HAS PROBABILITY EQUAL TO 1, SO THE
TOTAL AREA UNDER THE CURVE EQUALS 1.
• ILLUSTRATIVE PICTURES
6
CHAPTER 6.2
FINDING PROBABILITIES FOR BELL – SHAPED
DISTRIBUTIONS – THE NORMAL DISTRIBUTION
• THE NORMAL DISTRIBUTION IS VERY COMMONLY USED
FOR CONTINUOUS RANDOM VARIABLES. IT IS
CHARACTERIZED BY A PARTICULAR SYMMETRIC, BELL –
SHAPED CURVE WITH TWO PARAMETERS – THE MEAN AND
STANDARD DEVIATION.
• NOTATION
N ( , )
• ILLUSTRATIVE PICTURES
7
•THE NORMAL DISTRIBUTION IS ALSO THE
MODEL FOR A POPULATION DISTRIBUTION
• THE POPULATION DISTRIBUTION OF A RANDOM
VARIABLE X IS OFTEN MODELED BY A BELL –
SHAPED CURVE WITH THE PROPERTIES THAT THE
PROPORTION OF THE POPULATION FOR WHICH X
IS BETWEEN a AND b, IS THE AREA UNDER THE
CURVE, AND BETWEEN a AND b.
• ILLUSTRATIVE PICTURE
8
THE EMPIRICAL OR 68 – 95 – 99.7 % RULE
• THE EMPIRICAL RULE STATES THAT FOR AN
APPROXIMATELY BELL – SHAPED DISTRIBUTION,
ABOUT 68% OF OBSERVATIONS(VALUES) FALL
WITHIN ONE STANDARD DEVIATION OF THE MEAN;
95% OF THE VALUES FALL WITHIN TWO
STANDARD DEVIATIONS OF THE MEAN; 99.7% OF
VALUES FALL WITHIN THREE STANDARD
DEVIATIONS OF THE MEAN.
• ILLUSTRATIVE PICTURE
9
FINDING PROBABILITIES FOR CONTINUOUS RANDOM
VARIABLES USING THE STANDARD NORMAL
DISTRIBUTION TABLE
• DEFINITION: THE STANDARD NORMAL
DISTRIBUTION IS THE NORMAL DISTRIBUTION WITH
MEAN = 0 AND STANDARD DEVIATION = 1. IT IS THE
DISTRIBUTION OF NORMAL Z – SCORES.
• DEFINITION: THE Z – SCORE FOR A VALUE x OF A
RANDOM VARIABLE IS THE NUMBER OF STANDARD
DEVIATIONS THAT x FALLS FROM THE MEAN. IT IS
CALCULATED AS
•
z
x

or
xx
z
s
10
CLASS EXAMPLE 1
•
IN A STANDARD NORMAL MODEL, WHAT
PERCENT OF POPULATION IS IN EACH REGION?
DRAW A PICTURE IN EACH CASE.
(A) Z < 0.83
(B) Z > 0.83
(C) 0.1 < Z < 0.9
SOLUTION
11
CLASS EXAMPLE 2
• IN A STANDARD NORMAL MODEL, FIND THE
VALUE OF Z THAT CUTS OFF
• (A) THE LOWEST 75% OF POPULATION;
• (B) THE HIGHEST 20% OF POPULATION (= THE
LOWEST 80%)
• SOLUTION
12
CLASS EXAMPLE 3
• SUPPOSE THAT WE MODEL SAT SCORES Y, BY
N(500, 100) DISTRIBUTION.
• (A) WHAT PERCENTAGE OF SAT SCORES FALL
BETWEEN 450 AND 600?
• (B) FOR WHAT SAT VALUE b, 10% OF SAT SCORES
ARE GREATER THAN b?
• SOLUTION
13
CHAPTER 6.3
PROBABILITY MODELS FOR OBSERVATIONS WITH
TWO POSSIBLE OUTCOMES
BERNOULLI TRIAL
A RANDOM EXPERIMENT WITH TWO
COMPLEMENTARY EVENTS, SUCCESS (S)
AND FAILURE (F) IS CALLED A BERNOULLI
TRIAL.
P(SUCCESS) = p
P(FAILURE) = q = 1 - p
14
EXAMPLES
• TOSSING A COIN 20 TIMES
SUCCESS = HEADS WITH p = 0.5 AND
FAILURE = TAILS WITH q = 1 – p = 0.5
• TAKING A MULTIPLE CHOICE EXAM
UNPREPARED.
SUCCESS = CORRECT ANSWER
FAILURE = WRONG ANSWER
p = 0.2;
q = 1 – p = 1 – 0.2 = 0.8
15
• PRODUCTS COMING OUT OF A PRODUCTION LINE
SUCCESS = DEFECTIVE ITEMS
FAILURE = NON-DEFECTIVE ITEMS
• ROLLING A DIE 10 TIMES
SUCCESS = GETTING A 6; p = 1/6
FAILURE = NOT GETTING A 6; q = 5/6
16
•AN OFFER FROM A BANK FOR A CREDIT CARD WITH
HIGH INTEREST RATE
SUCCESS = DECLINE; FAILURE = ACCEPT
• HAVING HEALTH INSURANCE
SUCCESS = HAVE; FAILLURE = NOT HAVE
• A REFERENDUM WHETHER TO RECALL AN
UNFAITHFUL GOVERNOR FROM OFFICE
SUCCESS = VOTE YES; FAILLURE = VOTE NO
17
GEOMETRIC PROBABILITY MODEL
• QUESTION: HOW LONG WILL IT TAKE TO
ACHIEVE THE FIRST SUCCESS IN A SERIES
OF BERNOULLI TRIALS?
• THE MODEL THAT TELLS US THIS
PROBABILITY (THAT IS, THE PROBABILITY
UNTIL FIRST SUCCESS) IS CALLED THE
GEOMETRIC PROBABILITY MODEL.
18
CONDITIONS
• THE FOLLOWING CONDITIONS MUST HOLD
BEFORE USING THE GEOMETRIC
PROBABILITY MODEL.
(1) THE TRIALS MUST BE BERNOULLI, THAT
IS, THE RANDOM EXPERIMENT MUST HAVE
TWO COMPLEMENTARY OUTCOMES –
SUCCESS AND FAILURE;
(2) THE TRIALS MUST BE INDEPENDENT OF
ONE ANOTHER;
(3) THE PROBABILITY OF SUCCESS IS THE
SAME FOR EACH TRIAL.
19
GEOMETRIC PROBABILITY MODEL FOR
BERNOULLI TRIALS
• LET p = PROBABILAITY OF SUCCESS
AND q = 1 – p = PROBABILITY OF
FAILURE
X = NUMBER OF TRIALS UNTIL FIRST
SUCCESS OCCURS
x 1
P( X  x)  q
1
E( X )   
p
SD( X ) 
p
q
2
p
20
EXAMPLE
• ASSUME THAT 13% OF PEOPLE ARE LEFT-HANDED. IF WE
SELECT 5 PEOPLE AT RANDOM, FIND THE PROBABILITY OF
EACH OUTCOME DESCRIBED BELOW.
• (1) THE FIRST LEFTY IS THE FIFTH PERSON CHOSEN?
0.0745
• (2) THE FIRST LEFTY IS THE SECOND OR THIRD PERSON.
0.211
• (3) IF WE KEEP PICKING PEOPLE UNTIL WE FIND A LEFTY, HOW
LONG WILL YOU EXPECT IT WILL TAKE?
7.69 PEOPLE
21
EXAMPLE
• AN OLYMPIC ARCHER IS ABLE TO HIT THE BULL’SEYE 80% OF THE TIME. ASSUME EACH SHOT IS
INDEPENDENT OF THE OTHERS. IF SHE SHOOTS 6
ARROWS, WHAT’S THE PROBABILITY THAT
• (1) HER FIRST BULL’S-EYE COMES ON THE THIRD
ARROW? ANS = 0.032
• (2) HER FIRST BULL’S-EYE COMES ON THE
FOURTH OR FIFTH ARROW? ANS = 0.00768
• IF SHE KEEPS SHOOTING ARROWS UNTIL SHE
HITS THE BULL’S-EYE, HOW LONG DO YOU
EXPECT IT WILL TAKE? ANS = 1.25 SHOTS
22
BINOMIAL PROBABILITY MODEL FOR
BERNOULLI TRIALS
• QUESTION: WHAT IS THE NUMBER OF
SUCCESSES IN A SPECIFIED NUMBER OF
TRIALS?
• THE BINOMIAL PROBABILITY MODEL
ANSWERS THIS QUESTION, THAT IS, THE
PROBABILITY OF EXACTLY k SUCCESSES IN
n TRIALS.
• CONDITIONS: SAME AS THOSE FOR THE
GEOMETRIC PROBABILITY MODEL
23
BINOMIAL PROBABILITY MODEL
• LET n = NUMBER OF TRIALS
p = PROBABILITY OF SUCCESS
q = PROBABILITY OF FAILURE
X = NUMBER OF SUCCESSESS IN n TRIALS
 n  k nk
P( X  k )    p q
k 
n
n!
  
 k  k!(n  k )!
where,
24
n! = n(n-1)(n-2)(n-3) … 3.2.1
E ( X )    np
SD( X )    npq
25
EXAMPLES
• COMPUTE
(1) 3!
(2) 4!
• COMPUTE
5
(1)  
 2
(3) 5!
10 
(2)  
7 
(4) 6!
12 
(3)  
0 
26
EXAMPLE
• ASSUME THAT 13% OF PEOPLE ARE LEFT-HANDED.
IF WE SELECT 5 PEOPLE AT RANDOM, FIND THE
PROBABILITY OF EACH OUTCOME BELOW.
• (1) THERE ARE EXACTLY 3 LEFTIES IN THE GROUP.
• 0.0166
• (2) THERE ARE AT LEAST 3 LEFTIES IN THE GROUP.
• 0.0179
• (3) THERE ARE NO MORE THAN 3 LEFTIES IN THE
GROUP. 0.9987
27
EXAMPLE
• AN OLYMPIC ARCHER IS ABLE TO HIT THE BULL’SEYE 80% OF THE TIME. ASSUME EACH SHOT IS
INDEPENDENT OF THE OTHERS. IF SHE SHOOTS 6
ARROWS, WHAT’S THE PROBABILITY THAT
• (1) SHE GETS EXACTLY 4 BULL’S-EYES? 0.246
• (2) SHE GETS AT LEAST 4 BULL’S-EYES? 0.901
• (3) SHE GETS AT MOST 4 BULL’S-EYES? 0.345
• (4) SHE MISSES THE BULL’S-EYE AT LEAST ONCE?
•
0.738
• (5) HOW MANY BULL’S-EYES DO YOU EXPECT HER
TO GET?
4.8 BULL’SEYES
• (6) WITH WHAT STANDARD DEVIATION? 0.98
28
THE NORMAL MODEL TO THE RESCUE OF
BINOMIAL MODEL
• IF n, THE FIXED NUMBER OF TRIALS IS LARGE,
THAT IS,
np  10
nq  10
THEN, THE BINOMIAL CUMULATIVE PROBABILITIES
CAN BE APPROXIMATED BY THE NORMAL
PROBABILITIES WITH THE SAME MEAN OR
EXPECTED VALUE = n*p
AND, THE SAME STANDARD DEVIATION =
= SQRT( n*p*q)
29
EXAMPLE
• TENNESSEE RED CROSS COLLECTED BLOOD FROM 32,000
DONORS. WHAT IS THE PROBABILITY THAT THEY HAD AT
LEAST 1850 DONORS OF THE O-NEGATIVE BLOOD GROUP.
THE PROBABILITY OF SOMEONE HAVING A 0-NEGATIVE
BLOOD TYPE IS 0.06.
• SOLUTION: LET X BE SOMEONE OF THE O-NEGATIVE BLOOD
GROUP. THEN THE QUESTION CAN BE FORMULATED
MATHEMATICALLY AS
P ( X  1850)  ?
30
CHAPTER 6.4
HOW LIKELY ARE THE POSSIBLE VALUES OF A
STATISTICS?
• REMINDER: A STATISTIC IS A NUMERICAL
SUMMARY OF A SAMPLE DATA. SOME
EXAMPLES ARE: SAMPLE PROPORTION,
SAMPLE MEAN.
• DEFINITION: THE SAMPLING DISTRIBUTION
OF A STATISTIC IS THE PROBABILITY
DISTRIBUTION THAT SPECIFIES
PROBABILITIES FOR THE POSSIBLE
VALUES THE STATISTIC CAN TAKE.
31
SAMPLING DISTRIBUTION MODELS FOR
PROPORTIONS AND MEANS
• SAMPLING DISTRIBUTION MODEL FOR A
PROPORTION
PROBLEM FORMULATION: SUPPOSE THAT p IS AN
UNKNOWN PROPORTION OF ELEMENTS OF A
CERTAIN TYPE S IN A POPULATION.
EXAMPLES
• PROPORTION OF LEFT - HANDED PEOPLE;
• PROPORTION OF HIGH SCHOOL STUDENTS WHO
ARE FAILING A READING TEST;
• PROPORTION OF VOTERS WHO WILL VOTE FOR
MR. X.
32
ESTIMATION OF p
• TO ESTIMATE p, WE SELECT A SIMPLE RANDOM
SAMPLE (SRS), OF SIZE SAY, n = 1000, AND
COMPUTE THE SAMPLE PROPORTION.
• SUPPOSE THE NUMBER OF THE TYPE WE ARE
INTERESTED IN, IN THIS SAMPLE OF n = 1000 IS x
= 437. THEN THE SAMPLE PROPORTION
p̂
IS COMPUTED USING THE FORMULA
x
pˆ 
n
33
IN THE EXAMPLE ABOVE
437
pˆ 
 43.7%
1000
34
WHAT IS THE ERROR OF ESTIMATION?
• THAT IS, WHAT IS
ˆ  p?
p
•
WHAT MODEL CAN HELP US FIND THE
BEST ESTIMATE OF THE TRUE
PROPORTION OF p?
• LET’S START THE ANALYSIS BY FIRST
ANSWERING THE SECOND QUESTION.
35
APPROACH
• SUPPOSE THAT WE TAKE A SECOND
SAMPLE OF SIZE 1000 AND COMPUTE
P(HAT); CLEARLY, THE NEW ESTIMATE
WILL BE DIFFERENT FROM 0.437. NOW,
TAKE A THIRD SAMPLE, A FOURTH
SAMPLE, UNTIL THE TWO THOUSANDTH
(2000 –TH) SAMPLE, EACH OF SIZE 1000. IT
IS OBVIOUS THAT WE WILL LIKELY OBTAIN
TWO THOUSAND DIFFERENT P(HATS) AS
ILLUSTRATED IN THE TABLE BELOW.
36
TABLE OF 2000 SAMPLES OF SIZE EACH n=1000, AND
THEIR CORRESPONDING P(HATS)
SAMPLES OF SIZE n
P(HATS)
n1
p̂1
n2
p̂2
…
…
n2000
p̂2000
37
WHAT DO WE DO WITH THE DATA FOR
P(HATS)?
• WE CONSTRUCT A HISTOGRAM OF
THESE 2000 P(HATS).
# OF
SAMPLES
p
P(HATS)
38
WHAT WE OBSERVE FROM THE HISTOGRAM
• THE HISTOGRAM ABOVE IS AN EXAMPLE
OF WHAT WE WOULD GET IF WE COULD
SEE ALL THE PROPORTIONS FROM ALL
POSSIBLE SAMPLES. THAT DISTRIBUTION
HAS A SPECIAL NAME. IT IS CALLED THE
SAMPLING DISTRIBUTION OF THE
PROPORTIONS.
• OBSERVE THAT THE HISTOGRAM IS
UNIMODAL, ROUGHLY SYMMETRIC, AND
IT’S CENTERED AT P WHICH IS THE TRUE
PROPORTION
39
WHAT DOES THE SHAPE OF THE HISTOGRAM
REMIND US ABOUT A MODEL THAT MAY JUST BE
THE RIGHT ONE FOR SAMPLE PROPORTIONS?
• ANSWER: IT IS AMAZING AND FORTUNATE
THAT A NORMAL MODEL IS JUST THE
RIGHT ONE FOR THE HISTOGRAMS OF
SAMPLE PROPORTIONS.
• HOW GOOD IS THE NORMAL MODEL?
– IT IS GOOD IF THE FOLLOWING
ASSUMPTIONS AND CONDITIONS HOLD.
40
ASSUMPTIONS AND CONDITIONS
• ASSUMPTIONS
• INDEPENDENCE ASSUMPTION: THE
SAMPLED VALUES MUST BE INDEPENDENT
OF EACH OTHER.
• SAMPLE SIZE ASSUMPTION: THE SAMPLE
SIZE, n, MUST BE LARGE ENOUGH
• REMARK: ASSUMPTIONS ARE HARD – OFTEN
IMPOSSIBLE TO CHECK. THAT’S WHY WE ASSUME
THEM. GLADLY, SOME CONDITIONS MAY PROVIDE
INFORMATION ABOUT THE ASSUMPTIONS.
41
CONDITIONS
• RANDOMIZATION CONDITION: THE DATA VALUES MUST BE
SAMPLED RANDOMLY. IF POSSIBLE, USE SIMPLE RANDOM
SAMPLING DESIGN TO SAMPLE THE POPULATION OF
INTEREST.
• 10% CONDITION: THE SAMPLE SIZE, n, MUST BE NO LARGER
THAN 10% OF THE POPULATION OF INTEREST.
• SUCCESS/FAILURE CONDITION: THE SAMPLE SIZE HAS TO
BE BIG ENOUGH SO THAT WE EXPECT AT LEAST 10
SUCCESSES AND AT LEAST 10 FAILLURES. THAT IS,
np  10 ( SUCCESS)
nq  10 ( FAILLURE )
42
THE CENTRAL LIMIT THEOREM FOR THE
SAMPLING DISTRIBUTION OF A PROPORTION
• FOR A LARGE SAMPLE SIZE n, THE SAMPLING
DISTRIBUTION OF P(HAT) IS APPROXIMATELY

p 

N
p
,


q


THAT IS, P(HAT) IS NORMAL WITH
MEAN  E ( pˆ )  p
STANDARD DEVIATION   ( pˆ ) 
pq
n
43
EXAMPLE 1
• ASSUME THAT 30% OF STUDENTS AT A
UNIVERSITY WEAR CONTACT LENSES
• (A) WE RANDOMLY PICK 100 STUDENTS. LET
P(HAT) REPRESENT THE PROPORTION OF
STUDENTS IN THIS SAMPLE WHO WEAR
CONTACTS. WHAT’S THE APPROPRIATE MODEL
FOR THE DISTRIBUTION OF P(HAT)? SPECIFY THE
NAME OF THE DISTRIBUTION, THE MEAN, AND THE
STANDARD DEVIATION. BE SURE TO VERIFY THAT
THE CONDITIONS ARE MET.
• (B) WHAT’S THE APPROXIMATE PROBABILITY
THAT MORE THAN ONE THIRD OF THIS SAMPLE
WEAR CONTACTS?
44
SOLUTION TO EXAMPLE 1
45
EXAMPLE 2
• INFORMATION ON A PACKET OF SEEDS CLAIMS
THAT THE GERMINATION RATE IS 92%. WHAT’S
THE PROBABILITY THAT MORE THAN 95% OF THE
160 SEEDS IN THE PACKET WILL GERMINATE? BE
SURE TO DISCUSS YOUR ASSUMPTIONS AND
CHECK THE CONDITIONS THAT SUPPORT YOUR
MODEL.
• SOLUTION
46
CHAPTER 6.5 – 6.6
SAMPLING DISTRIBUTION OF THE SAMPLE
MEAN X
RECALL THAT
x1  x2  ...  xn
x
n
APPROACH FOR ESTIMATING
X
SAME AS FOR SAMPLING DISTRIBUTION FOR
PROPORTIONS ILLUSTRATED ABOVE
47
ASSUMPTIONS AND CONDITIONS
• ASSUMPTIONS
• INDEPENDENCE ASSUMPTION: THE SAMPLED
VALUES MUST BE INDEPENDENT OF EACH OTHER
• SAMPLE SIZE ASSUMPTION: THE SAMPLE SIZE
MUST BE SUFFICIENTLY LARGE.
• REMARK: WE CANNOT CHECK THESE DIRECTLY,
BUT WE CAN THINK ABOUT WHETHER THE
INDEPENDENCE ASSUMPTION IS PLAUSIBLE.
48
CONDITIONS
• RANDOMIZATION CONDITION: THE DATA VALUES MUST BE
SAMPLED RANDOMLY, OR THE CONCEPT OF A SAMPLING
DISTRIBUTION MAKES NO SENSE. IF POSSIBLE, USE SIMPLE
RANDOM SAMPLING DESIGN TO ABTAIN THE SAMPLE.
• 10% CONDITION: WHEN THE SAMPLE IS DRAWN WITHOUT
REPLACEMENT (AS IS USUALLY THE CASE), THE SAMPLE
SIZE, n, SHOULD BE NO MORE THAN 10% OF THE
POPULATION.
• LARGE ENOUGH SAMPLE CONDITION: IF THE POPULATION
IS UNIMODAL AND SYMMETRIC, EVEN A FAIRLY SMALL
SAMPLE IS OKAY. IF THE POPULATION IS STRONGLY
SKEWED, IT CAN TAKE A PRETTY LARGE SAMPLE TO
ALLOW USE OF A NORMAL MODEL TO DESCRIBE THE
DISTRIBUTION OF SAMPLE MEANS
49
CENTRAL LIMIT THEOREM FOR THE
SAMPLING DISTRIBUTION FOR MEANS
• FOR A LARGE ENOUGH SAMPLE SIZE, n, THE SAMPLING
DISTRIBUTION OF THE SAMPLE MEAN X IS
APPROXIMATELY
 

N  ,

n

• THAT IS, NORMAL WITH
MEAN  E ( x )    population mean
STANDARD DEVIATION   ( x ) 

n
  population s tan dard deviation
50
EXAMPLE 3
• SUPPOSE THE MEAN ADULT WEIGHT, , IS 175
POUNDS WITH STANDARD DEVIATION, , OF 25
POUNDS. AN ELEVATOR HAS A WEIGHT LIMIT OF
10 PERSONS OR 2000 POUNDS. WHAT IS THE
PROBABILITY THAT 10 PEOPLE WHO GET ON THE
ELEVATOR OVERLOAD ITS WEIGHT LIMIT?
• SOLUTION
51
EXAMPLE 4
• STATISTICS FROM CORNELL’S NORTHEAST REGIONAL
CLIMATE CENTER INDICATE THAT ITHACA, NY, GETS AN
AVERAGE OF 35.4 INCHES OF RAIN EACH YEAR, WITH A
STANDARD DEVIATION OF 4.2 INCHES. ASSUME THAT A
NORMAL MODEL APPLIES.
• (A) DURING WHAT PERCENTAGE OF YEARS DOES ITHACA
GET MORE THAN 40 INCHES OF RAIN?
• (B) LESS THAN HOW MUCH RAIN FALLS IN THE DRIEST 20%
OF ALL YEARS?
• (C) A CORNELL UNIVERSITY STUDENT IS IN ITHACA FOR 4
YEARS. LET y (bar) REPRESENT THE MEAN AMOUNT OF RAIN
FOR THOSE 4 YEARS. DESCRIBE THE SAMPLING
DISTRIBUTION MODEL OF THIS SAMPLE MEAN, y (bar).
• (D) WHAT’S THE PROBABILITY THAT THOSE 4 YEARS
AVERAGE LESS THAN 30 INCHES OF RAIN?
52
SOLUTION TO EXAMPLE 4
53
CHAPTER 7.1 – 7.2
CONFIDENCE INTERVALS FOR PROPORTIONS
ESTIMATION
POINT ESTIMATION PRODUCES A NUMBER
(AN ESTIMATE) WHICH IS BELIEVED TO BE
CLOSE TO THE VALUE OF UNKNOWN
PARAMETER.
FOR EXAMPLE: A CONCLUSION MAYBE THAT
“PROPORTION P OF LEFT-HANDED
STUDENTS IN MSU IS APPROXIMATELY
O.46”
54
SOME POINT ESTIMATORS
PARAMETER
PROPORTION
MEAN
STANDARD
DEVIATION
P


ESTIMATOR
P̂
X
S
55
INTERVAL ESTIMATION
• PRODUCES AN INTERVAL THAT CONTAINS
THE ESTIMATED PARAMETER WITH A
PRESCRIBED CONFIDENCE.
• A CONFIDENCE INTERVAL OFTEN HAS THE
FORM:
POINT ESTIMATE  MARGIN OF ERROR (ME )
56
DEFINITION
• GIVEN A CONFIDENCE LEVEL C%, THE
*
CRITICAL VALUE C IS THE NUMBER SO
THAT THE AREA UNDER THE PROPER
*
*
CURVE AND BETWEEN  C AND C IS C
(IN DECIMALS).
57
SOME CRITICAL VALUES FOR STANDARD
NORMAL DISTRIBUTION
C % CONFIDENCE
LEVEL
80%
CRITICAL VALUE
90%
1.645
95%
1.960
98%
2.326
99%
2.576
Z*
1.282
58
WHAT DOES C% CONFIDENCE REALLY
MEAN?
• FORMALLY, WHAT WE MEAN IS THAT C% OF
SAMPLES OF THIS SIZE WILL PRODUCE
CONFIDENCE INTERVALS THAT CAPTURE THE
TRUE PROPORTION.
• C% CONFIDENCE MEANS THAT ON AVERAGE, IN C
OUT OF 100 ESTIMATIONS, THE INTERVAL WILL
CONTAIN THE TRUE ESTIMATED PARAMETER.
• E.G. A 95% CONFIDENCE MEANS THAT ON THE
AVERAGE, IN 95 OUT OF 100 ESTIMATIONS, THE
INTERVAL WILL CONTAIN THE TRUE ESTIMATED
PARAMETER.
59
CONFIDENCE INTERVAL FOR PROPORTION P
[ONE-PROPORTION Z-INTERVAL]
•
ASSUMPTIONS AND CONDITIONS
RANDOMIZATION CONDITION
•
10% CONDITION
•
SAMPLE SIZE ASSUMPTION OR
SUCCESS/FAILURE CONDITION
•
•
INDEPENDENCE ASSUMPTION
NOTE: PROPER RANDOMIZATION CAN HELP
ENSURE INDEPENDENCE.
60
CONSTRUCTING CONFIDENCE
INTERVALS
ESTIMATOR
SAMPLE PROPORTION
P̂
STANDARD ERROR
C% MARGIN OF
ERROR
C% CONFIDENCE
INTERVAL
SE ( Pˆ ) 
pˆ qˆ
n
ME ( pˆ )  z SE ( pˆ )
*
pˆ  ME ( pˆ )
61
SAMPLE SIZE NEEDED TO PRODUCE A CONFIDENCE
INTERVAL WITH A GIVEN MARGIN OF ERROR, ME
ˆ)  z
ME ( p
SOLVING FOR n GIVES
*
ˆ qˆ
p
n
ˆ qˆ
(z ) p
n
2
( ME )
* 2
ˆ AND qˆ IS A REASONABLE GUESS. IF WE
WHERE p
CANNOT MAKE A GUESS, WE TAKE p
ˆ  qˆ  0.5
62
EXAMPLE 1
A MAY 2002 GALLUP POLL FOUND THAT ONLY 8% OF A
RANDOM SAMPLE OF 1012 ADULTS APPROVED OF
ATTEMPTS TO CLONE A HUMAN.
(A)
(B)
(C)
(D)
(E)
FIND THE MARGIN OF ERROR FOR THIS POLL IF WE WANT
95% CONFIDENCE IN OUR ESTIMATE OF THE PERCENT OF
AMERICAN ADULTS WHO APPROVE OF CLONING HUMANS.
EXPLAIN WHAT THAT MARGIN OF ERROR MEANS.
IF WE ONLY NEED TO BE 90% CONFIDENT, WILL THE
MARGIN OF ERROR BE LARGER OR SMALLER? EXPLAIN.
FIND THAT MARDIN OF ERROR.
IN GENERAL, IF ALL OTHER ASPECTS OF THE SITUATION
REMAIN THE SAME, WOULD SMALLER SAMPLES PRODUCE
SMALLER OR LARGER MARGINS OF ERROR?
63
SOLUTION
64
EXAMPLE 2
DIRECT MAIL ADVERTISERS SEND SOLICITATIONS (a.k.a. “junk
mail”) TO THOUSANDS OF POTENTIAL CUSTOMERS IN THE
HOPE THAT SOME WILL BUY THE COMPANY’S PRODUCT.
THE RESPONSE RATE IS USUALLY QUITE LOW. SUPPOSE
A COMPANY WANTS TO TEST THE RESPONSE TO A NEW
FLYER, AND SENDS IT TO 1000 PEOPLE RANDOMLY
SELECTED FROM THEIR MAILING LIST OF OVER 200,000
PEOPLE. THEY GET ORDERS FROM 123 OF THE
RECIPIENTS.
(A) CREATE A 90% CONFIDENCE INTERVAL FOR THE
PERCENTAGE OF PEOPLE THE COMPANY CONTACTS WHO
MAY BUY SOMETHING.
(B) EXPLAIN WHAT THIS INTERVAL MEANS.
(C) EXPLAIN WHAT “90% CONFIDENCE” MEANS.
(D) THE COMPANY MUST DECIDE WHETHER TO NOW DO A
MASS MAILING. THE MAILING WON’T BE COST-EFFECTIVE
UNLESS IT PRODUCES AT LEAST A 5% RETURN. WHAT
DOES YOUR CONFIDENCE INTERVAL SUGGEST? EXPLAIN.
65
SOLUTION
66
EXAMPLE 3
IN 1998 A SAN DIEGO REPRODUCTIVE CLINIC
REPORTED 49 BIRTHS TO 207 WOMEN UNDER
THE AGE OF 40 WHO HAD PREVIOUSLY BEEN
UNABLE TO CONCEIVE.
(A) FIND A 90% CONFIDENCE INTERVAL FOR THE
SUCCESS RATE AT THIS CLINIC.
(B) INTERPRET YOUR INTERVAL IN THIS CONTEXT.
(C) EXPLAIN WHAT “90 CONFIDENCE” MEANS.
(D) WOULD IT BE MISLEADING FOR THE CLINIC TO
ADVERTISE A 25% SUCCESS RATE? EXPLAIN.
(E) THE CLINIC WANTS TO CUT THE STATED
MARGIN OF ERROR IN HALF. HOW MANY
PATIENTS’ RESULTS MUST BE USED?
(F) DO YOU HAVE ANY CONCERNS ABOUT THIS
SAMPLE? EXPLAIN.
67
SOLUTION
68
CHAPTER 7.3 – 7.4
CONFIDENCE INTERVALS TO ESTIMATE A
POPULATION MEAN
• NOTES TO BE TAKEN IN CLASS
69
CHAPTER 8
TESTING HYPOTHESES ABOUT
PROPORTIONS
• PROBLEM
• SUPPOSE WE TOSSED A COIN 100 TIMES
AND WE OBTAINED 38 HEADS AND 62
TAILS. IS THE COIN BIASED?
• THERE IS NO WAY TO SAY YES OR NO WITH
100% CERTAINTY. BUT WE MAY EVALUATE
THE STRENGTH OF SUPPORT TO THE
HYPOTHESIS THAT “THE COIN IS BIASED.”
70
TESTING
• HYPOTHESES
NULL HYPOTHESIS H 0
– ESTABLISHED FACT;
– A STATEMENT THAT WE EXPECT DATA TO
CONTRADICT;
– NO CHANGE OF PARAMETERS.
ALTERNATIVE HYPOTHESIS H A
– NEW CONJECTURE;
– YOUR CLAIM;
– A STATEMENT THAT NEEDS A STRONG
SUPPORT FROM DATA TO CLAIM IT;
– CHANGE OF PARAMETERS
71
IN OUR PROBLEM
H 0 : COIN IS FAIR;
p  0.5
H A : COIN IS BIASED;
p  0.5
WHERE p IS THE PROBABILIT Y THAT
THE COIN TURNS " HEADS ."
72
EXAMPLE
• WRITE THE NULL AND ALTERNATIVE HYPOTHESES
YOU WOULD USE TO TEST EACH OF THE
FOLLOWING SITUATIONS.
• (A) IN THE 1950s ONLY ABOUT 40% OF HIGH
SCHOOL GRADUATES WENT ON TO COLLEGE.
HAS THE PERCENTAGE CHANGED?
• (B) 20% OF CARS OF A CERTAIN MODEL HAVE
NEEDED COSTLY TRANSMISSION WORK AFTER
BEING DRIVEN BETWEEN 50,000 AND 100,000
MILES. THE MANUFACTURER HOPES THAT
REDESIGN OF A TRANSMISSION COMPONENT HAS
SOLVED THIS PROBLEM.
• (C) WE FIELD TEST A NEW FLAVOR SOFT DRINK,
PLANNING TO MARKET IT ONLY IF WE ARE SURE
THAT OVER 60% OF THE PEOPLE LIKE THE
73
FLAVOR.
ATTITUDE
• ASSUME THAT THE NULL
HYPOTHESIS H 0
IS TRUE AND UPHOLD IT,
UNLESS DATA STRONGLY SPEAKS
AGAINST IT.
74
TEST MECHANIC
• FROM DATA, COMPUTE THE VALUE
OF A PROPER TEST STATISTICS,
THAT IS, THE Z-STATISTICS.
• IF IT IS FAR FROM WHAT IS
EXPECTED UNDER THE NULL
HYPOTHESIS ASSUMPTION, THEN WE
REJECT THE NULL HYPOTHESIS.
75
COMPUTATION OF THE Z – STATISTICS OR
PROPER TEST STATISTICS
pˆ  po
z
SD ( pˆ )
SD ( pˆ ) 
where,
po .qo
n
76
CONSIDERING THE EXAMPLE AT THE
BEGINNING:
0
.
5
(
0
.
5
)
Pˆ  0.38, PO  0.5, SD( Pˆ ) 
 0.05
100
0.38  0.50
AND zo 
 2.4
0.05
77
THE P – VALUE AND ITS COMPUTATION
• THE PROBABILITY THAT IF THE NULL
HYPOTHESIS IS CORRECT, THE TEST
STATISTIC TAKES THE OBSERVED OR
MORE EXTREME VALUE.
• P – VALUE MEASURES THE STRENGTH OF
EVIDENCE AGAINST THE NULL
HYPOTHESIS. THE SMALLER THE P –
VALUE, THE STRONGER THE EVIDENCE
AGAINST THE NULL HYPOTHESIS.
78
THE WAY THE ALTERNATIVE HYPOTHESIS IS
WRITTEN IS HELPFUL IN COMPUTING THE P - VALUE
HA
p  value
NORMAL
CURVE
H A : p  po P( z  zo )
H A : p  po P( z  zo )
H A : p  po 2P( z  zo )
79
IN OUR EXAMPLE,
• P – VALUE = P( z < - 2.4) = 0.0082
• INTERPRETATION: IF THE COIN IS
FAIR, THEN THE PROBABILITY OF
OBSERVING 38 OR FEWER
HEADS IN 100 TOSSES IS 0.0082
80
CONCLUSION: GIVEN SIGNIFICANCE
LEVEL = 0.05
• WE REJECT THE NULL HYPOTHESIS IF THE
P – VALUE IS LESS THAN THE
SIGNIFICANCE LEVEL OR ALPHA LEVEL.
• WE FAIL TO REJECT THE NULL
HYPOTHESIS (I.E. WE RETAIN THE NULL
HYPOTHESIS) IF THE P – VALUE IS
GREATER THAN THE SIGNIFICANCE LEVEL
OR ALPHA LEVEL.
81
ASSUMPTIONS AND CONDITIONS
• RANDOMIZATION
• INDEPENDENT OBSERVATIONS
• 10% CONDITION
• SUCCESS/FAILURE CONDITION
82
EXAMPLE 1
• THE NATIONAL CENTER FOR EDUCATION
STATISTICS MONITORS MANY ASPECTS OF
ELEMENTARY AND SECONDARY EDUCATION
NATIONWIDE. THEIR 1996 NUMBERS ARE OFTEN
USED AS A BASELINE TO ASSESS CHANGES. IN
1996, 31% OF STUDENTS REPORTED THAT THEIR
MOTHERS HAD GRADUATED FROM COLLEGE. IN
2000, RESPONSES FROM 8368 STUDENTS FOUND
THAT THIS FIGURE HAD GROWN TO 32%. IS THIS
EVIDENCE OF A CHANGE IN EDUCATION LEVEL
AMONG MOTHERS?
83
EXAMPLE 1 CONT’D
• (A) WRITE APPROPRIATE HYPOTHESES.
• (B) CHECK THE ASSUMPTIONS AND CONDITIONS.
• (C) PERFORM THE TEST AND FIND THE P – VALUE.
• (D) STATE YOUR CONCLUSION.
• (E) DO YOU THINK THIS DIFFERENCE IS
MEANINGFUL? EXPLAIN.
84
SOLUTION
85
EXAMPLE 2
• IN THE 1980s IT WAS GENERALLY BELIEVED THAT
CONGENITAL ABNORMALITIES AFFECTED ABOUT
5% OF THE NATION’S CHILDREN. SOME PEOPLE
BELIEVE THAT THE INCREASE IN THE NUMBER OF
CHEMICALS IN THE ENVIRONMENT HAS LED TO AN
INCREASE IN THE INCIDENCE OF ABNORMALITIES.
A RECENT STUDY EXAMINED 384 CHILDREN AND
FOUND THAT 46 OF THEM SHOWED SIGNS OF AN
ABNORMALITY. IS THIS STRONG EVIDENCE THAT
THE RISK HAS INCREASED? ( WE CONSIDER A P –
VALUE OF AROUND 5% TO REPRESENT STRONG
EVIDENCE.)
86
EXAMPLE 2 CONT’D
• (A) WRITE APPROPRIATE HYPOTHESES.
• (B) CHECK THE NECESSARY ASSUMPTIONS.
• (C) PERFORM THE MECHANICS OF THE TEST.
WHAT IS THE P – VALUE?
• (D) EXPLAIN CAREFULLY WHAT THE P – VALUE
MEANS IN THIS CONTEXT.
• (E) WHAT’S YOUR CONCLUSION?
• (F) DO ENVIRONMENTAL CHEMICALS CAUSE
CONGENITAL ABNORMALITIES?
87
SOLUTION
88
CHAPTER 8 CONT’D
TESTING HYPOTHESES ABOUT MEANS
• NOTES TO BE TAKEN IN CLASS
89