random sample

Download Report

Transcript random sample

Stat 155, Section 2, Last Time
• Linear Regression
– Fit a line to data
•
•
•
•
Least Squares Prediction
Residual Diagnostic Plot
Producing Data
How to Sample?
– History of Presidential Election Polls
Reading In Textbook
Approximate Reading for Today’s Material:
Pages 198-210, 218-225
Approximate Reading for Next Class:
Pages 231-240, 256-257
Common Problem
Adding lines to an Excel Plot
E.g. Textbook problem 2.17
•
Plot Data
•
Add line with “Add trendline”
•
Add line: y = 35+.5x
•
Explicitly add least squares fit line
Chapter 3:
Producing Data
(how this is done is critical to conclusions)
Section 3.1:
Statistical Settings
2 Main Types:
I.
Observational Study
II. Designed Experiment
Producing Data
2 Main Types:
I.
Observational Study
II. Experiment
(Make Changes, & Study Effect)
Apply “treatment” to individuals & measure
“responses”
e.g. Clinical trials for drugs,
(safe? effective?)
agricultural trials
(max yield?)
Producing Data
2 Main Types:
I.
Observational Study
II. Experiment
(common sense)
Caution: Thinking is required for each.
Both if you do statistics & if you need to
understand somebody else’s results
Helpful Distinctions
(Critical Issue of
I.
“Good” vs. “Bad”)
Observational Studies:
A. Anecdotal Evidence
Idea:
Study just a few cases
Problem:
may not be representative
(or worse: only considered for this reason)
e.g. Cures for hiccups
Key Question: how were data chosen?
(early medicine: this gave crazy attempts at cures)
Helpful Distinctions
I.
Observational Studies:
B. Sampling
Idea: Seek sample representative of population
Challenge: How to sample?
(turns out: not easy)
How to sample?
History of Presidential Election Polls
During Campaigns, constantly hear in news
“polls say …” How good are these? Why?
1936
Landon vs. Roosevelt
Literary Digest Poll: 43% for R
Result:
62% for R
What happened?
Sample size not big enough?
2.4 million
Biggest Poll ever done (before or since)
Bias in Sampling
Bias: Systematically favoring one outcome
(need to think carefully)
Selection Bias: Addresses from L. D.
readers, phone books, club memberships
(representative of population?)
Non-Response Bias: Return-mail survey
(who had time?)
How to sample?
1936
Presidential Election (cont.)
Interesting Alternative Poll:
Gallup:
56% for R
Gallup of L.D.
(sample size ~ 50,000)
44% for R
( ~ 3,000)
Predicted both correct result (62% for R),
and L. D. error (43% for R)!
(what was better?)
Improved Sampling
Gallup’s Improvements:
(i) Personal Interviews
(attacks non-response bias)
(ii) Quota Sampling
(attacks selection bias)
Quota Sampling
Idea: make “sample like population”
So surveyor chooses people to give:
i.
ii.
iii.
iv.
Right % male
Right % “young”
Right % “blue collar”
…
This worked well, until …
How to sample?
1948
Dewey
Crossley 50%
Gallup
50%
Roper
53%
Actual
45%
Truman
45%
44%
38%
50%
sample size
50,000
15,000
-
Note: Embarassing for polls, famous photo
of Truman + Headline “Dewey Wins”
What went wrong?
Problem: Unintentional Bias
(surveyors understood bias,
but still made choices)
Lesson: Human Choice can not give a
Representative Sample
Surprising Improvement: Random Sampling
Now called “scientific sampling”
Random = Scientific???
Random Sampling
Key Idea: “random error” is smaller than
“unintentional bias”, for large enough
sample sizes
How large?
Current sample sizes: ~1,000 - 3,000
Note:
now << 50,000 used in 1948.
So surveys are much cheaper
(thus many more done now….)
Random Sampling
How Accurate?
•
Can (& will) calculate using “probability”
•
Justifies term “scientific sampling”
•
2nd improvement over quota sampling
And now for something
completely different
Recall
Stat 155, Section 2, Majors
Distribution
0.4
0.35
of majors of
0.25
0.2
0.15
0.1
0.05
de
d
nd
ec
i
er
U
th
O
m
/N
Jo
ur
ur
na
si
ng
lis
m
/C
om
m
.
En
v.
Sc
i.
/H
ea
lth
Ph
ar
gy
Po
lic
y
Bi
ol
o
ic
Pu
bl
ne
s
s
/M
an
.
0
Bu
si
this course:
Frequency
students in
0.3
And now for something
completely different
A man goes into a drugstore and asks the
pharmacist if he can give him something for
the hiccups. The pharmacist promptly
reaches out and slaps the man's
face."What did you do that for?" the man
asks.
And now for something
completely different
What did you do that for?" the man asks.
"Well, you don't have the hiccups anymore, do
you?“
The man says, "No, but my wife out in the car
still does!"
And now for something
completely different
An elderly woman went into the doctor's office.
When the doctor asked why she was there,
she replied, "I'd like to have some birth
control pills."
Taken aback, the doctor thought for a minute
and then said, "Excuse me, Mrs. Smith, but
you're 75 years old. What possible use
could you have for birth control pills?"
The woman responded, "They help me sleep
better."
And now for something
completely different
The woman responded, "They help me sleep
better."
The doctor thought some more and continued,
"How in the world do birth control pills help
you to sleep?"
The woman said, "I put them in my
granddaughter's orange juice and I sleep
better at night."
Random Sampling
How Accurate?
•
Can (& will) calculate using “probability”
•
Justifies term “scientific sampling”
•
2nd improvement over quota sampling
Random Sampling
What is random?
Simple Random Sampling:
Each member of population is
equally likely to be in sample
Key Idea: Different from “just choose some”
Random Sampling
An old (but still fun?) experiment:
Choose a number among 1,2,3,4
Old typical results:
about 70% choose “3”
(perhaps you have seen this before…)
Main lesson: human choice does not give
“equally likely” (i.e. random sample)
Random Sampling
How to choose a random sample?
Old Approaches:
–
Random Number Table
–
Roll Dice
Modern Approach:
–
Computer Generated
Random Sampling
EXCEL generation of random samples:
http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg16.xls
Goal 1:
Generate Random Numbers
EXCEL approaches:
•
RAND function
•
Tools  Data Analysis  Random
Number Generation
EXCEL Random Sampling
Goal 2:
Randomly Reorder List
EXCEL approach:
•
Highlight block with list & random num’s
•
Sort whole thing on numbers
Goal 3:
Random Sample from List
•
Choose 1st subset from random re-order
•
Since, each equally likely in each spot
EXCEL Details
RAND:
• Not available among “Statistical” functions
• But can find on “All” menu
• Note no (explicit) inputs
• Just put in desired cell
• Drag downwards for several random #s
• Caution: these change on each re-comp.
• Thus not recommended for this
EXCEL Details
Tools  Data Analysis  Random Number
Generation :
• Set: # Variables: 1
Distribution: Uniform (over [0,1])
• Generates Fixed List
(doesn’t change with re-computation)
(note entries are “just numbers”)
• Thus stable for later interpretation
• Recommended for random sample choice
EXCEL Details
Sorting Lists:
•
Highlight Block with Both:
–
Names to sort
–
Random numbers
•
Data  Sort  Choose Column
•
Result is random re-ordering of List
Random Sampling HW
HW:
C7: For the letters A – L, use EXCEL to:
(a) Put in a random order.
(b) Choose a random sample of 6.
(Hints: for (a), want each equally likely,
for (b), reorder, and choose a subset)
Random Sampling HW
Interesting Question:
What is the % of Male Students at UNC?
(Your chance of date,
or take 100% -
to get your chance)
HW:
C8: Print Class Handout
http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155HWC8.doc
Random Sampling HW
Notes on HW C8:
• 3 dumb ways to sample, 1 good one
• Goal is to learn about sampling,
Not “get right answer”
• Part 1, put symbol for yourself, Ms and Fs
for others
• Put both count & % (%100 x count / 25)
• Part 2, “tally” is:
• Part 4, student phone directory available
in Student Union?
Random Sampling HW
Notes on HW C8,
• Hints on Part 4:
–
–
–
–
–
For each draw, first draw a “random page”
Tools  Data Analysis  Random Number
Generation  Uniform is one way to do this
In “Uniform”, you need to set “Parameters”, to
0 and “number of pages”.
This gives a random decimal, to get an
integer, round up, using CEILING
In CEILING, set “significance” to 1.
Random Sampling HW
Notes on HW C8,
• Hints on Part 4 (cont.):
–
–
–
–
–
–
–
–
Next Choose Random Column
Next Choose Random Name
Caution: Different numbers on each page.
Challenge: still make equally likely
Approach: choose larger number.
Approach: when not there, just toss it out
Approach: then do a “redraw”
Also redraw if can’t tell gender
More On Surveys
More Common Sense:
How you ask the question
makes a big difference
HW:
3.57, 3.59
And Now for Something
Completely Different
Extreme Bicycling
Need a bicycle helmet there?
And Now for Something
Completely Different
And Now for Something
Completely Different
And Now for Something
Completely Different
And Now for Something
Completely Different
More about Sampling
The “simple random sample” (recall “each
equally likely”) can be expensive
(e.g. nationwide political poll, collected by
personal interview)
So there are many cheaper variations:
–
–
–
–
Stratified Sampling
Multi Stage Sampling
See text
And there are many others as well
Sampling for Experiments
II. Experiments
(Recall I was Observational Studies,
Now take similar look at II)
Terminology:
“treatments” are applied to “individuals”
i.e. to “subjects”
i.e. to “experimental units”
Sampling for Experiments
A “treatment” is:
a combination of “levels”,
of explanatory variables (quantities),
called “factors”.
E.g. Medicine, Agriculture, …
Sampling for Experiments
Agriculture Example:
Study how plant growth depends on:
fertilizer and water
So plants = “experiment’l units”, i.e. “subjects”
“Factors” are fertilizer and water,
Each plant gets some “level” of each.
HW on Sampling Terminology
HW:
3.9
3.11
Design of Experiments
The “design” of an experiment is the
assignment of levels and treatments to
experimental units
(just as “choice of sample” was critical for
sampling, this is too. There is a huge
literature on this, including current
research)
Design of Experiments
Key Design Issues:
1. Control
Idea: Eliminate “lurking variable” effects,
by comparing treatments on groups of
similar experimental units.
Controlled Experiments
Common Type:
compare “treatment” with
“placebo”, a “sham treatment” that
controls for psychological effects
(think you are better, just because you are
treated, so you are better…)
Called a “blind” experiment
Controlled Experiments
Further Refinement:
“Double Blind” experiment means neither
patient, nor doctor knows is real or not
Eliminates possible doctor bias
Design of Experiments
2. Randomization
Useful method for choosing groups above
(e.g. Treatment and Control)
Recall:
Different from “just choose some”,
instead means “make each equally likely”
Design of Experiments
2. Randomization
Big Plus:
Eliminates biases,
i.e. effects of “lurking variables”
(same as random choice of samples,
again pay price of added variability,
but well worth it)
Design of Experiments
3. Replication
Idea: Reduce chance variation by applying
same treatment to several (even many?)
experimental units.
How many replications are needed?
(depends on context: tradeoff between
cost and reduction of variation)
Will build tools to study (based on probability)
Design of Experiments
Fancier Designs
(there are many, some in text)
•
Blocks
•
Matched Pairs
•
Balanced Designs

Example of an Experiment
(to tie above ideas together)
Gastric Freezing:
Treatment for stomach ulcers
–
Anesthetize patient
–
Put balloon in stomach
–
Fill with freezing coolant