Transcript PowerPoint
Mar. 15 Statistics for the day:
Highest Temp ever recorded in State College:
102 degrees (July 9, 1936 and July 17, 1988)
Lowest temp ever recorded in State College:
-18 degrees (January 19-20, 1994)
Source: http://pasc.met.psu.edu
Review
Exam Friday, March 19
Chapters 10, 11, 12, 15, 16, 17
These slides were created by Tom Hettmansperger and in some cases
modified by David Hunter
Arby's
700
calories
600
500
400
300
200
3
4
5
6
7
serving
size
8
9
10
11
Arby's
calories = -10.2 + 60.5x(serving)
Correlation = .83
700
calories
600
500
400
300
200
3
4
5
6
7
8
9
10
11
serving
S = 78.5202
R-Sq = 72.2 %
R-Sq(adj) = 69.8 %
Best fitting line through the data: called the REGRESSION LINE
Strength of relationship: measured by CORRELATON
calories = -10 + 60(serving size in oz)
------------------------------------------------For example if you have a 6 oz sandwich
on the average you expect to get about:
-10 + 60(6) = -10 + 360 = 350 calories
-------------------------------------------------For a 10 oz sandwich:
-10 + 60(10) = -10 +600 = 590
calories = -10 + 60(serving size in oz)
• -10 is called the intercept
• 60 is called the slope
• One way to interpret slope: For
every extra oz of serving you get an
increase of 60 calories
Facts about correlation, measured for
two quantitative variables
+1 means perfect increasing linear
relationship
-1 means perfect decreasing linear
relationship
0 means no linear relationship
+ means one increases as the other increases
- means one increases as the other decreases
Outliers
Outliers are data that are not compatible
with the bulk of the data.
They show up in graphical displays
as detached or stray points.
Sometimes they indicate errors in data
input. Experts estimate that roughly
5% of all data entered is in error.
Sometimes they are the most important
data points.
Example
20
15
10
5
0
Hours
25
30
35
Hours per day typically spent studying
Stat 100.2 Spring 2004
How many pairs of jeans do you own?
Stat 100.2 Spring 2004
20
10
0
50
40
Frequency
Jeans
30
30
20
10
0
0
10
20
Jeans
30
A bad outlier:
Stat 100.2 Combined
height = 58.2 + 0.06 weight
Correlation = .38
100
height
90
80
70
60
100
200
300
weight
S = 4.77319
R-Sq = 14.7 %
R-Sq(adj) = 13.9 %
Another bad outlier:
Regression Plot
text cost = 220.5 + 0.58 cds
Correlation = .56
1500
text cost
1000
500
0
0
500
1000
1500
cds
S = 141.673
R-Sq = 31.2 %
R-Sq(adj) = 30.5 %
The Moral:
There can be good outliers: Election fraud. We use
them to identify important parts of the data. Or in
analyzing put options for extreme cases.
More often the outliers are bad. They can depress
the correlation and make you think the relationship
is weaker than it really is.
They can increase the correlation and make it appear
that the relationship is stronger than it really is.
IMPORTANT: Always look at a scatter plot as well
as compute the correlation.
Another problem:
Sometimes we see strong relationship in
absurd examples.
Two seemingly unrelated variables have
a high correlation.
This signals the presence of a third variable
that is highly correlated with the other two.
(Confounding or interaction)
A third variable: vocabulary vs shoe size
Regression Plot
Y = -806 + 555 X
Correlation = .985
2500
Words
known
2000
1500
1000
500
0
2
3
Shoe Size
4
5
S = 158.602
6
R-Sq = 97.1 %
R-Sq(adj) = 96.6 %
How can we have such high correlation between
shoe size and vocabulary?
Easy: Both increase with age and hence age
is a hidden variable.
Age is positively correlated with both shoe
size and with vocabulary.
Two categorical variables:
Explanatory variable: Gender
Response variable: Body Pierced or Not
Survey question:
Have you pierced any other part of your body?
(Except for ears)
Research Question:
Is there a significant difference between women
and men in terms of body pierces?
Data:
Response
Pierced?
Explanatory
No
Yes
Women
84
51
135
Men
96
3
99
180
54
234
Gender?
From Stat 100.2, spring 2004 (missing responses omitted)
Percentages
Response:
no
62.22 = 84/135
96.97 = 96/99
body pierced?
yes
All
female
male
62.22
96.97
37.78
3.03
100.00
100.00
All
76.92
23.08
100.00
Research question: Is there a significant difference
Between women and men?
(i.e., between 62.22% and 96.97%)
The Debate:
The research advocate claims that there is a
significant difference.
The skeptic claims there is no real difference.
The data differences simply happen by chance.
The strategy for determining
statistical significance:
First, figure out what you expect to see if there is
no difference between females and males
Second, figure out how far the data is from what is
expected.
Third, decide if the distance in the second step is
large.
Fourth, if large then claim there is a statistically
significant difference.
Research Advocate: OK. Suppose there is really
no difference in the population as you, the Skeptic,
claim.
We will compare what you, The Skeptic, expect to
see and what you actually do see in the data.
Skeptic: How do we figure out what we expect to see?
No
Yes
Women
135
Men
99
180
54
234
180
180 135
135
103.85
234
234
Rows: gender
Columns: body pierces
top lines of numbers are observed
bottom lines are expected (by skeptic)
no
female
yes
All
84
103.85
51
31.15
135
135.00
male
96
76.15
3
22.85
99
99.00
All
180
180.00
54
54.00
234
234.00
How to measure the distance between what the
research advocate observes in the table and what
the skeptic expects:
Add up the following for each cell:
(obs exp)
exp
2
(84 103.85)2 (51 31.15)2 (96 76.15)2 (3 22.85)2
38.85
103.85
31.15
76.15
22.85
2
Now how do we decide if 38.85 is large or not? If
it is large enough the skeptic concedes to the
research advocate and agrees there is a statistically
significant difference. How large is enough?
2.0
Chi-squared distribution with 1 degree of freedom:
1.0
1.5
If chi-squared statistic
is larger than 3.84, it is
declared large and the
research advocate wins.
0.5
Cutoff=3.84
5% on
this side
0.0
95% on
this side
0
1
2
3
4
5
6
But our chi-squared is 38.85 so the research advocate easily wins!
There is a statistically significant difference between men and women.
Why 1 degree of freedom?
No
Yes
Women
136
Men
101
26
211
237
Note that black box is the ONLY one we can fill
arbitrarily. Once that box is filled, all others are
determined by margins!
How many degrees of freedom?
Women
Always
Sometimes Never
One df
Two df
136
Men
101
106
105
26
237
Degrees of freedom (df) always equal
(Number of rows – 1) times (Number of columns – 1)
Health studies and risk
Research question: Do strong electromagnetic fields
cause cancer?
50 dogs randomly split into two groups: no field, yes field
The response is whether they get lymphoma.
Rows: mag field
no
yes
All
Columns: cancer
no
yes
All
20
10
5
15
25
25
30
20
50
Rows: mag field
Columns: cancer
observed above the expected
no
yes
All
no
20
15.00
5
10.00
25
25.00
yes
10
15.00
15
10.00
25
25.00
30
30.00
20
20.00
50
50.00
All
Chi-Square = 8.333 (compare to 3.84)
Research advocate wins!
Terminology and jargon:
1. Identify the ‘bad’ response category: yes cancer
2. Risk for categories of explanatory variable
• Identify treatment category
• Identify baseline (control) category
3. Treatment risk: 15/25 or .60 or 60%
4. Baseline risk: 5/25 or .20 or 20%
5. Relative risk: Treatment risk over Baseline risk = .60/.20=3
So risk due to mag field is 3 times higher than baseline risk.
One more on the next page:
Increased risk (percentage change in risk):
Treatment Baseline .60 .20 .4
2
Baseline
.20
.2
So the percentage change is 200%
A 200% increase in treatment risk over
baseline risk for getting cancer.
Final note:
When the chi-squared test is statistically significant
then it makes sense to compute the various risk
statements.
If there is no statistical significance then the skeptic
wins.
There is no evidence in the data for differences in
risk for the categories of the explanatory variable.
Research question: Is ghost sighting related
to age? Do young and old people differ in
ghost sighting?
The skeptic responds by saying he
doesn’t believe that there is any
difference between the age groups.
We need to see the data to resolve the debate. Then
we can consider assessing the risk.
Exercise 9, p219 of the text.
Expected counts are printed below observed
yes
no
Total
young
212
1313
1525
174.9
1350.1
old
Total
Chi-Sq =
465
502.1
3913
3875.9
4378
677
5226
5903
7.870 +
2.742 +
1.020 +
0.355 = 11.987
The research advocate wins and skeptic loses.
There is evidence in the data that there are differences
in the population.
The percent of young who saw a ghost:
212/1525 = .139
Answer: 13.9%
The proportion of old who saw a ghost:
465/4378 = .106
Answer: .106
The risk of young seeing ghost:
Answer: 212/1525 or .139 or 13.9%
Odds ratio?
Odds
The odds of something
happening are given
Proportion of time it happens
Proportion of time it doesn't happen
by a ratio:
For example, if you
flip a fair coin, the
odds of heads are 1 (or
sometimes “1 to 1”).
An odds ratio is the
ratio of two odds!
The odds that a young person saw a ghost:
212/1313 = .161
The odds that an older person saw a ghost:
465/3912 = .119
The odds ratio:
Answer: .161/.106 = 1.35
Relative risk of young person seeing a
ghost compared to older person:
Answer: .139/.106 = 1.31
We would say that the risk that a younger person
sees a ghost is 1.31 times higher than the risk that an
older person sees a ghost.
The increased risk that a young person sees a ghost over
that of an older person:
Answer: (.139 - .106)/.106 = .31
Hence we would say that young people have a 31%
higher risk of seeing a ghost than older people.
Statistical significance
Statistical significance is related to
the size of the sample. But that makes
sense. More data, more information, more
precise inference.
So statistical significance is related to two things:
1. The size of the difference between the percentages.
Big differences are more likely to show stat. significance.
2. The size of the sample. Bigger samples are more likely
to show statistical significance irrespective of the size of
the difference in percentages.
Practical significance
Even if the difference in percentages is uninteresting
and of no practical interest, the difference may
be statistically significant because we have a large
sample.
Hence, in the interpretation of statistical significance,
we must also address the issue of practical significance.
In other words, you must answer the skeptic’s second
question: WHO CARES?
Probability
Relative
Frequency
Personal
Opinion
Experiment
Repeated Sampling
Physical World
Assumptions
Check by
Repeated Sampling
Experience
Non-repeatable
Event
Estimate Probability
Repeated
Sampling
Rules: For combining probabilities
0 < Probability < 1
1. If there are only two possible outcomes, then
their probabilities must sum to 1.
2. If two events cannot happen at the same time,
they are called mutually exclusive. The probability
of at least one happening (one or the other) is the
sum of their probabilities. [1. is a special case of this.]
3. If two events do not influence each other, they
are called independent. The probability that they
happen at the same time is the product of their probabilities.
4. If the occurrence of one event forces the occurrence of
another event, then the probability of the second event is
always at least as large as the probability of the first event.
Are mutually exclusive events independent or dependent?
Remember the tests:
1. Two events are mutually exclusive if they cannot happen
at the same time.
2. Two events are independent if the occurrence of one does
not alter the probability of the other occurring.
Or, another way, if the probability of the occurrence of one
event changes when we find out whether the other event
occurred or not.
New Rule:
Suppose we are considering a series of events.
The probability of at least one of the events occurring is:
Pr( at least one ) = 1 – Pr( none )
This follows directly from Rule 1 since ‘at least one’
or ‘none’ has to occur.
Long Run Behavior
We CANNOT predict individual outcomes.
BUT
We CAN predict quite accurately long run behavior.
-------------------------------------------------------------------Standard example:
We cannot predict the outcome of a single toss of
a coin very precisely: Pr(head) = .50
But in the long run we expect about 50% heads and tails.
Toss a fair coin 1000 and keep track of the
proportion of heads.
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0
500
1000
Two laws (only one of them valid):
Law of large numbers: Over the long haul,
we expect about 50% heads (this is true).
“Law of small numbers”: If we’ve seen a
lot of tails in a row, we’re more likely to see
heads on the next flip (this is completely
bogus).
Remember: The law of large numbers
OVERWHELMS; it does not COMPENSATE.
When will it happen? (p264 text)
Odd Man
Consider the odd man game. Three people toss
a coin. The odd man has to pay for the drinks.
You are the odd man if you get a head and the
other two have tails or if you get a tail and the
other two have heads.
Pr(no odd man) = Pr(HHH or TTT)
= Pr(HHH) + Pr(TTT)
Rule 2
= (1/2)3 + (1/2)3
Rule 3
=1/8 + 1/8
=1/4 = .25
Pr( odd man ) = 1 – Pr(no odd man) = 1 - .25 = .75 Rule 1
Pr( odd man occurs on the third try)
= Pr(miss, miss, hit)
= Pr(miss)Pr(miss)Pr(hit)
=[Pr(miss)]2Pr(hit)
=[.25]2.75
= .047
Rule 3
Expectation
Insurance
Example 14 p267 extended.
Suppose my insurance company has 10,000 policy holders
and they are all skateboarders.
I collect a $500 premium each year.
I pay off $1500 for a claim of a skate board accident.
From past experience I know 10% ( ie. 1000) will file a claim.
How much do I expect to make per customer?
Pr(claim) = .10 loss is $1500 - $500 = $1000
recorded as -$1000
Pr(no claim) = .90 gain is $500
-------------------------------------------------------------------------Expected value = .10x(-1000) + .90x(500)
= -100 + 450
= 350 dollars per customer
-------------------------------------------------------------------------Expected value for the 10,000 customers
= 10,000x350
= 3,500,000 dollars per year
Side value
Efron
Dice
A
0
0
4
4
4
4
B
1
1
1
5
5
5
C
2
2
2
2
6
6
D
3
3
3
3
3
3
Pr( B beats A ) = 2/3
Pr( D beats C ) = 2/3
Pr( C beats B ) = 2/3
Pr (A beats D ) = 2/3
Hence, there is NO best die! You can always pick a winner.
Cancer testing: confusion of the inverse
Suppose we have a cancer test for a certain type of cancer.
Sensitivity of the test:
If you have cancer then the probability of a positive test
is .98. Pr(+ given you have C) = .98
Specificity of the test:
If you do not have cancer then the probability of a negative
test is .95. Pr(- given you do not have C) = .95
Base rate:
The percent of the population who has the cancer. This is
the probability that someone has C.
Suppose for our example it is 1%. Hence, Pr(C) = .01.
Percent table
+
Positive Negative
Sensitivity
Specificity
C
(Cancer)
no C
(no Cancer)
.98
.02
.01
.05
.95
.99
false positive
Base
Rate
false negative
Suppose you go in for a test and it comes back positive.
What is the probability that you have cancer?
Count table from a percent table
+
-
C
.98
.02
.01
no C
.05
.95
.99
+
-
C
98
2
100
no C
495
9405
9,900
593
9407
10,000
Pr(C given a + test) = 98/593 = .165
Tree diagrams: A possible tool for
solving problems like the “rare
disease” problem
All people like you
.01
.99
With disease
.98
Positive
.0098
.02
Negative
Without disease
.05
Positive
.95
Negative
.0495
Pr (Positive) = .0098+.0495 = .0593
Pr (Disease given Positive) = .0098/.0593 = .165
Recall earlier quiz we didn’t have:
Mary likes earrings and spends time at festivals shopping
for jewelry. Her boy friend and several of her close girl
friends have tattoos. They have encouraged her to also
get a tattoo.
Unknown to you, Mary will be sitting next to you in the
next stat100.2 class.
Which of the following do you think is more likely and why?
A. Mary is a physics major.
B. Mary is a physics major with pierced ears.
An answer of B (Mary is a physics major with
pierced ears) is impossible and illustrates the
Conjunction fallacy: assigning higher probability
to a detailed scenario involving the conjunction of
events than to one of the simple events that make up the
conjunction.
A possible cause of this fallacy is the
Representative heuristic: leads people to assign higher
probabilities than are warranted to scenarios that are
representative of how we imagine things would happen.
Exercise 1, page 309 (sort of):
Suppose you flip four coins.
• Which is more likely, HHHH or HTTH?
• Which is more likely, four total heads or
two total heads?
Note: These questions are not the same! One of
these questions is often mistakenly answered due to
belief in the “Law of small numbers” (also known as
the Gambler’s Fallacy).
Flip a coin repeatedly. Which of the
following is more likely?
Your first seven flips are HHTHTTH
Your first six flips are all heads
(By the way, how do you calculate the exact
probability of each of these events?)
Exercise 15, page 311. What’s the
difference between these two
statements?
“I’m confident that there is at least one set
of matching birthdays in this room”
“I’m confident that there is at least one
person in this room whose birthday matches
my birthday”
Which statement is more likely to be true?
How many possible pairs of people are eligible for
matching in each case? Assume 50 people are in the
room.
With 50 people in the room…
There are 49 possible pairs with me.
There are 49+48+47+…+1 = 1225 total
possible pairs.
Pr (No match with my birthday) =
(364/365)49=.874
Pr (No match at all) = .030 (and we can
estimate by (364/365)1225=.035)
Randomized Response: A technique for
asking sensitive questions
Question 1: Have you ever smoked marijuana?
Question 2: Is your mother’s birthday in Jan
through May?
If your father’s birthday is in July through Dec,
answer question 1. Otherwise answer question 2.
Conditional
Probabilities
Q1
no
yes
Base rate
1-p
p
6/12
Q2
7/12
5/12
6/12
Unconditional
Probabilities
Q1
no
yes
.5(1-p)
.5p
Q2
.292
.208
.208+.5p
Solve for p: .208+.5p = proportion of observed yeses
in sample