PowerPoint version of Lecture 18

Download Report

Transcript PowerPoint version of Lecture 18

Oct. 17 Statistic for the Day:
In1996, the percentages of 16-24 yr old high
school finishers enrolled in college were
49% for lower income families
63% for middle income families
78% for higher income families
Assignment: Review for Exam #2, Wednesday, Oct. 19
Chapters 10, 11, 12, 13, 16
Arby’s sandwiches
weight
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Big Montana
Giant Roast Beef
Regular Roast Beef
Beef ‘n Cheddar
Super Roast Beef
Junior Roast Beef
Chicken Breast Fillet
Chicken Bacon ‘n Swiss
Roast Chicken Club
Market Fresh Turkey Ranch Bacon
Market Fresh Ultimate BLT
Market Fresh Roast Beef Swiss
Market Fresh Roast Ham Swiss
Market Fresh Roast Turkey Swiss
Market Fresh Chicken Salad
309 g
224
154
195
230
125
233
209
228
379
293
357
357
357
322
calories
590
450
320
440
440
270
500
550
470
830
780
780
700
720
770
Arby's Sandwiches
300
400
calories
500
600
700
800
This type of
plot, with two
measurements
per subject, is
called a
scatterplot
(see p. 166).
150
200
250
weight
300
350
800
Arby's Sandwiches
400
calories
500
600
700
The correlation
measures the
strength of the
linear relationship
between weight
and calories.
300
Correlation = 0.94
150
200
250
weight
300
350
More on this in the
next class.
800
Arby's Sandwiches
calories
500
600
700
The best-fitting
line through the
data is called the
regression line.
300
400
How should we
describe this line?
150
200
250
weight
300
350
Arby's Sandwiches
400
calories
500
600
700
800
The intercept is 18
in this case and
the slope is 2.1.
300
cal = 18 + (2.1)(wt)
150
200
250
weight
300
350
In this class, you
don’t need to
know how to
calculate the
slope and
intercept (but see
p. 195 if you like
formulas).
intercept
slope
calories = 18 + (2.1)(weight in grams)
------------------------------------------------For example, if you have a 200g sandwich,
on the average you expect to get about:
18 + (2.1)(200) = 18 + 420 = 438 calories
--------------------------------------------------
For a 350g sandwich:
18 + (2.1)(350) = 18 + 735 = 753 calories
intercept
slope
calories = 18 + (2.1)(weight in grams)
For every extra gram of weight, you expect an
increase of 2.1 calories in your Arby’s sandwich.
Interpretation of slope: Expected increase in response
for every unit increase (increase of one) in explanatory.
Facts about Correlation:
 +1 means perfect increasing linear relationship
 -1 means perfect decreasing linear relationship
 0 means no linear relationship
 + means increasing together
 - means one increases and the other decreases
Strength vs. statistical significance
 Even a weak relationship can be statistically significant
(if it is based on a large sample)
 Even a strong relationship can be statistically
insignificant (if it is based on a small sample)
Regression potential pitfalls:
Sometimes we see strong relationship in absurd examples;
two seemingly unrelated variables have a high correlation.
This signals the presence of a third variable that is highly
correlated with the other two (confounding). Remember
that correlation does not imply causation.
Also: If you use a regression for prediction, do not
extrapolate too far beyond the range of the observed data.
Vocabulary vs Shoe Size
Regression Plot
Y = -806 + 555 X
Correlation = .985
2500
Words
known
2000
1500
1000
500
0
2
3
4
Shoe Size
5
6
Outliers
Outliers are data that are not compatible
with the bulk of the data.
They show up in graphical displays
as detached or stray points.
Sometimes they indicate errors in data
input. Some experts estimate that roughly
5% of all data entered is in error.
Sometimes they are the most important
data points.
Put Options (NYTimes, September 26, 2001)
Put options on stocks give buyers the right to sell stock
at a specified price during a certain time. They rise in
value if the underlying stock falls below the strike price.
The value of puts on airline stocks soared on Sept. 17 when
U.S. stock and options markets reopened after a four-day
closure, as airline stocks slid as much as 40 percent.
American Airlines was at $32 prior to attack. Suppose a
terrorist buys a put option (at say $5 per share) to have the
right to sell at $25. The price after the attack was at $16.
That put option is now more valuable.
R wins machine (D minus R negative for machine)
D wins absentee (D minus R positive for absentee)
From
story on
p. 442
Regression Plot
absentee = -182.575 + 0.295319 machine
- 0.0000285 machine**2
S = 294.363
R-Sq = 62.0 %
R-Sq(adj) = 57.8 %
absentee
1000
0
Regression
-1000
95% PI
0
2500
machine
5000
7500
Outliers affect regression lines and correlation (these data aren’t real):
4.0
Exercise minutes vs. GPA
Red line:
B
3.0
3.5
Without A,
with B
2.5
Black line:
1.5
2.0
With A and B
Green line:
1.0
GPA
A
Without A or B
0
2000
4000
6000
Exercise
8000
10000
Two categorical variables:
Explanatory variable: Sex
Response variable: Body Pierced or Not
Survey question:
Have you pierced any other part of your body?
(Except for ears)
Research Question:
Is there a significant difference between women
and men at PSU in terms of body pierces?
Data:
Explanatory:
Sex
Response:
Body Pierced?
No
Yes
Women
86
52
138
Men
77
5
82
163
57
220
From STAT 100, fall 2005 (missing responses omitted)
Percentages
Response:
no
body pierced?
yes
All
female
male
62.32%
93.90%
37.68%
6.10%
100.00%
100.00%
All
74.09%
25.91%
100.00%
62.32% = 86 / 138
93.90% = 77 / 82
Research question: Is there a significant difference
Between women and men?
(i.e., between 66.67% and 91.35%)
The Debate:
The research advocate claims that there is a
significant difference.
The skeptic claims there is no real difference.
The data differences simply happen by chance,
since we’ve selected a random sample.
The strategy for determining
statistical significance:
 First, figure out what you expect to see if there is no difference
between females and males
 Second, figure out how far the data is from what is expected.
 Third, decide if the distance in the second step is large.
 Fourth, if large then claim there is a statistically significant
difference.
Exercise: Follow the 4 steps and answer the
Research Question: Is there a statistically
significant difference between males and females in
terms of the percent who have used marijuana?
Data from STAT 100 fall 2005
Rows: Sex
Female
Male
All
Columns: Marijuana
No
Yes
All
56
31
87
76
46
122
132
77
209
Step 1: Find expected counts if the
skeptic is correct
This step is based on the marginal totals:
No
Yes
Women
A
B
132
Men
C
D
77
87
122
209
132  87
 54.95
A =
209
(Repeat for B, C, D)
Step 1 cont’d
Repeat the process for B (and then C and D):
No
Yes
Women
54.95
B
132
Men
C
D
77
87
122
209
132 122
 77.05
B =
209
Or you can simply subtract:
132 – 54.95 = 77.05
Step 1 cont’d
Green: Observed counts
Red: Expected counts if skeptic is correct.
Female
Male
Total
Marijuana?
No
Yes
56
76
54.95
77.05
All
132
132.00
31
32.05
46
44.95
77
77.00
87
122
209
Step 2: How far are the data (observed
counts) from what is expected?
Green: Observed counts
Red: Expected counts if skeptic is correct.
(56  54.95)
 .020
54.95
(76  77.05)2
 .014
77.05
(31  32.05)2
 .034
32.05
(46  44.95)2
 .025
44.95
2
Chi-Sq =
0.020 +
0.034 +
0.014 +
0.025 = 0.093
Step 3: Is the distance in step 2 large?
2.0
Chi-squared distribution with
1 degree of freedom:
If chi-squared statistic is
larger than 3.84, it is
declared large and the
research advocate wins.
1.0
1.5
Something is large when
it is in the outer 5% tail
of the appropriate
distribution.
Our chi-squared value:
0.5
95% on
this side
5% on
this side
0.0
0.093 (from Step 2)
Cutoff=3.84
0
1
2
3
4
5
6
Step 4: If distance is large, claim statistically
significant difference.
Rows: Sex
Female
Male
Columns: marijuana
No
56
42.4%
Yes
76
57.6%
All
132
100.0%
31
40.3%
46
59.7%
77
100.0%
Hence, the difference:
57.6% of women versus 59.7% of men
is not statistically significant in this case.
(Sample size has been automatically considered!)
How many degrees of freedom here?
Women
Too Young
No
One df
Two df
Yes
135
Men
81
69
35
112
216
Degrees of freedom (df) always equal
(Number of rows – 1) × (Number of columns – 1)
Health studies and risk
Research question: Do strong electromagnetic fields
cause cancer?
50 dogs randomly split into two groups: no field, yes field
The response is whether they get lymphoma.
Rows: mag field
no
yes
All
Columns: cancer
no
yes
All
20
10
5
15
25
25
30
20
50
Terminology and jargon:
In the mag field group, 15/25 of the dogs got cancer.
Therefore, the following are all equivalent:
1. 60% of the dogs in this group got cancer.
2. The proportion of dogs in this group that got
cancer is 0.6.
3. The probability that a dog in this group got
cancer is 0.6.
4. The risk of cancer in this group is 0.6
And one more: The odds of cancer in this group are 3/2.
More terminology and jargon:
1. Identify the ‘bad’ response category: In this example, cancer
2. Treatment risk: 15 / 25 or .60 or 60%
3. Baseline risk: 5 / 25 or .20 or 20%
4. Relative risk: Treatment risk over Baseline risk = .60 / .20=3 That
is, the treatment risk is three times as large as the baseline risk.
5. Increased risk: By how much does the risk increase for treatment
as compared to control? (.60 - .20) / .20 = 2 or 200% That is, the
risk is 200% higher in the treatment group.
6. Odds ratio: Ratio of treatment odds to baseline odds.
(15/10) / (5/20) turns out to be 6. That is, the treatment odds are
six times as large as the baseline odds.
Final note:
When the chi-squared test is statistically significant
then it makes sense to compute the various risk
statements.
If there is no statistical significance then the skeptic
wins.
There is no evidence in the data for differences in
risk for the categories of the explanatory variable.
Recall marijuana example
Female
Male
Marijuana?
No
Yes
56
76
54.95
77.05
All
132
132.00
31
32.05
46
44.95
77
77.00
87
122
209
Total
Chi-Sq =
0.020 +
0.034 +
0.014 +
0.025 = 0.093
SO THE SKEPTIC WINS. But what if we observed a
much larger sample? Say, 100 times larger?
Marijuana example, larger sample:
Female
Marijuana?
No
Yes
5600
7600
5495
7705
All
13200
13200
Male
3100
3205
4600
4495
7700
7700
Total
8700
12200
20900
Chi-Sq =
2.0 +
3.4 +
1.4 +
2.5 = 9.3
NOW THE RESEARCH ADVOCATE WINS.
Practical significance
In the marijuana example, 58% of women and 60% of men
reported that they had tried marijuana. This size of
difference, even if it is really in the population, is probably
uninteresting. Yet we have seen that a large sample size
can make it statistically significant.
Hence, in the interpretation of statistical significance, we
should also address the issue of practical significance.
In other words, we should answer the skeptic’s second
question: WHO CARES?
Simpson’s paradox (for quantitative variables)
80
Price vs. # of pages for 15 books
40
Correlation= -.312
20
price
60
Example 11.4, pp.
204-205
100
200
300
400
pages
500
600
Simpson’s paradox (for quantitative variables)
80
Price vs. # of pages for 15 books
H
60
Example 11.4, pp.
204-205
H
Correlation= -.312
H
40
H
H
H
H Correlation= .348
H
20
price
H
S Correlation= .637
S
100
200
SS
300
S S
400
pages
S
S
500
600
Simpson’s paradox for categorical variables,
as seen in video
Overall admitted to City U.
Number
Percent
Men
198 / 360 55%
Women
88 / 200
44%
Business (hard)
Law (easy)
Number
Percent
Number
Percent
Men
18 / 120
15%
Men
180 / 240 75%
Women
24 / 120
20%
Women
64 / 80
80%
Women better in each, but more men apply to easier law school!
Rules: For combining probabilities
0 < Probability < 1
1. If there are only two possible outcomes, then their probabilities must
sum to 1.
2. If two events cannot happen at the same time, they are called mutually
exclusive. The probability of at least one happening (one or the other)
is the sum of their probabilities. [Rule 1 is a special case of this.]
3. If two events do not influence each other, they are called independent.
The probability that they happen at the same time is the product of their
probabilities.
4. If the occurrence of one event forces the occurrence of another event,
then the probability of the second event is always at least as large as
the probability of the first event.
Rule 1: If there are only two possible
outcomes, then their probabilities
must sum to 1.
According to Example 3, page 302:
P(lost luggage) = 1/176 = .0057
Thus, P(luggage not lost) = 1 – 1/176 = 175/176 = .9943
The point of rule 1 is that P(lost) + P(not lost) = 1
so if we know P(lost), then we can find P(not lost).
Sounds simple, right? It can be surprisingly powerful.
Rule 2: If two events cannot happen at
the same time, they are called
mutually exclusive.
In this case, the probability of at least one happening is
the sum of their probabilities.
[Rule 1 is a special case of this.]
Example 5, page 303:
Suppose P(A in stat) = .50 and P(B in stat) = .30.
Then P( A or B in stat) = .50 + .30 = .80
Note that the events ‘A in stat’ and ‘B in stat’ are mutually exclusive.
Do you see why?
Rule 3: If two events do not influence each
other, they are called independent.
In this case, the probability that they happen at the
same time is the product of their probabilities.
Example 8, page 303:
Suppose you believe that P(A in stat) = .5 and P(A in history) = .6.
Further, you believe that the two events are independent, so that
they do not influence each other.
Is this a
Then P(A in stat and A in history) = (.5)×(.6) = .3
reasonable
assumption?
Rule 4: If the occurrence of one event forces
the occurrence of another event, then the
probability of the second event is always
at least as large as the probability of the
first event.
If event A forces event B to occur, then P(A) < P(B)
Special case: P(E and F) < P(E)
P(E and F) < P(F)
(because ‘E and F’ forces E to occur).
Two laws (only one of them valid):
 Law of large numbers: Over the long haul, we expect
about 50% heads (this is true).
 “Law of small numbers”: If we’ve seen a lot of tails in a
row, we’re more likely to see heads on the next flip (this
is completely bogus).
Remember: The law of large numbers
OVERWHELMS; it does not COMPENSATE.
The game of Odd Man
Consider the “odd man” game. Three people at lunch toss a coin. The
odd man has to pay the bill.
You are the odd man if you get a head and the other two have tails or if
you get a tail and the other two have heads. Notice that there will not
always be an odd man – this occurs if flips come up HHH or TTT.
P(no odd man) = P(HHH or TTT)
= P(HHH) + P(TTT) since HHH, TTT are mutually exclusive
= (1/2)3 + (1/2)3
since H,H,H are independent (as are T,T,T)
=1/8 + 1/8
= .25
Thus, P(there is an odd man) = 1 – P(no odd man) = 1 - .25 = .75
Play until there is an odd man. What is the
probability this will take exactly three tries?
P(odd man occurs on the third try)
= P(miss, miss, hit)
in that order! That’s the only way. (See why?)
= P(miss) P(miss) P(hit) since each try is independent of the others.
= [P(miss)]2 P(hit)
= [.25]2 .75
= .047 This is the final answer: The probability that the odd man
occurs exactly on the third try (after two unsuccessful tries).
Expectation
What if you bet $10 on a game of craps? What is your
expected profit?
(Probability of winning: 244/495, or 49.3%)
You win $10 with probability .493
You lose $10 with probability .507
Expected profit: .493($10) + .507(-$10) = - $0.14
Casino winnings, 10,000 games per day
200
Casino winnings for 1000 days
100
50
0
Frequency
150
Expectation = $1400
-2000
0
1000
2000
3000
4000
Casino winnings, 100,000 games a day
250
Casino winnings for 1000 days
50
100
150
Note: Now all values
are positive
0
Frequency
200
Expectation = $14,000
5000
10000
15000
20000
25000
Your winnings, a single game
We already calculated the expectation to be 14 cents.
But you can’t lose 14 cents in one game; you either
win 10 dollars or lose 10 dollars.
Thus, the expected value does not have to be a
possible value for any individual case.