Transcript Document
Found StatCrunch Resources
– Use StatCrunch to find correlation between two variables
• http://screencast.com/t/rAbGVY5We8
– Find a Confidence Interval for a population mean using StatCrunch
• http://www.youtube.com/watch?v=G5nw2B9g19c
– StatCrunch Cheat Sheet
• http://www3.jjc.edu/staff/msullivan/Stats/Technology%20Step%20by%20Step%20
StatCrunch.pdf
– You might want to have StatCrunch open
1
Final Project Notes
•
•
1. Using the MM207 Student Data Set:
What is the correlation between student cumulative GPA and the number of hours spent on
school work each week? Be sure to include the computations or StatCrunch output to
support your answer.
– Stat->summary stats->correlation
•
What would be the predicted GPA for a student who spends 16 hours per week on school
work? Be sure to include the computations or StatCrunch output to support your prediction.
– Stat->regression->simple linear
– Choose dependent (y) and independent (x) variable
– Each of the “next” screens have useful options “ Confidence Intervals” , “Predict Y for
x=“, “plot fitted Line”
• Hit the Next button to see the graph.
• Highlight the tables with the mouse, press ctrl-c to copy to the clipboard,
Ctrl-v to paste in your document
2
Final Project Notes
•
2. bar graphs are great,
–
•
3. Jonathan is a 42 year old male student and Mary is a 37 year old female student
thinking about taking this class. Based on their relative position, which student
would be farther away from the average age of their gender group based on this
sample of MM207 students?
–
•
Graphics->bar plot->with data
compute the z- values
4. If you were to randomly select a student from the set of students who have
completed the survey, what is the probability that you would select a male?
Explain your answer.
3
Final Project Notes
•
5. Using the sample of MM207 students: What is the probability of randomly
selecting a person who is conservative and then selecting from that group
someone who is a nursing major? What is the probability of randomly selecting a
liberal or a male?
–
Stat->tables->Contingency-> with data. Choose q9 and q13
Business
Legal
Studies
IT
Nursing
Psycholog
y
Other
Total
Conservat
ive
4
1
6
17
6
22
56
Liberal
2
3
2
12
8
26
53
Moderate
13
6
2
32
17
49
119
Total
19
10
10
61
31
97
228
4
Final Project Notes
What is the probability of randomly selecting a liberal or a male?
Female
Male
Total
Conservative
41
15
56
Liberal
44
9
53
Moderate
101
17
118
Total
186
41
227
5
Final Project Notes
•
•
6. compute the z-score using the standard deviation from the CLT, n=25
7. Select a random sample of 30 student responses to question 6, "How many
credit hours are you taking this term?" Using the information from this sample,
and assuming that our data set is a random sample of all Kaplan statistics students,
estimate the average number of credit hours that all Kaplan statistics students are
taking this term using a 95% level of confidence. Be sure to show the data from
your sample and the data to support your estimate.
–
–
To get the sample of 30 students hit Data->Sample columns, choose q6, enter sample size of 30 and
hit “Sample Columns”. A popup comes up somewhere to tell you that a new column has been added
to the data, Sample(q6)
Compute 95% CI from this column. Stat->z-statistics-> one sample->with data. Choose the new
Sample(q6) column, hit next and select Confidence Interval, calculate
6
Final Project Notes
•
8. Assume that the MM207 Student Data Set is a random sample of all Kaplan
students; estimate the proportion of all Kaplan students who are female using a
90% level of confidence.
–
First get the number of females in the sample: Stat->table->frequency, choose Gender.
Gender
Frequency
Female
Male
–
192
0.82403433
41
0.17596567
Proportions-> one sample-> with summary. Successes=192, observations=233. Next, choose
Confidence Intervals, calculate
Proportion
p
Relative
Frequency
Count
Total
192
Sample Prop.
233
0.82403433
Std. Err.
0.024946446
L. Limit
U. Limit
0.7751402
0.87292844
7
Final Project Notes
•
•
9. Assume you want to estimate with the proportion of students who commute less than 5
miles to work within 2%, what sample size would you need?
– N=1/E^2
10. A professor at Kaplan University claims that the average age of all Kaplan students is 36
years old. Use a 95% confidence interval to test the professor's claim. Is the professor's claim
reasonable or not? Explain.
– Stat->z-stat-one sample->with data. Choose q2, next, select confidence intervals,
calculate
Variable
Q2 How old
are you?
Sample
Mean
n
237
37.291138
Std. Err.
0.67025393
L. Limit
U. Limit
35.977467
38.604813
8
Page 209
z-Scores
z-scores determine how far, in terms of standard
deviations, a given score is from the mean of the
distribution.
value of piece of data mean x
z
standard deviation
9
Figure 5.22 shows the values on the distribution of IQ scores from Example 6.
Figure 5.22 Standard scores for IQ scores of 85, 100, and 125.
Copyright © 2009 Pearson Education, Inc.
Slide 5.2- 10
Graph from: http://www.comfsm.fm/~dleeling/statistics/fx_2001_02.html
11
Percentiles
• Are normally used with lots of data.
• We divide the number of data values by 100,
and that will tell us how many data values are
in each percent.
• The following example has the grocery bills for
300 families for a week.
• There will be 3 data values to each percent, or
30 values for each 10 %.
12
591 215 150 342 265 426 414 33 426 507 269 116 205 153 199 418 177 106 318 473
52 461 328 172 82 451 150 384 480 68 269 580 191 98 477 468 471 398 68 124
222 551 315 134 249 599 272 210 485 183 535 43 55 150 274 94 331 536 317 446
152 65 358 254 196 209 213 317 447 431 593 162 220 239 129 259 102 92 491 469
35 487 273 216 214 428 282 226 149 271 330 452 216 150 574 538 420 488 170 263
218 256 475 372 110 550 425 59 194 138 518 402 594 184 305 309 146 112 416 390
45 262 183 520 306 597 407 309 558 259 348 272 234 276 261 438 246 407 65 118
481 130 391 441 398 399 407 164 486 149 257 271 446 144 130 238 408 83 157 204
591 86 352 498 351 203 182 418 242 587 566 125 241 369 444 372 405 319 523 391
272 255 542 429 241 227 150 563 419 180 352 506 341 372 314 289 512 243 202 58
244 558 506 551 57 391 328 335 533 32 593 122 506 227 401 108 350 342 212 113
596 265 392 414 73 48 525 513 350 465 44 419 549 534 543 137 176 587 401 490
257 250 129 295 507 267 522 41 522 581 302 42 543 132 275 363 365 181 360 232
238 535 263 488 285 433 380 270 69 511 99 574 49 549 106 516 220 185 344 317
136 391 288 389 85 481 500 77 338 331 488 309 400 372 501 506 307 72 556 569
32 65 99 130 150 184 215 241 261 273 309 341 372 399 419 447 486 507 536 569
33 65 102 130 152 185 216 241 262 274 314 342 372 400 419 451 487 511 538 574
35 68 106 132 153 191 216 242 263 275 315 342 372 401 420 452 488 512 542 574
41 68 106 134 157 194 218 243 263 276 317 344 372 401 425 461 488 513 543 580
42 69 108 136 162 196 220 244 265 282 317 348 380 402 426 465 488 516 543 581
43 72 110 137 164 199 220 246 265 285 317 350 384 405 426 468 490 518 549 587
44 73 112 138 170 202 222 249 267 288 318 350 389 407 428 469 491 520 549 587
45 77 113 144 172 203 226 250 269 289 319 351 390 407 429 471 498 522 550 591
48 82 116 146 176 204 227 254 269 295 328 352 391 407 431 473 500 522 551 591
49 83 118 149 177 205 227 255 270 302 328 352 391 408 433 475 501 523 551 593
52 85 122 149 180 209 232 256 271 305 330 358 391 414 438 477 506 525 556 593
55 86 124 150 181 210 234 257 271 306 331 360 391 414 441 480 506 533 558 594
57 92 125 150 182 212 238 257 272 307 331 363 392 416 444 481 506 534 558 596
58 94 129 150 183 213 238 259 272 309 335 365 398 418 446 481 506 535 563 597
59 98 129 150 183 214 239 259 272 309 338 369 398 418 446 485 507 535 566 599
The Central Limit Theorem
Page 217
Suppose we take many random samples of size n for a
variable with any distribution (not necessarily a normal
distribution) and record the distribution of the means of each
sample. Then,
1. The distribution of means will be approximately a
normal distribution for large sample sizes.
2. The mean of the distribution of means approaches
the population mean, µ, for large sample sizes.
3. The standard deviation of the distribution of means
approaches σ/√n for large sample sizes, where σ is
the standard deviation of the population.
15
Figure 5.26 As the sample size increases (n = 5, 10, 30), the distribution of sample
means approaches a normal distribution, regardless of the shape of the original
distribution. The larger the sample size, the smaller is the standard deviation of the
distribution of sample means.
16
EXAMPLE 1 Predicting Test Scores
You are a middle school principal and your 100 eighth-graders are
about to take a national standardized test. The test is designed so
that the mean score is = 400 with a standard deviation of = 70.
Assume the scores are normally distributed.
a. What is the likelihood that one of your eighth-graders, selected
at random, will score below 375 on the exam?
Solution:
a. In dealing with an individual score, we use the method of
standard scores discussed in Section 5.2. Given the mean of
400 and standard deviation of 70, a score of 375 has a standard
score of
z=
data value – mean
standard deviation
=
375 – 400
70
= -0.36
18
EXAMPLE 1 Predicting Test Scores
Solution: (cont.)
According to Table 5.1, a standard score of -0.36
corresponds to about the 36th percentile— that is, 36%
of all students can be expected to score below 375. Thus,
there is about a 0.36 chance that a randomly selected
student will score below 375.
Notice that we need to know that the scores have a
normal distribution in order to make this calculation,
because the table of standard scores applies only to
normal distributions.
EXAMPLE 1 Predicting Test Scores
You are a middle school principal and your 100 eighth-graders are about to take
a national standardized test. The test is designed so that the mean score is =
400 with a standard deviation of = 70. Assume the scores are normally
distributed.
b. Your performance as a principal depends on how well your entire group of
eighth-graders scores on the exam. What is the likelihood that your group of
100 eighth-graders will have a mean score below 375?
Solution:
b. The question about the mean of a group of students must be
handled with the Central Limit Theorem. According to this
theorem, if we take random samples of size n = 100 students
and compute the mean test score of each group, the distribution
of means is approximately normal.
EXAMPLE 1 Predicting Test Scores
Solution: (cont.)
Moreover, the mean of this distribution is = 400 and its standard
deviation is / n = 70/ 100 = 7. With these values for the mean
and standard deviation, the standard score for a mean test score of
375 is
z=
data value – mean
standard deviation
=
375 – 400
7
=
-03.57
Table 5.1 shows that a standard score of -3.5 corresponds to the
0.02th percentile, and the standard score in this case is even lower.
In other words, fewer than 0.02% of all random samples of 100
students will have a mean score of less than 375.
22
EXAMPLE 1 Predicting Test Scores
Solution: (cont.)
Therefore, the chance that a randomly selected group of 100
students will have a mean score below 375 is less than 0.0002,
or about 1 in 5,000.
Notice that this calculation regarding the group mean did not
depend on the individual scores’ having a normal distribution.
This example has an important lesson. The likelihood of an
individual scoring below 375 is more than 1 in 3 (36%), but
the likelihood of a group of 100 students having a mean score
below 375 is less than 1 in 5,000 (0.02%).
In other words, there is much more variation in the scores of
individuals than in the means of groups of individuals.
Page 289
Types of Correlation
Figure 7.3 Types of correlation seen on scatter diagrams.
24
Linear Correlation Coefficient
Page 294
The line of best fit (regression line or the least
squares line) is the line that best fits the data, i.e.
it is closer to the data than any other line.
26
Regression
27
Data Set 1, WT is y and HT is x
•
28
Cautions in Making Predictions from Best-Fit Lines
1. Don’t expect a best-fit line to give a good prediction unless the
correlation is strong and there are many data points. If the sample
points lie very close to the best-fit line, the correlation is very strong
and the prediction is more likely to be accurate. If the sample points
lie away from the best-fit line by substantial amounts, the correlation
is weak and predictions tend to be much less accurate.
2. Don’t use a best-fit line to make predictions beyond the bounds of
the data points to which the line was fit.
3. A best-fit line based on past data is not necessarily valid now and
might not result in valid predictions of the future.
4. Don’t make predictions about a population that is different from the
population from which the sample data were drawn.
5. Remember that a best-fit line is meaningless when there is no
significant correlation or when the relationship is nonlinear.
29
EXAMPLE 1 Valid Predictions?
State whether the prediction (or implied prediction) should be trusted
in each of the following cases, and explain why or why not.
You’ve found a best-fit line for a correlation between the number of
hours per day that people exercise and the number of calories
they consume each day. You’ve used this correlation to predict
that a person who exercises 18 hours per day would consume
15,000 calories per day.
Solution:
No one exercises 18 hours per day on an ongoing basis, so this much
exercise must be beyond the bounds of any data collected.
Therefore, a prediction about someone who exercises 18 hours per
day should not be trusted.
30
EXAMPLE 1 Valid Predictions?
State whether the prediction (or implied prediction) should be trusted
in each of the following cases, and explain why or why not.
Historical data have shown a strong negative correlation between
national birth rates and affluence. That is, countries with greater
affluence tend to have lower birth rates. These data predict a high
birth rate in Russia.
Solution:
We cannot automatically assume that the historical data still apply
today. In fact, Russia currently has a very low birth rate, despite
also having a low level of affluence.
31
EXAMPLE 1 Valid Predictions?
State whether the prediction (or implied prediction) should be trusted
in each of the following cases, and explain why or why not.
A study in China has discovered correlations that are useful in
designing museum exhibits that Chinese children enjoy. A curator
suggests using this information to design a new museum exhibit
for Atlanta-area school children.
Solution:
The suggestion to use information from the Chinese study for an
Atlanta exhibit assumes that predictions made from correlations in
China also apply to Atlanta. However, given the cultural
differences between China and Atlanta, the curator’s suggestion
should not be considered without more information to back it up.
32
EXAMPLE 1 Valid Predictions?
State whether the prediction (or implied prediction) should be trusted
in each of the following cases, and explain why or why not.
Scientific studies have shown a very strong correlation between
children’s ingesting of lead and mental retardation. Based on this
correlation, paints containing lead were banned.
Solution:
Given the strength of the correlation and the severity of the
consequences, this prediction and the ban that followed seem
quite reasonable. In fact, later studies established lead as an
actual cause of mental retardation, making the rationale behind
the ban even stronger.
33
EXAMPLE 1 Valid Predictions?
State whether the prediction (or implied prediction) should be trusted
in each of the following cases, and explain why or why not.
Based on a large data set, you’ve made a scatter diagram for salsa
consumption (per person) versus years of education. The diagram
shows no significant correlation, but you’ve drawn a best-fit line
anyway. The line predicts that someone who consumes a pint of
salsa per week has at least 13 years of education.
Solution:
Because there is no significant correlation, the best-fit line and any
predictions made from it are meaningless.
34
The square of the correlation coefficient, or r2, is
the proportion of the variation in a variable that is
accounted for by the best-fit line.
The use of multiple regression allows the
calculation of a best-fit equation that represents
the best fit between one variable (such as price)
and a combination of two or more other variables
(such as weight and color). The coefficient of
determination, R2, tells us the proportion of the
scatter in the data accounted for by the best-fit
equation.
35
EXAMPLE 4 Voter Turnout and Unemployment
Political scientists are interested in knowing what factors affect voter
turnout in elections. One such factor is the unemployment rate. Data
collected in presidential election years since 1964 show a very weak
negative correlation between voter turnout and the unemployment
rate, with a correlation coefficient of about r = -0.1. Based on this
correlation, should we use the unemployment rate to predict voter
turnout in the next presidential election?
Note that there is a scatter diagram of the voter turnout data on page 312.
Solution: The square of the correlation coefficient is r2 = (-0.1)2 = 0.01,
which means that only about 1% of the variation in the data is
accounted for by the best-fit line. Nearly all of the variation in the data
must therefore be explained by other factors. We conclude that
unemployment is not a reliable predictor of voter turnout.
36
The Search for Causality
• A correlation may suggest causality, but by itself a
correlation never establishes causality. Much more
evidence is required to establish that one factor causes
another.
• a correlation between two variables may be the result of
either (1) coincidence, (2) a common underlying cause,
or (3) one variable actually having a direct influence on
the other.
• The process of establishing causality is essentially a
process of ruling out the first two explanations.
37
Determining Causality
• We can rule out coincidence by repeating the experiment many
times or using a large number of subjects in the experiment.
Because coincidences occur randomly, they should not occur
consistently in many subjects or experiments.
• If the controls rule out confounding variables, any remaining effects
must be caused by the variables being studied.
38
Guidelines for Establishing Causality
If you suspect that a particular variable (the suspected cause) is causing some effect:
1.
Look for situations in which the effect is correlated with the suspected
cause even while other factors vary.
2.
Among groups that differ only in the presence or absence of the
suspected cause, check that the effect is similarly present or absent.
3.
Look for evidence that larger amounts of the suspected cause produce
larger amounts of the effect.
4.
If the effect might be produced by other potential causes (besides your
suspected cause), make sure that the effect still remains after accounting
for these other potential causes.
5.
If possible, test the suspected cause with an experiment. If the
experiment cannot be performed with humans for ethical reasons,
consider doing the experiment with animals, cell cultures, or computer
models .
6.
Try to determine the physical mechanism by which the suspected cause
produces the effect.
39