football statistics

Download Report

Transcript football statistics

Introduction to Probability
and Statistics
Thirteenth Edition
Chapter 12
Linear Regression and
Correlation
Correlation & Regression
 Univariate & Bivariate Statistics
U: frequency distribution, mean, mode, range, standard
deviation
 B: correlation – two variables
 Correlation
 linear pattern of relationship between one variable
(x) and another variable (y) – an association
between two variables
 graphical representation of the relationship between two
variables
 Warning:
 No proof of causality
 Cannot assume x causes y

1. Correlation Analysis
• Correlation coefficient measures the strength of the
relationship between x and y
Sample Pearson’s correlation coefficient
r

x

 x 
n
2
S xx
S yy
i
2
i
2
S yy
i
i
S xx S yy

y

y 
n
2
i

x  y 

x y 
i
S xy
i
n
i
Pearson’s Correlation Coefficient
 “r” indicates…
 strength of relationship (strong, weak, or none)
 direction of relationship
 positive (direct) – variables move in same direction
 negative (inverse) – variables move in opposite directions
 r ranges in value from –1.0 to +1.0
-1.0
Strong Negative
0.0
No Rel.
+1.0
Strong Positive
Limitations of Correlation
 linearity:
 can’t describe non-linear relationships
 e.g., relation between anxiety & performance
 no proof of causation
 Cannot assume x causes y
Some Correlation Patterns
Linear relationships
Y
Curvilinear relationships
Y
X
Y
X
Y
X
X
Some Correlation Patterns
Strong relationships
Y
Weak relationships
Y
X
Y
X
Y
X
X
Example
The table shows the heights and
weights of n = 10 randomly selected
college football players.
Player
1
2
3
4
5
6
7
8
9
10
Height, x
73
71
75
72
72
75
67
69
71
69
Weight, y 185 175 200 210
190 195 150
S xy  328 S xx  60.4 S yy  2610
328
r
 .8261
(60.4)( 2610)
170 180 175
Example – scatter plot
Scatterplot of Weight vs Height
210
200
Weight
190
r = .8261
180
Strong positive
correlation
170
160
150
66
67
68
69
70
71
Height
72
73
74
75
As the player’s height
increases, so does his
weight.
Inference using r
• The population coefficient of correlation is called 
(“rho”). We can test for a significant correlation
between x and y using a t test:
H0 :   0
H1 :   0
n2
Test Statistic : t  r
2
1 r
Reject H 0 if t  t / 2 or t  t / 2 with n - 2 df .
Example
r  .8261
Is there a significant positive correlation between weight and height
in the population of all college football players?
H0 :   0
H1 :   0
Test Statistic :
n2
tr
1 r 2
8
 .8261
 4.15
2
1  .8261
Use the t-table with n-2 = 8 df to
bound the p-value as p-value < .005.
There is a significant positive
correlation between weight and
height in the population of all college
football players.
2. Linear Regression
Regression: Correlation + Prediction
 Regression analysis is used to predict the
value of one variable (the dependent
variable) on the basis of other variables
(the independent variables).
◦ Dependent variable: denoted Y
◦ Independent variables: denoted X1, X2,
…, Xk

Example
Let y be the monthly sales revenue for a
company. This might be a function of several
variables:
◦ x1 = advertising expenditure
◦ x2 = time of year
◦ x3 = state of economy
◦ x4 = size of inventory
 We want to predict y using knowledge of x1, x2,
x3 and x4.

Some Questions
Which of the independent variables are useful
and which are not?
 How could we create a prediction equation to
allow us to predict y using knowledge of x1, x2,
x3 etc?
 How good is this prediction?

We start with the simplest case, in which the
response y is a function of a single
independent variable, x.
Model Building
A statistical model
separates the
systematic
component of a
relationship from the
random component.
Data
Statistical
model
Systematic
component
+
Random
errors
In regression, the
systematic
component is the
overall linear
relationship, and the
random component is
the variation around
the line.
A Simple Linear Regression Model
Explanatory and Response Variables are Numeric
 Relationship between the mean of the response
variable and the level of the explanatory variable
assumed to be approximately linear (straight line)
 Model:

Y    1 x  
 ~ N (0, )
• 1 > 0  Positive Association
• 1 < 0  Negative Association
• 1 = 0  No Association
Picturing the Simple Linear Regression Model
Y
Regression Plot
y
 = Slope
Error: 
1
 = Intercept
0
X
x
Simple Linear Regression Analysis
y    x 
yˆ  a  b x  e







y = actual value of a score
ŷ = predicted value
Variables:
x = Independent Variable
y = Dependent Variable
Parameters:
 = y Intercept
β = Slope
ε ~ normal distribution with mean 0 and
variance 2
Simple Linear Regression Model…
y
yˆ  a  b x
b=slope=y/x
intercept
a
x
The Method of Least Squares
The equation of the best-fitting line
is calculated using a set of n pairs (xi, yi).
•We choose our estimates a
and b to estimate  and  so
that the vertical distances of
the points from the line,
are minimized.
Bestfitting line :yˆ  a  bx
Choosea and b to minimize
SSE  ( y  yˆ ) 2  ( y  a  bx) 2
Least Squares Estimators
Calculatethe sumsof squares:
( x)
( y )
2
Sxx   x 
Syy   y 
n
n
( x)( y )
Sxy   xy 
n
Bestfitting line : yˆ  a  bx where
2
2
b
S xy
S xx
and a  y  bx
2
Example
The table shows the IQ scores for a random sample of
n = 10 college freshmen, along with their final calculus
grades.
Student
1
IQ Scores, x
2
3
4
5
6
7
8
9
10
39 43
21
64
57
47
28
75
34
52
Calculus grade, y 65 78
52
82
92
89
73
98
56
75
Use your calculator
to find the sums and
sums of squares.
 x  460
 y  760
 x  23634  y  59816
 xy  36854
x  46
y  76
2
2
Example
(460) 2
Sxx  23634 
 2474
10
(760) 2
Syy  59816 
 2056
10
(460)(760)
Sxy  36854 
 1894
10
1894
b
 .76556 and a  76  .76556(46)  40.78
2474
Bestfitting line : yˆ  40.78  .77 x
The Analysis of Variance

The total variation in the experiment is measured by
the total sum of squares:
Total SS  S yy  ( y  y ) 2
The Total SS is divided into two parts:
SSR (sum of squares for regression):
measures the variation explained by using x in
the model.
SSE (sum of squares for error): measures the
leftover variation not explained by x.
The Analysis of Variance
We calculate
SSR 
( S xy ) 2
S xx
1894 2

2474
 1449.9741
SSE  Total SS - SSR
 S yy 
( S xy ) 2
S xx
 2056  1449.9741
 606.0259
The ANOVA Table
Total df = n -1
Regression df = 1
Error df = n –1 – 1 = n - 2
Mean Squares
MSR = SSR/(1)
MSE = SSE/(n-2)
Source
df
SS
MS
F
Regression
1
SSR
SSR/(1)
MSR/MSE
Error
n-2
SSE
SSE/(n-2)
Total
n -1
Total SS
The Calculus Problem
SSR 
( S xy ) 2
S xx
18942

 1449.9741
2474
SSE  Total SS - SSR  S yy 
( S xy ) 2
S xx
 2056  1449.9741  606.0259
Source
df
SS
MS
Regression
1
1449.9741 1449.9741 19.14
Error
8
606.0259
Total
9
2056.0000
75.7532
F
Testing the Usefulness of the Model (The F Test)

You can test the overall usefulness of the model
using an F test. If the model is useful, MSR will be
large compared to the unexplained variation, MSE.
Hypothesis
H 0 : model is not useful in predicting y
Test Statistic :
MSR
F
MSE
Reject H 0 if F  F with 1 and n - 2 df .
This test is
exactly
equivalent to the
t-test, with t2 = F.
Minitab
Output
To test
Least squares regression line
H0 :   0
Regression Analysis: y versus x
The regression equation is y = 40.8 + 0.766 x
Predictor
Coef
SE Coef
T
P
Constant
40.784
8.507
4.79
0.001
x
0.7656
0.1750
4.38
0.002
S = 8.70363
R-Sq = 70.5%
Analysis of Variance
Source
DF
SS
Regression
1 1450.0
Residual Error 8
606.0
Total
9 2056.0
MSE
R-Sq(adj) = 66.8%
MS
1450.0
75.8
Regression
coefficients,
a and b
F
19.14
P
0.002
t2  F
Testing the Usefulness of the Model
• The first question to ask is whether the independent
variable x is of any use in predicting y.
• If it is not, then the value of y does not change,
regardless of the value of x. This implies that the
slope of the line, , is zero.
H0 :   0 versus Ha :   0
Testing the Usefulness of the Model
The test statistic is function of b, our best estimate
of . Using MSE as the best estimate of the
random variation 2, we obtain a t statistic.
H0 :   0 versus Ha :   0
Test statistic: t 
b0
which has a t distribution
MSE
S xx
with df  n  2 or a confidenceinterval: b  t / 2
MSE
S xx
The Calculus Problem
• Is there a significant relationship between the calculus
grades and the IQ scores at the 5% level of
significance?
H 0 :   0 versusH a :   0
t
b0
MSE/ S xx

.7656  0
 4.38
75.7532 / 2474
Reject H 0 when |t| > 2.306. Since t = 4.38 falls into
the rejection region, H 0 is rejected .
There is a significant linear relationship between the calculus grades and
the IQ scores for the population of college freshmen.
Measuring the Strength
of the Relationship
•
•
If the independent variable x is of useful in predicting
y, you will want to know how well the model fits.
The strength of the relationship between x and y can
be measured using:
Correlation coefficient : r 
S xy
S xx S yy
S xy
2
SSR
Coefficient of determination : r 

S xx S yy Total SS
2
Measuring the Strength
of the Relationship
•


Since Total SS = SSR + SSE, r2 measures
the proportion of the total variation in the responses
that can be explained by using the independent
variable x in the model.
the percent reduction the total variation by using the
regression equation rather than just using the sample
mean y-bar to estimate y.
SSR
r 
Total SS
2
For the calculus problem, r2 = .705 or
70.5%. Meaning that 70.5% of the
variability of Calculus Scores can be
exlain by the model.
Estimation and Prediction
To estimatethe averagevalueof y when x  x0 :
 1 ( x0  x ) 2 

yˆ  t / 2 MSE  

n
S
xx


To predict a particularvalueof y when x  x0 :
yˆ  t / 2

1 ( x0  x ) 2
MSE 1  
n
S xx





Confidence
interval
Prediction
interval
The Calculus Problem

Estimate the average calculus grade for
students whose IQ score is 50 with a 95%
confidence interval.
Calculate yˆ  40.78424  .76556(50) 79.06
 1 (50  46) 2
yˆ  2.306 75.7532 
2474
 10
79.06  6.55 or 72.51to 85.61.



The Calculus Problem

Estimate the calculus grade for a particular
student whose IQ score is 50 with a 95%
confidence interval.
Calculate yˆ  40.78424  .76556(50) 79.06

1 (50  46) 2
yˆ  2.306 75.75321 

2474
 10
79.06  21.11 or 57.95 to 100.17.
Notice how much wider this interval is!



Minitab Output
Confidence and prediction
intervals when x = 50
Predicted Values for New Observations
New Obs
Fit
SE Fit
95.0% CI
1
79.06
2.84
(72.51, 85.61)
95.0% PI
(57.95,100.17)
Values of Predictors for New Observations
New Obs
x
Fitted Line Plot
1
50.0
y = 40.78 + 0.7656 x
Regression
95% CI
95% PI
120
Both intervals are
narrowest when x = xbar.
110
S
R-Sq
R-Sq(adj)
100
90
y
Green prediction
bands are always wider
than red confidence
bands.
80
70
60
50
40
30
20
30
40
50
x
60
70
80
8.70363
70.5%
66.8%
Estimation and Prediction
•
•
Once you have
 determined that the regression line is useful
 used the diagnostic plots to check for violation of
the regression assumptions.
You are ready to use the regression line to
 Estimate the average value of y for a given
value of x
 Predict a particular value of y for a given value
of x.
Estimation and Prediction
• The best estimate of either E(y) or y for
a given value x = x0 is
yˆ  a  bx0
• Particular values of y are more difficult to predict,
requiring a wider range of values in the prediction
interval.
Regression Assumptions
• Remember that the results of a regression analysis
are only valid when the necessary assumptions
have been satisfied.
Assumptions:
1. The relationship between x and y is linear, given
by y =  + x + .
2. The random error terms  are independent and,
for any value of x, have a normal distribution
with mean 0 and constant variance,  2.
Diagnostic Tools
1.
Normal probability plot or
histogram of residuals
2. Plot of residuals versus fit or
residuals versus variables
3. Plot of residual versus order
Residuals
• The residual error is the “leftover”
variation in each data point after the
variation explained by the regression model
has been removed.
Residual yi  yˆi or yi  a  bxi
• If all assumptions have been met, these
residuals should be normal, with mean 0
and variance 2.
Normal Probability Plot
 If the normality assumption is valid, the plot should
resemble a straight line, sloping upward to the right.
 If not, you will often see the pattern fail in the tails of the
graph.
Normal Probability Plot of the Residuals
(response is y)
99
95
90
Percent
80
70
60
50
40
30
20
10
5
1
-20
-10
0
Residual
10
20
Residuals versus Fits
 If the equal variance assumption is valid, the plot
should appear as a random scatter around the zero
center line.
 If not, you will see a pattern in the residuals.
Residuals Versus the Fitted Values
(response is y)
15
Residual
10
5
0
-5
-10
60
70
80
Fitted Value
90
100