Transcript document

MGS 3100
Business Analysis
Regression
Oct 6, 2014
Georgia State University - Confidential
MGS3100_04.ppt/Oct 6, 2014/Page 1
Agenda
Overview of
the
Regression
Georgia State University - Confidential
Regression
Statistics &
ANOVA
Statistical
Significance
MGS3100_04.ppt/Oct 6, 2014/Page 2
What is the Regression Analysis?
•
The regression procedure is used when you are interested in describing the
linear relationship between the independent variables and a dependent
variable.
•
A line in a two dimensional or two-variable space is defined by the equation
Y=a+b*X
•
In full text: the Y variable can be expressed in terms of a constant (a) and a
slope (b) times the X variable.
•
The constant is also referred to as the intercept, and the slope as the
regression coefficient or B coefficient.
•
For example, GPA may best be predicted as 1+.02*IQ. Thus, knowing that
a student has an IQ of 130 would lead us to predict that her GPA would be
3.6 (since, 1+.02*130=3.6).
Georgia State University - Confidential
MGS3100_04.ppt/Oct 6, 2014/Page 3
What is the Regression Analysis?
•
In the multivariate case, when there is more than one independent variable,
the regression line cannot be visualized in the two dimensional space, but
can be computed just as easily.
•
For example, if in addition to IQ we had additional predictors of
achievement (e.g., Motivation, Self- discipline) we could construct a linear
equation containing all those variables. In general then, multiple regression
procedures will estimate a linear equation of the form:
Y = a + b1*X1 + b2*X2 + ... + bp*Xp
Georgia State University - Confidential
MGS3100_04.ppt/Oct 6, 2014/Page 4
Agenda
Overview of
the
Regression
Georgia State University - Confidential
Regression
Statistics &
ANOVA
Statistical
Significance
MGS3100_04.ppt/Oct 6, 2014/Page 5
1) Predicted and Residual Scores
•
The regression line expresses the best prediction of the dependent
variable (Y), given the independent variables (X).
•
However, nature is rarely (if ever) perfectly predictable, and usually there is
substantial variation of the observed points around the fitted regression line
(as in the scatterplot shown earlier). The deviation of a particular point
from the regression line (its predicted value) is called the residual
value.
Georgia State University - Confidential
MGS3100_04.ppt/Oct 6, 2014/Page 6
2) Residual Variance and R-square
•
The smaller the variability of the residual values around the regression line
relative to the overall variability, the better is our prediction.
•
For example, if there is no relationship between the X and Y variables, then the
ratio of the residual variability of the Y variable to the original variance is equal to
1.0. If X and Y are perfectly related then there is no residual variance and
the ratio of variance would be 0.0. In most cases, the ratio would fall
somewhere between these extremes, that is, between 0.0 and 1.0.
•
1.0 minus this ratio is referred to as R-square or the coefficient of
determination. This value is immediately interpretable in the following manner. If
we have an R-square of 0.4 then we know that the variability of the Y values
around the regression line is 1-0.4 times the original variance; in other words we
have explained 40% of the original variability, and are left with 60% residual
variability.
•
Ideally, we would like to explain most if not all of the original variability. The
R-square value is an indicator of how well the model fits the data (e.g., an
R-square close to 1.0 indicates that we have accounted for almost all of the
variability with the variables specified in the model).
Georgia State University - Confidential
MGS3100_04.ppt/Oct 6, 2014/Page 7
2) R-square
•
A mathematical term describing how much variation is being explained by
the X.
•
R-square = SSR / SST
•
•
SSR – SS (Regression)
SST – SS (Total)
Georgia State University - Confidential
MGS3100_04.ppt/Oct 6, 2014/Page 8
3) Adjusted R-square
•
Adjusted R-square is the adjusted value for R-square will be equal or
smaller than the regular R-square. The adjusted R-square adjusts for a bias
in R-square.
•
R-square tends to over estimate the variance accounted for compared to an
estimate that would be obtained from the population. There are two reasons
for the overestimate, a large number of predictors and a small sample size.
So, with a small sample and with few predictors, adjusted R-square should
be very similar the R-square value. Researchers and statisticians differ on
whether to use the adjusted R-square. It is probably a good idea to look at
it to see how much your R-square might be inflated, especially with a small
sample and many predictors.
•
Adjusted R-square = 1 – [MSR / (SST/(n – 1))]
•
MSR – MS (Regression)
•
SST – SS (Total)
Georgia State University - Confidential
MGS3100_04.ppt/Oct 6, 2014/Page 9
4) Coefficient R (Multiple R)
•
Customarily, the degree to which two or more predictors (independent or X
variables) are related to the dependent (Y) variable is expressed in the
correlation coefficient R, which is the square root of R-square. In multiple
regression, R can assume values between 0 and 1.
•
To interpret the direction of the relationship between variables, one looks at the
signs (plus or minus) of the regression or B coefficients. If a B coefficient is
positive, then the relationship of this variable with the dependent variable is
positive (e.g., the greater the IQ the better the grade point average); if the B
coefficient is negative then the relationship is negative (e.g., the lower the class
size the better the average test scores). Of course, if the B coefficient is equal to
0 then there is no relationship between the variables.
Georgia State University - Confidential
MGS3100_04.ppt/Oct 6, 2014/Page 10
5) ANOVA
•
In general, the purpose of analysis of variance (ANOVA) is to test for
significant differences between means.
•
At the heart of ANOVA is the fact that variances can be divided up, that is,
partitioned. Remember that the variance is computed as the sum of
squared deviations from the overall mean, divided by n-1 (sample size
minus one). Thus, given a certain n, the variance is a function of the sums
of (deviation) squares, or SS for short. Partitioning of variance works as
follows. Consider the following data set:
Observation 1
Observation 2
Observation 3
Mean
Sums of Squares (SS)
Overall Mean
Total Sums of Squares
Georgia State University - Confidential
Group 1
2
3
1
2
2
Group 2
6
7
5
6
2
4
28
MGS3100_04.ppt/Oct 6, 2014/Page 11
6) Degree of Freedom (df)
•
Statisticians use the terms "degrees of freedom" to describe the number of
values in the final calculation of a statistic that are free to vary. Consider, for
example the statistic s-square.
Regression
Residual
Total
Georgia State University - Confidential
df
= Number of independent variables
= n -1
MGS3100_04.ppt/Oct 6, 2014/Page 12
7) S square & Sums of (deviation) squares
•
•
•
The statistic s square is a measure on a random sample that is used to
estimate the variance of the population from which the sample is drawn.
Numerically, it is the sum of the squared deviations around the mean of a
random sample divided by the sample size minus one.
Regardless of the size of the population, and regardless of the size of the
random sample, it can be algebriacally shown that if we repeatedly took
random samples of the same size from the same population and calculated
the variance estimate on each sample, these values would cluster around
the exact value of the population variance. In short, the statistic s squared
is an unbiased estimate of the variance of the population from which a
sample is drawn.
Georgia State University - Confidential
MGS3100_04.ppt/Oct 6, 2014/Page 13
7) S square & Sums of (deviation) squares
•
When the regression model is used for prediction, the error (the amount of
uncertainty that remains) is the variability about the regression line, . This
is the Residual Sum of Squares (residual for left over). It is sometimes
called the Error Sum of Squares. The Regression Sum of Squares is the
difference between the Total Sum of Squares and the Residual Sum of
Squares. Since the total sum of squares is the total amount of variability in
the response and the residual sum of squares that still cannot be
accounted for after the regression model is fitted, the regression sum of
squares is the amount of variability in the response that is accounted for by
the regression model.
Georgia State University - Confidential
MGS3100_04.ppt/Oct 6, 2014/Page 14
8) Mean Square Error
•
ANOVA is a good example of why many statistical test represent ratios of
explained to unexplained variability . It refers to an estimate of the
population variance based on the variability among a given set of
measures. It is an estimate of the population variance based on the
average of all s-square within the several samples.
df
Regression
Residual
Total
Georgia State University - Confidential
1
14
15
SS
115424.56
33361.38
148785.94
MS
115424.56
2382.96
MGS3100_04.ppt/Oct 6, 2014/Page 15
Agenda
Overview of
the
Regression
Georgia State University - Confidential
Regression
Statistics &
ANOVA
Statistical
Significance
MGS3100_04.ppt/Oct 6, 2014/Page 16
1) What is "statistical significance" (p-value)
•
The statistical significance of a result is the probability that the observed
relationship (e.g., between variables) or a difference (e.g., between means)
in a sample occurred by pure chance ("luck of the draw"), and that in the
population from which the sample was drawn, no such relationship or
differences exist. Using less technical terms, one could say that the
statistical significance of a result tells us something about the degree to
which the result is "true" (in the sense of being "representative of the
population").
•
More technically, the value of the p-value represents a decreasing index of
the reliability of a result (see Brownlee, 1960). The higher the p-value, the
less we can believe that the observed relation between variables in the
sample is a reliable indicator of the relation between the respective
variables in the population.
•
Specifically, the p-value represents the probability of error that is
involved in accepting our observed result as valid, that is, as
"representative of the population."
Georgia State University - Confidential
MGS3100_04.ppt/Oct 6, 2014/Page 17
1) What is "statistical significance" (p-value)
•
For example, a p-value of .05 (i.e.,1/20) indicates that there is a 5% probability
that the relation between the variables found in our sample is a "fluke." In other
words, assuming that in the population there was no relation between those
variables whatsoever, and we were repeating experiments like ours one after
another, we could expect that approximately in every 20 replications of the
experiment there would be one in which the relation between the variables in
question would be equal or stronger than in ours. (Note that this is not the same
as saying that, given that there IS a relationship between the variables, we can
expect to replicate the results 5% of the time or 95% of the time; when there is a
relationship between the variables in the population, the probability of replicating
the study and finding that relationship is related to the statistical power of the
design.).
•
In many areas of research, the p-value of .05 is customarily treated as a
"border-line acceptable" error level. It identifies a significant trend.
f
Typically, in many sciences, results that yield p .05 are considered borderline statistically
significant but remember that this level of significance still involves a pretty high
probability of error (5%). Results that are significant at the p .01 level are commonly
considered statistically significant, and p .005 or p .001 levels are often called "highly"
significant. But remember that those classifications represent nothing else but arbitrary
conventions that are only informally based on general research experience.
•
Georgia State University - Confidential
MGS3100_04.ppt/Oct 6, 2014/Page 18
2) What is "statistical significance" (F-test & t-test)
F test
•
The F test employs the statistic (F) to test various statistical hypotheses about
the mean (or means) of the distributions from which a sample or a set of
samples have been drawn. The t test is a special form of the F test.
F-value
•
F-value is the ratio of MSR/MSE. This shows the ratio of the average error that is
explained by the regression to the average error that is still unexplained. Thus,
the higher the F, the better the model, and the more confidence we have that the
model that we derived from sample data actually applies to the whole population,
and is not just an aberration found in the sample.
Significance of F
•
The value was computed by looking at standardized tables that consider the Fvalue and your sample size to make that determination.
•
If the significance of F is lower than an alpha of 0.05, the overall regression
model is significant
t-test
•
The t test employs the statistic (t) to test a given statistical hypothesis about the
mean of a population (or about the means of two populations).
Georgia State University - Confidential
MGS3100_04.ppt/Oct 6, 2014/Page 19