Working with Your Data (Chapter 2 in the Little SAS Book)

Download Report

Transcript Working with Your Data (Chapter 2 in the Little SAS Book)

Using Basic Graphical and
Statistical Procedures
(Chapter in the 8 Little SAS Book)
Animal Science 500
Lecture No. 7
September 21, 2010
IOWA STATE UNIVERSITY
Department of Animal Science
SAS Graphical Capabilities
 SAS
has an extensive graphical ability
 Can
graph your distribution with a normal
distribution overlay
 Can
graph various bar graphs
 However
 Various
it may not be as intuitive to use
styles of graphs can be used
IOWA STATE UNIVERSITY
Department of Animal Science
SAS Graphical Capabilities
 Many
other programs that are available that
are easier to use and more intuitive
 Other
programs with graphical capabilities
more easily interface with word processing
and other software
IOWA STATE UNIVERSITY
Department of Animal Science
Assumptions of the Analysis of Variance
 The
analysis of variance has basic assumptions
1.
Treatments randomly applied experimental
units
2.
Independence of residuals (,ij) within groups
3.
Homogeneity of residual variances among
groups
4.
Treatment observations normally distributed
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Univariate

Proc Univariate can be used to request a variety of
statistics to summarize the data distribution of each
analysis variable:
1.
Sample moments
Basic measures of location and variability
Confidence intervals for the mean, standard deviation, and variance
Tests for location
Tests for normality
Trimmed and Winsorized means
Robust estimates of scale
Quantiles and related confidence intervals
Extreme observations and extreme values
Fequency counts for observations
Missing values
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Univariate
Using various options in the PROC UNIVARIATE
statement user can do the following:
1.
Specify the input data set to be analyzed
2.
Secify a graphics catalog for saving traditional graphics
output
3.
Specify rounding units for variable values
4.
Specify the definition used to calculate percentiles
5.
Specify the divisor used to calculate variances and standard
deviations
6.
Request that plots be produced on line printers and define
special printing characters used for features
7.
Suppress tables
8.
Save statistics in an output data set

IOWA STATE UNIVERSITY
Department of Animal Science
Proc Univariate Output
The UNIVARIATE Procedure
Variable: write (writing score)
Moments
N
200
Mean
52.775
Std Deviation
9.47858602
Skewness
-0.4820386
Uncorrected SS
574919
Coeff Variation 17.9603714
Sum Weights
200
Sum Observations
10555
Variance
89.843593
Kurtosis
-0.7502476
Corrected SS
17878.875
Std Error Mean
0.67023725
Basic Statistical Measures
Location
Variability
Mean 52.77500
Median 54.00000
Mode 59.00000
Std Deviation
9.47859
Variance
89.84359
Range
36.00000
Interquartile Range 14.50000
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Univariate Output meaning
a. Moments - Moments are a statistical summaries of a distribution.
b. N - This is the number of valid observations for the variable. The total number of observations is
the sum of N and the number of missing values. If there are missing values for the variable, proc
univariate will output the statistics about the missing values, such as the number and the
percentage of missing values.
c. Mean - This is the arithmetic mean across the observations. It is the most widely used measure of
central tendency. It is commonly called the average. The mean is sensitive to extremely large or
small values.
d. Std Deviation - Standard deviation is the square root of the variance. It measures the spread of a
set of observations. The larger the standard deviation is, the more spread out the observations
are.
e. Skewness - Skewness measures the degree and direction of asymmetry. A symmetric distribution
such as a normal distribution has a skewness of 0, and a distribution that is skewed to the left,
e.g. when the mean is less than the median, has a negative skewness.
f. Uncorrected SS - This is the sum of squared data values. The two summations: sum of
observations and sum of squares are related to the calculation of variance in the following way:
Variance= (sum of squares -(sum of observations)2/N)/(N-1)
g. Coeff Variation - The coefficient of variation is another way of measuring variability. It is a unitless
measure. It is defined as the ratio of the standard deviation to the mean and is generally
expressed as a percentage. It is useful for comparing variation between different variables.
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Univariate Output meaning
h.
Sum Weights - A numeric variable can be specified as a weight variable to weight the values of
the analysis variable. The default weight variable is defined to be 1 for each observation. This
field is the sum of observation values for the weight variable. In our case, since we didn't
specify a weight variable, SAS uses the default weight variable. Therefore, the sum of weight is
the same as the number of observations.
i.
Sum Observations - This is the sum of observation values. In case that a weight variable is
specified, this field will be the weighted sum. The mean for the variable is the sum of
observations divided by the sum of weights.
j.
Variance - The variance is a measure of variability. It is the sum of the squared distances of
data value from the mean divided by the variance divisor. The variance divisor is defined to be
either N-1 or N controlled by the option vardef. The default option is vardef=df, which is N-1.
The Corrected SS is the sum of squared distances of data value from the mean. Therefore, the
variance is the corrected SS divided by N-1. We don't generally use variance as an index of
spread because it is in squared units. Instead, we use standard deviation.
k.
Kurtosis - Kurtosis is a measure of the heaviness of the tails of a distribution. In SAS, a normal
distribution has kurtosis 0. Extremely nonnormal distributions may have high positive or
negative kurtosis values, while nearly normal distributions will have kurtosis values close to 0.
Kurtosis is positive if the tails are "heavier" than for a normal distribution and negative if the
tails are "lighter" than for a normal distribution. Please see our FAQ on kurtosis What's with
the different formulas for kurtosis?
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Univariate Output meaning
l.
Corrected SS - This is the sum of squared distance of data values from the mean. This number
divided by the number of observations minus one gives the variance.
m. Std Error Mean - This is the estimated standard deviation of the sample mean. If we drew
repeated samples of size 200, we would expect the standard deviation of the sample means to be
close to the standard error. The standard deviation of the distribution of sample mean is
estimated as the standard deviation of the sample divided by the square root of sample size.
This provides a measure of the variability of the sample mean. The Central Limit Theorem tells
us that the sample means are approximately normally distributed when the sample size is 30 or
greater
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Univariate Output meaning
Mean - This is the arithmetic mean across the observations. It is the most widely used measure of
central tendency. It is commonly called the average. The mean is sensitive to extremely large or
small values.
Median - The median is a measure of central tendency. It is the middle number when the values are
arranged in ascending (or descending) order. Sometimes, the median is a better measure of
central tendency than the mean. It is less sensitive than the mean to extreme observations.
Mode - The mode is another measure of central tendency. It is the value that occurs most frequently
in the variable. It is used most commonly when the variable is a categorical variable.
Std Deviation - Standard deviation is the square root of the variance. It measures the spread of a set
of observations. The larger the standard deviation is, the more spread out the observations are
Variance - The variance is a measure of variability. It is the sum of the squared distances of data
value from the mean divided by the variance divisor. The variance divisor is defined to be either
N-1 or N controlled by the option vardef. The default option is vardef=df, which is N-1. The
Corrected SS is the sum of squared distances of data value from the mean. Therefore, the
variance is the corrected SS divided by N-1. We don't generally use variance as an index of
spread because it is in squared units. Instead, we use standard deviation.
Range - The range is a measure of the spread of a variable. It is equal to the difference between the
largest and the smallest observations. It is easy to compute and easy to understand. However, it
is very insensitive to variability.
Interquartile Range - The interquartile range is the difference between the upper and the lower
quartiles. It measures the spread of a data set. It is robust to extreme observations.
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Univariate Output
The UNIVARIATE Procedure
Variable: write (writing score)
Moments
N
200
Mean
52.775
Std Deviation
9.47858602
Skewness
-0.4820386
Uncorrected SS
574919
Coeff Variation 17.9603714
Sum Weights
200
Sum Observations
10555
Variance
89.843593
Kurtosis
-0.7502476
Corrected SS
17878.875
Std Error Mean
0.67023725
Basic Statistical Measures
Location
Mean 52.77500
Median 54.00000
Mode 59.00000
Variability
Std Deviation
9.47859
Variance
89.84359
Range
36.00000
Interquartile Range 14.50000
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Univariate Output
Tests for Location: Mu0=0
Test
-Statistic-
-----p Value------
Student's t
Sign
M
Signed Rank S
t 78.74077
100
10050
Pr > |t| <.0001
Pr >= |M| <.0001
Pr >= |S| <.0001
Quantiles (Definition 5)
Quantile
Estimate
100% Max
99%
95%
90%
75% Q3
50% Median
25% Q1
10%
5%
1%
0% Min
67.0
67.0
65.0
65.0
60.0
54.0
45.5
39.0
35.5
31.0
31.0
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Univariate Output meaning
Test - This column lists the various tests that are provided.
Statistic - This column lists the values of the test statistics.
p Value - This column lists the p-values associated with the test statistics.
Student's t - The Student t-test is used to test the null hypothesis that the population mean equals Mu0. The
default value in SAS for Mu0 is 0.
The t-statistic is defined to be the difference between the mean and the hypotheses mean divided by the
standard error of the mean.
The p-value is the two-tailed probability computed using a t distribution. If the p-value associated with the ttest is small (usually set at p < 0.05), there is evidence to reject the null hypothesis in favor of the alternative.
In other words, the mean is statistically significantly different than the hypothesized value. If the p-value
associated with the t-test is not small (p > 0.05), the null hypothesis is not rejected. In our example, our tvalue is 78.74077 and the corresponding p-value is less than 0.0001. We conclude that there is a
statistically significant difference between the mean of the variable write and zero.
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Univariate Output meaning
Sign - The sign test is a simple nonparametric procedure to test the null hypothesis
regarding the population median. It does not require that the sample is drawn from a
normal distribution. It is used when we have a small sample from a nonnormal
distribution. The statistic M is defined to be M=(N+-N-)/2 where N+ is the number of
values that are greater than Mu0 and N- is the number of values that are less than
Mu0. Values equal to Mu0 are discarded. Under the hypothesis that the population
median is equal to Mu0, the sign test calculates the p-value for M using a binomial
distribution. The interpretation of the p-value is the same as for t-test. In our example
the M-statistic is 100 and the p-value is less than 0.0001. We conclude that the median
of variable write is significantly different from zero.
Signed Rank - The signed rank test is also known as the Wilcoxon test. It is used to
test the null hypothesis that the population median equals Mu0. It assumes that the
distribution of the population is symmetric. The Wilcoxon signed rank test statistic is
computed based on the rank sum and the numbers of observations that are either
above or below the median. The interpretation of the p-value is the same as for the ttest. In our example, the S-statistic is 10050 and the p-value is less than 0.0001. We
therefore conclude that the median of the variable write is significantly different from
zero.
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Univariate Output meaning
Qualntile Meanings
100% Max - This is the maximum value of the variable. One hundred percent of all
values are equal to or less than this value.
95% - Ninety-five percent of all values of the variable are equal to or less than this value.
75% Q3 - This is the third quantile. Seventy-five percent of all values are equal to or
less than this value.
50% Median - This is the median. The median splits the distribution such that half of all
values are above this value, and half are below.
25% Q1 - This is the first quantile. Twenty-five percent of all values of the variable are
equal to or less than this value.
0% Min - This is the minimum value. Zero percent of values are less than this value.
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Univariate Output
Extreme Observationsee
----Lowest---Value
Obs
31
89
31
40
31
39
31
31
33
70
IOWA STATE UNIVERSITY
Department of Animal Science
----Highest--Value
Obs
67
118
67
160
67
177
67
183
67
185
Proc Univariate Output
Stem Leafff
66 0000000
64 0000000000000000
62 0000000000000000000000
60 00000000
58 0000000000000000000000000
56 000000000000
54 00000000000000000000
52 0000000000000000
50 00
48 00000000000
46 00000000000
44 0000000000000
42 000
40 0000000000000
38 000000
36 00000
34 00
32 0000
30 0000
------+-------+-------+---------+--------+
IOWA STATE UNIVERSITY
Department of Animal Science
# Boxplotgg
7 |
16 |
22 |
8
+-----+z
25 | |
12 | |
20 *-----*aa
16 | + |c
2 | |
11 | |
11 | |
13 +-----+bb
3 |
13 |
6 |
5 |
2 |
4 |
4 |
Proc Univariate Meaning
Extreme Observations - This is a list of the five lowest and five highest values of the
variable.
Stem Leaf - The stem-leaf plot is used to visualize the overall distribution of a
variable. In this display, the stem is the portion of the value to the left and the leaf is
the part to the right. The number on the right is the number of leaves on each stem.
For example, one the first line, the stem is 66, and there are seven 0's to the right of
this stem, indicating that there are seven cases with a value of 66 or 67 for this
variable.
Boxplot - The box plot is a graphical representation of the 5-number summary for a
variable. It is based on the quartiles of a variable. The rectangular box corresponds
to the lower quartile and the upper quartile. The line in the middle is the median.
The plus sign in the middle is the mean. We can visually compare the lengths of the
whiskers. If one is clearly longer than the other one, the distribution may be skewed.
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Univariate Output Meaning
75% Q3 - This is the third quantile. Seventy-five percent of all values are equal to or
less than this value.
50% Median - This is the median. The median splits the distribution such that half of all
values are above this value, and half are below.
Mean - This is the arithmetic mean across the observations. It is the most widely used
measure of central tendency. It is commonly called the average. The mean is sensitive
to extremely large or small values.
25% Q1 - This is the first quantile. Twenty-five percent of all values of the variable are
equal to or less than this value.
IOWA STATE UNIVERSITY
Department of Animal Science
Normal Probability Plotcc
67+
+++ ***** **
|
*******
|
*****
|
**++
|
****+
|
***++
|
***++
|
***++
|
**++
49+
**+
|
***
|
***
|
++*
|
+***
|
+**
|
+**
|
++*
| +***
31+**+**
+------+------+------+------+------+------+------+------+------+------+
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Univariate Output Meaning
Normal Probability Plot - The normal probability plot is used to investigate whether
the variable is normally distributed. The plus signs in the plot are indicate a normal
distribution and they form a straight line. The asterisks are show the data values. If
our variable is close to normal distribution, then the asterisks will also be close to a
straight line and thus cover most of the plus signs. There are different types of
departure from normality.
IOWA STATE UNIVERSITY
Department of Animal Science
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Corr (Correlations)
 Is
part of the base SAS software and computes
correlations
 Measures
the strength of relationship between
two variables
 Values
can range from -1 to 1
 If
two variables completely uncorrelated they
would have a correlation of 0
 If
two variables are perfectly correlated they
would have values of either -1 or 1 depending on
whether correlation was negative or positive
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Corr (Correlations)
 SAS

PROC CORR;


Computes correlations using the var list across the top and
variables in the with list down the side
Default


Computes correlations between variables you have listed
Add the word With along with the Var list;


Will compute correlations between all numeric variables.
Add the word Var (list);


basic statement
Computes Pearson product-moment correlation coefficients
Add options to the PROC statement to request nonparametric correlations
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Corr (Correlations)
 SAS



basic statement
PROC CORR Spearman;
The Spearman option calculates the Spearman’s rank
correlations instead of Pearson’s correlations
Other options


HOEFFDING for Hoeffding’s D-Statistic
KENDALL for Kendall’s tau-b coefficient
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Corr (Correlations)

By default, PROC CORR prints a report that includes descriptive
statistics and correlation statistics for each variable.





Number of observations with nonmissing values,
Mean,
Standard Deviation,
Minimum, and
Maximum.

For each pair of variables, PROC CORR prints the correlation
coefficients, the number of observations used to calculate the
coefficient, and the p-value.

If you specify the ALPHA option, PROC CORR prints Cronbach’s
coefficient alpha, the correlation between the variable and the total of
the remaining variables, and Cronbach’s coefficient alpha by using
the remaining variables for the raw variables and the standardized
variables.
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Corr (Correlations)

What does the P-Value mean that is associated with each correlatio?
Answer = A significant P-value with a correlation just means the
correlation is different from zero

Remember that correlations do not imply cause and effect. The
correlation really just says how two variables vary with each other.
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Corr Output
Fish Measurement Data
The CORR Procedure
4 Variables:
Simple Statistics
Variable
N
Weight3
34
Length3
34
Height
34
Width
34
Weight3
Length3
Mean
8.44751
38.38529
15.22057
5.43805
Std Dev
0.97574
4.21628
1.98159
0.72967
IOWA STATE UNIVERSITY
Department of Animal Science
Height
Width
Sum
287.21524
1305
517.49950
184.89370
Minimum
6.23168
30.00000
11.52000
4.02000
Maximum
10.00000
46.50000
18.95700
6.74970
Proc Corr Output
Pearson Correlation Coefficients, N=34
Prob > |r| under H0: Rho=0
Weight3
Length3
Height
Width
Weight3
1.0000
0.96523
<0.0001
0.98261
<0.0001
0.92789
<0.0001
Length 3
0.96523
<0.0001
1.0000
0.95492
<0.0001
0.92171
<0.0001
Length
0.98261
<0.0001
0.95492
<0.0001
1.0000
0.92632
<0.0001
Width
0.92789
<0.0001
0.92171
<0.0001
0.92632
<0.0001
1.0000
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Corr Options
ALPHA
calculates and prints Cronbach’s coefficient alpha. PROC CORR computes separate coefficients using
raw and standardized values (scaling the variables to a unit variance of 1). For each VAR statement variable,
PROC CORR computes the correlation between the variable and the total of the remaining variables. It also
computes Cronbach’s coefficient alpha by using only the remaining variables.
If a WITH statement is specified, the ALPHA option is invalid. When you specify the ALPHA option, the Pearson
correlations will also be displayed. If you specify the OUTP= option, the output data set also contains observations
with Cronbach’s coefficient alpha. If you use the PARTIAL statement, PROC CORR calculates Cronbach’s
coefficient alpha for partialled variables. See the section Partial Correlation for details.
BEST=n prints the highest correlation coefficients for each variable. Correlations are ordered from highest to
lowest in absolute value. Otherwise, PROC CORR prints correlations in a rectangular table, using the variable
names as row and column labels.
If you specify the HOEFFDING option, PROC CORR displays the statistics in order from highest to lowest.
COV displays the variance and covariance matrix. When you specify the COV option, the Pearson correlations
will also be displayed. If you specify the OUTP= option, the output data set also contains the covariance matrix
with the corresponding _TYPE_ variable value 'COV.' If you use the PARTIAL statement, PROC CORR computes
a partial covariance matrix.
Displayed 4 of many. Examine the option that you might need or view the options and see what can be done!
IOWA STATE UNIVERSITY
Department of Animal Science
PROC Reg
 Reg
procedure fits linear regression models by
least-squares and is on of many SAS
procedures which performs regression analyses
 Reg
is part of the SAS / STAT software and is
licensed separately from the Base SAS software
 Show
linear regression
 Proc
Reg can is capable of analyzing models
with many regressor variables using a variety of
model –selection methods
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Reg
 Selection



Stepwise regression
Forward selection
Backward elimination
 Other


methods available in Proc Reg
procedures (Procs) for :
Non-linear
Logistic Regresssion
 Basic
form
 PROC
REG;

MODEL dependent = independent;
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Reg Example
proc reg data = "d:\hsb2";
model science = math female socst read / clb;
run;
quit;
IOWA STATE UNIVERSITY
Department of Animal Science
Proc Reg Output
Analysis of Variance
Source
Sum of
Squares
DF
Model
4
Error
195
Corrected Total 199
Mean
Square
9543.72074
9963.77926
19507
Root MSE
Dependent Mean
Coeff Var
7.14817
51.85000
13.78624
IOWA STATE UNIVERSITY
Department of Animal Science
2385.93019
51.09630
F Value
Pr > F
46.69
<.0001
R-Square
Adj R-Sq
0.4892
0.4788
Proc Reg Output
Variable
Label
Parameter Estimates
Parameter
Standard
DF
Estimate
Error
Intercept
Intercept
math
math score
female
socst
social studies score
read
reading score
Variable
Label
Intercept
Intercept
math
math score
female
socst
social studies score
read
reading score
1
1
1
1
1
12.32529
0.38931
-2.00976
0.04984
0.33530
Parameter Estimates
DF
95% Confidence Limits
1
1
1
1
1
6.02694
0.24312
-4.02677
-0.07289
0.19177
IOWA STATE UNIVERSITY
Department of Animal Science
3.19356
0.07412
1.02272
0.06223
0.07278
18.62364
0.53550
0.00724
0.17258
0.47883
t Value
Pr > |t|
3.86
5.25
-1.97
0.80
4.61
0.0002
<.0001
0.0508
0.4241
<.0001
PROC REG OUTPUT
Ypredicted = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x4
The column of estimates provides the values for b0, b1, b2, b3 and b4 for this equation.
math - The coefficient is .3893102. So for every unit increase in math, a 0.38931 unit
increase in science is predicted, holding all other variables constant.
female - For every unit increase in female, we expect a -2.00976 unit decrease in the
science score, holding all other variables constant. Since female is coded 0/1 (0=male,
1=female) the interpretation is more simply: for females, the predicted science score
would be 2 points lower than for males.
socst - The coefficient for socst is .0498443. So for every unit increase in socst, we
expect an approximately .05 point increase in the science score, holding all other
variables constant.
read - The coefficient for read is .3352998. So for every unit increase in read, we
expect a .34 point increase in the science score.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC REG OUTPUT
Standard Error - These are the standard errors associated with the coefficients.
t Value - These are the t-statistics used in testing whether a given coefficient is
significantly different from zero.
Pr > |t|- This column shows the 2-tailed p-values used in testing the null hypothesis that
the coefficient (parameter) is 0. Using an alpha of 0.05:
The coefficient for math is significantly different from 0 because its p-value is 0.000,
which is smaller than 0.05.
The coefficient for socst (.0498443) is not statistically significantly different from 0
because its p-value is definitely larger than 0.05.
The coefficient for read (.3352998) is statistically significant because its p-value of 0.000
is less than .05.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC REG OUTPUT
The intercept is significantly different from 0 at the 0.05 alpha level.
95% Confidence Limits - These are the 95% confidence intervals for the
coefficients. The confidence intervals are related to the p-values such that the
coefficient will not be statistically significant if the confidence interval includes 0. These
confidence intervals can help you to put the estimate from the coefficient into
perspective by seeing how much the value could vary.
IOWA STATE UNIVERSITY
Department of Animal Science
Creating Statistical Graphics with PROC REG
General form
ODS GRAPHICS ON;
PROC REG PLOTS (OPTIONS) = (PLOT-LIST);
Model dependent = independent;
Run;
Quit;
IOWA STATE UNIVERSITY
Department of Animal Science
Creating Statistical Graphics with PROC REG
FITPLOT
scatter plot with regression line and confidence
and prediction bands
RESIDUALS
residuals plotted against independent variable
DIAGNOSTICS
diagnostics panel including all of the following
plots
COOKSD
Cook’s D statistic by observation number
OBSERVATIONBY PREDICTED
dependent variable by predicted value
QQPLOT
Normal Quantile Plot of Residuals
RESIDUAL BYPREDICTED
residuals by predicted values
RESIDUALHISTOGRAM
histogram of residuals
RFPLOT
residual fit plot
RSTUDENTBY LEVERAGE
studentized residuals by leverage
RSTUDENTBYPREDICTED
studentized residuals by predicted values
IOWA STATE UNIVERSITY
Department of Animal Science
Default Options
 By
default the FITPLOT, RESIDUAL and
DIAGNOSTIC plots are generated
IOWA STATE UNIVERSITY
Department of Animal Science
Proc ANOVA
 One
of many SAS procedures that can perform
Analysis of Variance or ANOVA
 Is
part of the SAS/STAT that is licensed
separately from the base SAS software
 Is


designed for balanced data
Equal numbers of observations in each combination of
the classification factors
Exception is for the one-way ANOVA where the data not
need be balanced
IOWA STATE UNIVERSITY
Department of Animal Science
Proc ANOVA
 One-way


analysis of variance.
The null hypothesis tested by one-way ANOVA is that
two or more population means are equal.
The question is whether (H0) the population means may
equal for all groups and that the observed differences in
sample means are due to random sampling variation, or
(Ha) the observed differences between sample means
are due to actual differences in the population means.
IOWA STATE UNIVERSITY
Department of Animal Science
Proc ANOVA
 Assumptions
needed for the ANOVA.
1)random, independent sampling from some larger
population;
2)normal population distributions;
3)equal variances within the population.
 Assumption 1 is crucial for any inferential statistic.
 Assumptions 2 and 3 can be relaxed when large
samples are used, and
 Assumption 3 can be relaxed when the sample sizes
are roughly the same for each group even for small
samples.
IOWA STATE UNIVERSITY
Department of Animal Science
Proc ANOVA
 If
you are not performing a one-way analysis of
variance and / or your data is not balanced you
should be using the General Linear Models
Procedure or GLM
IOWA STATE UNIVERSITY
Department of Animal Science
PROC ANOVA

The ANOVA procedure performs analysis of variance
(ANOVA)

It is designed for use with balanced data from a wide variety of
experimental designs.

In analysis of variance, a continuous response
variable, known as a dependent variable, is measured
under experimental conditions identified by
classification variables, known as independent
variables.

The variation in the response is assumed to be due to
effects in the classification, with random error
accounting for the remaining variation.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC ANOVA
 General
form
PROC ANOVA
CLASS variable-list;
Model dependent = effects;



The two required statements are the CLASS and
MODEL statements.
The CLASS statement MUST come before the Model
statement
For the one way ANOVA only one variable is listed
IOWA STATE UNIVERSITY
Department of Animal Science
PROC ANOVA
 Many
options available when using the ANOVA
Means – calculates means for the dependent variable for
any of the main effects included in the model
statement
Several mean separation or comparison tests including
1. Bonferroni t tests (BON)
2. Duncan’s multiple-range test (DUNCANS)
3. Scheffe’s multiple-comparison procedure
(SCHEFFE)
4. Pairwise t tests (T)
5. Tukey’s studentized range test (TUKEYS)
IOWA STATE UNIVERSITY
Department of Animal Science
PROC ANOVA
 Many


General form MEANS effects / options;
The effects can be any main effect in the model
statement


options available when using the ANOVA
Cannot be any crossed or nested effects
The options can be any one of the comparison tests
(Duncans or Tukeys for example)
IOWA STATE UNIVERSITY
Department of Animal Science
PROC ANOVA
 If
the ODS Graphics are turned on PROC ANOVA
will produce a grouped box plot of the effect
variable for one-way ANOVA and for all effects in
the MEANS statement
IOWA STATE UNIVERSITY
Department of Animal Science
Proc ANOVA output
 The
output from an ANOVA analysis has at
least two parts
1.
Table providing information about the classification
variables in the model
1.
2.
3.
2.
3.
Number of levels
Values
Number of observations
An ANOVA table
Options like means will be outputted next
IOWA STATE UNIVERSITY
Department of Animal Science
Proc ANOVA output example
Girls’ Heights on Basketball Teams
The ANOVA Procedure
Class Level Information
CLASS
Team
Levels
5
Number of Observations 60
IOWA STATE UNIVERSITY
Department of Animal Science
Values
Blue gold gray pink red
Proc ANOVA output example
Girls’ Heights on Basketball Teams
The ANOVA Procedure
Dependent Variable: Height
Source
DF
Sums of Squares
Mean Squares
F Value
Pr > F
Model
4
228.00
57.00
4.14
0.0053
Error
55
758.00
13.7828282
Corrected Total
59
986.00
R-Square
Coeff Var
0.2331
Source
Team
7.279
DF
Anova SS
4
228.000
Mean Square
57.00
IOWA STATE UNIVERSITY
Department of Animal Science
Root MSE
Height Mean
3.712
51.00
F Value
Pr > F
4.14
0.0053
Proc ANOVA output example
Source
source of variation
DF
degrees of freedom for the model, error, and total
Sum of Squares sum of squares for the portion attributed to the model, error, and
the total
Mean Square
Mean square (sum of squares divided by the degrees of freedom)
F Value
F value (mean square for model divided by the mean square for
error
Pr > F
significance probability associated with the F statistic
R-square
R-square (how predictive your model is)
Coeff Var
coefficient of variation (standard deviation divided by the mean)
How much variation you have among means of the same variable
Root MSE
root mean square error (The name comes from the fact that it is
the square root of the mean of the squares of the values)
a statistical measure of the magnitude of a varying quantity
It gives a sense for the typical size of the numbers and is squared
to account for negative numbers
The RMS is always the same as or just a little bit larger than the
average of the unsigned values
Height mean
mean of the dependent variable in this case height
IOWA STATE UNIVERSITY
Department of Animal Science
Proc ANOVA output example
Girls’ Height on Basketball Teams
The ANOVA Procedure
Scheffe’s Test for Height
NOTE: This test controls the type I experimentwise error rate.
Alpha
0.05
Error Degrees of Freedom
55
Error Mean Square
13.78182
Critical Value of F
2.53969
Minimum Significant Difference
4.8306
Means with the same letter are not significantly different
Scheffe Grouping
Mean
N
team
A
54.833
12
Pink
B
A
50.500
12
gold
B
A
50.333
12
gray
B
49.833
12
blue
B
49.500
12
red
IOWA STATE UNIVERSITY
Department of Animal Science