Transcript Document
SAS Lecture 5 – Some
regression procedures
Aidan McDermott,
April 25, 2005
What will the output from this program look like?
How many variables will be in the dataset example, and what will be the
length and type of each variable?
What will the variable package look like?
What will the output from this program look like?
Modeling with SAS
examine relationships between
variables
estimate parameters and their standard
errors
calculate predicted values
evaluate the fit or lack of fit of a model
test hypotheses
design
outcome
The linear model
y 0 1x1 2 x2 k xk
~ N (0, )
2
Example:
Weight 0 1Height 2 Age
Note: outcome variable must be continuous and
normal given independent variables
the linear model with proc reg
estimates parameters by least squares
produces diagnostics to test model fit
(e.g. scatter plots)
tests hypotheses
Example:
proc reg data=mydata;
model weight = height age;
run;
proc reg
Syntax:
proc reg <options>;
model response = effects </options>;
plot
yvariable*xvariable = ’symbol’;
by varlist;
output <OUT=SAS data set>
<output statistic list>;
run;
proc reg
proc reg statement syntax:
data = SAS data set name input data set
outest = SAS data set name creates data set with
parameter estimates
simple prints simple statistics
proc reg
the model statement
model response=<effects></options>;
required
variables must be numeric
many options
can specify more than one model
statement
Example:
model weight = height age;
model weight = height age / p clm cli;
proc reg
the plot statement
plot yvariable*xvariable <=symbol> </options>;
produces scatter plots - yvariable on the
vertical axis and xvariable on the horizontal axis
can specify several plots
optional symbol to mark points
yvariable and xvariable can be variables
specified in model statements or statistics
available in output statement
Example:
plot weight * age / pred;
plot r. * p. / vref = 0;
proc reg
some statistics available for plotting:
P.
R.
L95.
U95.
L95M.
U95M.
predicted
residuals
lower 95%
upper 95%
lower 95%
upper 95%
values
CI
CI
CI
CI
bound
bound
bound
bound
for
for
for
for
individual prediction
individual prediction
mean of dependent variable
mean of dependent variable
Example:
plot weight * age / pred;
plot r. * p. / vref = 0;
plot (weight p. l95. U95.) * age / overlay;
proc reg
the output statement
output <OUT=SAS data set> keywords=names;
creates SAS data set
all original variables included
keyword=names specifies the statistics to
include
Example:
output out=pvals
p=pred r=resid;
Example: NMES
variables of interest:
totalexp – total medical expenditure ($)
chd5 – indicator of CHD
lastage – age at last interview
male – sex of participant
proc reg example
here:
1. model
2. plot
3. output
estimate parameters etc
make three plots
make an output dataset regout
The run statement
Many people assume that the run statement ends a
procedure such as proc reg.
This is because when SAS encounters a run
statement it executes any outstanding
instructions in the program buffer. But it may
or may not end the procedure.
proc reg data=lecture4.nmes;
model totalexp = chd5 lastage male;
run;
model totalexp = chd5 lastage;
plot r.*chd5;
run;
quit; /* ends the procedure */
proc glm (the general linear model)
uses least-squares with generalized inverses
performs linear regression, analysis of
variance, analysis of covariance
accepts classification variables (discrete) and
continuous variables
estimates and performs tests for general
linear effects
proc anova is suitable for “balanced” designs;
proc glm can be used for either balanced or
unbalanced designs
suitable for random effects models
proc glm
Syntax:
proc glm data=name <options>;
class classification variables;
model response=effects /options;
means effects / options;
random effects / options;
estimate ‘label’ effect value / options;
contrast ‘label’ effect value / options;
run;
proc glm
response (dependent) variable is
continuous – same normality assumption
as in proc reg
independent variables are discrete or
continuous; discrete must listed on class
statement
interaction terms can be with an asterisk
a*b, e.g.
model bmi= a b a*b;
proc glm
means effects / options;
computes arithmetic means and standard
deviations of all continuous variables in
the model (both dependent and
independent) within each group for
effects specified on the right-hand side
of the model statement
only class variables may be specified as
effects
options specify multiple comparison
methods for main effect terms in the
model
proc glm example
here:
1. solution
2. means
3. class
show estimated parameters
show means for smoke variable
treat smoke as discrete
proc glm example
here:
1. format
changes reference group
reg and glm
Both the proc reg and proc glm
procedures are suitable only when the
outcome variable is normally distributed.
proc reg has many regression diagnostic
features, while proc glm allows you to fit
more sophisticated linear models such as
random effects models, models for
unbalanced designs etc.
non-normal outcomes
In many situations we cannot assume our
response variable is normally distributed.
proc reg and proc glm are not suitable for
modeling such outcomes.
Example:
Suppose you are interested in estimating
the prevalence of disease in a population.
You have an indicator of disease (1 =
Yes, 0 = No)
non-normal outcomes
Example:
You are interested in estimating how the
incidence of infant mortality has changed
as a function of time
Example:
You are interested in estimating the
median survival time for two groups of
patients receiving either a placebo or
treatment.
proc logistic
Example:
Survey data: parent
agrees to close school
when certain toxic
elements are found in the
environment
Variables:
close 0 = no, 1 = yes
lived years lived in
community
proc logistic
Syntax:
proc logistic <options>;
model response = effects </options>;
class variables;
by variables;
output <out=name> <keyword=name>;
run;
proc logistic
•
descending option
means that we are
modeling the probability
that close=1 and not the
probability that close=0.
proc genmod
•
•
•
implements the generalized linear model
fits models with normal, binomial or
poisson response variable (among others)
fits generalized estimating equations for
repeated measures data
proc genmod
Syntax:
proc genmod <options>;
by variables;
class variables;
model response = effects </options>;
output <out=name> <keyword=name>;
make ‘table’ out=name;
run;
proc genmod:
class statement says which variables are
classification (categorical) variables
by statement produces a separate analysis for
each level of the by variables (data must be
sorted in the order of the by variables)
response variable is the response (dependent)
variable in the regression model.
<effects> are a list of variables. These are the
independent variables in the regression model.
Any independent variables that are categorical
must be listed in the Class statement.
Example:
•
•
smoke will be treated as
a categorical variable
because of the class
statement
Same model as we
produced with proc glm.
The default is a linear
model.
options for the model statement
dist = option specifies the distribution of the response
variable. (default = normal)
link = option specifies the link that will transform the
response variable (default = identity)
Examples:
logistic regression:
poisson regression:
dist=binomial link=logit
dist=poisson link=log
options for the model statement
alpha = specifies confidence level for confidence
intervals
waldci or lrci specifies that confidence intervals are to
be computed. The waldci gives approximate
intervals and doesn’t take as long as lrci. The lrci
give intervals based on likelihood ratio.
the output statement
• the output statement is just one of the ways to create
a new SAS dataset containing results form the
genmod procedure.
• statement is similar to that found in proc means and
proc glm.
Example:
output out=new
predicted=fit
upper=upper lower=lower;
the make statement
• the make statement is another way to create a new
SAS dataset containing results form the genmod
procedure.
• ods is another more general way (see later).
Example:
make ‘ParameterEstimates’ out=parms;
make ‘ParmInfo’ out=parminfo;
example: logistic regression
Perform a logistic regression analysis to
determine how the odds of CHD are
associated with age and gender in the 1987
NMES
Save the parameter estimates as a new dataset.
Save the predicted values along with the
original data.
Example:
•
descending options
means that we are
modeling the probability
that chd5=1 and not the
probability that chd5=0.