Presentation - University of Michigan
Download
Report
Transcript Presentation - University of Michigan
“WORKING WITH S-PLUS”
Michael Elliott
Assistant Professor of Biostatistics
University of Michigan School of Public
Health
Overview
• S-Plus is a commercial statistical software package
– Plots or other graphics
– Vector and matrix computation
– Non-parametric estimation such as kernel density estimation and
cubic spline regression
– Simulations.
•
•
•
•
•
•
•
Importing and exporting data
Executing standard statistical procedures
Graphics
Computation
Robust statistics
R: an improved(?) alternative to S-Plus
Where to go for further help
2
Importing and Exporting Data
•Can import or export ASCII data plus most major database
management and statistical package data files (dBASE, Lotus,
Excel, SAS, SPSS, Stata)
•S-plus creates data objects:
–Scalar
–Vector
–Matrix
–Array
–Factor (vector of categorical variables)
–Frame (matrix of different types)
–List (general collection of different types)
3
Example dataset
•
1,000 young, first-time-licensed drivers in Michigan
Age when Time
ID licensed licensed Gender
001 17.5
3.4
0
002 16.1
6.2
0
.
.
.
.
.
.
.
.
.
.
.
.
N of
Offenses
1
1
.
.
.
Time to
1st Offense
1.0
3.3
.
.
.
4
Two ways to import and export data:
• Use menu-driven commands
• Use batch commands in the “Commands Window”
5
Menu-Driven Import/Export
• Click on File
– Import Data
• From File
• Fill in location and file type information
6
Batch Command Import/Export
• Use the read.table or scan commands
– Scan is a quick way to read in a small vector
– Read.table generally works better for importing 2dimensional tables:
splus_read.table(file="h:/splus/splus.txt",
na.strings='.',
col.names=c('id',"age","tol","male","noff","ttof"))
Generally better to manage complex hierarchical data
structures in, e.g., SAS and generate flat files for S-plus.
7
Basic operations
• The usual operators are used: also ^ for exponentiation,
%% for modulo, and _ or <- for assignment.
• c(1,2,5) creates a 3x1 vector with the elements 1, 2, and 5;
c(0:10) creates an 11x1 vector with elements 0 through 10;
rep(1,10) creates a 10x1 vector of 1.
• matrix(c(1,1,2,2),2,2) creates a 2x2 matrix with the 1st
column being 1s and the 2nd column 2s.
• What would array(rep(c(1,1,2,2),2),dim=c(2,2,2)
generate?
• To extract a portion of a vector, use name[a:b]; similarly,
to read a row or column of a matrix, use name[a,] or
name[,b] respectively.
8
• Logical vectors can be generated by conditional
statements:
>tf_(c(1,2,5)<=c(2,4,1))
>tf
[1] T T F
• Since these vectors have the property that T=1, F=0, they
are very useful. For example, they could construct a linear
spline design matrix:
yi=0+1ti+2 (ti-1)+ + 3 (ti-2)+ +i
X_cbind(t,c((t-theta1)*(t>theta1)),
c((t-theta2)*(t>theta2)))
9
• sort(object) provides the values in object in alphanumeric
order
> sort(c(3,2,1,5))
[1] 1 2 3 5
• order(object) provides the location of elements in object
in alphanumeric order
> order(c(3,2,1,5))
[1] 3 2 1 4
10
Common statistical procedures
• S-plus commands are functions of the form
function(parameter1,parameter2,…)
• Descriptive functions
– summary(object)
– hist(object)
– stem(object)
• Univariate and bivariate statistical tests
– t.test(object,mu=value)
– chisq.test(object)
11
• Regression models
– Linear regression
– Analysis of variance
– Generalized linear models
• Survival analysis
– Nonparametric
– Parametric
– Semiparametic (Cox proportional hazard model)
• Other
– Random effects models
– Factor analysis
12
• Example fitting a (Gaussian) linear regression model:
yi=0+1xi1+2xi2+i where yi is the age at time of license,
xi1 is the number of offenses, and xi2 is the total time of
license for the ith subject. The ’s are fixed but unknown
parameters, and the are (also unknown) normally
distributed error terms.
• The function lm() provides the least-square (maximum
likelihood) estimates of together with numerous related
diagnostics:
myreg_lm(age~noff+tol,data=splus)
13
• myreg is a list containing a number of items
–
–
–
–
myreg$coefficients
myreg$residuals
myreg$fitted.values
see help(lm.object) for a complete list of available items
• Use summary.lm(myreg) to obtain conventional
regression analysis output.
14
• summary.lm(object) also provides useful items:
– cov.unscaled: a p x p matrix of (unscaled) covariances of the
regression coefficents
– sigma: the square root of the estimated variance of the random
error
> summary.lm(myreg)$sigma*summary.lm(myreg)$cov.unscaled
(Intercept)
noff
tol
(Intercept) 0.01584473684 0.00005315618 -0.00250628666
noff 0.00005315618 0.00014034184 -0.00004702984
tol -0.00250628666 -0.00004702984 0.00042518954
– r.squared: the multiple R-squared statistic for the model.
– fstatistic: numeric vector of length three giving the F test for the
regression
– see help(summary.lm) for a complete list of available items
15
ANOVA:
aov(formula, data = <name of data>, projections = F,
contrasts = NULL, ...)
GLM:
glm(formula, family = gaussian, data = <data>,
weights = <weight vector>, na.action = na.fail, contrasts = NULL, ...)
Survival:
KM: kaplanMeier(Surv(time,cens), data= <name of data>,
weights=NULL, na.action=na.fail, se.fit = T, conf.interval = "log",
coverage = 0.95, …)
G-rho family: survdiff(Surv(ttof,cens)~male,rho=<rho>)
Cox PH: coxph(Surv(time,cens), method= <ties>, …)
16
Graphics
• S-plus easily produces nice graphics, with fine control
available for little effort.
• Plot age at time of license versus number of offenses:
tol_splus[,2]
ttof_splus[,6]
plot(tol,ttof,xlab=“Age of time of license”,
ylab="Time to First Offense")
17
• You can easily control a variety of features, including
–
–
–
–
–
–
Labels
Plotting symbol
Axis type
Legends
Titles
Size.
(xlab=“xxxx”, ylab=“yyyy”)
(pch=‘symbol’, pch=n)
(log=“x”, log=“y”, log=“xy”)
(legend(x,y,legend=…,pch=…,lty=…) )
(title())
(pty=“s”, mfrow=c(2,2))
• It is also relatively easy to produce
–
–
–
–
–
–
multiple graphs on a single page (par)
Overlays of graphs (par)
Barplots (barplot)
Scatterplot matricies (brush)
Contour plots (contour)
3-D plots (use menu)
18
Matrix Functions and Other Computations
• Vector and matrix computation is very simply in S-plus.
• To obtain the product of a scalar with a vector or matrix,
use ‘*’; this also will give you element-by-element
multiplication of two equal-size vectors, matricies, or
arrays:
>c(1,2,3)*c(4,5,6)
[1] 4 10 18
• Vector or matrix multiplication can be done using %*%:
• Transpose using t().
19
• solve() inverts a square matrix or solves a set of linear
equations:
– solve(A)%*%A=I
– A%*%solve(A,b)=b
20
• chol() returns the Choleski decomposition of a nonnegative symmetric matrix (note this is the transpose of the
usual lower-triangular form):
– t(chol(A))%*%chol(A)=A
21
• QR decomposition of an n x p matrix into an n x n
orthonormal matrix Q and an n x p upper triangular matrix
R is given by qr.Q(qr(A)) and qr.R(qr(A)).
22
• eigen() calculates the eigenvalues and eigenvectors of a
square matrix: for A k x k square symmetric matrix,
eigen(A)$vector and eigen(A)$values yield a k x k matrix
and k x 1 vector such that, for each element j in
eigen(A)$values and each column ej in eigen(A)$vector:
Aej=jej <-> A=jejejT , where ejTej=1 and eiTej=0 if i<>j.
23
• Numerical optimization procedures are also available:
nlmin(f, x, …)
– f is an S-PLUS function which takes as its only argument a real
vector of length p.
– x is a vector of length p used as the starting point for the optimizer.
Also nlminb for constrained optimization, nlregb for
minimizing sums of squares of nonlinear functions subject
to bound-constrained parameters, and nls for general nonlinear regression.
24
User-Defined Functions
• You may define your own functions using the function()
function:
mydet_function(A,B){
detA_Re(prod(eigen(A)$values))
detB_Re(prod(eigen(B)$values))
detAB_detA*detB
return(detA,detB,detAB)
}
25
Robust Statistics
• S-plus’ “niche” market is its extensive package of robust
statistical procedures.
26
• As a measure of central tendency, the mean is highly
sensitive to skewed or outlying observations, whereas the
median is unaffected, but less optimal if the distribution is
truly symmetric.
– A simple compromise between the median and mean is the
trimmed mean, which is the mean of the 1-2 central part of the
distribution. This is implemented in S-plus via the trim parameter
in the mean function: mean(x,trim=).
– A major subspeciality in statistics is devoted to this area and S-plus
has the function location() which can be used to implement them.
27
• Similarly, for the linear regression model
yi=xi+i
the least-square parameter estimate of that solve
minb (yi-xib)2 are not as robust to outliers as the median
(L1) estimator minb |yi-xib|.
• Fit in S-plus using l1fit(X,y) where X is the design matrix
and y the vector of dependent observations.
– note this is the reverse of the usual order: design variables first,
then outcome.
28
• Kernel density estimation provides a probability density
function estimate from the histogram of observations:
density(object,kernel=<>,bandwidth=<>)
kernel=kernel-density smoother:
"box" - a rectangular box (the default).
"triangle" - a triangle (a box convolved with itself).
"parzen" - the parzen function (a box convolved with a triangle).
"normal" - the gaussian density function.
bandwidth=the kernel bandwidth smoothing parameter.
29
• Robust and additive regression models extend the
assumption of a linear relationship between E(Y) and X to
the more general case of a smooth non-linear relationship.
loess(formula,data=<dataset>,span=<>,degree=<1,2>,…)
span=width over which tricube weights calculated
degree=X linear or quadratic
plot using plot.loess(loess(…))
smooth.spline(x, y, df = <>, spar = <>, all.knots = <T,F>,
df.offset = 0, penalty = 1)
Parameters allow knot choice and smoothing parameters to
be pre-set or data-driven.
30
Simulations
• A number of built-in procedures are available to generate
random data.
– rnorm(n,mu=0,sd=1) generates n N(0,1) random variables.
– qnorm(p,mu=0,sd=1) and pnorm(q,mu=0,sd=1) gives q such that
P(X<q)=p for X~N(0,1).
– dnorm(x,mu=0,sd=1) gives the PDF of the N(0,1) distribution at x.
– Similar procedures are available for 19 other standard
distributions.
• The for and if commands implement loops and conditional
execution of statements.
• Together these procedures allow for easy implementation
of simulations.
31
betalb_rep(0,200)
betaub_rep(0,200)
count_0
for(i in 1:200){
x_runif(30,0,10)
y_1+2*x+rnorm(30)
myreg_lm(y~x)
betalb[i]_myreg$coefficients[2]qt(.975,28)*sqrt(deviance(myreg)/28)/sqrt(29*var(x))
betaub[i]_myreg$coefficients[2]+
qt(.975,28)*sqrt(deviance(myreg)/28)/sqrt(29*var(x))
if (betalb[i]<2&&betaub[i]>2) count_count+1
}
32
Could delete the if statement and just compute
100-(sum(betalb>2)+sum(betaub<2))/2
33
Memory allocation in S-Plus
• In loops, RAM is not released until the loop is completed.
– In large simulations, later iterations will slow relative to earlier
iterations.
– This problem is made worse in nested loops (loops within loops).
• Try to avoid loops, and especially nested loops, whenever
possible
– Use vector and matrix manipulations, instead.
– Have yik, i=1,…,n, k=1,…,ni. Want yi.=Σyik.
– Could loop across i and k, or be clever and use cumsum:
cumsum(y)[c(n1,…, nn)]-c(0,cumsum(y)[c(n1,…,nn-1)])
34
R
An alternative to S-Plus is R.
• Part of the GNU ("GNU's Not Unix") constellation of free
Linux-based software.
• Runs under Linux, UNIX and Windows.
• Big difference: dynamic allocation of memory
• Can run loops without chewing up memory.
35
• Most basic code is virtually interchangeable. A few
differences:
– Generalized linear models are implemented differently in R. (The
same functionality is available but the components have different
names.)
– Missing data usually leads to failure in S unless the “na.omit” is
used; "na.omit" is the default in R.
– No-intercept models are defined by y~x+0 instead of y~x-1.
– Write out “TRUE” and “FALSE” instead of using “T” and “F”.
• More advanced functions (linear and non-linear mixed
models, smoothing splines, adaptive quadrature, SEMs)
have packages that need to be downloaded.
36
Where to go to get help
• This has been a very brief introduction.
• The on-line help menu in S-plus is very useful.
• Venables and Ripley (1997) Modern Applied Statistics
with S-Plus (2nd edition).
• Websites:
Splus:
– www.mathsoft.com
– Statlib site: lib.stat.cmu/S/
R:
– http://www.R-project.org/
– http://www.gnu.org/gnu/the-gnu-project.html
See http://cceb.med.upenn.edu/elliott/ for this presentation.
37