Transcript Document

Computing for Research I
Spring 2014
Introduction to R
March 17
Primary Instructor:
Elizabeth Garrett-Mayer
Check out online resources
http://people.musc.edu/~elg26/teaching/methods2.2010/
R-intro.pdf
http://www.ats.ucla.edu/stat/r/
http://www.statmethods.net/about/learningcurve.html
http://www.mayin.org/ajayshah/KB/R/index.html
http://chartsgraphs.wordpress.com/2011/01/09/learnrtoolkit-to-help-excel-users-move-up-to-r/
R. Kabacoff on learning R after SPSS and SAS
(http://www.statmethods.net/about/learningcurve.html)
Why R has A Steep Learning Curve :
A long answer to a simple question...
•
I have been a hardcore SAS and SPSS programmer for more than 25 years, a Systat programmer for 15
years and a Stata programmer for 2 years. But when I started learning R recently, I found it frustratingly
difficult. Why?
I think that there are two reasons why R can be challenging to learn quickly.
•
First, while there are many introductory tutorials (covering data types, basic commands, the interface),
none alone are comprehensive. In part, this is because much of the advanced functionality of R comes
from hundreds of user contributed packages. Hunting for what you want can be time consuming, and it
can be hard to get a clear overview of what procedures are available.
•
The second reason is more ephemeral. As users of statistical packages, we tend to run one proscribed
procedure for each type of analysis. Think of PROC GLM in SAS. We can carefully set up the run with all the
parameters and options that we need. When we run the procedure, the resulting output may be a
hundred pages long. We then sift through this output pulling out what we need and discarding the rest.
The paradigm in R is different.
•
Rather than setting up a complete analysis at once, the process is highly interactive. You run a command
(say fit a model), take the results and process it through another command (say a set of diagnostic plots),
take those results and process it through another command (say cross-validation), etc. The cycle may
include transforming the data, and looping back through the whole process again. You stop when you feel
that you have fully analyzed the data. It may sound trite, but this reminds me of the paradigm shift from
top-down procedural programming to object oriented programming we saw a few years ago. It is not an
easy mental shift for many of us to make.
•
In that in the end, however, I believe that you will feel much more intimately in touch with your data and
in control of your work. And it's fun!
Installing R
• http://cran.r-project.org/
• Choose appropriate interface
– windows
– Mac
– Linux
• Follow install instructions
• Rstudio: https://www.rstudio.com/
R interface
• batching file: File -> open script
• run commands: Ctrl-R
• Save session: sink([filename])….sink()
• Quit session: q()
General Syntax
• result <- function(object(s), options…)
• function(object(s), options…)
• Object-oriented programming
• Note that ‘result’ is an object
First things first:
• help([function]) or ?function
• help.search(“linear model”) or ??”linear
model”
• help.start()
Choosing your default
• setwd(“[pathname for directory]”)
• getwd()
• need “\\” instead of “\” when giving paths
• Alternatively, you can use a ‘/’ to give path names.
• .Rdata
• .Rhistory
Start with data
• read.table
• read.csv
• scan
• dget
Extracting variables from data
• Use $: data$AGE
• note it is case-sensitive!
• attach([data]) and detach([data])
Descriptive statistics
• summary
• mean, median
• var
• quantile
• range, max, min
Missing values
• sometimes cause ‘error’ message
• na.rm=T
• na.option=na.omit
Data Objects
• data.frame, as.data.frame, is.data.frame
– names([data])
– row.names([data])
• matrix, as.matrix, is.matrix
– dimnames([data])
• factor, as.factor, is.factor
– levels([factor])
•
•
•
•
•
arrays
lists
functions
vectors
scalars
Creating and manipulating
• combine: c
• cbind: combine as columns
• rbind: combine as rows
• list: make a list
• rep(x,n): repeat x n times
• seq(a,b,i): create a sequence between a and b in increments
of i
• seq(a,b, length=k): create a sequence between a and b with
length k with equally spaced increments
ifelse
• ifelse(condition, true, false)
– agelt50 <- ifelse(data$AGE<50,1,0)
– for equality must use “==“
– “or” is indicated by `|’
e.g., young.or.old <- ifelse(data$AGE<30 | data$AGE>65,1,0)
• cut(x, breaks)
– agegrp <- cut(data$AGE, breaks=c(0,50,60,130))
– agegrp <- cut(data$AGE, breaks=c(0,50,60,130),
labels=c(0,1,2))
– agegrp <- cut(data$AGE, breaks=c(0,50,60,130),
labels=F)
Looking at objects
• dim
• length
• sort
• attributes
Subsetting
• Use [ ]
• Vectors
– data$AGE[data$REGION==1]
– data$AGE[data$LOS<10]
• Matrices & Dataframes
– data[data$AGE<50, ]
– data[ , 2:5]
– data[data$AGE<50, 2:5]
Some math
• abs(x)
• sqrt(x)
• x^k
• log(x) (natural log, by default)
• choose(n,k)
Matrix Manipulation
• Matrix multiplication: A%*%B
• transpose: t(X)
• diag(X)
Table
• table(x,y)
• tabulate(x)
Statistical Tests and CI’s
• t.test
• fisher.test and binom.exact
• wilcox.test
Plots
• hist
• boxplot
• plot
–
–
–
–
pch, type, lwd
xlab, ylab
xlim, ylim
xaxt, yaxt
• axis
Plot Layout
• par(mfrow=c(2,1))
• par(mfrow=c(1,1))
• par(mfcol=c(2,2))
• help(par)
Probability Distributions
• Normal:
– rnorm(N,m,s): generate random normal data
– dnorm(x,m,s): density at x for normal with mean m,
std dev s
– qnorm(p,m,s): quantile associated with cumulative
probability of p for normal with mean m, std dev s
– pnorm(q,m,s): cumulative probability at quantile q for
normal with mean m, std dev s
• Binomial
– rbinom
– etc.
Libraries
• Additional packages that can be loaded (next
lecture)
• Example: epitools
• library
• library(help=[libname])
Keeping things tidy
• ls() and objects()
• rm()
• rm(list=ls())