Transcript Slide 1
R Lecture 5
Naomi Altman
Department of
Statistics
Example: Regression
The data are available at
http://www.stat.psu.edu/~jls/stat511/homework/body.dat
?read.table
body=read.table("body.txt",header=T)
plot(body$hips,body$weight)
plot(body$waist,body$weight)
?formula
lm.out=lm(weight~hips+waist,data=body)
attributes(lm.out)
Formulas
lm fits the regression of Y on a set of X
variables.
The variable for Y and the predictors are
denoted by a formula of the form.
You can also use formulas in other
contexts. e.g.
plot(weight~waist, data=body)
Object Oriented Programming in R
or how a bunch of smart programming
types made R easier to use and
harder to program - at least in the
eyes of a statistician
In the bad old days
If I wanted to write a function similar to something
already in R, I would edit the R code:
myFun=edit(Rfun)
myDensity=edit(density)
Sometimes the R code would call a C or C++ program,
but the code for that is also available.
But now ...
plot
boxplot
rnorm
Classes and Generic Functions
I have already mentioned that one of the
attributes a R object can have is a class.
A generic function is a function that
captures the class of an object and then
calls another function to do the actual
work. If the function is called fun and the
class is called cls, the function that does
the work is (almost always) called fun.cls.
If there is no suitable fun.cls, then
fun.default is used.
e.g.
plot(body$hips,body$weight)
plot(lm.out)
plot.default
plot.lm
Classes
Actually, a class can be a pair
c("first","second") in which the "first"
"inherits from" i.e. is a special case of
"second". In practise, this means that it
has all the components of class "first"
objects but possibly some additional ones.
If there is no fun.first, then the generic
function will search for fun.second. Only if
there is also no fun.second will fun.default
be used.
e.g. plot
uses plot.lm on an object with class "lm"
and also on an object with class ("glm","lm")
'inherits' indicates whether its first argument inherits
from any
of the classes specified in the 'what' argument
glm.out=glm(weight~hips+waist,data=body)
class(glm.out)
"glm" "lm"
inherits(lm.out,"lm")
inherits(glm.out,"lm")
inherits(lm.out,"glm")
inherits(glm.out,"glm")
plot.lm
plot.glm
plot(glm.out)
unclass
If you remove the class, most objects are just lists.
lm.out
unclass(lm.out)
For example, the "lm" objects are lists with the following components:
"coefficients" "residuals"
"effects"
"assign"
"qr"
"df.residual"
"xlevels"
"call"
"terms"
"rank"
"fitted.values"
"model"
Some of these components are obvious.
Some of them are matrix computations that can be used to compute, e.g.
the leverages and Cook's Distance (notice that these have not been
stored).
Some of them are only empty - they are used primarily when the predictor
variable is a factor (ANOVA).
Why use classes
For the user: less to think about
e.g. you can try generic functions like plot and
summary with any output
For the programmer: provides a framework
e.g. you might think about having a plot.myfun
and summary.myfun for the function you are
writing
also, you can use inheritance so that you do
not need to write your own functions
Learning to Use Objects and
other Extensions
Calling C or C++ from R:
Writing R extensions
Object oriented programming in R
(S3 protocol)
R Language Definition
(S4 protocol)
R Internals
Odds and Ends
speeding
things up (a bit)
errors in functions and loops
source and sink
attach and detach
resampling
Speeding Things Up
(I am assuming that R behaves like Splus)
Try to minimize the number of expressions
evaluated.
R is an "interpreted" language, not a "compiled"
language. This means that each time R
encounters an expression, it has to decipher it
and convert it to machine-readable microcode.
If you write a loop, then R has to decipher the
expressions within the loop again and again.
Looking at the code for apply, it seems to create
a loop, so apply does not get around this.
Speeding Things Up
Create space for new objects.
Every time you change the dimensions of an object, R
reallocates space.
So my loop:
intercepts=NULL
for (i in 1:1000){y=line+rnorm(25,0,3)
lmout=lm(y~x)
intercepts=c(intercepts,getIntercept(lmout))}
would be better as:
intercepts=rep(0,1000)
for (i in 1:1000){y=line+rnorm(25,0,3)
lmout=lm(y~x)
intercepts[i]=getIntercept(lmout))}
Errors in Functions and Loops
If an error occurs somewhere within a
function or loop, R aborts the whole
process and destroys the frame. Nothing
will change in your .Rdata directory.
There is no such thing as "partial
execution"; if even one mistake occurs
the whole process is aborted.
If you want to "trap" errors so that your
routine can recover, the "try" command
can be used to return a flag.
"source" and "sink"
source("myfile.txt") runs the commands in
the text file "myfile" until they are
completed
source("myOutput.txt") send everything
that would be written to the screen to the
text file "myOutput" until sink() is typed
in
Attach and Detach
"attach" and "detach" expand and
contract the search path for R.
You can attach either a dataframe or a
.Rdata workspace.
Any new assignments are stored in the
current workspace.
When you want to stop searching the
dataframe or workspace, you can
detach it.
Session: Attach and Detach
myData=read.table("body.txt",header=T)
lm.out=lm(weight~hips+waist,data=myData)
hips[1:5]
attach(myData)
hips[1:5]
waist[1:5]
waist=hips
waist[1:5]
detach(myData)
hips[1:5]
waist[1:5]
myData$waist[1:5]
Resampling
Many statistical methods require sampling
from the data already collected.
Examples include the bootstrap and
permutation methods.
sample(x, size, replace = FALSE, prob =
NULL)
allows you to sample with or without
replacement from x, with unequal
probabilities if you want
Resampling
sample(x,length(x),replace=F) creates a
random permutation of the data
Session: Resampling
a=c("H","T")
sample(a,10,replace=T)
sample(a,10,replace=T,prob=c(.2,.8))
sample(1:5,5,replace=T)
sample(1:5,5,replace=F)
#bootstrap distribution of the intercept
attach(myData)
intercept=rep(0,1000)
l=length(weight)
for (i in 1:1000){
samp=sample(1:l,l,replace=T)
intercept[i]=coefficients(lm(weight[samp]~
hip[samp]+waist[samp]))[1]
}
hist(intercept)