Introduction to Data Analysis in R powerpoint

Download Report

Transcript Introduction to Data Analysis in R powerpoint

Introduction to Data Analysis in R
Department of Statistical Sciences and Operations
Research
Computation Seminar Series
Speaker: Edward Boone
Email: [email protected]
What is R?





The R statistical programming language is a free
open source package based on the S language
developed by Bell Labs.
The language is very powerful for writing programs.
Many statistical functions are already built in.
Contributed packages expand the functionality to
cutting edge research.
Since it is a programming language, generating
computer code to complete tasks is required.
Getting Started







Where to get R?
Go to www.r-project.org
Downloads: CRAN
Set your Mirror: Anyone in the USA is fine.
Select Windows 95 or later.
Select base.
Select R-2.4.1-win32.exe

The others are if you are a developer and wish to
change the source code.
Getting Started

The R GUI?
Getting Started


Opening a script.
This gives you a script window.
Getting Started
Submit Selection



Submitting a
program:
Use button
Right mouse click
and run selection.
Getting Started


Basic assignment and operations.
Arithmetic Operations:


Matrix Arithmetic.



+, -, *, /, ^ are the standard arithmetic operators.
* is element wise multiplication
%*% is matrix multiplication
Assignment

To assign a value to a variable use “<-”
Getting Started

How to use help in R?





R has a very good help system built in.
If you know which function you want help with
simply use ?_______ with the function in the
blank.
Ex: ?hist.
If you don’t know which function to use, then use
help.search(“_______”).
Ex: help.search(“histogram”).
Importing Data




How do we get data into R?
Remember we have no point and click…
First make sure your data is in an easy to
read format such as CSV (Comma Separated
Values).
Use code:

D <- read.table(“path”,sep=“,”,header=TRUE)
Working with data.



Accessing columns.
D has our data in it…. But you can’t see it
directly.
To select a column use D$column.
Working with data.


Subsetting data.
Use a logical operator to do this.



==, >, <, <=, >=, <> are all logical operators.
Note that the “equals” logical operator is two = signs.
Example:





D[D$Gender == “M”,]
This will return the rows of D where Gender is “M”.
Remember R is case sensitive!
This code does nothing to the original dataset.
D.M <- D[D$Gender == “M”,] gives a dataset with the
appropriate rows.
Summary Statistics

Mean

mean(D$wg)



[1] NA
It doesn’t know what to do with the missing
values.
mean(D$wg, na.rm=TRUE)

[1] 16.62016
Summary Statistics
Median
median(D$wg,na.rm=TRUE)
[1] 15
Quantiles
quantile(D$wg,c(.25,.5,.75),na.rm=TRUE)
25% 50% 75%
8
15
20
Summary
summary(D$wg)
Min. 1st Qu.
0.00
8.00
Median
15.00
Mean 3rd Qu.
16.62
20.00
Max.
70.00
NA's
132.00
Summary Statistics
Range
range(D$wg,na.rm=TRUE)
IQR
IQR(D$wg,na.rm=TRUE)
Standard Deviation
sd(D$wg,na.rm=TRUE)
Basic Testing

One Sample T-Test
t.test(D$wg, alternative="greater", mu=10,
conf.level=0.95)

Output
One Sample t-test
data: D$wg
t = 8.7429, df = 257, p-value < 2.2e-16
alternative hypothesis: true mean is greater than 10
95 percent confidence interval:
15.37016
Inf
sample estimates:
mean of x
16.62016
Basic Testing


Comparing Men to Women on weight gain.
Two Sample T-Test
wg.M <- D[D$Gender =="M",]
wg.F <- D[D$Gender =="F",]
t.test(wg.M$wg, wg.F$wg, alternative="two.sided", mu=0, conf.level
= 0.95)

Output
Welch Two Sample t-test
data: wg.M$wg and wg.F$wg
t = 1.2645, df = 142.21, p-value = 0.2081
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.132955 5.155447
sample estimates:
mean of x mean of y
18.08571 16.07447
Basic Testing

Paired T-Test
Pt <- read.csv("H:\\Pairedt.csv",header=TRUE)
t.test(Pt$A,Pt$B, alternative="two.sided", mu=0,
paired = TRUE, conf.level = 0.95)

Output
Paired t-test
data: Pt$A and Pt$B
t = -0.1064, df = 9, p-value = 0.9176
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.558007 1.418007
sample estimates:
mean of the differences
-0.07
The lm Function

R organizes many linear models into the lm
function.


Regression
ANOVA
Regression

Consider the following dataset.
cherry <- read.csv("H:\\cherry.csv",header=TRUE)
plot(cherry$Height,cherry$Volume)
Regression

Fit a regression model to the data.
lm(Volume ~ Height, data = cherry)

Output:
Call:
lm(formula = Volume ~ Height, data = cherry)
Coefficients:
(Intercept)
-87.124
Height
1.543
Regression

Maybe there is more…
cherry.lm <- lm(Volume ~ Height,data = cherry)
summary(cherry.lm)

Output:
Call:
lm(formula = Volume ~ Height, data = cherry)
Residuals:
Min
1Q
-21.274 -9.894
Median
-2.894
3Q
12.067
Max
29.852
Coefficients:
Estimate Std. Error t value
(Intercept) -87.1236
29.2731 -2.976
Height
1.5433
0.3839
4.021
--Signif. codes: 0 '***' 0.001 '**' 0.01
Pr(>|t|)
0.005835 **
0.000378 ***
'*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.4 on 29 degrees of freedom
Multiple R-Squared: 0.3579,
Adjusted R-squared: 0.3358
F-statistic: 16.16 on 1 and 29 DF, p-value: 0.0003784
Regression

And more…
par(mfrow=c(2,2))
plot(cherry.lm)
Regression

What all is there?
names(cherry.lm)

Output
[1] "coefficients"
"residuals"
"effects"
"rank"
[5] "fitted.values" "assign"
"qr"
"df.residual"
[9] "xlevels"
"terms"
"model"
"call"
Regression

What all is there?


names(summary(cherry.lm))
Output
[1] "call"
"terms"
[5] "aliased"
"sigma"
[9] "adj.r.squared" "fstatistic"

And even more…
"residuals"
"coefficients"
"df"
"r.squared"
"cov.unscaled"
Regression
Plot the line
par(mfrow=c(1,1))
plot(cherry$Height,
cherry$Volume)
abline(cherry.lm)
Regression

Fit confidence intervals to line.
predict.lm(cherry.lm,
interval="confidence")


Prediction intervals to a new data set can
also be generated.
Influence measures can be generated as
well.

Influence.measures() function does lots of
these.
Multiple Linear Regression
Matrix Plot
pairs(cherry)
Multiple Linear Regression

To fit a multiple linear regression:
cherry.lm <- lm(Volume ~ Height + Diam, data=cherry)
summary(cherry.lm)
Call:
lm(formula = Volume ~ Height + Diam, data = cherry)
Residuals:
Min
1Q Median
-6.4065 -2.6493 -0.2876
3Q
2.2003
Max
8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877
8.6382 -6.713 2.75e-07 ***
Height
0.3393
0.1302
2.607
0.0145 *
Diam
4.7082
0.2643 17.816 < 2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-Squared: 0.948,
Adjusted R-squared: 0.9442
F-statistic:
255 on 2 and 28 DF, p-value: < 2.2e-16
Regression

ANOVA
anova(cherry.lm)
Analysis of Variance Table
Response: Volume
Df Sum Sq Mean Sq F value
Pr(>F)
Height
1 2901.2 2901.2 192.53 4.503e-14 ***
Diam
1 4783.0 4783.0 317.41 < 2.2e-16 ***
Residuals 28 421.9
15.1
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What type of Sums of Squares are these?
Regression

ANOVA
cherry.lm <- lm(Volume ~ Diam + Height, data=cherry)
anova(cherry.lm)
Analysis of Variance Table
Response: Volume
Df Sum Sq Mean Sq F value Pr(>F)
Diam
1 7581.8 7581.8 503.1503 < 2e-16 ***
Height
1 102.4
102.4
6.7943 0.01449 *
Residuals 28 421.9
15.1
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Regression

Confidence Intervals
confint(cherry.lm, level = 0.97)
1.5 %
98.5 %
(Intercept) -77.73792775 -38.2373901
Diam
4.10395112
5.3123699
Height
0.04167615
0.6368263
Model Search

AIC
extractAIC(cherry.lm)
[1]
3.00000 86.93578
The first value is the effective degrees of freedom and the second value is
the AIC

Stepwise.
step(cherry.lm)
This is based on AIC however can be specified to use BIC. Default is AIC
Can do forward, backward and stepwise selection. Default is stepwise.
ANOVA

ANOVA
do2 <- read.csv("H:\\do2.csv",header=TRUE)
do2.aov <- aov(DO2 ~ Stream, data=do2)
summary(do2.aov)

Output
Df Sum Sq Mean Sq F value Pr(>F)
2 37.056 18.528 4.5595 0.01898 *
29 117.844
4.064
Stream
Residuals
--Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA

You can also use the lm framework
do2.lm <- lm(DO2 ~ Stream, data=do2)
anova(do2.lm)

Output
Analysis of Variance Table
Response: DO2
Df Sum Sq Mean Sq F value Pr(>F)
Stream
2 37.056 18.528 4.5595 0.01898 *
Residuals 29 117.844
4.064
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Multiple Comparisons

Tukey’s HSD Procedure
TukeyHSD(do2.aov)

Output
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = DO2 ~ Stream, data = do2)
$Stream
diff
lwr
upr
p adj
Piney-Cedar
0.4966667 -1.634953 2.62828674 0.8342003
South Run-Cedar -2.0533333 -4.184953 0.07828674 0.0607655
South Run-Piney -2.5500000 -4.776405 -0.32359544 0.0221846
Multiple Comparisons

Tukey’s HSD
Procedure
using the plot
command.
do2.HSD <- TukeyHSD(do2.aov)
plot(do2.HSD)
Multiple Comparisons

Tukey’s HSD Procedure
do2.HSD <- TukeyHSD(do2.aov, conf.level = 0.97)
do2.HSD

Output
Tukey multiple comparisons of means
97% family-wise confidence level
Fit: aov(formula = DO2 ~ Stream, data = do2)
$Stream
diff
lwr
upr
p adj
Piney-Cedar
0.4966667 -1.832410 2.8257437 0.8342003
South Run-Cedar -2.0533333 -4.382410 0.2757437 0.0607655
South Run-Piney -2.5500000 -4.982642 -0.1173584 0.0221846
Multiple Comparisons

Plot it.
Kruskal Wallis Test

Tukey’s HSD Procedure
kruskal.test(DO2 ~ Stream, data=do2)

Output
Kruskal-Wallis rank sum test
data: DO2 by Stream
Kruskal-Wallis chi-squared = 7.5097, df = 2, p-value = 0.02340
Loess Regression

Regression from a non-parametric framework
cherry.lo <- loess(Volume ~ Diam, data=cherry)
cherry.lo$fitted

Output
[1]
[8]
[15]
[22]
[29]
9.358813 10.471276 11.213760 17.518128 18.258269 18.629063 19.368892
19.368892
22.930251
31.658354
55.590857
19.742982
26.020681
33.187952
55.590857
20.117965 20.503236 20.890123 20.890123 21.950202
26.020681 27.657015 29.364601 29.799478 30.701120
41.983894 43.878359 50.561902 51.968665 54.854374
76.749972
Loess Regression
cherry.plo <- predict(cherry.lo, cherry$Diam)
plot(cherry$Diam,cherry$Volume)
lines(cherry$Diam,cherry.plo)
Summary




R is programming environment with many
standard statistical procedures already
included.
Easy to control the procedure you are
interested in.
No support.
Allows user to modify many existing
procedures.
Summary

All of the R code and files can be found at:
www.people.vcu.edu/~elboone2/CSS.htm