R Environment

Download Report

Transcript R Environment

Statistics Workshop
Dave Zhao
PowerPoint by Casey Hanson
Statistics Workshop | Dave Zhao | 2016
1
Introduction
In this lab, we will do the following:
1.
Learn about the R programming language and the RStudio Integrated
Development Environment (IDE).
2.
Examine and manipulate data in R.
3.
Perform univariate regression.
4.
Perform multivariate regression.
Statistics Workshop | Dave Zhao | 2016
2
Step 0A: Local Files
For viewing and manipulating the files needed for this laboratory
exercise, insert your flash drive.
Denote the path to the flash drive as the following:
[course_directory]
We will use the files found in:
[course_directory]/01_Statistics/data/
Statistics Workshop | Dave Zhao | 2016
3
Step 0B: Data Files
data.RData – 20,000 gene expression vectors, CD4 levels, and
case/control information for 100 individuals in wide format. This file
cannot be opened by excel or notepad/wordpad. It is in an R specific
format.
data.txt – Same as data.Rdata, but in text – long format.
data.xlsx – Same as the above, but in excel – long format.
lab.R – All of the R commands we will run in today's lab. The sdir
variable needs to be changed to the directory of your R lab data for
01_Statistics.
Statistics Workshop | Dave Zhao | 2016
4
R Programming
The objectives of this section are to learn the user interface of the RStudio program as
well as essential aspects of R programming, namely:
1.
2.
3.
4.
5.
How to create variables.
What the R Environment is.
Typical math operations in R.
Typical data objects in R.
How to modify data using functions in R.
Statistics Workshop | Dave Zhao | 2016
5
Step 1A: Opening RStudio
Go to the Desktop
Double click the RStudio Icon
A program like the this should open.
Statistics Workshop | Dave Zhao | 2016
6
Step 1B: Configuring RStudio
Click Tools on the Menu Bar and select Global Options.
Statistics Workshop | Dave Zhao | 2016
7
Step 1C: Configuring RStudio
The Options window
should appear.
Click Pane Layout.
Ensure your configuration
matches the image on the
right. Both for the clicked
boxes and their order.
Click OK.
Statistics Workshop | Dave Zhao | 2016
8
Step 1D: RStudio Opened
Statistics Workshop | Dave Zhao | 2016
9
Step 2A: RScript Panel
The Scripting Panel is located in the top left.
You can input R commands here and save the
commands to a file.
You can run highlighted R commands by
pressing.
Files saved can also be run in the R Console
(bottom left).
We won’t be using this panel today.
Statistics Workshop | Dave Zhao | 2016
10
Step 2B: R Console Panel
The R Console (bottom left) is where we will
be doing most of our work.
The R Console takes commands as input and
performs operations based on the
commands.
The commands may contain references to
data variables or objects in the R
environment.
Commands may return new data variables
based on inputs.
Statistics Workshop | Dave Zhao | 2016
11
Step 2C: Plots and 4th Panel
The Plots Panel (upper right) contains any
visualizations that you might produce using R.
The 4th Panel (bottom left) contains tabs for
various views of the R configuration and
environment.
Environment – Shows all variables and objects
loaded or created. Initially empty.
Statistics Workshop | Dave Zhao | 2016
12
Step 3A: Using the R Console
Commands typed into the R console, as below, …
Will be represented in this tutorial as the following:
> x <- 3;
Statistics Workshop | Dave Zhao | 2016
13
Step 4A: Setting a Variable
R is a program that interprets commands as text input, stores data as variables or
objects, and is able to modify data using functions.
Let’s start with variables.
Suppose we want to create a variable x that is set to the value 3.
Type the following into the command prompt.
At the end of each line, press the Enter key on the keyboard for R to execute the
command.
> x <- 3;
> x
Statistics Workshop | Dave Zhao | 2016
14
Step 4B: Setting a Variable
You should see the following result:
[1] 3
The first line does two things:
1.
2.
It creates a variable called x in the current R environment.
Using the R assignment operator <-, it sets the value of x to 3.
The second line (without ;), prints the value of x to the R console.
The [1] 3 indicates that x has 1 value (shown in brackets) that is 3.
Statistics Workshop | Dave Zhao | 2016
15
Step 4C: Setting a Variable
RStudio enables the user to examine the R Environment as they type
commands into the R console.
For instance, before the command was executed, the environment
looked like the following:
After the command was executed, the environment shows the x
variable with its corresponding value 3.
Statistics Workshop | Dave Zhao | 2016
16
Step 5A: R Environment
So…what is the R Environment?
The R environment is the set of variables that the user creates while
inputting commands into R.
For instance, given the current R environment, try printing the value of
the variable y to the screen.
You should see something like below:
> y
Error: object 'y' not found
Statistics Workshop | Dave Zhao | 2016
17
Step 5B: R Environment
We get an error because the variable y has yet to be created.
If we create y , say be setting it to 5, it will be added to the
environment.
> y <- 5;
> y
> [1] 5
You can see that after creating the variable y with value 5, the
environment updates to show the newly created value.
Statistics Workshop | Dave Zhao | 2016
18
Step 6A: Data Types
Thus far, we have only worked with numeric data (real numbers).
R supports many other data types including :
> x <- 'c'
1. Characters (use quotes):
2. Factors: (categorical data with an explicit order)
R can also create a vector of data types, using the c() function.
For instance, to create a column vector z that is [1,2,3] :
> z <- c(1,2,3);
> z
[1] 1 2 3
Statistics Workshop | Dave Zhao | 2016
19
Step 6B: Data Types
Other data types are available at the following link:
http://www.statmethods.net/input/datatypes.html
The one we will use most today are the vector and data frame.
A data frame is like a matrix except that the columns don’t need to have the
same type of data.
For instance, the 1st column can have characters, the 2nd numeric, and the 3rd
column factors.
We won’t create data frames explicitly but rather load them from a file.
Statistics Workshop | Dave Zhao | 2016
20
Step 7A: Functions and Operators
Functions accept variables as input, perform operations on those
variables, and return the result as a new variable.
We have already seen what the function c() does : it takes a list of
comma separated variables or values and constructs a column vector
from them.
Lets create a variable a that is the column vector [3,5,1,2,3] using
only the variables currently in the environment.
> a <- c(x,y,z);
> a
[1] 3 5 1 2 3
Statistics Workshop | Dave Zhao | 2016
21
Step 7B: Functions and Operators
The c() function took as input the variables x, y, and z and
concatenated them all into a new column vector a.
So c() operated on the input variables and created a new output
variable a.
Did it change x, y, or z though? NO.
In principle, an R function will not change the
input data
Statistics Workshop | Dave Zhao | 2016
22
Step 7C: Functions and Operators
What is an operator then?
An operator is like a function except it’s form is different from the
function form: function_name(arg1, arg2, arg3)
For instance, we can add two numeric variables using the + operator.
> z <- x + y
> z
[1] 8
In fact, we have already seen an operator before: (remember <-).
Statistics Workshop | Dave Zhao | 2016
23
Step 7D: Functions and Operators
Notice that we assigned the variable z to a new value 8.
What does this do? It destroys the old z which was [1,2,3].
It then creates a new z with the value of 8.
Look at the R Environment.
Statistics Workshop | Dave Zhao | 2016
24
Step 7D: Functions and Operators
A list of common operators is given below:
a + b
: Adds a and b and returns the result.
a * b
: Multiplies a by b and returns the result.
a – b
: Subtracts b from a and returns the result.
a / b
: Divides a by b and returns the result.
a ** b
: Raises a to the power of b and returns the result.
For more operators visit the following link:
http://www.statmethods.net/management/operators.html
Statistics Workshop | Dave Zhao | 2016
25
Statistics Lab
The objective of this section of the lab is to be able to perform the following:
1. Load .RData into the R Environment.
2. Perform a single univariate regression using glm.
3. Perform multivariate regression using glm.
4. Find p-values of univariate and multivariate regression using anova.
Statistics Workshop | Dave Zhao | 2016
26
Step 1: Clear the current environment
For the sake of neatness, let us cleanse the current R environment of all
variables.
What do we mean by cleanse? Remove all variables.
We can accomplish this with the rm() and ls() functions.
> rm(list=ls());
> ls()
character(0)
Statistics Workshop | Dave Zhao | 2016
27
Step 2: Change directories
We want to change the working directory of R to make loading the data
easier.
> setwd("[course_directory]/01_Statistics/data");
Make sure you substitute [course_directory] with the correct
path to 01_Statistics.
For example, on my machine, the command would be :
> setwd("F:/01_Statistics/data");
Statistics Workshop | Dave Zhao | 2016
28
Step 3: Load the .RData file
In the directory, there is a file called data.RData.
This file contains ALL of the variables relevant to this lab extracted from
Dr. Zhao’s R Environment.
To load the variables into our R Environment, use the load() function.
Notice the change in the R Environment panel.
> load("data.Rdata")
There is only a single object with 100 obs. of 2003 variables.
Statistics Workshop | Dave Zhao | 2016
29
Step 4A: Examine the Data
We loaded an object called data with 100 obs. of 2003 variables.
What kind of object is it?
> class(data)
data frame
How many rows and columns does it have? (Hint: 100 rows, 2003 cols)
> dim(data)
[1] 100 2003
Statistics Workshop | Dave Zhao | 2016
30
Step 4B: Examine the Data
How can we get a peak of the data? Let’s view the first 5 rows and first 3
columns. Note that 1:3 evaluates to c(1,2,3).
Note: Leave off the semicolon so that it prints to the console.
> data[1:5,1:3]
CD4
gene1
gene2
1
2.00772551
11.759267
13.818804
2
0.01781400
10.084232
13.377747
3 131.84355762
8.178157
13.173177
4
8.82627931
7.316058
6.338184
5
0.01406711
11.690831
8.429529
Statistics Workshop | Dave Zhao | 2016
31
Step 4C: Examine the Data
It appears that this data frame contains the transposed table from the
Excel file shown in lecture and data.xls. If so, why are there 20003
columns instead of 20001 (CD4 + 20000 genes)?
Let’s look a columns 20002 and 20003.
> data[1:3, c(20002,20003)]
condition
time
1
ctrl 30min
2
ctrl 60min
3
exp 60min
Statistics Workshop | Dave Zhao | 2016
32
Step 4D: Examine the Data
So columns 20002 and 20003 contain information about whether the individual
was a control or exp subject and what time the gene expression was taken (30
or 60 min).
What data types are these? Characters or factors?
> class(data[["condition"]])
[1] "factor"
So a factor can look like a character string, but it is stored differently and has a
built in way of ordering.
This is important for the regression, because glm uses factors, not characters.
Statistics Workshop | Dave Zhao | 2016
33
Step 5: Dividing the data
Create a vector called cd4 by selecting the CD4 column of data.
Create a data frame of gene expressions called genes.
Create two vectors corresponding to control and time.
> cd4 <- data[["CD4"]];
> genes <- data[,2:20001];
> condition <- data[["condition"]];
> time <- data[["time"]];
Statistics Workshop | Dave Zhao | 2016
34
Step 6A: Univariate Regression
Let us regress the gene expression of gene 1 against log CD4 response
across individuals.
> fit <- glm(log(cd4) ~ genes[[1]], family="gaussian");
> fit
Call:
glm(formula = log(cd4) ~ genes[[1]], family = "gaussian")
Coefficients:
(Intercept)
genes[[1]]
0.4361
-0.3183
Degrees of Freedom: 99 Total (i.e. Null);
Null Deviance:
Residual Deviance: 1042
98 Residual
1128
AIC: 524.1
Statistics Workshop | Dave Zhao | 2016
35
Step 6B: Univariate Regression
How do we get the correlation r and p-value p of the univariate
regression? Is it significant at 0.05 threshold?
> sumfit <- summary(fit);
> sumfit$coefficients
Estimate Std. Error
(Intercept)
0.4360894
genes[[1]]
-0.3182643
t value
1.1711375
Pr(>|t|)
0.3723639 0.710425941
0.1119869 -2.8419784 0.005455195
> r <- sumfit$coefficients[2,1];
> p <- sumfit$coefficients[2,4];
Statistics Workshop | Dave Zhao | 2016
36
Step 7A: Does the drug effect CD4 level
As in Example 1 from lecture, this will require a multivariate regression
Specifically,
log 𝐶𝐷4 ~ 𝛽0 + 𝛽1 I exp + 𝛽2 I 30 min + 𝛽3 I(exp, 30min)
Where we test the null hypothesis:
𝐻0 : 𝛽1 = 𝛽3 = 0
Statistics Workshop | Dave Zhao | 2016
37
Step 7B: Does the drug effect CD4 level
We first need to fit the alternative hypothesis: 𝐻1 : 𝛽1 ≠ 𝛽3
> fit_h1 <- glm(log(cd4) ~ condition * time,family="gaussian");
We then need to fit the null hypothesis: 𝐻0 : 𝛽1 = 𝛽3 = 0
> fit_h0 <- glm(log(cd4) ~ time, family="gaussian");
We then need to find the p-value p of the likelihood ratio statistic:
> p <- anova(fit_h0, fit_h1, test="LRT")[2,5];
> p
0.429236
Statistics Workshop | Dave Zhao | 2016
38
Step 7C: Does the drug effect CD4 level
The p-value of the LRT statistic is 0.429236.
At a significance threshold of 0.05, we fail to reject 𝐻0 .
Why?
0.429236 > 0.05
Therefore, we fail to detect that the drug effects CD4 level.
Statistics Workshop | Dave Zhao | 2016
39
Step 8A: Which genes are affected by the drug
We will perform a very similar procedure to Step 7.
Specifically given a gene expression vector 𝑔,
𝑔 ~ 𝛽0 + 𝛽1 I exp + 𝛽2 I 30 min + 𝛽3 I(exp, 30min)
Where we test the null hypothesis:
𝐻0 : 𝛽1 = 𝛽3 = 0
Except we will do this for all 20,000 genes.
Since we are performing 20,000 statistical tests, we will have to do some p-value
adjustment.
Statistics Workshop | Dave Zhao | 2016
40
Step 8B: Which genes are affected by the drug
The body of the code is similar to before.
All the sapply does is iterate over each column in the genes data
frame and returns a vector of p-values.
This will take some time.
ps <- sapply(genes, function(g){
fit_h1 <-glm(g ~ condition * time);
fit_h0 <-glm(g ~ time);
p <- anova(fit_h0, fit_h1, test="LRT")[2,5];
return(p)}
);
Statistics Workshop | Dave Zhao | 2016
41
Step 8C: Which genes are affected by the drug
The vector ps is a 20,000 dimensional vector containing p-values for
each statistical test asking if the gene is affected by CD4.
However, since we do 20,000 different tests, by random chance, we will
have many genes with a p-value <= 0.05.
Thus, we must correct the p-values in light of the number of tests
conducted.
A popular method for doing so is False Discovery Rate (FDR) control.
Statistics Workshop | Dave Zhao | 2016
42
Step 8D: Which genes are affected by the drug
To get adjusted p_values using FDR, we have to use the p.adjust
function.
> ps_fdr <- p.adjust(ps, method="fdr");
We can then reject at our normal value of 0.05.
A way of finding and counting the number of genes with an adjusted pvalue <= 0.05 is given below:
> drug_genes <- which(ps_fdr <= 0.05);
length(drug_genes)
[1] 21
Statistics Workshop | Dave Zhao | 2016
43
Step 8E: Which genes are affected by the drug
So, which genes were affected by the drug ?
The genes 1-20 and gene11989.
How many genes were affected by the drug?
21
Statistics Workshop | Dave Zhao | 2016
44
Summary
In this lab, we did the following:
• Learned about the R programming language and the RStudio
Integrated Development Environment (IDE).
• Examined and manipulated data in R.
• Performed univariate regression of gene expression and log of CD4
levels.
• Performed multivariate regression to :
• Discover if the drug effects CD4 levels.
• Discover the identity and number of genes affected by the drug.
Statistics Workshop | Dave Zhao | 2016
45
Extra Material
Material not essential for the lab, but informative for curious students.
Statistics Workshop | Dave Zhao | 2016
46
Step 2D: 4th Panel
History – Contains a list of the history of
commands. We will not be using this pane in
this lab.
Files – Contains a list of the files in the
directory R is currently set to. Initial directory
can be changed from Tools – Preferences. We
will not be using this pane in this lab.
Statistics Workshop | Dave Zhao | 2016
47
Step 2E: 4th Panel
Packages – Contains a list of Packages (where
each Package contains commands that native
R doesn’t support) that can be loaded. We
will not be using this pane in this lab.
Help – Provides an input box to query for help
regarding R. We will not be using this pane in
this lab.
View – Provides a way to view local web
content. We will not be using this pane in this
lab.
Statistics Workshop | Dave Zhao | 2016
48
Step 5C: R Environment
What does create a variable and assigning it to a value mean.
Initially, the R environment has a large amount of free memory to store
data. Let’s conceptualize this as an empty rectangle.
Statistics Workshop | Dave Zhao | 2016
49
Step 5D: R Environment
When we create a variable, we consume a portion of the free memory
of R and dedicate it to a variable.
x
-
When created, the variable doesn’t have a value, so we must assign a
value to fill its space.
x
3
Statistics Workshop | Dave Zhao | 2016
50
Step 5E: R Environment
As we create more and more variables, the R Environment becomes
increasingly crowded, with less room to give to new variables.
So far, we have created two variables in the R Environment, named x
and y.
These are fairly tiny variables compared to the size of the R
environment.
When we load data in from files, the size of the variables will become
much larger.
Statistics Workshop | Dave Zhao | 2016
51
Testing Differential Expression
(D.E.) Over Time in Control and
Experimental Conditions.
The code for this exercise is available on the schedule page in the data directory link.
The file is called lab_final.R
The additional code from lab.R is found in Section 5.0
Statistics Workshop | Dave Zhao | 2016
52
Step 9A: Differential Expression Over Time in
Control/Experimental
Does differential expression (D.E.) over time differ between
experimental and control individuals?
An individual (row) is either in a control or experimental
condition. Thus, we should split time and gene data by the control
type.
The code below will get which individuals (rows) are "control" and
"experimental" respectively.
> indiv_ctrl
<- which(condition=="ctrl");
> indiv_exp
<- which(condition=="exp");
Statistics Workshop | Dave Zhao | 2016
53
Step 9B: Differential Expression Over Time in
Control/Experimental
Split genes and time by the different control and experimental rows.
> genes_ctrl <- genes[indiv_ctrl,];
> genes_exp
<- genes[indiv_exp,];
> time_ctrl
<- time[indiv_ctrl];
> time_exp
<- time[indiv_exp];
For example, genes[indiv_ctrl,] will get the gene expression data for all
individual (rows) in the control group for all genes (columns),
Similarly, time[time_ctrl] will get only those time points in the control
group. We don't need the , because time is vector, not a matrix.
Statistics Workshop | Dave Zhao | 2016
54
Step 9C: Differential Expression Over Time in
Control/Experimental
For each condition, we want to regress the time of each gene
expression on gene expression within each condition.
> ps_ctrl <- sapply(genes_ctrl, function(g) {
## regress gene expression of gene g against time in the control group
model <- lm(g ~ time_ctrl);
model_summary <- summary(model); ## summarize model
coef <- model_summary$coefficients; ## get coefficient matrix
p <- coef[2,4]; ## get p-value from coefficient matrix
return(p);
});
In the above code, we iterate over each column (gene) in genes_ctrl,
regress its expression against time within the control group, and return
the p-value.
Statistics Workshop | Dave Zhao | 2016
55
Step 9D: Differential Expression Over Time in
Control/Experimental
For each condition, we want to regress the time of each gene
expression on gene expression within each condition.
> ps_exp <- sapply(genes_exp, function(g) {
## regress gene expression of gene g against time in the experimental group
model <- lm(g ~ time_exp);
model_summary <- summary(model); ## summarize model
coef <- model_summary$coefficients; ## get coefficient matrix
p <- coef[2,4]; ## get p-value from coefficient matrix
return(p);
});
In the above code, we iterate over each column (gene) in genes_exp,
regress its expression against time within the experimental group, and
return the p-value.
Statistics Workshop | Dave Zhao | 2016
56
Step 9E: Differential Expression Over Time in
Control/Experimental
ps_ctrl contains the p-values of the association of each gene's
expression with its time of sampling within the control group.
What is the data type of ps_ctrl :
vector or matrix (data frame)?
How many entries does ps_ctrl have? In other words, what is the
length of ps_ctrl ?
ps_exp, likewise, contains the p-values of the association of each
gene's expression with its time of sampling within the experimental
group.
Statistics Workshop | Dave Zhao | 2016
57
Step 9E: Differential Expression Over Time in
Control/Experimental
ps_ctrl contains the p-values of the association of each gene's
expression with its time of sampling within the control group.
What is the data type of ps_ctrl :
vector
How many entries does ps_ctrl have? In other words, what is the
length of ps_ctrl ?
20,000
ps_exp, likewise, contains the p-values of the association of each
gene's expression with its time of sampling within the experimental
group.
Statistics Workshop | Dave Zhao | 2016
58
Step 9F: Differential Expression Over Time in
Control/Experimental
We should correct the p-values for multiple testing, using FDR.
> ps_ctrl_fdr <- p.adjust(ps_ctrl,"fdr");
> ps_exp_fdr
<- p.adjust(ps_exp, "fdr");
The ps_ctrl_fdr vector contains the multiple hypothesis adjusted pvalues of the control data using FDR.
The ps_exp_fdr vector, likewise, contains the adjusted p-values of the
experimental data using FDR.
Statistics Workshop | Dave Zhao | 2016
59
Step 9G: Differential Expression Over Time in
Control/Experimental
Now, let's find out which genes are D.E. at an FDR of 5% (0.05) within
each group.
> de_ctrl <- ps_ctrl_fdr <= 0.05;
> de_exp
<- ps_exp_fdr
<= 0.05;
What is the data type of de_ctrl: vector or matrix?
What are the types of entries in de_ctrl?
What does de_ctlr[i]represent for any 𝑖 between 1 and 20,000?
Statistics Workshop | Dave Zhao | 2016
60
Step 9G: Differential Expression Over Time in
Control/Experimental
Now, let's find out which genes are D.E. at an FDR of 5% (0.05) within each
group.
> de_ctrl <- ps_ctrl_fdr <= 0.05;
> de_exp
<- ps_exp_fdr
<= 0.05;
What is the data type of de_ctrl: vector or matrix? vector
What are the types of entries in de_ctrl? True/False or 1/0
What does de_ctlr[i]represent for any 𝑖 between 1 and 20,000?
Whether gene 𝒊 is differentially expressed over time in the control
group.
Statistics Workshop | Dave Zhao | 2016
61
Step 9H: Differential Expression Over Time in
Control/Experimental
Let's group genes together into the four different D.E. pairwise
outcomes:
For a given gene 𝑖:
de_ctrl[i]
de_exp[i]
Meaning
FALSE or 0
FALSE or 0
Gene 𝑖 is not D.E. in either
condition.
TRUE
FALSE or 0
Gene 𝑖 is D.E. in the control, but
not experimental condition.
FALSE or 0
TRUE
or 1
Gene 𝑖 is D.E. in the experimental,
but not control condition.
TRUE
TRUE
or 1
Gene 𝑖 is D.E. in both conditions.
or 1
or 1
Statistics Workshop | Dave Zhao | 2016
62
Step 9I: Differential Expression Over Time in
Control/Experimental
Then count the number of genes in each D.E. group.
For two binary vectors with columns in the same relative order, we can
group and count the pairwise groups in one command.
> de_counts <- table(de_ctrl,de_exp);
> de_counts ## Print out table
Do you see an independence between genes D.E in control and genes
D.E. in experimental?
Statistics Workshop | Dave Zhao | 2016
63
Step 9J: Differential Expression Over Time in
Control/Experimental
Let's test the independence of D.E. genes between control and
experimental groups using the Fisher exact test.
The code below get's the p-value of the Fisher exact test.
> de_p <- fisher.test(de_counts)$p;
What value do you get
What is the NULL hypothesis?
Can we reject the NULL hypothesis at a 0.05 threshold?
Statistics Workshop | Dave Zhao | 2016
64
Step 9J: Differential Expression Over Time in
Control/Experimental
Let's test the independence of D.E. genes between control and experimental
groups using the Fisher exact test.
The code below get's the p-value of the Fisher exact test.
> de_p <- fisher.test(de_counts)$p;
What value do you get? (it should be 3.069468e-12)
What is the NULL hypothesis? (D.E. genes in the control are independent of D.E.
genes in the experimental group)
Can we reject the NULL hypothesis at a 0.05 threshold? (Yes)
Statistics Workshop | Dave Zhao | 2016
65