Practical lecture 2: Intro into R language

Download Report

Transcript Practical lecture 2: Intro into R language

Statistical lingua franca
Introduction to R
Lecture 2
22 Sept 2015
Kyrylo Bessonov
1
Outline
1. Introduction to R language
1. basics
2. loops
3. data structures
2. Visualization
1. rich plotting options
3. Bioinformatics
1. repositories
2. annotation of biological IDs
2
Definition
• “R is a free software environment for
statistical computing and graphics”1
• R is considered to be one of the most widely
used languages amongst statisticians, data
miners, bioinformaticians and others.
• R is free implementation of S language
• Other commercial statistical packages are
SPSS, SAS, MatLab
3
1 R Core Team, R: A Language and Environment for Statistical Computing, Vienna, Austria (http://www.R-project.org/)
History
• 1993 creation
– by Ross Ihaka and Robert Gentleman at the
University of Auckland
• Community extended via
– Packages
• Facts
– initial interpreter ~ 1000 lines of C code
– GNU project
4
R language features
• Based on S language (commercial)
– Created in 1976 by Bell Labs
– Main moto: "to turn ideas into software, quickly and
faithfully" (John Chambers)
• Memory allocation
– Fixed allocation at startup
– ‘on-the-fly’ garbage collector
– memory hungry
• Variables have lexical scoping
– functions have access to the variables which were in
the effect when the function was defined
5
Lexical scoping
The value of the variable is searched in the environment it is being called. If the value
is not found, the search is continued in parent environment
make.counter <- function(x = 0){
function(){
x <<- x + 1
cat("Inside the make.counter() the x is",x,"and not 5!\n")
print(environment())
}
}
x <- 5
cat("Global value of x is", x , "\n");
print(environment())
counter <- make.counter()
counter()
print(x)
1) The value of x inside make.counter()is stored even after the function call and
belongs to the function environment (i.e. scope)
2) The global value of x is not modified
3) The “inner function” (i.e. function()) can only access the local value of x
More details are given at the functions section of the tutorial…
6
R language features
• Flexible Plot Layouts
– The layout command divides up the area into a
matrix.
nf <- layout(matrix(c(1,1,0,2), 2, 2,
byrow=TRUE), respect=TRUE)
– Calling layout.show()results in the layout
being displayed for your reference.
7
R language features
• A flexible statistical analysis toolkit
– Statistical models
• Regression, ANOVA, GLM, trees
• Free data analysis
– No subscription fees
– Transparent code
• Is a language
– Can write complex scripts
– Not limited by the GUI
• Active community
8
Why to learn R?
• R is widely used by bioinformaticians and
statisticians
• It is multiplatform
– iOS, Windows, Linux
• Large number of libraries
• Main library repositories CRAN and
BioConductor
9
In a nut-shell…
• R is a scripting language and, as such, is
much more easier to learn than other
compiled languages such as C
• R has reasonably well written
documentation (vignettes)
• Syntax in R is simple and intuitive if one
has basic statistics skills
• R scripts will be provided and explained
in-class
10
Installation of R
1. go to http://www.r-project.org/
2. select CRAN (Comprehensive R Archive
Network) from left menu
3. link to nearby geographic site
4. select your operating system
5. choose "Base" installation
6. save R-X.X.X-win32.exe (windows) or R-X.X.Xmini.dmg (Mac OS X)
7. run the installation program accepting defaults
11
Tutorial
Introduction to R language
Part 1: Basics
12
Topics covered in this tutorial
• Operators / Variables
• Main objects types
• Visualization
– plot modification functions
• Writing and reading data to/from files
• Bioinformatics applications
– Annotation of array probes
13
Variables/Operators
• Variables store one element
x <- 25
Here x variable is assigned value 25
• Check value assigned to the variable x
>x
[1] 25
• Basic mathematical operators :
– arithmetic: + , - , * , /
– power: ^
• Use parenthesis to obtain desired sequence of
mathematical operations
14
Arithmetic operators
• What is the value of small z here?
>x <- 25
> y <- 15
> z <- (x + y)*2
> Z <- z*z
> z
[1] 80
15
Logical operators
• These operators mostly work on vectors,
matrices and other data types
• Type of data is not important, the same
operators are used for numeric and
character data types
Operator
<
<=
>
>=
==
!=
!x
x|y
x&y
Description
less than
less than or equal to
greater than
greater than or equal to
exactly equal to
not equal to
Not x
x OR y
x AND y
16
Logical operators
• Can be applied to vectors in the
following way. The return value is
either True or False
> v1
[1] 48 2 3 4 5
> v1 <= 3
[1] FALSE TRUE TRUE FALSE FALSE
17
Key functions
• mathematical functions
– sqrt, log, exp, sin, cos, tan
• simple functions
– max, min, length, sum, mean, var, sort
• other useful functions
– abs(-5) #absolute value
– exp(8) # exponentiation
– log(exp(8)) # natural logarithm
– sqrt(64) # square root
18
R workspace
• Display all workplace objects (variables,
vectors, etc.) via ls():
>ls()
[1] "Z" "v1" "x" "y" "z"
• Useful tip: to save “workplace” and
restore from a file use:
>save.image(file = " workplace.rda")
>load(file = "workplace.rda")
19
How to find help info?
• Any function in R has help information
• To invoke help use ? Sign or help():
? function_name()
? mean
help(mean, try.all.packages=T)
• To search in all packages installed in your R
installation always use try.all.packages=T
in help()
• To search for a key word in R
documentation use help.search():
help.search("mean")
20
Basic data types
• Data could be of 3 basic data types:
– numeric
– character
– logical
• Numeric variable type / class:
> x <- 1
> mode(x)
[1] "numeric"
> class(x)
[1] "numeric"
21
Basic data types
• Logical variable type (True/False):
> y <- 3<4
> mode(y)
[1] "logical"
• Character variable type:
> z <- "Hello class"
> mode(z)
[1] "character"
22
Objects/Data structures
• The main data objects in R are:
– Matrices (single data type)
– Data frames (supports various data types)
– Lists (contain set of vectors)
– Other more complex objects with slots
• S3 and S4 objects
• Matrices are 2D objects (rows/columns)
> m <- matrix(0,2,3)
> m
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
23
Vectors c()
• Vectors have only 1 dimension and
represent enumerated sequence of data.
They can also store variables
> v1 <- c(1, 2, 3, 4, 5)
> mean(v1)
[1] 3
The elements of a vector are specified
/modified with braces (e.g. [number])
> v1[1] <- 48
> v1
[1] 48 2 3 4 5
24
Lists
• Lists contain various vectors. Each vector
in the list can be accessed by double
braces [[number]]. Lists could be of
different types
> x <- c(1, 2, 3, 4)
> y <- c(2, 3, 4)
> L1 <- list(x, y)
> L1
[[1]]
[1] 1 2 3 4
[[2]]
[1] 2 3 4
25
Matrices
• Contain data of the same type
– i.e. all cells are one of the 3 basic types
• Defined with matrix()
– Arguments:
• initialization value (i.e. 0)
• ncol: number of columns
• nrow: number of rows
matrix(0,ncol=2,nrow=2)
[,1] [,2]
[1,]
0
0
[2,]
0
0
26
Data frames
• Data frames are similar to matrices but
can contain various data types
> x <- c(1,5,10)
> y <- c("A", "B", "C")
> z <-data.frame(x,y)
x y
1 1 A
2 5 B
3 10 C
• To get/change column and row names use
colnames() and rownames()
27
Factors
• Factors are special
– could be vectors of integers or characters
– can identify levels
• i.e. unique variables in a vector
> letters = c("A","B","C","A","C","C")
- the vector letters has 6 chars but 3 levels
> letters = factor(letters)
[1] A B C A C C
Levels: A B C
> summary(letters)
A B C
2 1 3
- summary()allows to see
- number of variables per level
- unique variables (levels)
28
Conversion between data types
• One can convert one type of data into
another using as.xxx where xxx is a data
type
29
S4 objects
• Allow to store more complex data strutures
• Benefits
– Provide greater modularity
– Check for data type errors during assignment
– Cleaner code
• Every S4 objects belongs to a class
require(methods)
setClass("Box",
slots=c(box_name = "character",
address = "numeric",
files = "list“)
,prototype=list(box_name=NULL,
address=NULL, files=list())
)
30
S4 objects
• Each object is an instance of a class
object = new("Box")
• To access slots in the object use @ operator
object@box_name
object@address
object@files
• Assign values to slots
object@box_name="pretty_box"
31
S4 object functions
• Use functions to get and set values in objects
setMethod(f="getBoxName", signature="Box",
definition=function(object){
return(object@box_name)
})
setMethod(f="setBoxName", signature="Box",
definition=function(object,name){
object@box_name <-name
})
32
Read Input/ Write Output
• To read data into R from a text file use
read.table()
– read help(read.table) to learn more
– scan() is a more flexible alternative
raw_data <-read.table(file="data_file.txt")
• To write data into R from a text file use
write.table()
> write.table(mydata, "data_file.txt")
33
Loops
• Loops allow repetition operations
– can check given condition
– execute n number of times
• R is not specifically designed for loops
– use them sparsely or replace by apply()
– avoid nested loops
• Will look at
– for loop
– while loop
– repeat loop
34
For loop
• Executed n number of times
– safe to use
– variable i will take values found in the array (i.e. 1:10)
for(i in 1:10){
print(i)
}
35
while loop
• Will test infinitely for a particular condition
until it becomes TRUE
i=1;
while(i<=10){
print(i)
i=i+1
}
36
repeat
• similar to the while loop
• will always begin the loop
– Executed at least once
• Important to have a breaking condition
– otherwise, infinite loop
x <-1
repeat {
x <- x+1
#if(x >= 5)break;
}
print(x)
37
apply()
• A faster alternative to loops
• Apply function to 2D data structures (matrix, df)
apply(X, MARGIN, FUN, ...)
X:
is matrix or data frame (df)
MARGIN: 1-columns and 2-rows
FUN:
function to apply to rows or columns
M <- matrix(1:6, nrow=3, byrow=TRUE)
apply(M, 1, sum)
apply(M, 2, sum)
#columns
#rows
38
lapply()
• is applied on lists / vectors
• it does not work on matrices or higherdimensional arrays directly
• returns list()
M <- matrix(1:6, nrow=3, byrow=TRUE)
lapply(1:2,function(x){sum(M[,x])})
L=list(sample(1:10,10))
lapply(L,mean)
39
sapply()
• the "s" in "sapply" stands for "simplify
• simplifies output to a vector
M <- matrix(1:6, nrow=3, byrow=TRUE)
sapply(1:2,function(x){sum(M[,x])})
[1] 9 12
40
tapply()
• Apply function to subset data matrix
tapply(summary, group, Function)
Summary variable:
Group variable:
Function:
selected variable to apply function on
the filter variable used to split data
function to apply to resulting sets
medical.data <data.frame(patient = 1:100,
age = rnorm(100, mean = 60, sd = 12),
treatment = gl(2, 50,
labels = c("Treatment", "Control")))
tapply(medical.data$age, medical.data$treatment, mean)
Treatment
Control
61.10559 59.37849
41
Conditional statements
• allow to make choices
– specific parts of the code executed
• Add flexibility
• Main statements
– if … else
– switch
42
If..else
if (test expression) {
#code
} else if (test expression) {
#code
} else () {
#code
}
• else()does not have any test expression
– accepts all cases not covered by if()or else if()
• test expression returns logical value (T or F)
43
If..else
sign <- function(x){
if(x > 0){
print("x is a positive number");
} else if (x == 0){
print("x is zero");
} else{
print("x is negative")
}
}
x<- 5; sign(x);
x<- 0; sign(x);
x<- -2; sign(x);
44
swich()
switch (statement, list)
• the statement is evaluated and value returned
• based on this value, the corresponding item in the
list is returned
y <- rnorm(5)
x <- "sd"
z <- switch(x,"mean"=mean(y),"median"=median(y),"variance"=var(y),"sd"=sd(y))
print(z)
sd(y)
45
Tutorial
Introduction to R language
Part 2: Functions
46
Functions in R
• Functions encapsulate chucks of code
– Helps with debugging and code readability
– Variables passed by ‘pass by value’
add.function<-function(x){
x+5
}
> add.function(2)
[1] 7
47
Functions in R
•
•
•
•
defined with the function() directive
stored as R objects
can be nested
The return() directive is the last expression in
the function body to be evaluated
f <- function() {
# Do something interesting
return(1)
}
48
Arguments
• named arguments
– optionally could have default values
• not every function call in R makes use of all
arguments
– i.e. function arguments could be missing
• then the default values are used
mydata <- sample(1:10,20, replace=T)
sort(mydata)
sort(mydata, decreasing=T)
49
Arguments
• can be matched by
– relative sequential position: f(arg1, arg2, arg3)
rnorm(100,0,1)
– name: f(data=arg1, replace=arg2)
rnorm(n=100, mean = 0, sd = 1)
– can mix positional and by name matching
mydata=data.frame(y=rnorm(100),x=rnorm(100))
lm(data = mydata, y ~ x, model = FALSE, 1:100)
#equivalently
lm(y ~ x, mydata, 1:100, model = FALSE)
50
Lazy evaluation
• Arguments to functions are evaluated lazily
– i.e. R does not check the # of arguments supplied to
function
– i.e. can give less but not more args to a function
– args are evaluated as needed
f <- function(a,b) {
a^2
}
f(a=10) #no error even if b is missing
51
The … argument
• The ... argument indicates
– a variable # of arguments
– adds flexibility and abstraction
• Uknown number of variables to be passed to a function apriori
x=1:10; y=rnorm(10)
f <- function(x, y, ...) {
cat("value of x:",x,"\n");
cat("value of y:",y,"\n");
plot(x,y, ...)
}
f(x,y)
#adding extra argument type=‘l’
f(x,y, type="l")
52
Scoping
• Variable value is searched through a series of
environments
– the global environment
– the namespaces of loaded libraries
search()
[1] ".GlobalEnv"
"package:stats"
[4] "package:grDevices" "package:utils"
[7] "package:methods"
"Autoloads"
"package:graphics"
"package:datasets"
"package:base"
• The global environment
– is the user’s workspace
53
Scoping
• Loading a package via library()adds the
namespace of that package to position 2 of
the search list
library(igraph)
search()
[1] ".GlobalEnv"
[4] "package:graphics"
[7] "package:datasets"
[10] "package:base"
"package:igraph"
"package:stats"
"package:grDevices" "package:utils"
"package:methods"
"Autoloads"
54
Scoping
• The scoping rules for R
– different from the original S language
– uses lexical scoping
• related on how R “searches” for the variable value
– simplifies statistical calculations
f <- function(x, y) {
y=2x+z
}
– z is a ‘free variable’ not defined in f()
55
Searching for the value
• free variable value will be searched
– in the environment in which a function was called
– In the parent environment
– In the top-level environment (i.e. global)
– In the empty environment
• if value not found
– error thrown
56
Scoping
• Typically, the user defined values are found
– in the workspace (i.e. global environment)
– a function is defined in the global environment
• can have functions
– defined inside other functions
– the environment in which a function is defined could
be environment of another function
• child  parent environments
build.power <- function(n) {
p <- function(x) {
x^n
}
print(p)
}
square <- build.power(2)
cube <- build.power(3)
57
Lexical Scoping
• With lexical scoping
– the variable value is looked up in the environment in
which the function was defined
y <- 10
f <- function(x) {
y <- 2
y^2
}
f()
[1] 4
58
Other languages
• Lexical scoping is supported in
– Perl
– Python
–…
59
Dynamic scoping
• With dynamic scoping
– The variable value is looked up in the environment
from which the function was called (the calling
environment)
y <- 10
f <- function(x) {
y^2
}
f()
[1] 100
60
Regression and
Generalized Linear Models Functions
Examples
61
Regression
• Regression analysis is used to
– explain or model the relationship between
• a single variable Y, called the response, output or
dependent variable, and
• one or more predictor, input, independent or
explanatory variables, X1, …, Xp
– p=1 simple regression
– p>1 multivariate regression
62
Main uses of regression analysis
• Prediction of future observations.
• Assessment of the effect of, or relationship
between, explanatory variables on the
response.
• A general description of data structure
63
Regression in R
• the basic syntax for doing regression in R
– lm(Y~model)to fit linear models
– glm(Y~model)to fit generalized linear models.
• where model could be specified following
the general syntax rules in R
– consists of predictor terms
– provides Y~X relationship
64
Model syntax
65
Simple linear regression
• The regression of Y on X is given by
𝑦𝑖 = β𝑜 + β1 𝑥𝑖 + ε𝑖
for i = 1, … , p
• Unknown parameters
– β𝑜 intercept
• point in which the line intercepts the y-axis
– β1 slope or coefficient
• Increase in Y per unit change in X
66
Objective
• To find the equation of
the line that “best” fits
the data.
– find β𝑜 and β1 such that
the fitted values of ŷ𝒊 are
as close as possible to
observed 𝑦𝑖
– Residuals
• The difference between
– the observed value 𝑦𝑖 and
the fitted value ŷ𝒊
» 𝑟𝑖 = 𝑦𝑖 − ŷ𝒊
67
residual sum of squares(RSS)
• A usual way of calculating β𝑜 and β1
– based on the minimization of the sum of the
squared residuals, or
• residual sum of squares (RSS)
𝑅𝑆𝑆 =
𝑝
𝑖 (𝑟𝑖
)2 =
𝑝
𝑖 (𝑦𝑖
− ŷ𝒊 ) 2
68
Regression example
• thuesen dataset
– Ventricular shortening velocity
– 24 rows and 2 columns
• short.velocity (Y)
• blood.glucose (X)
– It contains ventricular shortening velocity and
blood glucose for type 1 diabetic patients.
69
Regression example
file="http://www.montefiore.ulg.ac.be/~kbessonov/present_data
/GBIO0009-1_TopInBioinf2015-16/lectures/L2/thuesen.txt"
data <- read.table(file, header=TRUE, stringsAsFactors=FALSE)
options(na.action =na.exclude)
fit.lm <- lm(short.velocity ~ blood.glucose, data=data)
data.frame(data, fitted.value=fitted(fit.lm),
residual=resid(fit.lm))
1
2
3
4
5
blood.glucose
15.3
10.8
8.1
19.5
7.2
short.velocity fitted.value
residual
1.76
1.433841 0.326158532
1.34
1.335010 0.004989882
1.27
1.275711 -0.005711308
1.47
1.526084 -0.056084062
1.27
1.255945 0.014054962
70
Measuring Goodness of Fit
• Coefficient of Determination, R2
– Need to calculate TSS, RSS and SSreg
• The ANOVA breaks the total variability observed
in the sample into two parts
– TSS = SSreg + RSS
– TSS: total sum of squares (entire sample variability)
– SSreg: regression sum of squares
• Explained by fitted model
– RSS: residual sum of squares (unexplained variability)
71
Calculating R2
anova(fit.lm)
Analysis of Variance Table
Response: short.velocity
Df Sum Sq Mean Sq F value Pr(>F)
blood.glucose 1 0.20727 0.207269
4.414 0.0479 *
Residuals
21 0.98610 0.046957
---
R2 = SSreg/TSS
R2 = 0.20727/(0.98610+0.20727)
R2 = 0.1736846
72
Getting lm() summary
summary(fit.lm)
Call:
lm(formula = short.velocity ~ blood.glucose, data = data)
Residuals:
Min
1Q
Median
3Q
Max
-0.40141 -0.14760 -0.02202 0.03001 0.43490
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
1.09781
0.11748
9.345 6.26e-09 ***
blood.glucose 0.02196
0.01045
2.101
0.0479 *
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2167 on 21 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.1737,
Adjusted R-squared: 0.1343
F-statistic: 4.414 on 1 and 21 DF, p-value: 0.0479
* - are shorthand for significance levels
Estimate - is the value of slope calculated by the regression
Std. Error - Measure of the variability in the estimate for the coefficient β
t value - measures whether or not the coefficient for this variable is meaningful for the model. It is used to calculate the
significance levels.
R-squared - Metric for evaluating the goodness of fit of your model. Higher is better with 1 being the best.
DF - The Degrees of Freedom is the difference between the number of observations (24) and the number of variables used in your
model minus 1
73
(intercept counts as a variable). DF = 24-2-1 = 23
Generalized Linear Models Functions
• The glm()
– is designed to perform generalized linear
models regression on outcome data that is
•
•
•
•
binary
count
proportion
continuous
– overcomes limitation of classical regression models
• do not need to transform the response Y to have a normal
distribution N(0, σ2)
– limitations
• assumes linear Y~X relationship
74
glm()
• glm(formula, family = "gaussian", data, …)
– formula: symbolic description of the model to fit
• e.g. y~x1+x2 or y~. or y~x1+x2+x1*x2
– family: type of distribution to apply to the response
variable
– data: data frame with the variables
• glm returns an object with slots
• summary(obj)provides compact summary
75
logistic regression example
Using data on admissions where rank of 1 represents the highest and 4 the
lowest prestige. Determine the major variable/factor impacting the admission
decision
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
head(mydata)
admit gre gpa rank
1
0 380 3.61
3
2
1 660 3.67
3
fit <- glm(admit ~ gre + gpa + rank, family=binomial(link="logit"),
data=mydata)
summary(fit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.449548
1.132846 -3.045 0.00233 **
gre
0.002294
0.001092
2.101 0.03564 *
gpa
0.777014
0.327484
2.373 0.01766 *
rank
-0.560031
0.127137 -4.405 1.06e-05 ***
AIC: 467.44 (the lower the better)
76
Exploring glm() object
attributes(fit)
$names
[1] "coefficients"
[4] "effects"
[7] "qr"
[10] "deviance"
[13] "iter"
[16] "df.residual"
[19] "converged"
[22] "call"
[25] "data"
[28] "method"
"residuals"
"R"
"family"
"aic"
"weights"
"df.null"
"boundary"
"formula"
"offset"
"contrasts"
"fitted.values"
"rank"
"linear.predictors"
"null.deviance"
"prior.weights"
"y"
"model"
"terms"
"control"
"xlevels"
$class
[1] "glm" "lm"
77
Accessing glm S3 object values
• Can look at beta coefficients of each variable
fit$coefficients
• Can check how good is my fitted model
fit$deviance
fit$aic
• Use TAB key for auto-complete
78
Regression in Genetics
• Response yj for j=1..n
• where case(1) or control(0)
• Could refer to two types of patients or phenotypes
• Genotypes gi for markers i=1..p
• To run glm() need to code data
– Additive
– Dominant
– Recesive
79
Data
• There are genotypes for
– 1018 individuals at 32 SNP markers
– 32 columns give the marker genotypes
• AA-11, AG-12 and GG-22, with NA for missing
– The genotypes
• 890kb region flanking the CYP2D6 gene
• associated with the metabolism of drugs
file="http://www.well.ox.ac.uk/rmott/LECTURES/LOGIST
IC_REGRESSION/ugeno.dat"
data <- read.table(file, header=TRUE,
stringsAsFactors=FALSE)
80
Coding
• Additive coding
Code genotypes as
– AA(11)
x=0,
– AG(12)
x=1,
– GG(22)
x=2
additive <- function( x ) {
return(as.numeric(factor(x))-1)
}
Code genotypes as
– AA
x=0,
– AG
x=0,
– GG
x=1
recessive <- function( x ) {
return ( ifelse( additive(x) > 1, 1,
0 ) )
}
Code genotypes as
– AA
x=0,
– AG
x=1,
– GG
x=1
dominant <- function( x ) {
return ( ifelse( additive(x) > 0, 1,
0 ) )
}
• Recessive coding
• Dominant coding
these functions allow to convert a genotype vector in a certain way (i.e. coding)
81
Fitting the additive model
• Additive coding
x=apply(data[,-c(1:2)],2,additive)
data_prep=as.data.frame(cbind(y=data$y,x))
fit.add <- glm(y~m1, data=data_prep, family = 'binomial' )
Call: glm(formula = y ~ m1, family = "binomial", data =
data_prep)
Coefficients:
(Intercept)
-2.8695
m1
-0.9872
Degrees of Freedom: 1017 Total (i.e. Null);
Null Deviance:
343.7
Residual Deviance: 335.1
AIC: 339.1
1016 Residual
82
additive model example
83
Fitting the dominant model
• Dominant coding
x=apply(data[,-c(1:2)],2,dominant)
data_prep=as.data.frame(cbind(y=data$y,x))
fit.dom <- glm(y~m1, data=data_prep, family = 'binomial' )
Call: glm(formula = y ~ m1, family = "binomial", data =
data_prep)
Coefficients:
(Intercept)
-2.880
m1
-1.004
Degrees of Freedom: 1017 Total (i.e. Null);
Null Deviance:
343.7
Residual Deviance: 336.2
AIC: 340.2
1016 Residual
84
Dominant model example
85
Fitting the recessive model
• Recessive coding
x=apply(data[,-c(1:2)],2,recessive)
data_prep=as.data.frame(cbind(y=data$y,x))
fit.recessive <- glm(y~m1, data=data_prep, family =
'binomial' )
Call: glm(formula = y ~ m1, family = "binomial",
data = data_prep)
Coefficients:
(Intercept)
-3.126
m1
-15.440
Degrees of Freedom: 1017 Total (i.e. Null);
Residual
Null Deviance:
343.7
Residual Deviance: 340.1
AIC: 344.1
1016
86
Recessive model example
87
In-class exercise
Part 1 of 2
88
In-class Exercises
1) Create vectors a=(5, 6, 7) and b=(10,3,1) and
obtain a total sum of their elements that are
greater than 5
2) Calculate 100
𝑖=1 𝑖 + 5
3) Create matrix A and replace the element located
in the 3rd row and 3rd column by the sum of the
1st and 2nd row
1
1
3
𝐴= 5
2
6
−2 −1 −3
89
In-class Exercises
4) Write a function which takes a single
argument which is a matrix A (see Q3). The
function should return a matrix which is the
same as the function argument but every odd
number is doubled.
𝟐
𝟐
𝟔
𝟏𝟎 2
6
−2 −𝟐 −𝟔
90
Plots generation in R
• R provides very rich set of plotting
possibilities
• The basic command is plot()
• Each library has its own version of plot()
function
• When R plots graphics it opens
“graphical device” that could be either
a window or a file
91
Plotting functions
• R offers following array of plotting
functions
Function
Description
plot(x)
plot of the values of x variable on the y axis
bi-variable plot of x and y values (both axis scaled based
on values of x and y variables)
circular pie-char
Plots a box plot showing variables via their quantiles
Plots a histogram(bar plot)
plot(x,y)
pie(y)
boxplot(x)
hist(x)
92
Plot modification functions
• Often R plots are not optimal at 1st
• R has an array of graphical parameters
Consult here is the full list
• Some of the graphical parameters can be
specified inside plot() or using other
graphical functions such as lines()
93
Plot modification functions
Function
points(x,y)
lines(x,y)
Description
add points to the plot using coordinates specified in x and y vectors
adds a line using coordinates in x and y
mtext(text,side=3)
adds text to a given margin specified by side number
boxplot(x)
arrows(x0,y0,x1,y1,
angle=30, code=1)
abline(h=y)
rect(x1, y1, x2, y2)
this a histogram that bins values of x into categories represented as bars
adds arrow to the plot specified by the x0, y0, x1, y1 coordinates. Angle provides
rotational angle and code specifies at which end arrow should be drawn
draws horizontal line at y coordinate
draws rectangle at x1, y1, x2, y2 coordinates
plots legend of the plot at the position specified by x and y vectors used to
generate a given plot
adds title to the plot
adds axis depending on the chosen one of the 4 sides; vector specifying where
tick marks are drawn
used interactively to select locations with mouse
legend(x,y)
title()
axis(side, vect)
locator()
94
Plot margins
• Outer margin
– par(oma=c(3, 2, 2,1))
• Fig. margin
– par(mar=c(5, 4, 4, 2))
• Note the num. order
– down, left, up, right
95
Visualization of data in R
cars <- c(1, 3, 6, 4, 9)
trucks <- c(2, 5, 4, 5, 12)
g_range <- range(0, cars, trucks)
plot(cars, type="o", col="blue", ylim=g_range, axes=FALSE,
ann=FALSE)
lines(trucks, type="o", pch=22, lty=2, col="red")
axis(2, las=1, at=4*0:g_range[2])
axis(1, at=1:5, lab=c("Mon","Tue","Wed","Thu","Fri"))
box()
title(xlab="Days", col.lab=rgb(0,0.5,0))
title(ylab="Total", col.lab=rgb(0,0.5,0))
96
Visualization of data in R
c1 <- c(1,3,6,4,9); c2 <- c(2,5,4,5,12); c3 <- c(4,4,6,6,16);
autos_data <- cbind(c1,c2,c3);
colnames(autos_data)<-c("cars", "trucks", "suvs");
barplot(as.matrix(autos_data), main="Autos", ylab= "Total",
beside=TRUE, col=rainbow(5))
legend("topleft", c("Mon","Tue","Wed","Thu","Fri"), cex=0.8,
bty="n", fill=rainbow(5))
97
Visualization of data in R
#Expand right side of the margin for the legend
par(xpd=T, mar=par()$mar+c(0,0,0,4))
#Graph autos using heat colors
#put 10% of the space between each bar, and make
#labels smaller with horizontal y-axis labels
barplot(t(autos_data), main="Autos", ylab="Total",
col=heat.colors(3), space=0.1, cex.axis=0.8, las=1,
names.arg=c("Mon","Tue","Wed","Thu","Fri"), cex=0.8)
legend(6, 30, colnames(autos_data), cex=0.8,
fill=heat.colors(3));
# Restore default margins
par(mar=c(5, 4, 4, 2) + 0.1);
98
Visualization of data in R
dotchart(t(autos_data),
color=c("red","blue","darkgreen"),
main="Dotchart for Autos", cex=0.8)
99
Visualization of data in R
r <- rlnorm(1000); hist(r)
100
Visualization of data in R
r <- rlnorm(1000) # Get the distribution
h <- hist(r, plot=F, breaks=c(seq(0,max(r)+1, .1)))
# Plot the distribution using log scale on both axes, and
# use blue points
plot(h$counts[h$counts > 0], log="xy", pch=20, col="blue",
main="Log-normal distribution", xlab="Value",
ylab="Frequency")
101
In-class exercises
Part 2 of 2
102
In-class exercises
1) create two perfectly correlated vectors (with
ρ=1) and plot them.
2) Change data points to a dashed line
3) Add a horizontal and vertical red lines to your
plot at coordinates: a) (1,10) and (4,4); b) (7,7)
and (1,20)
4) Add text to the plot with text() to the left and
right of the diagonal line
5) Add title to your plot
6) Add legend to the plot
103
Installation of new libraries
• There are two main R repositories
– CRAN
– BioConductor
• To install package/library from CRAN
install.packages("seqinr")
To install packages from BioConductor
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
104
The Comprehensive R Archive Network (CRAN)
• CRAN
– package repository features 7154 available
packages
– install.packages("packageName")
– Some popular packages
• ggplot2 – beautiful plots
• Matrix – sparse matrices
• igraph – graphs and analysis thereof
105
source("http://bioconductor.org/biocLite.R");
biocLite("packageName")
• Repository of biology-related libraries in R
– 178,856 packages
• Some list of libraries
– biomaRt
– IRanges
106
Installation of the new libraries
• Download and install latest R version on
your PC. Go to http://cran.r-project.org/
• Install following libraries by running
install.packages(c("seqinr", "ape", "GenABEL")
source("http://bioconductor.org/biocLite.R")
biocLite(c("limma", "muscle",
"affy","hgu133plus2.db","Biostings"))
107
Biological annotation with biomaRt
• Biological experiments require ID conversions
– e.g. microarray probe id to gene name
– e.g. mapping of SNP to gene symbol
• BioMart online service
– can be accessed via web-GUI
– programmatically via biomaRt
108
BioMart
•
•
•
•
Go to http://central.biomart.org/converter/#!/ID_converter/
Select genome assembly version
Paste or upload the ID list
E.g. covert SOX1 to GOID
109
biomaRt
• Access BioMart programmatically
– install the biomaRt library
source("http://www.bioconductor.org/biocLite.R");
biocLite("biomaRt"); require(biomaRt);
- use the listMarts() to see the different databases
listMarts()
- we will use the ensembl mart
ensMart<-useMart("ensembl")
110
biomaRt
- listDatasets() to see which data sets that are
available in the database
listDatasets(ensMart)
- Will use ʻhomo sapiensʼ dataset
ensembl_hs_mart <- useMart(biomart="ensembl",
dataset="hsapiens_gene_ensembl")
- List retrieved attributes (db fields)
listAttributes(ensembl_hs_mart)[1:100,]
- Download data
ensembl_df <- getBM( attributes=c("ensembl_gene_id", "ensembl_transcript_id",
"hgnc_symbol","chromosome_name", "entrezgene"), mart=ensembl_hs_mart )
111
biomaRt
- List of genes to annotate
my_genes = c("ENSG00000197971", "ENSG00000153165",
"ENSG00000159352", "ENSG00000146006",
"ENSG00000149809", "ENSG00000204179",
"ENSG00000213023", "ENSG00000115008",
"ENSG00000130844","ENSG00000155363")
- Annotate IDs
my_genes_ann = ensembl_df[match(my_genes,
ensembl_df$ensembl_gene_id),]
ensembl_gene_id ensembl_transcript_id hgnc_symbol chromosome_name entrezgene
66435 ENSG00000197971
ENST00000382582
MBP
18
4155
66038 ENSG00000153165
ENST00000409886
RGPD3
2
653489
190545 ENSG00000159352
ENST00000368884
PSMD4
1
5710
112
In-class exercise
1) Map SNP rs2066844 to a gene with biomaRt
library. What is its gene symbol, full gene name
and gene biological function?
113
References
[1] Team, R. Core. "R Language Definition." (2000).
[2] Durinck, Steffen, et al. "BioMart and Bioconductor: a
powerful link between biological databases and microarray data
analysis." Bioinformatics 21.16 (2005): 3439-3440.
114
115
In-class exercises (answers) s.103
x=seq(1,10)
y=2*seq(1,10)
lines(c(1,10),c(4,4))
lines(c(7,7),c(1,20))
text(5, 8, "This is my plot", adj = c(0,0))
plot(x,y, type="c")
legend(9,15, "data" , lty=1, col=c('red'), cex=0.8)
116
In-class exercises (answers). s113
library(biomaRt)
snp_db=useMart("snp", dataset="hsapiens_snp")
getBM(attributes="associated_gene", filters =
"snp_filter", values = "rs2066844", mart=snp_db)
117