R-programming

Download Report

Transcript R-programming

Analyzing Local Properties
• Many local properties are important for the function of
your protein
– Hydrophobic regions are potential transmembrane domains
– Coiled-coiled regions are potential protein-interaction domains
– Hydrophilic stretches are potential loops
• You can discover these regions
– Using sliding-widow techniques (easy)
– Using prediction methods such as hidden Markov Models
(more sophisticated)
Sliding-window Techniques
• Ideal for identifying strong
signals
• Very simple methods
– Few artifacts
– Not very sensitive
• Use ProtScale on
www.expasy.org
• Make the window the same
size as the feature you’re
looking for
www.expasy.org/cgi-bin/protscale.pl
www.expasy.org/cgi-bin/protscale.pl
Hphob. / Eisenberg
Using TMHMM
• TMHMM is the best method for predicting transmembrane
domains
• TMHMM uses an HMM
• Its principle is very different from that of ProtScale
• TMHMM output is a prediction
Searching for
PROSITE Patterns
• Search your protein against PROSITE on ExPAsy
– www.expasy.org/tools/scanprosite
• PROSITE motifs are written as patterns
– Short patterns are not very informative by themselves
– They only indicate a possibility
– Combine them with other information to draw a conclusion
• Remember: Not everything is in PROSITE !
www.expasy.org/tools/scanprosite
P12259
www.expasy.org/tools/scanprosite
Protein Domains
• Proteins are usually
made of domains
• A domain is an
autonomous folding unit
• Domains are more than
50 amino acids long
• It’s common to find
these together:
– A regulatory domain
– A binding domain
– A catalytic domain
www.ebi.ac.uk/InterProScan
www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi
www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi
Secondary Structures
• Helix
– Amino acid that twists like a spring
• Beta strand or extended
– Amino acid forms a line without
twisting
• Random coils
– Amino acid with a structure neither
helical nor extended
– Amino-acid loops are usually coils
bioinf.cs.ucl.ac.uk/psipred//?program=psipred
bioinf.cs.ucl.ac.uk/psipred//?program=psipred
bioinf.cs.ucl.ac.uk/psipred//?program=psipred
Servers
•
•
•
•
www.predictprotein.org
cubic.bioc.columbia.edu/predictprotein
www.sdsc.edu/predicprotein
www.cbi.pku.edu.cn/predictprotein
www.rcsb.org
www.rcsb.org
ncbi.nlm.nih.gov/BLAST
zhanglab.ccmb.med.umich.edu/I-TASSER/
zhanglab.ccmb.med.umich.edu/I-TASSER/
http://zhanglab.ccmb.med.umich.edu/I-TASSER/output/S102840/
R-programming
Introduction
•R is:
– a suite of operators for calculations on arrays, in particular matrices,
– a large, coherent, integrated collection of intermediate tools for interactive data
analysis,
– graphical facilities for data analysis and display either directly at the computer
or on hardcopy
– a well developed programming language which includes conditionals, loops,
user defined recursive functions and input and output facilities.
•The core of R is an interpreted computer language.
– It allows branching and looping as well as modular programming using
functions.
– Most of the user-visible functions in R are written in R, calling upon a smaller
set of internal primitives.
– It is possible for the user to interface to procedures written in C, C++ or
FORTRAN languages for efficiency, and also to write additional primitives.
R and statistics
o Packaging: a crucial infrastructure to efficiently produce, load
and keep consistent software libraries from (many) different
sources / authors
o Statistics: most packages deal with statistics and data analysis
o State of the art: many statistical researchers provide their
methods as R packages
Data Analysis and Presentation
• The R distribution contains functionality for large number of
statistical procedures.
–
–
–
–
–
–
linear and generalized linear models
nonlinear regression models
time series analysis
classical parametric and nonparametric tests
clustering
smoothing
• R also has a large set of functions which provide a flexible
graphical environment for creating various kinds of data
presentations.
R as a calculator
> seq(0, 5, length=6)
[1] 0 1 2 3 4 5
0.0
-0.5
[1] 1.414214
-1.0
> sqrt(2)
sin(seq(0, 2 * pi, length = 100))
[1] 5
0.5
1.0
> log2(32)
0
20
40
60
Index
> plot(sin(seq(0, 2*pi, length=100)))
80
100
Object orientation
primitive (or: atomic) data types in R are:
• numeric (integer, double, complex)
• character
• logical
• function
out of these, vectors, arrays, lists can be built.
Object orientation
• Object: a collection of atomic variables and/or other objects that
belong together
• Example: a microarray experiment
• probe intensities
• patient data (tissue location, diagnosis, follow-up)
• gene data (sequence, IDs, annotation)
Parlance:
• class: the “abstract” definition of it
• object: a concrete instance
• method: other word for ‘function’
• slot: a component of an object
Object orientation
Advantages:
Encapsulation (can use the objects and methods someone else has
written without having to care about the internals)
Generic functions (e.g. plot, print)
Inheritance (hierarchical organization of complexity)
Caveat:
Overcomplicated, baroque program architecture…
variables
> a = 24
> b<-25
> sqrt(a+b)
[1] 7
> a = "The dog ate my homework"
> sub("dog","cat",a)
[1] "The cat ate my homework"
> a = (1+1==3)
> a
[1] FALSE
numeric
character
string
logical
variables
> paste("X", "Y")
> paste("X", "Y", sep = " + ")
> paste("Fig", 1:4)
> paste(c("X", "Y"), 1:4, sep = "", collapse = " + ")
x<-2.17
y<-as.character(x)
z<-as.numeric(y)
Help(as)
vectors, matrices and arrays
• vector: an ordered collection of data of the same type
> a = c(1,2,3)
> a*2
[1] 2 4 6
• Example: the mean spot intensities of all 15488 spots on a chip:
a vector of 15488 numbers
• In R, a single number is the special case of a vector with 1
element.
• Other vector types: character strings, logical
vectors, matrices and arrays
• matrix: a rectangular table of data of the same type
example: the expression values for 10000 genes for 30
tissue biopsies: a matrix with 10000 rows and 30 columns.
• array: 3-,4-,..dimensional matrix
example: the red and green foreground and background values
for 20000 spots on 120 chips: a 4 x 20000 x 120 (3D) array.
Lists
• vector: an ordered collection of data of the same type.
> a = c(7,5,1)
> a[2]
[1] 5
• list: an ordered collection of data of arbitrary types.
> doe = list(name="john",age=28,married=F)
> doe$name
[1] "john "
> doe$age
[1] 28
• Typically, vector elements are accessed by their index (an integer),
list elements by their name (a character string). But both types
support both access methods.
Data frames
data frame: is like a spreadsheet.
It is a rectangular table with rows and columns; data within each
column has the same type (e.g. number, text, logical), but
different columns may have different types.
Example:
>a
localisation tumorsize
XX348 proximal
6.3
XX234
distal
8.0
XX987 proximal
10.0
progress
FALSE
TRUE
FALSE
id<-c("xx348", "xx234", "xx987")
locallization<-c("proximal", "distal", "proximal")
progress<-c(F, T, F)
tumorsize<-c(6.3, 8.0, 10.0)
results<-data.frame(id, locallization , tumorsize,
progress)
> results
id locallization tumorsize progress
1 xx348
proximal
6.3
FALSE
2 xx234
distal
8.0
TRUE
3 xx987
proximal
10.0
FALSE
results<-edit(results)
> summary(results)
id
locallization
tumorsize
xx234:1
distal :1
Min.
: 6.30
xx348:1
proximal:2
1st Qu.: 7.15
xx987:1
Median : 8.00
Mean
: 8.10
3rd Qu.: 9.00
Max.
:10.00
>x<-summary(results)
>x
id
locallization
tumorsize
xx234:1
distal :1
Min.
: 6.30
xx348:1
proximal:2
1st Qu.: 7.15
xx987:1
Median : 8.00
Mean
: 8.10
3rd Qu.: 9.00
Max.
:10.00
progress
Mode :logical
FALSE:1
TRUE :2
NA's :0
progress
Mode :logical
FALSE:1
TRUE :2
NA's :0
Subsetting
Individual elements of a vector, matrix, array or data frame are
accessed with “[ ]” by specifying their index, or their name
> results
localisation tumorsize progress
XX348 proximal
6.3
0
XX234
distal
8.0
1
XX987 proximal
10.0
0
> results[3, 2]
[1] 10
> a["XX987", "tumorsize"]
[1] 10
> results["XX987",]
localisation tumorsize progress
XX987 proximal
10
0
> results
localisation tumorsize progress
XX348 proximal
6.3
0
XX234
distal
8.0
1
XX987 proximal
10.0
0
> results[c(1,3),]
localisation tumorsize progress
XX348 proximal
6.3
0
XX987 proximal
10.0
0
> results[c(T,F,T),]
localisation tumorsize progress
XX348 proximal
6.3
0
XX987 proximal
10.0
0
> results$localisation
[1] "proximal" "distal" "proximal"
> results $localisation=="proximal"
[1] TRUE FALSE TRUE
> results[ results$localisation=="proximal", ]
localisation tumorsize progress
XX348 proximal
6.3
0
XX987 proximal 10.0
0
Subsetting
subset rows by a
vector of indices
subset rows by a
logical vector
subset a column
comparison resulting in
logical vector
subset the selected
rows
results[2,]
results[2,2]
results[1:3,]
results[c(1,3),]
results[c(T,F,T),]
x<-summary(results)
x
X[2,2]
x = c(1, 1, 2, 3, 5, 8)
x[c(TRUE, TRUE, FALSE, FALSE, TRUE, TRUE)]
x[c(TRUE, FALSE)]
x == 1
x[x == 1]
x[x%%2 == 0]
y = c(1, 2, 3)
y[]=3
y
Matrix
• a matrix is a vector with an additional attribute
(dim) that defines the number of columns and
rows
• only one mode (numeric, character, complex,
or logical) allowed
• can be created using matrix()
x<-matrix(data=0,nr=2,nc=2)
or
x<-matrix(0,2,2)
Data Frame
• several modes allowed within a single data
frame
• can be created using data.frame()
L<-LETTERS[1:4] #A B C D
x<-1:4
#1 2 3 4
data.frame(x,L) #create data frame
• attach() and detach()
– the database is attached to the R search path so that the database is searched by R
when it is evaluating a variable.
– objects in the database can be accessed by simply giving their names
a=matrix(1:9, ncol = 3, nrow = 3)
a
b=matrix(c(TRUE, FALSE, TRUE), ncol = 3, nrow = 3)
b
x=1:10
y=11:20
z=matrix(c(x,y))
z
z=matrix(c(x,y),nrow=2)
z
z=matrix(c(x,y),nrow=4)
z
R code
max(z)
min(z)
length(z)
mean(z)
sd(z)
sum(z)
index=c(15,27,34,10,9)
welcome=c(13,26,30,10,7)
paper=c(2,1,3,0,1)
days=c("mon", "tues", "wed", "thurs", "fri")
filenames=c("index.html", "welcom.png", "paper.pdf")
downloads=matrix(c(index,welcome,paper), nrow=5,
dimnames=list(days,filenames))
downloads
filesizes = c(1624, 23172, 1234065)
downloads%*%filesizes
image(as.matrix(downloads))
Factors
A character string can contain arbitrary text. Sometimes it is useful to use a limited
vocabulary, with a small number of allowed words. A factor is a variable that can only
take such a limited number of values, which are called levels.
expression<-factor(c("over","under","over","unchanged","under","under"))
levels(expression)
protein<-list("glucose oxidas", "1CF3", 63355)
protein
protein<-list(name="glucose oxidas", accession="1CF3", weight=63355)
x<-c(16614, 50660, 6066, 6118)
protein$GOIDs<-x
protein
class(protein)
length(protein)
attributes(protein)
Working directory
setwd("D:/data")
x<-read.table("profiles.csv", sep=",", header=TRUE)
x<-read.table("http://www.bixsolutions.net/profiles.csv", sep=",",
header=TRUE)
matplot(x, type="l")
matplot(x, type="l", xlab="fraction", ylab="quantity", col=1:6, lty=1:5,
lwd=2)
lty: line style
lwd: line width
xmax<-apply(x, 2, max)
xmax
ymax<-apply(x, 1, max)
ymax
Apply the max function on columns (2) or rows (1) of matrix x
cummean = function(x){
n = length(x)
y = numeric(n)
z = c(1:n)
y = cumsum(x)
y = y/z
return(y)
}
n = 10000
z = rnorm(n)
x = seq(1,n,1)
y = cummean(z)
X11()
plot(x,y,type= 'l',main= 'Convergence Plot')
Apply the max function on columns (2) or rows (1) of matrix x