Microarrays 2 BMI 731 Winter 2005

Transcript Microarrays 2 BMI 731 Winter 2005

Introduction to R and
Bioconductor
BMI 731 Winter 2005
Catalin Barbacioru
Department of Biomedical Informatics
Ohio State University
References
• R Project (www.r-project.org):
open-source language and environment for statistical
computing and graphics.
Comprehensive R Archive Network, CRAN (cran.rproject.org): source code and precompiled binary
distributions for Linux, Windows, MacOS; base and
contributed packages.
• Bioconductor Project (www.bioconductor.org)
open-source software for the analysis of biomedical and
genomic data, mainly R packages.
R Project
• R is a language and environment for statistical
computing and graphics. It is a open source project
which is similar to the S language and environment
which was developed at Bell Laboratories (formerly
AT&T, now Lucent Technologies) by John Chambers and
colleagues. R can be considered as a different
implementation of S.
• R provides a wide variety of statistical (linear and
nonlinear modeling, classical statistical tests, time-series
analysis, classification, clustering, ...) and graphical
techniques, and is highly extensible. The S language is
often the vehicle of choice for research in statistical
methodology, and R provides an Open Source route to
participation in that activity.
R Project
• R can be extended (easily) via packages.
• An R package is a structured collection of code (R, C, or
other), documentation, and/or data for performing specific
types of analyses.
• Packages only need to be installed once, but ... they must be
loaded with each new R session.
• Loading: R function library, e.g., library(Biobase);
• Various functions are available to obtain information on a
package.
• For example, packageDescription returns the content of the
DESCRIPTION file and .find.package returns the directory
where the package was installed.
> packageDescription("hgu95av2")
R Packages
• Analysis packages: implementation of statistical and
graphical methods. E.g. cluster , glm, graph, hexbin,
lattice, rpart.
• Data packages: Biological metadata packages consisting
of environment objects for mappings between dierent
gene identifiers (e.g., Aymetrix ID, GO ID, LocusLink ID,
PubMed ID), CDF and probe sequence information for
Aymetrix chips. E.g. GO, hgu95av2 , humanLLMappings,
KEGG.
• Specialized/custom packages: code, data,
documentation, and exercises, for a particular project,
article, or course. E.g. EMBO03 : Bioconductor course
package; golubEsets: Golub et al. (2000) ALL/AML
dataset; yeastCC: Spellman et al. (1998) yeast cell cycle
dataset.
R Packages
• Base packages (CRAN).
E.g. base, graphics, RPackmethods, stats.
• Contributed packages (CRAN).
E.g. ellipse, XML.
• Bioconductor packages.
E.g. annotate, affy, marray, multtest, hgu95av2 , ALL,
EMBO03 .
Bioconductor Project
• Bioconductor is an open-source and open-development
software project for the analysis of biomedical and
genomic data.
• The project was started in the Fall of 2001 and includes
25 core developers in the US, Europe, and Australia.
• Provide access to powerful statistical and graphical
methods for the analysis of biomedical and genomic
data.
• Facilitate the integration of biological metadata from
WWW in the analysis of experimental data.
E.g. GenBank, GO, LocusLink, PubMed.
• Provide training in computational and statistical methods.
Bioconductor Packages
• Statistical methods: cluster analysis, estimation and
(multiple) testing for linear and non-linear models (with
possibly censored continuous and polychotomous
outcomes), resampling, visualization, etc.
• Biological assays: cell-based assays, DNA microarrays
(transcript levels, DNA copy number from CGH),
proteomics, SAGE, SELDI-TOF, SNP, etc.
• Biological metadata from WWW: GenBank, GO, KEGG,
PubMed,etc
• Interfaces with other languages: C, Java, Perl, Python,
XML, etc. – Omega Project (www.omegahat.org).
• Interactions with other projects: BGL, GeneSpring,
Graphviz, MAGE-ML, Resourcerer, etc.
Bioconductor Packages
• Analysis packages: e.g., annotate, affy, marray, multtest.
• Data packages:
• Biological metadata: mappings between dierent gene
identifiers (e.g., AffyID, GO ID, LocusID, PMID), CDF
and probe sequence information for Affymetrix chips.
E.g. hgu95av2 , GO, KEGG.
• Experimental data: code, data, and documentation for
specific experiments or projects.
ALL: Chiaretti et al. (2004) ALL dataset.
golubEsets: Golub et al. (2000) ALL/AML dataset.
yeastCC: Spellman et al. (1998) yeast cell cycle dataset.
Bioconductor Packages
• General infrastructure: Biobase, Biostrings, DynDoc, reposTools,
rhdf5 , ruuid, tkWidgets, widgetTools.
• Annotation: annotate, AnnBuilder + metadata packages.
• Graphics: geneplotter, hexbin.
• Pre-processing Aymetrix oligonucleotide chip data: affy,
affycomp, affydata, affylmGUI , affyPLM, annaffy, gcrma,
makecdfenv, vsn.
• Pre-processing two-color spotted DNA microarray data:
arrayMagic, arrayQuality, limma, limmaGUI , marray, vsn.
• Other assays: aCGH, DNAcopy, prada, PROcess, RSNPer,
SAGElyzer.
• Dierential gene expression: EBarrays, edd, factDesign, genefilter,
limma, limmaGUI , multtest, ROC.
• Graphs and networks: graph, RBGL, Rgraphviz .
• Gene Ontology: GOstats, goTools.
Microarray data analysis
• Pre-processing of
– spotted array data with marray packages;
– Affymetrix chip data with affy packages.
• List of differentially expressed genes from genefilter,
limma, or multtest packages.
• Prediction of tumor class using randomForest package.
• Clustering of genes using cluster or hopach packages.
• Use of annotate package
– to retrieve and search PubMed abstracts;
– to generate an HTML report with links to LocusLink
and PubMed for each gene.
affy Package
• To load the necessary packages,
> library(affy)
> library(affydata)
• One of the main functions for reading in Affymetrix data
is ReadAffy. It reads in data from CEL files and creates
objects of class AffyBatch.
• In this lab we will work mainly with the Dilution dataset,
which is included in the affydata package. To load the
dataset, type
>data(Dilution)
For a description of Dilution, type
>? Dilution
affy classes and methods
• One of the main classes in affy is the AffyBatch class.
>class(Dilution)
[1] “AffyBatch”
> slotNames(Dilution)
[1] "cdfName“ "nrow“ "ncol" "exprs" "se.exprs“ "phenoData"
[7]"description" "annotation" "notes“
>Dilution
AffyBatch object
size of arrays=640x640 features (12805 kb)
cdf=HG_U95Av2 (12625 affyids)
number of samples=4
number of genes=12625
annotation=hgu95av2
affy classes and methods
• The exprs slot contains a matrix with columns
corresponding to chips and rows to individual probes on
the chip. To obtain the matrix of intensities for all four
chips,
> e <- exprs(Dilution)
• Probe-level PM and MM intensities can be accessed
using the pm and mm methods.
> PM <- pm(Dilution)
affy classes and methods
> PM[1:5, ]
20A 20B 10A 10B
[1,] 468.8 282.3 433.0 198.0
[2,] 430.0 265.0 308.5 192.8
[3,] 182.3 115.0 138.0 86.3
[4,] 930.0 588.0 752.8 392.5
[5,] 171.0 128.0 152.3 97.8
affy classes and methods
To get the probe-set names (Ay IDs),
> gnames <- geneNames(Dilution)
> length(gnames)
[1] 12625
> gnames[1:5]
[1] "1000_at" "1001_at" "1002_f_at" "1003_s_at"
[5]"1004_at"
affy classes and methods
To produce boxplots plots of log base 2 probe intensities,
> boxplot(Dilution, col = c(2, 2, 3, 3))
affy classes and methods
• The boxplots show that the Dilution data needs
normalization. As described in the dataset help file and in
the phenoData slot (pData(Dilution)), two concentrations
of mRNA were used and, for each concentration, two
scanners were used. From the plots, we note that
scanner effects seem stronger than concentration effects
(different colors). In other words, chips that should be the
same are different; chips that should be different are
similar.
• Because different mRNA concentrations were used, we
perform normalization within concentration groups. The
default procedure implemented in the normalize method
is probe-level quantile normalization.
affy classes and methods
> Dil20 <- normalize(Dilution[, 1:2])
> Dil10 <- normalize(Dilution[, 3:4])
> normDil <- merge(Dil20, Dil10)
>boxplot(normDil, col=c(2,2,3,3))
affy classes and methods
We view the process of going from probe-level intensities to
gene-level expression measures as a three-step procedure
consisting of: (i) background adjustment; (ii) normalization; (iii)
summarization. The affy package provides implementations for a
number of methods for each of these steps: (i) background
correction: e.g., none, MAS 5.0, convolution; (ii) normalization:
e.g., probe-level quantile, cyclic loess, contrast loess; (iii)
summarization: e.g., MAS 4.0, MAS 5.0, MBEI (Li & Wong,
2001), median polish for additive linear model (Irizarry et al.,
2003).
The Robust Multichip Average (RMA) method refers to the
sequence: convolution background adjustment, probe-level
quantile normalization, and median polish summarization for
gene-specific additive models with probe and chip effects.
> rmaDil <- rma(Dilution)
affy classes and methods
CDF data packages
Data packages providing CDF information can be
download from www.bioconductor.org. These packages
contain environment objects which provide mappings
between AffyIDs and matrices of probe locations, with rows
corresponding to probe-pairs and columns to PM and MM
cells. The CDF environment for the HGU95Av2 chip is
already in the package. For information on the environment
object type >? hgu95av2cdf

Microarrays 2 BMI 731 Winter 2005

Transcript Microarrays 2 BMI 731 Winter 2005

Directory