Hierarchical Clustering in R

Download Report

Transcript Hierarchical Clustering in R

Hierarchical Clustering in R
Quick R Tips
• How to find out what packages are available
– library()
• How to find out what packages are actually
installed locally
– (.packages())
Hierarchical Clustering
• A type of cluster analysis
• There is both “divisive” and “agglomerative”
HC…agglomerative is most commonly used
• Group objects that are “close” to one another
based on some distance/similarity metric
• Clusters are created and linked based on a
metric that evaluates the cluster-to-cluster
distance
• Results are displayed as a dendrogram
Step 1: Data matrix
• First you need a numeric matrix
– Typical array data set will have samples as columns and
genes as rows
– We want to be sure our data are in the form of an expression
matrix
• Use Biobase library/package
• See
http://www.bioconductor.org/packages/2.2/bioc/vignettes/Biobase/
inst/doc/ExpressionSetIntroduction.pdf
> exprs<-as.matrix(data, header=TRUE, sep="\t", row.names=1, as.is=TRUE)
Step 2: Calculate Distance Matrix
• Default dist() method in R uses rows as the vectors..but we want
the distance between samples….i.e., the columns of our matrix.
• There is a handy package to help us at MD Anderson called
oompaBase
source("http://bioinformatics.mdanderson.org/OOMPA/oompaLite.R")
oompaLite()
oompainstall(groupName="all")
• Once installed, be sure to locally activate the libraries
library(oompaBase)
library(ClassDiscovery)
library(ClassComparison)
• oompaBase also requires the mclust and cobs
packages…download these from CRAN
• Use the function distanceMatrix() to create a
distance matrix of your samples….
– Uses the expression set created in Step 1 as input
– Remember that there are many different types of
distance metrics to choose from!
– See help(distanceMatrix)
x<- distanceMatrix(exprs,'pearson')
Step 3: Cluster
• Use the hclust() function to create a hierarchical cluster
based on your distance matrix, x, created in Step 2.
> y<-hclust(x,method="complete")
> plot(y)
Testing for Differential Gene
Expression with the T-test
• Get the multtest package from CRAN
• Package contains data from the Golub
leukemia microarray data set (ALL v AML)
– 38 arrays
• 27 from lymphoblastic
• 11 from myeloid
http://people.cryst.bbk.ac.uk/wernisch/macourse/
•
•
•
•
library(multtest)
data(golub)
golub.cl
Generate the T statistic
– teststat <-mt.teststat(golub, golub.cl)
• Convert into P-values
– rawp0 <-2*pt(abs(teststat),lower.tail=F, df=38-2)
• Correct for multiple testing and show the ten most
significant genes
– procs <-c(“Bonferroni”, “BH”)
– res<-mt.rawp2adjp((rawp0), procs)
– res$adjp[1:10,]
http://people.cryst.bbk.ac.uk/wernisch/macourse/