Exploratory Data Analysis (EDA)

Download Report

Transcript Exploratory Data Analysis (EDA)

Canadian Bioinformatics
Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Module 2
Exploratory Data Analysis (EDA)
Exploratory Data Analysis of Biological Data using R
†
Boris Steipe
Toronto, May 23. and 24. 2013
DEPARTMENT OF
BIOCHEMISTRY
DEPARTMENT OF
MOLECULAR GENETICS
†
Odysseus listening to the song of the sirens. Late Archaic (500–480 BCE)
This workshop includes material
originally developed by Raphael Gottardo, FHCRC
and by Sohrab Shah, UBC
Exploratory Data Analysis (EDA)
EDA is an approach to data analysis without a
particular statistical model or hypothesis.
It is therefore often the first step of data
analysis.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
Exploratory Data Analysis (EDA)
The objectives of EDA include:
• uncovering underlying structure and identifying
trends and patterns;
• extracting important variables;
• detecting outliers and anomalies;
• testing underlying assumptions;
• developing statistical models.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
Exploratory Data Analysis (EDA)
The practice of EDA emphasizes looking at data in different ways through:
• computing and tabulating basic descriptors of data properties such as
ranges, means, and variances;
• generating graphics, such as boxplots, histograms, scatter plots;
• applying transformations, such as log or rank;
• comparing observations to statistical models, such as the QQ-plot, or
regression;
• identifying underlying structure through clustering;
• simplifying data through dimension reduction ...
... all with the final goal of defining a statistical model and using the model
for hypothesis testing and prediction.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
Graphics
Good graphics are immensely valuable. Poor graphics are
worse than none.
If you want to learn more about good graphics and information
design, find a copy of Edward Tufte's The Visual Display of
Quantitative Information. You can also visit his Web site to get a sense of the
field (www.edwardtufte.com).
Fundamentally, there is one simple rule.
Use less ink.
The rule has many corollaries.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
Use less ink
Make sure that all elements on your graphics are necessary.
Make sure that all elements on your graphics are informative.
Make sure that all information in your data is displayed.
Not all of R's defaults use as little ink as possible...
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
refined graphics
... for a popular alternative to R's
defaults, check out the ggplot2
package (http://www.ggplot2.org).
> install.packages("ggplot2")
...
> library(ggplot2)
...
Example: Bubble chart.
http://ygc.name/2010/12/01/bubble-chart-by-using-ggplot2/
crime <- read.csv("crimeRatesByState2008.csv", header=TRUE, sep="\t")
p <- ggplot(crime, aes(murder,burglary,size=population, label=state))
p <- p+geom_point(colour="red") +scale_area(to=c(1,20))+geom_text(size=3)
p + xlab("Murders per 100,000 population") + ylab("Burglaries per 100,000")
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
Graphics
R Graph Gallery: http://gallery.r-enthusiasts.com/
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
exploring data
•
•
•
•
When teaching (or learning new procedures) I prefer to work with
synthetic data.
Synthetic data has the advantage that I know what the outcome of the
analysis should be.
Typically one would create values acording to a function and then add
noise.
R has several functions to create sequences of values – or you can write
your own ...
0:10
seq(0, pi, 5*pi/180)
rep(1:3, each=3, times=2)
for (i in 1:10) { print(i*i) }
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
synthetic data
Function ...
Noise ...
Noisy Function ...
Explore functions and noise.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
discrete data
TGGGACGATCAACTCGAAGCGGATGCCCCTATTGATCCTCTCCGAACACAATGAGAATGCTCTCAAAG
GCTGTAAGCTCAGTGAACGGGACTAAATCAGACGTTTGTCGTCGATACGGGTGCCCTACCTCTACCTA
ATCTGATCGAGGCCAAGTATCTGGAAGGACTGCTAATTTTGCCGTAAACCCCGTTCTTGTTCAGTATAA
TCTAAGTAGATGTGTTAGCAGTCATATGACGATAGTTTGGGGGTCCTCCTATAGGAAAAAGAAAGAGTC
GCCACACCAGAGCCTGTAGCGCTTTCTAGATAGTGCTGCATATTTATATGTCGGCCCCGAACCAGAGC
GCTCCAATGGTAGCCCCTTTAATCTTCGTATCTTACCTTTTATGAGTGCACAGGTTTCCATCGAGGGAG
AAAAACCTCAGCAACGTGGGTGGGTAGAAGAGCCCTAGTTTGAAAGCCGCACCATAACCCGCACATC
GTCAGATCAGTAACCCAAGATCGGTGGGCTGTAACTAGGCTCCGTGACACAGCGTGGTATTCCGAGT
TCCCGAAATCGTTTCACCTATAGAACGCCACCCCGGACGGGGTTGTTAGTTTTTCTACCTTTTAAGAAG
AAAAGCAAAGTGTGTGGACACGAGAACTAGTGTGAGTACGGTTTTGTATGTGGCCCTACTGTGGAAAC
TCAGTAGTACGAAGGGGATAGCGAGACTTAGCTTTGCCCCAACTGCCGTCACGCACCCGCTTGTGCC
GGTACGCAGAGCTCCGCCGGGTGTCCAAGTGCCGTTCTACGATAAGAACCTGTGTATCTAGCGCGCC
CGATATGAATAAAGCCTACTCTTATCCAGATTTTGCGGACTGGTAAGCGTGACAATTATTGCGCAGCTT
CGACTTAGTTCTCCTTGCCTTGCTTTAGGGGAGTTCTCCACTCAAAAGTCGTTGACGTACAATCGCAG
ATTTTGTAATCCCTTAAACCTCTGATTAGTCTCAGCCGTATTCACTA
Exercise: Write a function to output random nucleotides.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
random nucleotides: solution 1
Generating uniformly distributed random nucleotides using sample().
rNucUnif <- function(n) {
nuc <- c("A", "C", "G", "T")
rSeq <- sample(nuc, n, replace=TRUE)
return(rSeq)
}
Task:
Explore a function that samples according to arbitrary probabilities.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
sample()
•
•
Random samples from a vector with/without replacement.
Sampling a vector over its length without replacement is a permutation
or shuffle of the vector.
sums <- rep(0, 10)
digits <- 0:9
rsd <- vector()
for (i in 1:10000) {
perm <- sample(digits, 10)
sums <- sums + perm
if (i %% 1000 == 0) {
rsd <- c(rsd, sd(sums)/mean(sums))
}
}
Task:
Test whether the permutation is biased.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
probability distributions
• Can be either discrete or
continuous (uniform,
Bernoulli, normal, etc)
• Defined by a density
function, p(x) or f(x)
• Bernoulli distribution
Be(p): flip a coin (T=0,
H=1). Assume probability
of H is 0.1 ...
x <- 0:1
f <- dbinom(x, size=1, prob=0.1)
plot(x, f, xlab="x", ylab="density", type="h", lwd=5)
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
probability distributions
Random sampling:
Generate 100 observations
from a Be(0.1)
set.seed(100)
x <- rbinom(100, size=1, prob=0.1)
hist(x)
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
probability distributions
Normal distribution N(μ,σ2)
μ is the mean and σ2 is the
variance.
Extremely important because of
the Central Limit Theorem: if a
random variable is the sum of a
large number of small random
variables, it will be normally
distributed.
The area under the curve is the probability of observing a value between 0 and 2.
x <- seq(-4, 4, 0.1)
f <- dnorm(x, mean=0, sd=1)
plot(x, f, xlab="x", ylab="density", lwd=5, type="l")
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
probability distributions
Normal distribution N(μ,σ2)
μ is the mean and σ2 is the
variance.
Extremely important because of
the Central Limit Theorem: if a
random variable is the sum of a
large number of small random
variables, it will be normally
distributed.
Task:
Explore line
parameters
The area under the curve is the probability of observing a value between 0 and 2.
x <- seq(-4, 4, 0.1)
f <- dnorm(x, mean=0, sd=1)
plot(x, f, xlab="x", ylab="density", lwd=5, type="l")
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
probability distributions
Random sampling:
Generate 100 observations
from a N(0,1)
Histograms can be used
to estimate densities!
set.seed(100)
x <- rnorm(100, mean=0, sd=1)
hist(x)
lines(seq(-3,3,0.1),50*dnorm(seq(-3,3,0.1)), col="red")
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
quantiles
(Theoretical) Quantiles:
The p-quantile has the
property that there is a
probability p of getting a
value less than or equal to it.
The 50% quantile is called the median.
90% of the probability (area under the curve) is to the left of the red vertical line.
q90 <- qnorm(0.90, mean = 0, sd = 1)
x <- seq(-4, 4, 0.1)
f <- dnorm(x, mean=0, sd=1)
plot(x, f, xlab="x", ylab="density", type="l", lwd=5)
abline(v=q90, col=2, lwd=5)
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
descriptive statistics
Empirical Quantiles:
The p-quantile has the property that p% of the observations are
less than or equal to it.
Empirical quantiles can be easily obtained in R.
> set.seed(100)
> x <- rnorm(100, mean=0, sd=1)
> quantile(x)
0%
25%
50%
75%
100%
-2.2719255 -0.6088466 -0.0594199 0.6558911 2.5819589
> quantile(x, probs=c(0.1, 0.2, 0.9))
10%
20%
90%
-1.1744996 -0.8267067 1.3834892
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
descriptive statistics
We often need to quickly 'quantify' a data set, and this can be
done using a set of summary statistics (mean, median,
variance, standard deviation).
> mean(x)
[1] 0.002912563
> median(x)
[1] -0.0594199
> IQR(x)
[1] 1.264738
> var(x)
[1] 1.04185
> summary(x)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
-2.272000 -0.608800 -0.059420 0.002913 0.655900 2.582000
Exercise: what are the units of variance and standard deviation?
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
Box-plot
Descriptive
statistics can be
intuitively
summarized in a
Box-plot.
1.5 x IQR
75% quantile
IQR
Median
25% quantile
> boxplot(x)
1.5 x IQR
Everything above and below 1.5 x
IQR is considered an "outlier".
IQR = Inter Quantile Range = 75% quantile – 25% quantile
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
Violinplot
Internal structure of a
data-vector can be
made visible in a violin
plot. The principle is
the same as for a
boxplot, but a width is
calculated from a
smoothed histogram.
p <- ggplot(X, aes(1,x))
p + geom_violin()
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
plotting data in R
Task:
Explore types of plots.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
QQ–plot
One of the first things we may ask about data is whether it
deviates from an expectation e.g. to be normally distributed.
The quantile-quantile plot provides a way to visually verify this.
The QQ-plot shows the theoretical quantiles versus the
empirical quantiles. If the distribution assumed (theoretical one)
is indeed the correct one, we should observe a straight line.
R provides qqnorm() and qqplot().
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
QQ–plot: sample vs. Normal
Only valid for the
normal distribution!
qqnorm(x)
qqline(x, col=2)
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
Plotting: lines and legends
Example: compare the
t-distribution with the
Normal distribution.
x <- seq(-4, 4, 0.1)
f1 <- dnorm(x, mean=0, sd=1)
f2 <- dt(x, df=2)
plot(x, f1, xlab="x", ylab="density", lwd=5, type="l")
lines(x, f2, lwd=5, col=2)
legend(-4, .4, c("Normal", "t2"), col=1:2, lwd=5)
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
QQ–plot: sample vs. Normal
set.seed(100)
t <- rt(100, df=2)
qqnorm(t)
qqline(t, col=2)
Clearly the t distribution with two
degrees of freedom is not Normal.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
QQ–plot
Verify the CLT.
set.seed(101)
generateVariates <- function(n) {
Nvar <- 10000
Vout <- c()
for (i in 1:n) {
x <- runif(Nvar, -0.01, 0.01)
Vout <- c(Vout, sum(x) )
}
return(Vout)
}
x <- generateVariates(1000)
y <- rnorm(1000, mean=0, sd=1)
qqnorm(x)
qqline(x, y, col=2)
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
QQ–plot: sample vs. sample
Comparing two samples: are their
distributions the same?
... or ...
compare a sample vs. a synthetic
dataset.
set.seed(100)
x <- rt(100, df=2)
y <- rnorm(100, mean=0, sd=1)
qqplot(x, y)
Exercise: try different values of df for rt() and compare the vectors.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
scatter plots
Biological data sets often
contain several variables, they
are multivariate.
Scatter plots allow us to look
at two variables at a time.
# GvHD flow cytometry data
gvhd <- read.table("GvHD.txt", header=TRUE)
# Only extract the CD3 positive cells
gvhdCD3p <- as.data.frame(gvhd[gvhd[, 5]>280, 3:6])
plot(gvhdCD3p[, 1:2])
This can be used to assess independence and identify subgroups.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
scatter plots
Task:
Explore scatter plots.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
trellis graphics
Trellis Graphics is a family of techniques for viewing complex,
multi-variable data sets.
plot(gvhdCD3p, pch=".")
Tip:
Many more possibilities are available in the "lattice" package. See ?Lattice
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
Boxplots
The boxplot function can
be used to display
several variables at a
time.
boxplot(gvhdCD3p)
Exercise: Interpret this plot.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
Histograms
par(mfrow=c(2, 2))
hist(gvhdCD3p[, 1], 50,
main=names(gvhdCD3p)[1],
xlab="fluorescent intensity",
ylim=c(0, 120))
hist(gvhdCD3p[, 2], 50,
main=names(gvhdCD3p)[2],
xlab="fluorescent intensity",
ylim=c(0, 120))
hist(gvhdCD3p[, 3], 50,
main=names(gvhdCD3p)[3],
xlab="fluorescent intensity",
ylim=c(0, 120))
hist(gvhdCD3p[, 4], 50,
main=names(gvhdCD3p)[4],
xlab="fluorescent intensity",
ylim=c(0, 120))
We discover a mix of distinct cell subpopulations!
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
EDA: HIV microarray data
HIV Data:
The expression levels of 7680 genes
were measured in CD4-T-cell lines at
time t = 24 hours after infection with
HIV type 1 virus.
• 12 positive controls (HIV genes).
• 4 replicates (2 with a dye swap)
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
EDA: HIV microarray data
One of the array chips.
We get the data after the image analysis is done.
For each spot we have an estimate of the intensity in both
channels.
The data matrix therefore is of size 7680 x 8.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
EDA: HIV microarray data
data <- read.table(file="hiv.raw.data.24h.txt", sep="\t", header=TRUE)
summary(data)
boxplot(data)
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
EDA: HIV microarray data
opar <- par(mfrow=c(2, 2))
hist(data[, 1], 50, main=names(data)[1], xlab="fluorescent intensity")
hist(data[, 2], 50, main=names(data)[2], xlab="fluorescent intensity")
hist(data[, 5], 50, main=names(data)[5], xlab="fluorescent intensity")
hist(data[, 6], 50, main=names(data)[6], xlab="fluorescent intensity")
par(oPar)
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
EDA: HIV microarray data
# 'apply' will apply the function to all rows of the data matrix
mean <- apply(data[, 1:4], 1, "mean")
sd <- apply(data[, 1:4], 1, "sd")
plot(mean, sd)
trend <- lowess(mean, sd)
lines(trend, col=2, lwd=5)
lowess fit
LOcally WEighted
Scatter plot
Smoother; used to
estimate the trend
in a scatter plot.
Non parametric!
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
EDA: Transformations
Observations:
The data are highly skewed. The standard deviation is not
constant, as it increases with the mean.
Solution:
Look for a transformation that will make the data more
symmetric and the variance more constant.
With positive data the log transformation is often
appropriate.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
EDA: Transformations
data <- log(read.table(file="hiv.raw.data.24h.txt",
sep="\t", header=TRUE))
summary(data)
boxplot(data)
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
EDA: HIV microarray data
opar <- par(mfrow=c(2, 2))
hist(data[, 1], 50, main=names(data)[1], xlab="fluorescent intensity")
hist(data[, 2], 50, main=names(data)[2], xlab="fluorescent intensity")
hist(data[, 5], 50, main=names(data)[5], xlab="fluorescent intensity")
hist(data[, 6], 50, main=names(data)[6], xlab="fluorescent intensity")
par(oPar)
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
EDA: Transformations
# 'apply' will apply the function to all rows of the data matrix
mean <- apply(data[, 1:4], 1, "mean")
sd <- apply(data[, 1:4], 1, "sd")
plot(mean, sd)
trend <- lowess(mean, sd)
lines(trend, col=2, lwd=5)
The standard deviation has
become almost independent of
the mean.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
EDA for microarray data: always log
• Makes the data more symmetric, large
observations are not as influential
• The variance is (more) constant
• Turns multiplication into addition
(log(ab)=log(a)+log(b))
• In practice use log base 2,
log2(x)=log(x)/log(2)
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
EDA for gene expression
# scatter plot
plot(data[, 1], data[, 5],
xlab=names(data)[1], ylab=names(data)[5])
Is this a good way to
look at the data?
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
EDA for gene expression: MA plots
# MA plot
A <- (data[, 1]+data[, 5])/2
M <- (data[, 1]-data[, 5])
plot(A, M, xlab="A", ylab="M")
M (minus) is the log ratio.
A (average) is overall intensity.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
EDA for gene expression: MA plots
# MA plots per replicate
par(mfrow=c(2, 2))
A1 <- (data[, 1]+data[, 5])/2
M1 <- (data[, 1]-data[, 5])
plot(A1, M1, xlab="A", ylab="M", main="rep 1")
trend <- lowess(A1, M1)
lines(trend, col=2, lwd=5)
A2 <- (data[, 2]+data[, 6])/2
M2 <- (data[, 2]-data[, 6])
plot(A2, M2, xlab="A", ylab="M", main="rep 2")
trend <- lowess(A2, M2)
lines(trend, col=2, lwd=5)
A3 <- (data[, 3]+data[, 7])/2
M3 <- (data[, 3]-data[, 7])
plot(A3, M3, xlab="A", ylab="M", main="rep 3")
trend <- lowess(A3, M3)
lines(trend, col=2, lwd=5)
A4 <- (data[, 4]+data[, 8])/2
M4 <- (data[, 4]-data[, 8])
plot(A4, M4, xlab="A", ylab="M", main="rep 4")
trend <- lowess(A4, M4)
lines(trend, col=2, lwd=5)
par(OldPar)
The data suggests that a small group of genes are interesting. How do we define/find
such differentially expressed genes? More on this in Module 6: Hypothesis Testing.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
More plots
Task:
Explore 3D plots.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
Summary
EDA should be the first step in any
statistical analysis!
Good modeling starts and ends with
EDA.
R provides a great framework for EDA.
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca
[email protected]
Module 2: Exploratory Data Analysis (EDA)
bioinformatics.ca