Genomic Profiles of Brain Tissue in Humans2

Download Report

Transcript Genomic Profiles of Brain Tissue in Humans2

Genomic Profiles of Brain
Tissue in Humans and
Chimpanzees II
Naomi Altman
Oct 06
SAM
Significance Analysis of Microarrays is a popular method
of differential expression analysis, freely available
from www-stat.stanford.edu/~tibs
It uses permutation based tests, and allows for some
common models including paired and unpaired t-tests,
one-way ANOVA, and some simple block designs. It
also has some other analyses.
The data must be normalized in advance. No missing
data are allowed. SAM includes a method to "fill in"
(impute) missing values, assuming they are missing at
random and sparse.
SAM
SAM can be run from Excel through an interface
that sends data to and from R.
samr is the package running on R.
I will demonstrate the Excel interface, which is
the popular method.
SAM
Like Limma, SAM starts by computing a test
statistic for each gene.
SAM uses a regularized denominator: i.e. the
test statistic is based on a paired or twosample t-test, or an ANOVA F-test, but a small
constant computed from all the data replaces
the within treatment estimate of variance for
each gene. The variance of a gene is
supposed to be the same for all treatments.
SAM
Like Limma, SAM starts by computing a test statistic for each gene.
SAM uses a regularized denominator: i.e. the test statistic is based on a
paired or two-sample t-test, or an ANOVA F-test, but a small constant
computed from all the data replaces the within treatment estimate of
variance for each gene. The variance of a gene is supposed to be the
same for all treatments.
Usual
Moderated
sM2 / n
sM2 / n  s0
y1  y2
1
2 1
sp (  )
n m
M
2-sample
paired
y1  y2
1 1
s 2p (  )  s0
n m
M
1/ 2
ANOVA
T
n (y
i 1
T ni
i
i
 ( y
i 1 j 1
ij
 y ) 2 /(T  1)
 yi ) 2 /( N  T )
T
 T

2
 N  ni ( yi  y ) /  ni 
i 1
 i 1

1/ 2
 T 1 T ni

   ( yij  yi ) 2 /( N  T )   s0
 n

 i 1 i i 1 j 1

s0
s0 is computed from the values of si computed
from all the genes.
An ad hoc procedure based on simulations is
used.
Selecting the Significant Genes
SAM uses a quantile-quantile plot of the data
versus the expected quantiles of the null
distribution.
Observations off the identity line are considered
detections.
The FDR is estimated based on the percentage
of the randomization values that would have
been "detected".
Selecting the Significant Genes
SAM uses a quantile-quantile plot of the data
versus the expected quantiles of the null
distribution.
Observations off the identity line are considered
detections.
The FDR is estimated based on the percentage
of the randomization values that would have
been "detected".
Example for Random Normals
We sort the data into y(1)<y(2)
...y(n)
y(i) has a sampling distribution with
mean: nz(i) the ith normal score.
We plot y(i) versus nz(i).
If the data are normally distributed,
then the data should lie on the line
y=x.
(Note that in the case of N(m,s2)
data, we often plot against the
normal scores for N(0,1) - then the
data should lie on the line y=msx
Example for Random Normals
We sort the data into y(1)<y(2)
...y(n)
y(i) has a sampling distribution with
mean: nz(i) the ith normal score.
We plot y(i) versus nz(i).
If the data are normally distributed,
then the data should lie on the line
y=x.
(Note that in the case of N(m,s2)
data, we often plot against the
normal scores for N(0,1) - then the
data should lie on the line y=msx
Selecting the Significant Genes
SAM computes a test statistic Di for the ith gene.
Then, the sample labels are permuted.
For each permutation: D(1)<D(2) ...<D(G) saved.
These are averaged over the permutations to obtain the
X-axis of the plot (call these the DN scores).
As well, all the distances dist(i)=|D(i)-DN(i)| are
recorded.
The median number of values such that dist(i)>K is
considered to be the estimate of the number of
expected false discoveries at distance K.
Selecting the Significant Genes
SAM computes a test statistic Di for the ith gene.
The user selects a distance. SAM computes the number
of genes detected at that distance R, and estimates
the expected number of false discoveries at that
distance V to obtain an estimate of the FDR
Example for Random Normals
If this is the plot for the
data, the points indicated
are the discoveries.
For each permutation
data set, we also
compute the number of
discoveries, and then
obtain an estimate of V.
Running SAM
1. Write normalized data to a file compatible with
Excel (tab or comma delimited).
2. Start Excel. First 2 columns should be gene
ids. First row are numbers 1 ... T giving
treatments.
3. Select rows and columns of spreadsheet that
you want to analyze.
4. Click on SAM on GUI. Select type of analysis,
random seed and number of permutations.
Running SAM
5. The SAM qqplot comes up. Select a distance
or use slider to assess FDR.
6. Print genelist.
The contrasts are:
yi   y
Limma Vs SAM
•
•
•
•
•
•
•
Limma
model-based
can handle small numbers of
replicates
handles ANOVA-type
problems including 1 random
effect
handles missing data
produces a genelist and CIs
can determine significance of
any linear contrast
hard to use
SAM
•nonparametric
•cannot handle small numbers
of replicates
•handles limited ANOVA -type
problems and survival
•"imputes" missing data
•produces only a genelist
•only determines significance of
deviation from mean
•easy to use