Transcript statgen10a

Gene Array Analysis
Statistical genetics - Class 10
Gene array description
Normalization
Data Analysis
Multiple measurements
What is a gene array
 Gene arrays are solid supports upon which a collection of
gene-specific nucleic acids have been placed at defined
locations, either by spotting or direct synthesis.
 In array analysis, a nucleic acid-containing sample is labeled
and then allowed to hybridize with the gene-specific targets
on the array.
 Based on the amount of probe hybridized to each target spot,
information is gained about the specific nucleic acid
composition of the sample.
 The major advantage of gene arrays is that they can provide
information on thousands of targets in a single experiment.
Nomenclature
 Many terms exist for naming gene arrays, including:






biochip,
DNA chip,
GeneChip (a registered trademark of Affymetrix, Inc.),
DNA array,
microarray
macroarray.
Glass
Support
 Microarray and macroarray may be used to differentiate
between spot size or the number of spots on the support.
Experiment
 A typical gene array experiment involves:
1.
2.
3.
4.
5.
6.
Isolating RNA from the samples to be compared
Converting the RNA samples to labeled cDNA via
reverse transcription; this step may be combined with
aRNA amplification
Hybridizing the labeled cDNA to identical membrane or
glass slide arrays
Removing the unhybridized cDNA
Detecting and quantitating the hybridized cDNA
Comparing the quantitative data from the various
samples
General Picture
Choosing Cell Populations
 The goal of comparative cDNA hybridization is to compare
gene transcription in two or more different kinds of cells.
For example:
 Tissue-specific Genes - Cells from two different tissues (say,
cardiac muscle and prostate epithelium) are specialized for
performing different functions in an organism. Although we
can recognize cells from different tissues by their
phenotypes, it is not known just what makes one cell
function as smooth muscle, another as a neuron, and still
another as prostate.
 Ultimately, a cell's role is determined by the proteins it
produces, which in turn depend on its expressed genes.
Comparative hybridization experiments can reveal genes
which are preferentially expressed in specific tissues.
Choosing Cell Populations
 Genetic disease is often caused by genes which are
inappropriately transcribed -- either too much or too little -or which are missing altogether.
 Such defects are especially common in cancers, which can
occur when regulatory genes are deleted, inactivated, or
become constitutively active.
 Unlike some genetic diseases (e.g. cystic fibrosis) in which a
single defective gene is always responsible, cancers which
appear clinically similar can be genetically heterogeneous.
 For example, prostate cancer (prostatic adenocarcinoma)
may be caused by several different, independent regulatory
gene defects even in a single patient.
Choosing Cell Populations
 Cell Cycle Variations
 Cells undergo DNA replication, mitosis, and eventually
death. These activities require quite different gene products,
such as DNA polymerases for genome replication or
microtubule spindle proteins for mitosis. A cell's genes
encode the "programs" for these activities, and gene
transcription is required to execute those programs.
Comparative hybridization can be used to distinguish genes
that are expressed at different times in the cell cycle. In this
way, the pathways responsible for controlling basic life
processes can be uncovered.
mRNA Extraction
 Genes which code for protein are transcribed
into messenger RNA's (mRNA's) in the cell
nucleus. The mRNA's in turn are translated
into proteins by ribosomes in the cytoplasm.
The transcription level of a gene is taken to
be the amount of its corresponding mRNA
present in the cell. Comparative
hybridization experiments compare the
amounts of many different mRNA's in two
cell populations.
mRNA Extraction
 To prepare mRNA for use in a microarray assay, it
must be purified from total cellular contents.
mRNA accounts for only about 3% of all RNA in a
cell.
 Common mRNA isolation methods take advantage
of the fact that most mRNA's have a poly-adenine
(poly(A)) tail. These poly(A)+ mRNA's can be
purified by capturing them using complementary
oligodeoxythymidine (oligo(dT)) molecules bound
to a solid support.
Reverse transcription
 Captured mRNA's are still difficult to work with
because they are prone to being destroyed.
 The environment is full of RNA-digesting enzymes,
so free RNA is quickly degraded. To prevent the
experimental samples from being lost, they are
reverse-transcribed back into more stable DNA
form. The products of this reaction are called
complementary DNA's (cDNA's) because their
sequences are the complements of the original
mRNA sequences.
Reverse transcription
 A problem with cDNA production is that not all mRNA's are
reverse-transcribed with the same efficiency. This fact leads
to reverse transcription bias, which can change the relative
amounts of different cDNA's measured by the microarray
assay.
 Reverse transcription bias is not a problem when comparing
the same mRNA across two cell populations unless it causes
the mRNA not to be transcribed at all.
 However, the bias does prohibit quantitative comparison
between different mRNA's on one array.
Fluorescent labeling of cDNA's
 In order to detect cDNA's bound to the microarray, we must
label them with a reporter molecule that identifies their
presence. The reporters currently used in comparative
hybridization to microarrays are fluorescent dyes (fluors).
 A differently-colored fluor is used for each sample so that
we can tell the two samples apart on the array. The labeled
cDNA samples are called probes because they are used to
probe the collection of spots on the array.
 Fluors do not show their colors unless stimulated with a
specific frequency of light by a laser. Even then, the colors
are not directly observed; rather, the wavelength of the
emitted light is used to tune a detector which measures the
fluorescence.
Normalization
 The number of fluor molecules which label each
cDNA depends on its length and possibly its sequence
composition, both of which are often unknown.
 This is one more reason that fluorescent intensities for
different cDNA's cannot be quantitatively compared.
However, identical cDNA's from the two probes are
still comparable as long as the same number of label
molecules are added to the same DNA sequence in
each probe.
Normalization
 To equalize the total concentrations of the two
cDNA probes before applying them to the
array, the probe solutions are diluted to have
the same overall fluorescent intensity.
 This procedure makes two possibly unjustified
assumptions:
1.
2.
that the total amount of mRNA in each cell type
being tested is identical
that each fluor emits the same amount of light
relative to its concentration.
Hybridization to a DNA
Microarray
 The two cDNA probes are tested by hybridizing them to a
DNA microarray.
 The array holds hundreds or thousands of spots, each of
which contains a different DNA sequence.
 In this way, every spot on an array is an independent assay
for the presence of a different cDNA. There is enough DNA
on each spot that both probes can hybridize to it at once
without interference.
 Microarrays are made from a collection of purified DNA's.
A drop of each type of DNA in solution is placed onto a
specially-prepared glass microscope slide by an arraying
machine. The arraying machine can quickly produce a
regular grid of thousands of spots in a square about 2 cm on
a side
Scanning the Hybridized Array
 Once the cDNA probes have been hybridized to the array and
any loose probe has been washed off, the array must be
scanned to determine how much of each probe is bound to
each spot.
 The probes are tagged with fluorescent reporter molecules
which emit detectable light when stimulated by a laser.
 The emitted light is captured by a detector,usualy a chargecoupled device (CCD).
 Spots with more bound probe will have more reporters and
will therefore fluoresce more intensely.
 The scanner also records light from a few molecules that
hybridized either to the wrong spot or nonspecifically to the
glass slide. This extra light becomes the background of the
scanned array image.
Affymetrix arrays
• 107copies per oligo in 24 x 24 um square
• Use 20 pairs of different 25-mers per gene
•
Perfect match and mismatch
Data Analysis
 Normalization
 Detection of outliers
 Clustering
 Multiple measurments
False color images of spotted
array
 Overlay of two scans of the slide
 Compares the two samples
 Green = less relative expression
 Red = more relative expression
 Yellow = equal expression
 Dimmer colors = lower expression levels.
Normalizing two-color arrays
The signals for the two colors are rarely
“balanced”.
before
after
Cy5 signal (log2)
Normalization
Cy3 signal (log2)
Normalization by iterative
linear regression
fit a line (y=mx+b) to the data set
set aside outliers (residuals > 2 x s.e.)
repeat until r2
changes by
< 0.001
then apply slope
and intercept to
the original dataset
D Finkelstein et al.
http://www.camda.duke.edu/CAMDA00/abstracts.asp
Cy5 signal (log2)
Normalization (Linear)
Cy3 signal (log2)
Cy5 signal (log2)
Normalization (Linear)
Cy3 signal (log2)
ratio {log2 (Cy5 / Cy3)}
Normalization (Curvilinear)
Loess function
fit line
0
average signal {log2 (Cy3 + Cy5)/2}
G Tseng et al., NAR 2001
LOESS function
To use LOESS, the user must specify the degree, d, of
the local polynomial to be fit to the data, and the
fraction of the data, q, to be used in each fit. In this
case, the simplest possible initial function
specification is d=1 and q=1. While it is relatively
easy to understand how the degree of the local
polynomial affects the simplicity of the initial
model, it is not as easy to determine how the
smoothing parameter affects the function.
LOESS function
The weight function gives the most weight to the data points
nearest the point of estimation and the least weight to the
data points that are furthest away. The use of the weights is
based on the idea that points near each other in the
explanatory variable space are more likely to be related to
each other in a simple way than points that are further apart.
Following this logic, points that are likely to follow the
local model best influence the local model parameter
estimates the most. Points that are less likely to actually
conform to the local model have less influence on the local
model parameter estimates. The traditional weight function
used for LOESS is the tri-cube weight function,
Image Analysis
 2 images per array
 Super-imposing
 Grid on image
Clone Id
1
2
…
Ratio
1.5
0.8
…
Gene Ratios
 Gene expression levels determined by intrinsic
properties of each gene
Gene A
Gene B


low
high
expression level
Statistical Analysis
 Differences in ratios due to


random variation
meaningful changes
 Hypothesis testing, with
H0: no systematic differences between ratios
Most Basic Statistical Analysis
 Assumptions

‘red’ and ‘green’ intensities at a given gene
~ i.i.N.d with common variance

constant coefficient of variation over the whole
gene set
Statistical Analysis
According to Chen et al. 1997 (J Biomedical Optics, 2(4):364)
with Tk = Rk / Gk ,
2



1 t 1 t
t  1 
fTk t  
exp  2
2
2 
2
c 1 t
2
 2c 1  t 
2




with c: coefficient of variation, estimated from data
Statistical Analysis
 Classification with hypothesis testing
under-expressed
over-expressed
/2
/2
3 classes of genes
Fold Change Graphs






How many times did the expression of this gene
change in the treated tissue versus the control?
comparison analysis
requires experiment vs control
does not apply to absolute analysis
parameter value in one vs another
Avg diff (perfect match vs mismatch)
Fold Change of Average Difference
Noise and Repeats
log – log plot
 >90% 2 to 3 fold
 Multiplicative
noise
 Repeat experiments
 Log scale
dist(4,2)=dist(2,1)