PPT - Biochemistry and Molecular Biology

Download Report

Transcript PPT - Biochemistry and Molecular Biology

Crazy
The physics and calibration of
Affymetrix microarrays
[email protected]
Following discussions with Caroline, Christine, Danielle, Eric, Hugh,
Ilhem, Kevin, Lucy, Martin, Martino, Michael, Mike and Paul
Bioinformatix, Mathematix, Physix, Statistix, Transcriptomix
Glass Spottix versus Affymetrix
Operation and tissue extraction
Animal
Biology
mRNA preparation
Molecular
Biology
Chips are run
Chip calibration
Differentially expressed
genes are identified
Computational
Biology
Affymetrix scanners transform message into light
The Engima scanner has low
noise:
SVDFGS is definitely SVDFGS
Would you prefer to listen to
Radio Glasgow with poor
reception or Radio Budapest with
clear reception?
Affymetrix microarrays
5’
3’
GGTGGGAATTGGGTCAGAAGGACTGTGGCTAGGCGC
GGAATTGGGTCAGAAGGACTGTGGC
GGAATTGGGTCACAAGGACTGTGGC
perfect match probe cells
mismatch probe cells
actually scattered on chip
Probe cells of an Affymetrix Gene chip
contain millions of identical 25-mers
25-mer
Affymetrix Gene chip-Hybridization
Affymetrix Gene chip-Fluorescence
Affymetrix probe set
Probe cell (aka feature)
Perfect Match (PM)
Mismatch (MM)
Probe pair
Each gene is represented by 16
probe pairs (for chip rgu34a). Each
pair has a perfect match (the 25
base oligonucleotide binds to the
gene of interest) and a mismatch
(the central base is changed).
Outliers?
Chip calibration
Correct Background, Normalise, Correct for
Cross Hybridisation, Expression Measure
High-level analysis,
biological interpretation
Background Fluorescence needs to be corrected
e.g. MAS and RMA algorithms
Camel distributions suggest that there are two
populations (detected and not detected?).
Chips need to normalised against each other.
Each chip
is a
different
colour
e.g. invariant genes, lowess, quantiles
RMA uses Quantile normalisation at the probe level
Order by ranks
PA
PB
PC
PD
PE
Chip 1
1
2
4
3
5
Chip 2
7
2
5
3
1
Chip 3
5
3
4
2
9
Chip 1
1
2
3
4
5
Chip 2
1
2
3
5
7
Chip 3
2
3
4
5
9
1.33 2.33
3.33
4.66
7
Chip 2
1.33 2.33
3.33
4.66
7
Chip 3
1.33 2.33
3.33
4.66
7
Average the intensities at
each rank
Chip 1
Reorder by probe
PA
PB
PC
PD
PE
Chip 1
1.33 2.33
4.66
3.33
7
Chip 2
7
2.33
4.66
3.33 1.33
Chip 3
4.66 2.33
3.33
1.33
7
Cross Hybridisation
MAS 5.0 (Affymetrix) corrects for cross-hybridisation
by subtracting the MisMatch signal from the PerfectMatch.
RMA ignore the mismatches because they hybridise
to the Perfect Signal.
Expression Measure
The intensities of the multiple probes within a probeset are
combined into ONE measure of expression
MAS 5.0 (Signal) takes the Tukey bi-weighted
mean of the difference in logs of PM and MM.
1-9 are different chips.
dChip and RMA ‘model’ the systematic hybridisation patterns
when calibrating an expression measure.
Once chips have gone through the calibration process, changes in
gene expression between conditions or over time can be observed.
m=log2(Fold Change), a=log2(Average Intensity)
The change in
expression between
two conditions for all
the genes on an array
can be viewed on a
MA plot
Sliding Z
Quackenbush (2002)
Z=
m - mean(m)
standard deviation (m)
signal
At low intensities,
the sd is too low.
bg
signal
bg
Barenco 2003
Spike-in measurements show there remains
considerable signal at low concentrations.
The non-linearity means that Fold Change (Intensity)
is NOT the same as Fold Change (Transcript)
This causes complications when comparing chips
against mathematical models of changes in gene
expression
It is difficult to establish when a gene is NOT expressed
The statistical space is also non-linear
Cross Hybridisation
MAS 5.0 (Affymetrix) corrects for cross-hybridisation
by subtracting the MisMatch signal from the PerfectMatch.
RMA ignore the mismatches because they hybridise
to the Perfect Signal.
How can you measure cross-hybridisation without
using the MisMatch signal?
There is a need for a model of the physics of hybridisation
(Naef and Magnasco 2003)
GC content is important
AT bonds have two
hydrogen bonds.
GC have 3 hydrogen
bonds
Van der Waals interactions
between adjacent bases
H-bond interactions
between adjacent bases
Nearest-neighbour interactions predict duplex kinetics and
so sequence order is important (Santa Lucia)
The binding energy of GAC is not the same as CAG
The fraction of overlap between transcript and
probe depends upon the position along the probe
(Maibaum and SantaLucia)
Imagine if all your fragments were of length 20.
Imagine dropping the fragments randomly along a line of 25
Fraction
1
5
13
20
25
There will also be Duplex breathing and a torque
between the duplex and the unbound fragment
Biotin labelling interferes with the hybridisation
C & T (pyrimidines) are labelled. So GC*
binds less strongly than CG, and AT*
binding is weaker than TA.
If the probe contains no C & T, it will
hybridise well but with no fluorescence. If
you have all C & T, it will have difficulty
hybridising.
C and T within your mRNA fragment but
immediately outside your probe will
fluoresce and not interfere with hybridisation
Naef and Magnasco 2003
- a key paper
Size is important
T
e.g. perfect match #13 = A, so mismatch
#13 is T, and the complementary base in
mRNA is also T/U
Pyrimidines (C & T) are small
C
There will be no steric hindrance between
the pyrimidine in the mismatch and the
pyrimidine in the mRNA of interest.
G
Purines (G & A) are large
A
There will be a large steric hindrance
between the purine in the mismatch and
the purine in the mRNA of interest.
Naef and Magnasco (2003)
From Mei et al. (2003, PNAS)
Hybridisation with respect to A:
C is red
G is green
T is yellow
Affymetrix design their arrays using increasingly sophisticated
models of the physical chemistry of hybridisation
Zhang, Miles and Aldape (2003)
Their model is named
Position Dependent Nearest
Neighbour (PDNN)
PDNN has 24 weight factors
for Gene Specific Binding, 24
factors for Non-Specific
Binding and 16 stacking
energy parameters
They fit their model with a
dataset of ~5,000,000 probe
measurements (~40 chips)
Naef and Magnasco (2003)
The model
contains only
position
specific
affinities for
each base
(fitted using
~80 chips)
A low order function can be fitted to the hybridisation for a
given base at a given position. The total hybridisation for the
25 base sequence is then the sum of the local hybridisations.
If your probe
contains lots
of As in the
centre:
Position along probe
There will be lots of AT
bonds which means weak
2-hydrogen bonds
The complementary
sequence will contain lots of
Ts (biotin interference)
If your probe
contains lots
of Cs in the
centre:
Position along probe
There will be lots of GC
bonds which means strong
3-hydrogen bonds
The complementary
sequence will contain lots of
Gs (no biotin interference)
Wu and Irizarry report spike in yeast controls on a human chip.
This measures non-specific hybridisation directly
Many unchanging
genes do not express!
Theory is
comparable to
experiment
Not as
clean as
Naef
Wu and Irizarry (2004) have written GCRMA (which is
available now in Bioconductor)
As theory is comparable to experiment, it
can be used estimate the intrinsic
stochastic uncertainty of the hybridisation
process
Lots of close sequences will hybridise to
a given probe. Wu and Irizarry model the
variation in hybridisation of these similar
processes using a statistical model.
GCRMA determines the contribution to the
PM from Signal and from Non-Specific
Hybridisation
GCRMA suggests that many probes on the chip
do not detect signal.
GCRMA produces a good linear relationship
between intensity and concentration
Standard deviation of fold change
as a function of intensity
GCRMA
RMA
MAS
GCRMA is more noisy
than RMA because
each PM has a noisy
cross-hybridisation
subtraction
GCRMA makes the global properties of chips much
more comparable. In particular, it is much better than
RMA at removing genes with little emission over and
above the non-specific hybridisation.
GCRMA produces a linear relationship between light
and transcript to much lower concentrations.
The subtraction of cross-hybridisation adds to the
noise. However, this noise is much lower than MAS at
low-middle concentrations
Can the algorithms be improved further?
Spike-in measurements show that at large
intensities there is a non-linear relationship between
transcript concentration and fluorescent signal
Hekstra et al. (2003) show that Affymetrix chips follow
Langmuir adsorption isotherms i.e. they chemically saturate at
large concentrations in a well understood manner.
The affinities show a slight kink,
suggesting they can be improved
by including saturation effects
The corrections are for non-specific hybridisation,
yet some probes will be prone to specific cross
hybridisation from other genes - see talk by Eric
Outliers will need to be found and removed
A more detailed physical model may reduce variance
Comparing the probes
in a biological replicate
Even after using GCRMA the
variation does not look
random at low intensities.
It looks like there is still a
systematic bias, or there
remains a background
contribution to the PM signal
There appear to be two
populations of probes
“D-detected”
“U-undetected”
U
ignore?
D
At present, expression
measures (GCRMA, RMA,
MAS) combine all the
probes within a probeset
Should all the probes below
the peak in variance be
ignored?
Can we do better on the image processing?
Affymetrix data
Solar system formation
Gung-Ho Conclusions
The calibration of Affymetrix chips is a very active
and quickly evolving research area. All the
references in my talk are from 2003 or later!
GCRMA seems to have all the properties you would
expect from a correct calibration protocol. It is
available NOW in Bioconductor for FREE and will
help biologists and analysts.
Affymetrix calibration requires bioinformatix, physix and
statistix to work (and live) in harmony.
Transcriptomica?
Our unification will allow
us to face the common
enemy together ….
Computer
Scientists
Dadda (yesterday)
Quantile normalisation assumes the chips have the same
underlying distribution of intensities. For some experiments, this is
not the case (and what if you wish to compare 1000 chips?)