Transcript Document

On the causes of
correlations seen in
Affymetrix GeneChip
data
Dr Andrew Harrison
University of Essex
[email protected]
Microarray informatics at Essex University
Departments of Mathematical Sciences and Biological Sciences
Faculty
Dr Andrew Harrison
Professor Graham Upton
Dr Berthold Lausen
Degrees in …..
Physics
Statistics
Statistics
Postdocs
Dr Olivia Sanchez
Dr Maria Stalteri
Computer Science & Bioinformatics
Inorganic Chemistry & Bioinformatics
PhD students
Joanna Rowsell
Jose Arteaga-Salas
Farhat Memon
Fajriyah Rohmatul
Mathematics
Statistics
Computer Science
Statistics
We are developing
informatics tools to aid
the analysis of
Affymetrix GeneChips.
There are many
thousands of scientific
publications that have
resulted from
GeneChip technology.
Many laboratories have an almost identical
set-up for running GeneChips.
Probe cells of an Affymetrix
Gene chip contain millions of
25mer oligonucleotide probes,
which are grown through
photolithography.
Density of initiation sites for
photolithographic probe synthesis
is ~5×1013 molecules/cm2.
The photolithographic steps
have a yield of ~0.92-0.94.
There will be 0.9225 (10%) to
0.9425 (20%) full length probes.
This gives a full length probe
density of 5-10 × 1012 cm-2.
Thus there will ~ 3 nm between adjacent full length probes (c.f. diameter
of DNA is ~2 nm).
Full length probes (with linker) are ~20 nm.
Detect
fluorescence
Remove partial hybrids
by washing in a
solution with a reduced
salt content
(phosphate backbones
of nucleic acids have
negative charge).
Labelling with a
fluorescent marker
(on the Us).
Fragmentation of
RNA to mean length
of ~100 bases.
Hybridization
Affymetrix software derives the intensity
for each probe from the 75% quantile of
the pixel values in each box.
Affymetrix microarrays
5’
3’
GTGGGAATTGGGTCAGAAGGACTGTGGCTAGG
GGAATTGGGTCAGAAGGACTGTGGC
GGAATTGGGTCACAAGGACTGTGGC
perfect match probe cells
mismatch probe cells
Probe-pairs scattered on chip
Affymetrix probe set
Probe cell (aka feature)
Perfect Match (PM)
Mismatch (MM)
Probe pair
The probes are not physically adjacent on the chip
The biggest uncertainty in GeneChip analysis is how to merge all
the probe information for one gene Harrison, Johnston and Orengo, 2007, BMC Bioinformatics, 8: 195
dChip, RMA and GCRMA ‘model’ the systematic hybridisation
patterns when calibrating an expression measure.
Once chips have gone through the DATCELExpression Measure
process, changes in gene expression between conditions or over time
can be observed.
m=log2(Fold Change), a=log2(Average Intensity)
The change in
expression between two
conditions for all the
genes on an array can
be viewed on a MA plot
Some genes are represented by multiple probe-sets.
Probe-set A
Probe-set B
If they are measuring the same thing the signals should
be up and down regulated together!
Is that always true?
No
Stalteri and Harrison, 2007,
BMC Bioinformatics, 8:13
Probes map to different exons. Because of alternative
splicing, some of the exons may be upregulated
whereas others may be downregulated.
Genes come in pieces.
But exons do not. Multiple probes mapping to the
same exon should measure the same thing.
CONCLUSIONS I
Genes come in pieces.
Each exon needs to be considered, and classified,
separately.
Check that your assumptions don’t contradict
known biology.
The Essex approach
The data from many tens of
thousands of GeneChips are
freely available in the public
domain, in repositories such as
GEO. We are mining this data.
We are able to discover signals in GeneChips surveys which
will be invisible to analysts dealing with single experiments.
We are developing tools to enable analysts of single experiments to
utilise the signals we have discovered.
Our research is funded by the BBSRC (UK)
probes
information
Ensembl 48
exons, genes and transcript
information using the
BioMart query tool
Microarray data
megaBLAST
Text files repository:
sequence files,
sequence mappings
sequence alignment of
probes to genetic products

mysql db

Linux OS
Local
database
Output
Perl programs
 SQL queries
 Linux scripts

We are studying the correlations in expression across >6,000 GeneChips
(HGU-133A), sampling RNA from many tissues and phenotypes.
The correlations in intensities
(log2) between probes in probeset
208772_at on the HG-U133A array.
The number in each square is the
correlation ×10
Blue = low correlation
Yellow = high correlation
Average intensity in GEO
Probe order along the gene
The correlation calculated for PM probes 9 and 11 , the data in the earlier scatter plot, is
reported as 8 (0.76 multiplied by 10 and rounded).
This probeset shows no
coherent correlations
amongst its probes.
Some probesets
clearly have
outliers.
Probes 1-11 all map to the
same exon.
This is a different probeset mapping to the same
exon – there seems to be
one outlier.
The outliers are
correlated with
each other!
The outliers correlate well
with thousands of probes,
taken from many different
probesets.
Correlation: Red 1;
Yellow 0.75;
Green 0.5;
Blue 0
There is little sequence similarity between the probes, they are from
probe-sets picking up different biology, yet they are correlated!
TCCTGGACTGAGAAAGGGGGTTCCT
GAGACACACTGTACGTGGGGACCAC
GGTAGACTGGGGGTCATTTGCTTCC
Virtually all of the probes in the group have runs of Guanines
within their 25 bases.
Comparing probes with runs of Gs.
Number of
contiguous Gs
Mean
Correlation
3
0.14
4
0.42
5
0.49
6
0.62
7
0.75
We are only looking at a small fraction of the entire probe,
yet it is dominating the effects across all experiments.
Hybridization
kf
Probe + Target
Duplex
kr
Dissociation
K
kf
kr
G = - RT ln K
R is the Gas Constant,
and T is temperature.
All spontaneous physical and chemical changes take place
in the direction of a decrease in free energy, G < 0
Phosphates on
chains of nucleic
acids have a
negative charge.
There is a coulomb block of hybridization on
microarrays (Vainrub and Pettitt 2002). The
environment caused by probe-probe interactions
acts to modify the hybridization of RNA.
Hagan and Chakraborty
2004, Journal of
Chemical Physics
The strength of binding
depends upon probe density
K
kf
kr
G = - RT ln K
A tetrad of Guanines can bind to each other through
Hoogsteen Hydrogen bonds with the help of a central cation.
G-quadruplexes are prevalent in telomeres (single stranded
DNA at the end of chromosomes). G-quadruplexes are
thermally stable.
G-quadruplexes take a
range of topologies.
Adjacent probes within a cell on a GeneChip have the same sequence – a
run of Guanines will result in closely packed DNA with just the right
properties to form quadruplexes.
Upton et al. 2008 BMC Genomics, 9, 613
Parallel G-quadruplexes have a
left-handed helical twist.
G
G
G
G
We suggest 4 probes can efficiently form a
“Maypole”. Outside the corset of the “Gspot”, the probes have little affinity for bases
of the same sequence and the phosphate
backbones will repel each other. Inside the
G-spot the bases are on the inside and
cannot bind target.
G
G
G
G
K
kf
G
G
G
G
G
G
G
G
G = - RT ln K
kr
Probes that are not bound in G-quadruplexes will have a reduced probe
density in the immediate environment of the runs of Guanines. This will
result in very effective nucleation, and binding, with respect to
hybridization to the rest of the probe.
The binding will efficiently occur in the G-spot. Any RNA molecule with a
run of Cs will hybridize. Thus, there will be enhanced correlations
between all the probes that are able to form G-quadruplexes.
CONCLUSIONS II
Probes containing a contiguous run of 4 or more
guanines (a G-spot) are correlated with all the
other probes which have similar runs of guanines.
These probes are not measuring expression of the
gene for which they chosen.
Simple heuristic: Ignore the signals from probes
containing G-spots.
Single Nucleotide
Polymorphisms (SNPs )
SNPs: a single base pair is
different between one
individual and the other.

Polymorphism: if at least two
variants have frequencies >
1% in a population.
ENSE00001416163
HG_U133A
(5,374 CEL files)
SNP in only outlier
probes
snp_id
probe_id
probe_position_heatmap snp_position_probe allele
sequence
rs13505
219768_at-2-233
8
24
C/A
CTGAATTTAGATCTCCAGACCCTGC
O
rs13505
219768_at-602-267
9
4
C/A
CCTGCCTGGCCACAATTCAAATTAA
O
ENSE0000129003
HG_U133A
SNPs in only
no-outlier probes
snp_id
probe_id
probe_position_heatmap snp_position_probe allele
seq
rs11038
221667_s_at-512-441
10
13
A/G
GTTTATGATCTGACCTAGGTCCCCC
N
rs6413487
221667_s_at-570-641
9
7
C/G
TAAGGACGCTGGGAGCCTGTCAGTT
N
Examination of SNP-Outlier Associations
SNP(Yes)
SNP(No)
Outlier(Yes)
Outlier(No)
Total
11.4%
88.6%
100%
(n=1,788)
(n=13,869)
11.6%
88.4%
(n=17,231)
(n=131,035)
Phi =-.002
Cross-validation for HG_U133_Plus_2
100%

Outlier SNP-probes in HG_U133_Plus_2 with
“problematic” sub sequences (PS):

G’s (>=4), CCTCC, CCACC, GGTGG
11%
Gs, CCTCC
40%
With PS
CCACC, GGTGG
With PS
Without PS
Without PS
60%
89%
Outlier probes
No-outlier probes
CONCLUSIONS III
Probes overlapping SNPs sometimes appear different
from other probes from within their probe-set. But there
are other examples in which there is no difference.
However, when there is a difference this may not be
due to biology. It may be due to coincidental overlaps
with other causes of outliers.
Kerkhoven et al. 2008, PLoS ONE 3(4): e1980
Probes containing GCCTCCC will hybridize to the primer spacer sequence that is
attached to all aRNA prior to hybridization.
CONCLUSIONS IV
Probes containing complementary sequences to
primer spacers may not measure gene expression.
Simple heuristic: Ignore the signals from probes
containing CCTCC.
Log(magnitude) of averaged
probe values
Colour coded by size. Note the
perimeter of bright-dark pairs.
Cell (0,0) contains a
probe which does not
measure any biology
Corner correlations
(correlations with values in cell (0,0))
Numbers are correlations times 10 (red greater than 0.8)
Negative correlations appear as blanks
Filled circles indicate probes not listed in CDF file.
Large circles indicate correlations greater than 0.8
Correlations with cell (0,0)
Being in the opposite corner has not reduced the
correlations of the interior row and column
What are
in the
sheep
pens?
Entries are correlation with cell (0,0)
Entries are log(mean(Intensity))
Sheep!
Many thousands of probes are correlated with
each other simply because they are adjacent to
bright probes.
We believe that the focus of the scanner may be
responsible – regions adjacent to bright spots
will gain the same fraction of light.
A comparison of many images at different levels of
blurriness will appear to indicate that dark regions
adjacent to bright regions are correlated in their
intensities.
Sharply focussed arrays will have big values next to small
values with big differences between them. However, out of
focus arrays will have some of the big values falling into their
small neighbours so that the differences will be smaller.
1


T   vij  vi 1, j  vi 1, j  vi , j 1  vi , j 1 
4

i 2 j 2 
1161 1161
2
We work with log intensities. We also contrast T for each array
with a “master” array containing the mean intensities in GEO.
A CEL file contains
information about
the ID of the
scanner as well as
the date on which
the image was
scanned.
Conclusions V
We have found evidence that many GeneChip images
contain blurred data. There is evidence for temporal
changes within each lab, caused by either changes in the
use of protocol, or scanner, or some mixture of the two.
Genechip users assume that correlations result from
biology. However, there are a number of mechanisms
responsible for why probes show correlated behaviour.
Bioinformatix, Genomix, Mathematix, Physix, Statistix, Transcriptomix
are needed in order to
extract reliable
information from
Affymetrix GeneChips
Thank you for
your attention.