Transcript View
Looking for signals in tens of thousands
of GeneChips
Dr Andrew Harrison
Departments of Mathematical
Sciences and Biological
Sciences
University of Essex
[email protected]
There are >105 GeneChip
experiments in the public
domain, that cost ~$109 to
produce. Extracting further
information from this resource
will be very cost effective.
Microarray informatics at Essex University
Departments of Mathematical Sciences and Biological Sciences
Faculty
Dr Andrew Harrison
Professor Graham Upton
Dr Berthold Lausen
+ Dr Hugh Shanahan (Royal Holloway)
Degrees in …..
Physics
Statistics
Statistics
Physics
PhD students
Farhat Memon
Anne Owen
Fajriyah Rohmatul
Computer Science
Mathematics
Statistics
Current MSc and UG students
Aleksandra Iljina
Lina Hamadeh
Madalina Ghita
Statistics and Data Analysis
Statistics and Data Analysis
Mathematics
Alumni
Dr Jose Arteaga-Salas
Dr Renata Camargo
Dr Caroline Johnston
Dr William Langdon
Dr Joanna Rowsell
Dr Olivia Sanchez-Graillet
Dr Maria Stalteri
+ 4 former MSc students
Statistics
Computer Science
Molecular Biology and Bioinformatics
Computer Science and Physics
Mathematics
Computer Science and Bioinformatics
Inorganic Chemistry and Bioinformatics
Perfect Match (PM)
Mismatch (MM)
m=log2(Fold Change), a=log2(Average Intensity)
The biggest uncertainty in
GeneChip analysis is how
to merge all the probe
information for one gene Harrison, Johnston and
Orengo, 2007, BMC
Bioinformatics, 8: 195
There is a huge multiple-testing problem.
What can be learnt from comparing different experiments?
Some genes are represented by multiple probe-sets.
Probe-set A
Probe-set B
If they are measuring the same thing the signals should
be up and down regulated together.
Is that always true?
No
Stalteri and Harrison, 2007,
BMC Bioinformatics, 8:13
Probes map to different exons. Alternative splicing may cause
some exons to be upregulated and others to be downregulated.
Genes come in pieces.
But exons do not. Multiple probes mapping to the
same exon should measure the same thing.
We are studying the correlations in expression across >6,000 GeneChips
(HGU-133A), sampling RNA from many tissues and phenotypes.
The correlations in intensities
(log2) between probes in probeset
208772_at on the HG-U133A array.
The number in each square is the
correlation ×10
Blue = low correlation
Yellow = high correlation
Average intensity in GEO
Probe order along the gene
The correlation calculated for PM probes 9 and 11 , the data in the earlier scatter plot, is
reported as 8 (0.76 multiplied by 10 and rounded).
This probeset shows no
coherent correlations
amongst its probes.
Some probesets clearly
have outliers.
Probes 1-11 all map to the
same exon.
This is a different probeset mapping to the same
exon – there seems to be
one outlier.
The outliers are
correlated with
each other!
There is little sequence similarity between the probes, they are from
probe-sets picking up different biology, yet they are correlated!
TCCTGGACTGAGAAAGGGGGTTCCT
GAGACACACTGTACGTGGGGACCAC
GGTAGACTGGGGGTCATTTGCTTCC
Virtually all of the probes in the group have runs of Guanines
within their 25 bases.
Comparing probes with runs of Gs.
Number of
contiguous Gs
Mean
Correlation
3
0.14
4
0.42
5
0.49
6
0.62
7
0.75
We are only looking at a small fraction of the entire probe,
yet it is dominating the effects across all experiments.
G-quadruplexes
G
G
G
G
G
G
G
G
G
G
G
G
Probes all have the same sequence in a cell – a run of guanines will result in
closely packed DNA with just the right properties to form G-quadruplexes.
Upton et al. 2008 BMC Genomics, 9, 613
How do we deal with known outliers such as G-quadruplexes?
What is the best way to calculate expression in the presence of outliers?
G-stacks bias which
genes are reported to be
clustered together within
published experiments.
Kerkhoven et al. 2008, PLoS ONE 3(4): e1980
Probes containing GCCTCCC will hybridize to the primer spacer sequence that is
attached to all aRNA prior to hybridization.
Log(magnitude) of averaged probe values
Colour coded by size. Note the
perimeter of bright-dark pairs.
Cell (0,0) contains a
probe which does not
measure any biology
Corner correlations
(correlations with values in cell (0,0))
Numbers are correlations times 10 (red greater than 0.8)
Negative correlations appear as blanks
Filled circles indicate probes not listed in CDF file.
Large circles indicate correlations greater than 0.8
Correlations with cell (0,0)
Being in the opposite corner has not reduced the
correlations of the interior row and column
What are
in the
sheep
pens?
Entries are correlation with cell (0,0)
Entries are log(mean(Intensity))
Sheep!
Many thousands of probes are correlated with
each other simply because they are adjacent to
bright probes.
We believe that the focus of the scanner may be
responsible – regions adjacent to bright spots
will gain the same fraction of light.
A comparison of many images at different levels of
blurriness will appear to indicate that dark regions
adjacent to bright regions are correlated in their
intensities.
A CEL file contains information
about the ID of the scanner as
well as the date on which the
image was scanned – how does
the impact of blur change over
time for each scanner?
Upton and Harrison, 2010,
Stat Appl Genet Mol Biol, 9(1),
Article 37
How best to transform a DAT image into a CEL file?
We are testing whether ideas from astronomy are applicable.
We are checking whether the temporal patterns in scanner
performance for human and other organisms are related.
Bioinformatix, Genomix, Mathematix, Physix, Statistix, Transcriptomix
are needed in order to
extract reliable
information from
Affymetrix GeneChips
Thank you for
your attention.