Transcript Microarrays
Microarrays
IST 444
Microarrays
• What if no test tubes were needed to
conduct an experiment?
• Hundreds, thousands or even millions of
individual experiments are conducted in
parallel, with very few reagents
• Microarrays combine genomics (study of
all the genes in the genome) with
experiments and eventually, diagnostics
Microarrays:
Universal Biochemistry Platforms
Peptides
Proteins
DNA
Lipids
Carbohydrates
Small molecules
Types of microarrays include:
• DNA microarrays, such as cDNA microarrays
and oligonucleotide microarrays, SNPs, CHiP
• MMChips, for surveillance of microRNA
populations
• Protein microarrays (protein-protein interactions)
• Tissue microarrays
• Cellular microarrays (also called transfection
microarrays
• Chemical compound microarrays
• Antibody microarrays (proteomics)
• Carbohydrate arrays (glycoarrays)
Microarrays Require
Bioinformatics
• Microarrays combine genomics, silicon
chip manufacturing, DNA and Protein
chemistry, signal and image processing,
statistics, software skills and miniaturized
versions of traditional molecular biology
experiments.
• Develop new software to analyze the
results of the many possible experiments.
Biological Samples in 2D Arrays
on Membranes or Glass Slides
DNA Microarray Technologies
• Gene expression profiling
- Monitoring expression levels for thousands of genes
simultaneously.
• Three other common applications:
• Array CGH (Comparative genomic hybridization)
- Assessing genome content in different cells or closely related
organisms.
• SNP array (single nucleotide polymorphism)
- Identifying single nucleotide polymorphism among alleles within or
between populations.
• ChIP-on-chip (Chromatin immunoprecipitation)
- Determining protein binding site occupancy throughout the
genome.
• Methylation arrays (immonoprecipitate methylated DNA_
-Determining which regions of DNA are methylated to determine
epigenetics
History of DNA Microarrays
• Microarrays descend from Southern and Northern
blotting. Unknown DNA is transferred to a membrane
and then probed with a known DNA sequence with a
label
• In Microarrays, the known DNA sequence (or probe) is
on the membrane while the unknown labeled DNA (or
target) is hybridized and then washed off so only specific
hybrids remain.
• Dot Blots of different genes in an array were used to
assay gene expression as early as 1987.
• Complete genome of all Saccharomyces cerevisiae
ORFs on a microarray published in 1997 by Lashkari et
al. http://www.pnas.org/content/94/24/13057.
• www.bio.davidson.edu/Courses/genomics/chip/
chip.html
RNA Transcription as a Measure
of Gene Expression
Transcription factors bind to
the promoter and bind RNA
polymerase
DNA strands separate and
transcription is initiated
Transcription continues in the
3'-5' direction until the stop
codons are reached
www.csu.edu.au/faculty/health/biomed/subjects/molbol/basic.htm
The completed RNA strand is
released for post-processing
What Can be Measured using
Microarrays?
1. Amount of mRNA expressed by a gene.
gene expression array, exon array, tiling array
2. Amount of mRNA expressed by an exon.
exon array, tiling array
3. Amount of RNA expressed by a region of DNA.
tiling array
4. Which strand of DNA is expressed.
exon array, tiling array
5. Which of several similar DNA sequences is present in the
genome.
SNP array
6. How many copies of a gene is present in the genome.
gene expression array, exon array, tiling array
7. Where a known protein has bound to the DNA. (ChIP on chip)
promoter array, tiling array
Types of Microarrays
cDNA sequence
UTR
Exon 1 Exon 2
Exon 3 UTR
oligo
exon
exon
exon
cDNA
chromosome sequence
CCGTTCACATTAGGATACCAGTTCAAGGCCGTTCACATTAGGATACCAGTTCAAGGAGGCCGTTCAGTTCACATTA
CCGTTCACA
AAGGCCGTT
CCGTGCACA
AAGGACGTT
tile
tile
tile
tile
tile
tile
tile
tile
tile
SNP
tile
tile
tile
tile
tile
tile
tile
tile
tile
promoter
A cDNA microarray can be made from the unsequenced cDNA library. All the
other types require that the sequence be available.
tile
Spotted vs. in situ synthesized
arrays
• The DNAs can be chemically synthesized
or made by PCR and then mechanically
spotted on the array.
– The amount spotted can vary.
– Method is more flexible and less expensive.
• The DNA can be chemically synthesized
directly on the array (Affymetrix).
– This can be more consistent
– Shorter pieces are used.
Format of an Affymetrix Array
http://cnx.rice.edu/content/m12388/latest/figE.JPG
Print Technology
1. The cDNA or oligo can be printed on the slide using an
"arraying robot" which deposits a drop of liquid
containing the material at each spot. (gene expression
only)
40,000+ spots
2. Oligos (all the same length) can be synthesized on the
slide using:
i) inkjet technology
ii) photolithography
1,000,000+ spots
3. There are other technologies that give similar types of
results (e.g. "beads").
Spotted 2-Channel Array
Spotted arrays
are printed on
coated
microscope
slides.
2 RNA samples
are converted
to cDNA. Each
is labelled with
a different dye.
http://www.anst.uu.se/frgra677/bilder/micro_method_large.jpg
SNP Analysis Using the Illumina,
Inc.
GoldenGate™ Assay
• Allele Specific Extension and Ligation
• PCR Amplification
• Hybridization to the Universal Sentrix®
Array Matrix
Allele Specific Extension and
Ligation
Genomic DNA
Allele Specific
Extension &
Ligation
[T/C]
Polymerase
Universal
PCR Sequence 1
Ligase
A
G
Universal
PCR Sequence 2
Custom Oligo Pool All (OPA)
96-1,536 SNPs multiplexed
Total oligos in reaction – 288-4,608
[T/A]
illumiCode’ Address
Universal
PCR Sequence 3’
PCR Amplification
A
Amplification
Template
PCR with
Common
Primers
Cy3 Universal
Primer 1
Cy5 Universal
Primer 2
illumiCode #561
Universal
Primer P3
Hybridization to Sentrix® Array
Matrix
SNP #561
G/G
SNP #217
/\/\/\/
/\/\/\/
A/A
illumiCode
#1024
/\/\/\/
illumiCode
#217
illumiCode
#561
C/T
SNP #1024
Sentrix® Array Matrix
1.5 mm
400 mm
10 mm
The Illumina BeadStation 500G permits high throughput analysis of
thousands of SNP DNA markers in hundreds of genotypes in less than
one week.
Gene Expression
Microarrays
Two-color fluorescent scan of a yeast microarray
containing 2,479 elements (ORFs). Red and Green probes
interact with a single target. Yellow probes interact with
both targets and empty probes with neither target.
Lashkari D A et al. PNAS 1997;94:13057-13062
©1997 by The National Academy of Sciences of the USA
Tiling Array
• Genome array consisting of overlapping
probes
• Finer Resolution
• Better at finding RNA in the cell
– mRNA
• Alternative splicing
• Not Polyadenylated
– microRNA
Tiling Arrays
http://en.wikipedia.org/
Tiling Array
http://en.wikipedia.org/
SNP Array
© Affymetrix Inc.
Single Nucleotide Polymorphisms
(SNPs)
* Most common genetic variation in human genome. Occur
about every one thousand base pairs in genome
* Genome-wide SNP maps now available (millions in
database)
Affymetrix Standard Tiling
C-T-C-C-A-A-A-A-A-A-A-T-T-T-C-A-T-T-C-T
C-T-C-C-A-A-A-A-A-A-C-T-T-T-C-A-T-T-C-T
C-T-C-C-A-A-A-A-A-A-G-T-T-T-C-A-T-T-C-T
C-T-C-C-A-A-A-A-A-A-T-T-T-T-C-A-T-T-C-T
Substitution position
ChIP-on-chip array
(Chromatin ImmunoPrecipitation )
ChIP-on-chip array
Antibody
ChIP-on-chip array
3. ChIP-on-chip array
Bioinformatic approaches for
analysis
• Measuring 10000s of
data points
simultaneously
• High dimensional data
– 10 Exp x 50K = 500K
• How to find real
differences over the
noise
• Statistical approaches
Tumor
Normal
Bioinformatic approaches for
analysis
• Class Comparison
– Which genes are up or
down in tumors v normal,
untreated v treated
• Class Discovery
– Within the tumor
samples, are there
subgroups that have a
specific expression
profile?
• Class prediction,
pathway analysis etc
Tumor
Normal
Challenges in microarray
analysis
• Different platforms
– Ilumina, Affymetrix, Agilent….
• Many file types, many data formats
• Need to learn platform dependent methods and
software required
Public databases
• Many sources for public
data – labs, consortia,
government
• Publications require that
data files including raw
files be made public
• GEO –
http://www.ncbi.nlm.nih.g
ov/geo/
• Array Express http://www.ebi.ac.uk/array
express/#ae-main[0]
Streamlined Analysis
normal
ID_REF
VALUE
AFFX-BioB-5_at
210.6
AFFX-BioB-M_at
393
AFFX-BioB-3_at
264.9
AFFX-BioC-5_at
738.6
AFFX-BioC-3_at
356.3
AFFX-BioDn-5_at
566.3
AFFX-BioDn-3_at
3911.8
AFFX-CreX-5_at
6433.3
AFFX-CreX-3_at
11917.8
AFFX-DapX-5_at
12.2
AFFX-DapX-M_at
57.8
AFFX-DapX-3_at
29.8
AFFX-LysX-5_at
15.3
AFFX-LysX-M_at
33.2
AFFX-LysX-3_at
40.7
AFFX-PheX-5_at
7.8
AFFX-PheX-M_at
4.2
AFFX-PheX-3_at
54.2
AFFX-ThrX-5_at
8.2
AFFX-ThrX-M_at
38.1
AFFX-ThrX-3_at
15.2
AFFX-TrpnX-5_at
11.2
AFFX-TrpnX-M_at
9
AFFX-TrpnX-3_at
19.8
AFFX-HUMISGF3A/M97935_5_at
82.7
AFFX-HUMISGF3A/M97935_MA_at
397.6
AFFX-HUMISGF3A/M97935_MB_at
206.2
AFFX-HUMISGF3A/M97935_3_at
663.8
AFFX-HUMR GE/M10098_5_at
547.6
AFFX-HUMR GE/M10098_M_at
239.1
AFFX-HUMR GE/M10098_3_at
1236.4
AFFX-HUMGAPDH/M33197_5_at
19508
AFFX-HUMGAPDH/M33197_M_at
18996.6
AFFX-HUMGAPDH/M33197_3_at
18016.4
AFFX-HSAC07/X00351_5_at
23294.6
AFFX-HSAC07/X00351_M_at
25373.1
AFFX-HSAC07/X00351_3_at
20032.8
tumor
ABS_C ALL VALUE
ABS_C ALL
P
P
P
P
P
P
P
P
P
A
M
A
A
A
M
A
A
A
A
A
A
A
A
A
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
M
A
A
A
A
A
A
A
A
A
A
A
A
A
A
P
P
P
P
P
P
P
P
P
P
P
P
P
234.6
327.8
164.6
676.1
365.9
442.2
3703.7
5980
9376.7
44.3
42.5
6.2
16.2
12
10.7
3
4.8
39.6
11.2
30.6
5
11.8
8.1
12.8
120.7
416.7
303
723.9
405.9
175.8
721.4
19267.1
20610.4
17463.8
21783.7
24922.8
20251.1
tumor
VALUE
362.5
501.4
244.7
737.6
423.4
649.7
4680.9
7734.7
11509.3
31.2
79
23.4
15.6
17.7
36.2
7.6
6.8
19.4
13.2
37.6
15
22.2
9.1
11.8
92.7
244.8
300.8
812.1
6894.7
3675
9076.1
22892
21573.7
20921.3
18423.3
22384.2
20961.7
ABS_C ALL
P
P
P
P
P
P
P
P
P
A
M
A
A
A
A
A
A
A
A
A
A
A
A
A
P
A
P
P
P
P
P
P
P
P
P
P
P
normal
VALUE
389
816.5
379.7
1191.2
711.6
834.3
6037.7
10591
16814.4
37.7
48.8
28.4
16.7
37.3
22.1
5.6
6.1
16.1
9.5
7.2
8.3
22.1
8.7
43.2
46.4
181.4
253.5
666.1
3496.1
1348.6
7795.9
26584
29936
26908.3
21858.9
25760.2
23494.6
ABS_C ALL
P
P
P
P
P
P
P
P
P
P
P
A
A
A
A
A
A
A
A
A
A
A
A
M
P
A
P
P
P
P
P
P
P
P
P
P
P
Raw data
normal
VALUE
305.6
542
261.3
917
560.3
599.1
4653.7
8162.1
13861.8
33.3
39.5
3.2
3.1
49.2
22.8
5
3.7
44.7
8.5
26.9
36.8
8.9
8.1
17.4
55.9
197.5
195.3
629.4
1958.5
695.9
4237.1
29666.6
30106.6
28382.2
23517.1
27718.5
23381.2
ABS_C ALL
P
P
P
P
P
P
P
P
P
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
P
A
P
P
P
P
P
P
P
P
P
P
P
tumor
VALUE
330.5
440.8
303.7
767.9
484.9
606.9
4232
8428
13653.4
12.8
39.2
7.6
3.9
9.1
28.2
6.4
5.5
31.2
7.5
36.3
11.5
35.6
12
10
46.5
192.3
216
754.1
5799.4
2428.2
7890
25038.1
22380.2
21885
19450.3
21401.6
21173.3
A B S _C A LL
P
P
P
P
P
P
P
P
P
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
P
A
P
P
P
P
P
P
P
P
P
P
P
Significance
•t-test
•SAM
•Rank Product
Normalize
Filter
(RMA)
Classification
Gene lists
•PAM
•Machine learning
Function
(Genome Ontology)
•Present/Absent
•Minimum value
•Fold change
Clustering
Microarray experiments
Print or buy the microarray
Obtain sequence info
select oligos
Print microarray
Microarray experiments
Print or buy the microarray
Obtain sequence info
select oligos
Print microarray
sequencing error
assembly error
contamination
unique
similar hybridization rates
Microarray experiments
Print or buy the microarray
Obtain sequence info
Create the labeled samples
obtain tissue sample
select oligos
extract RNA
Print microarray
extract mRNA
normalize mRNA
label
Microarray experiments
Print or buy the microarray
Obtain sequence info
Create the labeled samples
obtain tissue sample
select oligos
extract RNA
Print microarray
extract mRNA
normalize mRNA
label
experimental design
-number of biological
replicates
-technical replicates
blocks
sample pooling
Microarray experiments
Print or buy the microarray
Obtain sequence info
Create the labeled samples
obtain tissue sample
select oligos
extract RNA
extract mRNA
Print microarray
normalize mRNA
hybridize
label
Microarray experiments
Print or buy the microrray
Obtain sequence info
Create the labeled samples
obtain tissue sample
select oligos
extract RNA
extract mRNA
Print microarray
normalize mRNA
hybridize
label
hybridization design (multichannel)
Microarray experiments
hybridize
scan
process image
detect spots
detect background
compute spot summary
detect bad spots
remove array specific noise
Microarray experiments
hybridize
scan
process image
using multiple scans
detect spots
spot detection software
detect background
compute spot summary
detect bad spots
remove array specific noise
pixel mean, median ...
background correction
detection limit
background > foreground
badly printed spots
flaws
normalization
Raw data are not mRNA
concentrations
•
•
•
•
tissue contamination
RNA degradation
amplification efficiency
reverse transcription
efficiency
• Hybridization efficiency and
specificity
• clone identification and
mapping
• PCR yield, contamination
• spotting efficiency
• DNA support binding
• other array manufacturing
related issues
• image segmentation
• signal quantification
• “background” correction
Quality control:
Noise and reliable signal
Probe level
Array level
Gene level
Arrays 1 ... n
Probe level: quality of the expression measurement of one spot
on one particular array
Array level: quality of the expression measurement on one
particular glass slide
Gene level: quality of the expression measurement of one probe
across all arrays
Probe-level quality control
• Individual spots printed on the slide
• Sources:
– faulty printing, uneven distribution, contamination with debris,
magnitude of signal relative to noise, poorly measured spots;
• Visual inspection:
– hairs, dust, scratches, air bubbles, dark regions, regions with haze
• Spot quality:
– Brightness: foreground/background ratio
– Uniformity: variation in pixel intensities and ratios of intensities within
a spot
– Morphology: area, perimeter, circularity.
– Spot Size: number of foreground pixels
• Action:
– set measurements to NA (missing values)
– local normalization procedures which account for regional
idiosyncrasies.
– use weights for measurements to indicate reliability in later analysis.
Spot Identification
Individual spots are recognized, size and shape might be
adjusted per spot (automatically fine adjustments by
hand).
Additional manual flagging of bad (X) or non-present (NA)
spots
NA
X
poor spot quality
good spot quality
Different Spot identification methods: Fixed circles, circles with
variable size, arbitrary spot shape (morphological opening)
Spot Identification
• The signal of the spots is quantified.
Histogram of pixel
intensities of a single spot
„Donuts“
Mean / Median / Mode / 75% quantile
Spot Detection
Rafael A Irizarry,
Department of
Biostatistics JHU
[email protected]
http://www.biostat.
jhsph.edu/~ririzarr
http://www.biocon
ductor.org
nci 2002
Adaptive segmentation Fixed circle segmentation
---- GenePix
---- QuantArray
---- ScanAnalyze
Spot uses
morphological opening
Microarray Analysis – Data
Preprocessing
• Objective
– Convert image of thousands of
signals to a a signal value for
each gene or probe set
• Multiple step
– Image analysis
– Background and noise
subtraction
– Normalization
– Expression value for a gene or
probe set
• Image analysis and
background noise usually
done by proprietary
software
Gene 1
Gene 2
Gene 3
.
Gene10000
100
150
75
500
Array Level Quality Control
• Problems:
–
–
–
–
–
array fabrication defect
problem with RNA extraction
failed labeling reaction
poor hybridization conditions
faulty scanner
• Quality measures:
–
–
–
–
–
Percentage of spots with no signal (~30% excluded spots)
Range of intensities
(Av. Foreground)/(Av. Background) > 3 in both channels
Distribution of spot signal area
Amount of adjustment needed: signals have to substantially
changed to make slides comparable.
Gene-level Quality Control
Gene g
• Poor hybridization in the
reference channel may
introduce bias on the foldchange
• Some probes will not hybridize
well to the target RNA
• Printing problems: such that
all spots of a given inventory
have poor quality.
•A well may be of bad qualitywell
– contamination
•Genes with a consistently low signal in the reference channel
are suspicious
Normalization
• Corrects for variation in
hybridization etc
• Assumption that no
global change in gene
expression
• Without normalization
– Intensity value for gene
will be lower on Chip B
– Many genes will appear
to be downregulated
when in reality they are
not
Treated
Gene 1
100
Gene 2
150
Gene 3
75
Gene10000 500
Control
50
75
32
250
Gene expression data
mRNA Samples
sample1 sample2 sample3 sample4 sample5 …
Gene
1
2
3
4
0.46
-0.10
0.15
-0.45
0.30
0.49
0.74
-1.03
0.80
0.24
0.04
-0.79
1.51
0.06
0.10
-0.56
0.90
0.46
0.20
-0.32
...
...
...
...
5
-0.06
1.06
1.35
1.09
-1.09
...
gene-expression level or ratio for gene i in mRNA sample j
M=
A=
Log2(red intensity / green intensity)
Function (PM, MM) of MAS, dchip or RMA
average: log2(red intensity), log2(green intensity)
Function (PM, MM) of MAS, dchip or RMA
Scatterplot of Data
Data
Data (log scale)
Message: look at your data on log-scale!
Use a log
transformation of the
ratio data:
Scatter plot of all genes
in a simple comparison
of two control (A) and
two treatments (B: high
vs. low glucose)
showing changes in
expression greater than
2.2 and 3 fold (lines).
X Axis is Average spot
intensity on a log scale
and Y Axis is specific
spot intensity.
Statistical power
• t test
– Test hypothesis that the two
means are not statistically
different
– Adding “confidence” to the fold
change value
•
•
•
•
•
Mean
Standard deviation
Sample size
Calculates statistic
You choose cutoff or
threshold
– Give me gene list at a cutoff
of p <0.05
» 95% confidence that
the mean for that gene
between control are
treated are different
Experimental Design – Very
important!!!
• Sample size
–
How many samples in test and
control
•
•
Will depend on many factors
such as whether tissue
culture or tissue sample
Power analysis
• Replicates
– Technical v biological
•
Biological replicates is more
important for more
heterogeneous samples
Need replicates for statistical
analysis
• To pool or not to pool
– Depends on objective
• Sample acquisition or
extraction
– Laser captured or gross
dissected
• All experimental steps from
sample acquisition to
hybridization
– Microarray experiments are
very expensive. So, plan
experiments carefully
t tests
• Results might look
like
– At a p<0.05, there are
300 genes up and 200
genes down regulated
• 95% confidence that the
means of these genes in
the two groups is
different
– At a p < 0.05, x genes
up and y genes down
with a fold change of at
least 3.0
Multiple Comparisons
• In a microarray experiment, each gene
(each probe or probe set) is really a
separate experiment
• Yet if you treat each gene as an
independent comparison, you will always
find some with significant differences
– (the tails of a normal distribution)
Multiple Comparisons
• Microarrays have multiple comparison problem
• p <= 0.05 says that 95% confidence means are
different; therefore 5% due to chance
• 5% of 10000 is 500
– 500 genes are picked up by chance
– Suppose t tests selects 1000 genes at a p of 0.05
– 500/1000 ;Approximately 50% of the genes will be
false
– Very high false discovery rate; need more confidence
– How to correct?
– Correction for multiple comparison
– p value and a corrected p value
Corrections for multiple
comparisons
• Involve corrections to the p value so that
the actual p value is higher
• Bonferroni
http://en.wikipedia.org/wiki/Bonferroni_corr
ection
• Benjamin-Hochberg
• Significance Analysis of Microarrays
– Tusher et al. at Stanford
Gene Expression Microarray experiments
obtain numerical
summary for each gene
or exon on each array
differential expression analysis
sample
classification
clustering genes
and samples
Gene Expression Microarray experiments
obtain numerical
summary for each gene
or exon on each array
robust methods to down weight
outliers
data imputation (filling in missing data)
differential expression analysis
sample
classification
discriminant analysis
support vector machines
supervised learning
t-tests, ANOVA
Bayesian versions of above
Fourier analysis of time series
False discovery and nondiscovery
rates
clustering genes
and samples
unsupervised learning
hierarchical clustering
k-means clustering
heatmaps
From Data to Knowledge
Once data is of high quality and systematic, nonbiological effects are removed, the result is a gene
expression matrix
mRNA Samples
sample1 sample2 sample3 sample4 sample5 …
Gene
1
2
3
4
0.46
-0.10
0.15
-0.45
0.30
0.49
0.74
-1.03
0.80
0.24
0.04
-0.79
1.51
0.06
0.10
-0.56
0.90
0.46
0.20
-0.32
...
...
...
...
5
-0.06
1.06
1.35
1.09
-1.09
...
This is still just data, not knowledge.
Use this data to answer a scientific question.
Supervised Analysis
Learning from examples, classification
– We have already seen groups of healthy and
sick people. Now let’s diagnose the next person
walking into the hospital.
– We know that these genes have function X (and
these others don’t). Let’s find more genes with
function X.
– We know many gene-pairs that are functionally
related (and many more that are not). Let’s
extend the number of known related gene pairs.
Known structure in the data needs to be
generalized to new data.
Unsupervised analysis
= clustering
– Are there groups of genes that behave
similarly in all conditions?
– Disease X is very heterogeneous. Can we
identify more specific sub-classes for more
targeted treatment?
No structure is known. We first need to
find it. Exploratory analysis.
Supervised analysis
Calvin, I still don’t
know the difference
between cats and
dogs …
Oh, now I get it!!
Class 1: cats
Don’t worry!
I’ll show you
once more:
Class 2: dogs
Unsupervised analysis
Calvin, I still don’t
know the difference
between cats and
dogs …
I don’t know it
either.
Let’s try to figure
it out together …
Supervised analysis: setup
• Training set
– Data: microarrays
– Labels: for each one we know if it falls into our
class of interest or not (binary classification)
• New data (test data)
– Data for which we don’t have labels.
– These are genes without known function
• Goal: Generalization ability
– Build a classifier from the training data that is
good at predicting the right class for the new
data.
One microarray, one dot
Expression of gene 2
Think of a space with
#genes dimensions (yes, it’s
hard for more than 3).
Each microarray
corresponds to a point in this
space.
If gene expression is similar
under some conditions, the
points will be close to each
other.
Expression of gene 1
If gene expression overall is
very different, the points will
be far away.
sample clusters show that different regions of
the brain cluster more closely than different species
A heatmap
samples of different regions of the
brain in humans and chimpanzees
gene clusters show that some genes differentiate
among brain regions while other differentiate the
2 species.
Heat Map: 2-D Cluster Analysis
Genome Wide Association
Studies
• GWAS involves rapidly scanning markers across
genome (≈0.5M or 1M) of many people (≈2K) to find
genetic variations associated with a particular disease.
• A large number of subjects are needed because
(1)associations between SNPs and causal variants are
expected to show low odds ratios, typically below 1.5
(2)In order to obtain a reliable signal, given the very
large number of tests that are required, associations
must show a high level of significance to survive the
multiple testing correction
• Such studies are particularly useful in finding genetic
variations that contribute to common, complex diseases
Genome Wide Association
Studies
• GWAS involves rapidly scanning markers
across genome (≈0.5M or 1M) of many people
(≈2K) to find genetic variations associated with a
particular disease.
• A large number of subjects are needed because
– associations between SNPs and causal variants are
expected to show low odds ratios, typically below
1.5
– (2)In order to obtain a reliable signal, given the very
large number of tests that are required, associations
must show a high level of significance to survive the
multiple testing correction
• Such studies are particularly useful in finding
genetic variations that contribute to common,
complex diseases like Autism.
Look for Association with
Diseases and SNPs
Many issues with data- including
Population diversities and need for
T-test corrections and log transformations
because of many variables and fewer
Samples (Bonferoni)
A recurrent mutation in the BMP type I receptor ACVR1 causes
inherited and sporadic fibrodysplasia ossificans progressiva
Eileen M. Shore, Meiqi Xu, George J. Feldman, David A. Fenstermacher,
The FOP International Research Consortium, Matthew A. Brown, and
Frederick S. Kaplan Nature Genetics 2006
Collect 13 individuals from five families
with FOP ectopic bone formation
Genome-wide linkage analysis with 400
microsatellite markers
Higher resolution linkage analysis
with Affymetrix 10K SNP mapping
chip (in Facility)
Candidate gene sequencing
identifies a new SNP in BMP
receptor
Predictive Value of Gene
Expression
• Lymphoma dataset
– This dataset is the gene expression in the
three most prevalent adult lymphoid
malignancies: B-CLL,FL and DLBCL.
– This study produced gene expression data
for p=4,682 genes in n=81 mRNA samples.
29 × B-CLL
9 × FL
43 × DLBCL
http://genome-www.stanford.edu/lymphoma
Correlation Matrix
Personal Genomics
• Many companies are marketing SNP chips
as useful- promising future information as
it becomes available.
–
–
–
–
–
deCODEme.com
Navigenics
23andMe
Knome
http://thepersonalgenome.com/
• Who will tell these people what the data
means?
• http://www.nytimes.com/2009/01/11/magazine/11Genom
e-t.html?pagewanted=all
IN-DELS and CNVs
• Insertions, Deletions and Copy Number
Variation
• Clone-based comparative genomic hybridization
(Array CGH)
– Test and reference DNA are differentially fluorescent
labeled and hybridized to the array.
– cons: low resolution (Cannot find small CNV region)
• SNP genotyping array
– pros: Higher resolution
– Cons: poor signal-to-noise ratio of hybridization
Hidden Markov Model designed for
high resolution CNV detection in
whole genome SNP genotyping data
CNV Analysis of Array Data
• Log R ratio (LRR): total fluorescent intensity signals
from both sets of probe/allele at each SNP
• B Allelle Frequence (BAF) : relative ratio of the
intensity signals between two probes/allele at each
SNP
• Accurate model for log R ratio and B Allele
Frequency
• + Population allele frequency + distance between
adjacent SNPs + family information
CNV Data Analysis
Genome Wide Association
Studies
• GWAS
– http://grants.nih.gov/grants/gwas/
– http://www.nature.com/scitable/topicpage/Gen
etic-Variation-and-Disease-GWAS-682
• Personal Genomes
– http://www.nytimes.com/2009/01/11/magazine
/11Genome-t.html?pagewanted=all
Functional Genomics
• Take a list of "interesting" genes and find
their biological relationships
– Gene lists may come from
significance/classfication analysis of
microarrays, proteomics, or other highthroughput methods
• Requires a reference set of "biological
knowledge"
Genome Ontology
• 3 hierarchical sets of terminology
– Biological Process
– Cellular Component (location within cell)
– Molecular Function
• about 1000 categories of functions
Biological Pathways
Microarray Databases
• Large experiments may have hundreds of
individual array hybridizations
• Core lab at an institution or multiple investigators
using one machine - data archive and validate
across experiments
• Data-mining - look for similar patterns of gene
expression across different experiments
• Microarray experiments are complex and this
shares data
Using Public Databases
• Gene Expression data is an essential
aspect of annotating the genome
• Publication and data exchange for
microarray experiments
• Data mining/Meta-studies
• Common data format - XML
• MIAME (Minimal Information About a
Microarray Experiment)
Transcriptome:
Gene Expression Technologies
•
•
•
•
•
cDNA (EST) libraries
SAGE
Microarray
rt-PCR
RNA-seq
The Cancer Genome Anatomy
Project
• CGAP has collected a large amount of
cDNA and related data online
• http://cgap.nci.nih.gov/
• cDNA libraries from various tissues
– search for genes
– compare expression levels
SAGE
• Serial Analysis of Gene Expression is a
technology that sequences very short
fragments of mRNA (10 or 17 bp) that
have been randomly ligated together
• The short ‘tags’ are assigned to genes and
then relative counts for each gene are
computed for cDNA libraries from various
tissues
SAGE Genie
• SAGE Anatomic Viewer
• SAGE Digital Gene Expression
Displayer
• Digital Northern
• SAGE Experiment Viewer
GEO
Microarray database at NCBI
• Microarray experiments
– Defined arrays
– Published results
– Also lots of inconclusive experiments
– Tools to search for specific genes
– Unreliable to search for tissue or disease in
experiment description text
Array Express at EMBL
Antibodies on Arrays
Miniature Western Blots