Clustering Techniques

Download Report

Transcript Clustering Techniques

Microarrays
•
•
•
•
•
Short Introduction to Microarray Technology
The data matrix
Distance metrics
Clustering techniques
Relevance of networks
What is a Microarray?
• “Microarray” has become a general term,
there are many types now
– DNA microarrays
– Protein microarrays
– Transfection microarrays
– Tissue microarray
–…
• We’ll be discussing cDNA microarrays
Gene Expression Measurement
• mRNA expression represents dynamic
aspects of cell
• mRNA expression can be measured with
latest technology
• mRNA is isolated and labeled with
fluorescent protein
• mRNA is hybridized to the target; level of
hybridization corresponds to light emission
which is measured with a laser
3
Gene Expression Microarrays
The main types of gene expression
microarrays:
• Short oligonucleotide arrays (Affymetrix) –
–11-20 probes per gene,
–probes for perfect match vs mismatch;
• cDNA or spotted arrays (Brown/Botstein)
–two colors – experiment vs control.
• ...
4
Affymetrix Microarrays
1.28cm
50um
~107 oligonucleotides,
some perfectly match mRNA (PM),
some have one Mismatch (MM)
Gene expression computed from
PM and MM
5
Affymetrix Microarray Raw
Image
Gene
D26528_at
D26561_cds1_at
D26561_cds2_at
D26561_cds3_at
D26579_at
D26598_at
D26599_at
D26600_at
D28114_at
Scanner
enlarged section of raw image
6
raw data
Value
193
-70
144
33
318
1764
1537
1204
707
What is a DNA Microarray
(very generally)
• A grid of DNA spots (probes) on a substrate
used to detect complementary sequences
• The DNA spots can be deposited by
– piezolectric (ink jet style)
– Pen
– Photolithography (Affymetrix)
• The substrate can be plastic, glass, silicon
(Affymetrix)
• RNA/DNA of interest is labelled & hybridizes
with the array
• Hybridization with probes is detected optically.
Types of DNA microarrays and
their uses
• What is measured depends on the chip design
and the laboratory protocol:
– Expression
• Measure mRNA expression levels (usually polyadenylated
mRNA)
– Resequencing
• Detect changes in genomic regions of interest
– Tiling
• Tiles probes over an entire genome for various applications
(novel transcripts, ChIP, epigenetic modifications)
– SNP
• Detect which known SNPs are in the tested DNA
– ?...
What do Expression Arrays really
measure?
• Gene Expression
• mRNA levels in a cell
• mRNA levels averaged over a population of cells
in a sample
• relative mRNA levels averaged over populations
of cells in multiple samples
• relative mRNA hybridization readings averaged
over populations of cells in multiple samples
• some relative mRNA hybridization readings
averaged over populations of cells in
multiple samples
Why “some” & “multiple samples”
• “some”
– In a comparison of Affymetrix vs spotted arrays, 10%
of probesets yielded very different results.
– “In the small number of cases in which
platforms yielded discrepant results, qRTPCR generally did not confirm either set of
data, suggesting that sequence-specific
effects may make expression predictions
difficult to make using any technique.”*
– It appears that some transcripts just can’t be
detected accurately by these techniques.
* Independence and reproducibility across microarray platforms.,
Quackenbush et al. Nat Methods. 2005 May;2(5):337-44
Why “multiple samples”
• “multiple samples”
– We can only really depend on betweensample fold change for Microarrays not
absolute values or within sample comparisons
(>1.3-2.0 fold change, in general)
Central “Assumption” of Gene
Expression Microarrays
• The level of a given mRNA is positively
correlated with the expression of the associated
protein.
– Higher mRNA levels mean higher protein expression,
lower mRNA means lower protein expression
• Other factors:
– Protein degradation, mRNA degradation,
polyadenylation, codon preference, translation rates,
alternative splicing, translation lag…
• This is relatively obvious, but worth emphasizing
•
How
Does
it
Work
?
DNA strands corresponding to specific genes, are arrayed on a
small glass, silicon or nylon slide (1x1 to 2x2 cm). Each contains many
copies of single stranded DNA probe.
• mRNA is extracted from the sample, converted to cDNA and attached
to a fluorescent label (usually Cy3 or Cy5).
• Single vs double labeled arrays:
 Absolute analysis:
determine whether transcripts
are present or not within
the sample.
The array
 Comparative analysis:
determine relative change in
abundance for each transcript.
probe
How Does it Work ? - (cont.)
• The chip is exposed to a solution containing extracted labeled mRNA
(hybridization, similar to Northern and Southern).
•
If the probe sequence
finds a complementary
base sequence in a DNA
sample, it will bind to it.
•
Excess unbound sample
is washed away.
RNA fragments with fluorescent tags from sample
Hybridization of labeled RNA
fragments with DNA on chip
How Does it Work ? - (cont.)
The array is scanned to measure fluorescent label.
The location of the bound sample is detected using the fluorescent
reporter, and gene expression is determined.
Detection of labeled
(hybridized) DNA
fragments
Non-hybridized
DNA
hybridized
DNA
Microarrays
The conentration of labels in the cDNA is measured!
Assumptions for interpretation of Microarray data:
•mRNA is stable over time until cDNA is transcribed
•The concentration of cDNA equals the concentraton of mRNA
•The labeled nucleotides are inserted statistically
The labels are often flourophors with a cyanine structure:
cy3
Microarrays
*
*
*
*
*
*
*
**
*
*
**
*
**
*
*
labeled cDNA
as
probe
target
cDNA Microarrays
•Immobilisation of whole cDNA
•cDNA is obtained via becteria
•Spotted by inkjet technology
•cDNA libraries are needed
oligonucleotide Microarrays
•Immobilisation of
characteristc sequences
•Sequences are designed
•Sequences are synthesized
on the chip
•Several commercial products
are available
Comparing two situations
cDNA Microarrays
*
*
*
*
*
**
*
*
Situation A – treated probe
*
*
**
*
*
**
*
Situation B – wild type
A reference is needed to compare data from different arrays.*)
Mixture of A and B is spotted
•Confocal microscope scans the
array
•Extraction of intensities with
picture analysis software
Comparing two situations
Int(treated)
Int(wild type)
ratio
Gene A
0.22
0.24
0.917
Gene B
0.67
1.21
0.598
Gene C
1.13
0.43
2.630
Gene D
2.45
2.44
1.01
0 < ratio < Inf.
-Inf. < log2(ratio) < + Inf.
where
log2(ratio) > 0: increase
log2(ratio) < 0: decrease
Absolute Intensities
•Absolute itensities without reference signal can be used if
two microarray chips are exactly equal.
•Commercial products (Affymetrix, Agilent, ...) guarantee
an interchip variation below 2%
•This technology works well.
Target Sequences and Probes
Example:
• 1415771_at:
– Description: Mus musculus nucleolin mRNA, complete cds
– LocusLink: AF318184.1 (NT sequence is 2412 bp long)
– Target Sequence is 129 bp long
11 probe pairs tiling the target sequence
gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt
gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt
gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt
gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt
gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt
gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt
gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt
gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt
gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt
gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt
gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt
Perfect Match and Mismatch
Target
ttccagacagactcctatggtgacttctctggaat
Perfect match
ctgtctgaggataccactgaagaga
ctgtctgaggattccactgaagaga
Probe pair
Mismatch
Affymetrix Chip Pseudo-image
*image created using dChip software
1415771_at on MOE430A
*image created using dChip software
1415771_at on MOE430A
PM
MM
*Note that PM, MM are always adjacent
*image created using dChip software
Data Matrix
Exp. 1
Exp. 2
Exp. 3
Exp. 4
Exp. 5
Type I
Type II
Type III
Type II
Type III
Gene 1
foo
0.78
0.22
0.32
0.87
0.50
Gene 2
bar
0.73
0.92
0.06
0.67
0.09
Gene 3
blee
0.99
0.60
0.48
0.10
0.56
Gene 4
bas
0.60
0.26
0.11
0.21
0.41
Gene 5
groo
0.44
0.84
0.42
0.84
0.86
Gene 6
gar
0.07
0.18
0.49
0.30
0.05
Gene 7
glee
0.28
0.49
0.47
0.93
0.89
Gene 8
glas
0.91
0.95
0.76
0.22
0.00
Gene 9
gree
0.72
0.51
0.35
0.54
0.81
Gene 10
goo
0.34
0.83
0.62
0.15
0.71
…
Data Normalisation
Background noise
•The background is determined around each spot
•Some spots are left empty to determine background
•There are several different procedures to substract background:
•The background around a point is subtracted for that point
•The average background is subtracted from each point
•A background profile is determined for the whole array – can
also be used to determine systematic errors of array
MAS5: Background & Noise
Background
•Divide chip into zones
•Select lowest 2% intensity values
•stdev of those values is zone variability
•Background at any location is the sum of all zones
background, weighted by 1/((distance^2) + fudge
factor)
Noise
•Using same zones as above
•Select lowest 2% background
•stedev of those values is zone noise
•Noise at any location is the sum of all zone noise as
above
•From http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf
MAS5 Model
• Measured Value = N + P + S
– N = Noise
– P = Probe effects (non-specific hybridization)
– S = Signal
MAS5: Adjusted Intensity
A = Intensity minus background, the final value
should be > noise.
A: adjusted intensity
I: measured intensity
b: background
NoiseFrac: default 0.5 (another fudge factor)
And the value should always be >=0.5 (log issues)
(fudge factor)
•From http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf
MAS5: Ideal Mismatch
Because Sometimes MM > PM
•From http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf
MAS5: Signal
Value for each probe:
Modified mean of probe values:
Scaling Factor
(Sc default 500)
Signal
ReportedValue(i) = nf * sf * 2 (SignalLogValuei)
(nf=1)
Tbi = Tukey Biweight (mean estimate, resistant to outliers)
TrimMean = Mean less top and bottom 2%
•From http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf
MAS5: p-value and calls
• First calculate discriminant for each probe pair:
R=(PM-MM)/(PM+MM)
• Wilcoxon one sided ranked test used to compare
R vs tau value and determine p-value
• Present/Marginal/Absent calls are thresholded
from p=value above and
– Present =< alpha1
– alpha1 < Marginal < alpha2
– Alpha2 <= Absent
• Default: alpha1=0.04, alpha2=0.06, tau=0.015
Data Normalisation
Correct for:
•Differences in labelling and detection efficiencies
•Differences in quantity of initial mRNA
Some widely used techniques:
•Total intensity normalisation
•Assumption: Quantity of initial mRNA is similar for both
labelled samples
•Total integrated intensities are equal for both samples
•Re-scale intensities for each gene
•Normalisation using regression techniques
•Normalisation using ratio statistics
Data Normalisation
Some widely used techniques (continued):
•Normalisation using regression techniques
•Assumption: mRNA derived from closely related samples is
used
•Scatterplot of int(Cy5) vs. int(Cy3) has a diagonal part with
slope 1
•Re-scale intensities for each gene
•Local regression techniques also available for non-linear
scatterplots
•Normalisation using ratio statistics
•Confidence limits are calculated via statistics
Data Normalisation
Data Normalisation
Undefined values
Gene expression datasets often contain undefined values.
•The background and the signal give similar intensity
•Surface of the chip is not planar (cDNA chips)
•The probe is not properly fixed on the chip
• hybridistion step didn‘t work properly
•Probe was not washed away properly
Undefined values can be
•Discarded -> leads to problems in data analysis
•Replaced with averages of rows/columns
•Replaced by zero
Similarity of Genes
Co-regulated genes show a similar behaviour
•There exists no a priori definition of similarity
•Different similarity measures are used
•Different patterns are seen with different similarity measures
Similarity of Genes
Example:
•An expression vector is defined for
each gene
•The expression values are the
dimensions (x1,x2, ..., xn) in expression
x3
space
•Each gene is represented as one
n-dimensional point in expression
space
•Two genes with similar expression
behaviour are spatially nearby
•Similarity is inversely proportional to
spatial distance
x2
x1
Similarity of Genes
Example (continued):
•Different clustering algorithms can be
applied to find clusters of genes
x3
x2
x1
Similarity of Experiments
•Instead of clustering genes it is also possible to cluster
experiments
•Each experiment is regarded as experiment vector in the
experiment space
Exp. 1
Exp. 2
Exp. 3
Exp. 4
Exp. 5
Type I
Type II
Type III
Type II
Type III
Gene 1
foo
0.78
0.22
0.32
0.87
0.50
Gene 2
bar
0.73
0.92
0.06
0.67
0.09
Gene 3
blee
0.99
0.60
0.48
0.10
0.56
Gene 4
bas
0.60
0.26
0.11
0.21
0.41
Gene 5
groo
0.44
0.84
0.42
0.84
0.86
Gene 6
gar
0.07
0.18
0.49
0.30
0.05
Gene 7
glee
0.28
0.49
0.47
0.93
0.89
Gene 8
glas
0.91
0.95
0.76
0.22
0.00
Gene 9
gree
0.72
0.51
0.35
0.54
0.81
Gene 10
goo
0.34
0.83
0.62
0.15
0.71
…
Visualisation of Data Matrix
•The data matrix can be visualised to get an
impression of gene expression behaviour.
•The order of genes in data matrix can be reorganised
for better recognition of expression patterns.
Distance Metrics
•
•
•
•
For clustering algorithms the calculation of a distance
between gene vectors or experiment vectors is a
necessary step
Distances metrics can be classified as
• Metric distances
• Semi-metric distances
Metric distances:
1. dab >= 0
2. dab = dba
3. daa = 0
4. dab <= dac + dcb
Semi-metric distances: obey 1) to 3), fail in 4)
Distance Metrics
Minkowski distance
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1
j1
i2
j2
ip
jp
If q = 1, d is Manhattan distance (semi-metric distance)
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1
i2 j 2
ip j p
If q = 2, d is Euclidean distance (metric distance)
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1
i2 j 2
ip
jp
Distance Metrics
Pearson correlation coefficient (semi-metric distance)
n
( x  x )(x  x )
i1 1 i2 2
i

1
d (i, j) 
n
n
2
 x )2
 (x  x )
 (x
i 1 i1 1
i 1 i 2 2

x2
(x  x )
1 1
(x  x )
2 2
-1 <= d(i,j) <= +1
x1
(x , x )
1 2
Distance Metrics
Rank-ordered Pearson correlation coefficient -> Spearman
(semi-metric distance)
Distance Metrics
Other variations of Pearsons correlation coefficient:
Uncentered Pearson correlation
(semi-metric distance)
n
x x
i1 i2
i

1
d (i, j) 
n
n
2
 x )2
 (x  x )
 (x
i 1 i1 1
i 1 i 2 2

Clustering Techniques
Several classification criteria of clustering algorithms exist with
regard to
•Clustering result:
hierarchical – not hierarchical
•Clustering process: divisive - agglomerative
•Clustering criterion: sequential - global
Clustering Techniques
Hierarchical clustering:
Clustering Techniques
Hierarchical clustering: phylogenetic tree
Euclidean distance
20
15
10
5
3
4 2
1
5
Clustering Techniques
Hierarchical clustering:
Various hierarchical clustering algorithms exist
•Single-linkage clustering, nearest-neighbour
•Complete-linkage, furthest-neighbour
•Average-linkage, unweighted pair-group method average (UPGMA)
•Weighted-pair group average, UPGMA weighted by cluster sizes
•Within-groups clustering
•Ward‘s method
•...
Clustering Techniques
k-means clustering:
1.
2.
3.
4.
5.
6.
The number of clusters k has to be chosen in
advance*)
x2
Initial position of cluster centers is random**)
For each data point the (euclidean) distance to
each cluster center is calculated
Each data point is assigned to its nearest
cluster.
Cluster centers are shifted to the center of
data points assigned to this cluster center
Step 3. – 5. is iterated until cluster centers are
not shifted anymore
x1
Clustering Techniques
k-means clustering (continued):
A reasonable number of cluster centers k can be estimated:
weighted squared
distance of data
points to their
cluster center
k
Clustering Techniques
k-means clustering (continued):
The initial position of cluster centers can be estimated by
the distribution of the data vectors: x
2
x1
Clustering Techniques
Principal Component
Analysis (PCA)
Singular value
decomposition (SVD)
•Reduction of effective
dimensionality of gene-expression
space
•by (linear) combination of initial
dimensions
•Mathematically complex
Clustering Techniques
•No clustering technique is better than another
•Different clustering techniques have shown to lead to
reasonable results depending on the measured data:
•Organism
•Tissue
•Experimental condition, e.g. mutants or time course
experiments
•Distance metric used
Visualisation
Visualisation in clusters:
Euclidean distance
20
15
10
5
3
4 2
1
5
What have we learned from
microarrays?
•
Evolution
– Most of gene expression differences between chimpanzees human have been
detected in their brain.
•
Development
– Associating gene expression with metamorphosis stages in Drosophila ila
•
Regulation
– Finding novel regulatory motifs by coupling motif search with co-expression
•
Behaviour
– Can predict the behavior of honey-bees workers by their brain gene expression.
•
Functional annotation
– Annotating unknown genes based on co expression (guilty by association)
•
Tissue
– Molecular signature specific to subtype of cancer tissues
Uses of Microarrays
• Sample classification
– What are the set of genes that differentiate
between two or more groups of Treatments
– What is the set of samples that have the same
expression profile in the detected cell(s)?
• Gene classification
– What is the set of genes that have the same
expression profile along a set of treatments?
Why are Microarrays Important ?
Primary goal: Generate expression information for every
gene in the array (detect global changes in whole genome
transcription, under similar set of conditions).
• Infer probable function of new genes (functional genomics;
based on similarities in expression patterns with those of known
genes). Explore gene function and regulation in relation
to structure.
• Provide knowledge of inter-relationships between genes,
revealing how multiple gene products work together.
Microarrays Uses
1.
•
Measure differential gene expression in samples:
Response to environmental factors (e.g. treatment, cell stimulation
in-vitro, response to environmental factors, effect of drugs,
time after treatment...).
• Diseased vs. normal tissues.
• Profiling tumors.
• Gene regulation (e.g. during
development).
2. Sequence identification:
• Comparative genomics.
• Genotyping: mutation detection & SNPs.
• Direct sequencing (sequencing by hybridization; SBH), determine
a sequence from its “signature” on a chip (the original objective
for DNA arrays, back at the beginning of the 90’s).
Gene Expression Matrix
When the array surface is scanned with a laser, fluorescent labels
attached to the complementary DNA reveal which probes are bound
and their quantity.
www.ebi.ac.uk/microarray/biology_intro.htm
Outline: Microarrays (“DNA Chips”)
What is functional genomics ?
Microarray landmarks.
Basic principles.
Applications of DNA
microarrays.
Working with microarrays.
Clustering analysis, example.
Conclusions.
Clustering analysis
Clustering - Algorithmic Challenge
A a powerful set of tools
which partition samples into
well-separated and homogeneous
groups, based on their behaviors
or patterns.
Example: cluster genes that have common expression patterns in certain
experimental conditions. Analyze
the vast amount of data in gene expression matrices,
and discover meaningful common biological functions, regulatory elements
and relationships among genes.
Clustering Methods
1
2
3
4
5
6
7
8
9
10
11
How will you cluster the faces ?
(“raw” state cases).
Cluster analysis is the tool
used to group these faces
in an objective manner.
The same data could be
presented
in numeric format: 0/1.
http://149.170.199.144/multivar/ca.htm
12
Clustering Methods
Challenge: to select data and apply algorithm
appropriately to get sensible partition of the data.
Choosing the “right” algorithm:
1. Hierarchical clustering Based on pair-wise distances, the result
is a tree (similar to phylogenetic tree).
2. Non-hierarchical clustering Partitions the n genes into dis-joint clusters.
E.g. k-means clustering (k is a pre-defined
parameter), where the variability within each
cluster is minimized, and variability between
clusters is maximal.
Hierarchical classification
(tree)
Non-hierarchical
classification
Simultaneous
(Traditional
global
correlation)
Timeshifted
Clustering
AlgorithmIdentifies further
(reasonable) types of
expression
relationships
Inverted
Mark Gerstein, Yale University, 2002.
Clustering Example
Advantages:
• Intuitive and quick.
• Visualization of results.
• Useful as first data
mining tool for
microarray data.
Pitfalls:
• In any type of data,
clusters could always
be found.
Future Model for Cancer Treatment
A patient arrives with any kind of cancer…
A biopsy of the tumor will be taken
and undergo micro-array analysis
for expression pattern.
“Personal” drug cocktail will be
formulated for the specific tumor type
according to individual expression pattern.
Microarrays and Drug “Tailoring”
Pharmacogenomics: The entire spectrum
of genes that interact with drugs.
•
Genes that determine drug sensitivity
and toxicity.
• “Tailoring” personal drugs relies
on patient’s genetic constitution.
How is individual genetic polymorphism linked to differences
in the efficacy and toxicity of medications ?
•
•
•
•
Drug-metabolizing enzymes.
Transporters.
Receptors.
Other drug targets.
Pharmacogenomics: Translating Functional Genomics into Rational Therapeutics. W.E. Evans
and M.V. Relling. Science vol 286: 487-491, 15 October 1999.
Micro-Array Market
<$50 million
>$500 million
Average annual rate of growth of 150%.
What Do DNA Chips Miss ?
• Cellular protein production: there are 3 major “check
points” to determine quantities of protein and function:
• Transcription from genes to mRNA.
• Translation of proteins from mRNA.
• Post-translation modifications.
DNA chips detect only the levels of transcription.
Other proteome technologies:
• Mass spectrometry.
• 2-D electrophoresis.
• Protein microarrays.
• Tissue microarrays.
Conclusion
•Microarrays measure the concentration of mRNA
•Several assumptions are made in the progress of analysis
•Biological
•Numerical
•Experimental conditions are important!
•Different distances matrics combined with different
clustering algorithms can leed to completely different results
• SUMMARY
Gene Expression Measurement
• mRNA expression represents dynamic
aspects of cell
• mRNA expression can be measured with
latest technology
• mRNA is isolated and labeled with
fluorescent protein
• mRNA is hybridized to the target; level of
hybridization corresponds to light emission
which is measured with a laser
78
Gene Expression Microarrays
The main types of gene expression
microarrays:
• Short oligonucleotide arrays (Affymetrix) –
–11-20 probes per gene,
–probes for perfect match vs mismatch;
• cDNA or spotted arrays (Brown/Botstein)
–two colors – experiment vs control.
• ...
79
Affymetrix Microarrays
1.28cm
50um
~107 oligonucleotides,
some perfectly match mRNA (PM),
some have one Mismatch (MM)
Gene expression computed from
PM and MM
80
Affymetrix Microarray Raw Image
Gene
D26528_at
D26561_cds1_at
D26561_cds2_at
D26561_cds3_at
D26579_at
D26598_at
D26599_at
D26600_at
D28114_at
Scanner
enlarged section of raw image
81
raw data
Value
193
-70
144
33
318
1764
1537
1204
707
Microarray Potential Applications
• Earlier and more accurate diagnostics
• New molecular targets for therapy
• Improved and individualized treatments
• fundamental biological discovery (e.g. finding and
refining biological pathways)
• Recent examples
– molecular diagnosis of leukemia, breast cancer, ...
– discovery that genetic signature strongly predicts
outcome
– a few new drugs, many new promising drug targets
82
Microarray Data Analysis Types
• Gene Selection
– Find genes for therapeutic targets (new drugs)
• Classification (Supervised)
– Identify disease
– Predict outcome / select best treatment
• Clustering (Unsupervised)
– Find new biological classes / refine existing
ones
– Exploration
83
Microarray Data Analysis
Challenges
• Few records (samples), usually < 100
• Many columns (genes), usually > 1,000
• This is very likely to result in false positives,
“discoveries” due to random noise
• Model needs to be explainable to biologists
• Good methodology is essential for minimizing
and controlling false positives
84
Microarray Classification
Overview
Train data
Gene
data
Data Cleaning & Preparation
Class
data
Feature and Parameter Selection
Model Building
Test data
Evaluation
85
Global Feature (Gene) Selection
“Leaks” Information
Class
Gene Data data
Train data
Gene
Selection
Model Building
Evaluation
Test data
is wrong, because the information is “leaked” via gene selection.
When #Features >> # samples, leads to overly “optimistic” results.
86
Data Preparation Issues
• Cleaning: inherent measurement noise
• Thresholding:
– min 20, max 16,000 for MAS-4
– MAS-5 does not generate negative numbers
• Filtering - remove genes with low variation
(for biological and efficiency reasons)
– e.g. MaxVal - MinVal < 500 and
MaxVal/MinVal < 5
– or Std. Dev across samples in the bottom 1/3
– or MaxVal - MinVal < 200 and MaxVal/MinVal
87