Slides - Department of Computer Science

Download Report

Transcript Slides - Department of Computer Science

CS 5263 Bioinformatics
Lecture 23
Microarray Data Analysis
What is a Microarray
Gene 1
Conceptually similar to
(reverse) Northern blot
(Many) probes, rather than
mRNAs, are fixed on some
surface, in an ordered way
Gene 300
Microarray categories
• cDNAs microarray
– Each probe is the cDNA of a gene (5005000nt)
• Oligonucleotide microarray
– Each probe is a synthesized short DNA
(uniquely corresponding to a substring of a
gene)
– Affymetrix: ~ 25mers
– Aglient: ~ 60 mers
Spotted cDNA microarray
Array Manufacturing
Each tube contains cDNAs corresponding to a unique
gene. Pre-amplified, and spotted onto a glass slide
Experiment
cy3
cy5
Data acquisition
Computer programs are used to process the image into digital signals.
• Segmentation: determine the boundary between signal and background
• Results: gene expression ratios between two samples
Affymetrix GeneChip®
Array Design
25-mer unique oligo
mismatch in the middle
nuclieotide
multiple probes (11~16) for each gene
from Affymetrix Inc.
Experiment
Each probe set combines to give an
absolute expression level.
Image segmentation is relatively easy.
But need to subtract background.
from Affymetrix Inc.
Affymetrix GeneChip
One color design
cDNA microarray
Two color design
Preprocessing
• Image processing
– Analog to digital
• Background subtraction
– Account for non-specific hybridization
• Transformation
– Convenience, normal distribution assumptions
• Normalization
– Remove systematic biases
• Filtering, averaging, etc.
– Remove random noises
Order may be different.
May be combined.
Background subtraction
• For cDNA array, relatively straightforward
• For oligo array, how to combine PM and MM?
• Does MM really measure non-specific
hybridization?
• Recent studies suggest
to ignore MM entirely or
use with caution
• Available software tools
900
800
700
– MAS 5 (by affymetrix)
– dChIP
– GCRMA
MM
600
500
400
300
200
100
0
0
500
1000
PM
1500
Transformation
• Log transformation for two-color array
3500
1800
1600
3000
1400
2500
Frequency
Frequency
1200
2000
1500
1000
800
600
1000
400
500
200
0
0
2
4
6
8
Cy5/Cy3 ratio
10
12
14
0
-4
-3
-2
-1
0
1
log (Cy5/Cy3)
2
2
3
4
Transformation
• Log transformation for one-color array
• When get a data set from someone, be careful
with the scale
4
2.5
x 10
5000
4500
2
4000
1.5
frequency
frequency
3500
1
3000
2500
2000
1500
0.5
1000
500
0
0
1000
2000
3000
raw intensity
4000
5000
0
0
2
4
6
8
log2 (raw intensity)
10
12
14
Normalization
• Purpose:
– Correct for systematic errors
– Make data from different samples comparable
• Best approach to detect problems:
visualization
An example data set
• J DeRisi, V Iyer, and P Brown, “Exploring the
Metabolic and Genetic Control of Gene
Expression on a Genomic Scale”, Science, 278:
680 – 686, 1997
– Yeast cells grow in glucose medium
– When glucose was depleted, cells change their
metabolic pathways
• cDNA microarray
–
–
–
–
–
Test: 2, 4, 6, 8, 10, 12, 14 hours after growth
Control: 0 hour
No replicates!
No normalization!
Use fold-change to get differentially expressed genes!
Histogram of log ratios
Median = -0.27
1600
1400
1200
Two possibilities:
frequency
1000
• Dye effect
800
• Sample difference
600
400
200
0
-3
-2
-1
0
log2 (cy5/cy3)
1
2
3
Total intensity normalization
Median = -0.1
1600
mean(cy3) = 3141
mean(cy5) = 2838
3141 / 2838 = 1.11
1400
1200
Other options:
• use median
• use subset of genes
frequency
1000
800
–
–
–
–
600
400
Exclude 10% extreme
House-keeping genes
Spike-in genes
Etc.
200
0
-3
-2
-1
0
1
log2 (1.11*cy5/cy3)
2
3
• Net effect: constant
factor for every gene
Intensity-intensity plot
5
5
10
10
4
4
cy3 intensity
10
cy3 intensity
10
3
3
10
10
2
10
2
10
2
3
4
10
10
cy5 intensity
5
10
10
2
10
3
4
10
10
1.1 * cy5 intensity
• Total intensity normalization worked well here
5
10
Intensity-intensity plot
5
5
10
10
4
4
cy3 intensity
1.02 * cy3 intensity
10
3
10
2
10
2
10
10
3
10
2
3
4
10
10
cy5 intensity
5
10
10
2
10
3
4
10
10
cy5 intensity
• Did not work well for this experiment
• Dye-swapping can probably help?
5
10
2.5
2.5
2
2
1.5
1.5
1
1
log2 (1.11*cy5/cy3)
log2 (cy5/cy3)
M-A plot
0.5
0
-0.5
0.5
0
-0.5
-1
-1
-1.5
-1.5
-2
-2
-2.5
16
18
20
22
24
26
log2 (cy5*cy3)
28
30
32
-2.5
16
18
20
A: log2(cy5 * cy3) = log2(cy5)+log2(cy3)
M: log2(cy5 / cy3) = = log2(cy5)-log2(cy3)
22
24
26
log2 (cy5*cy3)
28
30
32
1
1
0.5
0.5
0
0
log2 (cy5/1.02*cy3)
log2 (cy5/cy3)
M-A plot
-0.5
-1
-1.5
-0.5
-1
-1.5
-2
14
16
18
20
22
24
log2 (cy5*cy3)
26
28
Dependency of M on A
30
32
-2
14
16
18
20
22
24
log2 (cy5*cy3)
26
28
30
32
2
5
0
0
-2
15
20
25
30
35
2
-5
10
15
20
25
30
35
5
0
0
-2
-4
10
15
20
25
30
35
-5
15
5
5
0
0
-5
10
15
20
25
30
35
-5
10
20
15
25
20
30
25
30
35
35
Lowess normalization
• Lowess: Locally Weighted Regression
• Fit local polynomial functions
• M adjusted according to fitted line
M’
M
A
A
Replicate filtering
Ratio 1
Ratio 2
Log2(ratio2)
• Genes with very high
variability in replicates
are questionable
Log2(ratio1)
Preprocessing questions
• What kind of array it is?
–
–
–
–
Two-color?
One-color?
Oligo array?
cDNA array?
• How is the experiment designed?
– Time series?
– Test vs control? (what kind of control?)
• What kind of preprocessing has been done?
– What values are given: raw intensities or ratios?
– Transformation? Log scale? Linear scale?
– Normalization: within-array? Cross-array?
• What are the next steps?
– Identifying differentially expressed genes?
– Clustering?
Identify differentially expressed
genes
• Naïve approach: fold change
– Log2 (cy5 / cy3) > 1: up-regulated / induced
– Log2(cy5 / cy3) < -1: down-regulated / repressed
• Still widely used – very simple
• Main problem: genes with low expression levels
may have a large fold change by chance
– From 10 to 100: ten fold
– From 1000 to 3000: three fold
– However: low-intensity => relatively high variance
15
15
10
10
log2(test / control)
log2(test / control)
Problem with fold change
5
0
-5
-10
5
0
-5
0
1000
2000
3000
4000
sqrt(test * control)
5000
6000
-10
0
1000
2000
3000
4000
0.5 * (test + control)
5000
• The most “differentially” expressed genes are the ones
with the lowest average expression levels
6000
More robust estimation of
differentially expression
• Estimate variance as a function of average expression
• Compute a Z-score depending on location: Z(x) = (x - <x>) / (x)
– x : log2(R/G) value.
– <x> : local mean
– (x): local standard deviation
SAM (Significance Analysis of
Microarrays)
• Test: replicate 1, replicate 2, replicate 3
• Control: replicate 1, replicate 2, replicate 3
T1
T2
T3
C1
C2
C3
Ratio
Gene1
1000
2000
1500
200
300
250
6
Gene2
1000
2000
3000
1000
1500
500
2
Gene3
100
1000
100
20
80
50
8
Gene4
1800
1700
1900
1000
800
900
2
Which one is more significantly differentially expressed?
SAM (Significance Analysis of
Microarrays)
• Basic idea: Student’s t-test
+ S0
To avoid small
sample problem
• Larger t => higher significance
• P-value can be directly computed for t-test
• Permutation test can be used
T1
T2
T3
C1
C2
C3
t
Gene1
1000
2000
1500
200
300
250
4.3
P1
1000
300
300
200
2000
200
-0.4
P2
2000
1000
200
300
1000
250
0.96
P3
200
1000
1000
300
250
1000
0.60
False Discovery Rate (FDR)
• Multiple testing problem
– P-value cutoff = 0.05
– We tested 10000 genes
– Would expect 500 genes by chance at this significance level
• Bonferroni correction
– Use p-value cutoff 0.05 / 10000
– Meaning among all genes selected, only 0.05 are expected to be
false positive
– Too conservative
• False Discovery Rate (FDR)
– FDR = 0.1, meaning among all genes selected, (say 100), we
would expect 10 to be false positive
– Acceptable to biologists
– Several different approaches to estimate
Microarray data analysis
• Often perform multiple experiments under
varying conditions
– Temperature
– Time series
– Different chemical treatment
– Different tissue
– Different mutant
Microarray data analysis
gene
2.5
20
2
40
1.5
60
1
80
0.5
100
0
120
-0.5
140
-1
160
-1.5
180
-2
200
2
4
experiment
6
-2.5
For each gene we have a
vector ej = (e1j, e2j, …, edj)
What to do next?
Supervised vs unsupervised
learning
• Supervised learning (Classification)
– Associate genes with phenotypes
• E.g.: Genes A, B and C induced => cancer
repressed => not cancer
– Goal: to learn such a function from data
– Classification algorithms:
• Decision tree, SVM, neural networks, naïve bayes,
etc.
• Unsupervised learning (clustering)
AML: acute myeloid leukemia
ALL: acute lymphoblastic leukemia
• Golub et. al., Molecular Classification of Cancer: Class Discovery
and Class Prediction by Gene Expression Monitoring, Science 286:
531 – 537, 1999
Clustering microarray data
• Group genes into co-expressed sets
– Genes with similar expression patterns across
multiple experiments may be co-regulated
• Group experiments into clusters
– Experiments within the same group may have similar
“gene expression” signature
– For example, disease sub-types that can be classified
from gene expression data
Clustering microarray data
• How to tell if two expression vectors are
similar?
– Define the (dis)-similarity measure between
two vectors
• How to group multiple profiles into
meaningful subsets ?
– Describe the clustering procedure
• Are the results meaningful ?
– Evaluate biological meaning of a clustering
(Dis)-similarity measures
•
•
•
•
Euclidean distance
Pearson correlation coefficient
Cosine similarity
Etc.
Clustering algorithms
•
•
•
•
•
Hierarchical clustering
K-means clustering
Self Organizing Maps (SOMs)
Spectral clustering
Etc.
Hierarchical clustering
• Agglomerative or divisive (less popular)
• Agglomerative basic idea:
– Given n genes
– Initially every gene in a single cluster
– for each iteration
• find two most similar genes (or gene groups),
combine into one cluster
• Terminate when only one cluster is left
• (how to define similarity between two groups?)
Hierarchical clustering
a
b
c
d
e
• Exact behavior depends on how to compute the distance between
two clusters
• No need to specify number of clusters
• A distance cutoff is often chosen to break tree into clusters
f
Distance between clusters
• Single-linkage
– Not recommended
• Complete-linkage
• Average-linkage
• Centroid method
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html
K-means
• Basic idea:
– Given n genes
– Estimate number of clusters: k
– (Randomly) choose k genes
as cluster centers
– Assign each gene to the
closest center
– Re-compute center for each
cluster
– Until assignment is stable
Similarity to EM. Objective function: minimize total distance to cluster centers.
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
An example
• A synthetic data set
3
50
2
100
150
1
200
250
Genes
0
300
350
-1
400
450
-2
500
550
10
20
30
Experiments
40
50
-3
Hierarchical clustering
Average linkage. Cluster genes only.
Average linkage. Cluster both genes and experiments.
K-means
50
100
150
200
K = 15
250
300
350
400
450
500
5
10
15
20
25
30
35
40
45
50
Another view of clusters
Cluster 1
Log ratio
4
2
0
-2
-4
0
5
10
15
20
25
30
35
40
45
50
30
35
40
45
50
Cluster 4
Log ratio
4
2
0
-2
-4
0
5
10
15
20
25
Experiments
Evaluating clustering
• Do genes in the same cluster share similar
functions?
– Functional enrichment analysis
• Do genes in the same cluster share similar
cis-regulatory motifs?
– Motif finding
Gene Ontology (GO)
• Gene functions were often defined using
free text
• Hard to extract, transfer, revise, predict,
annotate, comprehend, manage …
• The list of vocabularies should be predefined and commonly agreed
• Gene Ontology provides a controlled
vocabulary to describe gene and gene
product attribute
Gene ontology
• Two parts
– Ontology: list of vocabularies (terms) to use
– Annotations: characterizing genes using
ontology terms
• Three ontology categories
– Biological process
– Molecular function
– Cellular components
Part of a GO graph
Each GO category is a
directed acyclic graph
A term can have multiple
parents, and multiple
children.
A gene can be Annotated by
multiple terms.
If annotated by a child term,
automatically annotated by
all ascendant terms.
Functional enrichment analysis
• Total number of genes: 6000
• Cluster A: 100 genes
• What functions do these 100 genes have?
– Do they share some functions significantly?
Gene with function of interest
Gene with other functions
Example:
Among 6000 genes, 60 genes have
function F. In cluster A, 55 out of 100
genes also have function F.
Significance can be computed using
cumulative hyper-geometric test.
Significance of enrichment
Example:
Among 6000 genes, 60 genes have
function F. In cluster A, 55 out of 100
genes also have function F.
Significance can be computed using
cumulative hyper-geometric test.
 N  M  N 
 

min( m , N ) 
i  m  i 

cHypegeom ( M , N , m, n)  
M 
i n
 
m
M = 6000
m = 100
N = 60
n = 55
P-value = 8e-100
An application
•
•
•
Tavazoie et al, Systematic determination
of genetic network architecture, Nature
Genetics, 22, 1999
3000 yeast genes, 15 time points during
“cell cycle”
Use k-means clustering, k=30
– Clusters correlate well with known function
•
AlignACE motif finding
– 600-long upstream regions
– Many motifs known
Cell-division cycle
• A cell duplicates its
genome and
divides into two
identical cells
• Four phases
– G1 (preparation)
– S (DNA duplication)
– G2 (preparation)
– M (cell division)
Motifs in Clusters
Enriched functions in clusters
Overview of course
Main themes by biological subject
• Sequence analysis
– Alignment
– String matching
– Motif finding
– Gene prediction
– RNA secondary structure prediction
• Functional genomics
– Microarray
– Functional enrichment analysis
Main themes by algorithmic
techniques
• Dynamic programming
– Alignment, HMM, RNA structure
• Probabilistic modeling
– HMM (regular grammar)
– RNA structure (context free grammar)
– Motif finding
• Suffix trees
– Applications in bioinfo
• Clustering
Sequence alignment
• DP algorithm for global and local alignment
– Needleman-Wunsch
– Smith-Waterman
• Alignment with affine gap cost
• Linear space alignment
• Heuristic alignment
– Bounded alignment
– BLAST
• Alignment statistics
– Extreme value distribution
• Multiple sequence alignment
Probabilistic models
• HMMs
– HMMs for pair-wise sequence alignment
– HMMs for multiple alignment
– HMMs for gene prediction
• Stochastic context-free grammar for RNA
structure prediction
• Viterbi and posterior decoding
• Expectation-maximization
• Gibbs Sampling
Goals
Basis of sequence analysis and other
computational biology algorithms
Overall picture about the field
Read / criticize research articles
Think about the sub-field that best suits
your background to explore
Communicate and exchange ideas with
(computational) biologists
Good luck with your final exams
See you on Dec 11
Please remember to turn in your homework
and final project report by Dec 6