Slides - Biomedical Informatics

Download Report

Transcript Slides - Biomedical Informatics

Introduction to Microarry Data
Analysis - II
BMI 730
Kun Huang
Department of Biomedical Informatics
Ohio State University
Review of Microarray
Elements of Statistics and Gene Discovery in
Expression Data
Elements of Machine Learning and Clustering
of Gene Expression Profiles
How does two-channel microarray work?
• Printing process introduces errors and
larger variance
• Comparative hybridization experiment
How does microarray work?
• Fabrication expense and frequency of
error increases with the length of probe,
therefore 25 oligonucleotide probes are
employed.
• Problem: cross hybridization
• Solution: introduce mismatched probe
with one position (central) different with
the matched probe. The difference
gives a more accurate reading.
How do we use microarray?
• Inference
• Clustering
Normalization
• Which normalization algorithm to use
• Inter-slide normalization
• Not just for Affymetrix arrays
Review of Microarray
Elements of Statistics and Gene Discovery in
Expression Data
Elements of Machine Learning and Clustering
of Gene Expression Profiles
Hypothesis Testing
• Two set of samples sampled from two
distributions (N=2)
Hypothesis Testing
• Two set of samples sampled from two
distributions (N=2)
• Hypothesis
Null hypothesis
Alternative hypothesis
m1 and m2 are the means of the two distributions.
Student’s t-test
Student’s t-test
p-value can be computed from t-value and number of
freedom (related to number of samples) to give a bound on
the probability for type-I error (claiming insignificant
difference to be significant) assuming normal distributions.
Student’s t-test
• Dependent (paired) t-test
Permutation (t-)test
T-test relies on the parametric distribution assumption (normal
distribution). Permutation tests do not depend on such an
assumption. Examples include the permutation t-test and
Wilcoxon rank-sum test.
Perform regular t-test to obtain t-value t0. The randomly permute
the N1+N2 samples and designate the first N1 as group 1 with the
rest being group 2. Perform t-test again and record the t-value t.
For all possible
permutations, count how many tvalues are larger than t0 and write down the number K0.
Multiple Classes (N>2)
F-test
• The null hypothesis is that the distribution of
gene expression is the same for all classes.
• The alternative hypothesis is that at least one
of the classes has a distribution that is
different from the other classes.
• Which class is different cannot be determined
in F-test (ANOVA). It can only be identified
post hoc.
Example
• GEO Dataset Subgroup Effect
Gene Discovery and Multiple T-tests
Controlling False Positives
• p-value cutoff = 0.05 (probability for false
positive - type-I error)
• 22,000 probesets
• False discovery 22,000X0.05=1,100
• Focus on the 1,100 genes in the second
speciman. False discovery 1,100X0.05 = 55
Gene Discovery and Multiple T-tests
Controlling False Positives
• State the set of genes explicitly before the
experiments
• Problem: not always feasible, defeat the
purpose of large scale screening, could
miss important discovery
• Statistical tests to control the false positives
Gene Discovery and Multiple T-tests
Controlling False Positives
• Statistical tests to control the false positives
• Controlling for no false positives (very
stringent, e.g. Bonferroni methods)
• Controlling the number of false positives (
• Controlling the proportion of false positives
• Note that in the screening stage, false
positive is better than false negative as the
later means missing of possibly important
discovery.
Gene Discovery and Multiple T-tests
Controlling False Positives
• Statistical tests to control the false positives
• Controlling for no false positives (very stringent)
• Bonferroni methods and multivariate permutation
methods
Bonferroni inequality
Area of union < Sum of areas
Gene Discovery and Multiple T-tests
Bonferroni methods
• Bonferroni adjustment
• If Ei is the event for false positive discovery of gene I,
conservative speaking, it is almost guaranteed to have
false positive for K > 19.
• So change the p-value cutoff line from p0 to p0/K. This is
called Bonferroni adjustment.
• If K=20, p0=0.05, we call a gene i is significantly
differentially expressed if pi<0.0025.
Gene Discovery and Multiple T-tests
Bonferroni methods
• Bonferroni adjustment
• Too conservative. Excessive stringency leads to
increased false negative (type II error).
• Has problem with metaanalysis.
• Variations: sequential Bonferroni test (Holm-Bonferroni
test)
• Sort the K p-values from small to large to get
p1p2…pK.
• So change the p-value cutoff line for the ith p-value
to be p0/(K-i+1) (ie, p1p0/K, p2p0/(K-1), …, pKp0.
• If pjp0/(K-j+1) for all ji but pi+1>p0/(K-i+1+1), reject
all the alternative hypothesis from i+1 to K, but keep
the hypothesis from 1 to i.
Gene Discovery and Multiple T-tests
Controlling False Positives
• Statistical tests to control the false positives
• Controlling the number of false positives
• Simple approach – choose a cutoff for pvalues that are lower than the usual 0.05
but higher than that from Bonferroni
adjustment
• More sophisticated way: a version of
multivariate permutation.
Gene Discovery and Multiple T-tests
Controlling False Positives
• Statistical tests to control the false positives
• Controlling the proportion of false positives
Let g be the portion (percentage) of false positive in
the total discovered genes.
False
positive
Total
positive
pD is the choice. There are other ways for estimating false
positives. Details can be found in Tusher et. al. PNAS
98:5116-5121.
Review of Microarray
Elements of Statistics and Gene Discovery in
Expression Data
Elements of Machine Learning and Clustering
of Gene Expression Profiles
Review of Microarray and Gene Discovery
Clustering and Classification
• Preprocessing
• Distance measures
• Popular algorithms (not necessarily the best
ones)
• More sophisticated ones
• Evaluation
• Data mining
- Clustering or classification?
- Is training data available?
- What domain specific knowledge can be applied?
- What preprocessing of data is needed?
- Log / data scale and numerical stability
- Filtering / denoising
- Nonlinear kernel
- Feature selection (do I need to use all the data?)
- Is the dimensionality of the data too high?
How do we process microarray data
(clustering)?
- Feature selection – genes, transformations of
expression levels.
- Genes discovered in the class comparison (ttest). Risk: missing genes.
- Iterative approach : select genes under
different p-value cutoff, then select the one
with good performance using cross-validation.
- Principal components (pro and con).
- Discriminant analysis (e.g., LDA).
Distance Measure (Metric?)
- What do you mean by “similar”?
- Euclidean
- Uncentered correlation
- Pearson correlation
Distance Metric
- Euclidean
102123_at
160552_at
Lip1
3189.000
Ap1s1
5410.900
1596.000
1321.300
4144.400
3162.100
2040.900
2164.400
3986.900
4100.900
1277.000
868.600
3083.100
4603.200
4090.500
185.300
6105.900
6066.200
1357.600
266.400
3245.800
5505.800
dE(Lip1, Ap1s1) = 12883
1039.200
2527.800
4468.400
5702.700
1387.300
7295.000
Distance Metric
- Pearson Correlation
Ranges from 1 to -1.
r=1
r = -1
Distance Metric
- Pearson Correlation
102123_at
160552_at
Lip1
3189.000
Ap1s1
5410.900
1596.000
1321.300
4144.400
3162.100
2040.900
2164.400
3986.900
4100.900
1277.000
868.600
3083.100
4603.200
4090.500
185.300
6105.900
6066.200
1357.600
266.400
3245.800
5505.800
1039.200
2527.800
4468.400
5702.700
8000
7000
6000
dP(Lip1, Ap1s1) = 0.904
5000
4000
3000
2000
1000
0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1387.300
7295.000
Distance Metric
- Uncentered Correlation
102123_at
160552_at
Lip1
3189.000
Ap1s1
5410.900
1596.000
1321.300
4144.400
3162.100
2040.900
2164.400
3986.900
4100.900
1277.000
868.600
3083.100
4603.200
4090.500
185.300
6105.900
6066.200
1357.600
266.400
3245.800
5505.800
1039.200
2527.800
4468.400
5702.700
du(Lip1, Ap1s1) = 0.835
q
About 33.4o
1387.300
7295.000
Distance Metric
- Difference between Pearson correlation
and uncentered correlation
102123_at
Lip1
3189.000
Ap1s1
5410.900
160552_at
1596.000
1321.300
4144.400
3162.100
2040.900
2164.400
3986.900
4100.900
1277.000
868.600
3083.100
4603.200
4090.500
185.300
6105.900
6066.200
8000
8000
7000
7000
6000
6000
5000
5000
4000
4000
3000
3000
2000
2000
1000
1000
1357.600
266.400
3245.800
5505.800
1039.200
2527.800
4468.400
5702.700
1387.300
7295.000
0
0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Pearson correlation
Baseline expression possible
0
500
1000
1500
2000
2500
3000
3500
4000
Uncentered correlation
All are considered signals
4500
Distance Metric
- Difference between Euclidean and
correlation
Distance Metric
- Missing: negative correlation may also
mean “close” in signal pathway (1-|PCC|,
1-PCC^2)
Review of Microarray and Gene Discovery
Clustering and Classification
• Preprocessing
• Distance measures
• Popular algorithms (not necessarily the best
ones)
• More sophisticated ones
• Evaluation
• Data mining
How do we process microarray data
(clustering)?
-Unsupervised Learning – Hierarchical
Clustering
How do we process microarray data
(clustering)?
-Unsupervised Learning – Hierarchical
Clustering
Single linkage: The linking distance is the minimum distance
between two clusters.
How do we process microarray data
(clustering)?
-Unsupervised Learning – Hierarchical
Clustering
Complete linkage: The linking distance is the maximum
distance between two clusters.
How do we process microarray data
(clustering)?
-Unsupervised Learning – Hierarchical
Clustering
Average linkage/UPGMA: The linking distance is the
average of all pair-wise distances between members of
the two clusters. Since all genes and samples carry equal
weight, the linkage is an Unweighted Pair Group Method
with Arithmetic Means (UPGMA).
How do we process microarray data
(clustering)?
-Unsupervised Learning – Hierarchical
Clustering
• Single linkage – Prone to chaining and sensitive to
noise
• Complete linkage – Tends to produce compact
clusters
• Average linkage – Sensitive to distance metric
-Unsupervised Learning – Hierarchical
Clustering
-Unsupervised Learning – Hierarchical
Clustering
Dendrograms
• Distance – the height each
horizontal line represents
the distance between the
two groups it merges.
• Order – Opensource R
uses the convention that
the tighter clusters are on
the left. Others proposed
to use expression values,
loci on chromosomes, and
other ranking criteria.
- Unsupervised Learning - K-means
- Vector quantization
- K-D trees
- Need to try different K, sensitive to initialization
- Unsupervised Learning - K-means
[cidx, ctrs] = kmeans(yeastvalueshighexp, 4, 'dist', 'corr', 'rep',20);
K
Metric
- Unsupervised Learning - K-means
- Number of class K needs to be specified
- Does not always converge
- Sensitive to initialization
- Issues
- Lack of consistency or representative features
(5.3 TP53 + 0.8 PTEN doesn’t make sense)
- Data structure is missing
- Not robust to outliers and noise
D’Haeseleer 2005 Nat. Biotechnol 23(12):1499-501
- Model-based clustering methods
(Han) http://www.cs.umd.edu/~bhhan/research2.html
Pan et al. Genome Biology 2002 3:research0009.1
doi:10.1186/gb-2002-3-2-research0009
- Structure-based clustering methods
- Supervised Learning
- Support vector machines (SVM) and Kernels
- Only (binary) classifier, no data model
- Accuracy vs. generality
- Overfitting
Prediction error
- Model selection
Testing sample
Training sample
Model complexity
(reproduced from Hastie et.al.)