talk proteomics

Download Report

Transcript talk proteomics

MACHINE LEARNING TECHNIQUES
IN BIO-INFORMATICS
Elena Marchiori
IBIVU
Vrije Universiteit Amsterdam
Summary
• Machine Learning
• Supervised Learning: classification
• Unsupervised Learning: clustering
Machine Learning (ML)
• Construct a computational model from a dataset
describing properties of an unknown (but
existent) system.
observations
System (unknown)
ML
?
Computational model
properties
Supervised Learning
• The dataset describes examples of inputoutput behaviour of a unknown (but
existent) system.
• The algorithm tries to find a function
‘equivalent’ to the system.
• ML techniques for classification: K-nearest
neighbour, decision trees, Naïve Bayes,
Support Vector Machines.
Supervised Learning
observations System (unknown)
property of
interest
Training data
?
ML algorithm
new
observation
model
Unsupervised learning
prediction
Example: A Classification Problem
• Categorize images of fish—
say, “Atlantic salmon” vs.
“Pacific salmon”
• Use features such as length,
width, lightness, fin shape &
number, mouth position, etc.
• Steps
1. Preprocessing (e.g.,
background subtraction)
2. Feature extraction
3. Classification
example from Duda & Hart
Classification in Bioinformatics
• Computational diagnostic: early cancer
detection
• Tumor biomarker discovery
• Protein folding prediction
• Protein-protein binding sites prediction
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Classification Techniques
•
•
•
•
Naïve Bayes
K Nearest Neighbour
Support Vector Machines (next lesson)
…
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Bayesian Approach
• Each observed training example can incrementally
decrease or increase probability of hypothesis instead of
eliminate an hypothesis
• Prior knowledge can be combined with observed data to
determine hypothesis
• Bayesian methods can accommodate hypotheses that
make probabilistic predictions
• New instances can be classified by combining the
predictions of multiple hypotheses, weighted by their
probabilities
Kathleen McKeown’s slides
Bayesian Approach
• Assign the most probable target value, given
<a1,a2,…an>
• VMAP=argmax P(vj| a1,a2,…an)
• Using Bayes Theorem:
• VMAP=argmax P(a1,a2,…an|vj)P(vi)
vjV
P(a1,a2,…an)
=argmax P(a1,a2,…an|vj)P(vi)
vjV
• Bayesian learning is optimal
• Easy to estimate P(vi) by counting in training data
• Estimating the different P(a1,a2,…an|vj) not feasible
(we would need a training set of size proportional to the number of
possible instances times the number of classes)
Kathleen McKeown’s slides
Bayes’ Rules
• Product Rule
P(a Λ b) = P(a|b)P(b)= P(b|a)P(a)
• Bayes’ rule
P(a|b)=P(b|a)P(a)
P(b)
• In distribution form:
P(Y|X)=P(X|Y)P(Y) = αP(X|Y)P(Y)
P(X)
Kathleen McKeown’s slides
Naïve Bayes
• Assume independence of attributes
– P(a1,a2,…an|vj)=∏P(ai|vj)
i
• Substitute into VMAP formula
– VNB=argmax P(vj)∏P(ai|vj)
vjV
i
Kathleen McKeown’s slides
VNB=argmax P(vj)∏P(ai|vj)
vjV
S-length
S-width P-length
Class
1
2
3
4
5
high
low
low
low
high
high
high
high
high
high
high
low
low
med
high
Versicolour
Setosa
Verginica
Verginica
Versicolour
6
7
8
9
high
high
high
high
high
high
high
high
med
low
high
high
Setosa
Setosa
Versicolour
Versicolour
Kathleen McKeown’s slides
Estimating Probabilities
• What happens when the number of data elements is
small?
• Suppose true P(S-length=low|verginica)=.05
• There are only 2 instances with C=Verginica
• We estimate probability by nc/n using the training set
• Then #S-length =low |Verginica must = 0
• Then, instead of .05 we use estimated probability of 0
• Two problems
• Biased underestimate of probability
• This probability term will dominate if future query contains Slength=low
Kathleen McKeown’s slides
Instead: use m-estimate
• Use priors as well
• nc+mp
n+m
– Where p = prior estimate of P(S-length=low|verginica)
– m is a constant called the equivalent sample size
» Determines how heavily to weight p relative to the
observed data
» Typical method: assume a uniform prior of an
attribute (e.g. if values low,med,high -> p =1/3)
Kathleen McKeown’s slides
K-Nearest Neighbour
• Memorize the training data
• Given a new example, find its k nearest
neighbours, and output the majority vote
class.
• Choices:
– How many neighbours?
– What distance measure?
Application in Bioinformatics
•
A Regression-based K nearest neighbor algorithm for gene
function prediction from heterogeneous data, Z. Yao and W.L.
Ruzzo, BMC Bioinformatics 2006, 7
1.
For each dataset k, for each pair of genes p compute similarity
fk(p) of p wrt k-th data
2.
Construct predictor of gene pair similarity, e.g. logistic regression
H: f(p,1),…,f(p,m)  H(f(p,1),…,f(p,m)) such that
H high value if genes of p have similar functions.
Given a new gene g find kNN using H as distance
Predict the functional classes C1, .., Cn of g with confidence equal to
Confidence(Ci) = 1- Π (1- Pij) with gj neighbour of g and Ci in the set of
classes of gj (probability that at least one prediction is correct,
that is 1 – probability that all predictions are wrong)
Classification: CV error
• Training error
N samples
– Empirical error
• Error on independent
test set
– Test error
• Cross validation (CV)
error
– Leave-one-out (LOO)
– N-fold CV
Supervised learning
splitting
1/n
samples
for testing
Count errors
Summarize CV
error rate
N-1/n
samples
for training
Two schemes of cross validation
CV1 N samples
CV2 N samples
LOO
Train and test the
gene-selector and the
classifier
Gene selection
LOO
Train and test the
classifier
Count errors
Count errors
Supervised learning
Difference between CV1 and CV2
• CV1 gene selection within LOOCV
• CV2 gene selection before before LOOCV
• CV2 can yield optimistic estimation of classification true
error
• CV2 used in paper by Golub et al. :
–
–
–
–
0 training error
2 CV error (5.26%)
5 test error (14.7%)
CV error different from test error!
Supervised learning
Significance of classification results
• Permutation test:
–
–
–
–
Permute class label of samples
LOOCV error on data with permuted labels
Repeat process a high number of times
Compare with LOOCV error on original data:
• P-value = (# times LOOCV on permuted data <= LOOCV on
original data) / total # of permutations considered
Supervised learning
Unsupervised Learning
ML for
unsupervised
learning
attempts to
discover
interesting
structure in the
available data
Unsupervised learning
Unsupervised Learning
• The dataset describes the structure of an
unknown (but existent) system.
• The computer program tries to identify
structure of the system (clustering, data
compression).
• ML techniques: hierarchical clustering, kmeans, Self Organizing Maps (SOM),
fuzzy clustering (described in a future
lesson).
Clustering
 Clustering is one of the most important unsupervised
learning processes for organizing objects into groups
whose members are similar in some way.
 Clustering finds structures in a collection of unlabeled
data.
 A cluster is a collection of objects which are similar
between them and are dissimilar to the objects
belonging to other clusters.
Clustering Algorithms
• Start with a collection of n objects each
represented by a p–dimensional feature
vector xi , i=1, …n.
• The goal is to associatethe n objects to
k clusters so that objects “within” a
clusters are more “similar” than objects
between clusters. k is usually unknown.
• Popular methods: hierarchical, kmeans, SOM, …
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Hierarchical Clustering
Dendrogram
Venn Diagram of
Clustered Data
From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt
Hierarchical Clustering (Cont.)
• Multilevel clustering: level 1 has n clusters 
level n has one cluster.
• Agglomerative HC: starts with singleton and
merge clusters.
• Divisive HC: starts with one sample and split
clusters.
Nearest Neighbor Algorithm
• Nearest Neighbor Algorithm is an agglomerative
approach (bottom-up).
• Starts with n nodes (n is the size of our sample),
merges the 2 most similar nodes at each step,
and stops when the desired number of clusters
is reached.
From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt
Nearest Neighbor, Level 2, k = 7 clusters.
From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt
Nearest Neighbor, Level 3, k = 6 clusters.
Nearest Neighbor, Level 4, k = 5 clusters.
Nearest Neighbor, Level 5, k = 4 clusters.
Nearest Neighbor, Level 6, k = 3 clusters.
Nearest Neighbor, Level 7, k = 2 clusters.
Nearest Neighbor, Level 8, k = 1 cluster.
Hierarchical Clustering
Calculate the similarity
between all possible
combinations of two profiles
Keys
• Similarity
• Clustering
Two most similar clusters
are grouped together to form
a new cluster
Calculate the similarity
between the new cluster and
all remaining clusters.
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Clustering in Bioinformatics
• Microarray data quality checking
– Does replicates cluster together?
– Does similar conditions, time points, tissue types
cluster together?
• Cluster genes  Prediction of functions of
unknown genes by known ones
• Cluster samples  Discover clinical
characteristics (e.g. survival, marker status)
shared by samples.
• Promoter analysis of commonly regulated genes
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Functional significant gene clusters
Two-way clustering
Sample clusters
Gene
clusters
Bhattacharjee et al. (2001)
Human lung carcinomas
mRNA expression
profiling reveals distinct
adenocarcinoma
subclasses.
Proc. Natl. Acad. Sci.
USA, Vol. 98, 1379013795.
Similarity Measurements
• Pearson Correlation
 x1 
  
 y1 
x  
  
y
  
Two profiles (vectors)
 x N  and
 y N 
N
( x  mx )( yi  m y )
 

i 1 i
C pearson ( x , y ) 
N
N
2
[i 1 ( xi  mx ) ][ i 1 ( yi  m y ) 2 ]

x

y
+1  Pearson Correlation  – 1
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
mx 
1
N

my 
1
N

N
x
n 1 n
N
n 1
yn

x

y
Similarity Measurements
• Pearson Correlation: Trend Similarity


b  0.5a
 
c  a  0.2
 
C pearson (a, b )  1
 
C pearson (a, c )  1
 
C pearson (b , c )  1
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Similarity Measurements
• Euclidean Distance
 
d ( x, y) 
 x1 
 y1 


x     y    
 
 x N 
 y N 
2
(
x

y
)
n1 n n
N
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Similarity Measurements
• Euclidean Distance: Absolute difference


b  0.5a
 
c  a  0.2
 
d (a, b )  2.8025
 
d (a, c )  1.5875
 
d (b , c )  3.2211
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Clustering
C1
C2
Merge which pair of clusters?
C3
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Clustering
Single Linkage
+
+
C1
C2
Dissimilarity between two
clusters = Minimum dissimilarity
between the members of two
clusters
Tend to generate “long chains”
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Clustering
Complete Linkage
+
+
C1
C2
Dissimilarity between two
clusters = Maximum
dissimilarity between the
members of two clusters
Tend to generate “clumps”
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Clustering
Average Linkage
+
+
C2
Dissimilarity between two
clusters = Averaged distances of
all pairs of objects (one from
each cluster).
C1
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Clustering
Average Group Linkage
+
+
Dissimilarity between two
clusters = Distance between two
cluster means.
C2
C1
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Considerations
• What genes are used
to cluster samples?
– Expression variation
– Inherent variation
– Prior knowledge
(irrelevant genes)
– Etc.
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
K-means Clustering
– Initialize the K cluster representatives w’s, e.g. to
randomly chosen examples.
– Assign each input example x to the cluster c(x) with the
nearest corresponding weight vector:
– Update the weights:
w j ( n  1) 
x
c(x)  arg min j x  w j ( n )
x such that c(x)  j
/ nj
wit h n j t henumber of examplesassigned t o clust er j
– Increment n by 1 and go until no noticeable changes of
the cluster representatives occur.
Unsupervised learning
Example I
Initial Data and Seeds
Final Clustering
Unsupervised learning
Example II
Initial Data and Seeds
Final Clustering
Unsupervised learning
SOM: Brain’s self-organization
The brain maps the external
multidimensional representation of
the world into a similar 1 or 2 dimensional internal representation.
That is, the brain processes the
external signals in a topologypreserving way
Mimicking the way the brain
learns, our clustering system should
be able to do the same thing.
Unsupervised learning
Self-Organized Map: idea
Data: vectors XT = (X1, ... Xd) from d-dimensional space.
Grid of nodes, with local processor (called neuron) in each node.
Local processor # j has d adaptive parameters W(j).
Goal: change W(j) parameters to recover data clusters in X space.
Unsupervised learning
Training process
o
x=dane
o=pozycje wag
neuronów
x
o
o
o
o x
o
o
x
o
xo
N-wymiarowa
przestrzeń danych
o
o
o
wagi wskazują
na punkty w N-D
siatka neuronów
w 2-D
Java demos: http://www.neuroinformatik.ruhr-unibochum.de/ini/VDM/research/gsn/DemoGNG/GNG.html
Unsupervised learning
Concept of the SOM
Input space
Reduced feature space
Ba
s1
Mn
s2
Sr
Cluster centers (code vectors)
Place of these code vectors
in the reduced space
Clustering and ordering of the cluster centers
in a two dimensional grid
Unsupervised learning
Concept of the SOM
We can use it for visualization
Mn
SA3
Sr
SA3
…
Mg
We can use it for clustering
Unsupervised learning
We can use it for classification
Ba
SOM: learning algorithm
• Initialization. n=0. Choose random small values for weight
vectors components.
• Sampling. Select an x from the input examples.
• Similarity matching. Find the winning neuron i(x) at iteration n:
i( x)  arg min || x(n)  w j (n) ||
j
• Updating: adjust the weight vectors of all neurons using the
following rule
w j (n  1)  w j (n)   (n) hi ( x ) (di ( x ) j ) x(n) - w j (n)
• Continuation: n = n+1. Go to the Sampling step until no
noticeable changes in the weights are observed.
Unsupervised learning
Neighborhood Function
– Gaussian neighborhood function:
 d ij2

hi (d ij )  exp  
2

2



– dji: lateral distance of neurons i and j
• in a 1-dimensional lattice | j - i |
• in a 2-dimensional lattice || rj - ri ||
where rj is the position of neuron j in
the lattice.
Unsupervised learning
Initial h function (Example )
Unsupervised learning
Some examples of real-life
applications
Helsinki University of Technology web site
http://www.cis.hut.fi/research/refs/
Contains > 5000 papers on SOM and its applications:
• Brain research: modeling of formation of various
topographical maps in motor, auditory, visual and
somatotopic areas.
• Clusterization of genes, protein properties, chemical
compounds, speech phonemes, sounds of birds and insects,
astronomical objects, economical data, business and
financial data ....
• Data compression (images and audio), information filtering.
• Medical and technical diagnostics.
Unsupervised learning
Issues in Clustering
•
How many clusters?
–
–
•
User parameter
Use model selection criteria (Bayesian Information Criterion) with
penalization term which considers model complexity. See e.g. Xmeans: http://www2.cs.cmu.edu/~dpelleg/kmeans.html
What similarity measure?
–
–
–
Euclidean distance
Correlation coefficient
Ad-hoc similarity measures
Unsupervised learning
Validation of clustering results
• External measures
– According to some external knowledge
– Consideration of bias and subjectivity
• Internal measures
–
–
–
–
Quality of clusters according to the data
Compactness and separation
Stability
…
See e.g. J.Handl, J.Knowles, D.B.Kell
Computational cluster validation in postgenomic data analysis,
Bioinformatics, 21(15):3201-3212, 2005
Unsupervised learning
Bioinformatics Application
T.R. Golub et al., Science 286, 531
(1999)
Molecular Classification of Cancer:
Class Discovery and Class Prediction by Gene Expression Monitoring
Unsupervised learning
Identification of cancer types
• Why is Identification of Cancer Class (tumor sub-type)
important?
– Cancers of Identical grade can have widely variable
clinical courses (i.e. acute lymphoblastic leukemia, or
Acute myeloid leukemia).
• Tradition Method:
–
–
–
–
Morphological appearance.
Enzyme-based histochemical analyses.
Immunophenotyping.
Cytogenetic analysis.
Unsupervised learning
Golub et al 1999
Class Prediction
• How could one use an initial collection of samples
belonging to know classes to create a class Predictor?
– Identification of Informative Genes
– Weighted Vote
Unsupervised learning
Golub et al slides
Data
• Initial Sample: 38 Bone Marrow Samples (27 ALL, 11
AML) obtained at the time of diagnosis.
• Independent Sample: 34 leukemia consisted of 24 bone
marrow and 10 peripheral blood samples (20 ALL and 14
AML).
Unsupervised learning
Golub et al slides
Validation of Gene Voting
• Initial Samples: 36 of the 38 samples as either AML or
ALL and two as uncertain. All 36 samples agrees with
clinical diagnosis.
• Independent Samples: 29 of 34 samples are strongly
predicted with 100% accuracy.
Unsupervised learning
Golub et al slides
Class Discovery
• Can cancer classes be discovered
automatically based on gene expression?
– Cluster tumors by gene expression
– Determine whether the putative classes
produced are meaningful.
Unsupervised learning
Golub et al slides
Cluster tumors
 Self-organization Map (SOM)
 Mathematical cluster analysis for recognizing and
clasifying feautres in complex, multidimensional data
(similar to K-mean approach)
 Chooses a geometry of “nodes”
 Nodes are mapped into K-dimensional space, initially at
random.
 Iteratively adjust the nodes.
Unsupervised learning
Golub et al slides
Validation of SOM
• Prediction based on cluster A1 and A2:
– 24/25 of the ALL samples from initial dataset were
clustered in group A1
– 10/13 of the AML samples from initial dataset were
clustered in group A2
Unsupervised learning
Golub et al slides
Validation of SOM
• How could one evaluate the putative cluster if the “right”
answer were not known?
– Assumption: class discovery could be tested by class
prediction.
• Testing of Assumption:
– Construct Predictors based on clusters A1 and A2.
– Construct Predictors based on random clusters
Unsupervised learning
Golub et al slides
Validation of SOM
• Predictions using predictors based on
clusters A1 and A2 yields 34 accurate
predictions, one error and three
uncertains.
Unsupervised learning
Golub et al slides
Validation of SOM
Unsupervised learning
Golub et al slides
CONCLUSION
• In Machine Learning, every technique has its assumptions and
constraints, advantages and limitations
• My view:
– First perform simple data analysis before applying fancy high tech ML
methods
– Possibly use different ML techniques and then ensemble results
– Apply correct cross validation method!
– Check for significance of results (permutation test, stability of selected
genes)
– Work in collaboration with data producer (biologist, pathologist)
when possible!
ML in bioinformatics