Introduction to Machine Learning for Microarray Analysis

Download Report

Transcript Introduction to Machine Learning for Microarray Analysis

Introduction to Machine
Learning for Microarray
Analysis
(for the non-computer scientist)
Jennifer Listgarten
October 2001
Outline
• Introduction to Machine Learning
• Clustering (Thoroughly)
• Principal Components Analysis (Briefly)
• Self-Organizing Maps (Briefly)
Perhaps another time:
• ‘Supervised’ vs. ‘Unsupervised’ Learning
• Neural Networks and Support Vector Machines
Outline
• Introduction to Machine Learning
• Clustering (Thoroughly)
• Principal Components Analysis (Briefly)
• Self-Organizing Maps (Briefly)
What is Machine Learning
Definition: The ability of a computer
to recognize patterns that have occurred
repeatedly and improve its performance
based on past experience.
Questions for Machine Learning
• Which genes are co-regulated?
• Which genes have similar functional
roles?
• Do certain gene profiles correlate with
diseased patients?
• (Which genes are upregulated/downregulated?)
The Data: How to think about it
In Machine Learning, each data point is a
vector.
Example:
Patient_X= (gene_1, gene_2, gene_3, …, gene_N)
Expression Ratio for Gene 3
Each vector ‘lives’ in a high-dimensional space.
Patient_X= (gene_1, gene_2, gene_3, …, gene_N)
Expression Ratio for Gene 3
N is normally larger than 2, so we can’t visualize the data.
Our Goal
Tease out the structure of our data from the
high-dimensional space in which it lives.
Breast cancer patients
Healthy patients
Patient_X= (gene_1, gene_2, gene_3, …, gene_N)
Two ways to think of data for
Microarray Experiments
1. All genes for one patient make a vector:
Patient_X= (gene_1, gene_2, gene_3, …, gene_N)
Two ways to think of data for
Microarray Experiments
1. All genes for one patient make a vector:
Patient_X= (gene_1, gene_2, gene_3, …, gene_N)
2. All experiments for one gene make a
vector:
Gene_X= (experiment_1, experiment_2 …, experiment_N)
Outline
• Introduction to Machine Learning
• Clustering
– Hierarchical Clustering
– K-means Clustering
– Stability of Clusters
• Principal Components Analysis
• Self-Organizing Maps
Clustering Data
Want ‘similar’ data to group together.
Problems:
• Don’t know which definition of ‘similar’ to use in
order to extract useful information.
• Without external validation, difficult to tell if
clustering is meaningful.
• How many clusters?
Similarity
• Metric: Formal name for ‘measure of similarity’
between 2 points.
• Every clustering algorithm (hierarchical, k-means,
etc.) needs to decide on a metric.
• Can argue in favour of various metrics, but no
correct answer.
Some Metrics
• Euclidean distance:
2
(
X
i

Y
i
)

i
• Correlation based:
 X i  X i  Yi  Yi 
1




N i   X   Y 
• Mutual Information based:
p( X , Y ) log p( X , Y )

p( X ) p(Y )
X and Y here are two vectors (eg. two patients)
Outline
• Introduction to general Machine Learning
• Clustering
– Hierarchical Clustering
– K-means Clustering
– Stability of Clusters
• Principal Components Analysis
• Self-Organizing Maps
Outline of Hierarchical
Clustering
1. Start by making every data point its own
cluster.
2. Repeatedly combine ‘closest’ clusters until
only one is left.
(example to follow)
Simple Example - Step 1
Initially each datum is its own cluster.
High-dimensional point
F
G
E
A
D
B
C
Legend: Data point (patient)
One cluster
Simple Example - Step 2
Combine the two closest clusters.
F
G
E
A
D
B
C
E F
Legend: Data point (patient)
One cluster
DENDOGRAM
Simple Example - Step 3
Again: Combine the next two closest
clusters.
F
G
E
A
D
B
C
Legend: Data point (patient)
One cluster
E F B C
DENDOGRAM
Simple Example - Step 4
And again...
F
G
E
A
D
B
C
Legend: Data point (patient)
One cluster
D E F B C
DENDOGRAM
Simple Example - Step 5
And again...
F
G
E
A
D
B
C
Legend: Data point (patient)
One cluster
D E F B C A
DENDOGRAM
Simple Example - Step 6
And again...
F
G
E
A
D
B
C
Legend: Data point (patient)
One cluster
G D E F B C A
DENDOGRAM
Simple Example - Step 7
And again...
F
G
E
A
D
B
C
Legend: Data point (patient)
One cluster
G D E F B C A
DENDOGRAM
Simple Example - Step 7
Metric Scale
And again...
F
G
E
A
D
B
C
Legend: Data point (patient)
One cluster
G D E F B C A
DENDOGRAM
Digression:‘Distance’ between
clusters
3 common ways:
1. Single-Linkage
2. Complete-Linkage
3. Average-Linkage
What we get out of HC
• A hierarchical set of clusters.
• A dendogram showing which data points
are most closely related, as defined by
the metric.
A B C D E F G
A B C D E F G
What we get out of HC
A B C D E F G
• Can we tell how data points are related
by looking at the horizontal positions of
the data points...?
• Must be careful about interpretation of
the dendogram - example to follow.
Notice that we can swap branches
while maintaining the tree structure.
G D E F B C A
Notice that we can swap branches
while maintaining the tree structure.
G D F E B C A
Again...
G D F E B C A
Again...
B C A G D F E
Again...
How many ways to
swap branches if
there are N data
points?
B C A G D F E
Again...
How many ways to
swap branches if
there are N data
points?
2N-1
For N=100
2N-1=1.3 x 1030
B C A G D F E
Two data points that were close together in one tree,
may be far apart in another.
1
2
1. G and A far apart
2. G and A close
G D F E B C A
B C A G D F E
Two data points that were close together in one tree,
may be far apart in another.
1
2
1. G and A far apart
2. G and A close
G D F E B C A
B C A G D F E
There is a way to help
overcome the arbitrariness
of the branches: SelfOrganizing Maps - SOMs discuss later
Lesson Learned
Be careful not to overinterpret the
results of hierarchical clustering
(along the horizontal axis).
What is HC used for?
A B C D E F G
• Typically, grouping genes that are co-regulated.
(Could use for grouping patients too.)
• While useful, it is a relatively simple,
unsophisticated tool.
• It is more of a visualization tool, rather than a
mathematical model.
Outline
• Introduction to Machine Learning
• Clustering
– Hierarchical Clustering
– K-means Clustering
• Principal Components Analysis
• Self-Organizing Maps
K-Means Clustering
Goal: Given a desired number of clusters,
• find out the cluster centers
• find out which data point belongs to each
cluster
Must specify that we want 3 clusters.
Legend: Data point (patient)
Cluster centre
Outline of K-Means Clustering
•Step 1 - Decide how many clusters (let’s say 3).
•Step 2 - Randomly choose 3 cluster centers.
Outline of K-Means Clustering
•Step 3 - Choose a metric.
•Step 4 - Assign each point to the cluster that it
is ‘closest’ to (according to metric).
Outline of K-Means Clustering
•Step 5 - Recalculate cluster centres using
means of points that belong to a cluster.
•Step 6 - Repeat until convergence (or fixed
number of steps, etc.).
Newly calculated
cluster centres
Outline of K-Means Clustering
Another step.... assign points to clusters.
Reassign Points
Outline of K-Means Clustering
And the final step.... reposition the means.
Newly calculated
cluster centres
Outline of K-Means Clustering
And the final step.... reassign points.
Reassign Points
Variations of K-Means:
•K-Median Clustering (uses median to find
new cluster positions instead of mean).
Use median/mean/etc/ to
reposition cluster.
Related to K-Means Clustering:
•Mixture of Gaussians (now clusters have a
width as well) - Gaussian Probability
Distribution instead of a metric. Other
differences too.

p( X | cluster_k ) 
1
(2 k ) d / 2
2
  2
( X  k )
exp 
2
2 k
Soft Partition (vs. Hard)
Comments on K-Means
Clustering
• May not converge nicely, need multiple
random restarts.
• Results are straightforward (unlike
hierarchical clustering).
• Still a relatively simple tool - not much
mathematical modelling.
Comments on K-Means
Clustering
• Earlier: showed random initialization.
• Can run hierarchical clustering to initialize
K-Means Clustering algorithm.
• This can help with convergence problems
as well speed up the algorithm.
Outline
• Introduction to Machine Learning
• Clustering
– Hierarchical Clustering
– K-means Clustering
– Stability of Clusters
• Principal Components Analysis
• Self-Organizing Maps
Stability of Clusters
• Ideally, want different (good) clustering
techniques to provide similiar results.
• Otherwise the clustering is likely
arbitrary, not modelling any true
structure in the data set.
Outline
• Introduction to Machine Learning
• Clustering (Thoroughly)
• Principal Components Analysis (Briefly)
• Self-Organizing Maps (Briefly)
Principal Components Analysis
(PCA)
• Mathematical technique to reduce the
dimensionality of the data.
PCA - Dimension Reduction
2 Projections Shown
•Some projections are more
informative than others.
•While projections differ, the
object remains unchanged in
original space.
3D2D (two ways)
Why?
• Instead of clustering 20,000 dimensional
data - cluster 100 dimensional data.
• Typically some dimensions are
redundant.
• Might eliminate noise, get more
meaningful clusters (?)
Why?
• Instead of clustering 20,000 dimensional
data - cluster 100 dimensional data.
• Typically some dimensions are
redundant.
• Might eliminate noise, get more
meaningful clusters (?)
 Fewer dimension means easier to
initialize ML algorithms and get good
results.
Simple 2D Example
One could cluster
these 2D points
in 2 dimensions.
Y
X
Simple 2D Example
One could cluster
these 2D points
in 2 dimensions.
Y
But... what if...
X
Simple 2D Example
We squashed them
into 1D.
‘squash’geometric projection
X
Simple 2D Example
•Y-Dimension was redundant.
•Only needed X variable to cluster nicely.
Y
X
X
Another 2D Example
It is not obvious
which dimension
to keep now.
Y
X
Another 2D Example
There is no axis onto which we
can project to get good separation
of the data...
Y
Y
X
X
Another 2D Example
But if we project the data onto a
combination (linear combination)
of the two dimensions... it works
out nicely.
Y
Y
X
X
That was the Intuition Behind PCA
Outline of PCA:
• Step 1 - Find the direction that accounts
for the largest amount of variation in the
data set, call this E1.
Y
E1
First Principal Component
X
That was the Intuition Behind PCA
Outline of PCA:
• Step 1 - Find the direction that accounts
for the largest amount of variation in the
data set, call this E1.
• Step 2 - Find the direction which is
perpendicular (orthogonal/uncorrelated)
to E1 and accounts for the next largest
amount of variation in the data set, call
this E2 .
That was the Intuition Behind PCA
Outline of PCA:
• Step 3 - Find 3rd next best direction
which is orthogonal to the other 2
directions - call this E3.
•
•
•
• Step N - Find the Nth such direction. (If
there were N dimensions to begin with).
PCA - Some Linear Algebra...
Singular Value Decomposition (SVD)
C  UV
 c11 c12
 ...
...

 ...
...

c N 1 c N 2
... c1N   u11 u12
... ...   ...
...

... ...   ...
...
 
... c NN  u N 1 u N 2
... u1N  1 0 0 0   v11 v12
... ...   0  2 0 0   ...
...
... ...   ... ... ... ...   ...
...


... u NN   0 0 0  N  v N 1 v N 2
... v1N 
... ... 
... ... 

... v NN 
T
Covariance Matrix of
Original Data
Principal
Components
2nd Principal
Component
Variance in each
Component
Principal Components Analysis
• Typical dimensional reduction might be:
1 million  200,000,
which might retain 95% of the original
information.
• Reduction depends on set of data.
PCA - Useful for clustering?
• It turns out that PCA can lead to worse
clustering than simply using the original
data (not necessarily).
• PCA is often used in conjunction with
other techniques, such as Artificial Neural
Networks, Support Vector Machines, etc.
PCA - Interesting Side Note
• PCA is the basis for
most of the Face
Recognition
systems currently in
use.
• eg. Security in
airports etc.
Principal ‘Face’ Directions
Outline
• Introduction to Machine Learning
• Clustering (Thoroughly)
• Principal Components Analysis (Briefly)
• Self-Organizing Maps (Briefly)
Self-Organizing Maps (SOMs)
• Way to visualize high-dimensional data in
a low-dimensional space.
• Commonly view data in 1 or 2
dimensions with SOMs.
Self-Organizing Maps (SOMs)
Can think of SOMs as a cross between KMeans Clustering and PCA:
Self-Organizing Maps (SOMs)
Can think of SOMs as a cross between KMeans Clustering and PCA:
K-Means: find Cluster centres.
 PCA: reduce dimensionality of the
cluster centres. (i.e. impose a ‘structure’
on the relationship between cluster
centres)
(At the same time!)
Self-Organizing Maps
Example: Impose a
5000
Dimensions
1 Dimension
1D structure on the
cluster centres.
Self-Organizing Maps
4
2
5000
Dimensions
1
3
1 2
1 Dimension
3
4
•This imposes an ordering
on the cluster centres.
Self-Organizing Maps
19
18
7
8
9
17
5
5000
Dimensions
4
2
10
14
12
1
3
•Data points from Cluster 1
come before points in
Cluster 2, etc.
•Then order based on
proximity to neighbouring
clusters.
13
1 2
1 Dimension
15
6
11
3
16
•This imposes an ordering
on the data points.
4
Self-Organizing Maps
19
18
7
8
9
17
5
5000
Dimensions
4
2
10
14
12
13
1
1 2
1 Dimension
15
6
11
3
16
3
4
What is important to know
for our immediate interest:
SOM imposes a
unique ordering on
the data points.
SOMs and Hierarchical
Clustering
Hierarchical Clustering
(dendogram)
Recall the problem of
arbitrariness in the order
of the branches in
Hierarchical Clustering?
Can use SOMs to help.
G D F E B C A
SOMs, Eisen, Cluster
5
4 6
3
19
18
7 8 9
10
16
17
14 15
11
12
2 1
Hierarchical Clustering
(dendogram)
Eisen’s Cluster can use
the ordering from the
SOM, to do hierarchical
clustering.
1) Run 1D SOM on data set.
2) Build dendogram using
ordering from SOM.
G D F E B C A
13
Self-Organizing Maps
• Not normally what SOMs are used for
(i.e. hierarchical clustering)
• Mainly used for visualization, and as a
first step before further analysis.
• Can also use 2D, 3D SOMs.
Concluding Remarks
I hope that I managed to:
1) Give some idea of what machine
learning is about (‘structure’ in highdimensional spaces).
2) The intuition behind several techniques,
and familiarity with their names.
Main Ideas of PCA
• For N dimensional data, find N directions.
• Can represent any one of our original
data points using (a linear combination)
of these N directions:
N direction
th
Patient_X  a1E1  a2 E2  a3 E3 ...  ...aN EN
i.e.
Patient_X  (a1 , a2 , a3 ,..., aN )
Main Ideas of PCA
• Key Idea: Can represent extremely
well any one of our original data
points using fewer than N
Fewer than N
directions:
directions.
Patient_X  a1E1  a2 E2  a3 E3 ...  ...aN d EN d
Patient_X  (a1 , a2 , a3 ,..., aN d )
Another 2D Example
But if we project the data onto a
combination (linear combination)
of the two dimensions... it works
out nicely.
Y
Y
X
X
Concluding Remarks
Understand the algorithms behind your
analysis. Blind application can lead to
erroneous conclusions.