u - Protein Design Group.

Download Report

Transcript u - Protein Design Group.

SOM y SOTA: Clustering methods in
the analysis of massive biological data
Joaquín Dopazo. CNIO.
Genes in the
DNA...
Between 30.000 and
100.000.
40-60% display
alternative splicing
…whose final
effect can be
different because
of the variability.
>protein kunase
acctgttgatggcgacagggactgtatgctgatct
atgctgatgcatgcatgctgactactgatgtgggg
gctattgacttgatgtctatc....
…code for the
structure of
proteins...
That undergo
posttranslational
modifications
…which accounts for
the function...
More than 3 millon
SNPs have been
mapped
From genotype
to phenotype.
(only the genetic component)
…conforming complex
interaction networks...
A typical tissue is
expressing among
5000 and 10000
genes
…providing they are expressed in
the proper moment and place...
Each protein has an average
of 8 interactions
…in
cooperation
with other
proteins…
Pre-genomics scnario in the lab
>protein kunase
acctgttgatggcgacagggactgtatgctga
tctatgctgatgcatgcatgctgactactgatg
tgggggctattgacttgatgtctatc....
Bioinformatics tools for pre-genomic
sequence data analysis
Phylogenetic
tree
Information
Sequence
Molecular
databases
Motif
databases
Search results
Motif
Conserved
region
The aim:
Extracting as much
information as
possible for one
single data
alignment
Secondary and tertiary
protein structure
Post-genomic vision
Who?
Genome
sequencing
Literature,
databases
2-hybrid systems,
Mass spectrometry for
protein complexes
What do
we know?
And who else?
SNPs
Expression
Arrays
Where, when and how much?
In what way?
Post-genomic vision
genes
Information
The new tools:
interactions
Clustering
Feature selection
Multiple correlation
Datamining
Information
Databases
polimorphisms
Gene
expression
Neural Networks
Brain and computers
Brain computes in a different way from digital computers
Brain
Computers
Structural components
Neurons
(Ramón y Cajal, 1911)
chips
Speed
slow (10-3s)
fast (10-9s)
Procesing units
10 billion neurons,
massively
interconnected (60
trillion synapses)
One or few
Brain is a highly complex, nonlinear, and parallel computer
Neurons are organized to perform complex computations many times faster
than the fastest computers.
What is a neural network?
A Neural network is a massively parallel distributed
processor able to store experiential knowledge and to
make it available for use.
It resembles to brain in two respects:
Knowledge is acquired by the network through a
learning process
Interneuron connection strengths (synaptic
weights) are used to store the knowledge.
Neural Net classifiers
Supervised
Unsupervised
Perceptrons
Kohonen SOM
Growing cell
structures
SOTA
Supervised learning: the
perceptron
Input
signals
X1
X1
X2
X2
:
:
:
:
:
:
:
:
Xp
Xp
Activation
function
w1
w2
:
w2
S
uk
J(.)
Summing
junction
Threshold
qk
Output
Supervised learning : training
up
Training
set
Summing
junction
down
w1
11111110000000
00000001111111
w2
u =x1*w1+x2*w2
W1 = 1
W2=0
S
Activation
function
J(.)
u
up =1
down = 0
J(u)=
1 if u 1
0 if u<1
Supervised learning: application
Summing
junction
X
1
1
0
0
S
J(.)
1
up
u
u =1*1+0*0= 1
J(u)=
1 if u 1
0 if u<1
Supervised vs. Unsupervised learning
Supervised:
The structure of the data is known beforehand. After a training process in
which the network learns how to distinguish among classes, you use the
network for assigning new items to the predefined classes
Unsupervised:
The structure of the data is not know beforehand.
The network learns how data are distributed among classes, based on a
function of distance
Unsupervised learning:
Kohonen self-organizing maps
The basis
Sensory pathways in the brain are organised in such a
way that its arrangement reflects some physical
characteristic of the external stimulus being sensed.
Brain of higher animals seems to contain many kind of
“maps” in the cortex.
 In visual areas there are orientation and color maps
 In the auditory cortex there exist the so-called tonotopic
maps
• The somatotopic maps represents the skin surface
Kohonen SOM
The causes of self-organisation
Kohonen SOM mimics two-dimensional arrangements of
neurons in the brain. Effects leading to spatially organized
maps are:
• Spatial concentration of the network activity on the
neuron best tuned to the present input
• Further sensitization of the best matching neuron and its
topological neighborhood.
Kohonen SOM
The topology
Two-dimensional network of
cells with a hexagonal or
rectangular (or other)
arrangement.
x1, x2..xn
input
Output
nodes
Neighborhood
Neighborhood of a cell is defined as a time dependent function
Kohonen SOM
The algorithm
Input
Step 1.
Initialize nodes to random values.
Set the initial radius of the neighborhood.
Step 2.
Present new input: Compute distances to all nodes.
Euclidean distances are commonly used
Step 3.
Select output node j* with minimum distance dj.
Update node j* and neighbors. Nodes updated for the
neighborhood NEj*(t) as:
wij(t+1) = wij(t) + (t)(xi(t) - wij(t)); for j  NEj*(t)
(t) is a gain term than decreases in time.
Step4
Repeat by going to Step 2 until convergence.
Kohonen SOM
Limitations
Arbitrary number of clusters
The number of clusters is arbitrarily fixed from the beginning.
Some clusters can remain unoccupied.
Non proportional clustering
Clusters are made based on the number of items so, distances
among them are not proportional.
Lack of the tree structure
The use of a two-dimensional structure for the net makes
impossible to recover a tree structure that relates the clusters
and subclusters among them.
Growing cell structures
Kohonen SOM produce topologypreserving mapping. That is, the
topology of the network and the
number of clusters are fixed before
to the training of the network
Growing cell structures produce
distribution-preserving mapping. The
number of clusters and the connections
among them are dynamically assigned
during the training of the network.
Insertion and deletion of neurons
•After a fixed number  of adaptations,
every neuron q with a signal counter
value hq > hc (a threshold) is used to
create a new neuron
 The direct neighbor f of the neuron q
having the greatest signal counter value
is used to insert a new neuron between
them.
 The new neuron is connected to
preserve the topology of the network.
• Signal counter values are adjusted in
the neighborhood
Similarly, neurons with signal counter
values below a threshold can be
removed.
Growing cell structures
Network dynamics
Similar to the used by Kohonen SOM, but with
several important differences:
Adaptation strength is constant over time (eb and en for the
best matching cell and its neighborhood).
Only the best-matching cell and its neighborhood are adapted.
Adaptation implies the increment of signal counter for the
best-matching cell and the decrement in the remaining cells.
New cells can be inserted and existent cells can be removed in
order to adapt the output map to the distribution of the input
vectors.
Growing cell structures
Limitations



Arbitrary number of clusters
The number of clusters is arbitrarily fixed from the beginning.
Some clusters can remain unoccupied.
Non proportional clustering
Clusters are made based on the number of items so, distances
among them are not proportional.
Lack of the tree structure
The use of a two-dimensional structure for the net makes
impossible to recover a tree structure that relates the clusters
and subclusters among them.
But, sometimes behing the real
world there is some hierarchy...
A
20 items
B
C
Many molecular data have
different levels of structured
information.
Ej, phylogenies, molecular population
data, DNA expression data (to same
extent), etc.
D
Simulation
Mapping a hierarchical structure using a
non-hierarchical method (SOM)
A,B
G
H
C,D
E,F
Self Organising Tree Algorithm
(SOTA)
A new neural network designed to deal with data that are
related among them by means of a binary tree topology
Dopazo & Carazo, 1997, J. Mol. Evol.44:226-233
Derived from the Kohonen SOM and the growing
cell structures but with several key differences:
The topology of the network is a binary tree.
Only growing of the network is allowed.
The growing mimics a speciation event, producing two new
neurons from the most heterogeneous neuron.
Only terminal neurons are directly adapted by the input data,
internal neurons are adapted through terminal neurons.
SOTA:
The algorithm
Step 1.
Initialize nodes to random values.
Step 2.
Present new input: Compute distances to all terminal
nodes.
Step 3.
Select output node j* with minimum distance dj.
Update node j* and neighbors. Nodes updated for the
neighborhood NEj*(t) as:
wij(t+1) = wij(t) + (t)(xi(t) - wij(t)); for j  NEj*(t)
(t) is a gain term than decreases in time.
Step 4
Repeat by going to Step 2 until convergence.
Step 5
Reproduce the node with highest variability.
The Self Organising Tree Algorithm (SOTA) is a
hierarchical divisive method based on a neural
network
SOTA, unlike other hierarchical methods, grows
from top to bottom until an appropriate level of
variability is reached
Input
Dopazo, Carazo (1997)
Herrero, Valencia, Dopazo (2001)
SOTA algorithm (neighborhood)
Initial state
a
s

w
Actualization
Growing and different
neighborhoods
SOTA algorithm
Initialise
system
Cycle: repeat as many epochs as
necessary to get convergence in the
present state of the network.
Convergence: relative error of the
network falls below a threshold
Cycle
EPOC
H
sister
mother
winner
When a cycle finishes, the network
size increases: two new neurons are
attached to the neuron with higher
resources. This neuron becomes
mother neuron and does not receive
direct inputs any more.
NO
Add
cell
Cycle
convergence?
YES
Network
convergence?
NO
YES
End
Applications
Sequence analysis
Microarray data analysis
Population data analysis
Sequence analysis in the genomics era
• Massive data
• Information
•redundancy
Codification
Indeterminaciones.
R = {A ó G}; N= {A ó G ó C ó T}
Vectores de N x 4 (nucleótidos) o N x 20
(aminoácidos); más una componente para
representar las deleciones
Other possible codifications: Frequencies of dipeptides or
dinucleotides
Updating the neurons
Updated
Missing
Classifying proteins with SOM
Ferrán, Pflugfelder and Ferrara
(1994) Self-organized neural
maps of human protein
sequences. Prot. Sci. 3:507-521.
Gene expression analysis using DNA microarrays
Cy5
Cy3
cDNA arrays
Oligonucleotide arrays
Research paradigm is shifting
Hipothesis driven: one PhD per gene
Ignorance driven: paralelized automated approach
sequences
Kb
DNA arrays
Gb
Mb
Tb - Pb
Expression patterns
1
Different DNA-arrays
2
3
4
Patterns can be:
• time series
• dosage series
• different patients
• different tissues
• etc.
The data
A
Genes
(thousands)
B C
Different classes
of experimental
conditions, e.g.
Cancer types,
tissues, drug
treatments, time
survival, etc.
Expression
profile of all the
genes for a
experimental
condition (array)
Expression profile
of a gene across the
experimental
conditions
Experimental conditions
(from tens up to no more than a few houndreds)
Characteristics of the data:
• Low signal to noise ratio
• High redundancy and intra-gene
correlations
• Most of the genes are not
informative with respect to the trait
we are studying (account forunrelated
physiological conditions, etc.)
• Many genes have no annotation!!
Study of many conditions.
Can we find groups of
experiments with
similar gene expression
profiles?
Types of problems
Unsupervised
Different phenotypes...
Supervised
Reverse engineering
Molecular classification
of samples
Co-expressing genes...
What profile(s) do they
display? and...
Genes interacting in a
network (A,B,C..)...
What genes are
responsible for?
What do they
have in common?
Genes of a class
Are there more
genes?
How is the
network?
B
A
C
D
E
What are we measuring?
green
red
Problem: is asymetrical
A (background)
Differential
expression
B (expression)
B/A
solution: log-transformation
100/1
=
100
2
10/1
=
10
1
1/1
=
1
1/10
=
0.1
-1
1/100
=
0.01
-2
transformation
0
Distance
A
Differences
B<=>C
B
Correlation
C
A<=>B
Clustering methods
Non hierarchical
deterministic
NN
Hierarchical
K-means, PCA
UPGMA
SOM
SOTA
Provides
different
levels of
information
Robust
Properties
Aggregative hierarchical
clustering
Relationships among profiles are
represented by branch lengths.
Links recursively the closest pair of
profiles until the complete hierarchy
is reconstructed
CLUSTER
Allows to explore the relationship
among groups of related genes at
higher levels.
Aggregative
hierarchical
clustering
Problems
• lack of robustness
• solution may be not unique
• dependent on the data order
What level would you
consider for defining
a cluster?
Subjective
cluster definition
Properties of neural networks for
molecular data classification
•Robust
• Manage real-world data sets containing noisy,
ill-defined items with irrelevant variables and
outliers
• Statistical distributions do not need to be
parametric
• Fast and scalable to big data sets
Kohonen SOM
Applied to microarray data
Group11
samplea, sampleb ...
Group12
samplea, sampleb ...
Group13
samplea, sampleb ...
t1
t2
sample1
a11
a12
..
a1p
sample2
a21
a22
..
a2p
:
:
an1
an2
:
samplen
..
Group14
tp
samplea, sampleb ...
:
..
anp
node44
node34
node24
z1
z2
..
x1
y1
zp
y2
x2
..
..
yp
xp
Kohonen SOM
microarray patterns
gen1 gen2
.. genp
sample1
a11
a12
..
a1p
sample2
a21
a22
..
a2p
:
:
an1
an2
:
samplen
:
..
anp
Kohonen SOM
Example
Response of human fibroblasts
to serum
Iyer et al., 1999 Science 283:83-87
The Self Organising Tree
Algorithm (SOTA)
The Self Organising Tree Algorithm
(SOTA) is a divisive hierarchical
method based on a neural network
SOTA,opposite to other clustering
methods, grows from top to bottom:
growing can be stopped at the
desired level of variability
SOTA nodes are weighted averages of
every item under the node
SOTA
Advantages of SOTA
Robusteness against noise
Divisive algorithm
SOTA grows from top to bottom: growing can be
stopped at any desired level of variability.
Clusters´patterns
Each node of the tree has a pattern associated
wich corresponds to the cluster under itself.
Distribution preserving
The number of clusters depends on the
variability of the data.
From low
resolution...
TEST
Where stop
growing?
...to high resolution.
exp1 exp2 ..
expp
gen1
a11
a12
..
a1p
gen2
a21
a22
..
a2p
:
genn
:
an1
:
an2
..
exp1 exp2 ..
:
anp
expp
gen1
a14
a17
..
a1q
gen2
a23
a21
..
a2r
:
genn
:
an9
an4
:
..
Permutation test for cluster
size definition
:
ans
95%
TEST
are dij > 0.4?
SOTA/SOM vs classical
clustering (UPGMA)
SOTA vs SOM
Acuracy: the silhouette
Is the object closer
to its cluster or to
the closer cluster?
a(i ) =
1
 d ( xi , x j ) xi  A
| A | 1 xi , x j A
d ( xi , C ) =
C A
1
 d ( xi , x j )
| C | x j C
b(i ) = min d ( xi , C )
C A
s(i ) =
b(i )  a(i )
maxa(i ), b(i )


a(i )
b(i )  a(i )  a(i ) 
  1(OK ) 
 s(i ) =
= 1 
 xi  A a(i )  b(i ) 
b(i )
b(i )
 b(i ) 




a(i )
b(i )  a(i )
b(i )
 xi  B a(i )  b(i ) 
 s (i ) =
= 1 
 1( Wrong )
b(i )
a(i )
a(i )





