Transcript Slide 1

Project reminder
• Deadline: Monday 16. 5. 11:00
• Prepare 10 minutes long pesentation (in
Czech/Slovak), which you’ll present on
Wednesday 18. 5. 2011 during the Data
mining lecture/exercise.
Self-Organizing Map
(SOM)
• Unsupervised neural networks, equivalent to
clustering.
• Two layers – input and output
– The input layer represents the input variables.
– The output layer: neurons arranged in a single line
(one-dimensional) or a two-dimensional grid.
• Main feature – weights
• Learning means adopting the weights.
• Each output receives inputs through the weights.
 weight vector has the same dimensionality as
the input vector
• The output of each neuron is its activation –
weighted sum of inputs (i.e. linear activation
function).
u = x1w11 + x2w21
w11
w21
2
• The objective of learning: project highdimensional data onto 1D or 2D output
neurons.
• Each neuron incrementally learns to represent
a cluster of data.
• The weights are adjusted incrementally – the
weights of neurons are called codebook
vectors (codebooks).
Competitive learning
• The so-called competitive learning (winnertakes-all) .
• Competitive learning will be demonstrated on
simple 1D network with two inputs.
Sandhya Samarasinghe, Neural Networks for Applied Sciences and Engineering, 2006
• First, number of output neurons (i.e. clusters)
must be selected.
– Not always known, do reasonable estimate, it is
better to use more, not used can be eliminated
later.
• Then initialize weights.
– e.g. small random values
– Or randomly choose some input vectors and use
their values for the weights.
• Then competitive learning can begin.
• The activation for each output neuron is
calculated as weighted sum of inputs.
• E.g. for the output neuron 1, its activation
n
u1 = w11x1 + w21x2. Generally u j   wij xi
i1
• Activation is the dot product between input
vector x and weight vector wj.
Sandhya Samarasinghe, Neural Networks for Applied Sciences and Engineering, 2006
n
• Dot product is not only u j   wij xi , but also
i1
u j  x w j cos 
x
w
θ
• If |x| = |wj| = 1, then uj = cosθ.
• The closer these two vectors are (i.e. the
smaller θ is), the bigger the uj is (cos 0 = 1).
• Say it again, and loudly:
The closer the weight and input vectors are,
the bigger the neuron activation is. Dan na
Hrad.
• A simple measure of the closeness – Euclidean
distance between x and wj.
dj  xwj 
 x  w 
n
i 1
i
ij
2
• Scale the input vector so that its length is
equal to one. |x|=1
• An input is presented to the network.
• Scale weight vectors of individual output
neurons to the unit length. |w|=1
• Calculate, how close is input vector x to each
of weight vector wj (j is 1 … # output neurons).
• The neuron which codebook is closest to the
input vector becomes winner (BMU, Best
Matching Unit).
• Its weights will be updated.
Weight update
• The weight vector w is updated so that it
moves closer to the input x.
d
x
Δw
w
Δw j    x  w j    d j
β – learning rate
Recursive vs. batch learning
• Conceptually similar to online/batch learning
• Recursive learning:
– update weights of the winning neuron after each
presentation of input vector
• Batch learning:
– the weight update for each input vector is noted
– the average weight adjustment for each output
neuron is done after the whole epoch
• When to terminate learning?
– mean distance between neurons and inputs they
represent is at a minimum
– distance stops changing
Example
Sandhya Samarasinghe, Neural Networks for Applied Sciences and Engineering, 2006
Sandhya Samarasinghe, Neural Networks for Applied Sciences and Engineering, 2006
epoch
Sandhya Samarasinghe, Neural Networks for Applied Sciences and Engineering, 2006
Sandhya Samarasinghe, Neural Networks for Applied Sciences and Engineering, 2006
Sandhya Samarasinghe, Neural Networks for Applied Sciences and Engineering, 2006
Topology is not
preserved.
Sandhya Samarasinghe, Neural Networks for Applied Sciences and Engineering, 2006
Meet today’s hero
Teuvo Kohonen
Self-Organizing Maps
• SOM, also Self-Organizing Feature Map
(SOFM), Kohonen neural network.
• Inspired by the function of brain:
– Different brain regions correspond to specific
aspects of human activities.
– These regions are organized such that tasks of
similar nature (e.g. speech and vision) are
controlled by regions that are in spatial proximity
each to other.
– This is called topology preservation.
• In SOM learning, not only the winner, but also
the neighboring neurons adapt their weights.
• Neurons closer to the winner adjust weights
more than farther neurons.
• Thus we need
1. to define the size of neighborhood
2. to define a way how much neighboring neurons
adapt their weights
Neighborhood definition
2
2
1
1
1
2
1
2
3
neighborhood radius r
Sandhya Samarasinghe, Neural Networks for Applied Sciences and Engineering, 2006
Training in SOM
• Follows similar manner of standard winner-takesall competitive training.
• However, new rule is used for weight changes.
• Suppose, that the BMU is at position {iwin, jwin} on
the 2D map.
• Then all codebook vectors of BMU and neighbors
are adjusted to w’j according to
w 'j  w j   NS  x  w j 
where NS is the neighbor strength varying with
the distance to the BMU. β is learning rate.
Neighbor strength
• When using neighbor features, all neighbor
codebooks are shifted towards the input
vector.
• However, BMU updates most, and the farther
away the neighbor neuron is, the less its
weights update.
• The NS function tells us how the weight
adjustment decays with distance from the
winner.
Slide by Johan Everts
Sandhya Samarasinghe, Neural Networks for Applied Sciences and Engineering, 2006
Gaussian
NS  Exp   dij2 2 2 
Linear
Exponential
NS  Exp   kd ij 
2D Side Effects
source: http://www.cis.hut.fi/projects/somtoolbox/download/sc_shots2.shtml
Shrinking neighborhood size
• Large neighborhood – proper placement of neurons
in the initial stage to broadly represent spatial
organization of input data.
• Further refinement – subsequent shrinking of the
neighborhood.
• The size of large starting neighborhood is reduced
with iterations.
σ0 … initial neighborhood size
σt … neighborhood width at iteration t
T … total number of iterations
bringing neighborhood to zero (i.e.
only winner)
 t   0 1  t T   t   0Exp  t T 
linear decay
exponential decay
NS  Exp   dij2 2 2  
2
Exp  dij2 2  0 Exp  t T  


Learning rate decay
• The step length (learning rate β) is also reduce with
iterations.
• Two common forms: linear or exponential decay
t  0 1  t T 
t  0Exp t T 
T … constant bringing β to zero (or small value)
• Strategy: start with relatively high β, decrease
gradually, but remain above 0.01
Weight update incorporating learning rate and neighborhood decay
w j  t   w j  t  1    t  NS  d , t   x  t   w j  t  1 
Recursive/Batch Learning
• Batch mode, no neigborhood – equivalent to
K-means
• Neighbor incorporating – topology
preservation
– Regions closer in input space are represented by
neurons closer in the map.
Two Phases of SOM Training
• Two phases
1. ordering
2. convergence
• Ordering
– neighborhood and learning rate are reduced to small
values
– topological ordering
– start β high, gradually decrease, remain above 0.01
– neighborhood – cover whole output layer
• Convergence
– fine tuning with the shrunk neighborhood
– small non-zero (~0.01) learning rate, NS no more than
1st neighborghood
Example contd.
Sandhya Samarasinghe, Neural Networks for Applied Sciences and Engineering, 2006
Sandhya Samarasinghe, Neural Networks for Applied Sciences and Engineering, 2006
NS  Exp  0.1d 
2
t 
3t
 t   0 1  t T 
neighborhood drops to 0 after 3 iterations
After 3 iterations
topology preservation takes effect very quickly
Complete training
Converged after 40 epochs.
Epochs
Sandhya Samarasinghe, Neural Networks for Applied Sciences and Engineering, 2006
Complete training
• All vectors have
found cluster
centers
• Except one
4
3
2
5
6
1
Sandhya Samarasinghe, Neural Networks for Applied Sciences and Engineering, 2006
• Solution: add one
more neuron
4
3
2
1
Sandhya Samarasinghe, Neural Networks for Applied
Sciences and Engineering, 2006
5
6
7
2D output
• Play with
http://www.neuroinformatik.ruhr-uni-bochum.de/VDM/research/gsn/DemoGNG/GNG.html
• Self-organizing map
NS  Exp   dij2 2 2 
 t   i  f  i 
neighborhood size
 t   i  f  i 
learning rate
t tmax
t tmax
A self-organizing feature map from a square source space to a
square (grid) target space.
Duda, Hart, Stork, Pattern Classification, 2000
Some initial (random) weights and the particular sequence of
patterns (randomly chosen) lead to kinks in the map; even
extensive further training does not eliminate the kink. In such
cases, learning should be re-started with randomized weights and
possibly a wider window function and slower decay in learning.
Duda, Hart, Stork, Pattern Classification, 2000
2D maps on Multidimensional data
• Iris data set
– 150 patterns, 4 attributes, 3 classes (Set – 1, Vers
– 2, Virg – 3)
– more than 2 dimensions, so all data can not be
vizualized in a meaningful way
– SOM can be used not only to cluster input data,
but also to exlpore the relationships between
different attributes.
• SOM structure
– 8x8, hexagonal, exp decay of learning rate β (βinit =
0.5, Tmax = 20x150 = 3000), NS: Gaussian
virginica
versicolor
setosa
What can be learned?
• petal length and width have similar structure to the class panel 
low length correlates with low width and these relate to class Versicolor
• sepal width – very different pattern
• class panel – boundary between Virginica and Setosa – classes overlap
Sandhya Samarasinghe, Neural Networks for Applied Sciences and Engineering, 2006
• Since we have class labels, we can assess the
classification accuracy of the map.
• So first we train the map using all 150
patterns.
• And then we present input patterns
aindividually again and note the winning
neuron.
– The class to which the input belongs is the class
associated with this BMU codebook vector (see
previous slide, Class panel).
– Only the winner decides classification.
Vers – 100% accuracy
Set – 86%
Virg – 88%
Overall accuracy = 91.3%
Vers – 100% accuracy
Set – 90%
Virg – 94%
Overall accuracy = 94.7%
Sandhya Samarasinghe, Neural Networks for Applied Sciences and Engineering, 2006
U-matrix
• Distance between the neighboring codebook
vectors can highlight different cluster regions in
the map and can be a useful visualization tool
• Two neurons: w1 = {w11, w21, … wn1}, w2 = {w12, w22, … wn2}
• Euclidean distance between them
d12 
 w11  w12    w21  w22  
2
2
  wn1  wn 2 
2
• The average of the distance to the nearest
neighbors – unified distance, U -matrix
The larger the distance
between neurons, the
larger the U value and more
separated the clusters. The
lighter the color, the larger
the U value.
Large distance between this
cluster (Iris versicolor) and the
middle cluster (Iris setosa). Large
distances between codebook
vectors indicate a sharp
boundary between the clusters.
Surface graph
The height represents the
distance.
3rd row – large height =
separation
Other two clusters are not
separated.
Quantization error
• Measure of the distance between codebook
vectors and inputs.
• If for input vector x the winner is wc, then
distortion error e can be calculated as
e   NSci d  x, wi 
e  d  x, wc 
i
• Comput e for all input vectors and get average
– quantization error, average map distortion
error E.
1
E
N

 d x , wi
n
n

1
E
N

n
NS
d
x
 ci , w i
n
i

Iris quantization error
High distortion error indicates areas where the codebook
vector is relatively far from the inputs. Such information can
be used to refine the map to obtain a more uniform distortion
error measure if a more faithful reproduction of the input
distribution from the map is desired.