DATA-MINING - tu

Download Report

Transcript DATA-MINING - tu

DATA-MINING
Artificial Neural Networks
Alexey Minin, Jass 2006
Teaching without the tutor:
introduction
ANN forms it’s output itself, according to the
information, presented for input. We have to minimize
some functional.After we have found this functional we
have to minimize it. It is the main task, and according to
this functional the input vector will be changed.
In practice, adaptive networks code input information in the most
compact way, of course according to some predefined requirements.
Teaching without the tutor: redundancy of
data
The length of data
description:
Dd b
d  Dimension of data = number of components of input
b
x
vector
Capacity of data = number of bits, defining the possible variety of
all values
Two ways of coding
(reducing) the information
Reducing the dimension of data
with min loss
finding of independent
features
Reducing the variety
of data by detecting
the prototypes
Clustering and quantifying
Two ways to reduce the data
x
Reducing the dimension
allows us to describe the data
with less components
Clustering allows us to reduce the
variety of data,reducing the
number of bits,we need to
describe the data.
x
NB
We can unite both types of algorithms. We can use Kohonen maps,
when prototypes regulate in the space of low dimension. For example,
input data can be reflected on to 2-dimensional grid of prototypes the
way, you can visualize the data you have.
Main idea: neuron - indicator
Neuron has one output and it’s teaching upon a d-dimension data
x 
Lets say that the activation function is linear. The output therefore is the
linear combination of it’s outputs:
d
y   j 1 w j x j
d
y  1 w j x j  w  x
The amplitude after the training is finished
can be the indicator for the data. Showing rather
the data corresponds for training patterns or not.
x1
xd
Hebb training algorithm
According to Hebb:



w j   y x j
If we will reformulate the task as the optimization task we will get the
property of such neuron and rule how to define functional we have to min:
E
2

1


w  
, E w, x    2 w  x   21 y 2
w
NB! If we wont to have minimum of the E than we will have an
output amplitude equals to infinity
Oja training rule
wj   y  xj  y w j 
The member interfering was added to stop unlimited growth of
weights
Rule Oja maximizes sensitivity of an output neuron at the limited amplitude of
weights. It is easy to be convinced of it, having equated average change of
weights to zero.
Having increased then the right part of equality on w. We are convinced, that in
2
balance
2
y
1  w   0
Thus, weights of trained neuron are located on hyper sphere:
At training on Oja, a vector of weights
settles down on hyper sphere, In a
direction maximizing Projection of
input vectors.
w
x
w  1.
1
Oja training rule
SUMMARY: Neuron is trying to reproduce the value of it’s input for
known output. It means that it’s trying to maximize the sensitivity
of it’s output neurons-indicators for many dimensional input
information, doing compression this way.
NB! The output of the Oja output layer is the
linear combination of main components. If you
want to receive main components you should
change sum of all outputs:

y
w

y
w

k
kj
k
ij
k
k 1

i

The analysis of main components
Lets say that we have d-dimensional data
x 
yi   j 1 wij x j   j 1 wij x j  w i  x
d
we are training m linear neurons:
yi   j  1 wij x j
d
d
i  1,..., m
.
x1
xd
THE TASK IS:
We want an amplitude to be independent indicators of all output neurons,
fully reflecting information about many-dimensional data we have.
The requirement:

Neurons must interact somehow (if we will train them
independently we will receive the same result for all of them)
In simple case:
Lets take perceptron with linear neuron for hidden layer, in which
the number of inputs and outputs equals, and the weights with the
same indexes in both layers are the same. Lets try to teach ANN to
reproduce the input on the output. Training rule therefore:
w   y   x  ~
x    y   x   y  w
~
x1
...
~
xd
x1
...
xd
Looks like Oya training rule!
Self training layer:
In our formulation the training of separate neuron, is trying to reproduce
the inputs according to its outputs. Generalizing this note, it is logical to
suggest a rule,according to which the value of outputs restoring according
to whole output information. Doing this way we can get Oja training rule
for one layer network:

~
wij   yi x j  x
j
~
x1
...
~
xd
x1
...
xd


  yi x j  k y k wkj

The hidden layer of such ANN, the
same as Oya layer,makes optimal
coding of input data, and contains
maximum variety of data according
to existing restrictions.
Example:
Lets change activation function on the sigmoid in the training rule:

wi   f  yi  x  k f  yk w k

Brings new property (Oja, et al, 1991). Such algorithm, in particular,
was used for the decomposition of mixed signals with an unknown way
(i.e. blind signal separation).
For example this task we have when we
want to separate human voice and noise.
Competition of neurons: the winner gets all
yi   j  1 wij x j
d
Basis algorithm
The training of competition layer remains constant:

wi  yi x   k yk w k

x1
i 
The winner:
xd
# of neuron winner

i   # i : w i  x  w i  x

if wi  1  wi  x  wi  x i  i
The winner will be the neuron,
which has the maximum response
Training of winner:

wi   x  w i

yi  1, yi  0, i  i
The winner takes away not all
One of variants of updating of a base rule of training of a competitive layer
Consists in training not only the neuron-winner, but also its "neighbors", though and with
In the smaller speed. Such approach - "pulling up" of the nearest to the winner neuronIt is applied in topographical Kohonen cards


i  # i : w i  x  min w i  x 
i




wi (t  1)  wi (t  1)  wi (t )   i  i , t  x (t )  wi (t )     i  i ,t  
Function of the neighborhood is equal to unit for the neuron-winner with an index And gradually falls down at removal
from the neuron-winner i 
 a  exp a 2  2 
Training on Kohonen reminds stretching an elastic grid of prototypes on
Data file from training sample
Methodology of self-organizing cards
Schematic representation of self-organizing network
Training on Kohonen reminds stretching an elastic grid of prototypes on
Data file from training sample
Neurons in the target layer are ordered and
correspond to cells of a bi-dimensional card
which can be painted by a principle of
affinity of attributes
Visualization a topographical card, Induced by i-th
component of entrance data
xi
The convenient tool of visualization
Data is coloring topographical
Cards, it is similar to how it do on
Usual geographical cards. All
attribute of data generates the coloring
Cells of a card - on size of average value
This attribute at the data who have got in given
Cell.
Having collected together cards of all interesting
Us of attributes, we shall receive topographical
The atlas, giving integrated representation
About structure of multivariate data.
Methodology of self-organizing cards
Classified SOM for NASDAQ100 index for the period
from 10-Nov-1997 till 27-Aug-2001
Complexity of the algorithm
When it’s better to use reducing of dimension, and when – quantifying of the input
information?
W  dm 
number of syn weights of 1 layer ANN
with d inputs & m output neurons
C1 ~ PW
Compression coef:
Complexity:
Number of training
patterns
quantifying
Reducing the dim
# of operations:
P
2
K d m
C1  Pd 4 K 2
With the same compression coef:
# of operations:
C2 ~ PW
Compression coef
(b – capacity data)
K  db log2 m
Complexity:
C2  Pd 2db K
C2 K 2 2db K
~
C1
d3
JPEG example
Image is divided on to 8x8 pixels, which should be input vectors, we want to reduce.
In our case
d  8  8  64
Lets propose that image contains
b8
d b2  1
But if d=64x64 than K>103
2 8  256
gradation of the gray accuracy
of the represented data
Any questions?