Transcript nn4-02

3.Learning
In previous lecture, we discussed the biological foundations
of of neural computation including




single neuron models
connecting single neuron behaviour with network models
spiking neural networks
computational neuroscience
In present one, we introduce
 Statistical foundations of neural computation
= Artificial foundations of neural computation
Artificial Neural Networks
Biological foundations
Artificial foundations
(Neuroscience)
(Statistics, Mathematics)
Duck: can swim (but not like a fish)
fly (but not like a bird)
walk (in a funny way)
In present one, we introduce
 Statistical foundations of neural computation
= Artificial foundations of neural computation
Artificial Neural Networks
Biological foundations
Artificial foundations
(Neuroscience)
(Statistics, Mathematics)
Duck: can swim (but not like a fish)
(Feng)
fly (but not like a bird) (all my colleagues here)
walk (in a funny way)
In present one, we introduce
 Statistical foundations of neural computation
= Artificial foundations of neural computation
Artificial Neural Networks
Biological foundations
Artificial foundations
(Neuroscience)
(Statistics, Mathematics)
Duck: can swim (but not like a fish)
(Feng)
fly (but not like a bird)
walk (in a funny way)
Topic
Pattern recognition
Cluster
Statistical Approach
Statistical Learning (training from data set, adaptation)
change weights or interaction between neurons according to
examples, previous knowledge
The purpose of learning is to minimize
 training errors on learning data

Learning (training from data set, adaptation) and
The purpose of learning is that to minimize
 training errors on learning data:
learning error
 prediction errors on new, unseen data: generalization error
Learning (training from data set, adaptation) and
The purpose of learning is that to minimize
 training errors
 prediction errors
The neuroscience basis of learning remains elusive, although
we have seen some progresses (see references in the previous
lecture)
Statistical learning:
the artificial, reasonable way of training and prediction
LEARNING: extracting principles from data set.
•
Supervised learning:
•
Unsupervised learning: not teacher, learn by itself
•
Reinforcement learning:
have a teacher, telling you where to go
have a critics, wrong or correct
Statistical learning:
the artificial, reasonable way of training and prediction
LEARNING: extracting principles from data set.
Supervised learning:
have a teacher, telling you where to go
Unsupervised learning: not teacher, learn by itself
Reinforcement learning: have a critics, wrong or correct
We will concentrate on the first two. You could find reinforced
learning from Haykin, Hertz et al. books or
Sutton R.S., and Barto A.G. (1998)
Reinforcement learning: an introduction
Cambridge, MA: MIT Press
Pattern recognition (classifications), a special case of learning
The simplest case: f (x) =1 or -1 for x in X (the set of objects
we intend to separate)
Example:
X, a bunch of faces
x, a single face,
1 if x is male
f ( x)  
 1 if x is female
Pattern recognition (classifications), a special case of learning
The simplest case: f (x) =1 or -1 for x in X (the set of objects
we intend to separate)
For example: X, a bunch of faces
x, a single face,
1 if x is male
f ( x)  
 1 if x is female
f(
)1
f(
)1
Pattern: as opposite of a chaos; it is an entity, vaguely defined, that
could be given a name
Examples:
•
a fingerprint image,
•
a handwritten word,
•
a human face,
•
a speech signal,
•
an iris pattern etc.
Pattern: Given a pattern:
a. supervised classification (discriminant analysis) in which
the input pattern is identified as a member of a predefined class
b. unsupervised classification (e.g.. clustering ) in which the patter is
assigned to a hitherto unknown class.
Unsupervised classification will be introduced in later Lectures
Pattern recognition is the process of assigning patterns to one of a
number of classes
x
y
feature
extraction
pattern space
(data)
feature
space
feature extraction
x =
x =
Hair length y =0
Hair length y = 30 cm
Pattern recognition is the process of assigning patterns to one of a
number of classes
x
classification
feature
extraction
y
feature
space
pattern space
(data)
Decision space
feature extraction
classification
Hair length =0
Short hair
= male
Long hair
= female
Hair length =
30 cm
Feature extraction: which is a very fundamental issue
For example: when we recognize a face, which feature we use ????
Eye pattern, geometric outline etc.
Two approaches:
Statistical approach
Clusters: template matching
In two steps:
Find a discrimant function in terms of certain
features
Make a decision in terms of the discrimant
function
discriminant function: a function used to decide on
class membership
Cluster:
patterns of a class should be grouped or clustered together in
pattern or feature space if decision space is to be partitioned
objects near together must be similar
objects far apart must be dissimilar
distance measures: choice becomes important for basis of
classification
Once a distance is given, the pattern recognition is accomplished.
Hair Length
Distance metrics:
different distance will be employed later
To be a valid distance metric of the distance between two objects
in and abstract space W, a distance metric must satisfy following
conditions
Distance metrics: different distance will be employed later
To be a valid distance metric of the distance between two objects
in and abstract space W, a distance metric must satisfy following
conditions
d(x,y)>=0
nonnegative
d(x,x)=0
reflexivity
d(x,y)=d(y,x)
symmetrical
d(x,y)<= d(x,z)+d(z,y)
triangle inequality
We will encounter different distances, for example
distance metric -- relative entropy (distance from information
theory
Hamming distance
For x = {xi} and y = {yi}
dH(x , y ) = S |xi-yi|
measure of sum of absolute different between each element of
two vectors x and y
most often used in comparing binary vectors (binary pixel figures, black
and white figures)
e.g. dH ([1 0 0 1 1 1 0 1], [1 1 0 1 0 01 1]) = 4
= ( 1 1 1 1 1 1 1 1 0)
Euclidean Distance
For x = {xi} and y = {yi}
d (x , y ) = [S (xi-yi)2]1/2
Most widely used distance, easy to calculate
Minkowski Distance
For x = {xi} and y = {yi}
d (x , y ) = [S |xi-yi|r]1/r
r>0
Statistical approach:
Hair length
Distribution density p1(x) and p2(x)
If p1(x) > p2(x) then x is in class one
other wise
it is in class two
The discriminant function is given by
p1(x) = p2(x)
Now the problem of statistical pattern recognition is reduced to
estimate the probability density for a given data {x} and {y}
In general there are two approaches
•
Parametric method
•
Nonparametric method
Parametric methods
Assumes knowledge of underlying probability density
distribution p(x)
Advantages: need only adjust parameters distributions to
obtain best fit. According to the central limit
theorem, we could assume in many cases that
the distribution is Gaussian (see below)
Disadvantage: if assumption is wrong than poor performance
in terms of misclassification. However, if crude
classification acceptable then this can be OK
Normal (Gaussian) Probability Distribution
--common assumption that density distribution is normal
For single variable X
1
(x  m )
p (x ) 
exp(
)
2
2
2s
2
mean E X = m
variance E ( X- E X)2 = s2
For multiple dimensions x
1
1
p (x ) 
exp( (x  m ) S (x  m ))
(2 ) |S |
2
1
T
n/2
1/ 2
 x1 
 m1 
 s 11
 
 

x    m    S  
 
 

 xn 
 mn
 s n1
x feature vector, m mean vector, covariance matrix
matrix and is symmetric and
sij = E [ (Xi- m i) (Xj- m j) ]
the correlation between Xi and Xj
| S | = determinant of S
S1
s 1n 



= inverse of S



s nn 
S
an nxn
Fig. here
Mahalanobis distance
d ( x, m )  ( x  m ) S ( x  m )
T
1
l1
u1
d (x , m )  c
l2
u2
Topic
Hebbian learning
rule
Hebbian learning rule is local: only involving two neurones,
independent of other variables
We will return to Hebbian learning rule later in the course in PCA
learning
There are other possible ways of learning which are demonstrated in
experiments (see Nature Neuroscience, as in previous lecture)
Biological learning Vs. statistical learning
Biological learning: Hebbian learning rule
When an axon of cell A is near enough to excite a cell B and repeatedly
or persistently takes part in firing it, some growth process or metabolic
changes take place in one of both cells such that A’s efficiency as one
of the cell firing B, is increased
A
B
Cooperation between two neurons
In mathematical term: w(t) as the weight between two neurons a t
time t
w(t+1)=w(t)+
h rA rB