cs621-lect22-perceptron-training-and

Download Report

Transcript cs621-lect22-perceptron-training-and

CS621 : Artificial Intelligence
Pushpak Bhattacharyya
CSE Dept.,
IIT Bombay
Lecture 21: Perceptron training; Introducing Feedforward
N/W
Perceptron as a learning
device
Perceptron Training Algorithm
1. Start with a random value of w
ex: <0,0,0…>
2. Test for wxi > 0
If the test succeeds for i=1,2,…n
then return w
3. Modify w, wnext = wprev + xfail
4. Go to 2
Tracing PTA on OR-example
w=<0,0,0>
w=<-1,0,1>
w=<0,0 ,1>
w=<-1,1,1>
w=<0,1,2>
w=<1,1,2>
w=<0,2,2>
w=<1,2,2>
wx1 fails
wx4 fails
wx2 fails
wx1 fails
wx4 fails
wx2 fails
wx4 fails
success
Proof of Convergence of PTA
• Perceptron Training Algorithm (PTA)
• Statement:
Whatever be the initial choice of weights and
whatever be the vector chosen for testing, PTA
converges if the vectors are from a linearly
separable function.
Proof of Convergence of PTA
• Consider the expression
G(wn) = wn . w*
| wn|
where wn = weight at nth iteration; W* exists
since the vectors are from a linearly separable
function
• G(wn) = |wn| . |w*| . cos 
|wn|
where  = angle between wn and w*
• G(wn) = |w*| . cos 
• G(wn) ≤ |w*| ( as -1 ≤ cos  ≤ 1)
Behavior of Numerator of G



•
•
•
wn . w* = (wn-1 + Xn-1fail ) . w*
wn-1 . w* + Xn-1fail . w*
(wn-2 + Xn-2fail ) . w* + Xn-1fail . w* …..
w0 . w* + ( X0fail + X1fail +.... + Xn-1fail ). w*
w*.Xifail is always positive: note carefully
Suppose |Xj| ≥  , where  is the minimum
magnitude.
Num of G ≥ |w0 . w*| + n  . |w*|
So, numerator of G grows with n.
Behavior of Denominator of G
•


≤
≤
|wn| =  wn . wn
 (wn-1 + Xn-1fail )2
 (wn-1)2 + 2. wn-1. Xn-1fail + (Xn-1fail )2
 (wn-1)2 + (Xn-1fail )2
(as wn-1. Xn-1fail ≤ 0 )
 (w0)2 + (X0fail )2 + (X1fail )2 +…. + (Xn-1fail )2
• |Xj| ≤  (max magnitude)
• So, Denom ≤  (w0)2 + n2
Some Observations
• Numerator of G grows as n
• Denominator of G grows as  n
=> Numerator grows faster than denominator
• If PTA does not terminate, G(wn) values will
become unbounded.
Some Observations contd.
• But, as |G(wn)| ≤ |w*| which is finite, this is
impossible!
• Hence, PTA has to converge.
• Proof is due to Marvin Minsky.
Study of Linear Separability
• W. Xj = 0 defines a
hyperplane in the
(n+1) dimension.
=> W vector and Xj
vectors are
perpendicular to
each other.
y
θ
w1
x1
w2
x2
w3
x3
. . .
wn
xn
Linear Separability
X1+
+
Xk+1
-
Xk+2
-
Xk+
Positive set :
w. Xj > 0 j≤k
Xm -
Negative set :
w. Xj < 0 j>k
+ X2
Separating
hyperplane
Test for Linear Separability (LS)
• Theorem:
A function is linearly separable iff the vectors
corresponding to the function do not have a
Positive Linear Combination (PLC)
• PLC – Both a necessary and sufficient condition.
• X1, X2, … , Xm - Vectors of the function
• Y1, Y2, … , Ym - Augmented negated set
• Prepending -1 to the 0-class vector Xi and negating it,
gives Yi
Example (1) - XNOR
• The set {Yi} has a PLC
if ∑ Pi Yi = 0 ,
1≤i≤m
– where each Pi is a
non-negative
scalar and
– atleast one Pi > 0
• Example : 2 bit evenparity (X-NOR
function)
X1 <0,0>
+
X2 <0,1>
-
Y1
<-1,0,0>
Y2
<1,0,-1>
X3 <1,0>
-
Y3
<1,-1,0>
X4 <1,1>
+
Y4
<-1,1,1>
Example (1) - XNOR
• P1 [ -1 0 0 ]T + P2 [ 1 0 -1 ]T
+ P3 [ 1 -1 0 ]T + P4 [ -1 1 1 ]T
= [ 0 0 0 ]T
• All Pi = 1 gives the result.
• For Parity function,
PLC exists => Not linearly separable.
Example (2) – Majority function
• 3-bit majority function
Xi
<0 0 0> <0 0 1> <0 1 0> <0 1 1> +
<1 0 0> <1 0 1> +
<1 1 0> +
<1 1 1> +
Yi
<1 0 0 0>
<1 0 0 -1>
<1 0 -1 0>
<-1 0 1 1>
<1 -1 0 0>
<-1 1 0 1>
<-1 1 1 0>
<-1 1 1 1>
• Suppose PLC exists.
Equations obtained are:
• P1 + P2+ P3- P4 + P5- P6- P7 –
P8 = 0
• -P5+ P6 + P7+ P8 = 0
• -P3+ P4 + P7+ P8 = 0
• -P2+ P4 + P6+ P8 = 0
• On solving, all Pi will be
forced to 0
• 3 bit majority function
=> No PLC => LS
Limitations of perceptron
• Non-linear separability is all pervading
• Single perceptron does not have enough
computing power
• Eg: XOR cannot be computed by perceptron
Solutions
• Tolerate error (Ex: pocket algorithm used by
connectionist expert systems).
– Try to get the best possible hyperplane using
only perceptrons
• Use higher dimension surfaces
Ex: Degree - 2 surfaces like parabola
• Use layered network
Example - XOR
θ = 0.5
w2=1
 Calculation of XOR
w1=1
x 1x 2
x1 x2
x1x2
x1x
Calculation of x1x2
2
0
0
0
0
1
1
1
0
0
1
1
0
w1=-1
x1
θ=1
w2=1.5
x2
0
w2  
w1  
w1  w 2  
Example - XOR
θ = 0.5
w2=1
w1=1
x 1x 2
-1
x1
1
1.5
1.5
1 x 1x 2
-1
x2
Multi Layer Perceptron (MLP)
• Question:- How to find weights for the hidden
layers when no target output is available?
• Credit assignment problem – to be solved by
“Gradient Descent”
Some Terminology
• A multilayer feedforward neural network has
– Input layer
– Output layer
– Hidden layer (assists computation)
Output units and hidden units are called
computation units.
Training of the MLP
• Multilayer Perceptron (MLP)
• Question:- How to find weights for the hidden
layers when no target output is available?
• Credit assignment problem – to be solved by
“Gradient Descent”
Gradient Descent Technique
• Let E be the error at the output layer
1 p n
E   (ti  oi ) 2j
2 j 1 i 1
• ti = target output; oi = observed output
• i is the index going over n neurons in the outermost layer
• j is the index going over the p patterns (1 to p)
• Ex: XOR:– p=4 and n=1
Weights in a ff NN
• wmn is the weight of the connection
from the nth neuron to the mth neuron
• E vs
surface is a complex
surface in the space defined by the
weightsWwij
•
gives the direction in which a
movement of the operating point in
E
the wmn co-ordinate space will result
wmn
in maximum
decrease in error
m
wmn
n
wmn
E

wmn
Sigmoid neurons
• Gradient Descent needs a derivative computation
- not possible in perceptron due to the discontinuous step
function used!
 Sigmoid neurons with easy-to-compute derivatives used!
y  1 as x  
y  0 as x  
• Computing power comes from non-linearity of sigmoid
function.
Derivative of Sigmoid function
1
y
x
1 e
x
dy
1
e
x

( e ) 
x 2
x 2
dx
(1  e )
(1  e )
1

x
1 e
1 

 y (1  y )
1 
x 
 1 e 
Training algorithm
• Initialize weights to random values.
• For input x = <xn,xn-1,…,x0>, modify weights as follows
Target output = t, Observed output = o
E
wi  
wi
1
E

• Iterate until E < (t o) 2 (threshold)
2
Calculation of ∆wi
n 1
E
E net 



 where : net   wi xi 
Wi net Wi 
i 0

E o net



o net Wi
 (t  o)o(1  o) xi
E
wi  
(  learning constant, 0    1)
wi
wi   (t  o)o(1  o) xi
Observations
Does the training technique support our intuition?
• The larger the xi, larger is ∆wi
– Error burden is borne by the weight values
corresponding to large input values
Observations contd.
• ∆wi is proportional to the departure from target
• Saturation behaviour when o is 0 or 1
• If o < t, ∆wi > 0 and if o > t, ∆wi < 0 which is
consistent with the Hebb’s law
Hebb’s
law
n
j
wji
ni
• If nj and ni are both in excitatory state (+1)
– Then the change in weight must be such that it enhances
the excitation
– The change is proportional to both the levels of excitation
∆wji α e(nj) e(ni)
• If ni and nj are in a mutual state of inhibition ( one is +1 and the
other is -1),
– Then the change in weight is such that the inhibition is
enhanced (change in weight is negative)
Saturation behavior
• The algorithm is iterative and incremental
• If the weight values or number of input values is
very large, the output will be large, then the
output will be in saturation region.
• The weight values hardly change in the
saturation region
Backpropagation algorithm
j
wji
i
….
….
….
….
Output layer
(m o/p neurons)
Hidden layers
Input layer
(n i/p neurons)
• Fully connected feed forward network
• Pure FF network (no jumping of connections
over layers)
Gradient Descent Equations
E
w ji  
(  learning rate, 0    1)
w ji
net j
E
E


( net j  input at the jth layer )
w ji net j w ji
E
net j
 j
net j
w ji  j
w ji
Example - Character
Recognition
• Output layer – 26 neurons (all capital)
• First output neuron has the responsibility of
detecting all forms of ‘A’
• Centralized representation of outputs
• In distributed representations, all output neurons
participate in output
Backpropagation – for
outermost layer
E
 E o j
th
j  


(net j  input at the j layer )
net j
o j net j
1 m
E   (t p  o p ) 2
2 p 1
Hence, j  ((t j  o j )o j (1  o j ))
w ji   (t j  o j )o j (1  o j )oi
Backpropagation for hidden
layers
k
j
i
….
….
….
….
Output layer
(m o/p neurons)
Hidden layers
Input layer
(n i/p neurons)
k is propagated backwards to find value of j
Backpropagation – for hidden
layers
w ji  joi
o j
E
j  


net j
o j net j
E

E
 o j (1  o j )
o j
E
netk
  (

)  o j (1  o j )
o j
knext layer net k
Hence,  j  

 (w
knext layer
kj
 (
knext layer
k
 wkj )  o j (1  o j )
 k )o j (1  o j )oi
General Backpropagation Rule
• General weight updating rule:
w ji  joi
• Where
 j  (t j  o j )o j (1  o j )

 (w 
knext layer
kj
k
for outermost layer
)o j (1  o j )oi for hidden layers
Issues in the training algorithm
•
•
•
Algorithm is greedy. It always changes weight such
that E reduces.
The algorithm may get stuck up in a local minimum.
If we observe that E is not getting reduced anymore,
the following may be the reasons:
Issues in the training algorithm
contd.
1. Stuck in local minimum.
2. Network paralysis. (High –ve or +ve i/p makes
neurons to saturate.)
3.
(learning rate) is too small.

Diagnostics in action(1)
1) If stuck in local minimum, try the following:
• Re-initializing the weight vector.
• Increase the learning rate.
• Introduce more neurons in the hidden layer.
Diagnostics in action (1) contd.
2) If it is network paralysis, then increase the
number of neurons in the hidden layer.
 Problem: How to configure the hidden layer ?
 Known: One hidden layer seems to be
sufficient. [Kolmogorov (1960’s)]
Diagnostics in action(2)
Kolgomorov statement:
A feedforward network with three layers (input,
output and hidden) with appropriate I/O relation
that can vary from neuron to neuron is sufficient to
compute any function.
 More hidden layers reduce the size of individual
layers.
Diagnostics in action(3)
3) Observe the outputs: If they are close to 0 or 1, try the
following:
1. Scale the inputs or divide by a normalizing factor.
2. Change the shape and size of the sigmoid.
Diagnostics in action(3)
1. Introduce momentum factor.
(w ji ) nth  iteration  jOi   (wji)( n  1)th  iteration
 Accelerates the movement out of the trough.
 Dampens oscillation inside the trough.
 Choosing : If
is large, we may jump over the
minimum. 

An application in Medical
Domain
Expert System for Skin Diseases
Diagnosis
• Bumpiness and scaliness of skin
• Mostly for symptom gathering and for
developing diagnosis skills
• Not replacing doctor’s diagnosis
Architecture of the FF NN
• 96-20-10
• 96 input neurons, 20 hidden layer neurons, 10 output
neurons
• Inputs: skin disease symptoms and their parameters
– Location, distribution, shape, arrangement, pattern,
number of lesions, presence of an active norder,
amount of scale, elevation of papuls, color, altered
pigmentation, itching, pustules, lymphadenopathy,
palmer thickening, results of microscopic
examination, presence of herald pathc, result of
dermatology test called KOH
Output
• 10 neurons indicative of the diseases:
– psoriasis, pityriasis rubra pilaris, lichen
planus, pityriasis rosea, tinea versicolor,
dermatophytosis, cutaneous T-cell lymphoma,
secondery syphilis, chronic contact dermatitis,
soberrheic dermatitis
Training data
• Input specs of 10 model diseases from 250
patients
• 0.5 is some specific symptom value is not
knoiwn
• Trained using standard error backpropagation
algorithm
Testing
• Previously unused symptom and disease data of 99 patients
• Result:
• Correct diagnosis achieved for 70% of papulosquamous group skin
diseases
• Success rate above 80% for the remaining diseases except for
psoriasis
• psoriasis diagnosed correctly only in 30% of the cases
• Psoriasis resembles other diseases within the papulosquamous
group of diseases, and is somewhat difficult even for specialists to
recognise.
Explanation capability
• Rule based systems reveal the explicit path of
reasoning through the textual statements
• Connectionist expert systems reach conclusions
through complex, non linear and simultaneous
interaction of many units
• Analysing the effect of a single input or a single
group of inputs would be difficult and would yield
incor6rect results
Explanation contd.
• The hidden layer re-represents the data
• Outputs of hidden neurons are neither symtoms
nor decisions
Duration
of lesions : weeks
Duration
of lesions : weeks
Symptoms & parameters
0
Internal
representation
Disease
diagnosis
0
1
0
( Psoriasis node )
Minimal itching
6
Positive
KOH test
Lesions located
on feet
1.68
10
13
5
(Dermatophytosis node)
1.62
36
14
Minimal
increase
in pigmentation 71
1
Positive test for
pseudohyphae
95
And spores
19
Bias
Bias
96
9
(Seborrheic dermatitis node)
20
Figure : Explanation of dermatophytosis diagnosis using the DESKNET expert system.
Discussion
• Symptoms and parameters contributing to the
diagnosis found from the n/w
• Standard deviation, mean and other tests of
significance used to arrive at the importance of
contributing parameters
• The n/w acts as apprentice to the expert