ppt - IIT Bombay

Download Report

Transcript ppt - IIT Bombay

CS621/CS449
Artificial Intelligence
Lecture Notes
Set 4: 24/08/2004, 25/08/2004,
27/08/2004, 08/09/2004,10/09/2004
Instructor: Prof. Pushpak Bhattacharyya
13/08/2004
CS-621/CS-449 Lecture Notes
Outline
• Proof of Theorem : PLC and ~LS
equivalence
• Feedforward neural nets
• Sigmoid networks
• Training through Gradient Descent
• Backpropagation algorithm
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Proof of PLC  ~LS
We have to prove PLC  ~LS and ~LS  PLC
• One proof for PLC  ~LS:
PLC exists   PiYi  0  (1)
Pi  0, j pj  0
Suppose LS.
w such that w  Yi  0, i  (2)
w  (1) gives  Pi(w  Yi )  0  (3)
This is a Contradiction. QED.
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Proof of PLC  ~LS
• Given a set of augmented and preprocessed
vectors X1, X2, ….,Xm, each of dimension n, is
there a vector w such that
w.Xi >0 , i=1,2,…,m ?
Or
w. D > [0 0 ….0]
m times
Where D = [X1 X2 … Xm] is a m x n matrix and
each Xi is a n x 1 matrix
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Proof of PLC  ~LS
• Alternatively, we could ask for a vector w1 such
that
w1. D > [1 1 ….1]
m times
• Consider a Linear Programming formulation:
• Minimize w1.O subject to the constraint
w1. D > [1 1 ….1]
F1
• Trivially w1. O is O
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Proof of PLC  ~LS
• Any such LP formulation has a Dual LP
(consequence of Forkas Lemma)
• Dual LP says:
Maximize [1 1 …. 1]. P subject to the constraint
D. P = 0 where P = [P1 P2 …. Pm,]T, Pi  0
F2
• Observation : F1 has a feasible solution iff the
objective function of F2 is bounded.
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Proof of PLC  ~LS
• Suppose PLC exists. Then P  O
• If P is a solution for D. P = O , then,
 P is also a solution for any  (as large as
required)
• Thus, the objective function F2 becomes
unbounded.
=> F1 does not have a feasible solution
=> w1 does not exist
=> ~LS
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Proof of PLC  ~LS
• Suppose ~LS.
=> w1 does not exist
=> F1 does not have a feasible solution
=> F2 becomes unbounded
=> PLC exists
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Limitations of perceptron
• Non-linear separability is all pervading
• Single perceptron does not have enough
computing power
• Eg: XOR cannot be computed by perceptron.
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Solutions
• Tolerate error (Ex: pocket algorithm used by
connectionist expert systems).
– Try to get the best possible hyperplane using
only perceptrons
• Use higher dimension surfaces
Ex: Degree - 2 surfaces like parabola
• Use layered network
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Pocket Algorithm
• Algorithm evolved in 1985 – essentially uses
PTA
• Basic Idea:
 Always preserve the best weight obtained so
far in the “pocket”
 Change weights, if found better (i.e. changed
weights result in reduced error).
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Connectionist Expert Systems
• Rules and facts form Knowledge Base (KB)
• Rules encoded in a neural network
• Main problem: “Knowledge engineering” – lots of
training/experience required.
• Challenges in KE:
– Obtaining knowledge from experts
– Encoding this into the system
• Hybrid AI area –combines Symbolic (Programbased) AI and Neural Network AI
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
XOR using 2 layers
x1  x2

x1 x2 x1x2 
 OR ( AND( x1 , NOT ( x2 )), AND( NOT ( x1 ), x2 )))
• Non-LS function expressed as a linearly separable
function of individual linearly separable functions.
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Example - XOR
θ = 0.5
w2=1
 Calculation of XOR
w1=1
x 1x 2
x 1x 2
x1
x2
x1x2
0
0
0
0
1
1
1
0
0
1
1
24/08/2004 - 10/09/2004
0
Calculation of x1x2
w1=-1
x1
CS-621/CS-449 Lecture Notes
θ=1
w2=1.5
x2
0
w2  
w1  
w1  w2  
Prof. Pushpak Bhattacharyya
IIT Bombay
Example - XOR
θ = 0.5
w2=1
w1=1
x 1x 2
-1
x1
24/08/2004 - 10/09/2004
1
1.5
1.5
1 x 1x 2
-1
x2
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Some Terminology
• A multilayer feedforward neural network has
– Input layer
– Output layer
– Hidden layer (asserts computation)
Output units and hidden units are called
computation units.
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Training of the MLP
• Multilayer Perceptron (MLP)
• Question:- How to find weights for the hidden
layers when no target output is available?
• Credit assignment problem – to be solved by
“Gradient Descent”
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Gradient Descent Technique
• Let E be the error at the output layer
1 p n
E   (ti  oi ) 2j
2 j 1 i 1
• ti = target output; oi = observed output
• i is the index going over n neurons in the outermost layer
• j is the index going over the p patterns (1 to p)
• Ex: XOR :– p=4 and n=1
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Weights in a NN
• wmn is the weight of the connection
from the nth neuron to the mth neuron
wmn
n
• E vs W surface is a complex surface
in the space defined by the weights
wij
•

E
wmn
gives the direction in which a
movement of the operating point in
the wmn co-ordinate space will result
in maximum decrease in error
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
m
wmn
E

wmn
Prof. Pushpak Bhattacharyya
IIT Bombay
Sigmoid neurons
• Gradient Descent needs a derivative computation
- not possible in perceptron due to the discontinuous step
function used!
 Sigmoid neurons with easy-to-compute derivatives
required!
y  1 as x  
y  0 as x  
• Computing power comes from non-linearity of sigmoid
function.
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Derivative of Sigmoid function
1
y
x
1 e
x
dy
1
e
x

( e ) 
x 2
x 2
dx
(1  e )
(1  e )
1

x
1 e
24/08/2004 - 10/09/2004
1 

 y (1  y )
1 
x 
 1 e 
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Training algorithm
• Initialize weights to random values.
• For input x = <xn,xn-1,…,x0>, modify weights as follows
Target output = t, Observed output = o
E
wi  
wi
E
1
(t  o) 2
2
• Iterate until E < 
24/08/2004 - 10/09/2004
(threshold)
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Calculation of ∆wi
n 1
E
E net 



 where : net   wi xi 
Wi net Wi 
i 0

E o net



o net Wi
 (t  o)o(1  o) xi
E
wi  
(  learning constant, 0    1)
wi
wi   (t  o)o(1  o) xi
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Observations
Does the training technique support our intuition?
• The larger the xi, larger is ∆wi
– Error burden is borne by the weight values
corresponding to large input values
• If t =0, ∆wi = 0
• ∆wi α departure from target
• Saturation behaviour
• If o < t, ∆wi > 0 and if o > t, ∆wi < 0 
establishes Hebb’s law
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Hebb’s law
nj
wji
ni
• If nj and ni are both in excitatory state (+1)
– Then the change in weight might be such that it enhances
the excitation
– The change is proportional to both the levels of excitation
∆wji α e(nj) e(ni)
• If ni and nj are in a mutual state of inhibition ( one is +1 and the
other is -1),
– Then the change in weight is such that the inhibition is
enhanced (change in weight is negative)
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Saturation behavior
• The algorithm is iterative and incremental
• If the weight values or number of input values is
very large, the output will be large, then the
output will be in saturation region.
• The weight values hardly change in the
saturation region
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Backpropagation algorithm
j
wji
i
….
….
Output layer
(m o/p neurons)
Hidden layers
….
….
Input layer
(n i/p neurons)
• Fully connected feed forward network
• Pure FF network (no jumping of connections
over layers)
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Gradient Descent Equations
E
w ji  
(  learning rate, 0    1)
w ji
net j
E
E


( net j  input at the jth layer )
w ji net j w ji
E
net j
 j
net j
w ji  j
w ji
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Example - Character Recognition
• Output layer – 26 neurons (all capital)
• First output neuron has the responsibility of
detecting all forms of ‘A’
• Centralized representation of outputs
• In distributed representations, all output neurons
participate in output
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Backpropagation – for outermost
layer
E
 E o j
th
j  


(net j  input at the j layer )
net j
o j net j
1 m
E   (t p  o p ) 2
2 p 1
Hence, j  ((t j  o j )o j (1  o j ))
w ji   (t j  o j )o j (1  o j )oi
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Backpropagation for hidden layers
k
j
i
….
….
Output layer
(m o/p neurons)
Hidden layers
….
….
Input layer
(n i/p neurons)
k is propagated backwards to find value of j
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Backpropagation – for hidden layers
w ji  joi
o j
E
j  


net j
o j net j
E

E
 o j (1  o j )
o j
E
netk
  (

)  o j (1  o j )
o j
knext layer net k
Hence,  j  

 (w
knext layer
24/08/2004 - 10/09/2004
kj
 (
knext layer
k
 wkj )  o j (1  o j )
 k )o j (1  o j )oi
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
General Backpropagation Rule
• General weight updating rule:
w ji  joi
• Where
 j  (t j  o j )o j (1  o j )

 (w 
knext layer
24/08/2004 - 10/09/2004
kj
k
for outermost layer
)o j (1  o j )oi for hidden layers
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Issues in the training algorithm
•
Algorithm is greedy. It always changes weight such
that E reduces.
• The algorithm may get stuck up in a local minimum.
• If we observe that E is not getting reduced anymore,
the following may be the reasons:
1. Stuck in local minimum.
2. Network paralysis. (High –ve or +ve i/p makes
neurons to saturate.)
3.  (learning rate is too small)
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Diagnostics in action(1)
1) If stuck in local minimum, try the following:
• Re-initializing the weight vector.
• Increase the learning rate.
• Introduce more neurons in the hidden layer.
2) If it is network paralysis, then increase the
number of neurons in the hidden layer.
 Problem: How to configure the hidden layer ?
 Known: One hidden layer seems to be
sufficient. [Kolmogorov (1960’s)]
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Diagnostics in action(2)
Kolgomorov statement:
A feedforward network with three layers (input,
output and hidden) with appropriate I/O relation
that can vary from neuron to neuron is sufficient to
compute any function.
 More hidden layers reduce the size of individual
layers.
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay
Diagnostics in action(3)
3) Observe the outputs: If they are close to 0 or 1, try the
following:
1. Scale the inputs or divide by a normalizing factor.
2. Change the shape and size of the sigmoid.
3. Introduce momentum factor.
(w ji ) nth  iteration  jOi   (wji)( n  1)th  iteration
 Accelerates the movement out of the trough.
 Dampens oscillation inside the trough.
 Choosing  : If  is large, we may jump over the
minimum.
24/08/2004 - 10/09/2004
CS-621/CS-449 Lecture Notes
Prof. Pushpak Bhattacharyya
IIT Bombay