ICO Learning

Download Report

Transcript ICO Learning

ICO Learning
Gerhard Neumann
Seminar A, SS06
Overview




Short Overview of different control methods
Correlation Based Learning
ISO Learning
Comparison to other Methods ([Wörgötter05])




TD Learning
STDP
ICO Learning ([Porr06])
Learning Receptive Fields ([Kulvicius06])
Comparison of ISO learning to
other Methods


Comparison for Classical Conditioning
learning Problems (open loop control)
Relating RL to Classical Conditioning

Classical Conditioning: Pairing of two subsequent
stimuli is learned such that the presentation of the
first stimulus is taken as a predictor of the second
one.
RL: Maximization of Rewards:

v … Predictor of future reward

RL for Classical Conditioning

TD-Error:


Weight Change:



Derivation Term :
=> Nothing new so far…
Goal: Output v should react after learning to the
onset of the CS xn, and remains active until the
reward terminates
Present CS internally by a chain of n + 1 delayed
pulses xi

Replace the states from traditional RL with time steps
RL for Classical Conditioning

Special kind of E-Trace


Learning Steps:


Serial Compound
Representation
Rectangular response of v
Special Treatment of the
reward not necessary

x0 can replace the reward
when setting w0 to 1 at the
beginning
Comparison for Classical
Conditioning

Correlation Based Learning


„Reward“ x0 is not an independent term as in TD learning
TD-Learning
Comparison for Classical
Conditioning

TD-Learning

ISO-Learning


Uses another form of E-Traces (Band-pass filters)
Used for all input pathways

-> also for calculating the output
Comparison for the closed
loop

Closed loop



Actions of the agent affect future sensory input
Comparison not so easy any more, because behavior of the algorithms
is now quite different
Reward Based Architectures







Actor-Critic Architecture
Use Evaluative Feed-Back
Reward Maximation
A good reward signal is very
often hard to find
In nature: Found by evolution
Can theoretically be applied to any learning problem
Resolution in the State Space:


Only applicable for low dimensional state spaces
-> Curse of dimensionality!
Comparison for the closed
loop

Correlation Based Architectures


Non-evaluative feedback, all signals are value free
Minimize Disturbance

Valid Regions are usually much bigger than in for reward maximation





Evaluations are implicitely build into the sign of the reaction behavior
Actor and Critic are the same architectureal building block
Only for a restricted set of learning problems


Better Convergence !!
Restricted Solutions
Hard to apply for complex tasks
Resolution in Time:


Only looks at temporal correlation of the input variables
Can be applied for high dimensional state spaces
Comparison of ISO learning
and STDP

ISO learning generically produces a bimodal weight change
curve
 Similiar to the STDP (Spike timing dependent plasticity) learning
weight change curve

ISO learning STDP rule:




Potential from the synapse: Filtered version of a spike
Gradient Dependent Model
Much faster time scale used in STDP
Can model different kind of synapses with different filters easily
Overview




Short Overview of different control methods
Correlation Based Learning
ISO Learning
Comparison to other Methods ([Wörgötter05])




TD Learning
STDP
ICO Learning ([Porr06])
Learning Receptive Fields([Kulvicius06])
ICO (Input Correlation Only) Learning

Drawback of Hebbian Learning


Auto-Correlation can result in divergence even if x0 = 0
ISO learning:
 Relies on orthogonal filters of different inputs



Orthogonal to its derivative
Only works for if steady state is assumed
 Auto correlation does not vanish any more if the weights are
changed during the impulse response of the filters
 -> can not be applied for large learning rates
=> Can be used only for small learning rates, otherwise
Auto-Correlation causes divergence of the weights
ICO & ISO Learning

ISO Learning

ICO Learning
ICO Learning

Simple adaption of the ISO Learning rule




Correlate only inputs with each other
No correlation with the output
 -> No Auto Correlation
Define one Input as the reflex input x0
Drawback:
 Loss of Generality: Not Isotropic any more


Not all inputs are treated equally any more
Advantage:
 Can use much higher learning rates (up to 100x faster)
 Can use almost arbitrary types of filter
 No Divergence in weights any more
ICO Learning

Weight change curve
(open loop, just one
Filter bank)


Same as for ISO
learning
Weight changing curve


ISO learning contains
exponential instability
Even after setting x0 to
0 after 100000
timesteps
ICO Learning: Closing the Loop

Output of learner v feeds back to its inputs xj after being modified by
the environment


Learning Goal:


Reactive Pathway: Fixed Reactive Feedback control
Learn earlier reaction to keep x0 (Disturbance or error signal) at 0
One can proof that under simplified
conditions that one shoot learning
is possible


With one filter bank, impulse signals
Using Z-Transform
ICO Learning: Applications

Simulated Robot Experiment:


Robot has to find food (disks in the environment)
Sensors for Uncondition Stimulus:
 2 Touchsensors (Left + Right)
 Reflex: Robot elicits a sharp turn as it touches a disk



Pulls the robot into the centre of the disk
Sensors for predictive Stimulus
 2 Sound (Distance) Sensors (Left + Right), Disks
 Can measure distance to the disk
 Stimulus: Difference between Left + Right sound signals
 Use 5 filters (resonators) in the filter bank
Output v: Steering angle of the Robot
ICO Learning: Simulated Robot

Only One experience has been sufficient to
show an adapted behavior

Only Possible with ICO learning
Simulated Robot

Comparison for different Learning rates
 ICO Learning
ISO Learning

Learning was successful if for a sequence of four contacts

Equivalent for small learning rates

Small Auto correlation term
Simulated Robot

Two Different Learning Rates

Divergent Behavior of ISO learning for high
learning rates

Robot shows avoidance behavior from food disks
Applications continued

More Complex Task:


Three food disks simultanously
No simple relationship between the reflex input and the predictive
input any more


Superimposed Sound Fields
Is only learned by ICO learning, not by ISO learning
ICO: Real Robot Application

Real Robot:




Target White disk from a distance
Reflex: Pulls the robot into the white disk just at the
moment the robot drives over the disk
 Achieved by analysing the bottom-scanline of a camera
Predictive input:
 Analysing Scanline from the top of the image
Filter Bank
 5 FIR Filters with different filter length


All coefficients set to 1 -> smear out signal
Narrow viewing angle of the camera
 Put robot more or less in front of the disk
ICO: Real Robot Experiment

Processing the input



Calculate the deviation of the positions of all white points in a scanline to
the center of the scanline
1D signal
Results:


A before learning
B & C After learning


14 contacts
Weights oscillate around
their best values, but do
not diverge
ICO Learning: Other
Applications

Mechanical Arm

Arm is always controlled with a PI controller to a
specified set point



Disturbance:



Input of the PI controller: Motor position
PI controller is used as reactive filter
Pushing force of a second small arm mounted to the
main arm
Fast reacting touch sensors measures D.
Use 10 resonator filters in the filter bank
ICO Learning: Other
Applications

Result:



Control is shifted
backwards in time
Error signal
(derivation to the set
point) almost
vanishes
Other example:
Temperature Control

Predict temperature
changes caused by
another heater
Overview




Short Overview of different control methods
Correlation Based Learning
ISO Learning
Comparison to other Methods ([Wörgötter05])




TD Learning
STDP
ICO Learning ([Porr06])
Learning Receptive Fields([Kulvicius06])
Development of Receptive fields through
temporal Sequence learning [Kulvicius06]

Develop receptive fields by ICO learning


Learn behavior and receptive fields simultanously
Usually these 2 learning processes are considered seperately


First approach where the receptive field and the behavior is
trained simultanously!!
Shows the application of ICO learning for high dimensional input
spaces
Line Following

System:


Robot should learn to better follow a
line painted on the ground
Reactive Input:


Predictive Input




Brings robot back to the line
Not a Smooth behavior
Motor Output



x1… Pixels in the middle of the image
Use 10 different filters in the filter bank
(resonators)
Reflexive Output:


x0… Pixels at the bottom ot the image
S… Constant Speed
v modifies speed and steering of the robot
Use Left-Right symmetry
Line Following

Simple System

Fixed sensor banks, all pixels are summed up

Input x1 predicts x0
Line Following

Three different Tracks





Steep, Shallow, Sharp
For one learning experiment
always the same track is used
Robot steers much smoother
Usually 1 trial is enough for
learning
Videos
 Without Learning
 Steep
 Sharp
Line Following: Receptive Fields

Receptive fields


Use 225 pixels for the far sensors
Use individual filter banks for each pixel


10 filters per pixel
Left-Right Symmetry:

Left Receptive field is a mirror of the right
Line Following: Receptive Fields

Results




Lower learning rates have to be used
More trials are needed (3 to 6 trials)
Different RFs are learned for different tracks
Steep and Sharp Track, Plots show the sum of all filter weights for one
pixel
Conclusion


Correlation Based Learning
 Tries to minimize the influence of disturbances
 Easier to learn than Reinforcement Learning
 The framework is less general
Questions:
 When to apply Correlation Based Learning and when
Reinforcement Learning


How can these two methods be combined



How is it done by Animals/Humans?
Correlation learning in early learning stage
RL for fine tuning
ICO Learning
 Improvement of ISO learning
 More Stable, higher learning rates can be used

One Shoot Learning is possible
Literature:




[Porr05]: F. Wörgötter and B. Porr, Temporal Sequence Learning,
Prediction and Control, A Review of different control methods
and their relation to biological mechanisms
[Porr03]: B. Porr, F. Wörgötter, Isotropic Sequence Order
Learning
[Porr06]: B. Porr, F. Wörgötter, Strongly improved stability and
faster convergence of temporal sequence learning by utilising
input correlations only
[Kulvicius06]: T. Kulvicius, B. Porr and F. Wörgötter,
Behaviourally Guided Development of Primary and Secondary
Receptive Fields through temporal sequence learning