Transcript Slide 1

Neural Networks
Intelligent Systems – Lecture 13
©www.sti-innsbruck.at
Copyright 2008 STI INNSBRUCK www.sti-innsbruck.at
1) Motivation
1) Technical solution
•
Definitions
•
Explanations
•
Procedures
•
Illustrations by short examples
2) Illustration by a larger example
3) Extensions (Work only sketched)
www.sti-innsbruck.at
Motivation
• A main motivation behind neural networks is the fact that
symbolic rules do not reflect reasoning processes performed
by humans.
• Biological neural systems can capture highly parallel
computations based on representations that are distributed
over many neurons.
• They learn and generalize from training data; no need for
programming it all...
• They are very noise tolerant – better resistance than symbolic
systems.
• In summary: neural networks can do whatever symbolic or
logic systems can do, and more; in practice it is not that
obvious however.
www.sti-innsbruck.at
Motivation
• Neural networks are stong in:
– Learning from a set of examples
– Optimizing solutions via constraints and cost functions
– Classification: grouping elements in classes
– Speech recognition, pattern matching
– Non-parametric statistical analysis and regressions
www.sti-innsbruck.at
Introduction: What are Neural Networks?
• Neural networks are networks
of neurons as in the real
biological brain.
• Neurons are highly specialized
cells that transmit impulses within
animals to cause a change in a target
cell such as a muscle effector cell or glandular cell.
• The axon, is the primary conduit through which the neuron
transmits impulses to neurons downstream in the signal chain
• Humans: 1011 neurons of > 20 types, 1014 synapses, 1ms10ms cycle time
• Signals are noisy “spike trains” of electrical potential
www.sti-innsbruck.at
Introduction: What are Neural Networks?
• What we refer to Neural Networks in the course are thus
mostly Artificial Neural Networks (ANN).
• ANN are approximation of biological neural networks and are
built of physical devices, or simulated on computers.
• ANN are parallel computational entities that consist of multiple
simple processing units that are connected in specific ways in
order to perform the desired tasks.
• Remember: ANN are computationally primitive
approximations of the real biological brains.
www.sti-innsbruck.at
Comparison & Contrast
•
Neural networks vs. classical symbolic computing
1. Sub-symbolic vs. Symbolic
2. Non-modular vs. Modular
3. Distributed Representation vs. Localist representation
4. Bottom-up vs. Top Down
5. Parallel processing vs. Sequential processing
•
In reality however, it can be observed that the distinctions
become increasingly less obvious!
www.sti-innsbruck.at
McCulloch-Pitts „Unit“ – Artificial Neurons
• Output is a „squashed“ linear function of the inputs:
• A gross oversimplification of real neurons, but its purpose is to
develop understanding of what networks of simple units can
do.
www.sti-innsbruck.at
Activation Functions
(a) is a step function or threshold function
(b) is a sigmoid function 1/(1+e-x)
• Changing the bias weight w0,i moves the threshold location
www.sti-innsbruck.at
Perceptron
• McCulloch-Pitts neurons can be connected together in any
desired way to build an artificial neural network.
• A construct of one input layer of neurons that feed forward to
one output layer of neurons is called Perceptron.
Figure of simple perceptron
www.sti-innsbruck.at
Example: Logical Functions
• McCulloch and Pitts: every boolean function can be
implemented with a artificial neuron.
www.sti-innsbruck.at
Example: Finding Weights for AND Operation
www.sti-innsbruck.at
Neural Network Structures
• Mathematically artificial neural networks are represented by
weighted directed graphs.
• In more practical terms, a neural network has activations
flowing between processing units via one-way connections.
• There are three common artificial neural network architectures
known:
– Single-Layer Feed-Forward (Perceptron)
– Multi-Layer Feed-Forward
– Recurrent Neural Network
www.sti-innsbruck.at
Single-Layer Feed-Forward
• A Single-Layer Feed-Forward Structure is a simple
perceptron, and has thus
– one input layer
– one output layer
– NO feed-back connections
• Feed-forward networks implement functions, have no
internal state (of course also valid for multi-layer
perceptrons).
www.sti-innsbruck.at
Multi-Layer Feed-Forward
• Multi-Layer Feed-Forward Structures have:
– one input layer
– one output layer
– one or MORE hidden layers of processing units
• The hidden layers are between the input and the output layer,
and thus hidden from the outside world: no input from the
world, not output to the world.
www.sti-innsbruck.at
Recurrent Network
• Recurrent networks have at least one feedback connection:
– They have thus directed cycles with delays: they have
internal state (like flip flops), can oscillate, etc.
– The response to an input depends on the initial state which
may depend on previous inputs – can model short-time
memory
– Hopfield networks have symmetric weights (Wij = Wji)
– Boltzmann machines use stochastic activation functions,
≈ MCMC in Bayes nets
www.sti-innsbruck.at
Feed-forward example
•
Feed-forward network = a parameterized family of nonlinear functions:
•
Adjust weights changes the function: this is where learning occurs!
www.sti-innsbruck.at
Single-layer feed-forward neural networks:
perceptrons
•
•
Output units all operate separately – no shared weights (the study can be limited to
single output perceptrons!)
Adjusting weights moves the location, orientation, and steepness of cliff
www.sti-innsbruck.at
Expressiveness of perceptrons (1)
•
•
•
Consider a perceptron with g = step function (Rosenblatt, 1957,
1960) – can model a Boolean function - classification
Can represent AND, OR, NOT, majority (weights: n/2, 1, 1, ...), etc.
compactly, but not XOR
Represents a linear separator in input space
www.sti-innsbruck.at
•
Threshold perceptrons can represent only linearly separable functions (i.e.
functions for which such a separation hyperplane exists)
x1
0
•
0
x2
Limited expressivity, but there exists an algorithm that will fit a threshold
perceptron to any linealy separable training set.
www.sti-innsbruck.at
Perceptron learning (1)
•
•
Learn by adjusting weights to reduce error on training set
The squared error for an example with input x and true output y is
•
Perform optimization search by gradient descent
•
Simple weight update rule
– positive error ⇒ increase network output
⇒ increase weights on positive inputs, decrease on negative inputs
•
Algorithm runs training examples through the net once at a time and adjusts weights after each
example. Such a cycle is called epoch
www.sti-innsbruck.at
Perceptron learning (2)
•
Perceptron learning rule converges to a consistent function for any linearly
separable data set
•
•
Perceptron learns majority function easily, DTL is hopeless
DTL learns restaurant function easily, perceptron cannot represent it
www.sti-innsbruck.at
Multilayer perceptrons
•
•
•
Layers are usually fully connected;
numbers of hidden units typically chosen by hand
Hidden layers enlarge the space of hypotheses that the network can represent
Learning done by back-propagation algorithm  errors are back-propagated from the
output layer to the hidden layers
www.sti-innsbruck.at
Expressiveness of MLPs (1)
•
•
2 layers can represent all continuous functions
3 layers can represent all functions
•
•
•
•
Combine two opposite-facing threshold functions to make a ridge
Combine two perpendicular ridges to make a bump
Add bumps of various sizes and locations to fit any surface
The required number of hidden units grows exponentially with the number of inputs
(2n/n for all boolean functions)
www.sti-innsbruck.at
Expressiveness of MLPs (2)
•
The required number of hidden units grows exponentially with the number of
inputs (2n/n for all boolean functions)
– more hidden units – more bumps
– single, sufficiently large hidden layer – any continuous
function of the inputs with arbitrary accuracy
– two layers – discontinuous functions
– for any particular network structure, harder to
characterize exactly which functions can be
represented and which ones cannot
www.sti-innsbruck.at
Example MLP
•
Build a hidden layer network for the restaurant problem (see lecture 7, Learning from
observations)
– 10 attributes describing each example
Restaurant problem: decide whether to wait for a table at a
restaurant, based on the following attributes:
•
Alternate: is there an alternative restaurant nearby?
Bar: is there a comfortable bar area to wait in?
Fri/Sat: is today Friday or Saturday?
Hungry: are we hungry?
Patrons: number of people in the restaurant (None,
Some, Full)
Price: price range ($, $$, $$$)
Raining: is it raining outside?
Reservation: have we made a reservation?
Type: kind of restaurant (French, Italian, Thai, Burger)
WaitEstimate: estimated waiting time (0-10, 10-30, 3060, >60)
A multilayer neural network suitable
for the restaurant problem:
– one hidden layer with four hidden units
– 10 input units
www.sti-innsbruck.at
Learning
• Learning is one of the most powerful features of neural
networks
• Neurons adapt the weights and connections between neurons
(the axons in the biological neural network) so that the final
output activitation reflects the correct answer.
www.sti-innsbruck.at
Back-propagation learning (1)
•
Output layer: same as for single-layer perceptron,
•
where
•
Hidden layer: back-propagate the error from the output layer:
•
Update rule for weights in hidden layer:
•
(Most neuroscientists deny that back-propagation occurs in the brain)
www.sti-innsbruck.at
Back-propagation learning (2)
•
•
At each epoch, sum gradient updated for all examples
Training curve for 100 restaurant examples converges to a perfect fit to the
training data
•
Typical problems: slow convergence, local minima
www.sti-innsbruck.at
Back-propagation learning (3)
•
Learning curve for MLP with 4 hidden units (as in our restaurant example):
•
MLPs are quite good for complex pattern recognition tasks, but resulting
hypotheses cannot be understood easily
www.sti-innsbruck.at
Kernel machines
•
New family of learning methods: support vector machines (SVMs) or, more
generally, kernel machines
•
Combine the best of single-layer networks and multi-layer networks
– use an efficient training algorithm
– represent complex, nonlinear functions
•
Techniques
– kernel machines find the optimal linear separator
• i.e., the one that has the largest margin between it and the positive examples on one
side and the negative examples on the other
– quadratic programming optimization problem
www.sti-innsbruck.at
Application Example (1)
• Many applications of MLPs: speech, driving, handwriting, fraud
detection, etc.
• Example: handwritten digit recognition
– automated sorting of mail by postal code
– automated reading of checks
– data entry for hand-held computers
www.sti-innsbruck.at
Application Example (2)
•
2 examples of handwritten digits:
– Top row: digits 0-9, easy to identify
– Bottom row: digits 0-9, more difficult to identify
•
Different learning approaches applied to examples:
– 3-nearest neighbor classifier = 2,4% error
– single-hidden-layer neural network: 400-300-10 unit MLP (i.e., 400 input units,
300 hidden units, 10 output units) = 1,6% error
– specialized neural networks (called „LeNet“): MLPs with 3 hidden layers with
768,192 and 30 units = 0.9% error
– Current best (kernel machines, vision algorithms) ≈ 0,6 % error
•
Humans are estimated to have an error rate of about 0,2% on this problem
– not extensively tested though
– Experiences from United States Postal Service: human errors of 2,5%
www.sti-innsbruck.at
Summary
• Most brains have lots of neurons, each neuron approximates
linear-threshold unit
• Perceptrons (one-layer networks) are insufficiently expressive
• Multi-layer networks are sufficiently expressive; can be trained
by gradient descent, i.e., error back-propagation
• Many applications: speech, driving, handwriting, etc.
• Engineering, cognitive modeling, and neural system modeling
subfields have largely diverged
www.sti-innsbruck.at