Introduction to Machine Learning
Download
Report
Transcript Introduction to Machine Learning
CSCE833 Machine Learning
Lecture 10
Artificial Neural Networks
Dr. Jianjun Hu
mleg.cse.sc.edu/edu/csce833
University of South Carolina
Department of Computer Science and Engineering
Outline
Midterm
moved to March 15th
Neural Network Learning
Self-Organizing Maps
Origins
Algorithm
Example
2
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Neuron: no division, only 1
axon
3
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Neural Networks
Networks of processing units (neurons) with
connections (synapses) between them
Large number of neurons: 1010
Large connectitivity: 105
Parallel processing
Distributed computation/memory
Robust to noise, failures
4
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Understanding the Brain
Levels of analysis (Marr, 1982)
1.
2.
3.
Computational theory
Representation and algorithm
Hardware implementation
Reverse engineering: From hardware to theory
Parallel processing: SIMD vs MIMD
Neural net: SIMD with modifiable local memory
Learning: Update by training/experience
5
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Perceptron
d
y w j x j w0 w x
T
j 1
w w 0 , w1 ,...,w d
T
x 1, x 1 ,...,x d
T
(Rosenblatt, 1962)
6
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
What a Perceptron Does
Regression: y=wx+w0
y
w0
Classification: y=1(wx+w0>0)
y
y
s
w0
w
w
x
w0
x
x0=+1
x
1
y sigmoido
1 e xp w T x
7
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
K Outputs
Regression:
d
y i wij x j wi 0 wiT x
j 1
y Wx
Classification:
oi w iT x
e xpoi
yi
k e xpok
c hooseC i
if y i max y k
k
8
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Training
Online (instances seen one by one) vs batch (whole
sample) learning:
No need to store the whole sample
Problem may change in time
Wear and degradation in system components
Stochastic gradient-descent: Update after a single
pattern
Generic update rule (LMS rule):
w ri y x
t
ij
t
t
i
t
j
Update Le arningFa ctor De sire dOutput ActualOutp ut Input
9
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Training a Perceptron:
Regression
Regression (Linear output):
1 t
E w | x ,r r y t
2
w tj r t y t x tj
t
t
t
2
1 t
r wT xt
2
2
10
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Classification
Single sigmoid output
y t sigmoid w T x t
E t w | x t , r t r t log y t 1 r t log 1 y t
w tj r t y t x tj
K>2 softmax outputs
e xpwiT x t
y
T t
e
xp
w
k
kx
t
E t wi i | x t , r t rit log y it
i
wijt rit y it x tj
11
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Learning Boolean AND
12
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
XOR
No w0, w1, w2 satisfy:
w0
w2 w 0
w1
w0
w1 w 2 w 0
0
0
0
0
(Minsky and Papert, 1969)
13
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multilayer Perceptrons
H
y i v z vih zh vi 0
T
i
h 1
zh sigmoid whT x
1 e xp
1
d
j 1
whj x j wh 0
(Rumelhart et al., 1986)
14
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
15
Backpropagation
H
y i v z v ih zh v i 0
T
i
h 1
zh sigmoid w hT x
1 e xp
1
d
j 1
w hj x j w h 0
E
E y i zh
w hj y i zh w hj
16
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
1
E W,v | X r t y t
2 t
Regression
H
vh r t y t zht
y vh z v 0
t
2
t
h
t
h 1
Backward
whj
Forward
T
h
zh sigmoidw x
E
whj
E y t zht
t
t
y
z
t
h w hj
r t y t vh zht 1 zht x tj
t
x
r t y t vh zht 1 zht x tj
t
17
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Regression with Multiple Outputs
yi
1
E W , V | X rit y it
2 t i
2
vih
H
y it v ih zht v i 0
h 1
zh
v ih rit y it zht
whj
t
w hj
t
t
t
ri y i v ih zh 1 zht x tj
t i
xj
18
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
19
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
20
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
whx+w0
zh
vhzh
21
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Two-Class Discrimination
One sigmoid output yt for P(C1|xt) and P(C2|xt) ≡ 1-yt
H
t
y sigmoid vh zh v 0
h 1
E W,v | X r t log y t 1 r t log 1 y t
t
t
r
v z 1 z x
vh r t y t zht
t
whj
t
yt
h
t
h
t
h
t
j
t
22
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
K>2 Classes
t
e
xp
o
t
i
y it
P
C
|
x
i
t
e
xp
o
k
k
H
oit v ih zht v i 0
h 1
E W, v | X rit log y it
t
i
v ih rit y it zht
t
w hj
t
t
t
ri y i v ih zh 1 zht x tj
t i
23
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multiple Hidden Layers
MLP with one hidden layer is a universal
approximator (Hornik et al., 1989), but using
multiple layers may lead to simpler networks
z1h
d
sigmoid w x sigmoid w1hj x j w1h 0 ,h 1,..., H 1
j 1
T
1h
H1
z2l sigmoid w z1 sigmoid w 2lhz1h w 2l 0 , l 1,..., H 2
h 1
T
2l
H2
y v T z 2 v l z2l v 0
l 1
24
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Improving Convergence
Momentum
t
E
wit
wit 1
wi
Adaptive learning rate
a if E t E t
b othe rwise
25
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Overfitting/Overtraining
Number of weights: H (d+1)+(H+1)K
26
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
27
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Tuning the Network Size
Destructive
Weight decay:
Constructive
Growing networks
E
w i
w i
wi
E' E
2
w
i
2 i
(Ash, 1989)
(Fahlman and Lebiere, 1989)
28
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Dimensionality Reduction
29
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
30
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Learning Time: sequential
learning
Applications:
Sequence recognition: Speech recognition
Sequence reproduction: Time-series prediction
Sequence association
Network architectures
Time-delay networks (Waibel et al., 1989)
Recurrent networks (Rumelhart et al., 1986)
31
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Time-Delay Neural Networks
32
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Recurrent Networks
33
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Unfolding in Time
34
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Self-Organizing Maps : Origins
Self-Organizing Maps
Ideas first introduced by C. von der Malsburg
(1973), developed and refined by T. Kohonen
(1982)
Neural network algorithm using unsupervised
competitive learning
Primarily used for organization and visualization
of complex data
Biological basis: ‘brain maps’
Teuvo Kohonen
Self-Organizing Maps
SOM - Architecture
Lattice of neurons (‘nodes’) accepts and
responds to set of input signals
Responses compared; ‘winning’ neuron
selected from lattice
Selected neuron activated together with
‘neighbourhood’ neurons
j
Adaptive process
changes weights to more
closely resemble inputs
2d array of
wj1 wj2 wj3
x1
x2
x3
wjn
...
neurons
Weighted synapses
xn
Set of input signals
(connected to all neurons in lattice)
Self-Organizing
Maps
SOM – Result Example
Classifying World Poverty
Helsinki University
of Technology
‘Poverty map’ based on 39 indicators from World Bank statistics
(1992)
SOM – Result Example
Self-Organizing Maps
Classifying World Poverty
Helsinki University
of Technology
‘Poverty map’ based on 39 indicators from World Bank statistics
(1992)
Self-Organizing Maps
SOM – Algorithm Overview
1.
2.
3.
4.
5.
6.
Randomly initialise all weights
Select input vector x = [x1, x2, x3, … , xn]
Compare x with weights wj for each neuron
j to determine winner
Update winner so that it becomes more like
x, together with the winner’s neighbours
Adjust parameters: learning rate &
‘neighbourhood function’
Repeat from (2) until the map has converged
(i.e. no noticeable changes in the weights) or
pre-defined no. of training cycles have
passed
Initialisation
(i)Randomly initialise the
weight vectors wj for
all nodes j
Input vector
In computer texts are shown as a
frequency distribution of one word.
Region
(ii) Choose an input vector x from the training set
A Text Example:
Self-organizing maps (SOMs) are a data
visualization technique invented by
Professor Teuvo Kohonen which reduce the
dimensions of data through the use of selforganizing neural networks. The problem
that data visualization attempts to solve is
that humans simply cannot visualize high
dimensional data as is so technique are
created to help us understand this high
dimensional data.
Self-organizing
maps
data
visualization
technique
Professor
invented
Teuvo Kohonen
dimensions
...
Zebra
2
1
4
2
2
1
1
1
1
0
Finding a Winner
(iii) Find the best-matching neuron w(x), usually
the neuron whose weight vector has smallest
Euclidean distance from the input vector x
The winning node is that which is in some sense
‘closest’ to the input vector
‘Euclidean distance’ is the straight line distance
between the data points, if they were plotted on a
(multi-dimensional) graph
Euclidean distance between two vectors a and b, a
= (a1,a2,…,an), b = (b1,b2,…bn), is calculated as:
d a, b
a
bi
2
i
i
Euclidean
distance
Weight Update
SOM Weight Update Equation
wj(t +1) = wj(t) + (t) w(x)(j,t) [x - wj(t)]
“The weights of every node are updated at each cycle
by adding
Current learning rate × Degree of neighbourhood
with respect to winner × Difference between current
weights and input vector
L.
to the current weights”
Example of (t)
Example of
rate
w(x)(j,t)
No. of
cycles
–x-axis shows distance from winning node
–y-axis shows ‘degree of neighbourhood’ (max. 1)
Example: Self-Organizing Maps
The animals should be ordered by a neural networks.
And the animals will be described with their
attributes(size, living space).
e.g. Mouse = (0/0)
Mouse
Lion
Size:
small
medium
Size
small=0 medium=1
big=2
Land
Land
Living
space
Air=2
(0/0)
(1/0)
Horse space:
Shark
Dove
Living
small
big Land=0
big Water=1
Land
(2/0)
Water
(2/1)
Air
(0/2)
Example: Self-Organizing
Maps
This training will be very often repeated. In the best
case the animals should be at close quarters ordered
by similarest attribute.
(0.75/0.6875)
(0.1875/1.25)
Dove
(1.125/1.625)
(1.375/0.5)
(1/0.875)
(1.5/0)
Hourse
(1.625/1)
Shark
(1/0.75)
Lion
(0.75/0)
Mouse
Land animals
Example: Self-Organizing
Maps
Animal names and their attributes
is
has
likes
to
Small
Medium
Big
2 legs
4 legs
Hair
Hooves
Mane
Feathers
Hunt
Run
Fly
Swim
Dove
1
0
0
1
0
0
0
0
1
0
0
1
0
Hen
1
0
0
1
0
0
0
0
1
0
0
0
0
Duck
1
0
0
1
0
0
0
0
1
0
0
0
1
Goose
1
0
0
1
0
0
0
0
1
0
0
1
1
Owl
1
0
0
1
0
0
0
0
1
1
0
1
0
Hawk
1
0
0
1
0
0
0
0
1
1
0
1
0
Eagle
0
1
0
1
0
0
0
0
1
1
0
1
0
Fox
0
1
0
0
1
1
0
0
0
1
0
0
0
Dog
0
1
0
0
1
1
0
0
0
0
1
0
0
Wolf
0
1
0
0
1
1
0
1
0
1
1
0
0
Cat
1
0
0
0
1
1
0
0
0
1
0
0
0
Tiger
0
0
1
0
1
1
0
0
0
1
1
0
0
Lion
0
0
1
0
1
1
0
1
0
1
1
0
0
Horse
0
0
1
0
1
1
1
1
0
0
1
0
0
Zebra
0
0
1
0
1
1
1
1
0
0
1
0
0
Cow
0
0
1
0
1
1
1
0
0
0
0
0
0
A grouping according to similarity has
emerged
peaceful
birds
hunters
[Teuvo Kohonen 2001] Self-Organizing Maps; Springer;
Conclusion
Advantages
SOM is Algorithm that projects high-dimensional data onto a
two-dimensional map.
The projection preserves the topology of the data so that
similar data items will be mapped to nearby locations on the
map.
SOM still have many practical applications in pattern
recognition, speech analysis, industrial and medical diagnostics,
data mining
Disadvantages
Large quantity of good quality representative training data
required
No generally accepted measure of ‘quality’ of a SOM
e.g. Average quantization error (how well the data is
classified)
Discussion topics
What is the main purpose of the SOM?
Do you know any example systems with SOM Algorithm?
References
[Witten and Frank (1999)] Witten, I.H. and Frank, Eibe. Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations. Morgan Kaufmann Publishers, San Francisco, CA, USA. 1999
[Kohonen (1982)]
Teuvo Kohonen. Self-organized formation of topologically correct feature maps. Biol.
Cybernetics, volume 43, 59-62
[Kohonen (1995)]
Teuvo Kohonen. Self-Organizing Maps. Springer, Berlin, Germany
[Vesanto (1999)]
SOM-Based Data Visualization Methods, Intelligent Data
Analysis, 3:111-26
[Kohonen et al (1996)]
PAK: The Self-Organizing Map program package, " Report
A31, Helsinki University of Technology, Laboratory of
Computer and Information Science, Jan. 1996
[Vesanto et al (1999)]
Organizing Map in Matlab: the SOM Toolbox. In Proceedings
of the Matlab DSP Conference 1999, Espoo, Finland, pp. 35-40, 1999.
[Wong and Bergeron (1997)] Pak Chung Wong and R. Daniel Bergeron. 30 Years of Multidimensional Multivariate
Visualization. In Gregory M.
Nielson, Hans Hagan, and Heinrich Muller, editors, Scientific
Visualization - Overviews, Methodologies and Techniques, pages 3-33, Los Alamitos, CA, 1997. IEEE Computer
Society Press.
[Honkela (1997)]
Processing, PhD Thesis, Helsinki, University of Technology,
Espoo, Finland
[SVG wiki]
[Jost Schatzmann (2003)]
Final Year Individual Project Report Using Self-Organizing Maps to Visualize Clusters
and Trends in Multidimensional Datasets
Imperial college London 19 June 2003
T. Kohonen, J. Hynninen, J. Kangas, and J. Laaksonen, "SOM
J. Vesanto, J. Himberg, E. Alhoniemi, J Parhankangas. Self-
T. Honkela, Self-Organizing Maps in Natural Language
http://en.wikipedia.org/wiki/Scalable_Vector_Graphics