Transcript 6Loeng

Algorithms of Artificial Intelligence
Lecture 6: Learning
E. Tyugu
© Enn Tyugu
1
Content
• Basic concepts
– transfer function
– classification
– stages of usage
• Perceptron
• Hopfield net
• Hamming net
• Carpenter-Grossberg’s net
• Kohonen’s feature maps
• Bayesian networks
• ID3
• AQ
© Enn Tyugu
2
Neural nets
Neural nets provide another form of massively parallel learning
functionality. They are well suited for learning pattern recognition. A simple
way to describe a neural net is to represent it as a graph. Each node of the
graph has an associated variable called state and a constant called threshold.
Each arc of the graph has an associated numeric value called weight.
Behaviour of a neural net is determined by transfer functions for nodes
which compute new values of states from previous states of neighbouring
nodes.
© Enn Tyugu
3
Node of a net
A common transfer function is of the form
xj = f( wij*xi - tj)
where the sum is taken over incoming arcs with weigths wij, and xi are
states of the neighboring nodes, tj is threshold of the node j where the
new state is computed. Learning in neural nets means changing the
weights in a right way.
x1
w1j
f
xn
xj
wnj
© Enn Tyugu
4
Transfer functions
f(x)
f(x)
+1
f(x)
+1
x
+1
x
x
-1
hard limiter
threshold logic
© Enn Tyugu
sigmoid
5
Forward-pass and layered nets
1. Forward-pass neural net is an acyclic graph. Its nodes can be
classified as input, output and internal nodes. Input nodes do not have
neighbours on incoming arcs, output nodes do not have them on outgoing
arcs and internal nodes possess both kinds of neighbours.
2. Layered (n-layered) net is a forward-pass net where each path
from an input node to an output node contains exactly n nodes. Each node
in such a graph belongs exactly to one layer, n-layered net is strongly
connected, if each node in the i-th layer is connected to all nodes of the
(i+1) -st layer, i=1,2, . . . ,n-1. States of the layered net can be interpreted
as decisions made on the basis of the states of the input nodes.
© Enn Tyugu
6
Layered neural net
} output nodes
....
} intermediate nodes
} input nodes
Learning in a layered net can be performed by means of back-propagation.
In this case, the states taken by output nodes are evaluated and credit or blam
is assigned to each output node. The evaluations are propagated back to othe
© Enn Tyugu
7
Stages of usage
1. Selection of the structure (of the network type)
2. Assignment of initial weights
3. Learning/teaching
4. Application
© Enn Tyugu
8
Perceptrons
Perceptrons´nodes are hard limiters or sigmoids.
Examples:
single-layer
double-layer
© Enn Tyugu
three-layer
9
Learning in a single-layer perceptron
1. Initialize weights wi and threshold t to small random values.
2. Take an input x1, …, xn and the desired output d.
3. Calculate output x of the perceptron.
4. Adapt the weights:
wi´= wi + h*(d - x)*xi, where h<1 is a positive gain value,
+1, if input is from one class
d=
-1, if input is from the other class
Repeat 2 - 4, if needed.
NB! Note that the weights are changed only for incorrect output d.
© Enn Tyugu
10
Regions separable by perceptrons
B
B
A
A
B
A
single-layered
double-layered
© Enn Tyugu
three-layered
11
Hopfield net
Every node is connected to all other nodes, weights are symmetric
(wij = wji). Works with binary (+1,-1) input signals. The output is
also a tuple of values +1 or -1.
x1'
x2'
xn'
even a sigmoid
can be used
x1
xn
x2
© Enn Tyugu
12
Hopfield net
1. Initialize connection weights:
wij =  xis
* xjs,
s
i  j,
where xis is +1 or -1 as in the description x1s, …, xns of the class s.
2. Initialise states with an unknown pattern x1, …, xn.
3. Iterate until convergence (even can be done asynchronously):

xj´= f ( wij*xi),
where f is the hard limiter.
Remarks:
• A Hopfield net can be used either as a classifier or an associative memory.
• It converges always, but no match may occur.
• It works well, in the case when number of classes is less than 0.15*n.
• There are several modifications of the Hopfield net architecture..
© Enn Tyugu
13
Hamming net
The Hamming net calculates Hamming distance to exemplar of each
class and shows positive output for the class with the minimal distance.
This net is widely used for restoring corrupted binary fixed length
signals.
Hamming net works faster than Hopfield net, has less connections for
larger number of input signals.
It implements the optimum minimum error classifier when bit errors
are random and independent.
© Enn Tyugu
14
Hamming net
y1
y2
ym
select the
best match
z2
z1
zm
calculate
Hamming
distance
x1
xn
x2
© Enn Tyugu
15
Hamming net
Value at a middle node zs is n – hds where hds is Hamming distance to
the exemplar pattern ps .
Threfore in the lower subnet weight from input xi to the middle node zs
is wis = xis/2, ts = n/2 for each exemplar s.
Indeed, 0 for the most incorrect code, and 1 = (+1 –(-1))* xis/2 is
added for each correct input signal, so that this gives n for correct
code.
© Enn Tyugu
16
Hamming net continued
1. Initialize weights and offsets:
a) lower subnet: wis = xis/2, tj = n/2 for each exemplar s;
b) upper subnet: tk=0, wsk = if k=s then 1 else -e, where 0 < e < 1/m.
2. Initialize the lower subnet with (unknown) pattern x1,…, xn and calculate
yj = f( wij*xi - tj).
3. Iterate in the upper subnet until convergence:
yj´= f(yj - e* yk).
© Enn Tyugu
17
A comparator subnet
Here is a comparator subnet that selects the maximum of two analog
inputs x0, x1. Combining several of these nets one builds comparators
for more inputs (4, 8 etc., approximately log2n layers for n inputs).
Output z is the maximum value, y0 and y1 indicate which input is
maximum, dark nodes are hard limiters, light nodes are threshold logic
nodes, all thresholds are 0, weights are shown on arcs.
z
y0
1
y1
0.5
0.5
1
0.5
0.5
1
x0
1
-1
-1
© Enn Tyugu
x1
18
Carpenter-Grossberg net
This net forms clusters without supervision. Its clustering algorithm is
similar to the simple leader clustering algorithm:
select the first input as the exemplar for the first cluster;
if the next input is close enough to some cluster exemplar, it is
added to the cluster, otherwise it becomes the exemplar of a new cluster.
The net includes much feedback and is described by nonlinear differential
equations.
© Enn Tyugu
19
Carpenter-Grossberg net
Carpenter-Grossberg net for three binary inputs: x0, x1, x2 and two classes.
x0
x1
© Enn Tyugu
x2
20
Kohonen´s feature maps
A Kohonen’s self organizing feayture map (K-map) is uses analogy
with such biological neural structures where the placement of neurons
is orderly and reflects structure of external (sensed) stimuli (e.g. in
auditory and visual pathways).
K-map learns, when continuous-valued input vectors are presented to
it without specifying the desired output. The weights of connections
can adjust to regularities in input. Large number of examples is
needed.
K-map mimics well learning in biological neural structures.
It is usable in speech recognizer.
© Enn Tyugu
21
Kohonen´s feature maps continued
This is a flat (two-dimensional) structure with connections between
neighbors and connections from each input node to all its output nodes.
It learns clusters of input vectors without any help from teacher.
Preserves closeness (topolgy).
Output nodes
continues valued input vector
© Enn Tyugu
22
Learning in K-maps
1. Initialize weights to small random numbers and set initial radius of
neighborhood of nodes.
2. Get an input x1, …, xn.
3. Compute distance dj to each output node:
dj =  (xi - wij)2
4. Select output node s with minimal distance ds.
5. Update weights for the node s and all nodes in its neighborhood:
wij´= wij + h* (xi - wij),
where h<1 is a gain that decreases in time.
Repeat steps 2 - 5.
© Enn Tyugu
23
Bayesian networks
Bayesian networks use the conditional probability formula
P(e,H)=P(H|e)P(e) = P(e|H)P(H)
binding the conditiona probabilities of evidence e and hypothesis H.
Bayesian network is a graph whose nodes are variables denoting
occurrence of events, arcs express causal dependence of events. Each
node x has conditional probabilities for every possible combination of
events influencing the node, i.e. for every collection of events in nodes
of pred(x) immediately preceding the node x in the graph.
© Enn Tyugu
24
Bayesian networks
x1
Example:
x3
x4
x2
x6
x5
The joint probability assessment for all nodes x1,…,xn:
P(x1,…,xn) = P(x1|pred(x1))*...*P(xn|pred(xn))
constitutes a joint-probability model that supports the assessed event
combination. For the present example it is as follows:
P(x1,…,x6) = P(x6|x5)*P(x5|x2,x3)*P(x4|x1,x2)*P(x3|x1)*P(x2|x1)*P(x1)
© Enn Tyugu
25
Bayesian networks continued
A bayesian network can be used for diagnosis/classification: given some
events, the probablities of events depending on the given ones can be
predicted.
To construct a bayesian network, one needs to
• determine its structure (topology)
• find conditional probabilities for each dependency.
© Enn Tyugu
26
Taxonomy of neural nets
NEURAL NETS
BINARY-VALUED INPUTS
SUPERVISED
LEARNING
HAMMING
NETS
HOPFIELD
NETS
CONTINUOUS INPUTS
UNSUPERVISED
LEARNING
CARPENTERGROSSBERG
NETS
SUPERVISED
LEARNING
UNSUPERVISED
LEARNING
MULTI-LAYERED
PERCEPTRONS
SINGLE-LAYERED
PERCEPTRONS
© Enn Tyugu
KOHONEN
MAPS
27
A decision tree
outlook
sunny
overcast
rain
humidity
+
windy
high
normal
_
+
true
_
© Enn Tyugu
false
+
28
ID3 algorithm
• To get the fastest decision-making procedure, one has to arrange
attributes in a decision tree in a proper order - the most
discriminating attributes first. This is done by the algorithm called
ID3.
• The most discriminating attribute can be defined in precise terms as
the attribute for which the fixing its value changes the enthropy of
possible decisions at most. Let wj be the frequency of the j-th
decision in a set of examples x. Then the enthropy of the set is
E(x)= - wj* log(wj)
• Let fix(x,a,v) denote the set of these elements of x whose value of
attribute a is v. The average enthropy that remains in x , after the
value a has been fixed, is:
H(x,a) =  kv E(fix(x,a,v)),
where kv is the ratio of examples in x with attribute a having
value v.
© Enn Tyugu
29
ID3 algorithm
ID3 uses the following variables and functions:
p -- pointer to the root of the decision tree being built;
x -- set of examples;
E(x) -- enthropy of x for the the set of examples x;
H(x,a) -- average entropy that remains in x after the value of a has
been fixed;
atts(x) -- attributes of the set of examples x;
vals(a) -- values of the attribute a;
mark(p,d) -- mark node p with d;
newsucc(p,v) -- new successor to the node p, with attribute value v,
returns pointer p to the new node;
fix(x,a,v) -- subset of given set of examples x with the value v of the
attribute a.
© Enn Tyugu
30
ID3 continued
A.3.10: ID3(x,p)= if empty(x) then failure
elif E(x)=0 then mark(p,decision(x))
else h:=bignumber;
for a atts(x) do
if H(x,a) < h then h:=H(x,a); am:=a fi
od;
mark(p,am);
for vvals(am,x) do
ID3(fix(x,am,v),newsucc(p,v))
od
fi
© Enn Tyugu
31
AQ algorithm
This algorith is for learning knowledge in the form of rules.
The algorithm AQ(ex,cl) builds a set of rules from the given set of
examples ex for the collection of classes cl using the function
aqrules(p,n,c) for building a set of rules for a class c from its given
positive examples p and negative examples n.
pos(ex,c) is a set of positive examples for class c in ex
neg(ex,c) is a set of negative examples for class c in ex
covers(r,e) is a predicate which is true when example e satisfies the
rule r.
prune(rules) throws away rules covered by some other rule.
© Enn Tyugu
32
AQ continued
A.3.11: AQ(ex,cl)=
allrules = { };
for c cl do
allrules:=alrules aqrules(pos(ex,c),neg(ex,c),c)
od;
return(allrules)
aqrules(pos,neg,c) =
rules := {aqrule(selectFrom(pos),neg,c)};
for e pos do
L: {for r rules do
if covers(r,e)then break L fi
od;
rules:=rules  {aqrule(e,neg,c)};
prune(rules)
}
od;
return(rules)
© Enn Tyugu
33
AQ continued
aqrule(seed,neg,c) -- builds a new rule from the initial condition seed and negative
examples neg for the class c.
newtests(r,seed,e) -- generates amendments q to the rule r, r&q covers seed and not e;
worstelements(star) -- chooses the least promising elements in star.
aqrule(seed,neg,c) =
star:={true};
for e neg do
for r star do
if covers(r,e) then
star:=(star {r&q| q newtests(r,seed,e)}) \{r}
fi;
while size(star)>maxstar do
star:=star\worstelements(star)
od
od
od;
return("if" bestin(star) "then"c)
© Enn Tyugu
34
A clustering problem
(learning without a teacher)
© Enn Tyugu
35
Hierarchy of learning methods
Learning
parametric
learning
by automata
massively parallel
learing
numerically
neural
nets
genetic
algorithms
symbolic
learning
search in
concept space
specific to
general
inductive
inference
general
to specific
inverse
resolution
© Enn Tyugu
36
Otsustustabelid
Otsutustabel on teadmuse kompakte esitusviis, kui teadmus on valiku
tegemiseks lõpliku (ja mitte eriti suure) hulga võimaluste seast.
Otsustustabel on kolme tüüpi aladest koosnev liittabel.
Tingimuste list (C1, C2,…,Ck on tingimused, mis pannakse kirja
mingis formaalses – programmiks tõlgitavas keeles):
C1
C2
…
© Enn Tyugu
Ck
37
Otsustustabelid
Valikumaatriks,mis koosneb tingimustele vastavatest veergudest ja
valikuvariantidele vastavatest ridadest.
Tabeli igas ruudus võib olla üks kolmest väärtusest:
y – jah, tingimus peab oleman täidetud
n – ei, tingimus ei tohi olla täidetud
0 – pole oluline, kas tingimus on täidetud või ei (lahter on siis sageli lihtsalt
tühi).
C1
C2
….
Ck
© Enn Tyugu
38
Otsustustabelid
Tabeli kolmandas väljas on valitavad otsused. Kui esimest ja teist
tüüpi välju on kumbagi kaks, saab ka kolmanda välja teha maatriksi kujul:
Tingimused
y
y
y
n
y
y
y
n
n y
n
Tingimused
n
Otsused
y
© Enn Tyugu
39
Bibliography
• Kohonen, T. (1984) Self-Organization and Associative Memory.
Springer Verlag, Holland.
• Lippmann, R. (1987) An Introduction to Computing with Neural
Nets. IEEE ASSP Magazine, No. 4, 4 - 22.
• Michalski, S. (1983) Theory and methodology of inductive learning.
In: S. Michalski, J. Carbonell, T. Mitchell, eds. Machine learning: an
Artificial Intelligence approach. Tioga, Palo Alto, 83 –134.
© Enn Tyugu
40
Exercises
1. Calculate the entropies of attributes; 2. build a decision tree
Sample data for ID3
Outlook
Temperature
Humidity
Windy
Class
Sunny
Hot
High
False
-
Sunny
Hot
High
True
-
Overcast
Hot
High
False
+
Rain
Mild
High
False
+
Rain
Cool
Normal
False
+
Rain
Cool
Normal
True
-
Overcast
Cool
Normal
True
+
Sunny
Mild
High
False
-
Sunny
Cool
Normal
False
+
Rain
Mild
Normal
False
+
Sunny
Mild
Normal
True
+
Overcast
Mild
High
True
+
Overcast
Hot
Normal
False
+
Rain
Mild
High
True
-
© Enn Tyugu
41