Transcript Slides

Review
Belief and Probability
• The connection between toothaches and cavities is not a logical
consequence in either direction.
• However, we can provide a degree of belief on the sentences.
– We usually get this belief from statistical data.
• Assigning probability 0 to a sentence correspond to an
unequivocal belief that the sentence is false.
• Assigning probability 1 to a sentence correspond to an
unequivocal belief that the sentence is true.
Syntax
• Basic element: random variable
• Possible worlds defined by assignment of values to random variables.
• Boolean random variables
Cavity (do I have a cavity?)
• Discrete random variables
Weather is one of <sunny,rainy,cloudy,snow>
• Domain values must be exhaustive and mutually exclusive
• Elementary propositions e.g.,
– Weather = sunny (abbreviated as sunny)
– Cavity = false (abbreviated as cavity)
• Complex propositions formed from elementary propositions and standard
logical connectives e.g.,
– Weather = sunny  Cavity = false
– sunny  cavity
Atomic events
• Atomic event: A complete specification of the state of the world
E.g., if the world is described by only two Boolean variables
Cavity and Toothache, then there are 4 distinct atomic events:
Cavity = false
Cavity = false
Cavity = true
Cavity = true




Toothache = false
Toothache = true
Toothache = false
Toothache = true
• Atomic events are mutually exclusive and exhaustive
Joint probability
• Joint probability distribution for a set of random variables gives the
probability of every atomic event on those random variables.
• If we consider all the variables then the joint probability distribution is called
full joint probability distribution.
– A full joint distribution specifies the probability of every
atomic event.
• Any probabilistic question about a domain can be answered by the full joint
distribution.
Prior and Conditional probability
• Prior or unconditional probability associated with a proposition is the degree
of belief accorded to it in the absence of any other information.
P(Cavity = true) = 0.1
(or P(cavity) = 0.1)
P(Weather = sunny) = 0.7
(or P(sunny) = 0.7)
• Conditional or posterior probabilities
P(cavity | toothache) = 0.8
i.e., given that toothache is all I know
• Definition of conditional probability:
P ( a  b)
P ( a | b) 
P(b)
if P(b)  0
• Product rule:
P(a  b)  P(a | b) P(b)  P(b | a) P(a)
Chain rule
• Chain rule is derived by successive application of product
rule:
P( x1 ,..., xn )  P( x1 ,..., xn 1 ) P( xn | x1 ,..., xn 1 )
 P( x1 ,..., xn  2 ) P( xn 1 | x1 ,..., xn  2 ) P( xn | x1 ,..., xn 1 )
 ...

n
i 1
P( xi | x1 ,..., xi 1 )
P( x1 ,..., xn ) means P( X 1  x1  ...  X n  xn )
Inference by enumeration
• Suppose we are given a proposition φ. Start with the joint
probability distribution:
• Sum up the probabilities of the atomic events where φ is
true.
Inference by enumeration
Inference by enumeration
Inference by enumeration
• We can also compute conditional probabilities:
Normalization Constant
P(cavity  toothache)
P(cavity | toothache) 
P(toothache)
   P(cavity  toothache)
   (0.108  0.012)    0.12
P(cavity  toothache)
P(cavity | toothache) 
P(toothache)
   P(cavity  toothache)
   (0.016  0.064)    0.08
P(cavity | toothache)  P(cavity | toothache)  1
  0.12    0.08  1
1

5
0.12  0.08
Hidden Variables
• What does it mean to compute
P(cavity  toothache)
• We sum up: 0.108 + 0.012
• I.e. the probabilities of atomic events that make the proposition
cavity  toothache true. So,
P(cavity  toothache)
 P(cavity  toothache  catch)  P(cavity  toothache  catch)
• The variable “Catch” is a
called a hidden variable.
Independence
• We can write:
P(toothache, catch, cavity, cloudy)
=P(cloudy | toothache, catch, cavity) P(toothache, catch, cavity)
=P(cloudy) P(toothache, catch, cavity)
• Thus, the 32 element table for four variables can be constructed from
one 8-element table and one 4-element table!!
• A and B are independent if for each a, b in the domain of A and B
respectively, we have
P(a|b) = P(a)
or
P(b|a) = P(b)
or
P(a, b) = P(a) P(b)
• Absolute independence powerful but rare
Bayes' Rule
• Product rule
P(a  b)  P(a | b)  P(b)  P(b | a)  P(a)
• Bayes' rule:
P(b | a)  P(a)
P ( a | b) 
P(b)
• Useful for assessing diagnostic probability from causal probability:
P(effect | cause)  P(cause)
P(cause | effect) 
P(effect)
Bayes’ rule (cont’d)
Let
s be the proposition that the patient has a stiff neck
m be the proposition that the patient has meningitis,
P(s|m) = 0.5
P(m) = 1/50000
P(s) = 1/20
P(m|s) = P(s|m) P(m) / P(s)
= (0.5) x (1/50000) / (1/20)
= 0.0002
That is, we expect only 1 in 5000 patients with a stiff neck to have
meningitis.
More than two variables
P(cause | effect1 ,..., effectn )

P(effect1 ,..., effectn | cause) P(cause)
P(effect1 ,..., effectn )
   P(effect1 ,..., effectn | cause) P(cause)
Now, the Naïve Bayes model makes the following assumption:
Although Effect1, …,Effectn might not be independent in general,
they are independent given the value of Cause.
This is called conditional independence.
E.g. If I have a cavity, the probability that the probe catches in it doesn't
depend on whether I have a toothache.
We write this as:
P(toothache, catch | cavity) = P(toothache | cavity) . P(catch | cavity) . P(cavity)
Naïve Bayes
P(cause | effect1 ,..., effectn )
   P(effect1 ,..., effectn | cause) P(cause)
   P(effect1 | cause)    P(effectn | cause) P(cause)


   i 1 P(effecti | cause)  P(cause)
n
What about when
the Naïve Bayes
assumption
doesn’t hold?
For finding the alpha we need to compute also:
P(cause | effect1 ,..., effectn )


   i 1 P(effecti | cause)  P(cause)
n
Then the alpha is:
Instead we have
a network of
interdependencies.
Let’s, first review
the conditional
independence.
1

P(cause | effect1 ,..., effectn )  P(cause | effect1 ,..., effectn )
Conditional Independence Equations
Equivalent statements:
P(toothache, catch | cavity) = P(toothache | cavity) P(catch | cavity)
P(toothache | catch, cavity) = P(toothache | cavity)
In general
P( xi | x1 ,..., xi 1 )  P( xi | x j1 ,..., x jm )
where
x
j1

,..., x jm  x1 ,..., xi 1
Conditional Independence (cont’d)
• We can write out full joint distribution using chain rule, e.g.:
P(toothache, catch, cavity)
= P(toothache | catch, cavity) P(catch, cavity)
= P(toothache | catch, cavity) P(catch | cavity) P(cavity)
= P(toothache | cavity) P(catch | cavity) P(cavity)
Because of Cond. Indep.
• In most cases, the use of conditional independence reduces the size of
the representation of the joint distribution from exponential in n to
linear in n.
• Conditional independence is our most basic and robust form of
knowledge about uncertain environments.
Bayesian Networks: Motivation
• The full joint probability can be used to answer any question
about the domain,
– but intractable as the number of variables grow.
• Furthermore specifying probabilities of atomic events is rather
unnatural and can be very difficult.
Bayesian networks
• Syntax:
– a set of nodes, one per variable
– a directed, acyclic graph (a link means: "directly influences")
– a conditional distribution for each node given its parents:
P(Xi | Parents (Xi))
• The conditional distribution is represented as a conditional
probability table (CPT) giving the distribution over Xi for each
combination of parent values.
Example
• Topology of network encodes conditional independence
assertions:
• Weather is independent of the other variables
• Toothache and Catch are conditionally independent given
Cavity, which is indicated by the absence of a link between
them.
Another Example
The topology shows that burglary and earthquakes directly affect the probability
of alarm, but whether Mary or John call depends only on the alarm.
Thus, our assumptions are that they don’t perceive any burglaries directly,
and they don’t confer before calling.
Semantics
• The full joint distribution is
defined as the product of the
local conditional distributions:
P(x1, … ,xn) =
i = 1 P(xi | parents(xi))
e.g.,
P(j  m  a  b  e)
= P(j | a) P(m | a) P(a | b, e)
P(b) P(e)
=…
Inference in Bayesian Networks
• The basic task for any probabilistic inference system is to
compute the posterior probability for a query variable, given
some observed events (or effects) – that is, some assignment of
values to a set of evidence variables.
• A typical query:
P(x|e1,…,em)
• We could ask: What’s the probability of a burglary if both Mary
and John calls
P(burglary | johncalls, marycalls)?
Inference by enumeration
Sum out variables from the joint without actually
constructing its explicit representation
P(b | j , m)
P(b, j , m)

P ( j , m)
   P(b, j , m)
   e a P(b, e, a, j , m)
e (earthquake) and a (alarm)
are values of the hidden
variables.
All the possible e’s and a’s
have to be considered.
Now, rewrite full joint entries using product of CPT
entries:
P(b | j, m)
   e a P(b, e, a, j, m)
   e a P(b) P(e) P(a | b, e)P( j | a) P(m | a)
Numerically…
Complete it for
exercise
P(b | j,m) =  P(b) e P(e)aP(a|b,e)P(j|a)P(m|a) = …=  * 0.00059
P(b | j,m) =  P(b) eP(e)aP(a|b,e)P(j|a)P(m|a) = …=  * 0.0015
P(B | j,m) =  <0.00059, 0.0015> = <0.284, 0.716>.
Machine Learning
Pseudo-code for 1R
For each attribute,
For each value of the attribute, make a rule as follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this attribute-value
Calculate the error rate of the rules
Choose the rules with the smallest error rate
Evaluating the weather attributes
Arbitrarily breaking the tie between the first
and third rule sets we pick the first.
Oddly enough the game is played when it’s
overcast and rainy but not when it’s sunny.
Perhaps it’s an indoor pursuit.
Statistical modeling: Probabilities for
the weather data
Naïve Bayes for classification
• Classification learning: what’s the probability of the class
given an instance?
– e = instance
– h = class value for instance
• Naïve Bayes assumption: evidence can be split into
independent parts (i.e. attributes of instance!)
P(h | e) = P(e | h) P(h) / P(e)
= P(e1|h) P(e2|h)… P(en|h) P(h) / P(e)
The weather data example
P(Play=yes | e) =
P(Outlook=Sunny | Play=yes) *
P(Temp=Cool | Play=yes) *
P(Humidity=High | Play=yes) *
P(Windy=True | Play=yes) *
P(Play=yes) / P(e) =
= (2/9) * (3/9) * (3/9) * (3/9) *(9/14)
/ P(e) = 0.0053 / P(e)
Don’t worry for the 1/P(E); It’s
alpha, the normalization constant.
The weather data example
P(Play=no | e) =
P(Outlook=Sunny | Play=no) *
P(Temp=Cool | Play=no) *
P(Humidity=High | Play=no) *
P(Windy=True | Play=no) *
P(play=no) / P(e) =
= (3/5) * (1/5) * (4/5) * (3/5) *(5/14)
/ P(e) = 0.0206 / P(e)
Normalization constant
 = 1/P(e) = 1/(0.0053 + 0.0206)
So,
P(Play=yes | e) = 0.0053 / (0.0053 + 0.0206) = 20.5%
P(Play=no | e) = 0.0206 / (0.0053 + 0.0206) = 79.5%
The “zero-frequency problem”
• What if an attribute value doesn’t occur with every class
value (e.g. “Humidity = High” for class “Play=Yes”)?
– Probability P(Humidity=High | play=yes) will be zero!
• A posteriori probability will also be zero!
– No matter how likely the other values are!
• Remedy: add 1 to the count for every attribute value-class
combination (Laplace estimator)
– I.e. initialize the counters to 1 instead of 0.
• Result: probabilities will never be zero! (stabilizes
probability estimates)
Constructing Decision Trees
• Normal procedure: top down in recursive divide-andconquer fashion
– First: an attribute is selected for root node and a branch is
created for each possible attribute value
– Then: the instances are split into subsets (one for each
branch extending from the node)
– Finally: the same procedure is repeated recursively for each
branch, using only instances that reach the branch
• Process stops if all instances have the same class
Which attribute to select?
(b)
(a)
(c)
(d)
A criterion for attribute selection
• Which is the best attribute?
• The one which will result in the smallest tree
– Heuristic: choose the attribute that produces the “purest”
nodes
• Popular impurity criterion: entropy of nodes
– Lower the entropy purer the node.
• Strategy: choose attribute that results in lowest entropy of
the children nodes.
Example: attribute “Outlook”
The final decision tree
• Note: not all leaves need to be pure; sometimes identical
instances have different classes
Splitting stops when data can’t be split any further
Numerical attributes
• Tests in nodes can be of the form xj > constant
• Divides the space into rectangles.
Considering splits
• The only thing we really need to do differently in our algorithm
is to consider splitting between each data point in each
dimension.
• So, in our bankruptcy
domain, we'd consider 9
different splits in the R
dimension
– In general, we'd expect to
consider m - 1 splits, if
we have m data points;
– But in our data set we
have some examples with
equal R values.
Considering splits II
• And there are another 6 possible splits in the L dimension
– because L is an integer, really, there are lots of duplicate L values.
Bankruptcy Example
Bankruptcy Example
• We consider all the possible splits in each dimension, and
compute the average entropies of the children.
• And we see that, conveniently, all the points with L not greater
than 1.5 are of class 0, so we can make a leaf there.
Bankruptcy Example
• Now, we consider all the splits of the remaining part of space.
• Note that we have to recalculate all the average entropies again,
because the points that fall into the leaf node are taken out of
consideration.
Bankruptcy Example
• Now the best split is at R > 0.9. And we see that all the points
for which that's true are positive, so we can make another leaf.
Bankruptcy Example
• Continuing in this way, we finally obtain:
Rules: Covering
• Strategy for generating a rule set directly is based on
covering.
• A rule “lhs then rhs” covers a data instance, if the tests in
“lhs” are true for that instance.
Example: contact lenses data
Example: contact lenses data
The numbers on the right show the fraction of “correct” instances in
the set singled out by that choice (second numbers show covering).
In this case, correct means that their recommendation is “hard.”
Selecting a Test for the LHS of a Rule
• Goal: to maximize accuracy
– t: total number of instances covered by rule
– p: positive examples of the class covered by rule
– t-p: number of errors made by rule
Select test that maximizes the ratio p/t
• We are finished when p/t = 1 or the set of instances can’t
be split any further
Modified rule and resulting data
The rule isn’t very accurate, getting only 4 out of 12 that it covers.
So, it needs further refinement.
Further refinement
Modified rule and resulting data
Should we stop here? Perhaps. But let’s say we are going
for exact rules, no matter how complex they become.
So, let’s refine further.
Further refinement
The result
Linear Hypothesis Class
• Equation of a hyperplane in the feature space
• w, b are to be learned
• In two dimensions, we can see the geometric
interpretation of w and b.
– The vector w is perpendicular to the linear
separator; such a vector is known as the
normal vector.
– The scalar b, which is called the offset, is
proportional to the perpendicular distance from
the origin to the linear separator.
– The constant of proportionality is the negative
of the magnitude of the normal vector.
Hyperplane: Geometry
Linear classifier
• We can now exploit the sign of this distance to define a linear
classifier, one whose decision boundary is a hyperplane.
• Instead of using 0 and 1 as the class labels (which was an
arbitrary choice anyway) we use the sign of the distance, either
+1 or -1 as the labels (that is the values of the yi ’s).
Margin
• A variant of the signed
distance of a training point to a
hyperplane is the margin of
the point.
i
i
Margin: i = yi(w.xi+b) = y w x
proportional to perpendicular distance of
point xi to hyperplane.
i > 0 : point is correctly classified (sign
of distance = yi)
i < 0 : point is incorrectly classified (sign
of distance  yi)
Training: Perceptron algorithm
Artificial Neural Networks
• The basic idea in neural nets is to define
interconnected networks of simple units (let's call
them "artificial neurons") in which each
connection has a weight.
– Weight wij is the weight of the ith input
into unit j.
– The networks have some inputs where the
feature values are placed and they
compute one or more output values.
– Each output unit corresponds to a class.
The network prediction is the output
whose value is highest.
• The learning takes place by adjusting the weights
in the network so that the desired output is
produced whenever a sample in the input data set
is presented.
Single Perceptron Unit
• We start by looking at a simpler
kind of "neural-like" unit called a
perceptron.
– This is where the
perceptron algorithm that
we saw earlier came from.
– Perceptrons antedate the
modern neural nets.
• A perceptron unit basically
compares a weighted combination
of its inputs against a threshold
value and then outputs a 1 if the
weighted inputs exceed the
threshold.
• Trick: we treat the (arbitrary)
threshold as if it were a weight w0
on a constant input x0 whose
value is –1.
• In this way, we can write the basic
rule of operation as computing the
weighted sum of all the inputs and
comparing to 0.
Linear Classifier: Single Perceptron
Unit
h(x)   (w  x  w0  (1))   ( w  x)
b
1 z  0
where  ( z )  
0 else
z
Multi-Layer Perceptron
• Yes. The introduction of
"hidden" units into these
networks make them much
more powerful:
– they are no longer limited to
linearly separable problems.
• Earlier layers transform the
problem into more tractable
problems for the latter layers.
Check XOR example in the full slides.
Multi-Layer Perceptron Learning
• Any set of training points can be separated by a three-layer
perceptron network.
• “Almost any” set of points is separable by two-layer perceptron
network.
• However, the presence of the discontinuous threshold in the
operation means that there is no simple local search for a good
set of weights;
– one is forced into trying possibilities in a combinatorial way.
• The limitations of the single-layer perceptron and the lack of a
good learning algorithm for multilayer perceptrons essentially
killed the field for quite a few years.
Sigmoid Unit
• The key property of the
sigmoid is that it is
differentiable.
– The output of this
function (y) varies
smoothly with changes in
the weights.
– So, we can use gradient.
Gradient Descent
Online version:
We consider each time only
the error for one data item
1
E  ( y (x m , w )  y m ) 2
2
 w E  ( y (x m , w )  y m ) w y (x m , w )
where
 y
y 
 w y (x , w )  
,...,


w

w
n
 1
m
We change w as follows
w  w   w E
Here we see the gradient of the
training error as a function of the
weights.
The descent rule is basically to
change the weights by taking a
small step (determined by the
learning rate ) in the direction
opposite this gradient.
Had we a single unit…
 y
y 
 w y (x , w )  
,...,


w

w
n
 1
i
n
z   wi xi
i
y y z

wi z wi
s ( z ) z

z wi
s ( z )

xi
z
1
y  s( z ) 
1  ez
Substituting in the equation
of previous slide we get (for
the arbitrary ith element):
s ( z )
wi  wi   ( y  y )
xi
z
m
E
m s ( z )

 (y  y )
z
z
wi   xi
Delta
rule
s( z ) 
Derivative
of
the
sigmoid
1
1  ez
ds ( z ) d

(1  e  z ) 1
dz
dz
 [(1  e  z )  2 ][ e  z ]


z
 1  e 

z 
z 
1

e
1

e



 s ( z )(1  s ( z ))
 y (1  y )
Generalized Delta Rule
w14   4 y1
w24   4 y2
 4  y4 (1  y4 )( 5 w45   6 w46 )
In general, for a hidden unit j we have
wi  j   j yi
 j  y j (1  y j )

 k w j k
kdownstream( j )
Immediate downstream only.
Generalized Delta Rule
For an output unit we have
wi n   n yi
 n  yn (1  yn )( yn  y m )
Where
yn is the output of this output unit.
y is the real class of the on focus data instance.
Backpropagation Algorithm
1.
2.
3.
4.
5.
6.
Initialize weights to small random values
Choose a random sample training item, say (xm, ym)
Compute total input zj and output yj for each unit (forward prop)
Compute n for output layer n = yn(1-yn)(yn-ynm)
Compute j for all preceding layers by backprop rule
Compute weight change by descent rule (repeat for all weights)
•
Note that each expression involves data local to a particular unit,
we don't have to look around summing things over the whole
network.
It is for this reason, simplicity, locality and, therefore, efficiency
that backpropagation has become the dominant paradigm for
training neural nets.
•
Backpropagation Example
w03
First do forward propagation:
Compute zi’s and yi’s.
y3
3
z3
-1
w13
1
w01
-1
 3  y3 (1  y3 )( y3  y m )
 2  y2 (1  y2 ) 3 w23
1  y1 (1  y1 ) 3 w13
w23
y1
y2
z1
z2
w21
w12
w11
2
w02
w22
x1
x2
-1
w03  w03  r 3 (1)
w02  w02  r 2 (1)
w01  w01  r1 (1)
w13  w13   3 y1
w12  w12   2 x1
w11  w11  1 x1
w23  w23   3 y2
w22  w22   2 x2
w21  w21  1 x2