Transcript Slides

Master of Science in Artificial Intelligence, 2010-2012
Knowledge Representation
and Reasoning
University "Politehnica" of Bucharest
Department of Computer Science
Fall 2010
Adina Magda Florea
Lecture 11
Uncertain representation of
Lecture outline
 Uncertain knowledge
 Belief networks
 Bayesian prediction
1. Uncertain knowledge
 Probability theory – 2 main interpretations
 Statistical = measure of proportion of individuals (long
range frequency of a set of events)
Prob of a bird flying = proportion of birds that fly out of the set af all birds
 Personal, subjective or Bayesian = an agent's
measure of belief in some proposition based on the
agent's knowledge
Prob of a bird flying = the agent's measure of belief in the flying ability of an
individual based on the knowledge that the individual is a bird
 Can be viewed as a measure over all the worlds that are
possible, given the agent's knowledge about a particular
situation (in each possible world, the bird either flies or it
does not)
Bayesian probability
 Both views have the same calculus
 We talk about the second view
 We assume uncertainty is epistemological - pertaining
to the agent's knowledge about the world, rather than
ontological – how the world is
Semantics of (prior) probability
 Interpretations – on possible worlds
 Specify not only the truth of formulas but also how likely
the real world is as compared to these formulas
 Modal logics – possible worlds + accessibility relation
 Probabilities – possible worlds + a measure on p.w.
Semantics of probability
 A possible world is an assignment of exactly one value to
every random variable.
 Let W be the set of all possible worlds. If wW and f is a
formula, f is true in w (w |= f) is defined inductively on the
structure of f:
w |= x=v iff
w assigns value v to x
w |=W f iff
w |=/ f
w |= f  g iff
w |= f and w |= g
w |= f  g iff
w |= f or w |=W g
(or w |= ¬f)
 Associated with each possible world is a measure. When
there are only a finite no. of worlds:
 0p(w) for all wW
wW p(w) = 1
Semantics of probability
 The probability of a formula f is the sum of all measures
of the possible worlds in which f is true.
P(f)= w |= f p(w)
Semantics of conditional probability
 A formula e representing the conjunction of all agent's
observations of the world is called evidence
 The measure of belief in formula h based on formula e is
called conditional probability of h given e, P(h|e)
 Evidence e will rule out all possible worlds that are
incompatible with evidence e
Semantics of probability
 Evidence e introduces a new measure pe over possible
worlds where all worlds in which e is false have measure 0
and the remaining worlds are normalized so that the sum
of the measures of the worlds is 1
 pe(w) =
w |= f
w |=/ f
P(h|e) =  w |=h pe(w) = ( w |= he p(w) )/ P(e) = P(he)/P(e)
 We assume P(e)>0. If P(e) = 0 then e is false in all
possible worlds and thus can not be observed
 Chain rule
P(f1  … fn)=P(f1) x P(f2|f1) x …P(fn|f1  …  fn-1)
Bayes theorem
 Given the current belief in a proposition H based on
evidence K, P(H|K), we observe E.
 P(H|EK) =
P(E|H  K) * P(H|K)
 If the background knowledge K is implicit
 P(H|E) =
P(E|H) * P(H)
Independence assumptions
 Independence. The knowledge of the truth of one
proposition does not affect the belief in another
 A random variable X is independent of a random
variable Y given a random variable Z if for all
values of the random variables (i.e., ai, bj, ck)
P(X=ai|Y=bj  Z=ck) = P(X=ai|Z=ck)
 Knowledge of Y's value does not affect the belief
in the value of X, given the value of Z.
2. Belief networks
 A BN (Belief Network or Bayesian Network) is a
graphical representation of conditional independence
 It is represented a Directed Acyclic Graph (DAG)
 The nodes represent random variables.
 The edges represent direct dependence among the
 XY: X has a direct influence on Y (represents a
statistical dependence)
 X = Parent(Y) if XY
 X = Ancestor(Y) if there is a direct path from X to Y
(X..  Y)
 Z = Descendant(Y) if Z=Y or there is a direct path from
Y to Z (Y..  Z)
The independence assumption embedded in a BN
 Each random variable is independent of its
nondescendants given its parents
Y1,..Yn – parents of X
P(X=a|Y1=v1  …  Yn=vn  R)= P(X=a|Y1=v1  …  Yn=vn)
if R does not involve descendants, including itself
 The number of probabilities needed to be specified
for each variable is exponential in the number of
parents of a variable
 BN contains a set of conditional probability tables
P(X=a|Y1=v1  …  Yn=vn)
 Therefore a BN defines a Joint Probability Distribution
(JPD) over the variables in the network
 A value of the JPD can be computed as:
P(X1=x1  … Xn=xn) = i=1,n P(Xi=xi | parents(Xi))
where parents(xi) represent the specific values of
 P(X1=x1  … Xn=xn) = P(x1,…, xn) =
P(xn | xn-1,…, x1) * P(xn-1,…, x1) = … = i=1,n P(xi | xi-1,…, x1)
 Order of variables in the BN
P(Xi | Xi-1,…, X1) = P(Xi | Parents(Xi)) provided that
Parents(Xi)  { Xi-1,…, X1}
 P(Xi | Xi-1,…, X1) = P(Xi | Parents(Xi)) provided that
Parents(Xi)  { Xi-1,…, X1}
 A BN is a correct representation of the
domain, provided that each node is
conditionally independent of its
predecessors, given its parents
A P(L)
T 0.88
F 0.001
F P(S)
T 0.9
F 0.01
L P(R)
T 0.75
F 0.01
Instead of computing the joint distribution of all the variables by the chain rule
P(T,F,A,S,L,R) = P(T)*P(F|T)*P(S|F,T)*P(A|S,F,T)*P(L|A,S,F,T)*P(R|L,A,S,F,T)
the BN defines a unique JPD in a factored form, i.e.
P(T,F,A,S,L,R) = P(T) * P(F) * P(A|T,F) * P(S|F) * P(L|A) * P(R|L)
 The probability of a variable given nondescendants can
be computed using the "reasoning by case" rule
 P(L|S) = P(L|A,S)*P(A|S) + P(L|~A,S)*(1-P(A|S))=
P(L|A)*P(A|S) + P(L|~A)*(1-P(A|S))
 P(A|S) = P(A|F,T)*P(F,T|S) +
P(A|F,~T)*P(F,~T|S) +
P(A|~F,T)*P(~F,T|S) +
 The right hand side of each product can be computed
using the multiplicative rule
P(F,T|S) = P(F|T,S)*P(T|S) = P(F|T,S)*P(T)
 For computing P(F|T,S) we can not use the
independence assumption because S is a descendant of
F; we can use Bayes rule instead
P(F|T,S) = (P(S|F,T)*P(F|T)) / P(S|T) = (P(S|F)*P(F))
/ P(S|T)
 The prior probabilities (with no evidence) of each variable are:
P(Tampering) = 0.02
P(Fire) = 0.1
P(Report) = 0.028
P(Smoke) = 0.0189
 Observing the Report gives
P(Tampering|Report) = 0.399
P(Fire|Report) = 0.2305
P(Smoke|Report) = 0.215
 The probability of both Tampering and Fire are increased by
the Report
 Because Fire is increased, so is the probability of Smoke
 Suppose instead that Smoke was observed
P(Tampering|Smoke) = 0.02
P(Fire|Smoke) = 0.476
P(Report|Smoke) = 0.320
 Note that the probability of tampering is not affected by
observing Smoke, however the probability of Report and Fire
are increased
 Suppose that both Report and Smoke were observed
P(Tampering|Report, Smoke) = 0.0284
P(Fire|Report, Smoke) = 0.964
 Thus, observing both makes Fire more likely
 However, in the context of Report, the presence of Smoke
makes Tampering less likely.
 Suppose instead that there is a Report but no Smoke
P(Tampering|Report,~Smoke) = 0.501
P(Fire|Report,~Smoke) = 0.0294
 In the context of Report, Fire becomes much less likely and
so the probability of Tampering increases to explain Report.
Determining posterior distributions
 Problem = computing conditional probabilities given the
 Estimating posterior probabilities in a BN within an
absolute error (of less than 0.5) is NP-hard
 3 main approaches
(1) Exploit the structure of the network
 Clique tree propagation method – the network is
transformed into a tree with nodes labeled with sets of
variables. Reasoning is performed by passing messages
between the nodes in the tree
 Time complexity is linear in the number of nodes of the
tree; the tree is in fact a polytree, so its size may be
exponential in the size of the belief network
Determining posterior distributions
(2) Search-based approaches
 Enumerate all possible worlds and estimate posterior
probabilities from the worlds in general
(3) Stochastic simulation
 Random cases are generated according to a probability
distribution. By treating these cases as a set of samples,
one can estimate the marginal distribution on any
combination of variables
A structure approach method
 Based on the notion that a BN specifies a factorization of
the JPD
 A factor is a representation of a function from a tuple of
random variables into a number.
 f(X1,..,Xn), X1,..,Xn are the variables of the factor, f is a
factor on X1,..,Xn;
 if f(X1,..,Xn) is a factor and each vi is an element of the
domain of Xi
 f(X1=v1,..,Xj=vj) is a number that is the value of f when
each Xi has value vj
A structure approach method
 The product of two factors f1 and f2 is a factor on the
union of the variables
(f1 x f2)(X1,…,Xi,Y1,…,Yj,Z1,…,Zk) =
f1(X1,…,Xi,Y1,…,Yj) x f2(Y1,…,Yj,Z1,…,Zk)
 Given a factor f(X1,…,Xi), one can sum out a variable,
say X1, and the result is a factor on X2,…,Xi
(X1f)(X2,…,Xi) = f(X1=v1,…,Xi)+…+ f(X1=vk,…,Xi)
 A conditional probability distribution can be seen as
f(X=u,Y1=v1…Yj=vj) = P(X=u|Y1=v1….Yj=vj)
A structure approach method
 BN inference problem = computing the
posterior distribution of a variable given some
 can be reduced to the problem of computing the
probabilities of conjunctions
 Given the evidence Y1=v1… Yj=vj and the query
variable Z:
P(Z|v1,…. vj) = P(Z,v1,…vj) / P(v1,..vj)
= P(Z,v1,…vj) / zP(z,v1,..vj)
 => compute the factor P(Z,v1,…vj) and
A structure approach method
 The variables of the BN are X1,…,Xn.
 To compute the factor P(Z,v1,…vj) we must sum out the
other variables from the JPD.
 Be Z1,…Zk an enumeration of the other variables in the
 Z1,…Zk = {X1,…,Xn} - {Z} - {Y1,…,Yj}
 The factor can be computing by summing out on Zi.
 The order of the Zi is an elimination order
 P(Z,Y1=v1,…Yj=vj) = Zk….Z1P(X1,…Xn)Y1=v1,…,Yj=vj
A structure approach method
 P(Z,Y1=v1,…Yj=vj) = Zk….Z1P(X1,…Xn)Y1=v1,…,Yj=vj
 There is a possible world for each assignment of a value
to each variable.
 The JPD P(X1,…Xn) gives the probability (measure) for
each possible world
 The approach selects the worlds with the observed
values for the Y's and sum over possible worlds with the
same value for Z => in fact this is the definition of
conditional probability
A structure approach method
 By the rule for conjunction of probabilities and the
definition of a BN:
P(X1,…Xn)=P(X1|Parents(X1)) * …*P(Xn|Parents(Xn))
 Now the BN inference problem is reduced to a problem
of summing out a set of variables from a product of
 To compute the posterior distribution of a query variable
given observations:
• Construct the JPD in terms of a product of factors
• Set the observed variables to their observed values
• Sum out each of the other variables (the Z1…Zk)
• Multiply the remaining factors and normalize
A structure approach method
 To sum out a variable Z from a product f1…fk of factors:
 We must first partition the factors into those that do not
contain Z, say f1,..,fi, and those that contain Z, say fi+1…fk
 Then
Zf1 x …x fk = f1 x .. x fi x (Z fi+1 x … x fk)
 Then explicitly construct a representation (in terms of a
multidimensional array, a tree, or a set of rules) of the
rightmost factor
 The factor size is exponential in the number of variables
of the factor
3. Bayesian prediction
5 bags of candies
h1: 100% cherry
h2: 75% cherry
25% lime
h3: 50% cherry
50% lime
h4: 25% cherry
75% lime
h5: 100% lime
H (set of hypothesis) – type of bag with values h1 .. h5
Collect evidence (random variables): d1, d2, … with
possible values cherry or lime
Goal: predict the flavour of the next candy
Bayesian prediction
Be D the data with observed value d
The probability of each hypothesis, based on Bayes' rule, is:
P(hi|d) =  P(d|hi) P(hi)
The prediction on an unknown hypothesis X is
P(X|d) = Σi P(X|hi) P(hi|d)
 Key elements: prior probabilities P(hi) and the probability of
an evidence for each hypothesis P(d|hi)
P(d|hi) = Πj P(dj|hi)
We assume the prior probability:
h1 h2 h3 h4 h5
0.1 0.2 0.4 0.2 0.1
h1 h2 h3 h4 h5
0.1 0.2 0.4 0.2 0.1
P(hi|d) =  P(d|hi) P(hi) (1)
h1: 100% cherry
h2: 75% cherry
25% lime
h3: 50% cherry
50% lime
h4: 25% cherry
75% lime
h5: 100% lime
P(lime) = 0.1*0 + 0.2*0.25 + 0.4*0.5 + 0.2*0.75+ 0.1*1 = 0.5
 = 1/0.5 = 2
P(h1|lime) =  P(lime|h1)P(h1) = 2*0.1*0 = 0
P(h2|lime) =  P(lime|h2)P(h2) = 2 * (0.25*0.2) = 0.1
P(h3|lime) =  P(lime|h3)P(h3) = 2 * (0.5*0.4) = 0.4
P(h4|lime) =  P(lime|h4)P(h4) = 2 * (0.75*0.2) = 0.3
P(h5|lime) =  P(lime|h5)P(h5) = 2 * (1*0.1) = 0.2
h1 h2 h3 h4 h5
0.1 0.2 0.4 0.2 0.1
P(hi|d) =  P(d|hi) P(hi) (1)
h1: 100% cherry
P(d|hi) = Πj P(dj|hi) (3)
h2: 75% cherry 25% lime
h3: 50% cherry 50% lime
h4: 25% cherry 75% lime
h5: 100% lime
P(lime,lime) = 0.1*0 + 0.2*0.25*0.25 + 0.4*0.5*0.5 + 0.2*0.75*0.75+
0.1*1*1 = 0.325
 = 1/0.325 = 3.0769
P(h1|lime,lime) =  P(lime,lime|h1)P(h1) = 3* 0.1*0*0 =0
P(h2|lime,lime) =  P(lime,lime|h2)P(h2) = 3 * (0.25*.25*0.2) = 0.0375
P(h3|lime,lime) =  P(lime,lime|h3)P(h3) = 3 * (0.5*0.5*0.4) = 0.3
P(h4|lime,lime) =  P(lime,lime|h4)P(h4) = 3 * (0.75*0.75*0.2) = 0.3375
P(h5|lime,lime) =  P(lime,lime|h5)P(h5) = 3 * (1*1*0.1) = 0.3
 P(hi|d1,…,d10) from equation (1)
h1 h2 h3 h4 h5
0.1 0.2 0.4 0.2 0.1
h1: 100% cherry
h2: 75% cherry
h3: 50% cherry
h4: 25% cherry
h5: 100% lime
25% lime
50% lime
75% lime
P(X|d) = Σi P(X|hi) P(hi|d) (2)
+ P(d2|h4)*P(h4|d1) + P(d2|h5)*P(h5|d1) =
= 0*0.1+0.25*0.2+0.5*0.4+0.75*0.3+1*0.2 = 0.65
Bayesian prediction
 The true hypothesis will finally dominate the
 Problems if the hypothesis space is big
 Aproximation
 Prediction based on the most probable
 MAP Learning – maximum aposteriori
 P(X|d)=~P(X|hMAP)
 In the xemaple hMAP=h5 after 3 evidences so 1.0
 As more data is collected MAP and Bayes tend to
be closer