Transcript Slides
Master of Science in Artificial Intelligence, 2010-2012
Knowledge Representation
and Reasoning
University "Politehnica" of Bucharest
Department of Computer Science
Fall 2010
Adina Magda Florea
http://turing.cs.pub.ro/krr_10
curs.cs.pub.ro
Lecture 11
Uncertain representation of
knowledge
Lecture outline
Uncertain knowledge
Belief networks
Bayesian prediction
2
1. Uncertain knowledge
Probability theory – 2 main interpretations
Statistical = measure of proportion of individuals (long
range frequency of a set of events)
Prob of a bird flying = proportion of birds that fly out of the set af all birds
Personal, subjective or Bayesian = an agent's
measure of belief in some proposition based on the
agent's knowledge
Prob of a bird flying = the agent's measure of belief in the flying ability of an
individual based on the knowledge that the individual is a bird
Can be viewed as a measure over all the worlds that are
possible, given the agent's knowledge about a particular
situation (in each possible world, the bird either flies or it
does not)
3
Bayesian probability
Both views have the same calculus
We talk about the second view
We assume uncertainty is epistemological - pertaining
to the agent's knowledge about the world, rather than
ontological – how the world is
Semantics of (prior) probability
Interpretations – on possible worlds
Specify not only the truth of formulas but also how likely
the real world is as compared to these formulas
Modal logics – possible worlds + accessibility relation
Probabilities – possible worlds + a measure on p.w.
4
Semantics of probability
A possible world is an assignment of exactly one value to
every random variable.
Let W be the set of all possible worlds. If wW and f is a
formula, f is true in w (w |= f) is defined inductively on the
structure of f:
•
•
•
•
w |= x=v iff
w assigns value v to x
w |=W f iff
w |=/ f
w |= f g iff
w |= f and w |= g
w |= f g iff
w |= f or w |=W g
(or w |= ¬f)
Associated with each possible world is a measure. When
there are only a finite no. of worlds:
0p(w) for all wW
wW p(w) = 1
5
Semantics of probability
The probability of a formula f is the sum of all measures
of the possible worlds in which f is true.
P(f)= w |= f p(w)
Semantics of conditional probability
A formula e representing the conjunction of all agent's
observations of the world is called evidence
The measure of belief in formula h based on formula e is
called conditional probability of h given e, P(h|e)
Evidence e will rule out all possible worlds that are
incompatible with evidence e
6
Semantics of probability
Evidence e introduces a new measure pe over possible
worlds where all worlds in which e is false have measure 0
and the remaining worlds are normalized so that the sum
of the measures of the worlds is 1
pe(w) =
p(w)/P(e)
if
w |= f
0
if
w |=/ f
P(h|e) = w |=h pe(w) = ( w |= he p(w) )/ P(e) = P(he)/P(e)
We assume P(e)>0. If P(e) = 0 then e is false in all
possible worlds and thus can not be observed
Chain rule
P(f1 … fn)=P(f1) x P(f2|f1) x …P(fn|f1 … fn-1)
7
Bayes theorem
Given the current belief in a proposition H based on
evidence K, P(H|K), we observe E.
P(H|EK) =
P(E|H K) * P(H|K)
P(E|K)
If the background knowledge K is implicit
P(H|E) =
P(E|H) * P(H)
P(E)
8
Independence assumptions
Independence. The knowledge of the truth of one
proposition does not affect the belief in another
A random variable X is independent of a random
variable Y given a random variable Z if for all
values of the random variables (i.e., ai, bj, ck)
P(X=ai|Y=bj Z=ck) = P(X=ai|Z=ck)
Knowledge of Y's value does not affect the belief
in the value of X, given the value of Z.
9
2. Belief networks
A BN (Belief Network or Bayesian Network) is a
graphical representation of conditional independence
It is represented a Directed Acyclic Graph (DAG)
The nodes represent random variables.
The edges represent direct dependence among the
variables.
XY: X has a direct influence on Y (represents a
statistical dependence)
X = Parent(Y) if XY
X = Ancestor(Y) if there is a direct path from X to Y
(X.. Y)
Z = Descendant(Y) if Z=Y or there is a direct path from
Y to Z (Y.. Z)
10
BN
The independence assumption embedded in a BN
is:
Each random variable is independent of its
nondescendants given its parents
Y1,..Yn – parents of X
P(X=a|Y1=v1 … Yn=vn R)= P(X=a|Y1=v1 … Yn=vn)
if R does not involve descendants, including itself
The number of probabilities needed to be specified
for each variable is exponential in the number of
parents of a variable
BN contains a set of conditional probability tables
P(X=a|Y1=v1 … Yn=vn)
11
BN
Therefore a BN defines a Joint Probability Distribution
(JPD) over the variables in the network
A value of the JPD can be computed as:
P(X1=x1 … Xn=xn) = i=1,n P(Xi=xi | parents(Xi))
where parents(xi) represent the specific values of
Parents(Xi)
P(X1=x1 … Xn=xn) = P(x1,…, xn) =
P(xn | xn-1,…, x1) * P(xn-1,…, x1) = … = i=1,n P(xi | xi-1,…, x1)
Order of variables in the BN
P(Xi | Xi-1,…, X1) = P(Xi | Parents(Xi)) provided that
Parents(Xi) { Xi-1,…, X1}
12
BN
P(Xi | Xi-1,…, X1) = P(Xi | Parents(Xi)) provided that
Parents(Xi) { Xi-1,…, X1}
A BN is a correct representation of the
domain, provided that each node is
conditionally independent of its
predecessors, given its parents
13
P(T)
0.002
Tampering
F
T
T
F
F
T
T
F
T
F
P(A)
0.5
0.99
0.85
0.0001
A P(L)
T 0.88
F 0.001
P(F)
0.001
Fire
Alarm
Smoke
F P(S)
T 0.9
F 0.01
Leaving
Report
L P(R)
T 0.75
F 0.01
Instead of computing the joint distribution of all the variables by the chain rule
P(T,F,A,S,L,R) = P(T)*P(F|T)*P(S|F,T)*P(A|S,F,T)*P(L|A,S,F,T)*P(R|L,A,S,F,T)
the BN defines a unique JPD in a factored form, i.e.
P(T,F,A,S,L,R) = P(T) * P(F) * P(A|T,F) * P(S|F) * P(L|A) * P(R|L)
Inferences
The probability of a variable given nondescendants can
be computed using the "reasoning by case" rule
P(L|S) = P(L|A,S)*P(A|S) + P(L|~A,S)*(1-P(A|S))=
P(L|A)*P(A|S) + P(L|~A)*(1-P(A|S))
P(A|S) = P(A|F,T)*P(F,T|S) +
P(A|F,~T)*P(F,~T|S) +
P(A|~F,T)*P(~F,T|S) +
P(A|~F,~T)*P(~F,~T|S)
The right hand side of each product can be computed
using the multiplicative rule
P(F,T|S) = P(F|T,S)*P(T|S) = P(F|T,S)*P(T)
For computing P(F|T,S) we can not use the
independence assumption because S is a descendant of
F; we can use Bayes rule instead
P(F|T,S) = (P(S|F,T)*P(F|T)) / P(S|T) = (P(S|F)*P(F))
/ P(S|T)
Inferences
The prior probabilities (with no evidence) of each variable are:
P(Tampering) = 0.02
P(Fire) = 0.1
P(Report) = 0.028
P(Smoke) = 0.0189
Observing the Report gives
P(Tampering|Report) = 0.399
P(Fire|Report) = 0.2305
P(Smoke|Report) = 0.215
The probability of both Tampering and Fire are increased by
the Report
Because Fire is increased, so is the probability of Smoke
16
Inferences
Suppose instead that Smoke was observed
P(Tampering|Smoke) = 0.02
P(Fire|Smoke) = 0.476
P(Report|Smoke) = 0.320
Note that the probability of tampering is not affected by
observing Smoke, however the probability of Report and Fire
are increased
Suppose that both Report and Smoke were observed
P(Tampering|Report, Smoke) = 0.0284
P(Fire|Report, Smoke) = 0.964
Thus, observing both makes Fire more likely
However, in the context of Report, the presence of Smoke
makes Tampering less likely.
17
Inferences
Suppose instead that there is a Report but no Smoke
P(Tampering|Report,~Smoke) = 0.501
P(Fire|Report,~Smoke) = 0.0294
In the context of Report, Fire becomes much less likely and
so the probability of Tampering increases to explain Report.
18
Determining posterior distributions
Problem = computing conditional probabilities given the
evidence
Estimating posterior probabilities in a BN within an
absolute error (of less than 0.5) is NP-hard
3 main approaches
(1) Exploit the structure of the network
Clique tree propagation method – the network is
transformed into a tree with nodes labeled with sets of
variables. Reasoning is performed by passing messages
between the nodes in the tree
Time complexity is linear in the number of nodes of the
tree; the tree is in fact a polytree, so its size may be
exponential in the size of the belief network
19
Determining posterior distributions
(2) Search-based approaches
Enumerate all possible worlds and estimate posterior
probabilities from the worlds in general
(3) Stochastic simulation
Random cases are generated according to a probability
distribution. By treating these cases as a set of samples,
one can estimate the marginal distribution on any
combination of variables
20
A structure approach method
Based on the notion that a BN specifies a factorization of
the JPD
A factor is a representation of a function from a tuple of
random variables into a number.
f(X1,..,Xn), X1,..,Xn are the variables of the factor, f is a
factor on X1,..,Xn;
if f(X1,..,Xn) is a factor and each vi is an element of the
domain of Xi
f(X1=v1,..,Xj=vj) is a number that is the value of f when
each Xi has value vj
21
A structure approach method
The product of two factors f1 and f2 is a factor on the
union of the variables
(f1 x f2)(X1,…,Xi,Y1,…,Yj,Z1,…,Zk) =
f1(X1,…,Xi,Y1,…,Yj) x f2(Y1,…,Yj,Z1,…,Zk)
Given a factor f(X1,…,Xi), one can sum out a variable,
say X1, and the result is a factor on X2,…,Xi
(X1f)(X2,…,Xi) = f(X1=v1,…,Xi)+…+ f(X1=vk,…,Xi)
A conditional probability distribution can be seen as
f(X=u,Y1=v1…Yj=vj) = P(X=u|Y1=v1….Yj=vj)
22
A structure approach method
BN inference problem = computing the
posterior distribution of a variable given some
evidence
can be reduced to the problem of computing the
probabilities of conjunctions
Given the evidence Y1=v1… Yj=vj and the query
variable Z:
P(Z|v1,…. vj) = P(Z,v1,…vj) / P(v1,..vj)
= P(Z,v1,…vj) / zP(z,v1,..vj)
=> compute the factor P(Z,v1,…vj) and
normalize
23
A structure approach method
The variables of the BN are X1,…,Xn.
To compute the factor P(Z,v1,…vj) we must sum out the
other variables from the JPD.
Be Z1,…Zk an enumeration of the other variables in the
BN
Z1,…Zk = {X1,…,Xn} - {Z} - {Y1,…,Yj}
The factor can be computing by summing out on Zi.
The order of the Zi is an elimination order
P(Z,Y1=v1,…Yj=vj) = Zk….Z1P(X1,…Xn)Y1=v1,…,Yj=vj
24
A structure approach method
P(Z,Y1=v1,…Yj=vj) = Zk….Z1P(X1,…Xn)Y1=v1,…,Yj=vj
There is a possible world for each assignment of a value
to each variable.
The JPD P(X1,…Xn) gives the probability (measure) for
each possible world
The approach selects the worlds with the observed
values for the Y's and sum over possible worlds with the
same value for Z => in fact this is the definition of
conditional probability
25
A structure approach method
By the rule for conjunction of probabilities and the
definition of a BN:
P(X1,…Xn)=P(X1|Parents(X1)) * …*P(Xn|Parents(Xn))
Now the BN inference problem is reduced to a problem
of summing out a set of variables from a product of
factors.
To compute the posterior distribution of a query variable
given observations:
• Construct the JPD in terms of a product of factors
• Set the observed variables to their observed values
• Sum out each of the other variables (the Z1…Zk)
• Multiply the remaining factors and normalize
26
A structure approach method
To sum out a variable Z from a product f1…fk of factors:
We must first partition the factors into those that do not
contain Z, say f1,..,fi, and those that contain Z, say fi+1…fk
Then
Zf1 x …x fk = f1 x .. x fi x (Z fi+1 x … x fk)
Then explicitly construct a representation (in terms of a
multidimensional array, a tree, or a set of rules) of the
rightmost factor
The factor size is exponential in the number of variables
of the factor
27
3. Bayesian prediction
5 bags of candies
Candies
h1: 100% cherry
h2: 75% cherry
25% lime
h3: 50% cherry
50% lime
h4: 25% cherry
75% lime
h5: 100% lime
H (set of hypothesis) – type of bag with values h1 .. h5
Collect evidence (random variables): d1, d2, … with
possible values cherry or lime
Goal: predict the flavour of the next candy
28
Bayesian prediction
Be D the data with observed value d
The probability of each hypothesis, based on Bayes' rule, is:
P(hi|d) = P(d|hi) P(hi)
(1)
The prediction on an unknown hypothesis X is
P(X|d) = Σi P(X|hi) P(hi|d)
(2)
Key elements: prior probabilities P(hi) and the probability of
an evidence for each hypothesis P(d|hi)
P(d|hi) = Πj P(dj|hi)
(3)
We assume the prior probability:
h1 h2 h3 h4 h5
0.1 0.2 0.4 0.2 0.1
29
h1 h2 h3 h4 h5
0.1 0.2 0.4 0.2 0.1
P(hi|d) = P(d|hi) P(hi) (1)
h1: 100% cherry
h2: 75% cherry
25% lime
h3: 50% cherry
50% lime
h4: 25% cherry
75% lime
h5: 100% lime
P(lime) = 0.1*0 + 0.2*0.25 + 0.4*0.5 + 0.2*0.75+ 0.1*1 = 0.5
= 1/0.5 = 2
P(h1|lime) = P(lime|h1)P(h1) = 2*0.1*0 = 0
P(h2|lime) = P(lime|h2)P(h2) = 2 * (0.25*0.2) = 0.1
P(h3|lime) = P(lime|h3)P(h3) = 2 * (0.5*0.4) = 0.4
P(h4|lime) = P(lime|h4)P(h4) = 2 * (0.75*0.2) = 0.3
P(h5|lime) = P(lime|h5)P(h5) = 2 * (1*0.1) = 0.2
30
h1 h2 h3 h4 h5
0.1 0.2 0.4 0.2 0.1
P(hi|d) = P(d|hi) P(hi) (1)
h1: 100% cherry
P(d|hi) = Πj P(dj|hi) (3)
h2: 75% cherry 25% lime
h3: 50% cherry 50% lime
h4: 25% cherry 75% lime
h5: 100% lime
P(lime,lime) = 0.1*0 + 0.2*0.25*0.25 + 0.4*0.5*0.5 + 0.2*0.75*0.75+
0.1*1*1 = 0.325
= 1/0.325 = 3.0769
P(h1|lime,lime) = P(lime,lime|h1)P(h1) = 3* 0.1*0*0 =0
P(h2|lime,lime) = P(lime,lime|h2)P(h2) = 3 * (0.25*.25*0.2) = 0.0375
P(h3|lime,lime) = P(lime,lime|h3)P(h3) = 3 * (0.5*0.5*0.4) = 0.3
P(h4|lime,lime) = P(lime,lime|h4)P(h4) = 3 * (0.75*0.75*0.2) = 0.3375
P(h5|lime,lime) = P(lime,lime|h5)P(h5) = 3 * (1*1*0.1) = 0.3
31
P(hi|d1,…,d10) from equation (1)
32
h1 h2 h3 h4 h5
0.1 0.2 0.4 0.2 0.1
h1: 100% cherry
h2: 75% cherry
h3: 50% cherry
h4: 25% cherry
h5: 100% lime
25% lime
50% lime
75% lime
P(X|d) = Σi P(X|hi) P(hi|d) (2)
P(d2=lime|d1)=P(d2|h1)*P(h1|d1)
+
P(d2|h2)*P(h2|d1)
P(d2|h3)*P(h3|d1)
+ P(d2|h4)*P(h4|d1) + P(d2|h5)*P(h5|d1) =
= 0*0.1+0.25*0.2+0.5*0.4+0.75*0.3+1*0.2 = 0.65
+
Bayesian prediction
33
Remarks
The true hypothesis will finally dominate the
prediction
Problems if the hypothesis space is big
Aproximation
Prediction based on the most probable
hypothesis
MAP Learning – maximum aposteriori
P(X|d)=~P(X|hMAP)
In the xemaple hMAP=h5 after 3 evidences so 1.0
As more data is collected MAP and Bayes tend to
be closer
34