#### Transcript Slides

Master of Science in Artificial Intelligence, 2010-2012 Knowledge Representation and Reasoning University "Politehnica" of Bucharest Department of Computer Science Fall 2010 Adina Magda Florea http://turing.cs.pub.ro/krr_10 curs.cs.pub.ro Lecture 11 Uncertain representation of knowledge Lecture outline Uncertain knowledge Belief networks Bayesian prediction 2 1. Uncertain knowledge Probability theory – 2 main interpretations Statistical = measure of proportion of individuals (long range frequency of a set of events) Prob of a bird flying = proportion of birds that fly out of the set af all birds Personal, subjective or Bayesian = an agent's measure of belief in some proposition based on the agent's knowledge Prob of a bird flying = the agent's measure of belief in the flying ability of an individual based on the knowledge that the individual is a bird Can be viewed as a measure over all the worlds that are possible, given the agent's knowledge about a particular situation (in each possible world, the bird either flies or it does not) 3 Bayesian probability Both views have the same calculus We talk about the second view We assume uncertainty is epistemological - pertaining to the agent's knowledge about the world, rather than ontological – how the world is Semantics of (prior) probability Interpretations – on possible worlds Specify not only the truth of formulas but also how likely the real world is as compared to these formulas Modal logics – possible worlds + accessibility relation Probabilities – possible worlds + a measure on p.w. 4 Semantics of probability A possible world is an assignment of exactly one value to every random variable. Let W be the set of all possible worlds. If wW and f is a formula, f is true in w (w |= f) is defined inductively on the structure of f: • • • • w |= x=v iff w assigns value v to x w |=W f iff w |=/ f w |= f g iff w |= f and w |= g w |= f g iff w |= f or w |=W g (or w |= ¬f) Associated with each possible world is a measure. When there are only a finite no. of worlds: 0p(w) for all wW wW p(w) = 1 5 Semantics of probability The probability of a formula f is the sum of all measures of the possible worlds in which f is true. P(f)= w |= f p(w) Semantics of conditional probability A formula e representing the conjunction of all agent's observations of the world is called evidence The measure of belief in formula h based on formula e is called conditional probability of h given e, P(h|e) Evidence e will rule out all possible worlds that are incompatible with evidence e 6 Semantics of probability Evidence e introduces a new measure pe over possible worlds where all worlds in which e is false have measure 0 and the remaining worlds are normalized so that the sum of the measures of the worlds is 1 pe(w) = p(w)/P(e) if w |= f 0 if w |=/ f P(h|e) = w |=h pe(w) = ( w |= he p(w) )/ P(e) = P(he)/P(e) We assume P(e)>0. If P(e) = 0 then e is false in all possible worlds and thus can not be observed Chain rule P(f1 … fn)=P(f1) x P(f2|f1) x …P(fn|f1 … fn-1) 7 Bayes theorem Given the current belief in a proposition H based on evidence K, P(H|K), we observe E. P(H|EK) = P(E|H K) * P(H|K) P(E|K) If the background knowledge K is implicit P(H|E) = P(E|H) * P(H) P(E) 8 Independence assumptions Independence. The knowledge of the truth of one proposition does not affect the belief in another A random variable X is independent of a random variable Y given a random variable Z if for all values of the random variables (i.e., ai, bj, ck) P(X=ai|Y=bj Z=ck) = P(X=ai|Z=ck) Knowledge of Y's value does not affect the belief in the value of X, given the value of Z. 9 2. Belief networks A BN (Belief Network or Bayesian Network) is a graphical representation of conditional independence It is represented a Directed Acyclic Graph (DAG) The nodes represent random variables. The edges represent direct dependence among the variables. XY: X has a direct influence on Y (represents a statistical dependence) X = Parent(Y) if XY X = Ancestor(Y) if there is a direct path from X to Y (X.. Y) Z = Descendant(Y) if Z=Y or there is a direct path from Y to Z (Y.. Z) 10 BN The independence assumption embedded in a BN is: Each random variable is independent of its nondescendants given its parents Y1,..Yn – parents of X P(X=a|Y1=v1 … Yn=vn R)= P(X=a|Y1=v1 … Yn=vn) if R does not involve descendants, including itself The number of probabilities needed to be specified for each variable is exponential in the number of parents of a variable BN contains a set of conditional probability tables P(X=a|Y1=v1 … Yn=vn) 11 BN Therefore a BN defines a Joint Probability Distribution (JPD) over the variables in the network A value of the JPD can be computed as: P(X1=x1 … Xn=xn) = i=1,n P(Xi=xi | parents(Xi)) where parents(xi) represent the specific values of Parents(Xi) P(X1=x1 … Xn=xn) = P(x1,…, xn) = P(xn | xn-1,…, x1) * P(xn-1,…, x1) = … = i=1,n P(xi | xi-1,…, x1) Order of variables in the BN P(Xi | Xi-1,…, X1) = P(Xi | Parents(Xi)) provided that Parents(Xi) { Xi-1,…, X1} 12 BN P(Xi | Xi-1,…, X1) = P(Xi | Parents(Xi)) provided that Parents(Xi) { Xi-1,…, X1} A BN is a correct representation of the domain, provided that each node is conditionally independent of its predecessors, given its parents 13 P(T) 0.002 Tampering F T T F F T T F T F P(A) 0.5 0.99 0.85 0.0001 A P(L) T 0.88 F 0.001 P(F) 0.001 Fire Alarm Smoke F P(S) T 0.9 F 0.01 Leaving Report L P(R) T 0.75 F 0.01 Instead of computing the joint distribution of all the variables by the chain rule P(T,F,A,S,L,R) = P(T)*P(F|T)*P(S|F,T)*P(A|S,F,T)*P(L|A,S,F,T)*P(R|L,A,S,F,T) the BN deﬁnes a unique JPD in a factored form, i.e. P(T,F,A,S,L,R) = P(T) * P(F) * P(A|T,F) * P(S|F) * P(L|A) * P(R|L) Inferences The probability of a variable given nondescendants can be computed using the "reasoning by case" rule P(L|S) = P(L|A,S)*P(A|S) + P(L|~A,S)*(1-P(A|S))= P(L|A)*P(A|S) + P(L|~A)*(1-P(A|S)) P(A|S) = P(A|F,T)*P(F,T|S) + P(A|F,~T)*P(F,~T|S) + P(A|~F,T)*P(~F,T|S) + P(A|~F,~T)*P(~F,~T|S) The right hand side of each product can be computed using the multiplicative rule P(F,T|S) = P(F|T,S)*P(T|S) = P(F|T,S)*P(T) For computing P(F|T,S) we can not use the independence assumption because S is a descendant of F; we can use Bayes rule instead P(F|T,S) = (P(S|F,T)*P(F|T)) / P(S|T) = (P(S|F)*P(F)) / P(S|T) Inferences The prior probabilities (with no evidence) of each variable are: P(Tampering) = 0.02 P(Fire) = 0.1 P(Report) = 0.028 P(Smoke) = 0.0189 Observing the Report gives P(Tampering|Report) = 0.399 P(Fire|Report) = 0.2305 P(Smoke|Report) = 0.215 The probability of both Tampering and Fire are increased by the Report Because Fire is increased, so is the probability of Smoke 16 Inferences Suppose instead that Smoke was observed P(Tampering|Smoke) = 0.02 P(Fire|Smoke) = 0.476 P(Report|Smoke) = 0.320 Note that the probability of tampering is not affected by observing Smoke, however the probability of Report and Fire are increased Suppose that both Report and Smoke were observed P(Tampering|Report, Smoke) = 0.0284 P(Fire|Report, Smoke) = 0.964 Thus, observing both makes Fire more likely However, in the context of Report, the presence of Smoke makes Tampering less likely. 17 Inferences Suppose instead that there is a Report but no Smoke P(Tampering|Report,~Smoke) = 0.501 P(Fire|Report,~Smoke) = 0.0294 In the context of Report, Fire becomes much less likely and so the probability of Tampering increases to explain Report. 18 Determining posterior distributions Problem = computing conditional probabilities given the evidence Estimating posterior probabilities in a BN within an absolute error (of less than 0.5) is NP-hard 3 main approaches (1) Exploit the structure of the network Clique tree propagation method – the network is transformed into a tree with nodes labeled with sets of variables. Reasoning is performed by passing messages between the nodes in the tree Time complexity is linear in the number of nodes of the tree; the tree is in fact a polytree, so its size may be exponential in the size of the belief network 19 Determining posterior distributions (2) Search-based approaches Enumerate all possible worlds and estimate posterior probabilities from the worlds in general (3) Stochastic simulation Random cases are generated according to a probability distribution. By treating these cases as a set of samples, one can estimate the marginal distribution on any combination of variables 20 A structure approach method Based on the notion that a BN specifies a factorization of the JPD A factor is a representation of a function from a tuple of random variables into a number. f(X1,..,Xn), X1,..,Xn are the variables of the factor, f is a factor on X1,..,Xn; if f(X1,..,Xn) is a factor and each vi is an element of the domain of Xi f(X1=v1,..,Xj=vj) is a number that is the value of f when each Xi has value vj 21 A structure approach method The product of two factors f1 and f2 is a factor on the union of the variables (f1 x f2)(X1,…,Xi,Y1,…,Yj,Z1,…,Zk) = f1(X1,…,Xi,Y1,…,Yj) x f2(Y1,…,Yj,Z1,…,Zk) Given a factor f(X1,…,Xi), one can sum out a variable, say X1, and the result is a factor on X2,…,Xi (X1f)(X2,…,Xi) = f(X1=v1,…,Xi)+…+ f(X1=vk,…,Xi) A conditional probability distribution can be seen as f(X=u,Y1=v1…Yj=vj) = P(X=u|Y1=v1….Yj=vj) 22 A structure approach method BN inference problem = computing the posterior distribution of a variable given some evidence can be reduced to the problem of computing the probabilities of conjunctions Given the evidence Y1=v1… Yj=vj and the query variable Z: P(Z|v1,…. vj) = P(Z,v1,…vj) / P(v1,..vj) = P(Z,v1,…vj) / zP(z,v1,..vj) => compute the factor P(Z,v1,…vj) and normalize 23 A structure approach method The variables of the BN are X1,…,Xn. To compute the factor P(Z,v1,…vj) we must sum out the other variables from the JPD. Be Z1,…Zk an enumeration of the other variables in the BN Z1,…Zk = {X1,…,Xn} - {Z} - {Y1,…,Yj} The factor can be computing by summing out on Zi. The order of the Zi is an elimination order P(Z,Y1=v1,…Yj=vj) = Zk….Z1P(X1,…Xn)Y1=v1,…,Yj=vj 24 A structure approach method P(Z,Y1=v1,…Yj=vj) = Zk….Z1P(X1,…Xn)Y1=v1,…,Yj=vj There is a possible world for each assignment of a value to each variable. The JPD P(X1,…Xn) gives the probability (measure) for each possible world The approach selects the worlds with the observed values for the Y's and sum over possible worlds with the same value for Z => in fact this is the definition of conditional probability 25 A structure approach method By the rule for conjunction of probabilities and the definition of a BN: P(X1,…Xn)=P(X1|Parents(X1)) * …*P(Xn|Parents(Xn)) Now the BN inference problem is reduced to a problem of summing out a set of variables from a product of factors. To compute the posterior distribution of a query variable given observations: • Construct the JPD in terms of a product of factors • Set the observed variables to their observed values • Sum out each of the other variables (the Z1…Zk) • Multiply the remaining factors and normalize 26 A structure approach method To sum out a variable Z from a product f1…fk of factors: We must first partition the factors into those that do not contain Z, say f1,..,fi, and those that contain Z, say fi+1…fk Then Zf1 x …x fk = f1 x .. x fi x (Z fi+1 x … x fk) Then explicitly construct a representation (in terms of a multidimensional array, a tree, or a set of rules) of the rightmost factor The factor size is exponential in the number of variables of the factor 27 3. Bayesian prediction 5 bags of candies Candies h1: 100% cherry h2: 75% cherry 25% lime h3: 50% cherry 50% lime h4: 25% cherry 75% lime h5: 100% lime H (set of hypothesis) – type of bag with values h1 .. h5 Collect evidence (random variables): d1, d2, … with possible values cherry or lime Goal: predict the flavour of the next candy 28 Bayesian prediction Be D the data with observed value d The probability of each hypothesis, based on Bayes' rule, is: P(hi|d) = P(d|hi) P(hi) (1) The prediction on an unknown hypothesis X is P(X|d) = Σi P(X|hi) P(hi|d) (2) Key elements: prior probabilities P(hi) and the probability of an evidence for each hypothesis P(d|hi) P(d|hi) = Πj P(dj|hi) (3) We assume the prior probability: h1 h2 h3 h4 h5 0.1 0.2 0.4 0.2 0.1 29 h1 h2 h3 h4 h5 0.1 0.2 0.4 0.2 0.1 P(hi|d) = P(d|hi) P(hi) (1) h1: 100% cherry h2: 75% cherry 25% lime h3: 50% cherry 50% lime h4: 25% cherry 75% lime h5: 100% lime P(lime) = 0.1*0 + 0.2*0.25 + 0.4*0.5 + 0.2*0.75+ 0.1*1 = 0.5 = 1/0.5 = 2 P(h1|lime) = P(lime|h1)P(h1) = 2*0.1*0 = 0 P(h2|lime) = P(lime|h2)P(h2) = 2 * (0.25*0.2) = 0.1 P(h3|lime) = P(lime|h3)P(h3) = 2 * (0.5*0.4) = 0.4 P(h4|lime) = P(lime|h4)P(h4) = 2 * (0.75*0.2) = 0.3 P(h5|lime) = P(lime|h5)P(h5) = 2 * (1*0.1) = 0.2 30 h1 h2 h3 h4 h5 0.1 0.2 0.4 0.2 0.1 P(hi|d) = P(d|hi) P(hi) (1) h1: 100% cherry P(d|hi) = Πj P(dj|hi) (3) h2: 75% cherry 25% lime h3: 50% cherry 50% lime h4: 25% cherry 75% lime h5: 100% lime P(lime,lime) = 0.1*0 + 0.2*0.25*0.25 + 0.4*0.5*0.5 + 0.2*0.75*0.75+ 0.1*1*1 = 0.325 = 1/0.325 = 3.0769 P(h1|lime,lime) = P(lime,lime|h1)P(h1) = 3* 0.1*0*0 =0 P(h2|lime,lime) = P(lime,lime|h2)P(h2) = 3 * (0.25*.25*0.2) = 0.0375 P(h3|lime,lime) = P(lime,lime|h3)P(h3) = 3 * (0.5*0.5*0.4) = 0.3 P(h4|lime,lime) = P(lime,lime|h4)P(h4) = 3 * (0.75*0.75*0.2) = 0.3375 P(h5|lime,lime) = P(lime,lime|h5)P(h5) = 3 * (1*1*0.1) = 0.3 31 P(hi|d1,…,d10) from equation (1) 32 h1 h2 h3 h4 h5 0.1 0.2 0.4 0.2 0.1 h1: 100% cherry h2: 75% cherry h3: 50% cherry h4: 25% cherry h5: 100% lime 25% lime 50% lime 75% lime P(X|d) = Σi P(X|hi) P(hi|d) (2) P(d2=lime|d1)=P(d2|h1)*P(h1|d1) + P(d2|h2)*P(h2|d1) P(d2|h3)*P(h3|d1) + P(d2|h4)*P(h4|d1) + P(d2|h5)*P(h5|d1) = = 0*0.1+0.25*0.2+0.5*0.4+0.75*0.3+1*0.2 = 0.65 + Bayesian prediction 33 Remarks The true hypothesis will finally dominate the prediction Problems if the hypothesis space is big Aproximation Prediction based on the most probable hypothesis MAP Learning – maximum aposteriori P(X|d)=~P(X|hMAP) In the xemaple hMAP=h5 after 3 evidences so 1.0 As more data is collected MAP and Bayes tend to be closer 34