Logic based systems

Download Report

Transcript Logic based systems

Abduction, Uncertainty,
and
Probabilistic Reasoning
Chapters 13, 14, and more
1
Introduction
• Abduction is a reasoning process that tries to form plausible
explanations for abnormal observations
– Abduction is distinct different from deduction and induction
– Abduction is inherently uncertain
• Uncertainty becomes an important issue in AI research
• Some major formalisms for representing and reasoning about
uncertainty
–
–
–
–
–
Mycin’s certainty factor (an early representative)
Probability theory (esp. Bayesian networks)
Dempster-Shafer theory
Fuzzy logic
Truth maintenance systems
2
Abduction
• Definition (Encyclopedia Britannica): reasoning that derives
an explanatory hypothesis from a given set of facts
– The inference result is a hypothesis, which if true, could
explain the occurrence of the given facts
• Examples
– Dendral, an expert system to construct 3D structures of
chemical compounds
• Fact: mass spectrometer data of the compound and the
chemical formula of the compound
• KB: chemistry, esp. strength of different types of bounds
• Reasoning: form a hypothetical 3D structure which meet the
given chemical formula, and would most likely produce the
given mass spectrum if subjected to electron beam
bombardment
3
– Medical diagnosis
• Facts: symptoms, lab test results, and other observed findings
(called manifestations)
• KB: causal associations between diseases and manifestations
• Reasoning: one or more diseases whose presence would
causally explain the occurrence of the given manifestations
– Many other reasoning processes (e.g., word sense
disambiguation in natural language process, image
understanding, detective’s work, etc.) can also been seen as
abductive reasoning.
4
Comparing abduction, deduction and induction
Deduction: major premise:
minor premise:
conclusion:
Abduction: rule:
observation:
explanation:
Induction: case:
observation:
hypothesized rule:
All balls in the box are black
These balls are from the box
These balls are black
All balls in the box are black
These balls are black
These balls are from the box
These balls are from the box
These balls are black
All ball in the box are black
Induction: from specific cases to general rules
Abduction and deduction:
both from part of a specific case to other part of
the case using general rules (in different ways)
A => B
A
--------B
A => B
B
------------Possibly A
Whenever
A then B
but not
vice versa
------------Possibly
A => B
5
Characteristics of abduction reasoning
1. Reasoning results are hypotheses, not theorems (may be
false even if rules and facts are true),
– e.g., misdiagnosis in medicine
2. There may be multiple plausible hypotheses
– When given rules A => B and C => B, and fact B
both A and C are plausible hypotheses
– Abduction is inherently uncertain
– Hypotheses can be ranked by their plausibility if that can be
determined
3. Reasoning is often a Hypothesize- and-test cycle
– hypothesize phase: postulate possible hypotheses, each of
which could explain the given facts (or explain most of the
important facts)
– test phase: test the plausibility of all or some of these
hypotheses
6
– One way to test a hypothesis H is to query if something that
is currently unknown but can be predicted from H is
actually true.
• If we also know A => D and C => E, then ask if D and E are
true.
• If it turns out D is true and E is false, then hypothesis A
becomes more plausible (support for A increased, support for
C decreased)
• Alternative hypotheses compete with each other (Okam’s
razor, explain away)
4. Reasoning is non-monotonic
– Plausibility of hypotheses can increase/decrease as new
facts are collected (deductive inference determines if a
sentence is true but would never change its truth value)
– Some hypotheses may be discarded/defeated, and new ones
may be formed when new observations are made
7
Source of Uncertainty
• Uncertain data (noise or partial observation)
• Uncertain knowledge (e.g, causal relations)
– A disorder may cause any and all POSSIBLE manifestations in a
specific case
– A manifestation can be caused by more than one POSSIBLE
disorders
• Uncertain reasoning results
– Abduction and induction are inherently uncertain
– Default reasoning, even in deductive fashion, is uncertain
– Incomplete deductive inference may be uncertain
8
Probabilistic Inference
• Based on probability theory (especially Bayes’ theorem)
– Well established discipline about uncertain outcomes
– Empirical science like physics/chemistry, can be verified by
experiments
• Probability theory is too rigid to apply directly in many
knowledge-based applications
– Some assumptions have to be made to simplify the reality
– Different formalisms have been developed in which some aspects
of the probability theory are changed/modified.
• We will briefly review the basics of probability theory before
discussing different approaches to uncertainty
• The presentation uses diagnostic process (an abductive and
evidential reasoning process) as an example
9
Probability of Events
• Sample space and events
– Sample space S:
– Events E1  S:
E2  S:
(e.g., all people in an area)
(e.g., all people having cough)
(e.g., all people having cold)
• Prior (marginal) probabilities of events
–
–
–
–
P(E) = |E| / |S| (frequency interpretation)
P(E) = 0.1
(subjective probability)
0 <= P(E) <= 1 for all events
Two special events:  and S: P() = 0 and P(S) = 1.0
• Boolean operators between events (to form compound events)
– Conjunctive (intersection):
– Disjunctive (union):
– Negation (complement):
E1 ^ E2 ( E1  E2)
E1 v E2 ( E1  E2)
~E
(EC = S – E)
10
• Probabilities of compound events
– P(~E) = 1 – P(E) because P(~E) + P(E) =1
– P(E1 v E2) = P(E1) + P(E2) – P(E1 ^ E2)
– But how to compute the joint probability P(E1 ^ E2)?
~E
E
E1
E2
E1 ^ E2
• Conditional probability (of E1, given E2)
– How likely E1 occurs in the subspace of E2
| E1  E 2 | | E1  E 2 | / | S | P ( E1  E 2)
P ( E1 | E 2) 


| E2 |
| E2 | / | S |
P ( E 2)
P ( E1  E 2)  P ( E1 | E 2) P ( E 2)
11
• Independence assumption
– Two events E1 and E2 are said to be independent of each other if
P ( E1 | E 2)  P ( E1) (given E2 does not change the likelihood of
E1)
– Computation can be simplified with independent events
P ( E1  E 2)  P ( E1 | E 2) P ( E 2)  P ( E1) P ( E 2)
P ( E1  E 2)  P ( E1)  P ( E 2)  P ( E1  E 2)
 P ( E1)  P ( E 2)  P ( E1) P ( E 2)
 1  (1  P ( E1)(1  P ( E 2))
• Mutually exclusive (ME) and exhaustive (EXH) set of events
– ME:
E i  E j   ( P ( E i  E j )  0), i , j  1,.., n, i  j
– EXH:
E1  ...  En  S ( P ( E1  ...  En )  1)
12
Bayes’ Theorem
• In the setting of diagnostic/evidential reasoning
Hi
hypotheses
P(Hi )
P(E j | Hi )
E1
Ej
Em
evidence/manifestations
– Know prior probabilities of hypotheses
P(Hi )
conditional probabilities
P(E j | Hi )
– Want to compute the posterior probability P ( H i | E j )
– The hypothesis with the greatest posterior probability may be
taken as the most plausible diagnosis, because it is the most
probable cause of the given manifestations
13
Bayes’ Theorem
• Computation is called Bayesian reasoning
– From priors and conditionals to posteriors
• Bayes’ theorem (formula 1):
P(Hi | E j )  P(Hi )P(E j | Hi ) / P(E j )
• If the purpose is to find which of the n hypotheses H1 ,..., H n
is more plausible for the given E j, then we can ignore the
denominator and rank them use relative likelihood
rel ( H i | E j )  P ( E j | H i ) P ( H i )
14
• P ( E j ) can be computed from P ( E j | H i ) and P ( H i ) , if we
assume all hypotheses H1 ,..., H n are ME and EXH
P ( E j )  P ( E j  ( H1  ...  H n ) ) (by EXH)
n
  P( E j  H i )
(by ME)
i 1
n
  P( E j | H i )P( H i )
i 1
• Then we have another version of Bayes’ theorem:
P(Hi | E j ) 
P(E j | Hi )P(Hi )
n
 P(E
k 1
j
| Hk )P(Hk )

rel ( H i | E j )
n
 rel ( H
k 1
k
| Ej)
n
where
 P(E
k 1
j
| H k ) P ( H k ) , the sum of relative likelihood of all
n hypotheses, equals P ( E j ), and is a normalization factor
15
Probabilistic Inference for simple diagnostic problems
• Knowledge base:
E1,..., Em :
evidence/manifestation
H1,..., H n :
hypotheses/disorders
E j and H i are binary
hypothesesform a ME & EXH set
P( Hi ), i  1,...n
prior probabilities
P( E j | H i ), i  1,...n, j  1,...m
conditiona l probabilit ies
• Case input: E1,..., El
• Find the hypothesis Hi with the highest posterior
probability
P( Hi | E1,..., El )
16
• By Bayes’ theorem
P ( E1,... El | H i ) P ( H i )
P( H i | E1,..., El ) 
P ( E1,... El )
• How to deal with multiple evidences?
– Assume all pieces of evidence are conditionally independent,
given any hypothesis
P( E1,...El | Hi )  lj 1P( E j | Hi )
– We then have
P( H i | E1,..., El ) 
lj 1P( E j | H i ) P( H i )
P( E1,... El )
– How to deal with P( E1,... El )
17
• The relative likelihood
rel ( H i | E1 ,..., El )  P ( E1 ,..., El | H i ) P ( H i )  P ( H i ) lj 1 P ( E j | H i )
• The absolute posterior probability
P ( H i | E1 ,..., E l ) 
rel ( H i | E1 ,..., El )
n
 rel ( H k | E1 ,..., El )
k 1

P ( H i ) lj 1 P ( E j | H i )
l
P
(
H
)

 k j 1 P ( E j | H k )
n
k 1
• Evidence accumulation (when new evidence discovered)
rel ( H i | E1 ,..., El , El 1 )  P ( El 1 | H i )rel ( H i | E1 ,..., El )
rel ( H i | E1 ,..., El , ~ El 1 )  (1  P ( El 1 | H i ))rel ( H i | E1 ,..., El )
18
Assessment of Assumptions
• Assumption 1: hypotheses are mutually exclusive and
exhaustive
– Single fault assumption (one and only hypothesis must true)
– Multi-faults do exist in individual cases
– Can be viewed as an approximation of situations where
hypotheses are independent of each other and their prior
probabilities are very small
P ( H1  H 2 )  P ( H1 ) P ( H 2 )  0 if both P ( H1 ) and P ( H 2 ) are very small
• Assumption 2: pieces of evidence are conditionally
independent of each other, given any hypothesis
– Manifestations themselves are not independent of each other, they
are correlated by their common causes
– Reasonable under single fault assumption
– Not so when multi-faults are to be considered
19
Limitations of the simple Bayesian system
• Cannot handle well hypotheses of multiple disorders
– Suppose H1 ,..., H n are independent of each other
– Consider a composite hypothesis H1 ^ H 2
– How to compute the posterior probability (or relative likelihood)
P ( H1 ^ H 2 | E1 ,..., El ) ?
– Using Bayes’ theorem
P ( E1 ,...E l | H1 ^ H 2 ) P ( H1 ^ H 2 )
P ( H1 ^ H 2 | E1 ,..., E l ) 
P ( E1 ,...E l )
P ( H1 ^ H 2 )  P ( H1 ) P ( H 2 ) because they are independen t
P ( E1 ,...E l | H1 ^ H 2 )   lj 1 P ( E j | H1 ^ H 2 )
assuming E j are independen t, given H1 ^ H 2
How to compute P ( E j | H1 ^ H 2 ) ?
20
– AssumingH1,..., H n are independent, given E1,..., El ?
P( H1 ^ H 2 | E1,..., El )  P( H1 | E1,..., El ) P( H 2 | E1,..., El )
but this is a very unreasonable assumption
B: burglar
E: earth quake
A: alarm set off
• Cannot handle causal chaining
E and B are independent
But when A is given, they
are (adversely) dependent
because they become
competitors to explain A
P(B|A, E) <<P(B|A)
– Ex. A: weather of the year
B: cotton production of the year
C: cotton price of next year
– Observed: A influences C
– The influence is not direct (A -> B -> C)
P(C|B, A) = P(C|B): instantiation of B blocks influence of A on C
• Need a better representation and better assumptions
21
Bayesian Networks (BNs)
• Definition: BN = (DAG, CPD)
– DAG: directed acyclic graph (BN’s structure)
• Nodes: random variables (typically binary or discrete, but
methods also exist to handle continuous variables)
• Arcs: indicate probabilistic dependencies between nodes
(lack of arc signifies conditional independence)
– CPD: conditional probability distribution (BN’s parameters)
• Conditional probabilities at each node, usually stored as a
table (conditional probability table, or CPT)
P ( xi |  i ) where  i is the set of all parent nodes of xi
– Root nodes are a special case – no parents, so just use priors
in CPD:  i  , so P ( xi |  i )  P ( xi )
22
Example BN
P(a) = 0.001
A
P(b|a) = 0.3
P(b|a) = 0.001
B
P(c|a) = 0.2
P(c|a) = 0.005
C
D
P(d|b,c) = 0.1
P(d|b,c) = 0.01
P(d|b,c) = 0.01
P(d|b,c) = 0.00001
E
P(e|c) = 0.4
P(e|c) = 0.002
Uppercase: variables (A, B, …)
Lowercase: values/states of variables (A has two states a and a)
Note that we only specify P(a) etc., not P(¬a), since they have
to add to one
23
Conditional independence and
chaining
• Conditional independence assumption

i
– P ( xi |  i , q)  P ( xi |  i )
where q is any set of variables (nodes)
q
xi
other than x i and its descendents
–  i blocks influence of other nodes on x i
and its descendents (q influences x i only
through variables in  i )
– With this assumption, the complete joint probability distribution
of all variables in the network can be represented by (recovered
from) local CPDs by chaining these CPDs:
P ( x1 ,..., x n )   ni1 P ( xi |  i )
24
Chaining: Example
A
B
C
D
E
Computing the joint probability for all variables is easy:
The joint distribution of all variables
P(A, B, C, D, E)
= P(E | A, B, C, D) P(A, B, C, D) by Bayes’ theorem
= P(E | C) P(A, B, C, D)
by cond. indep. assumption
= P(E | C) P(D | A, B, C) P(A, B, C)
= P(E | C) P(D | B, C) P(C | A, B) P(A, B)
= P(E | C) P(D | B, C) P(C | A) P(B | A) P(A)
25
Topological semantics
• A node is conditionally independent of its nondescendants given its parents
• A node is conditionally independent of all other nodes in
the network given its parents, children, and children’s
parents (also known as its Markov blanket)
• The method called d-separation can be applied to decide
whether a set of nodes X is independent of another set Y,
given a third set Z
A
B
A
C
B
C
Chain: A and C
are independent,
given B
B
C
Diverging: B and
C are independent,
given A
A
Converging: B and
C are independent,
NOT given A
26
Inference tasks
• Simple queries: Computer posterior marginal P(Xi | E=e)
– E.g., P(NoGas | Gauge=empty, Lights=on, Starts=false)
– Posteriors for ALL nonevidence nodes (belief update)
– Priors for and/all nodes (E = )
• Conjunctive queries:
– P(Xi, Xj | E=e) = P(Xi | E=e) P(Xj | Xi, E=e)
• Optimal decisions: Decision networks or influence diagrams
include utility information and actions;
– Maximize expected utility:
U(outcome)P(outcome | action, evidence)
– Probabilistic inference is required to find
P(outcome | action, evidence)
27
• MAP problems (explanation)
– Let X denotethe set of all variablesin a BN, V  X the set
of instantiated variables, U  X  V the set of all un - instantiated
varialbes.Then theMAP (maximum aposteriori probability) problem
is to find the most probableinstantiation of U , givenV , i.e.,
max u ( P(U | V ))
– The solution provides a good explanation for your action
– This is an optimization problem
28
Approaches to inference
• Exact inference
– Enumeration
– Variable elimination
– Belief propagation in polytrees (singly connected BNs)
– Clustering / join tree algorithms
• Approximate inference
– Stochastic simulation / sampling methods
– Markov chain Monte Carlo methods
– Loopy propagation
– Mean field theory
– Simulated annealing
– Genetic algorithms
– Neural networks
29
Inference by enumeration
• Instead of computing the joint, suppose we just want the
probability for one variable
• Add all of the terms (atomic event probabilities) from the full
joint distribution
• If E are the evidence (observed) variables and Y are the other
(unobserved) variables, excluding X, then the posterior
distribution
P(X|E=e) = α P(X, e) = α ∑yP(X, e, Y)
• Sum is over all possible instantiations of variables in Y
• α is the normalization factor
• Each P(X, e, Y) term can be computed using the chain rule
• Computationally expensive!
30
A
Example: Enumeration
B
• Suppose we want P(d), and only the value of E is given as true
• P(d|e) =  ΣABCP(A, B, C, d, e)
=  ΣABCP(A) P(B|A) P(C|A) P(d|B,C) P(e|C)
C
D
E
=  (P(a)P(b|a)P(c|a)P(d|b,c)P(e|c)+ P(a)P(b|a)P(~c|a)P(d|b,~c)P(e|~c)
+ P(a)P(~b|a)P(c|a)P(d|~b,c)P(e|c)+ P(a)P(~b|a)P(~c|a)P(d|~b,~c)P(e|~c)
+ P(~a)P(b|~a)P(c|~a)P(d|b,c)P(e|c)+ P(~a)P(b|~a)P(~c|~a)P(d|b,~c)P(e|~c)
+ P(~a)P(~b|~a)P(c|~a)P(d|~b,c)P(e|c)+ P(~a)P(~b|~a)P(~c|~a)P(d|~b,~c)P(e|~c)
P(~d|e) =  ΣABCP(A, B, C, ~d, e)
 = P(d|e) + P(~d|e)
• With simple iteration to compute this expression, there’s going to be a lot of
repetition (e.g., P(e|c) has to be recomputed every time we iterate over C for all
possible assignments of A and B))
31
Belief Propagation
• Singly connected network, (also known as polytree)
– there is at most one undirected path between any two nodes
(i.e., the network is a tree if the direction of arcs are ignored)
– The influence of the instantiated variable (evidence) spreads
to the rest of the network along the arcs
• The instantiated variable influences
its predecessors and successors
differently (using CPT along opposite
directions)
B
• Computation is linear to the diameter of
the network (the longest undirected
D
path)
• Update belief (posterior) of every nonevidence node in one pass
A
C
E=e
F
– For multi-connected net: conditioning
32
Conditioning
A
B
C
D
E
• Conditioning: Find the network’s smallest cutset S (a set of
nodes whose removal renders the network singly connected)
– In this network, S = {A} or {B} or {C} or {D}
• For each instantiation of S, compute the belief update with the
belief propagation algorithm
• Combine the results from all instantiations of S (each is weighted
by P(S = s))
• Computationally expensive (finding the smallest cutset is in
general NP-hard, and the total number of possible instantiations
of S is O(2|S|))
33
Junction Tree
• Convert a BN to a junction tree
– Moralization: add undirected edge between every pair of
parents, then drop directions of all arc: Moralized Graph
– Triangulation: add an edge to any cycle of length > 3:
Triangulated Graph
– A junction tree is a tree of cliques of the triangulated
graph
– Cliques are connected by links
• A link stands for the set of all variables S shared by these
two cliques
• Each clique has a CPT, constructed from CPT of variables
in the original BN
34
Junction Tree
• Reasoning
– Since it is now a tree, polytree algorithm can be applied,
but now two cliques exchange P(S), the distribution of S
– Complexity:
• O(n) steps, where n is the number of cliques
• Each step is expensive if cliques are large (CPT
exponential to clique size)
• Construction of CPT of JT is expensive as well, but it
needs to compute only once.
35
Approximate inference: Direct sampling
• Suppose you are given values for some subset of the
variables, E, and want to infer values for unknown
variables, Z
• Randomly generate a very large number of instantiations
from the BN
– Generate instantiations for all variables – start at root variables and
work your way “forward” in topological order
• Rejection sampling: Only keep those instantiations that are
consistent with the values for E
• Use the frequency of values for Z to get estimated
probabilities
• Accuracy of the results depends on the size of the sample
(asymptotically approaches exact results)
• Very expensive
36
Markov chain Monte Carlo algorithm
• So called because
– Markov chain – each instance generated in the sample is dependent
on the previous instance
– Monte Carlo – statistical sampling method
• Perform a random walk through variable assignment space,
collecting statistics as you go
– Start with a random instantiation, consistent with evidence variables
– At each step, for some nonevidence variable x, randomly sample its
value by
P( x | mb ( x))  P( x | parent( x)) 
Π
Ychild ( X )
P( y | parents(Y )
• Given enough samples, MCMC gives an accurate estimate of the
true distribution of values (no need for importance sampling
because of Markov blanket)
37
Loopy Propagation
• Belief propagation
– Works only for polytrees (exact solution)
– Each evidence propagates once throughout the network
• Loopy propagation
– Let propagate continue until the network stabilize (hope)
• Experiments show
– Many BN stabilize with loopy propagation
– If it stabilizes, often yielding exact or very good approximate
solutions
• Analysis
– Conditions for convergence and quality approximation are
under intense investigation
38
Learning BN (from case data)
• Need for learning
– Experts’ opinions are often biased, inaccurate, and incomplete
– Large databases of cases become available
• What to learn
– Parameter learning: learning CPT when DAG is known (easy)
– Structural learning: learning DAG (hard)
• Difficulties in learning DAG from case data
– There are too many possible DAG when # of variables is large
(more than exponential)
n
# of possible DAG
3
25
10
4*10^18
– Missing values in database
– Noisy data
39
BN Learning Approaches
• Bayesian approach (Cooper)
– Find the most probable DAG, given database DB, i.e.,
max(P(DAG|DB)) or max(P(DAG, DB))
– Based on some assumptions, a formula is developed to
compute P(DAG, DB) for a given pair of DAG and DB
– A hill-climbing algorithm (K2) is developed to search a
(sub)optimal DAG
– Extensions to handle some form of missing values
40
BN Learning Approaches
• Minimum description length (MDL) (Lam, etc.)
– Sacrifices accuracy for simpler (less dense) structure
• Case data not always accurate
• Fewer links imply smaller CPD tables and less expensive
inference
– L = L1 + L2 where
• L1: the length of the encoding of DAG (smaller for simpler
DAG)
• L2: the length of the encoding of the difference between DAG
and DB (smaller for better match of DAG with DB)
• Smaller L2 implies more accurate (and more complex) DAG,
and thus larger L1
– Find DAG by heuristic best-first search, that Minimizes L
41
Other formalisms for Uncertainty
Fuzzy sets and fuzzy logic
• Ordinary set theory
1 if x  A
– f A ( x)  
0 otherwise

f A ( x) is called the characteristic or membership function of set A
1 if x  A
Predicate A( x)  
0 otherwise
When it is uncertain if x  A , use probabilit y P ( x  A )
– There are sets that are described by vague linguistic terms (sets
without hard, clearly defined boundaries), e.g., tall-person, fastcar
• Continuous
• Subjective (context dependent)
• Hard to define a clear-cut 0/1 membership function
42
• Fuzzy set theory
– Relax f A ( x ) from binary {0, 1} to continuous [0, 1]
stands for the degree x is thought t o belong to set A
height(john) = 6’5”
height(harry) = 5’8”
height(joe) = 5’1”
Tall(john) = 0.9
Tall(harry) = 0.5
Tall(joe) = 0.1
– Examples of membership functions
1-
Set of teenagers
0
12
19
1-
Set of young people
0
12
19
1-
20
35
50
65
80
Set of mid-age
people
43
• Fuzzy logic: many-value logic
– Fuzzy predicates (degree of truth) FA ( x)  y if f A ( x)  y
– Connectors/Operators
negation : ~FA ( x )  1  FA ( x )
conjunctio n : FA ( x )  FB ( x )  min{ FA ( x ) , FB ( x )}
disjunctio n : FA ( x )  FB ( x )  max{ FA ( x ) , FB ( x )}
• Compare with probability theory
– Prob. Uncertainty of outcome,
• Based on large # of repetitions or instances
• For each experiment (instance), the outcome is either true or false
(without uncertainty or ambiguity)
unsure before it happens but sure after it happens
Fuzzy: vagueness of conceptual/linguistic characteristics
• Unsure even after it happens
whether a child of tall mother and short father is tall
unsure before the child is born
unsure after grown up (height = 5’6”)
44
– Empirical vs subjective (testable vs agreeable)
– Fuzzy set operations may lead to unreasonable results
• Consider two events A and B with P(A) < P(B)
• If A => B (or A  B) then
P(A ^ B) = P(A) = min{P(A), P(B)}
P(A v B) = P(B) = max{P(A), P(B)}
• Not the case in general
P(A ^ B) = P(A)P(B|A)  P(A)
P(A v B) = P(A) + P(B) – P(A ^ B)  P(B)
(equality holds only if P(B|A) = 1, i.e., A => B)
– Something prob. theory cannot represent
• Tall(john) = 0.9, ~Tall(john) = 0.1
Tall(john) ^ ~Tall(john) = min{0.1, 0.9) = 0.1
john’s degree of membership in the fuzzy set of “medianheight people” (both Tall and not-Tall)
• In prob. theory: P(john  Tall ^ john Tall) = 0
45
Uncertainty in rule-based systems
• Elements in Working Memory (WM) may be uncertain because
– Case input (initial elements in WM) may be uncertain
Ex: the CD-Drive does not work 70% of the time
– Decision from a rule application may be uncertain even if the
rule’s conditions are met by WM with certainty
Ex: flu => sore throat with high probability
• Combining symbolic rules with numeric uncertainty: Mycin’s
Uncertainty Factor (CF)
– An early attempt to incorporate uncertainty into KB systems
– CF  [-1, 1]
– Each element in WM is associated with a CF: certainty of that
assertion
– Each rule C1,...,Cn => Conclusion is associated with a CF:
certainty of the association (between C1,...Cn and Conclusion).
46
– CF propagation:
• Within a rule: each Ci has CFi, then the certainty of Action is
min{CF1,...CFn} * CF-of-the-rule
• When more than one rules can apply to the current WM for the
same Conclusion with different CFs, the largest of these CFs
will be assigned as the CF for Conclusion
• Similar to fuzzy rule for conjunctions and disjunctions
– Good things of Mycin’s CF method
• Easy to use
• CF operations are reasonable in many applications
• Probably the only method for uncertainty used in real-world
rule-base systems
– Limitations
• It is in essence an ad hoc method (it can be viewed as a
probabilistic inference system with some strong, sometimes
unreasonable assumptions)
• May produce counter-intuitive results.
47
Dempster-Shafer theory
• A variation of Bayes’ theorem to represent ignorance
• Uncertainty and ignorance
– Suppose two events A and B are ME and EXH, given an
evidence E
A: having cancer B: not having cancer
E: smoking
– By Bayes’ theorem: our beliefs on A and B, given E, are measured by
P(A|E) and P(B|E), and P(A|E) + P(B|E) = 1
– In reality,
I may have some belief in A, given E
I may have some belief in B, given E
I may have some belief not committed to either one
– The uncommitted belief (ignorance) should not be given to
either A or B, even though I know one of the two must be true,
but rather it should be given to “A or B”, denoted {A, B}
– Uncommitted belief may be given to A and B when new
evidence is discovered
48
• Representing ignorance
– Frame of discernmen t :q  {h1 ,..., hn }, a set of ME and EXH
hypotheses . The power set 2q is organized as a lattice of super/subs et
relations. Each node S is a subset of hypotheses ( S  q )
– Ex: q = {A,B,C}
Each node S is associated with a
{A,B,C} 0.15
basic probabilit y assignment m ( S )
0  m ( S )  1;
{A,B} 0.1 {A,C} 0.1 {B,C}0.05
m ()  0;
{A} 0.1
{B} 0.2
{C}0.3
Sq m(S)  1
{} 0
• Belief function
Bel ( S )  S ' S m ( S ' ); Bel ()  0; Bel (q )  1
Bel ({ A, B})  m ({ A, B})  m ({ A})  m ({B})  m ()
 0.1  0.1  0.2  0  0.4
Bel ({ A, B}C )  Bel ({C})  0.3
49
– Plausibility (upper bound of belief of a node)
All belief not committed to S C may be commited to S
Pls( S )  1  Bel ( S C )
Pls({ A, B})  1  Bel ({C })  1  0.3  0.7
[ Bel ( S ), Pls( S )] belief interval
Lower
bound
(known
belief)
Upper
bound
(maximally
possible)
{A,B,C} 0.15
{A,B} 0.1
{A,C} 0.1
{B,C}0.05
{A} 0.1
{B} 0.2
{C}0.3
{} 0
– Methods are developed to combine the effect of multiple
evidences (belief update by new evidence)
50
• Advantage:
– The only formal theory about ignorance
– Disciplined way to handle evidence combination
• Disadvantages
– Computationally very expensive (lattice size 2^|q|)
– Assuming hypotheses are ME and EXH
– How to obtain m(.) for each piece of evidence is not clear,
except subjectively
51