Bayesian Networks

Download Report

Transcript Bayesian Networks

Bayesian Networks
Campbell Wilson
[email protected]
CSE3475
1
Introduction



Bayesian Belief Networks, or more simply,
Bayesian Networks (BNs) are a powerful tool for
modeling the interaction between variables.
BNs can be used to infer likelihoods of events given
some knowledge of the likelihood of other events and
the dependencies between events (probabilistic inference)
They have found application in various domains where
reasoning under uncertainty is applicable. Example
domains include expert systems (e.g. medical diagnosis)
and information retrieval.
CSE3475
2
Preliminaries - Graphs

We define a graph as follows:
A graph G is a finite collection of vertices (or nodes) and edges
(or arcs). Vertices are points and edges are connecting lines
between two vertices.
Formally:
G = < V, E >
where
V = {vi}, i=1,…,n
E = {(vi, vj)} i,j=1,…,n
CSE3475
3
Example of a Graph
v1
v2
v4
v5
v3
CSE3475
4
Directed Graph

A directed graph (or digraph) is a graph G in which
each edge (vi,vj) is associated with a direction from its
starting point vi to its ending point vj

Informally, define a path in a directed graph as a set of
edges along which we can walk (in the direction of the
edges) from one vertex to another vertex.

Any such path emanating from one vertex which leads
us back to the same vertex is called a cycle.
CSE3475
5
Example of a Directed Graph
v1
v2
v4
Cycle
v5
v3
CSE3475
6
Directed Acyclic Graphs

A Directed Acyclic Graph (DAG) is a directed
graph in which there exist no cycles.
v2
v1
v4
v5
DAG Example
v3
CSE3475
7
So what is a Bayesian Network?


A Bayesian Network (BN) is a directed acyclic
graph where the vertices represent random
variables and the edges represent dependencies
between the variables.
For our purposes, we define a random variable
as a variable which can take any real number
value.
CSE3475
8
Dependency versus Causation


It turns out that our definition of a BN can also be
interpreted in a different way. The random variables
can represent the occurrence or otherwise of certain
events.
This alternative view gives us the causal interpretation of
Bayesian Networks as follows:
A Bayesian network is a DAG whose vertices represent events
and whose edges represent causal relationships between events.
CSE3475
9
Causal Interpretation

Event A

Event B
The directed edge (A,B) implies
that the occurrence of event A has
a direct causal effect on event B
(event A may cause event B to
occur)
This causality is not necessarily
absolute. That is, the existence of
edge (A,B) does not mean that if
event A occurs, event B must occur
as a consequence.
CSE3475
10
An Example
Mid-air
 Event A and/or
B may
Collision
B event C
lead to
Engine Failure
occurring.
 Event C may lead to
event D occurring.
 We
consider that there is
C
Forced
Landing
no direct causal
relationship between
event A and event B
A
D
Landing Gear Collapse
CSE3475
11
Joint Probability Distribution

Suppose we wanted to have a complete
knowledge of the risk of any combination of
these 4 events occurring. This requires us to
completely specify the joint probability distribution
of the system, which is the set of probability
values for all possible combinations of the
values of the random variables in the system.
CSE3475
12
Joint Probability Distribution





In our example, the joint probability distribution
is:
P(ABCD)={P(A,B,C,D), P(A,B,C,D)....}
Therefore we need to specify 16 probability
values to describe the system.
In general, how many probability values must we
explicitly calculate for a system with n events?
Answer: 2n (well really 2n-1…why?)
CSE3475
13
BNs and JPDs


If we know something about the causal
dependencies between events, we can exploit
this knowledge to simplify the JPD calculation
process.
In fact, depending on the topology of the
network, the number of terms required to
completely specify the JPD can be dramatically
reduced.
CSE3475
14
Theorem
Given a Bayesian network with random
variable nodes {X1,X2,...Xn}, the joint
probability distribution of the random
variables is given by:
n
P ( X 1 X 2 ... X n )   P ( X i |  X i )
i 1
the expression
 Xi
where
denotes the parent set
of the node Xi, i.e. the set of nodes from
which directed edges converge at Xi.
CSE3475
15
Simplifying the JPD

In our example:
P(ABCD)=P(A)P(B)P(C|AB)P(D|C)

We have reduced the amount of terms required
for completely specifying the JPD from 15 to 8:
P(A), P(B), P(C|A,B),P(C|¬A,B),
P(C|A, ¬B),P(C|¬A, ¬B),P(D|C),P(D|¬C)
CSE3475
16
Simplifying the JPD

Since we know the causal interaction between events,
we are likely to know at least some of the values for
these 8 probabilities. Probability values which we know
from previous experience are termed prior probabilities.

For example, we may know the following prior
probabilities:
We may also
(from
observation)
the probability
a forced
landing
Oneknow
out of
every
10 forced landings
and oneofout
of
An following
engine failure
occursfailure
once every
1000 flights.
occurring
an
engine
or
a
mid-air
collision
or both.
every 100,000 non forced landings results in landing
gear collapse
A mid-air collision occurs once every 100,000 flights
-5
e.g.
So P(D|C)=0.10
and P(D|¬C)=10
-3
-5
So P(A) = 10 and P(B)=10
P(C|A,B)=0.99, P(C|¬A,B)=0.80,P(C|A,¬B)=0.85,
P(C|¬A, ¬B)=0.05
CSE3475
17
Using Bayesian Networks



The JPD can be used to answer arbitrary queries
concerning the likelihood of events but…
It is often the case that we do not have all the
information we would like concerning the
distributions of probabilities in a system.
We would like to infer probabilities of particular
events given what we do know (the prior
probabilities)
CSE3475
18
The Independence Assumption


The reason Bayesian networks allow a reduction
in the number of terms of the JPD is to do with
the notion of conditional independence.
Two events A and B are considered to be
independent if they have no effect on each
other, or more formally:
P(A|B)=P(A) and P(B|A)=P(B)
CSE3475
19
The Independence Assumption



Two events A and B are said to be conditionally
independent given C if the knowledge that C has occurred
renders A and B independent.
The independence assumption for Bayesian Networks states
in a formal way how the topology of the network tells
us what events are conditionally independent given
other events.
An aside: naïve Bayes classifiers are also based on an
independence assumption which essentially ignores any
dependency between the random variables – Bayesian
networks do address this.
CSE3475
20
The Independence Assumption



The strict definition of the independence assumption
requires more mathematical formalism.
We will explain the independence assumption slightly
less formally…
Firstly, suppose X, Y and Z are disjoint subsets of
vertices of a Bayesian Network. Then, X and Y are
conditionally independent given Z if Z blocks the flow of
inference between X and Y. If Z blocks inference
flowing between X and Y, Z is said to d-separate X and
Y.
CSE3475
21
d-separation
Engine Failure
A
C blocks this
evidence flow
B
Mid-air
Collision
A and D are conditionally
independent given C
C
Forced Landing
C d-separates A and D
D
Landing Gear Collapse
CSE3475
22
d-separation
Engine Failure
A
C activates
this evidence
flow
B
Mid-air
Collision
A and B are conditionally
dependent given C
C
Forced Landing
C d-connects A and B
D
Landing Gear Collapse
CSE3475
23
Predictive versus Diagnostic
Inference



Predictive inference is the process whereby we
formulate our degree of belief in a particular
hypothesis given some a priori knowledge of the
possible supporting evidence for the hypothesis.
Diagnostic inference is the process whereby we
adjust our belief in a hypothesis given some
subsequently observed evidence.
Causal intepretation of BNs : causes precede
effects in time.
CSE3475
24
Predictive versus Diagnostic
Inference
Evidence
Hypothesis
Hypothesis
Evidence
Predictive
Inference
CSE3475
Diagnostic
Inference
25
But Why is the Independence
Assumption Useful?


Because it is the basis of the proof of the
theorem we saw before concerning the JPD
calculation.
The independence assumption allows us to use
the topology of the network to simplify
calculation of the JPD.
CSE3475
26
Circular Reasoning


Because inference can take place in both
directions along an edge in a BN (given
particular a priori evidence), we need to be
careful of circular reasoning.
e.g. If S was the only source of predictive
inference to another node N, then that inference
should not return as diagnostic inference to S
(which would then in turn result in predictive
inference to N ad infinitum)
CSE3475
27
Evaluating Bayesian Networks


We can associate belief values with each node in a
BN which indicate the likelihood of the
occurrence of the event represented by the
node.
The belief value of a particular node Xi is equal
to P(Xi|e) where e represents the set of
instantiated variables in the network, that is, the
set of nodes for which we have some a priori
knowledge of their value.
CSE3475
28
Evaluating Bayesian Networks


It is rarely the case that we have all the
information available to us to perform the sort
of joint probability distribution calculations we
saw earlier.
We need to use probabilistic inference to
calculate the belief values for all the nodes that
have not been instantiated a priori. This is
known as evaluating the Bayesian network.
CSE3475
29
Evaluating Bayesian Networks



Exact evaluation of arbitrary Bayesian networks
has been shown to be NP-hard.
However, for some networks topologies, exact
and efficient evaluation is possible.
An example is the algorithm due to Pearl for
local inference propagation in singly connected
networks.
CSE3475
30
Pearl’s Algorithm




A singly connected network is one within which there is only one
possible path between arbitrary nodes X and Y in the network.
This implies that the underlying undirected graph is also acyclic
(the directed graph is acyclic by definition).
The algorithm is a local propagation scheme in that a node’s
belief values are updated by examining the beliefs of its
immediate neighbours. Uses Bayes Theorem as a basis.
Updating is performed using message passing:
-messages (predictive)
-messages (diagnostic)

Also necessary is the link tensor which specifies the conditional
probabilities of the occurrence of the possible values of the
node under consideration given the values of its immediate
parents.
CSE3475
31
Pearl’s Algorithm
U
U
1
U
2
X (u1 )
π X (u1 )
π X (un )
n
π X (u2 )
X (u2 )
X (un )
X
πY1 ( x )
πY2 ( x )
λ Y2 ( x )
λ Y1 ( x )
Y
Y
1
2
CSE3475
π Ym ( x )
λ Ym ( x )
Y
m
32
Pearl’s Algorithm



We will not cover the details of the algorithm here.
But, there is a nice conceptual justification of the local
propagation approach.
In the words of Pearl:
“Belief networks encode [direct] relevancies as
neighbouring nodes in a graph, thus ensuring that by
consulting the neighbourhood one gains a license to
act; what you don’t see locally doesn’t matter.”
CSE3475
33
Non-singly Connected Networks



Pearl’s algorithm is not applicable to non-singly
connected networks.
The algorithm relies on the mutual independence of
the parent nodes of the node being evaluated.
This cannot be assured if we allow loops in the
underlying networks, e.g:
E
A
B
C
Independence of A and B cannot
be ensured (there may be a flow of
evidence along {A, E, B} )
D
CSE3475
34
Other Evaluation Algorithms

Exact algorithms
Clique-tree propagation
 Node elimination or graph reduction.
 Loop-cutset methods


Approximate evaluation methods
CSE3475
35
Summary





Bayesian Networks are a compact graphical model of causation.
Components are nodes and edges representing random variables
(events) and dependencies (causation) respectively.
Also can encode conditional probabilities as link weights (or in
an associated matrix)
The general problem of network evaluation involves updating
our belief in all nodes based on the a priori knowledge of
previously instantiated nodes.
In general, the evaluation problem is NP-hard but this depends
on the network topology. Alterations in network topology can
have a huge effect on the efficiency of evaluation.
CSE3475
36
Query By Example
INet
Indexing Network
{
Level 1 Features
Level 0 Features
{
Image Database Objects
CSE3475
DNet
Database Object
Network
Level 2 Features
Transient Unity
Weighted Link
QNet
Query Network
BIR
Bayesian (network) Image Retrieval
37