Signal Averaging

Download Report

Transcript Signal Averaging

Probabilistic Reasoning Systems
- Introduction to Bayesian Network -
CS570 AI
Team #7: T. M. Kim, J. B. Hur, H. Y. Park
Speaker: Kim, Tae Min
Outline
 Introduction to graphical model
 Review: Uncertainty and Probability
 Representing Knowledge in an Uncertain Domain
 Semantics of bayesian Networks
 Inference in bayesian Networks
 Summary
 Practice Questions
 Useful links on the WWW
1
What is a graphical model?
 A graphical model is a way of representing
probabilistic relationships between random
variables.
 Variables are represented by nodes:
 Conditional (in)dependencies are represented
by (missing) edges:
 Undirected edges simply give correlations
between variables (Markov Random Field or
Cavity
Undirected Graphical model):
 Directed edges give causality relationships
Weather
Catch
(Bayesian Network or Directed Graphical Model):
Toothache
2
Significance of graphical model
 “Graphical models: marriage between probability and graph theory.
 Probability theory provides the glue whereby the parts are
combined, ensuring that the system as a whole is consistent, and
providing ways to interface models to data.
 The graph theoretic side of graphical models provides both an
intuitively appealing interface by which humans can model highlyinteracting sets of variables as well as a data structure that lends
itself naturally to the design of efficient general-purpose
algorithms.
 They provide a natural tool for dealing with two problems that occur
throughout applied mathematics and engineering – uncertainty and
complexity.
 In particular, they are playing an increasingly important role in the
design and analysis of machine learning algorithms.
3
Significance of graphical model
 Fundamental to the idea of a graphical model is the notion of
modularity – a complex system is built by combining simpler parts.
 Many of the classical multivariate probabilistic systems studied in
fields such as statistics, systems engineering, information theory,
pattern recognition and statistical mechanics are special cases of the
general graphical model formalism -- examples include mixture
models, factor analysis, hidden Markov models, and Kalman filters.
 The graphical model framework provides a way to view all of these
systems as instances of a common underlying formalism.
 This view has many advantages -- in particular, specialized techniques
that have been developed in one field can be transferred between
research communities and exploited more widely.
 Moreover, the graphical model formalism provides a natural
framework for the design of new systems.“ --- Michael Jordan, 1998.
4
We already know many
graphical models:
(Picture by Zoubin
Ghahramani and Sam
Roweis)
5
Taxonomy of graphical models
6
Review: Probabilistic Independence
 Joint Probability: P(X  Y)  Probability of the Joint Event X  Y
 Independence
P( X , Y )  P( X ) P(Y )
X , Y
 g ( X )h(Y )
 Conditional Independence
P( X , Y | Z )  P( X | Z ) P(Y | Z )
X , Y , Z
P( X | Y , Z )  P( X | Z )
P( X , Y , Z ) 
P( X , Z ) P(Y , Z )
P( Z )
P( X , Y , Z )  g ( X , Z )h(Y , Z )
7
Review: Basic Formulas for Probabilities
 Bayes’s Rule P( X | Y )  P( X )P(Y | X )/ P(Y )
 Product Rule P( X ,Y )  P(Y )P( X | Y )
 Chain Rule: Repeatedly using Product Rule
P ( x1 ,..., xn )  P( xn | xn-1 ,..., x1 ) P( xn-1 ,..., x1 )
 P ( xn | xn-1 ,..., x1 ) P( xn-1 | xn-2 ,..., x1 )...P ( x2 | x1 ) P ( x1 )
n
  P ( xi | xi -1 ,..., x1 )
i 1
 Theorem of Total Probability
 Suppose events A1, A2, …, An are mutually exclusive and
exhaustive( P(Ai) = 1)
n
P(Y )   P(Y | X i ) P( X i ) : Conditioning
i 1
n
P(Y )   P( X i , Y )
: Marginalization
i 1
8
Uncertain Knowledge Representation
 By using a data structure called a Bayesian network (also known as a
Belief network, probabilistic network, causal network, or knowledge
map), we can represent the dependence between variables and to give a
concise specification of the joint probability distribution.
 Definition of Bayesian Network: Topology of the network + CPT
 A set of random variables makes up the nodes of the network
 A set of directed links or arrows connects pairs of nodes.
Example: X->Y means X has a direct influence on Y.
 Each node has a conditional probability table (CPT) that quantifies
the effects that the parents have on the node. The parents of a node
are all those nodes that have arrows pointing to it.
 The graph has no directed cycles (hence is a directed, acyclic graph,
or DAG).
9
Representing Knowledge Example
P(C=F) P(C=T)
0.5
C P(S=F) P(S=T)
F
T
0.5
0.9
0.5
0.1
0.5
C P(R=F) P(R=T)
Cloudy
Spinkler
Rain
F
T
0.8
0.2
0.2
0.8
S R P(W=F) P(W=T) WetGrass
FF
TF
FT
TT
1.0
0.1
0.1
0.01
0.0
0.9
0.9
0.99
10
Conditional Probability Table (CPT)
 Once we get the topology of the network, conditional probability table
(CPT) must be specified for each node.
 Example of CPT for the variable WetGrass:
C
S R P(W=F) P(W=T)
S
R
FF
1.0 0.0
TF
FT
TT
0.1
0.1
0.01
0.9
0.9
0.99
W
 Each row in the table contains the conditional probability of each node
value for a conditioning case. Each row must sum to 1, because the
entries represent an exhaustive set of cases for the variable.
 A conditioning case is a possible combination of values for the parent
nodes.
11
An Example
Pr( S  1| W  1) 
Pr( R  1| W  1) 
Pr( S  1,W  1)  c ,r Pr(C  c, S  1, R  r ,W  1) 0.2781


 0.4298
Pr(W  1)
Pr(W  1)
0.6471
Pr( R  1,W  1)  c ,s Pr(C  c, S  1, R  r ,W  1) 0.4581


 0.7079
Pr(W  1)
Pr(W  1)
0.6471
Pr(W  1)   Pr(C  c, S  s, R  r,W  1)  0.6471
C
c ,r , s
 Explaining away
S
 In the above example, notice that the two causes "compete" to
"explain" the observed data. Hence S and R become conditionally
W
dependent given that their common child, W, is observed, even
though they are marginally independent.
 For example, suppose the grass is wet, but that we also know that it
is raining. Then the posterior probability that the sprinkler is on
goes down:
• Pr(S=1|W=1,R=1) = 0.1945
12
R
Typical example of BN
 Compactly represent due to the fact that each row must sum to 1
13
Conditional independence in BN
n
 Chain rule P( x1 ,..., xn )   P( xi | xi -1,..., x1 )
i 1
P( X i | X 1 ,..., X n )  P( X i | Parents( X i )) if Parents( X i )  { X 1,..., X n }
n
P( x1 ,..., xn )   P( xi | Parents ( X i ))
X4
i 1
p( x1:n )
 p( x | x )
i
X2
i
i
X1
X6
 Example
p( x1:6 )  p( x1 ) p( x2 | x1 ) p( x3 | x1, x2 ) p( x4 | x1, x2 , x3 )
p( x5 | x1, x2 , x3 , x4 ) p( x6 | x1, x2 , x3 , x4 , x5 )
X3
X5
p( x1:6 )  p( x1 ) p( x2 | x1 ) p( x3 | x1 ) p( x4 | x2 ) p( x5 | x3 ) p( x6 | x2 , x5 )
14
Conditional independence in BN
if P( X , Y , Z , A, B, C , D, E )  P(Y | D) P( D | C , E ) P( E | Z ) P(C | B) P( B | A) P( X | A)
X  Z |Y ?
A
X
B
Z
C
E
D
Y
15
Conditional independence in BN
 Def. A topological (total) ordering I of the graph G is such that all
parents of node i occur earlier in I than i.
 Ex. I = {1, 2, 3, 4, 5, 6} is a total ordering.
I = {6, 2, 5, 3, 1, 4} is not a total ordering.
 Def. Non-descendant: the set of
X4
indices vi before i in total ordering I
X2
other than parents of i, i.
vi  { j  i : j  I , j   i }
X1
X3
order comp
X6
X5
I  {1, 2, 3, 4,
5,
6},
Ex. v  { , ,{2},{1,3},{1, 2, 4},{1,3, 4}}
 Markov property: X i  X v | X 
i
X i  nd ( X i ) | pa( X i )
i
X1   |  , X 2   | X1, X 3  X 2 | X1
X 4  { X1 , X 3}| X 2 , X 5  { X1, X 2 , X 4}| X 3
X 6  { X1 , X 3 , X 4 }|{ X 2 , X 5}
16
Conditional independence in BN
 To verify
x4  {x1 , x3}| x2
p( x1:4 )   p( x1:6 )  p( x1 ) p( x2 | x1 ) p( x3 | x1 ) p( x4 | x2 ) p( x5 | x3 ) p( x6 | x2 , x5 )
x5:6
x5:6
 p( x1 ) p( x2 | x1 ) p( x3 | x1 ) p( x4 | x2 ) p( x5 | x3 ) p( x6 | x2 , x5 )
 p( x1 ) p( x2 | x1 ) p( x3 | x1 ) p( x4 | x2 )
x5
x6
p ( x1:3 )   p ( x1:4 )  p( x1 ) p( x2 | x1 ) p( x3 | x1 )
X4
x4
X2
p ( x4 | x1 , x2 , x3 ) 
p ( x1:4 )
 p ( x4 | x2 )
p ( x1:3 )
X1
X6
X3
X5
 x4  {x1 , x3}| x2
17
Contructing a BN
 General procedure for incremental network construction:
 choose the set of relevant variables Xi that describe the domain
 choose the ordering for the variables
 while there are variables left:
• pick a variable Xi and add a node to the network for it
• set Parent(Xi) by testing its conditional independence in the net
• define the conditional probability table for Xi ☞ learning BN
 Example: Suppose we choose the ordering B,E,A,J,M
 P(B|E)= P(E)? Yes
Burglary Earthquake
 P(A|B)= P(A)? P(A|E)= P(A)? No
 P(J|A,B,E)= P(J|A)? Yes
Alarm
 P(J |A)= P(J)? No
 P(M|A)= P(M)? No
JohnCalls MaryCalls
 P(M|A,J)= P(M|A)? Yes
18
Contructing a Bad BN Example
 Suppose we choose the ordering M,J,A,B,E
 P(J|M)= P(J)? No
 P(A|J,M)= P(A|J)? P(A|J,M)= P(A)? No
 P(B|A,J,M)= P(B|A)? Yes
 P(B|A) = P(B)? No
MaryCalls
 P(E|A,J,M)= P(E|A)? Yes
JohnCalls
 P(E|A,B)= P(E|A)? No
 P(E|A) = P(E)? No
Alarm
Burglary
Earthquake
19
Bayes ball algorithm
 If every undirected path from a node in X to a node in y is d-separated
by E, then X and Y are conditionally independent given E.
 A set of nodes E d-separates two sets of nodes X and Y if every
undirected path from a node in X to a node in Y blocked given E.
20
Three canonical GM’s
 Case I. Serial Connection (Markov Chain)
X
Y
Z
I  {X ,Y , Z}
v  { , ,{ X }}
X  Z |Y
p( x, y, z)  p( x) p( y | x) p( z | y)
p( z | x, y) 
p( x, y, z ) p( x) p( y | x) p( z | y)

 p( z | y )
p( x, y)
p ( x) p ( y | x)
21
Three canonical GM’s
 Case II. Diverging Connection
I  {Y , X , Z }
V  { , ,{ X }}
X  Z |Y
Y
X
Z
X
Y
Z
Shoe Size
Age
Gray Hair
22
Three canonical GM’s
 Case III. Converging Connection
I  { X , Z , Y }, V  { ,{ X },}
X Z
X
Z
p ( x, z )   p ( x, y , z )
y
  p ( y | x, z ) p ( x ) p ( z )  p ( x ) p ( z )
Y
 Explaining away
Rain
y
X  Z |Y
Sprinkler
Lawn wet
Burglar
Earthquake
Alarm
23
Bayes ball algorithm
 Bayes Ball bouncing rules
 Serial Connection
X
Y
Z
X
Y
Z
 Diverging Connection
Y
X
Y
Z
X
Z
24
Bayes ball algorithm
 Converging Connection
Y
X
Y
Z
 Boundary condition
X
Y
X
Z
X
Y
25
Bayes ball algorithm: Summary
26
Examples of Bayes Ball Algorithm
 Markov Chain
X1
X2
X3
X4
X5
{ X 2 , X 1}  { X 4 , X 5}| X 3
X1  X 5 | X 3
X4
X2
X6
X 4  { X 1 , X 3}
X 4  { X 1 , X 3 }| X 2
X1
X3
X5
X 4  { X 1 , X 3 , X 5 , X 6 }| X 2
27
Examples of Bayes Ball Algorithm
X4
X6
X2
X1
X 1  X 6 |{ X 2 , X 3}
X 4  X 5 |{ X 2 , X 3}
X3
X5
28
Examples of Bayes Ball Algorithm
X4
X6
X2
X1
X 2  X 3 |{ X 1 , X 6 }
X 4  X 5 |{ X 1 , X 6 }
X3
X5
29
Examples of Bayes Ball Algorithm
A
B
Z
X
X  Z |Y
C
D
E
Y
30
Non-descendants
X  nd ( X ) | pa( X )
31
Markov blanket
 Markov blanket: Parents + children + children’s parents
32
The four types of inferences
 Note that these are just terminology used to describe the type of
inference in various systems. Using bayesian network, we don’t
need to distinguish the type of reasoning it performs. i.e., it treats
everything as mixed.
33
Other usage of the 4 patterns
 Making decisions based on probabilities in the network and on the
agent's utilities.
 Deciding which additional evidence variables should be observed in
order to gain useful information.
 Performing sensitivity analysis to understand which aspects of the
model have the greatest impact on the probabilities of the query
variables (and therefore must be accurate).
 Explaining the results of probabilistic inference to the user.
34
Exact inference
Inference by enumeration (with alarm example)
P( B, e, a, j, m)
P( B, j, m) 
P( B | j, m) 
 e a
P( j, m)
P( j, m)
B
E
A
J
M
 P( B, e, a, j, m)   P( B) P(e)P(a | B, e) P( j | a) P(m | a)
e
a
e
a
Variable elimination by distributive law
 P( B, e, a, j, m)  P( B) P(e) P(a | B, e) P( j | a) P(m | a)
e
a
 In general
e
Q
H
a
E
P(Q, h, e)  P(Q) P(Q | h, e) P(h | e) P(Q) P(Q | h, e) P(h | e)
P(Q, e) 
h
P(Q | e) 
 h
 h

P ( e)
P (e)
P ( e)
P ( e)
P(Q, h, e)  P(Q) P(Q | h, e) P(h | e)  P(Q) P(Q | h) P(h | e)
35
Exact inference
C
 Another example of variable elimination
Pr(W  w)   Pr(C  c, S  s, R  r ,W  w)
c
s
S
R
W
r
= Pr(C  c ) Pr( S  s | C  c ) Pr( R  r | C  c) Pr(W  w | S  s, R  r )
c
s
r
= Pr(C  c ) Pr( S  s | C  c) Pr( R  r | C  c) Pr(W  w | S  s, R  r )
c
s
r
Pr(W  w)   Pr(C  c) Pr( S  s | C  c)  T1(c, w, s )
c
s
T1(c, w, s)   Pr( R  r | C  c)  Pr(W  w | S  s, R  r )
Pr(W  w)   Pr(C  c)  T 2(c, w)
c
r
T 2(c, w)   Pr(S  s | C  c)  T1(c, w, s)
s
 Complexity of exact inference
 O(n) for polytree(singly connected network) - there exist at most
one undirectd path between any two nodes in the networks (e.g.
alarm example)
 Multiply connected network: exponential time (e.g. wet grass ex.)
36
Exact inference
Clustering Algorithm
 To calculate posterior probabilities for all the variables: O(n2) even
for polytree
 Cluster the network into polytree and apply Constraint propagation
algorithm (refer to Chap 5 Constraint Satisfaction Prob in the text)
 Widely used because of O(n)
C P(S=F) P(S=T)
F
T
0.5
0.9
0.5
0.1
C
S
C P(R=F) P(R=T)
F
T
0.8
0.2
0.2
0.8
Cloudy
R
W
Spr+Rain
WetGrass
C
P(S,R)
FF FT TF TT
F
T
.40 .10 .40 .10
.18 .72 .02 .08
37
Hybrid (discrete + continuous) BNs
 Option1: Discretization – possibly large error, Large CPTs
 Option2: Finitely parameterized canonical form
 Continuous variable, discrete + continuous parents (e.g. Cost)
Subsidy
Harvest
Cost
Buy
P(C  c | H  h, S  true)
 N (at h  bt , t )(c)
2

1
 1  c  (at h  bt )  


exp  
 
2

 t 2
t
 

 

 Discrete variable, continuous parents (e.g. Buys)
• Probit, Cummualtive Normal pdf: More Realistic but difficult to
manage.
• Logit, Sigmoid function: Practical ’cause simple derivative.
38
Simple BNs
 Notations
 Circles denote continuous rv's
 squares denote discrete rv's,
 clear means hidden, and
 shaded means observed
 Examples
 Principal component analysis
X2
X1
Y1
...
...
X
Y
Y
FA/PCA
Mixture of FAs
Factor analysis
Xn
Ym
X
Q
X1
Y1
Xn
...
Y2
...
Ym
39
Temporal(Dynamic) models
 Hidden Markov Models
Autoregressive HMM
Q1
Q2
Q3
Q4
Y1
Y2
Y3
Y4
...
Q1
Q2
Y1
Y2
 Linear Dynamic Systems(LDSs) and Kalman filter
 x(t+1) = A*x(t) + w(t), w ~ N(0, Q), x(0) ~ N(x0,V0)
 y(t) = C*x(t) + v(t), v ~ N(0, R)
Kalman filter
X1
X2
Y1
Y2
Autoregressive model AR(1)
X1
X2
40
Approximate inference
 To avoid tremendous computation
 Monte Carlo Methods
 Variational methods
 Loopy junction graphs and loopy Bayesian propagation
41
Monte Carlo Methods
 Direct sampling
 Generating Random Samples according to CPTs
 Counting # of samples matching the query
 Likelihood weighting
 To avoid rejection sample, generate events consistent with the
evidence and calculate the likelihood weight of the event
 Summing the weights w.r.t variables conditioned by the evidence
 Example of P(R=T|S=T,W=T) with initial weight w=1
•
•
•
•
•
P(C)=(0.5,0.5) return true
Since S is evidence, w=w* P(S=T|C=T)=0.1
P(R|C=T)=(0.8,0.2) return true
Since W is evidence, w=w* P(W=T|S=T,R=T)=0.1*0.99=0.099
[t,t,t,t] with weight 0.099
42
Monte Carlo Methods Examples
C P(S=F) P(S=T)
F
T
0.5
0.9
P(C=F) P(C=T)
0.5
0.5
0.1
FF
TF
FT
TT
P(W=F) P(W=T)
1.0
0.1
0.1
0.01
F
T
Cloudy
Spinkler
SR
C P(R=F) P(R=T)
0.5
0.0
0.9
0.9
0.99
0.8
0.2
0.2
0.8
Rain
WetGrass
P(R=T)
P(R=T|S=T,W=T)
S=T,W=T
CSRW
TTFT
TFTF
FFFT
TTFF
Wait
CR
0.099 T T
0.001 T T
0.10 F F
43
Markov Chain Monte Carlo methods
 MCMC algorithm
X1
X2
X3
X4
X5
 Randomly sampling w.r.t one of the nonevidence variable Xi,
conditioned on the current value of the variables in the Markov
blanket of Xi
 Example; to calculate P(R|S=T,W=T)
 Repeat following steps with initial point [C S R W]=[T T F T]
 Sample from P(C|S=T,R=F)=P(C,S=T,R=F)/P(S=T,R=F) return F
 Sample from P(R|C=F,S=T,R=F) return T
C
S
R
W
44
Learning from data
 To estimate parameters.
 Data observability.
 Model structure if it is unknown
Structure Known Structure
Unknown Structure
Data
Complete Data
Statistical parametric
Model selection;
estimation: MLE, MAP Discrete optimization
Incomplete Data Parametric
optimization; EM
algorithm
Combined: EM +
model selection
45
Summary
 Bayesian networks are a natural way to represent conditional independence
information.
 A bayesian network is complete and compact representation for the joint
probability distribution for the domain.
 Inference in bayesian networks means computing the probability distribution
of a set of query variables, given a set of evidence variables.
 Bayesian networks can reason causally, diagnostically, in mixed mode, or
intercausally. No other uncertain reasoning mechanism can handle all these
modes .
 The complexity of bayesian network inference depends on the network
structure. In polytrees the computation time is linear in the size of the network.
 With large and highly connected graphical models, exponential blowup in the
number of computations for exact inference occurs
 Given the intractability of exact inference in large multiply connected
networks, it is essential to consider approximate inference methods such as
Monte Carlo methods and MCMC
46
References
 http://www.cs.sunysb.edu/~liu/cse352/#handouts
 http://sern.ucalgary.ca/courses/CPSC/533/W99/presentations/L2_15B_
Griscowsky_Kainth/
 Jeff Bilmes, Introduction to graphical model, lecture note
 http://www.ai.mit.edu/~murphyk/Bayes/bayes_tutorial.pdf
 Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern
Approach, Prentice Hall
47