Transcript old-01

‫‪Introduction to Bayesian Networks‬‬
‫‪Instructor: Dan Geiger‬‬
‫על מה המהומה ? נישואים מאושרים בין תורת ההסתברות ותורת הגרפים‪.‬‬
‫הילדים המוצלחים‪ :‬אלגוריתמים לגילוי תקלות‪ ,‬קודים לגילוי שגיאות‪,‬‬
‫מודלים למערכות מורכבות‪ .‬שימושים במגוון רחב של תחומים‪.‬‬
‫קורס ממבט אישי למדי על נושא מחקריי לאורך שנים רבות‪.‬‬
‫‪Web page: www.cs.technion.ac.il/~dang/courseBN‬‬
‫‪http://webcourse.cs.technion.ac.il/236372‬‬
‫‪Email: [email protected]‬‬
‫‪Phone: 829 4339‬‬
‫‪Office: Taub 616.‬‬
‫‪.‬‬
What is it all above ?
•How to use graphs to represent probability distributions
over thousands of random variables ?
•How to encode conditional independence in directed and
undirected graphs ?
•How to use such representations for efficient computations
of the probability of events of interest ?
•How to learn such models from data ?
2
Course Information
Meetings:
 Lecture: Wednesdays 10:30 –12:30
 Tutorial: Wednesdays 16:30 – 17:30
Grade:
 30% in 5 question sets. These questions sets are obligatory. Each contains mostly
theoretical problems. Submit in pairs before due time (three weeks).
 40% one or two hours lecture (Priority to graduate students).
 10% Attending at least 12 lectures and recitation classes for a **passing grade**.
(In very special circumstances 2% per missing item).
 20% Checking HMWs and presentation of HMW solutions




Prerequisites:
Data structure 1 (cs234218)
Algorithms 1 (cs234247)
Probability (any course)
Information and handouts:

http://webcourse.cs.technion.ac.il/236372

http://www.cs.technion.ac.il/~dang/courseBN/ (Only lecture slides)
3
Relations to Some Other Courses
.‫אמור לי מי חבריך ואומר לך מי אתה‬
 Introduction
to Artificial Intelligence (cs236501)
 Introduction to Machine Learning (cs236756)
 Introduction to Neural Networks (cs236950)
 Algorithms in Computational Biology (cs236522)
 Error correcting codes
 Data mining
4
= Student lectures (8)
= TENTATIVE Student lectures (7)
5

Mathematical Foundations (4 weeks including students’ lectures, based on Pearl’s
Chapter 3 + papers).
1.
Properties of Conditional Independence (Soundness and completeness of marginal
independence, graphoid axioms and their interpretation as “irrelevance”,
incompleteness of conditional independence, no disjunctive axioms possible.)
2.
Properties of graph separation (Paz and Pearl 85, Theorem 3), soundness and
completeness of saturated independence statements. Undirected Graphs as I-maps
of probability distributions. Markov-Blankets, Pairwise independence basis.
Representation theorems (Pearl and Paz, from each basis to I-maps). Markov
networks, HC representation theorem, Completeness theorem. Markov chains
3.
Bayesian Networks, d-separation, Soundness, Completeness.
4.
Chordal Graphs as the intersection of BN and Markov networks. Equivalence of their
4 definitions.

Combinatorial Optimiziation of Exact Inference in Graphical models (3 weeks including
students lectures).
1.
HMMs
2.
Exact inference and their combinatorial optimization.
3.
Clique tree algorithm. Conditioning.
4.
Tree-width. Feedback Vertex Set.

Learning (5 weeks including students lectures).
1.
Introduction to Bayesian statistics
2.
Learning Bayesian networks
3.
Chow and Liu’s algorithm; the TAN model.
4.
Structural EM
5.
Searching for Bayesian networks

Applications (2 weeks including student lectures).
6
Homeworks
• HMW #1. Read Chapter 3.1 & 3.2.1. Answer Questions 3.1,
3.2, Prove Eq 3.5b, and fully expand/fix the proof of Theorem 2.
Submit in pairs no later than 28/10/09 (Two weeks from now).
•HMW #2. Read Chapter 2 in Pearl’s book and answer 6
questions of choice at the end. Submit in pairs no later than
11/11/09.
• HMW #3. Read fully Chapter 3. Answer additional 5
questions of choice at the end. Submit in pairs no later than
2/12/09.
• HMW #4. Submit in pairs 23/12/09
• HMW #5. Submit in pairs 6/1/10
Pearl’s book contains all the notations that I happen not to define
in these slides – consult it often – it is also a very unique and
interesting classic text book.
7
The Traditional View of Probability
in Text Books
• Probability theory provides the impression that we need to
literally represent a joint distribution explicitly as P(x1,…,xn)
on all propositions and their combinations. It is consistent and
exhaustive.
• This representation stands in sharp contrast to human
reasoning: It requires exponential computations to compute
marginal probabilities like P(x1) or conditionals like P(x1|x2).
•Humans judge pairwise conditionals swiftly while
conjunctions are judged hesitantly. Numerical calculations do
not reflect simple reasoning tasks.
8
The Traditional View of Probability
in Text Books
Given ?
Computed ?
Estimated or Computed ?
P(e | h)
Given ?
.‫ כחום גבוה‬e ‫ כמחלה ויראלית נדירה ועל‬h ‫חישבו על‬
9
The Qualitative Notion of
Dependence
•Marginal independence is defined numerically as P(x,y)=P(x) P(y).
The truth of this equation is hard to judge by humans, while judging
whether X and Y are dependent is often easy.
•“Burglary within a day” and “nuclear war within five years”
• Likewise, people tend to judge
x
Y
Z
10
•The notions of relevance and dependence are far more basic
than the numerical values. In a resonating system it should be
asserted once and not be sensitive to numerical changes.
•Acquisition of new facts may destroy existing dependencies
as well as creating new once.
• Learning child’s age Z destroys the dependency between
height X and reading ability Y.
•Learning symptoms Z of a patient creates dependencies
between the diseases X and Y that could account for it.
Probability theory provides in principle such a device via
P(X | Y, K) = P(X |K)
But can we model the dynamics of dependency changes based
on logic, without reference to numerical quantities ?
11
Definition of Marginal Independence
Definition: IP(X,Y) iff for all xDX and yDy
Pr(X=x, Y=y) = Pr(X=x) Pr(Y=y)
Comments:
 Each Variable X has a domain DX with value (or state) x in DX.
 We often abbreviate via P(x, y) = P(x) P(y).
 When Y is the emptyset, we get Pr(X=x) = Pr(X=x).
 Alternative notations to IP(X,Y) such as: I(X,Y) or XY
 Next few slides on properties of marginal independence are based
on “Axioms and algorithms for inferences involving probabilistic
independence.”
12
Properties of Marginal Independence
Trivial Independence: Ip(X,)
Symmetry: Ip(X,Y)  Ip(Y,X)
Decomposition: Ip(X,YW)  Ip(X,Y)
Mixing: Ip(X,Y) and Ip(XY,W)  Ip(X,YW)
Proof (Soundness).
Trivial independence and Symmetry follow from the definition.
Decomposition: Given P(x,y,w) = P(x) P(y,w), simply sum over w
on both sides of the given equation.
Mixing: Given: P(x,y) = P(x) P(y) and P(x,y,w) = P(x,y) P(w).
Hence, P(x,y,w) = P(x) P(y) P(w) = P(x) P(y,w).
13
Properties of Marginal Independence
Are there more such independent properties of independence ?
No. There are none !
Horn axioms are of the form 1 & … & n   where each
statement  stands for an independence statement. We use the symbol
 for a set of independence statements.
Namely:  is derivable from  via these properties if and only if  is
entailed by  (i.e.,  holds in all probability distributions that satisfy
).
Put differently: For every set  and a statement  not derivable from
, there exists a probability distribution P that satisfies S and not s.
14
Properties of Marginal Independence
Can we use these properties to infer a new independence
statements  from a set of given independence statements  in
polynomial time ?
YES. The “membership algorithm” and completeness proof in
Recitation class (Paper P2).
Comment. The question “does  entail ” could in principle be
undecidable, drops to being decidable via a complete set of
axioms, and then drops to polynomial with this claim.
15
Properties of Marginal Independence
we check consistency of a set + independence plus a
set - of negated independence statements ?
Can
The membership algorithm in previous slide applies only
for G that includes one negated statement – simply use the
algorithm to check that it is not entailed from + .
But another property of independence called “Amstrong
Relation” guarantees that consistency is indeed verified by
checking separately (in isolation) that each statement in is not entailed from + .
16
Properties of Marginal Independence
We define a “product”  of two distributions P1 and P2 over a set of n
variables X1,…,Xn as follows:
P(a11 ,…, ann)  P1(a1 ,…, an)  P2(1 ,…, n)
Note that the domains of Xi change in this definition !
This definition implies that for every independence statement :
 hold in P1 and  holds in P2 if and only if  holds in P*.
Hence, consistency can be checked statement by statement.
17
Definitions of Conditional Independence
Ip(X,Z,Y) if and only if whenever P(Y=y,Z=z) >0 (3.1)
Pr(X=x | Y=y , Z=z) = Pr(X=x |Z=z)
18
Properties of Conditional Independence
Same Properties of conditional independence:
Symmetry: I(X,Z,Y)  I(Y,Z,X)
Decomposition: I(X,Z,YW)  I(X,Z,Y)
Mixing: I(X,Z,Y) and I(XY,Z,W)  I(X,Z,YW)
BAD NEWS. Are there more properties of independence ?
Yes, infinitely many independent Horn axioms. No answer to
the membership problem, nor to the consistency problem.
19
Important Properties of Conditional
Independence
Recall there are some more notations.
20
Dependency Models – abstraction of
Probability distributions
21
Graphical Interpretation
22
The Intersection Property Revisited
Intersection: I(X,ZW,Y) and I(X,ZY,W)  I(X,Z,YW)
This property holds for positive distributions.
Counter example: X=Y=W, Z is the emptyset.
A set  of independence statements I(x,y,z) (namely, a dependency model)
that satisfies Symmetry, Decomposition, Weak Union, Contraction, and
Intersection is called a graphoid, and without intersection is called semigraphoid.
23