Job Shop Scheduling

Download Report

Transcript Job Shop Scheduling

Still More Uncertainty
Administrivia
• Read Ch 14, 15
Axioms of Probability Theory
• Just 3 are enough to build entire theory!
1. Range of probabilities: 0 ≤ P(A) ≤ 1
2. P(true) = 1 and P(false) = 0
3. Probability of disjunction of events is:
P( A  B)  P( A)  P( B)  P( A  B)
True
A  B
A
B
Bayes
rules!
P( E | H ) P( H )
P( H | E ) 
P( E )
Simple proof from def of conditional probability
4
Independence
P(AB) = P(A)P(B)
True
A
© Daniel S. Weld
A  B
B
5
Conditional Independence
A&B not independent, since P(A|B) < P(A)
True
A
© Daniel S. Weld
A  B
B
6
Conditional Independence
But: A&B are made independent by C
True
A
A  B
B
AC
P(A|C) =
P(A|B,C)
C
B  C
© Daniel S. Weld
7
Joint Distribution
• All you need to know
• Can answer any question
Inference by enumeration
© Daniel S. Weld
8
Joint Distribution
• All you need to know
• Can answer any question
Inference by enumeration
• But… exponential
Both time & space
• Solution: exploit conditional independence
© Daniel S. Weld
9
Sample Bayes Net
Earthquake
Radio
Nbr1Calls
© Daniel S. Weld
Burglary
Alarm
e,b
e,b
e,b
e,b
Pr(B=t) Pr(B=f)
0.05 0.95
Pr(A|E,B)
0.9 (0.1)
0.2 (0.8)
0.85 (0.15)
0.01 (0.99)
Nbr2Calls
10
Given Parents, X is Independent of
Non-Descendants
© Daniel S. Weld
11
Given Markov Blanket, X is
Independent of All Other Nodes
MB(X) = Par(X)  Childs(X)  Par(Childs(X))
© Daniel S. Weld
12
Given Markov Blanket, X is
Independent of All Other Nodes
Rained
Grass Wet
MB(X) = Par(X)  Childs(X)  Par(Childs(X))
© Daniel S. Weld
13
Galaxy Quest
• 2 astronomers in Chile & Alaska measure
(M1, M2) the number (N) of stars
• Normally there is a chance of error, e, of
under or over counting by 1 star
• But sometimes, with probability f, a
telescope can be out of focus (events F1
and F2) in which case the affected
scientist will undercount by 3 stars
Structures
F1
F2
M1
M2
N
F1
N
M1
F2
M1
M2
M2
N
F1
F2
• Two astronomers in Chile & Alaska measure (M1, M2) the
number (N) of stars.
• Normally there is a chance of error, e, of under or over
counting by 1 star.
• But sometimes, with probability f, a telescope can be out of
focus (events F1 and F2) in which case the affected
scientist will undercount by 3 stars
Inference in BNs
•We generally want to compute
Pr(X), or
Pr(X|E) where E is (conjunctive) evidence
•The graphical independence representation
Yields efficient inference
Computations organized by network topology
•Two simple algorithms:
Variable elimination (VE)
Junction trees
© Daniel S. Weld
16
Variable Elimination Procedure
• The initial potentials are the CPTs in BN
• Repeat until only query variable remains:
1. Choose another variable to eliminate
2. Multiply all potentials that contain the variable
3. If no evidence for the variable then sum the variable
out and replace original potential by the new result
4. Else, remove variable based on evidence
• Normalize remaining potential to get the final
distribution over the query variable
Junction Tree Inference
Algorithm
• Incorporate Evidence: For each evidence
variable, go to one table including that
variable
Set to 0 all entries that disagree with evidence
Renormalize this potential
• Upward Step: Pass message to parents
• Downward Step: Pass message to children
Learning
• Parameter Estimation:
Maximum Likelihood (ML)
Maximum A Posteriori (MAP)
Bayesian
• Learning Parameters for a Bayesian Network
• Learning Structure of Bayesian Networks
© Daniel S. Weld
19
Topics
• Bayesian networks overview
• Infernence
Variable elimination
Junction trees
• Parameter Estimation:
Maximum Likelihood (ML)
Maximum A Posteriori (MAP)
Bayesian
• Learning Parameters for a Bayesian Network
• Learning Structure of Bayesian Networks
© Daniel S. Weld
20
Coin Flip
C1
C2
C3
P(H|C1) = 0.1
P(H|C2) = 0.5
P(H|C3) = 0.9
Which coin will I use?
P(C1) = 1/3
P(C2) = 1/3
P(C3) = 1/3
Uniform Prior: All hypothesis are equally likely
before we make any observations
Your Estimate?
What is the probability of heads after two experiments?
Most likely coin:
Best estimate for P(H)
C2
P(H|C2) = 0.5
C1
C2
C3
P(H|C1) = 0.1
P(C1) = 1/3
P(H|C2) = 0.5
P(C2) = 1/3
P(H|C3) = 0.9
P(C3) = 1/3
Maximum Likelihood Estimate
The best hypothesis that fits observed data
assuming uniform prior
After 2 Experiments: heads, tails
Most likely coin:
C2
P(H|C2) = 0.5
P(C2) = 1/3
Best estimate for P(H)
P(H|C2) = 0.5
Using Prior Knowledge
• Should we always use a Uniform Prior ?
• Background knowledge:
Heads => we have take-home midterm
Dan likes take-homes…
=> Dan is more likely to use a coin biased in his favor
P(C1) = 0.05
P(C2) = 0.25
P(C3) = 0.70
C1
C2
C3
P(H|C1) = 0.1
P(H|C2) = 0.5
P(H|C3) = 0.9
Maximum A Posteriori (MAP)
Estimate
The best hypothesis that fits observed data
assuming a non-uniform prior
Most likely coin:
C3
Best estimate for P(H)
P(H|C3) = 0.9
P(H|C3) = 0.9
P(C3) = 0.70
Bayesian Estimate
Minimizes prediction error,
given data and (generally) assuming a
non-uniform prior
= 0.680
P(C1|HT)=0.035 P(C2|HT)=0.481
C1
P(H|C1) = 0.1
C2
P(H|C2) = 0.5
P(C3|HT)=0.485
C3
P(H|C3) = 0.9
Summary For Now
Prior
Hypothesis
Maximum Likelihood
Estimate
Uniform
The most likely
Maximum A
Posteriori Estimate
Any
The most likely
Bayesian Estimate
Any
Weighted
combination
Topics
• Parameter Estimation:
Maximum Likelihood (ML)
Maximum A Posteriori (MAP)
Bayesian
• Learning Parameters for a Bayesian Network
• Learning Structure of Bayesian Networks
© Daniel S. Weld
28
Parameter Estimation and
Bayesian Networks
E
B
R
A
J
M
T
F
T
T
F
T
F
F
F
F
F
T
F
T
F
T
T
T
F
F
F
T
T
T
F
T
F
F
F
F
We have:
...
- Bayes Net structure and observations
- We need: Bayes Net parameters
Parameter Estimation and
Bayesian Networks
E
T
F
F
F
F
B
F
F
T
F
T
R
T
F
F
F
F
A
T
F
T
T
F
J
F
F
T
T
F
M
T
T
T
T
F
...
Prior
25
20
18
20
16
P(B) = ?
+ data =
15
10
5
14
12
10
8
6
4
2
0
0
0
0
-5
0.2
0.4
0.6
0.8
1
-2
0.2
0.4
0.6
0.8
1
Now compute
either MAP or
Bayesian estimate
What Prior to Use?
• The following are two common priors
• Binary variable Beta
Posterior distribution is binomial
Easy to compute posterior
• Discrete variable Dirichlet
Posterior distribution is multinomial
Easy to compute posterior
© Daniel S. Weld
31
One Prior: Beta Distribution
a,b
For any positive integer y, G(y) = (y-1)!
Beta Distribution
• Example: Flip coin with Beta distribution as
prior over p [prob(heads)]
1.
2.
3.
4.
Parameterized by two positive numbers: a, b
Mode of distribution (E[p]) is a/(a+b)
Specify our prior belief for p = a/(a+b)
Specify confidence in this belief with high
initial values for a and b
• Updating our prior belief based on data
incrementing a for every heads outcome
incrementing b for every tails outcome
• So after h heads out of n flips, our posterior
distribution says P(head)=(a+h)/(a+b+n)
Parameter Estimation and
Bayesian Networks
E
T
F
F
F
F
B
F
F
T
F
T
R
T
F
F
F
F
A
T
F
T
T
F
J
F
F
T
T
F
M
T
T
T
T
F
...
Prior
P(B|data) = ?Beta(1,4) “+ data” =
B ¬B
(3,7)
.3 .7
Prior P(B)= 1/(1+4) = 20% with equivalent sample size 5
Parameter Estimation and
Bayesian Networks
E
T
F
F
F
F
...
P(A|E,B) = ?
P(A|E,¬B) = ?
P(A|¬E,B) = ?
P(A|¬E,¬B) = ?
B
F
F
T
F
T
R
T
F
F
F
F
A
T
F
T
T
F
J
F
F
T
T
F
M
T
T
T
T
F
Parameter Estimation and
Bayesian Networks
E
T
F
F
F
F
...
P(A|E,B) = ?
Prior
P(A|E,¬B) = ?
P(A|¬E,B) = ?Beta(2,3)
P(A|¬E,¬B) = ?
B
F
F
T
F
T
R
T
F
F
F
F
A
T
F
T
T
F
J
F
F
T
T
F
M
T
T
T
T
F
Parameter Estimation and
Bayesian Networks
E
T
F
F
F
F
B
F
F
T
F
T
...
P(A|E,B) = ?
Prior
P(A|E,¬B) = ?
P(A|¬E,B) = ?Beta(2,3) + data=
P(A|¬E,¬B) = ?
(3,4)
R
T
F
F
F
F
A
T
F
T
T
F
J
F
F
T
T
F
M
T
T
T
T
F
What if we don’t know
structure?
Learning The Structure
of Bayesian Networks
• Search thru the space…
of possible network structures!
(for now, assume we observe all variables)
• For each structure, learn parameters
• Pick the one that fits observed data best
Caveat – won’t we end up fully connected????
When scoring, add a penalty
 model complexity
Learning The Structure
of Bayesian Networks
• Search thru the space
• For each structure, learn parameters
• Pick the one that fits observed data best
Penalize complex models
• Problem?
Exponential number of networks!
And we need to learn parameters for each!
Exhaustive search out of the question!
So what now?
Structure Learning as Search
• Local Search
1. Start with some network structure
2. Try to make a change
(add or delete or reverse edge)
3. See if the new network is any better
• What should the initial state be?
Uniform prior over random networks?
Based on prior knowledge?
Empty network?
• How do we evaluate networks?
© Daniel S. Weld
42
A
A
B
B
C
D
E
C
D
A
E
B
C
D
E
A
B
C
D
E
Score Functions
• Bayesian Information Criteion (BIC)
P(D | BN) – penalty
Penalty = ½ (# parameters) Log (# data points)
• MAP score
P(BN | D) = P(D | BN) P(BN)
P(BN) must decay exponentially with # of
parameters for this to work well
© Daniel S. Weld
44
Naïve Bayes
Class
Value
…
F 1
F 2
F 3
F N-2
F N-1
F N
Assume that features are conditionally ind.
given class variable
Works well in practice
Forces probabilities towards 0 and 1
Tree Augmented Naïve Bayes (TAN)
[Friedman,Geiger & Goldszmidt 1997]
Class
Value
…
F 1
F 2
F 3
F N-2
F N-1
Models limited set of dependencies
Guaranteed to find best structure
Runs in polynomial time
F N