marked - Kansas State University
Download
Report
Transcript marked - Kansas State University
Lecture 21 of 42
Introduction to Bayesian Networks
Friday, 02 March 2007
William H. Hsu
Department of Computing and Information Sciences, KSU
http://www.kddresearch.org/Courses/Spring-2007/CIS732
Readings:
Section 6.11, Mitchell
Chapter 15, Russell and Norvig
“Bayesian Networks Without Tears”, Charniak
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Lecture Outline
•
Readings: 6.11, Mitchell; Chapter 15, Russell and Norvig; Charniak Tutorial
•
Suggested Exercises: 6.6, Mitchell; 15.2 Russell and Norvig
•
This Week’s Review: “A Theory of Inferred Causation”, Pearl and Verma
•
Graphical Models of Probability
– Bayesian networks: introduction
• Definition and basic principles
• Conditional independence and causal Markovity
– Inference and learning using Bayesian networks
• Acquiring and applying distributions (conditional probability tables)
• Learning tree dependent distributions and polytrees
•
Learning Distributions for Networks with Specified Structure
– Gradient learning
– Maximum weight spanning tree (MWST) algorithm for tree-structured networks
•
Reasoning under Uncertainty: Applications and Augmented Models
•
Next Lecture: (More on) Learning Bayesian Network Structure
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Bayesian Classification
•
Framework
– Find most probable classification (as opposed to MAP hypothesis)
– f: X V (domain instance space, range finite set of values)
– Instances x X can be described as a collection of features x (x1, x2, …, xn)
– Performance element: Bayesian classifier
• Given: an example (e.g., Boolean-valued instances: xi H)
• Output: the most probable value vj V (NB: priors for x constant wrt vMAP)
v MAP arg max P v j | x arg max P v j | x1 , x 2 , , x n
v j V
v j V
arg max P x 1 , x 2 , , x n | v j P v j
v j V
•
Parameter Estimation Issues
– Estimating P(vj) is easy: for each value vj, count its frequency in D = {<x, f(x)>}
– However, it is infeasible to estimate P(x1, x2, …, xn | vj): too many 0 values
– In practice, need to make assumptions that allow us to estimate P(x | d)
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Bayes Optimal Classifier (BOC)
•
Intuitive Idea
– hMAP(x) is not necessarily the most probable classification!
– Example
• Three possible hypotheses: P(h1 | D) = 0.4, P(h2 | D) = 0.3, P(h3 | D) = 0.3
• Suppose that for new instance x, h1(x) = +, h2(x) = –, h3(x) = –
• What is the most probable classification of x?
•
Bayes Optimal Classification (BOC)
v* v BOC arg max
v j V
– Example
• P(h1 | D) = 0.4, P(– | h1) = 0, P(+ | h1) = 1
P v
hi H
j
| hi P hi | D
P(h)
• P(h2 | D) = 0.3, P(– | h2) = 1, P(+ | h2) = 0
• P(h3 | D) = 0.3, P(– | h3) = 1, P(+ | h3) = 0
–
P | h P h
i
| D 0.4
P | h P h
| D 0.6
i
hi H
i
i
hi H
– Result: v* v BOC arg max
v j V
P v
hi H
j
| hi P hi | D = –
CIS 732: Machine Learning and Pattern Recognition
h
Kansas State University
Department of Computing and Information Sciences
BOC and
Concept Learning
•
Back to Concept Learning (Momentarily)
– Recall: every consistent hypothesis has MAP probability
1
| VS | if h is consistent with D
P h | D
H,D
0
otherwise
– Bayes optimal prediction
P x | D
P x | h P h | D | VS
1
hi H
i
i
h x
H,D | hi H
i
• Weighted sum of the predictions of all consistent hypotheses
• “Each hypothesis contributes in proportion to its own likelihood”
•
Properties of Bayes Optimal Prediction
– Classifer does not necessarily correspond to any hypothesis h H
– BOC algorithm searches for its classifier in a wider concept class
– Allows linear combinations of H’s elements
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
BOC and
Evaluation of Learning Algorithms
•
Method: Using The BOC as A “Gold Standard”
– Compute classifiers
• Bayes optimal classifier
• Sub-optimal classifier: gradient learning ANN, simple (Naïve) Bayes, etc.
– Compute results: apply classifiers to produce predictions
– Compare results to BOC’s to evaluate (“percent of optimal”)
•
Evaluation in Practice
– Some classifiers work well in combination
• Combine classifiers with each other
• Later: weighted majority, mixtures of experts, bagging, boosting
• Why is the BOC the best in this framework, too?
– Can be used to evaluate “global optimization” methods too
• e.g., genetic algorithms, simulated annealing, and other stochastic methods
• Useful if convergence properties are to be compared
– NB: not always feasible to compute BOC (often intractable)
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
BOC for
Development of New Learning Algorithms
•
Practical Application: BOC as Benchmark
– Measuring “how close” local optimization methods come to finding BOC
– Measuring how efficiently global optimization methods converge to BOC
– Tuning high-level parameters (of relatively low dimension)
•
Approximating the BOC
– Genetic algorithms (covered later)
• Approximate BOC in a practicable fashion
• Exploitation of (mostly) task parallelism and (some) data parallelism
– Other random sampling (stochastic search)
• Markov chain Monte Carlo (MCMC)
• e.g., Bayesian learning in ANNs [Neal, 1996]
•
BOC as Guideline
– Provides a baseline when feasible to compute
– Shows deceptivity of H (how many local optima?)
– Illustrates role of incorporating background knowledge
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Gibbs Classifier
•
Difficulties with BOC
– Computationally expensive if |H| is high
– Intractable (i.e., NP-hard) for many non-trivial learning problems
•
Solution Approach
– A stochastic classifier: result of random sampling
– Gibbs algorithm: simple random sampling
• Select a hypothesis h from H according to P(h | D)
• Use this h to classify new x
•
Quality of Gibbs Classification: Example
– Assume target concepts are drawn from H according to P(h)
– Suprising fact: error bound
E error hGibbs 2E error hBayesOptimal
– Suppose assumption correct: uniform priors on H
• Select any h VSH,D ~ Uniform(H)
• Expected error no worse than twice Bayes optimal!
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Gibbs Classifier:
Practical Issues
•
Gibbs Classifier in Practice
– BOC comparison yields an expected case ratio bound of 2
– Can we afford mistakes made when individual hypotheses fall outside?
– General questions
• How many examples must we see for h to be accurate with high probability?
• How far off can h be?
– Analytical approaches for answering these questions
• Computational learning theory
• Bayesian estimation: statistics (e.g., aggregate loss)
•
Solution Approaches
– Probabilistic knowledge
• Q: Can we improve on uniform priors?
• A: It depends on the problem, but sometimes, yes (stay tuned)
– Global optimization: Monte Carlo methods (Gibbs sampling)
• Idea: if sampling one h yields a ratio bound of 2, how about sampling many?
• Combine many random samples to simulate integration
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Bayesian Learning:
Parameter Estimation
•
Bayesian Learning: General Case
– Model parameters
• These are the basic trainable parameters (e.g., ANN weights)
• Might describe graphical structure (e.g., decision tree, Bayesian network)
• Includes any “low level” model parameters that we can train
– Hyperparameters (higher-order parameters)
• Might be control statistics (e.g., mean and variance of priors on weights)
• Might be “runtime options” (e.g., max depth or size of DT; BN restrictions)
• Includes any “high level” control parameters that we can tune
•
Concept Learning: Bayesian Methods
– Hypothesis h consists of (, )
– values used to control update of values
– e.g., priors (“seeding” the ANN), stopping criteria
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Case Study:
BOC and Gibbs Classifier for ANNs [1]
•
Methodology
Output Layer
– (model parameters): aj, uij, bk, vjk
– (hyperparameters): a, u, b, v
•
o1
Hidden Layer h1
Computing Feedforward ANN Output
o2
h2
h3
v42
h4
u 11
Input Layer
– Output layer activation: fk (x) = bk + j vjk hj (x)
x1
x2
x3
– Hidden layer activation: hj (x) = tanh (aj + i uij xi)
•
Classifier Output: Prediction
– Given new input from “inference space”
– Want: Bayesian optimal test output
P y m 1 | x m 1 , x 1 , y 1 , , x m , y m
P y
1
P θ | x ,y
m 1
1
| x n 1 ,θθ P θ,γ | x 1 , y 1 , , x m , y m dθ dγ
, , x
m
,y
m
Lθ | x , y , , x , y P θ
P x 1 , y 1 , , x m , y m | θ P θ
P x 1 , y 1 , , x m , y m
1
CIS 732: Machine Learning and Pattern Recognition
1
m
m
Kansas State University
Department of Computing and Information Sciences
Case Study:
BOC and Gibbs Classifier for ANNs [2]
•
Problem
– True parameter space is infinite (real-valued weights and thresholds)
– Worse yet, we know nothing about target distribution P(h | D)
•
Solution: Markov chain Monte Carlo (MCMC) Methods
– Sample from a conditional density for h = (, )
– Integrate over these samples numerically (as opposed to analytically)
•
MCMC Estimation: An Application of Gibbs Sampling
– Want: a function v(), e.g., fk(x(n+1), )
– Target:
E v
v θ Q θ dθ
1 N
v θ t
– MCMC estimate:
N t 1
– FAQ [MacKay, 1999]: http://wol.ra.phy.cam.ac.uk/mackay/Bayes_FAQ.html
E v
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
BOC and Gibbs Sampling
•
Gibbs Sampling: Approximating the BOC
– Collect many Gibbs samples
– Interleave the update of parameters and hyperparameters
• e.g., train ANN weights using Gibbs sampling
• Accept a candidate w if it improves error or rand() current threshold
• After every few thousand such transitions, sample hyperparameters
– Convergence: lower current threshold slowly
– Hypothesis: return model (e.g., network weights)
– Intuitive idea: sample models (e.g., ANN snapshots) according to likelihood
•
How Close to Bayes Optimality Can Gibbs Sampling Get?
– Depends on how many samples taken (how slowly current threshold is lowered)
– Simulated annealing terminology: annealing schedule
– More on this when we get to genetic algorithms
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Graphical Models
of Probability Distributions
•
Idea
– Want: model that can be used to perform inference
– Desired properties
• Ability to represent functional, logical, stochastic relationships
• Express uncertainty
• Observe the laws of probability
• Tractable inference when possible
• Can be learned from data
•
Additional Desiderata
– Ability to incorporate knowledge
• Knowledge acquisition and elicitation: in format familiar to domain experts
• Language of subjective probabilities and relative probabilities
– Support decision making
• Represent utilities (cost or value of information, state)
• Probability theory + utility theory = decision theory
– Ability to reason over time (temporal models)
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Using Graphical Models
•
P(x3 | y)
P(x2 | y)
P(x1 | y)
A Graphical View of Simple (Naïve) Bayes
– xi {0, 1} for each i {1, 2, …, n}; y {0, 1}
y
P(xn | y)
– Given: P(xi | y) for each i {1, 2, …, n}; P(y)
– Assume conditional independence
x1 x2 x3
xn
• i {1, 2, …, n} P(xi | xi, y) P(xi | x1, x2, …, xi-1, xi+1, xi+2, …, xn, y) = P(xi | y)
• NB: this assumption entails the Naïve Bayes assumption
P x 1 , x 2 , , x n | y P x i | x i , y P x i | y
• Why?
i
i
– Can compute P(y | x) given this info
P x | y P y P y n
P y n
P y | x
P x i | x i , y
P x i | y
P x
P x i 1
P x i 1
– Can also compute the joint pdf over all n + 1 variables
n
n
P x, y P y P x | y P y P x i | x i , y P y P x i | y
i 1
•
i 1
Inference Problem for a (Simple) Bayesian Network
– Use the above model to compute the probability of any conditional event
– Exercise: P(x1, x2, y | x3, x4)
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
In-Class Exercise:
Probabilistic Inference
•
Inference Problem for a (Simple) Bayesian Network
– Model: Naïve Bayes
– Objective: compute the probability of any conditional event
•
Exercise
– Given
• P(xi | y), i {1, 2, 3, 4}
• P(y)
– Want: P(x1, x2, y | x3, x4)
P x 1 , x 2 , y | x 3 , x 4
P x 3 , x 4 | x 1 , x 2 , y P x 1 , x 2 , y
P x 3 , x 4
P x 1 , x 2 , x 3 , x 4 , y
P x 3 , x 4
4
P y P x i | y
P x
i 1
3
| y P x 4 | y
Y
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Unsupervised Learning
and Conditional Independence
•
Given: (n + 1)-Tuples
(x1, x2, …, xn, xn+1)
– No notion of instance variable or label
– After seeing some examples, want to know something about the domain
• Correlations among variables
• Probability of certain events
• Other properties
•
Want to Learn: Most Likely Model that Generates Observed Data
– In general, a very hard problem
– Under certain assumptions, have shown that we can do it
•
Assumption: Causal Markovity
– Conditional independence among “effects”, given “cause”
– When is the assumption appropriate?
– Can it be relaxed?
•
Structure Learning
– Can we learn more general probability distributions?
P(x3 | y)
P(x2 | y)
P(x1 | y)
x1 x2 x3
y
P(xn | y)
xn
– Examples: automatic speech recognition (ASR), natural language, etc.
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Bayesian Belief Networks (BBNS):
Definition
•
Conditional Independence
– X is conditionally independent (CI) from Y given Z (sometimes written X Y | Z) iff
P(X | Y, Z) = P(X | Z) for all values of X, Y, and Z
– Example: P(Thunder | Rain, Lightning) = P(Thunder | Lightning) T R | L
•
Bayesian Network
– Directed graph model of conditional dependence assertions (or CI assumptions)
– Vertices (nodes): denote events (each a random variable)
– Edges (arcs, links): denote conditional dependencies
•
•
n
General Product (Chain) Rule for BBNs
P X 1 , X 2 , , X n P X i | parents X i
i 1
Example (“Sprinkler” BBN)
Sprinkler: On, Off
Season:
Spring
Summer X
1
Fall
Winter
X2
Ground:
Wet, Dry
X4
X3
X5
Ground:
Slippery, Not-Slippery
Rain: None, Drizzle, Steady, Downpour
P(Summer, Off, Drizzle, Wet, Not-Slippery) = P(S) · P(O | S) · P(D | S) · P(W | O, D) · P(N | W)
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Bayesian Belief Networks:
Properties
•
Conditional Independence
– Variable (node): conditionally independent of non-descendants given parents
– Example
Exposure-To-Toxics
Age X1
X3
Serum Calcium
Cancer
X6
X5
Gender X2
X4
X7
Tumor
Lung
Smoking
NonDescendants
Parents
– Result: chain rule for probabilistic inference
Descendants
n
P X 1 , X 2 , , X n P X i | Pai
i 1
Pai parents X i
•
Bayesian Network: Probabilistic Semantics
– Node: variable
– Edge: one axis of a conditional probability table (CPT)
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Bayesian Belief Networks:
Inference
•
Problem Definition
– Given
• Bayesian network with specified CPTs
• Observed values for some nodes in network
– Return: inferred (probabilities of) values for query node(s)
•
Implementation
– Bayesian network contains all information needed for this inference
• If only one variable with unknown value, easy to infer it
• In general case, problem is intractable (NP-hard: reduction to 3-CNF-SAT)
– In practice, can succeed in many cases using different methods
• Exact inference: work well for some network structures
• Monte Carlo: “simulate” network to randomly calculate approximate solutions
– Key machine learning issues
• Feasible to elicit this information or learn it from data?
• How to learn structure that makes inference more tractable?
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Tree Dependent Distributions
•
Polytrees
– aka singly-connected Bayesian networks
– Definition: a Bayesian network with no undirected loops
– Idea: restrict distributions (CPTs) to single nodes
– Theorem: inference in singly-connected BBN requires linear time
• Linear in network size, including CPT sizes
• Much better than for unrestricted (multiply-connected) BBNs
•
Tree Dependent Distributions
– Further restriction of polytrees: every node has at one parent
– Now only need to keep 1 prior, P(root), and n - 1 CPTs (1 per node)
root
– All CPTs are 2-dimensional: P(child | parent)
•
z
Independence Assumptions
x
– As for general BBN: x is independent of non-descendants given (single) parent z
– Very strong assumption (applies in some domains but not most)
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Inference in Trees
•
Inference in Tree-Structured BBNs (“Trees”)
– Generalization of Naïve Bayes to model of tree dependent distribution
– Given: tree T with all associated probabilities (CPTs)
– Evaluate: probability of a specified event, P(x)
•
Inference Procedure for Polytrees
Leaves
– Recursively traverse tree
• Breadth-first, source(s) to sink(s)
• Stop when query value P(x) is known
– Perform inference at each node
P x P X x
P x | y
1
P x | y
1
y1 ,y 2
Y1
Y2
, y 2 P y 1 , y 2
, y 2 P y 1 P y 2
y 1 ,y 2
X
parents Y1 parents Y2 X
– NB: for trees, proceed root to leaves (e.g., breadth-first or depth-first)
– Simple application of Bayes’s rule (more efficient algorithms exist)
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Learning Distributions:
Objectives
•
Learning The Target Distribution
– What is the target distribution?
– Can’t use “the” target distribution
• Case in point: suppose target distribution was P1 (collected over 20 examples)
• Using Naïve Bayes would not produce an h close to the MAP/ML estimate
– Relaxing CI assumptions: expensive
• MLE becomes intractable; BOC approximation, highly intractable
• Instead, should make judicious CI assumptions
– As before, goal is generalization
• Given D (e.g., {1011, 1001, 0100})
• Would like to know P(1111) or P(11**) P(x1 = 1, x2 = 1)
•
Several Variants
– Known or unknown structure
– Training examples may have missing values
– Known structure and no missing values: as easy as training Naïve Bayes
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Summary Points
•
Graphical Models of Probability
– Bayesian networks: introduction
• Definition and basic principles
• Conditional independence (causal Markovity) assumptions, tradeoffs
– Inference and learning using Bayesian networks
• Acquiring and applying CPTs
• Searching the space of trees: max likelihood
• Examples: Sprinkler, Cancer, Forest-Fire, generic tree learning
•
CPT Learning: Gradient Algorithm Train-BN
•
Structure Learning in Trees: MWST Algorithm Learn-Tree-Structure
•
Reasoning under Uncertainty: Applications and Augmented Models
•
Some Material From: http://robotics.Stanford.EDU/~koller
•
Next Lecture: Read Heckerman Tutorial
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences