Health Risk Assessment Frameworks
Download
Report
Transcript Health Risk Assessment Frameworks
Causal graphs for risk assessment
and risk management
Tony Cox
10-2-03
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Main themes
• Risk analysis is largely about how to use observed
information on some variables to help predict the
values of other unobserved variables.
– Exposure Future response
– Current response Past exposures
• Causal graphs/Bayesian networks provide a
systematic way to use information on some
variables to predict (a) the probable values of
others and (b) the likely effects of interventions.
• Influence diagrams are causal graphs with
decision and value nodes.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Main themes (Cont.)
• Methods for learning causal graphs from data,
inferring likely values of unobserved variables
from values of observed ones, and optimizing
decisions are heavily computational.
– Good algorithms and effective approximate heuristics
are available for inference and decision-making.
(Mature technology, many choices.)
– Learning models from data automatically or with
minimal user knowledge input is current cutting edge.
• Commercial software and academic shareware are
closing the gap between theory and practice.
– “Ready for prime time” in RA applications
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Quantifying Uncertainty
• Uncertainty about a random variable, X, can be
quantified by its entropy, denoted H(X).
• X and Y are statistically dependent if and only if
each gives some information (i.e., reduces expected
uncertainty/entropy) about the other.
• Conditional entropy: H(X | Y) H(X)
– Equal iff X and Y are statistically independent
• X and Y always provide equal information about
each other: H(X) – H(X | Y) = H(Y) – H(Y | X) =
mutual information between X and Y.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Uncertainty and Bayes Rule
• If random variables (or uncertain quantities) are
statistically independent, then their values can be
predicted easily, e.g., by sampling from their PDFs.
• If variables are statistically dependent, then Bayes
Rule allows their values to be predicted from
whatever is observed (e.g., by sampling from their
conditional distributions)
– Predicts unobserved from observed values.
– True risk is always unobserved.
• Conditional independence among variables helps
reveal©potential
causal
relations
among
them.
Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Statistical Independence for events and RVs
• Events A and B are statistically independent
if and only if: Pr(A & B) = Pr(A)Pr(B)
– Now, Pr(A & B) = Pr(A)*Pr(B | A)
– Independence: Pr(B | A) = Pr(B) = Pr(B | not-A).
• Random variables X and Y are statistically
independent if Pr(X, Y) = Pr(X)Pr(Y) (“joint
PDF equals product of marginals”)
– Meaning: Pr(X = x and Y = y) = Pr(X = x)*Pr(Y = y)
for all pairs of possible values, (x, y).
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Example
• Joint probability table/ PDF for X and Y.
• Q: Are X and Y statistically independent?
• Find marginal distributions Pr(X) and Pr(Y)
Y
0
1
0.0
0.2
0.3
0.5
X
Low = 0
High = 1
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Marginalization Principle:
“Joint determines marginal”
• Given a joint PDF table, the marginal probability
distribution of any variable(s) can be found… how?
• A: Sum joint PDF over values of other variable(s).
• In this sense, a joint PDF is an “oracle” that can
answer any query about any of its variables by
summing it over appropriate cells in the table.
• Could be useful…
• But joint PDF too big to estimate, store, sum over!
• Bayesian networks provide a solution based on
sparseness of table / statistical independence
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
“DAG model” notation: X Y
• “Directed acyclic graph” (DAG) notation
– “Bayes net”, “Bayesian belief network”, “causal
graph”, “DAG model”, “influence diagram”
• X Y means the same as [Pr(X), Pr(Y | X)]
• Pr(X) is stored at input node X.
– Gives Pr(X = x) for each possible value x of X.
– Pr(X) = “marginal” distribution of input X.
• Pr(Y | X) is stored at node Y.
– Gives Pr(Y = y | X = x) for each x and y.
– Called the conditional probability table (CPT) of Y.
• Extends to many variables: X Y Z W
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Example of a Bayesian Network
(BN) for Health Risk Assessment
action exposure response consequence QALYs
behavior susceptibility treatment
Risk assessment estimates the probable consequences of
actions, given all available information.
Bayes Rule estimates probabilities of values of unobserved
variables conditioned on all observed ones.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Bayes Rule: What it does
• Bayes Rule lets us calculate Pr(X | Y) from Pr(X )
and Pr(Y | X), for any X and Y.
–
–
–
–
X is unobserved, Y is observed
Needed inputs: Pr(X) = prior, Pr(Y | X) = likelihood
Output: Pr(X | Y) = posterior distribution for X given Y
“Reverses arc” in the probability model X Y.
• Multivariate Bayes Rule: X and Y can be sets of
uncertain quantities (e.g., vectors)
• Provides a consistent, computationally practical
framework for data-based modeling, estimation,
prediction, inference, and decision-making!
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Bayes Rule: How it Works
• Bayes Rule [Pr(Y | X) and Pr(X)] Pr(X | Y)
• Q: How is this possible?
• A: Use Pr(Y | X) and Pr(X) to get joint distribution:
– Q: How?
– A: Multiply! Pr(X, Y) = Pr(Y | X)*Pr(X)
– “Joint = conditional times marginal”
• Then use Pr(X, Y) to calculate Pr(X | Y) (How?)
– Pr(X = x | Y = y) = Pr(x, y)/Pr(y) = Pr(cell)/Pr(column)
• Pr(X | Y) = Pr(X, Y)/Pr(Y) = Pr(Y | X)Pr(X)/Pr(Y)
– “Conditional = joint divided by marginal”
– “Posterior
= likelihood
times
prior, normalized”
© Cox Associates,
2003. Ph. 303-388-1778;
Fax 303-388-0609.
[email protected]
Pop Quiz: Find Pr(X = 8 | Y = 2)
Y
1
2
3
4
0.0
0.2
0.1
8
0.1
0.2
0.0
16
0.1
0
0.3
X
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Details on Making Predictions
• DAG model: X Y. We will predict Y: Pr(Y = y)
• If X = x is observed, then prediction for Y is just the
CPT column, Pr(Y = y | X = x) for each value of y.
• But what if X is not observed?
• Then Pr(Y = y) = xPr(Y = y | X = x)Pr(X = x)
= xPr(Y = y & X = x) (since exactly one x will occur!)
– “Marginalize out” X by summing over possible values, x
• Fast, modern prediction: Monte Carlo simulation!
– Sample the values of X from Pr(X = x)
– Sample
the values of Y from the CPT Pr(Y = y | X = x)
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Example: Predict Pr(C) for random patient
(Source: http://research.microsoft.com/~breese/)
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Manual Solution for Pr(C = none)
• Model: S C, i.e., Smoking Cancer
• Rule: Pr(Y = y) = xPr(Y = y | X = x)Pr(X = x)
Pr(C = c) = sPr(C = c | S = s)Pr(S = s)
Pr(C = none) = sPr(C = none | S = s)Pr(S = s)
Pr(C = none | S = no)Pr(S = no) +
Pr(C = none | S = light)Pr(S = light) +
Pr(C = none | S = heavy)Pr(S = heavy) =
(0.96)(0.80) + (0.88)(0.15) + (0.60)(0.05) = 0.93
= [0.96, 0.88, 0.60]*[0.80, 0.15, 0.05]'
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Better Living through linear algebra
• Rule: Pr(Y = y) = xPr(Y = y | X = x)Pr(X = x)
• Matrix-vector notation: Pr(y) = Pr(y | x)*Pr(x)
x = vector of probabilities for values of x, y = vector of
probabilities for values of y, Pr(y | x) = CPT
Pr(S) = Pr[none, light, heavy] = s = [0.8, 0.15, 0.05];
Pr(C | S) = P = [.96, .88, .60; .03, .08, .25; .01, .04, .15];
• Pr(C) = Pr(C | S)Pr(S) c = P*s'
• MATLAB™ program: c = P*s'
= [0.93, 0.0485, 0.0215] for [none, benign, malignant]
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Bayes Applet Approach
http://www.cs.ubc.ca/labs/lci/CIspace/bayes
• Draw DAG model:
Smoking Cancer
• Populate marginal distributions and CPTs
– Same information as s and P in MATLAB
• Click on a node (“query” it) to get its PDF!
• Program computes PDFs of nodes by exact
formulas (marginalizing out and generalizations):
Pr(Y = y) = xPr(Y = y | X = x)Pr(X = x)
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Prediction: Summary
• Given a probability model Pr(Y = y | X = x) (usually
summarized in a CPT, P)
• Given input probabilities Pr(X = x) = marginal PDF, x
• The predicted probability distribution of Y is found by
summing products: [Pr(Y = y) for each y] = y = P*x'
• Bayes Applet automates the calculations
• MC simulation makes (approximate) calculations easy
even when Pr(Y = y | X = x) cannot be stored as a table.
• Framework applies to arbitrary probabilistic models
– The input-output relations can be very nonlinear, have highorder interactions, etc.
– Applies approximately to simplifed parametric models
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Updating Beliefs by Conditioning
• Given a model X Y with prior Pr(X) and likelihood
Pr(Y | X) and observed data or evidence Y = y, find
posterior: Pr(X | Y = y). (X = unobserved, Y = observed)
– Probable diagnosis, explanation, etc. for observations
• Pr(Y = y) = v[Pr(Y = y & X = v)] = v[Pr(y | v)Pr(v)]
• Abbreviated notation: Pr(X = x | Y = y) = Pr(x | y)
• Method: Pr(x | y)Pr(y) = Pr(x, y) = Pr(y | x)Pr(x)
So, Pr(x | y) = Pr(x, y)/Pr(y) (Defn. of conditional PDF)
= Pr(y | x)Pr(x)/Pr(y)
joint = conditional x marginal
= Pr(y | x)Pr(x)/[vPr(y | v)Pr(v)] (Bayes Rule)
where sum is over all possible values, v, of X.
• This is Bayes Rule!
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Pr(S | C) in Smoker Problem
• Q: What is the probability that someone is a heavy
smoker (S = heavy), given that she is diagnosed with
malignant lung cancer (C = malignant)?
• A: Pr(s | c) = Pr(c | s)Pr(s) /[vPr(c | v)Pr(v)]
• Pr(heavy | malignant) = ???
Pr(malignant | heavy)Pr(heavy)/Pr(malignant)
= (0.15)*(0.05)/((.01)*(.8) + (.04)*(.15) + (.15)*(.05))
= 0.3488. (Check with Bayes Applet solver)
• In practice, Probabilistic Logic Sampling (forward Monte
Carlo simulation with rejection of samples that disagree
with observations) and more efficient sampling-based
methods can be used for inference in large BNs.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Bayes Applet Approach
http://www.cs.ubc.ca/labs/lci/CIspace/bayes
• Draw DAG model, X Y:
Smoking Cancer
• Populate marginal distributions and CPTs
– Pr(x) = Pr(X = x), Pr(y | x) = Pr(Y = y | X = x)
• Click on a node (“query” it) to get its PDF!
• Program computes PDFs of nodes by exact
formulas (marginalizing out and generalizations):
Predict: Pr(y) = xPr(Y = y | X = x)Pr(X = x)
Infer: Pr(x | y) = Pr(x, y)/Pr(y) = Pr(y | x)Pr(x)/Pr(y)
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Example: Diagnostic Test
• 1% of population is sick, 3% has allergy, 96%
is well
• An imperfect medical test will yield a positive
(“sick”) reading as follows:
• Pr(test = sick | person is sick) = 0.97
• Pr(test = sick | person is well) = 0.05
• Pr(test = sick | person has allergy) = 0.10
• Q: What is Pr(person is sick | test says sick)?
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Solution to Diagnosis (Cont.)
Pr(test sick | sick)Pr(sick)/Pr(test sick) =
(0.97)(0.01)/0.0607 = 0.1598 (answer)
• Even this very reliable test is only right 16%
when it diagnosis illness!
• Explanation: Illness is unlikely!
• Many physicians are surprised by correct
Bayesian inference in diagnosis.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Bayesian Modeling and Updating
• Likelihood Model: Pr(Y | X)
– Pr(outputs | inputs) vs. Pr(data | model)
• Prior data/beliefs: Pr(X), X = predictive model
• Model-learning cycle:
X = model assumptions + inputs
Y = predicted outputs/observed data
Model predictions: [Pr(X), Pr(Y | X)] predictions Pr(Y)
Model updating/learning: [Pr(X) + data on Y] Pr(X |Y)
– After observing enough data, we learn a Pr(X | data) that
makes accurate predictions (if X ranges over enough
models, data are adequately diverse, etc.)
– Bayesian model-averaging (BMA) allows uncertain X
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Applications of Bayes in RA
• Bayes Rule: X Y Pr(X | Y)
– X = unobserved, Y = observed
– E.g., Pr(Y | X) = Pr(data | exposure, model)
• Diagnosis: X = disease, Y = symptom
• Hazard identification: X = cause, Y = health effect
• Exposure assessment: X = true exposures, Y =
measured (e.g., sample) values
• Dose-response modeling: X = true dose-response
model, Y = observed dose-response data
• Risk characterization: X = true models, exposure
distributions (and risk), Y = data
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Bayesian Networks
• Generalizes preceding to graphs (“DAGs”) with
many nodes and arcs
• Example: X Y Z
– (X Y) = PBPK model = (exposure internal dose)
– (Y Z) = PD model = (internal dose response)
• Extension allows composition of probabilistic
relations: (X Z) = (X Y)(Y Z).
Q: Find Pr(Z = z | X = x), Y unobserved
A: yPr(Z = z | Y = y)Pr(Y = y | X = x)
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Expert System for Many Variables
• How to design and build a probabilistic expert system to
answer all questions about risk (given joint PDF)?
• Goal: Answer user queries with probabilities, taking into
account all known observations/facts/data.
• Method: Condition the joint distribution Pr(X) on any
known facts (i.e., exclude all cells that are inconsistent
with known facts.) Then, sum over all cells satisfying
query conditions and normalize to get total conditional
probability that the query is satisfied.
• Challenge: Joint PDF of all variables is impossible to
store (or estimate)!
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Key Idea: Factoring a Joint PDF
• For two variables, we have
Pr(X = x, Y = y) = Pr(X = x)Pr(Y = y | X = x)
More briefly: Pr(x, y) = Pr(x)Pr(y | x)
• For n variables, we have (perhaps after renumbering
variables so that all of a variable’s direct parents
precede it in the new ordering):
Pr(x1, x2, …, xn) =
Pr(x1)Pr(x2 | x1)Pr(x3 | x1, x2)…Pr(xn | x1, x2, …, xn-1)
• Most of these terms simplify for sparse DAGs!
• Example: Find Pr(x1, x2, x3, x3) in the following DAG:
X1 X2 X3 X4
• A: Pr(x1, x2, x3, x4) = Pr(x1)Pr(x2 | x1)Pr(x4)Pr(x3 | x2 , x4)
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Example of a Bayesian Network
(BN) for Risk Assessment
act exposure response consequence QALYs
behavior susceptibility treatment
ax r c Q
All CPTs and Pr(b) given.
b s m
• Factor joint probability Pr(b, s, m, x, r, c, Q) given a:
Pr(b)Pr(s | b)Pr(m | s)Pr(x | b, a)Pr(r | s, x)Pr(c | r, m)Pr(Q | c)
• BN technology lets us find the PDF of any unobserved
variable(s) given observed values of any others
• Monte Carlo (MC) composes components into whole
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Monte Carlo Prediction for X Y
• Rule: Pr(Y = y) = xPr(Y = y | X = x)Pr(X = x)
• Suppose that Pr(Y = y | X = x) is hard to calculate but easy
to simulate (e.g., it is a large, complex spreadsheet
simulation model). What to do?
Answer: Sample x values from Pr(X = x).
– For univariate X, partition unit interval into sub-intervals of
lengths Pr(x), then use uniform U[0, 1] random number generator
• For each x, calculate/simulate a corresponding value of y
from Pr(Y = y | X = x)
• Fraction of sampled Y values equal to y Pr(Y = y)
• Adaptive importance sampling and variance reduction
techniques (e.g., Latin hypercube sampling) improve the
efficiency of this basic idea.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Gibbs Sampling for BNs
ax r c Q
b s m
All CPTs and Pr(b) given
• Factor joint PDF for all variables:
– Pr(b)Pr(s | b)Pr(m | s)Pr(x | b, a)Pr(r | s, x)Pr(c | r, m)Pr(Q | c)
• Gibbs sampling: Sample b from Pr(b), then s from
Pr(s | b), then… then Q from Pr(Q | c)
• Repeat many times to obtain a random sample
from the joint PDF!!! (Special case of MCMC)
• Can condition on observed values (with more or
less efficiency) to obtain samples from posterior
joint PDF of all unobserved values. (WinBUGS)
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Remaining Challenges
• How to learn (and objectively validate) the
causal structure of a BN from data?
– Answer: Conditional independence tests
– Software: Classification trees, BayesiaLab™,
Bayesian Network Toolbox, etc.
• How to learn/estimate a BN’s CPTs from data?
• Once these things are known, WinBUGS, Bayes
Applet, etc. can be used to draw inferences
within the BN about risk, exposure, etc.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Learning Causal DAG Structures
from Data
• Can be very hard for people
– “Often in error but never in doubt!”
• General problem is computationally demanding
• Excellent sampling-based heuristics are now
available (once we agree what a “causal
structure” is!)
• Exact solutions are available for small problems
and models with special structure/constraints
– Includes many risk assessment models!
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
A Cognitive bias (Wason, 1970)
• Consider a deck of cards, each with a letter on one of its
sides and a number on the other.
• Claim: “If a card has a vowel on one side, then it has an
even number on the other side”
A
D
4
7
Q: Which of these 4 cards must be turned over to
determine whether the claim is true of them?
(“Wason selection task”)
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Structural vs. Reduced-Form Models
Suppose truth (unknown to risk analyst) is:
E(EFFECT) = AGE – EXPOSURE
EXPOSURE = (1/3)*AGE
(1)
(2)
(Q: What is non-causal about: AGE = 3*EXPOSURE?)
Q: What is E(Effect | Exposure)?
A: E(EFFECT | EXPOSURE) = 2*EXPOSURE
(3)
E(EFFECT)
-1
+1
EXPOSURE AGE
+1/3
Math is right but implications are all wrong!
• (1) and (2) are structural equations (causal);
• (3) is a reduced-form equation (descriptive, not causal)
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
X is a potential direct cause of Y if
and only if…
• X is informative about Y
– I(X ; Y) > 0
• Information that X provides about Y cannot be fully
explained away by any other variables
– There is no W for which I(X ; Y | W) = 0
– X W Y vs. W X Y vs. W Y X
• Composition, ordering, and conditional independence
implications are all satisfied
• The history of X up through time t contains information
about the future of Y after t.
These are all information (vs. intervention) criteria
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Tests for Potential Causality and
Conditional Independence in Data
Different approaches:
• Time series: Granger causality, intervention
analysis, change point analysis
– http://support.sas.com/rnd/app/examples/ets/granger/
•
•
•
•
Systems of equations: Simon causal ordering
Correlations: SEMs/path analysis
Conditional independence: Causal graphs
Our favorite: Classification trees
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Classification Trees
• A powerful, popular method for data mining
• Background/software on classification trees:
http://www.statsoftinc.com/textbook/stcart.html;
http://www.stat.wisc.edu/~loh/quest.html
• Idea: “Always ask the most informative question” for
reducing uncertainty (conditional entropy) about the
dependent variable.
• Partitions a set of cases into groups or clusters (the leaf
nodes) with similar conditional probabilities for
dependent variable.
• Download applet from:
http://www.cs.ubc.ca/labs/lci/CIspace/dTree/
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
What is meant by the “Most
Informative” Question?
• The one that minimizes expected entropy about
the quantity being predicted.
• Entropy measures the “surprise” from learning
the correct value of a variable.
• Let X be a binary 0-1 variable with E(X) = p (in
other words, with Pr(X = 1) = p.)
• How surprised will we be to find that in fact X =
1? The answer depends on p.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Quantifying Entropy
• Let S = surprise from finding that X = 1.
– So, S = information gained from this finding.
– The expected value of S is the entropy of X.
• Assume S depends only on p = Pr(X = 1).
• Four axioms for entropy:
–
–
–
–
S(1) = 0
S(p) decreases with p
S(p) is a continuous function of p
S(pq) = S(p) + S(q)
• Theorem: S(p) = -k*log2p. If we choose S(0.5) = 1,
then S(p) = log2p bits.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Entropy for any discrete X
• It is usual to use H(X) to denote the entropy of
a discrete random variable X.
• Let X have possible values x1,…, xn with
probabilities p1,…, pn. Then the entropy of X
is the expected information gained from
learning its value:
• H(X) = -ipilog2pi = E[log2(1/pi)] bits
• H(X) depends only on the probabilities of the
different possible values of X.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Information for two variables, X and Y
http://www.montefiore.ulg.ac.be/~lwh/Info/Transp2000/jour2.pdf
Conditional entropy H(X | Y) is defined as:
• H(X | Y) = yPr(Y = y) H(X | Y = y) = EY[H(X | Y)]
– H(X | Y = y) = -iPr(X = xi | Y = y)log2Pr(X = xi | Y = y)
Properties:
• H(X | Y) H(X),
• H(X | Y) = yPr(Y = y) H(X | Y = y)
• Joint entropy: H(X, Y) = H(X) + H(Y | X)
Mutual information:
I(X ; Y) = I(Y ; X) = H(X) – H(X | Y)
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Information Theory Ideas
• Fano’s inequality: The conditional entropy H(X | Y)
puts a lower bound on the probability of estimating X
correctly from Y.
• Kolmogorov complexity: H(X) length of shortest
binary program for computing samples from X.
• Maximum compression is determined by H(X) + 1
• Shannon’s coding theorem: A channel Pr(Y | X) has
capacity C = maxPr(x)I(X ; Y). Asymptotically error-free
communication through the channel is possible iff source
entropy < channel capacity! C is the maximum possible
rate of error-free communication using the channel.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Mutual Information in DAGs
• In a causal graph model X – Y – Z (with the arcs
oriented in any directions), more remote ancestors
can never be more informative than direct parents.
• Thus, I(X; Z) I(X ; Y)
– Information-processing inequality for causal graphs.
– Special case: H(X) H[f(X)]
• Moreover, I(X ; Z | Y) = 0 (conditional
independence of X and Z given Y) unless both X
and Z point into Y.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Entropy and Classification Trees
• Let X be the dependent variable (the one we are
predicting) in a classification tree. Before asking any
questions, the marginal distribution of X has entropy
H(X).
• After asking “What is Y?” and learning that “Y = y”, the
conditional entropy of X becomes H(X | Y = y). This may
be larger or smaller than H(X).
• The expected value of H(X | Y) is < H(X)
– H(X | Y) < H(X) unless X and Y are independent.
• The question Y that maximizes I(X ; Y) is the most
informative question to ask (about X).
– A myopic criterion: the most informative single question
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Classification Tree Algorithms
Growing a tree requires splitting, stopping, pruning.
• Splitting criteria
– Entropy / mutual information
– Impurity / Gini coefficient of inequality
– Gain and error criteria
• Stopping criteria
– Statistical tests for significant differences between conditional
PDFs (F-test, chi-square)
– Merging leaves
• Pruning criteria
– Model cross-validation
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Model Cross-Validation
• Key idea: Choose the model to minimize crossvalidation error (not the naïve resubstitution error)
– Prune trees to minimize CVE. Myopic: Drop node to most
reduce CVE until no further reduction possible.
• The multiple training and test sets can be randomly
selected or deterministically selected (e.g., “leave-oneout” sampling)
• k-fold cross-validation can give an unbiased estimate
of true predictive performance of model
• Re-sampling data to obtain improved models is a key
idea of modern statistics
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Classification/Decision Tree
Algorithms in Practice
• Different decision tree (= classification tree)
algorithms use different splitting, stopping, and
pruning criteria.
• Examples of tree algorithms:
– CHAID: chi-square automatic interaction detection
– CART: Classification and regression trees, allows continuous
as well as discrete variables
– MARS: Multiple adaptive regression splines, smooths the
relations at tree-tips to fit together smoothly
– KnowledgeSeeker allows multiple splits, variable types
– ID3, C5.0, etc.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Generating CPTs with Trees
• Classification trees grown using parents of a
node provide exactly the information needed
to fill in its Conditional Probability Table
(CPTs).
• Each leaf node gives the conditional
probability distribution of the dependent
variable, given the combination of values of
its parents.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Learning Structure with Trees
• In a classification tree, the dependent variable
(root node) is conditionally independent of all
variables not in the tree, given the variables that
are in the tree. (At least as far as the treegrowing heuristic can discover.)
• Generically, all causal parents of a node appear
in every classification tree for that node.
• Starting with a childless node (output), we can
recursively seek direct parents of all nodes.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Eureka! Learning BNs is possible!
• We can use classification trees to perform
conditional independence tests to identify
potential causal graph structures.
• The leaves of the tree for each node provide the
numbers to put in its CPT.
• Many extensions, simplifications (e.g..,
parametric models), and computational
efficiencies have been and are being developed.
• Current status: From theory to software products
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Classification-tree Tests of CI
• Pop Quiz: How to use classification trees to decide
which of the following is correct from (Exposure,
Lifestyle, Response) data for many individuals?
• Alternative hypotheses:
Causal model 1: Exposure Response Lifestyle
Causal model 2: Exposure Lifestyle Response
Causal model 3: Lifestyle Exposure Response
• If a test (e.g., classification tree test for conditional
independence) can discriminate among alternative
models, perhaps it can be used to identify all possible
models from the data?
• The PC algorithm builds on this idea.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
The PC Algorithm
(Reference: Glymour C, Cooper GF, 1999. Computation, Causation &
Discovery, MIT Press)
1. Start with a complete undirected graph
2. Use conditional independence tests to successively remove edges
3. Orient any triple X – Y – Z as X Y Z if and only if X and Z are
dependent when conditioned on Y.
4. Orient any triple X Y – Z as X Y Z.
5. Orient any pair X – Y with a directed path through X to Y as X Y.
6. Repeat steps 3-5 until no more directions can be assigned.
This two-phase approach (eliminate edges via conditional independence,
orient remaining edges where possible) can be extended to latent
variables and possible sample selection biases Fast Causal
Inference (FCI) algorithm.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Pragmatics: Multiple Models
Basic ideas: D = Data M = Model S = Prediction statement
1. Assign CI relations among nodes based on data.
2. Average predictions over all plausible models.
Pr(statement S | data D) and model uncertainty:
Pr(S | D) = MPr(S | M)Pr(M | D), M = models. This is model-averaging.
Occam’s Window for selective model averaging (Madigan, Raftery, 94):
1. Create initial set of models (e.g., model with all nodes connected).
2. Consider all one-step changes (arc additions, deletions, reversals
that do not create cycles). Quantify each model from the data.
(Estimate CPTs by “obvious” MLE. Estimate likelihood of data
using DAG decomposition/Gibbs sampling)
3. Keep any new graphs that (a) are at least 5% as probable as the
most probable model known and (b) Contain no more-probable
subgraphs. Repeat 2 and 3 until no more graphs added.
4. Perform model-averaging using all the structures found.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Influence diagrams with Analytica
• Influence diagram software (e.g., Hugin,
Analytica) allows causal graph models with
decisions and value nodes to be quickly assembled
and managed if the required CPTs and input
probabilities Pr(y) are known.
• Every influence diagram can be automatically
converted to an equivalent Bayesian Network and
solved (e.g., Xiang and Ye, 2001)
– http://citeseer.nj.nec.com/492444.html
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
EXAMPLE: Analytica Influence Diagram
Description: Total cost to society = value of the lives saved cost of Emissions Reduction required to save those lives.
Base
Concentration
Emissions
Reduction
Control Cost
Factor
Concentration
Threshold
Health Damage
Factor
Health
Damage
Control
Cost
Population
Exposed
Excess
Deaths
Total
Cost
"Value of
a Life"
Analysis
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Clicking on “Total Cost” value node and
selecting mean value yields:
400M
Total
Cost
($/yr)
200M
0
0
0.2
0.4
0.6
Emissions Reduction Factor
0.8
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
1
Selecting “Probability Bands” yields:
Total Cost ($/yr)
1G
500M
0
0
0.2
Key
0.4
0.6
Emissions Reduction
0.8
Probability
0.05
0.25
0.5
0.75
0.95
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
1
Example Analytica Model Wrap-Up
• Run time = ~1 sec.
• Method = Monte-Carlo propagation of input
probability distributions through influence
diagram.
• Conclusion: An emissions reduction factor of 0.7
is a robust optimum for this decision problem,
given the probabilistic model (DAG influence
diagram model) mapping inputs to output
probabilities.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Conclusion: Bayes Rule and
Conceptual Models
Bayes’ Rule can help:
• Identify hazards/agents (sources of bad effects)
– Test hypothesized causative agents, pathways, adverse effects
via CI tests
• “Put everything together”
– Compose (modular, validated) components via MC
– Conditional independence
• “Drill down” on components
– Decompose relation into measurable components
– BNs provide testable decompositions
• Construct and validate system of causal connections
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Appendix
Decision Analysis Preview:
Beyond Influence Diagrams
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Individual Decision Analysis Credo
http://www.cs.mcgill.ca/~dprecup/courses/AI/Lectures/ai-lecture17x2.pdf
1. DECISIONS are represented by choices among risk
profiles (probabilities of consequences), one per act.
2. CONSEQUENCES are represented by numerical values
on one or more outcome scales
3. The best decision maximizes expected utility (correctly
defined)
4. UTILITIES reflect risk attitudes and preferences for
consequences
5. UNCERTAINTIES are represented by probabilities, often
derived from published models.
6. LEARNING takes place by conditioning on data (via
Bayes’ Rule and model-based likelihood functions.)
7. MODELS
can be represented by influence diagrams/BNs.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Decision Analysis Framework (Cont.)
•
Justification of DA paradigm: All people should prefer a higher
probability of gaining a preferred outcome!
–
•
•
In prospect [(p, x); (1 – p, 0)], x > 0, pick option with highest p.
Gives well-defined value of information for multi-stage and
sequential decision-making
Gives a universal recipe for prescribing what to do:
1.
2.
3.
4.
5.
6.
7.
Identify alternative acts/choices/decisions
Identify relevant consequences
Create causal model(s) and quantify risk profiles (i.e., probability of each
consequence, given each act)
Clarify value trade-offs. Assign utilities to outcomes.
Optimize decision. (Max EU s.t. model constraints)
Implement decision, monitor results.
Iterate!
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Solving Normal-Form Problems
• Ingredients: States, acts, probabilities, consequences,
utilities, expected utilities
– c(?, ?), Pr(?), u(?), EU(?)
– c(a, s), Pr(s), u(c), EU(a)
• Once the probabilities of consequences for each act
are known (given any available information), finding
expected utilities of acts is easy!
• Picking the EU-maximizing act is justified by a
powerful normative (prescriptive) theory of rational
decision-making.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
EU for “Normal Form” (Act-State-Result)
• Model: act consequence state
• Usually shown as a table.
• Act = decision rule or policy; state = all observed
values; Model determines Pr(state) and
Pr(consequence | act, state) = Pr(c | a, s).
• Q: Given Pr(c | a, s), Pr(s) and u(c), find EU(a)
• A: EU(a) = cPr(c | a)u(c)
= cu(c)[sPr(c | a, s)Pr(s)]
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Simple Example: What to Do?
State:
Act:
Launch
Don’t
launch
Probs:
Problem
No Problem
0
1
0.5
0.8
0.2
0.8
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Simple Example of Normal Form
State:
Act:
Launch
Don’t
launch
Probs:
Problem
No Problem
0
1
EU
0.80
0.5
0.8
0.74
0.2
0.8
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Rocket Launch Problem
• An unmanned rocket is being launched.
• An unreliable warning light comes on with probability
0.5 if the rocket has a problem and with probability
0.333 even if it has no problem.
• The loss is 2 from not launching when there is no
problem. It is 5 from launching when there is a
problem. It is 0 otherwise.
• The prior probability of no problem is p.
• How great must p be to justify launching even if the
warning light comes on? (Assume that the goal is to
minimize expected loss.)
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
A More Realistic/Complex Example: EU Table
State:
Rule:
Problem
No Problem
EU
Launch if and
only if no light
?
?
?
Always launch
0
1
p
0.5
0.8
1-p
p
Never launch
Probs:
.5 + .3p
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
A More Realistic/Complex Example: EU Table
State:
Act:
Problem
No Problem
Launch if and
only if no light
EU = 0.25 EU = (1/3)(.8) + .25(1-p)
= (.5)(.5) (2/3)*1 = .933 + .933p
EU
Always launch
Never launch
Probs:
0
1
0.5
0.8
1-p
p
p
.5 + .3p
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Extensions of Normal Form
Framework/
Representation
Act
State
Payoffs
Optimization
algorithms
Payoff matrix
(Raiffa, 1968)
row
column
Payoff
in cell
Eliminate
dominated rows,
then choose row
to maximize
expected utility
Normal Form
a = act
s = state
c(a, s) =
conseque
nce
model
Mathematical
programming:
Max cu(c)p(c | a)
aA
s.t. p(c | a) =
sc(a,
s)p(s | a)
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Normal Form Extensions (Cont.)
Framework/
Representation
Act
State
Payoff
Optimization
algorithms
Decision tree
(Raiffa, 1968)
Decision
at each
choice
node
Outcome
at each
chance
node
Payoffs at
tips of
tree
Averaging out &
folding back
(simple dynamic
programming)
Influence
diagram
Decision
at each
choice
node
Outcome
at each
chance
node
Value at
value
node
Arc
reversal
•MCMC
•Variable
elimination
•Etc.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Normal Form Extensions (Cont.)
Framework/
Representation
Act
State
Payoff
Optimization
algorithms
Markov
decision
process
(MDP);
Choice
of act in
each
state
Reward per
unit time in
states
and/or at
transitions
Value iteration
Policy iteration
MCMC
•Dynamic
programming
Stochastic
optimal
control
Decision
rule (maps
observed
history to
control
trajectory)
Transition
time and
next state
at each
visit to a
state
Evolution
of state
trajectory,
given
control
trajectory
Function
of state
and/or
control
trajectory
Mathematical
programming:
Max cu(c)p(c | a)
aA
s.t. p(c | a) =
sc(a,
s)p(s | a)
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
Why use decision analysis?
A: Intuitive decisions are often bad/dominated!
Example:
A = gain $240 for sure vs. B = 25% to win $1,000 (else 0)
C = lose $750 for sure vs. D = 75% to lose $1,000 (else 0)
• You must choose one prospect from the pair (A, B) and one
prospect from (C, D). What do you choose? (A, C), (A, D),
(B, C), or (B, D)?
• Many prefer (A, D) to (B, C)
• But (A, D) gives 25% chance at $240, 75% chance of -$760
• (B, C) gives 25% chance to win $250, 75% chance of -$750
• So, (B, C) dominates (A, D).
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]
A Formula for Expected Utility
• Let Pr(c | a) be the probability of receiving consequence
or result c if act or decision a is taken. a c
• Pr(c | a) for all c = “Risk profile” for act a.
• Let u(c) = utility of consequence c. (If u(c) is a utility
function representing preferences for consequences, so is
ru(c) + s, for any r > 0 and for any s.)
– c is equally preferred to {probability u(c) of best outcome,
probability 1 – u(c) of least-preferred outcome}.
Then EU(a) = cPr(c | a)u(c) is the expected utility of act a.
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]