Presentation file I - Discovery Systems Laboratory

Download Report

Transcript Presentation file I - Discovery Systems Laboratory

Machine Learning Methods for Decision
Support and Discovery
Constantin Aliferis M.D., Ph.D., Ioannis Tsamardinos Ph.D.
Discovery Systems Laboratory,
Department of Biomedical Informatics,
Vanderbilt University
2004 MEDINFO Tutorial
7 September 2004
1
Acknowledgments
Alexander Statnikov for code and for
putting together the Resource Web page and
CD
 Doug Hardin, Pierre Massion, Yindalon
Aphinyanaphongs, Laura E. Brown, and
Nafeh Fananapazir for access to data and
results for case studies

2
Goal
The purpose of this tutorial is:



To help participants develop a solid understanding of some of the most
useful machine learning methods.
To give several examples of how these methods can be applied in
practice, and
To provide resources for expanding the knowledge gained in the tutorial.
3
Outline
Part I: Overview and Foundations
1. Tutorial Overview and goals
2. Importance of Machine Learning for discovery and decision-support system
construction
3. A framework for inductive Machine Learning
4. Generalization and Over-fitting
5. Quick review of data preparation and model evaluation
6. Families of methods
a. Bayesian classifiers
break
b. Neural Networks
c. Support Vector Machines
break
7. Quick Review of Additional families
a. K-Nearest Neighborhs,
b. Clustering,
c. Decision Tree Induction,
d. Genetic Algorithms
4
Outline (cont’d)
Part II.More Advanced Methods and Case Studies
1. More Advanced Methods
a. Causal Discovery methods using Causal Probabilistic Networks
b. Feature selection
break
2: Case Studies
a.
b.
c.
Building a diagnostic model from gene expression data
Building a diagnostic model from mass spectrometry data
Categorizing text into content categories
break
d. Discovery of causal structure using Causal Probabilistic Network induction (demo)
3. Conclusions and wrap-up
a. Resources for machine learning
b. Questions & feedback
5
Definitions & Importance of
Machine Learning
6
A (Simplified) Motivating
Example
Assume we wish to create a decision
support system capable of diagnosing
patients according to two categories: Lung
Cancer and Normal.
 The input to the system will be of array
gene expression measurements from tissue
biopsies.

7
A (Simplified) Motivating
Example
Little is currently known about how gene
expression values differentiate human lung
cancer tissue from normal tissue.
 Thus we will use an automated approach in
which a computer system will examine
patients’ array gene expression
measurements and the correct diagnosis
(provided by a pathologist).

8
A (Simplified) Motivating
Example
The system will produce a program that
implements a function that assigns the
correct diagnosis to any pattern of array
gene expression data to the correct
diagnostic label (and not just the inputoutput patterns of the training data).
 Thus the system will learn (i.e., generalize)
from training data the general input-output
function for our diagnosis problem.

9
A (Simplified) Motivating
Example
What are the principles and specific
methods that enable the creation of such
learning systems?
 What flavors of learning systems currently
exist?
 What are their capabilities and limitations?

…These are some of the questions we will be
addressing in this tutorial
10
What is Machine Learning (ML)? How is it different
than Statistics and Data Mining?
Machine Learning is the branch of Computer Science (Artificial
Intelligence in particular) that studies systems that learn.
Systems that learn = systems that improve their performance with
experience.
11
What is Machine Learning (ML)? How is it different
than Statistics and Data Mining?
Typical tasks:
 image recognition,
 Diagnosis,
 elicitation of possible causal structure of problem domain,
 game playing,
 solving optimization problems,
 prediction of structure or function of biomolecules,
 text categorization,
 identification of relevant variables, etc.
12
Indicative Example applications of ML in
Biomedicine
Bioinformatics
• Prediction of Protein Secondary Structure
• Prediction of Signal Peptides
• Gene Finding and Intron/Exon Splice Site Prediction
• Diagnosis using cDNA and oligonucleotide array gene
expression data
• Identification of molecular subtypes of patients with
various forms of cancer
Clinical problem areas
• Survival after Pneumonia (CAP)
• Survival after Syncope
• Diagnosis of Acute M.I.
• Diagnosis of Prostate Cancer
• Diagnosis of Breast Cancer
• Prescription and monitoring in hemodialysis
• Prediction of renal transplant graft failure
13
Importance of ML: Task Types




Diagnosis (what is the most likely disease given a set of clinical
findings?),
Prognosis (what will be the outcome after a certain treatment has been
given to a patient?),
Treatment selection (what treatment to give to a specific patient?),
Prevention (what is the likelihood that a specific patient will develop
disease X if preventable risk factor Y is present?).
ML has practically replaced Knowledge Acquisition for building Decision
Support (“Expert”) Systems.
14
Importance of ML: Task Types (cont’d)

Discovery
– Feature selection (e.g., what is a minimal set of laboratory values needed
for pneumonia diagnosis?);
– Concept formation (e.g., what are patterns of genomic instability as
measured by array CGH that constitute molecular subtypes of lung cancer
capable of guiding development of new treatments?);
– Feature construction (e.g., how can mass-spectrometry signals be
decomposed into individual variables that are highly predictive for
detection of cancer and can be traced back to individual proteins that may
play important roles in carcinogensis?); information retrieval query
construction (e.g., what are PubMed Mesh terms that predict with high
sensitivity and specificity whether medical journals talk about treatment?);
– Questions about function, interactions, and structure (e.g., how do
genes and proteins regulate each other in the cells of lower and higher
organisms? what is the most likely function of a protein given the
sequence of its aminoacids?), etc.
15
What is Machine Learning (ML)? How is it different
than Statistics and Data Mining?
Broadly speaking ML, DM, and Statistics have similar
goals (modeling for classification and hypothesis
generation or testing).
Statistics has traditionally emphasized models that can be solved analytically
(for example various versions of the Generalized Linear Model – GLM). To
achieve this both restrictions in the expressive power of models and their
parametric distributions are heavily used.
Data Mining emphasizes very large-scale data storage, integration, retrieval
and analysis (typically the last one as a secondary focus).
Machine Learning seeks to use computationally powerful approaches to learn
very complex non- or quasi-parametric models of the data. Some of these
models are closer to human representations of the problem domain per se (or
of problem solving in the domain)
16
Importance of ML:
Data Types and Volume

Overwhelming production of data:
– Bioinformatics (mass-throughput assays for gene
expression, protein abundance, SNPs…)
– Clinical Systems (EPR, POE)
– Bibliographic collections
– The Web: web pages, transaction records,…
17
Importance of ML:
Reliance on Hard data and evidence

Machine learning has become critical for Decision
Support System Construction given extensive
cognitive biases and the corresponding need to
base MDSSs on hard scientific evidence and highquality data
18
Supplementary: Cognitive Biases

Main thesis:
– human cognitive abilities are tailored to support
instinctive, reflexive, life-preserving reactions traced
back in our evolution as species. They are not designed
for rational, rigorous reasoning such as the reasoning
needed in science and engineering.

In other words, there is a disconnect between our
innate cognitive ability and the complexity of
reasoning tasks required by the explosive
advances in science and technology in the last few
hundred years.
19
Supplementary: But is the Cognitive Biases
Thesis Correct?



Psychology of Judgment and Decision Making (Plous)
Tversky and Kahneman (Judgment under uncertainty:
Heuristics and Biases)
Methods of Influence (Cialdini)
And highly-recommended supplementary information can
be found in:


Professional Judgment (Elstein)
Institute of Medicine’s Report in Medical Errors (1999)
20
Supplementary: Tversky and Kahneman
“Judgment under uncertainty: Heuristics and Biases”

This work (a constellation of psychological studies
converging to a description of human decision making
under uncertainty) is very highly regarded and influential It
was recently (2002) awarded the Nobel Prize of Economics.

Main points:
–
–
–
People use a few simple heuristics when making judgments under
uncertainty
These heuristics sometimes are useful and other times lead to severe
and systematic errors
These heuristics are: representativeness, availability and anchoring
21
Supplementary: Representativeness


E.g., : the probability P that patient X has disease D given
that she has findings F is assessed by the similarity of X to
a prototypical description of D (found in a textbook, or
recalled from earlier practice and training).
Why is this wrong?
–
–
–
–
Reason #1: similarity ignores base-rate of D
Reason #2: similarity ignores sample size
Reason #3: similarity ignores predictability
Reason #4: similarity is affected by redundant features
22
Supplementary: Availability
• E.g., : the probability P that patient X with disease D given
that she is given treatment T will become healthy is
assessed by recalling such occurrences in one’s prior
experience
• Why is this wrong?
– Reason #1: availability is influenced by familiarity
– Reason #2: availability is influenced by salience
– Reason #3: availability is influenced by elapsed time
– Reason #4: availability is influenced by rate of abstract terms
– Reason #5: availability is influenced by imaginability
– Reason #6: availability is influenced by perceived association
23
Supplementary: Adjustment and Anchoring
• E.g., : the probability P that patient X has disease D given
that she has findings F is assessed by making an initial
estimate P1 for findings F1 and updating it when new
evidence F2, F3, …, and so on, becomes available.
• What goes wrong?
– Problem #1: initial estimate over-influences the final estimate
– Problem #2: initial estimate is often based on quick and then
extrapolated calculations
– Problem #3: people overestimate the probability of conjunctive
events
– Problem #4: according to initial anchor, people’s predictions are
calibrated differently
24
Supplementary: Additional
• Methods of Influence (Cialdini, 1993):
– Reciprocation
– Commitment & Consistency
– Social Proof
– Liking
– Authority
– Scarcity
• Professional Judgment (Dowie and Elstein 1988)
• Institute of Medicine’s Report in Medical Errors (1999)
25
Supplementary: Putting MDSSs and
Machine Learning in Historical Context

40s
– Foundations of Formal Decision-Making Theory by VonNeuman and
Morgerstern

50s
– Ledley and Lusted lay out how logic and probabilistic reasoning can help in
diagnosis and treatment selection in medicine

60s
– Applications of Bayes theorem for diagnosis and treatment selection pioneered
by Warner and DeDombal
– Medline (NLM)

Early 70s
– Ad-hoc systems (Myers et al; Pauker et al)
– Study of Cognitive Biases (Kahneman, Tversky)

Late 70s
– Rule-based systems (Buchanan & Shortliffe)
26
Supplementary: Milestones
in MDSSs
• 80s
– Analysis of ad-hoc and RBSs (Heckerman et al.)
– Bayesian Networks (Pearl, Cooper, Heckerman et al.)
– Medical Decision Making as discipline (Pauker)
– Literature-driven decision support (Renels & Shortliffe)
• Early 90s
– Web-enabled decision support & wide-spread information retrieval
– Computational Causal Discovery (Pearl, Spirtes et al. Cooper et al.)
– Sound re-formulation of very large ad-hoc systems (Shwe et al)
– Analysis of Bayesian systems (Domingos et al, Henrion et al.)
– Proliferation of focused Statistics and Machine Learning MDDSs
– First-order Logics that combine classical FOL with probabilistic
reasoning, causation and planning (Haddaway)
27
Supplementary: Milestones
in MDSSs
• Late 90s
– Efficient Inference for very large probabilistic systems (Jordan et al)
– Kernel-based methods for sample-efficient learning (Vapnik)
– Evidence-Based Medicine (Haynes et al)
• 21st Century
– Diagnosis, Prognosis and Treatment selection (a.k.a. “Personalized
medicine” or “Pharmacogenomics”) based on molecular information
(proteomic spectra, gene expression arrays, SNPs) collected via massthroughput assaying technology, and modeleld using machine learning
methods
– Provide-order entry delivery of advanced decision support
– Advanced representation, storage, retrieval and application of EBM
information (guidelines, journals, meta-analyses, clinical bioinformatics
models)
28
Importance of ML

How often ML techniques are being used? #Articles in Medline (in parentheses
last 2 years):
– Artificial Intelligence:
– Expert systems:
– Neural Networks
– Support Vector Machines
– Clustering
– Genetic Algorithms
– Decision Trees
– Bayesian (Belief) Networks
– Bayes (Bayesian Statistics + Nets)
Compare to:
– Regression
– Knowledge acquisition
– Knowledge representation
– 4 major Symbolic DSS
(Internist-I, QMR, ILIAD, DxPlain)
– Rule-based systems
12,441
2,271
5,403
163
17,937
2,798
4,958
1,627
4,369
(2,358)
(121)
(1,158)
(121)
(4,080)
(969)
(752)
(585)
(561)
164,305
310
227
145
(28,134)
(56)
(27)
(10)
802
(151)
29
Importance of ML
– Importance of ML becomes very evident in cases where:
» data analysis is too time consuming (e.g., classify web pages or medline
documents into content or quality categories)
» There is little or no domain theory
What is the diagnosis?
Is this an early cancer?
30
A Framework for Inductive ML
& Related Introductory Concepts
31
What is the difference between supervised and
unsupervised ML methods?
Supervised learning:
- Give to the learning algorithm several instances of input-output
pairs; the algorithm learns to predict the correct output that
corresponds to some inputs (not only previously seen but also
previously unseen ones (“generalization”)).
- In our original example: show to learning algorithm array gene
expression measurements from several patient cases as well as
normal subjects; then the learning algorithm induces a classifier
that can classify a previously unseen subject to the correct
diagnostic category given the gene expression values observed in
that subject
32
Classification
A
B
C
D
TRAIN
INSTANCES
CLASSIFIERINDUCTIVE
ALGORITHM
CLASSIFIER
E
APPLICATION
INSTANCES
A1, B1, C1, D1, E1
A2, B2, C2, D2, E2
An, Bn, Cn, Dn, En
CLASSIFICATION
PERFORMANCE
33
What is the difference between supervised and
unsupervised ML methods?
Unsupervised learning:
- Discover the categories (or other structural properties of the
domain)
- Example: give the learning algorithm gene expression
measurements of patients with Lung Cancer as well as normal
subjects; the algorithm finds sub-types (“molecular profiles”) of
patients that are very similar to each other, and different to the rest
of the types. Or another algorithm may discover how various genes
interact among themselves to determine development of cancer.
34
Discovery
A
B
C
D
TRAIN
INSTANCES
E
STRUCTUREINDUCTION
ALGORITHM
A
B
C E
D
A1, B1, C1, D1, E1
A2, B2, C2, D2, E2
PERFORMANCE
An, Bn, Cn, Dn, En
35
A first concrete attempt at solving our
hypothetical diagnosis problem using a
particular type of learning approach
(decision tree induction)
36
Decision Tree Induction
An example decision tree to solve the problem of how to classify
subjects into lung cancer vs normal
Gene139
Over-expressed
Normallyexpressed
Under-expressed
Gene202
Lung cancer
Gene8766
Over-expressed
Normallyexpressed
Over-expressed
Normallyexpressed
Lung cancer
Normal
Lung cancer
Normal
37
How Can I Learn Such A Decision Tree
Automatically?
A basic induction procedure is very simple in principle:
1.
2.
3.
4.
5.
Start with an empty tree
Put at the root of the tree the variable that best classifies the
training examples
Create branches under the variable corresponding to its values
Under each branch repeat the process with the remaining
variables
Until we run out of variables or sample
38
How Can We Generalize From
This Example?
39
A General Description of supervised Inductive
ML
Inductive Machine Learning algorithms can be designed and analyzed
using the following framework:



A language L in which we express models. The set of all possible models
expressible in L constitutes our hypothesis space H
A scoring metric M tells us how good is a particular model
A search procedure S helps us identify the best model in H
Space of all
possible models
x
x
x
x
x
x
x
x
x
x
x
x
xx
Models in H
40
A General Description of supervised Inductive
ML
In our decision tree example:




A language L in which we express models = decision trees
The hypothesis space H = space of all decision trees that can be
constructed with genes 1 to n
A scoring metric M telling us how good is a particular model = min
(classification error + model complexity)
A search procedure S = greedy search
41
How can ML methods fail?
Wrong language Bias: best model is not in H
Example: we look for models expressible as discrete
decision trees but the domain is continuous
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Space of all
possible models
Models in H
42
How can ML methods fail?
Search Failure: best model is in H but search fails to
examine it
Example: greedy search fails to capture a strong gene-gene
interaction effect
Space of all
possible models
xx
x
x
x
x
x
x
x
x
x
x
x
x
Models in H
43
Generalization & Over-fitting
44
Generalization & Over-fitting


It was mentioned previously that a good learning
program learns something about the data beyond
the specific cases that have been presented to it.
Indeed, it is trivial to just store and retrieve the
cases that have been seen in the past (“rote
learning” implemented as a lookup table). This
does not address the problem of how to handle
new cases, however.
45
Generalization & Over-fitting



In supervised learning we typically seek to minimize
“i.i.d.” error, that is error over future cases (not used in
training). Such cases contain both previously encountered
as well as new cases.
“i.i.d.” = independently sampled and identically distributed
problem instances.
In other words, the training and application samples come
from the same population (distribution) with identical
probability to be selected for inclusion and this
population/distribution is time-invariant.
(Note: if not time invariant then by incorporating time as
independent variable or by other appropriate
transformations we restore the i.i.d. condition)
46
Supplementary: Generalization &
Over-fitting

Consider now the following simplified diagnostic classification
problem: classify patients into cancer (red/vertical pattern) versus
normal (green/no pettern) on the basis of the values of two gene values
(gene1, gene2)
Gene1
Gene2
47
Supplementary: Generalization &
Over-fitting

The diagonal line represents a perfect classifier for this problem (do
not worry for the time being how to mathematically represent or
computationally implement the line – we will see how to do so in the
Neural Network and Support Vector Machine segments):
Gene1
Gene2
48
Supplementary: Generalization &
Over-fitting

Let’s solve the same problem from a small sample; one such possible
small sample is:
Gene1
Gene2
49
Supplementary: Generalization &
Over-fitting

We may be tempted to solve the problem with a fairly complicated
line:
Gene1
Gene2
50
Supplementary: Generalization &
Over-fitting

In which case we get several errors:
Gene1
Gene2
51
Supplementary: Generalization &
Over-fitting

…whereas with a simpler line…:
Gene1
Gene2
52
Supplementary: Generalization &
Over-fitting

…a much smaller error:
Gene1
Gene2
53
Generalization & Over-fitting



In general, over-fitting a model to the data means that instead of
general properties of the population from which the data is sampled we
learn idiosyncracies (i.e., non-representative properties) of the sample
data.
Over-fitting and poor generalization (i.e., the error in the overall
population (“true error”) is large) are synonymous as long as we have
learned the training data well (i.e., “small apparent error”).
Over-fitting is not only affected by the “simplicity” of the classifier
(e.g., straight vs wiggly line) but also by:
–
–
–
–
the size of the sample,
the complexity of the function we wish to learn from data,
the amount of noise, and
the number and nature (continuous discrete, ordered, distribution, etc.) of
the variables.
54
Generalization & Over-fitting






We wish to particularly emphasize the danger of grossly over-fitting
the learning when the number of predictive variables is large
relative to the available sample. Consider for example the following
situation:
Assume we have 5 binary predictors and two samples, and wish to
classify instances into two classes
The predictors can encode 2^5=32 possible distinct patterns. Assume
all patterns are equally probable.
Hence the chances of the two cases having different predictive patterns
are 31/32=97%.
Thus in 97% of our samples of size two, the five variables are
sufficient to identify perfectly the case.
Combine this with a powerful enough learning algorithm (i.e., one that
can effectively associate any pattern with the desired outcome) and it
follows that in 97% of samples, one gets optimal apparent error even
when there is no relationship between the target variable and the
predictive variables!
55
Generalization & Over-fitting


This situation is particularly relevant in bioinformatics in which we
routinely have >10,000 continuous variables, noise, and <500 samples.
Under these conditions every training instance has almost always a
unique value set of the predictive variables;
thus if one is not careful, the learning algorithm can simply learn what
amounts to a lookup table (i.e., by associating the unique predictor
signature with the outcome of that case for every case).
56
Generalization & Over-fitting



So how does one avoid over-fitting?
Via a variety of approaches:
– Use learning algorithms that intrinsically (by design) generalize
well
– Pursue simple (“highly biased”) classifiers for small samples
– Choose unbiased and low-variance statistical estimators of the true
error and employ them sparingly
Very important rule:
Estimate the performance (true error) of a model with data you did
not use to construct the model
57
Generalization & Over-fitting


Avoiding over-fitting will be a primary concern of ours in
this tutorial
We will outline here some specific cross-validation
procedures and use them to build models in the case
studies segment
58
Generalization & Over-fitting

Hold-out cross-validation method:
– Split data in Train and Test data
– Learn with Train and estimate true error with Test
test
data
train
59
Generalization & Over-fitting

N-fold Cross-validation:
– Split data in Train and Test data n times such that union of test sets is
full data set
– Learn with Train and estimate true error with Test in each split
separately
– Average test performance
test
test
test
test
train
train train train train
train train
test
test
60
Generalization & Over-fitting

Leave-one-out =
n-fold C.V. where n is equal to the number of data instances
61
Supplementary: Generalization &
Over-fitting

Stratified (balanced) Cross-validation:
An n-fold C.V. in which (by design) the target class has the same
distribution as in the full dataset
62
Supplementary: Generalization &
Over-fitting

Nested Cross-validation:
– Assume we wish to apply cross-validation to find the best parameter
values for parameter C for a classifier from parameter value set
[1,..,100].
– One way to use C.V. to select the best values for C is to apply the
holdout method 100 times, one for each value of C and select the value
that gives the best error in the test sample.
– The problem with this approach is that the true error estimate is not
reliable since it is produced by running the best model on a test set
that was used to derive the best model.
– In other words, our data used to estimate the true error can no longer
be used to produce unbiased estimates since it also guided the
selection of the model.
63
Supplementary: Generalization &
Over-fitting
solution:

– Split the Train data into two (Traintrain and Validation),
– Use the validation set to find the best parameters,
– Use the test set to estimate the true error
test
Validation
data
Traintrain
64
Supplementary: Generalization &
Over-fitting

If the sample is small, the nesting can be repeated with different
assignment of the test set (i.e., nested n-fold C.V.):
Te
data
V TT
V TT
TT V TT
TT V TT
V

V
Te
One can also nest LOO with n-fold C.V. or LOO with LOO
65
Supplementary: Generalization &
Over-fitting

Important notes:
– Estimating the true error of the best model is a separate
procedure than generating the best model; the former requires
an additional layer of nesting our cross-validation
– When there are several types of parameters to be selected (e.g.,
normalization, discretization, classifier parameters) one can:
» do one n-fold cross-validation using the cartesian product of all
parameters which uses more sample but yields more conservative
true error estimates, or one can
» nest the cross-validation to as many nesting levels as the number
of distinct parameters that need optimization, which yields more
unbiased true error estimates but uses less sample
66
Quick Notes On Data Preparation
67
Data preparation

Non-specific
– Is the data lawfully in our disposal?
– Are there issues that deal with patient privacy and
confidentiality as well as intellectual property issues
that need be resolved?
– How were the data produced, by whom, when, with
what purpose in mind?
– Any known or plausible biases present?
– References in the literature?
– Is there a codebook with clear definitions of variables
location, date of creation, method of creation, value list,
value meaning, missing value codes and meanings,
history of the database and its elements?
68
Data preparation

Data specific
–
–
–
–
Valid values?
Variable distributions?
Descriptive statistics?
Mechanisms of missingness & imputation
69
Data preparation

Learner specific
–
–
–
–
–
–
–
–
De-noising
Scaling/Normalization
Discretization
Transforming variable distributions
Co-linearities
Homoskedasticity
Outliers
Feature selection
70
Data preparation

Task specific
– Reconstruct hidden or distorted signals from
observed ones
– Infer presence of hidden variables, determine
their cardinality and values
– Stem, normalize, extract terms
– Weight or Project variables
71
Basic Evaluation Metrics
72
Evaluation Metrics


T: test
D: disease
T+
T-

D+
D-
a
b
a+b
c
d
c+d
a+c
b+d
a+b+
c+d
Accuracy (0/1 loss):
Number of correct classifications
Number of total classifications
a+d
a+b+c+d
73
Evaluation Metrics


T: test
D: disease
T+
T-

D+
D-
a
b
a+b
c
d
c+d
a+c
b+d
a+b+
c+d
Sensitivity: proportion of true positives identified by test
a
a+c
74
Evaluation Metrics


T: test
D: disease
T+
T-

D+
D-
a
b
a+b
c
d
c+d
a+c
b+d
a+b+
c+d
Specificity: proportion of true negatives identified by test
d
b+d
75
Evaluation Metrics


T: test
D: disease
T+
T-

D+
D-
a
b
a+b
c
d
c+d
a+c
b+d
a+b+
c+d
Positive predictive value (PPV): proportion of true positives over test
positives
a
a+b
76
Evaluation Metrics


T: test
D: disease
T+
T-

D+
D-
a
b
a+b
c
d
c+d
a+c
b+d
a+b+
c+d
Negative predictive value (NPV): proportion of true negatives over test
negatives
d
c+d
77
Evaluation Metrics

Mean squared error (MSE) (“Quadratic loss”):
|D|
1/|D|


Si (predicted_value(i)-true_value(i))
2
|D| = cardinality of test dataset
Suitable for continuous outputs
78
Evaluation Metrics

ROC area
1
Sensitivity
0
1
1-Specificity
79
Evaluation Metrics

In information retrieval:
– “Precision” is the name for “PPV” and
– “Recall” is the name for “Sensitivity”
80
Evaluation Metrics

Recall-precision curve (and area under it):
100%
Recall
0
100%
Precision
81
Bayesian Classifiers

Note: we will be discussing Bayesian classifiers using the
diagnostic context, (which in terms of applications and
historical development of the related ideas is
representative). However the ideas discussed readily
translate to any type of learning for classification and
concomitant decision support function.
82
Bayesian Classifiers

Bayes’ Theorem (or formula) says that:
P (D) * P(F| D)
P (D | F) =
P(F)
Where:
– P(D) is the probability of some disease D in the general population (i.e.,
before obtaining some evidence F), a.k.a. as “disease prior probability”
– P(F) is the probability of some evidence in the form of findings such as
lab tests, physical examination findings etc.
– P(F | D) is the probability of the same findings given that someone has
disease D
– P(D | F) is the probability of disease D given that someone has the
findings F (i.e., after obtaining some evidence F), a.k.a. as “disease
posterior probability”
83
Bayesian Classifiers



Since the most likely diagnosis is the one with the
maximum a posteriori probability, Bayes’ formula allows
one to solve the differential diagnosis problem, as well as
as any classification learning problem that can be cast as
supervised learning
Indeed, in the sample limit, there cannot be a better way to
infer the most likely diagnosis than Bayes’ theorem and
thus it serves as the theoretical gold standard against which
statistical and machine learning classifiers are measured in
terms of true error.
In that context it is referenced as the “Bayes Optimal
Classifier”
84
Bayesian Classifiers

Note that Bayes’ formula can be applied to
diagnosis of multiple possibly inter-depended
diseases and non-independent findings since
where there is “F” one can place a vector of
findings (e.g., F1+, F2-, F3-,…,Fn+) and where
there is “D” one can put a vector of diseases (e.g.,
D1-, D2-, D3+,…,Dm+) .
85
Bayesian Classifiers

Further, the intuitive interpretation of Bayes’ rule
is that of updating belief about the patient’s true
state: before seeing F we have some prior belief
(measured as probability) that the patient has
disease(s) D. After seeing F we update the prior
belief (diagnosis) to reflect (incorporate) the new
evidence F; the new belief is the posterior
produced by Bayes’s rule
86
Bayesian Classifiers


Unfortunately there is a significant drawback with
straightforward Bayes rule: we need number of
probabilities, storage and computational time that is
exponential to the number of findings (i.e., |F|) and the
number of diseases (i.e., |D|).
This means that for any diagnostic or other classification
problem of non-trivial size (measured in terms of |F| and
|D|) straight Bayes is not feasible
87
Bayesian Classifiers


This has led to a simplified version in which we disallow
multiple diseases (i.e., require that the patient may have
only one disease at a time) and we require that findings are
independent conditioned on the disease states (note: this
does not mean that the findings are independent in general,
but rather, that they are conditionally independent).
The combination of these two assumptions yields required
number of probabilities, storage and computational time
that is linear to the number of findings and the number of
diseases.
88
Simple (a.k.a. “Naive”) Bayes


Application of Bayes’ rule with the Mutual Disease Exclusivity
assumption (MEE) and the Conditional Independence assumption
(FCID) is known as “Simple Bayes’ Rule”, “Naïve Bayes”, or, rather
non-tastefully, as “Idiot’s Bayes”.
Simple Bayes can be implemented by plugging in the main formula:
P(F | D) = P P(Fi | Dj)
i,j
and P( F) = S P(Fi, Dj) = S [P(Fi | Dj) * P(Dj)]
i,j
i,j
where Fi is the ith (singular) finding and Dj the jth (singular) disease.

Several other (mathematically) equivalent formulations exist using
sensitivities and specificities, likelihood ratios or other convenient
building blocks
89
Simple (a.k.a. “Naive”) Bayes


Simple Bayes was applied very early (from the early 60’s
and on) in Medical Informatics for diagnosis and optimal
treatment selection as well as sequential testing.
See for example the classic papers by Warner et al (1961),
DeDombal (1972), Leaper (1972), Gorry and Barnett
(1968)
90
Variants of Simple Bayes


Since the MEE and FCID assumptions clearly are violated in many
medical contexts, researchers early on sought to relax them and created
modified Bayesian classifiers that approximated P(F |D) (Fryback
1978) or assumed independent diseases and multiple diagnoses
(“Multi-membership model” of Ben-Basat, 1980).
These models (and many others not mentioned here) have primarily
historical significance currently, because:
– (a) It was shown (1997, Domingos and Pazzani) that the MEE and
FCID assumptions are not necessary but sufficient conditions for a
wide variety of target functions under 0/1 loss
– (b) Bayesian Networks were invented and as we will see next they
allow flexible representation of dependencies so that parsimony and
tractability is maintained without compromising soundness
– (c) several other restricted Bayesian classifiers have been shown to
perform well in a variety of practical settings
91
Bayesian Networks
92
Supplementary: Bayesian Networks:
Overview





A Note On Terminology
Brief Historical Perspective
The Bayesian Network Model and Its Uses
Learning BNs
Reference & Resources
93
Supplementary: Bayesian Networks: A Note
On Terminology



Bayesian Networks (or “Nets”): generic name
Belief Networks: subjective probability-based, non-causal
Causal Probabilistic Networks: frequentist probability-based,
causal
94
Supplementary: Bayesian Networks: A Note
On Terminology

Various other names for special model classes:
– Influence Diagrams (Howard and Mathesson): incorporate decision and
utility nodes. Used for decision analyses
– Dynamic Bayesian Networks (Dagum et al.): temporal semantics. Used
as alternatives to multivariate time series models and dynamic control
– Markov Decision Processes (Dean et al.): for decision policy
formulation in temporally-evolving domains
– Modifiable Temporal Belief Networks (Aliferis et al.): for wellstructured and very large problem models that involve time and
causation and cannot be stored explicitly
95
Supplementary: Bayesian Networks:
Historical Perspective




Naïve Bayesian Model (mutually exclusive diseases, findings
independent given diseases) predominant model for medical decision
support systems in the 60’s and early 70’s because it requires linear
number of parameters and computational steps (to total findings and
diseases)
Theorem 1 (Minsky, Peot): Naïve Bayes heuristic usefulness (expected
classification performance) over all domains gets exponentially worse
as number of variables increases
Theorem 2 (see Mitchell): Full Bayesian classifier=perfect classifier
However FBC impractical and serves as analytical tool only
96
Supplementary: Bayesian Networks:
Historical Perspective


In the late 70’s and up to mid-80’s this led to: Production Systems (i.e.,
rule-based systems, that is simplifications of first-order logic). The
most influential version of PSs (Shortliffe, Buchanan) handled
uncertainty through a modular account of subjective belief (the
Certainty Factor Calculus)
Theorem 3 (Heckerman): The CFC is inconsistent with probability
theory unless rule-space search graph is a tree. Consequently, forward
and backward reasoning cannot be combined in a CFC PS and still
produce valid results
97
Supplementary: Bayesian Networks:
Historical Perspective

That led to research (late 80s) in Bayesian Networks which can vary
expressiveness between the full dependency (or even the full Bayesian
classifier) and the Naïve Bayes model (Pearl, Cooper)
Variables
Conditionally
Independent Given
Categories &
Bayesian Networks
Variables
Conditionally
Dependent
Categories Mutually
Exclusive
98
Supplementary: Bayesian Networks:
Historical Perspective



In the early 90’s researchers developed the first algorithms
for learning BNs from data (Herskovits, Cooper,
Heckerman)
In the mid 90’s researchers (Spirtes, Glymour, Sheines,
Pearl, Verma) discovered methods to learn CPNs from
observational data(!). We will cover the foundations of this
in the causal discovery segment.
Overall BNs is the brain child of computer scientists,
medical informaticians, artificial intelligence researchers,
and industrial engineers and is considered to be the
representation language of choice for most biomedical
Decision Support Systems today
99
Bayesian Networks: The Bayesian Network
Model and Its Uses


BN=Graph (Variables (nodes), dependencies (arcs)) + Joint Probability
Distribution + Markov Property
Graph has to be DAG (directed acyclic) in the standard BN model
A
B
C
P(A+,
P(A+,
P(A+,
P(A+,
P(A-,
P(A-,
P(A-,
P(A-,
JPD
B+, C+)=0.006
B+, C-)=0.014
B-, C+)=0.054
B-, C-)=0.126
B+, C+)=0.240
B+, C-)=0.160
B-, C+)=0.240
B-, C-)=0.160
Theorem 4 (Neapolitan): any JPD can be represented in BN form
100
Bayesian Networks: The Bayesian Network
Model and Its Uses

Markov Property: the probability distribution of any node N given its parents
P is independent of any subset of the non-descendent nodes W of N
A
e.g., :
B
C
D ^ {B,C,E,F,G | A}
D
F ^ {A,D,E,F,G,H,I,J |
B, C }
E
F
G
I
H
J
101
Bayesian Networks: The Bayesian Network
Model and Its Uses

Theorem 5 (Pearl): the Markov property enables us to decompose (factor) the
joint probability distribution into a product of prior and conditional probability
distributions
The original JPD:
P(A+, B+, C+)=0.006
P(A+, B+, C-)=0.014
A
P(A+, B-, C+)=0.054
P(A+, B-, C-)=0.126
P(A-, B+, C+)=0.240
Up to
P(A-, B+, C-)=0.160
P(A-, B-, C+)=0.240
Exponential
B
C
P(A-, B-, C-)=0.160
Saving in
P(V) =
P p(V |Pa(V ))
i
i
i
Becomes:
P(A+)=0.8
P(B+ | A+)=0.1
P(B+ | A-)=0.5
P(C+ | A+)=0.3
P(C+ | A-)=0.6
Number of
Parameters!
102
Bayesian Networks: The Bayesian
Network Model and Its Uses
As we will see in the causal discovery segment, BNs are a useful language
for automated causal discovery because the Markov property captures
causality thus:
– Revealing confounders
– Modeling “explaining away”
– Modeling/understanding selection bias
– Modeling causal pathways
– Modeling manipulation in the presence of confounders
– Modeling manipulation in the presence of selection bias
– Identifying targets for manipulation in causal chains
103
Bayesian Networks: The Bayesian Network
Model and Its Uses

Once we have a BN model of some domain we can ask questions:
A
• Forward: P(D+,I-| A+)=?
• Backward: P(A+| C+, D+)=?
B
C
D
• Forward & Backward:
P(D+,C-| I+, E+)=?
E
F
G
I
H
• Arbitrary abstraction/Arbitrary
predictors/predicted variables
J
104
Bayesian Networks: The Bayesian Network
Model and Its Uses
The Markov property tells us which variables are important to predict a
variable (Markov Blanket), thus providing a principled way to reduce
variable dimensionality
A
B
E
C
F
D
G
I
H
J
105
Bayesian Networks: Demonstration of
Flexible Representation

A BN in which FCID holds
D1
F1
D2
F2
D3
F3
F4
106
Bayesian Networks: Demonstration of
Flexible Representation

A BN in which MEE holds
D1
F1
D2
F2
D3
F3
F4
107
Bayesian Networks: Demonstration of
Flexible Representation

A BN in which MEE and FCID hold
D
F1
F2
F3
F4
108
Bayesian Networks: Demonstration of
Flexible Representation

Hybrid assumptions
D1
F1
D2
D3
F2
F3
F5
F6
F4
D2
F7
D3
F8
F9
109
Inference Algorithms

Exact
– Lauritzen & Spigelhalter
– Cooper: Recursive decomposition

Stochastic-approximate
– Likelihood weighting
– Dagum and Luby

Variational (approximate but not stochastic)
– Jordan et al. (1998): solves queries in QMR-DT in seconds
Reference:
An Introduction to Variational Methods for Graphical Methods (1998) Michael I. Jordan,
Zoubin Ghahramani, Tommi S. Jaakkola, Lawrence K. Saul. Machine Learning
An Optimal Approximation Algorithm For Bayesian Inference (1997) Paul Dagum, Michael
Luby. Artificial Intelligence
Probabilistic Reasoning in Expert Systems: Theory and Algorithms by Richard E.
Neapolitan. Kohn Wiley 1990
110
Theoretical Complexity



Inference is NP-hard (Cooper (exact, 1990) Dagum and
Luby (stochastic, 1993))
Learning is NP-hard (Chickering, 1994, Bukaert, 1995)
However: Many widely-applicable algorithms are very
efficient (allowing up to thousands of variables for
inference and up to >100,000 variables for focused
learning)
111
Automatic Construction of Bayesian
Networks from Data

For causal discovery:
– Perl, Verma (1988)
– Spirtes, Glymour, Scheines, (1991)

For classification/automatic DSS construction
– Herskovits, Cooper (1991): Kutato (entropy-based)
– Cooper, Herskovits (1992): K2 (Bayesian)
[to be discussed at length in second part]
Reference:
Computation, Causation, and Discovery by Clark Glymour (Preface), Gregory F. Cooper
(Editor), 2000, AAAI Press
112
Supplementary: Bayesian Networks: Sparse
Candidate Algorithm
Repeat
Select candidate parents Ci for each variable Xi
Set new best NW B to be Gn s.t. Gn maximizes
a Bayesian score Score(G|D) where G is a member
of class of BNs for which: PaG(Xi)  PaBprev(Xi) " Xi
Restriction
Step
Maximization
Step
Until Convergence
Return B
113
Supplementary: Bayesian Networks: Sparse
Candidate Algorithm





SCA proceeds by selecting up to k candidate parents for
each variable on the basis of pair-wise association
Then search is performed for a best network within the
space defined by the union of all potential parents
identified in the previous step
The procedure is iterated by feeding the parents in the
currently best network to the restriction step
Theorem 6 (Friedman) : SCA monotonically improves the
quality of the examined networks
Convergence criterion: no gain in score, and maximum
number of cycles with no improvement in score
114
Supplementary: Bayesian Networks: Learning
Partial Models


Partial model: feature (Friedman et al.)
Examples:
– Order relation (is X an ascendant or descendent of Y?)
– Markov Blanket membership (Is A in the MB of B?)

We want:
P(f(G|D) =
S (f(G) * p(G|D))
G

And we approximate it by:
Conf(f) =
1
m
S f(Gi)
m
*
i=1
115
Supplementary: BN Applications
116
Supplementary: Pathfinder






Heckerman et al early 90s
Diagnosis and test selection of lymph-node pathology
Assumes MEE but not FICD
Similarity networks (special enhancement to BNs) allow more efficient
Knowledge acquisition
Myopic test selection strategy (similar to Gorry and Barnett) &
combined monetary cost/expected survival utility measure
Led to Intellipath commercial product
Reference:
Heckerman DE, Horvitz EJ, Nathwani BN. Toward normative expert systems: Part I. The
Pathfinder project. Methods Inf Med. 1992 Jun;31(2):90-105.
Heckerman DE, Nathwani BN. Toward normative expert systems: Part II. Probability-based
representations for efficient knowledge acquisition and inference.Methods Inf Med.
1992 Jun;31(2):106-16.
Heckerman DE, Nathwani BN. An evaluation of the diagnostic accuracy of Pathfinder.
Comput Biomed Res. 1992 Feb;25(1):56-74.
117
Supplementary: QMR-DT




Stanford, late 80s, early 90s
Probabilistic formulation of QMR KB (subsequent version of
INTERNIST I)
Full-scope of INTERNIST/QMR
Uses
– two-layered BN representation,
– No MEE or FICD assumptions
– Stochastic inference
Reference:
Shwe M, Cooper G. An empirical analysis of likelihood-weighting simulation on a large,
multiply connected medical belief network. Comput Biomed Res. 1991 Oct;24(5):45375.
118
Supplementary: Analysis of sensitivity of BNs
to errors in probability specification

Henrion et al. 1996
– System: CPCS (subset of QMR)
– Results: average probabilities assigned to the actual diseases showed
small sensitivity even to large amounts of noise.
– Explanation: One reason is that the criterion for performance is
average probability of the true hypotheses, which is insensitive to
symmetric noise distributions. But, even asymmetric, logodds-normal
noise has modest effects. A second reason is that the gold-standard
posterior probabilities are often near zero or one, and are little
disturbed by noise.
Reference:
Max Henrion, Malcolm Pradhan, Brendan Del Favero, Kurt Huang, Gregory Provan and Paul O'Rorke. Why is
Diagnosis Using Belief Networks Insensitive to Imprecision in Probabilities? UAI, 1996
119
Supplementary: Temporal Causal and Spatial
Reasoning with Probabilistic methods



Haddaway 1995: temporal, causal and probabilistic FOL
Aliferis (97, 98): temporal and causal Bayesian Networks
with clear causal-temporal semantics
Spatio-temporal BNs for GI endoscopy
Reference:
Aliferis CF, Cooper GF. Temporal representation design principles: an assessment
in the domain of liver transplantation. Proc AMIA Symp. 1998;:170-4.
Ngo L, Haddawy P, Krieger RA, Helwig J. Efficient temporal probabilistic
reasoning via context-sensitive model construction. Comput Biol Med. 1997
Sep;27(5):453-76.
120
Supplementary: Dynamic construction of BNs from
Knowledge Bases to solve problem instances of interest
(KBMC)


Haddaway 1995 probabilistic FOL KBBN
Aliferis: Modifiable Temporal BNs: temporal and causal
Bayesian Networks with adjustable hybrid granularities, variable time
horizon, and interacting subnetworks (“contexts”) 1996-8

Koller et al. object-oriented BNs, 1997
Reference:
Generating Bayesian Networks from Probability Logic Knowledge Bases P. Haddawy , Proceedings of
the Tenth Conference on Uncertainty in Artificial Intelligence, July, 1994.
Daphne Koller and Avi Pfeffer. Object-Oriented Bayesian Networks , UAI, 1997
Aliferis C.F. A Temporal representation and Reasoning Model for Medical Decision Support Systems,
Doctoral Thesis, 1998
121
Supplementary: Other applications

Parsing of natural language with BNs
– Haug et al 1999
– Charniak et al.

Extensive applications for classification and discovery
Margaritis et al (1999), Aliferis & Tsamardinos et al.
(2001,2,3):
– focused causal discovery (parents-children or Markov Blankets)
– feature selection
[to be discussed at length in second part]
Reference:
Fiszman M, Chapman WW, Evans SR, Haug PJ. Automatic identification of pneumonia related
concepts on chest x-ray reports.Proc AMIA Symp. 1999;:67-71.
Charniak E. Bayesian Networks Without Tears. AI Magazine 1991
122
Simple Bayes Revisited


Domingos and pazzani 1997:
– Naïve Bayes assumptions are sufficient for accurate probability
estimates in the sample limit but not necessary for a wide variety
of learning problems when accuracy is the evaluation metric
– In small samples even if the assumptions are violated SB can do
better than more expressive representations due to the biasvariance decomposition of the error
– The best way to correct (extend) SB is not to join highly-associated
findings
These results explain the excellent performance of SB in text
categorization with thousands of variables (words) and many other
learning/inference tasks even against more expressive representations
Reference:
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss (1997) Pedro
Domingos and Michael Pazzani. Machine Learning
123
Other Restricted Bayesian
Classifiers

The TAN classifier augments Naïve Bayes with
“augmenting” edges among findings such that the
resulting network among the findings is a tree
D
F1
F2
F3
F4
124
Other Restricted Bayesian Classifiers

The TAN multinet classifier uses a different TAN for each
value of D and then chooses the predicted class to be the
value of D that has the highest posterior given the
findings
D=2
D=1
F1
F2
F3
F4
F1
F2
F3
F4
D=3
F1
F2
F3
F4
125
Other Restricted Bayesian
Classifiers


The main advantages of TANs and TAN multinets
are superior performance to Naïve Bayes and
efficient learning
Several other restricted Bayesian classifiers have
been proposed, e.g., non-tree augmented Naïve
Bayes (BANs) and corresponding multinets
References:
– Friedman et al Machine Learning 1997
– Cheng and Greiner, UAI 1999
126
Supplementary: Myths Surrounding Bayesian
Decision Support In General and Classifiers in
Particular






Bayes’ Theorem requires that diseases are mutually
exclusive and that findings are independent
Corollary: Simple Bayes is wrong whenever the MEE and
FICD assumptions do not hold
Related: A good way to fix Simple Bayes is to “lamp
together” highly-dependent findings
Bayes’ probabilities are difficult to assess and because
errors accumulate the final conclusions are wrong
Related: To apply Bayesian inference you need an
astronomical number of probabilities
Related: Bayesian inference is too subjective because
probabilities are not frequency-based
127
Supplementary: Conclusions


Since many of the previous advances are very recent, the medical
informatics community that does not specialize in probabilistic systems
has not yet caught up with them
Dissemination issues aside, the significant progress that has been
accomplished in the theory and technology of Bayesian Networks in
the 1990s has yielded:
– Algorithms that allow efficient learning and inference with
thousands of variables without unrealistic assumptions
– Formal handling of temporal and causal reasoning
– Decision-theory and probability theory compliant decision making
– Well-understood properties
– A plethora of tools for representation, inference , learning, and
experimentation
– Pioneering applications in many medical areas
128
Supplementary: Bayesian Networks:
Reference

Simple Bayes weakness:
– M. Peot, Proc. Proc. UAI 96
– M. Minsky, Transactions of IRE, 49:8-30, 1961

Simple Bayes application:
– H. Warner et al. Annals of NYAS, 115:2-16, 1964
– F. de Dombal et al. BMJ, 1:376-380, 1972

Full Bayesian Classifier:
– T. Mitchell, Machine Learning, McGraw Hill, 1997

Bayesian Networks as a knowledge representation:
– J. Pearl, Probabilistic Reasoning in Expert Systems, Morgan Kaufmann,
1988

Certainty Factor/PSs weaknesses:
– D. Heckerman et al., Proc. UAI 86
129
Supplementary: Bayesian Networks:
Reference

Causal discovery using BNs:
– P. Spirtes et al. , Causation, Prediction and Search, MIT Press 2000
– C. Glymour, G. Cooper, Computation, Causation and Discovery, AAAI Press/MIT
Press, 1999
– C. Aliferis, G. Cooper, Proc. UAI 94

Textbooks on BNs:
– R. Neapolitan, Probabilistic Reasoning in Expert Systems, John Wiley, 1990
– F. Jensen, An Introduction to Bayesian Networks, UCL Press, 1996
– E. Castillo, et al. Expert Systems and Probabilistic Network Models, Springer 1997

Learning BNs:
–
–
–
–
–

G. Cooper et al. Machine Learning 9:309-347, 1992
E. Herskovits, Report No. STAN-CS-91-1367 (Thesis)
D. Heckerman, Technical report Msr TR-95-06, 1995
J. Pearl, Causality, Gambridge University Press, 2001
N. Friedman et al. J Comput Biol, 7(3/4):601-620, 2000, and Proc. UAI 99
Comparison to other learning algorithms:
– G. Cooper, C. Aliferis et al. Artificial Intelligence in Medicine, 9:107-138, 1997
130