Evaluating Machine Learning Approaches for Aiding Probe
Download
Report
Transcript Evaluating Machine Learning Approaches for Aiding Probe
Mining High-Throughput
Biological Data
David Page
Dept. Biostatistics and Medical Informatics
Dept. Computer Sciences
University of Wisconsin-Madison
Copyrighted © 2006 by David Page
Some Data Types We’ll Discuss
Gene expression microarray
Single-nucleotide polymorphisms
Mass spectrometry proteomics and
metabolomics
Protein-protein interactions (from coimmunoprecipitation)
High-throughput screening of potential
drug molecules
image from the DOE Human Genome Program
http://www.ornl.gov/hgmis
How Microarrays Work
Probes (DNA)
Labeled Sample (RNA)
Hybridization
Gene Chip Surface
Two Views of Microarray Data
Data points are genes
Represented by expression levels across
different samples (ie, features=samples)
Goal: categorize new genes
Data points are samples (eg, patients)
Represented by expression levels of
different genes (ie, features=genes)
Goal: categorize new samples
Two Ways to View The Data
Person Gene
A28202_ac
AB00014_at AB00015_at
...
Person 1
1142.0
321.0
2567.2
...
Person 2
586.3
586.1
759.0
...
Person 3
105.2
559.3
3210.7
...
Person 4
42.8
692.1
812.0
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
Data Points are Genes
Person Gene
A28202_ac
AB00014_at AB00015_at
...
Person 1
1142.0
321.0
2567.2
...
Person 2
586.3
586.1
759.0
...
Person 3
105.2
559.3
3210.7
...
Person 4
42.8
692.1
812.0
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
Data Points are Samples
Person Gene
A28202_ac
AB00014_at AB00015_at
...
Person 1
1142.0
321.0
2567.2
...
Person 2
586.3
586.1
759.0
...
Person 3
105.2
559.3
3210.7
...
Person 4
42.8
692.1
812.0
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
Supervision: Add Class Values
Person Gene
A28202_ac
AB00014_at AB00015_at . . .
Person 1
1142.0
321.0
2567.2
...
normal
Person 2
586.3
586.1
759.0
...
cancer
Person 3
105.2
559.3
3210.7
...
normal
Person 4
42.8
692.1
812.0
...
cancer
.
.
.
.
.
.
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
Class
Supervised Learning Task
Given: a set of microarray experiments, each
done with mRNA from a different patient
(same cell type from every patient)
Patient’s expression values for each gene
constitute the features, and patient’s disease
constitutes the class
Do: Learn a model that accurately predicts
class based on features
Location in Task Space
Data Points are:
Genes
Samples
Clustering
Supervised
Data Mining
Predict the class
value for a patient
based on the
expression levels
for his/her genes
Leukemia
(Golub et al., 1999)
Classes
Acute Lymphoblastic Leukemia (ALL)
and Acute Myeloid Leukemia (AML)
Approach
Weighted voting (essentially naïve Bayes)
Cross-Validated Accuracy
Of 34 samples, declined to predict 5,
correct on other 29
Cancer vs. Normal
Relatively easy to predict accurately,
because so much goes “haywire” in
cancer cells
Primary barrier is noise in the data…
impure RNA, cross-hybridization, etc
Studies include breast, colon, prostate,
lymphoma, and multiple myeloma
X-Val Accuracies for Multiple
Myeloma (74 MM vs. 31 Normal)
Trees
98.1
Boosted Trees
99.0
SVMs
100.0
Vote
100.0
Bayes Nets
97.0
More MM (300), Benign Condition
MGUS (Hardin et al., 2004)
ROC Curves: Cancer vs. Normal
ROC: Cancer vs. Benign (MGUS)
Work by Statisticians Outside of
Standard Classification/Clustering
Methods to better convert Affymetrix’s
low-level intensity measurements into
expression levels: e.g., work by Speed,
Wong, Irrizary
Methods to find differentially expressed
genes between two samples, e.g. work
by Newton and Kendziorski
But the following is most related…
Ranking Genes by Significance
Some biologists don’t want one predictive model, but
a rank-ordered list of genes to explore further (with
estimated significance)
For each gene we have a set of expression levels
under our conditions, say cancer vs. normal
We can do a t-test to see if the mean expression
levels are different under the two conditions: p-value
Multiple comparisons problem: if we repeat this
test for 30,000 genes, some will pop up as significant
just by chance alone
Could do a Bonferoni correction (multiply p-values by
30,000), but this is drastic and might eliminate all
False Discovery Rate (FDR)
[Storey and Tibshirani, 2001]
Addresses multiple comparisons but is less
extreme than Bonferoni
Replaces p-value by q-value: fraction of genes
with this p-value or lower that really don’t
have different means in the two classes (false
discoveries)
Publicly available in R as part of Bioconductor
package
Recommendation: Use this in addition to
your supervised data mining… your
collaborators will want to see it
FDR Highlights Difficulties Getting
Insight into Cancer vs. Normal
Using Benign Condition Instead
of Normal Helps Somewhat
Question to Anticipate
You’ve run a supervised data mining
algorithm on your collaborator’s data, and
you present an estimate of accuracy or an
ROC curve (from X-val)
How did you adjust this for the multiple
comparisons problem?
Answer: you don’t need to because you
commit to a single predictive model before
ever looking at the test data for a fold—this is
only one comparison
Prognosis and Treatment
Features same as for diagnosis
Rather than disease state, class value
becomes life expectancy with a given
treatment (or positive response vs.
no response to given treatment)
Breast Cancer Prognosis
(Van’t Veer et al., 2002)
Classes
good prognosis (no metastasis within
five years of initial diagnosis) vs. poor prognosis
Algorithm
Ensemble of voters
Results
83% cross-validated accuracy on 78 cases
A Lesson
Previous work selected features to use in
ensemble by looking at the entire data set
Should have repeated feature selection on
each cross-val fold
Authors also chose ensemble size by seeing
which size gave highest cross-val result
Authors corrected this in web supplement;
accuracy went from 83% to 73%
Remember to “tune parameters” separately
for each cross-val fold!
Prognosis with Specific Therapy
(Rosenwald et al., 2002)
Data set contains gene-expression
patterns for 160 patients with diffuse
large B-cell lymphoma, receiving
anthracycline chemotherapy
Class label is five-year survival
One test-train split 80/80
True positive rate: 60%
False negative rate: 39%
Some Future Directions
Using gene-chip data to select therapy
Predict which therapy gives
best prognosis for patient
Combining Gene Expression Data with
Clinical Data such as Lab Results,
Medical and Family History
Multiple relational tables, may benefit
from relational learning
Unsupervised Learning Task
Given: a set of microarray experiments
under different conditions
Do: cluster the genes, where a gene
described by its expression levels in
different experiments
Location in Task Space
Data Points are:
Genes
Samples
Clustering
Supervised
Data Mining
Group genes into
clusters, where all
members of a
cluster tend to go
up or down together
Example
Genes
(Green = up-regulated, Red = down-regulated)
Experiments (Samples)
Visualizing Gene Clusters
(eg, Sharan and Shamir, 2000)
Normalized
expression
Gene Cluster 1, size=20
Time (10-minute intervals)
Gene Cluster 2, size=43
Unsupervised Learning Task 2
Given: a set of microarray experiments
(samples) corresponding to different
conditions or patients
Do: cluster the experiments
Location in Task Space
Data Points are:
Genes
Samples
Clustering
Supervised
Data Mining
Group samples by
gene expression
profile
Examples
Cluster samples from mice subjected to
a variety of toxic compounds
(Thomas et al., 2001)
Cluster samples from cancer patients,
potentially to discover different
subtypes of a cancer
Cluster samples taken at different
time points
Some Biological Pathways
Regulatory pathways
Nodes are labeled by genes
Arcs denote influence on transcription
G1 codes for P1, P1 inhibits G2’s transcription
Metabolic pathways
Nodes are metabolites, large biomolecules (eg,
sugars, lipids, proteins and modified proteins)
Arcs from biochemical reaction inputs to outputs
Arcs labeled by enzymes (themselves proteins)
Metabolic Pathway Example
H 20
HSCoA
cis-Aconitate
Citrate
Acetyl CoA
citrate synthase
aconitase
Oxaloacetate
NADH
NAD+
MDH
Malate
H20
fumarase
(Krebs Cycle,
Citric Acid Cycle)
IDH
NAD+
NADH + CO2
a-Ketoglutarate
a-KDGH
succinate thikinase
Succinate
FAD
Isocitrate
TCA Cycle,
Fumarate
FADH2
H 20
Succinyl-CoA
GTP GDP + Pi
+ HSCoA
NAD+ + HSCoA
NADH + CO2
Regulatory Pathway (KEGG)
Using Microarray Data Only
Regulatory pathways
Nodes are labeled by genes
Arcs denote influence on transcription
G1 codes for P1, P1 inhibits G2’s transcription
Metabolic pathways
Nodes are metabolites, large biomolecules (eg,
sugars, lipids, proteins, and modified proteins)
Arcs from biochemical reaction inputs to outputs
Arcs labeled by enzymes (themselves proteins)
Supervised Learning Task 2
Given: a set of microarray experiments
for same organism under different
conditions
Do: Learn graphical model that
accurately predicts expression of some
genes in terms of others
Some Approaches to
Learning Regulatory Networks
Bayes Net Learning
(started with Friedman & Halpern, 1999, we’ll see
more)
Boolean Networks
(Akutsu, Kuhara, Maruyama
& Miyano, 1998; Ideker, Thorsson & Karp, 2002)
Related Graphical Approaches
(Tanay & Shamir, 2001; Chrisman, Langley, Baay &
Pohorille, 2003)
Bayesian Network (BN)
Data
geneA
P(geneA)
geneB
Expt1
Expt2
Expt3
Expt4
Note: direction of arrow
indicates dependence
not causality
0.5 0.5
geneA
parent node
child node
P(geneB)
1.0 0.0
0.5 0.5
geneB
Problem: Not Causality
A
B
A is a good predictor of B. But is A regulating B??
Ground truth might be:
B
A
C
A
B
A
C
B
B
C
A
Or a more complicated variant
Approaches to Get Causality
Use “knock-outs” (Pe’er, Regev, Elidan and
Friedman, 2001). But not available in most
organisms.
Use time-series data and Dynamic Bayesian
Networks (Ong, Glasner and Page, 2002). But
even less data typically.
Use other data sources, eg sequences
upstream of genes, where transcription
regulators may bind. (Segal, Barash, Simon,
Friedman and Koller, 2002; Noto and Craven, 2005)
A Dynamic Bayes Net
gene
2 gene
gene
1
3
gene
N
gene
2 gene
gene
1
3
gene
N
Problem: Not Enough Data Points
to Construct Large Network
Fortunate to get 100s of chips
But have 1000s of genes
E. coli: ~4000
Yeast: ~6000
Human: ~30,000
Want to learn causal graphical model
over 1000s of variables with 100s of
examples (settings of the variables)
Advance: Module Networks
[Segal,
Pe’er, Regev, Koller & Friedman, 2005]
Cluster genes by similarity over
expression experiments
All genes in a cluster are “tied
together”: same parents and CPDs
Learn structure subject to this tying
together of genes
Iteratively re-form clusters and re-learn
network, in an EM-like fashion
Problem: Data are Continuous
but Models are Discrete
Gene chips provide a real-valued mRNA
measurement
Boolean networks and most practical
Bayes net learning algorithms assume
discrete variables
May lose valuable information by
discretizing
Advance: Use of Dynamic Bayes
Nets with Continuous Variables
[Segal, Pe’er, Regev, Koller & Friedman, 2005]
Expression measurements used instead
of discretized (up, down, same)
Assume linear influence of parents on
children (Michaelis-Menten assumption)
Work so far constructed the network
from literature and learned parameters
Problem: Much Missing
Information
mRNA from gene 1 doesn’t directly alter level
of mRNA from gene 2
Rather, the protein product from gene 1 may
alter level of mRNA from gene 2 (e.g.,
transcription factor)
Activation of transcription factor might not
occur by making more of it, but just by
phosphorylating it (post-translational
modification)
Example: Transcription Regulation
Operon
Operon
P
O geneR
T
Operon
Operon
P
O geneA geneB geneC T
R
DNA
mRNA
mRNA
R
Approach: Measure More Stuff
Mass spectrometry (later) can measure
protein rather than mRNA
Doesn’t measure all proteins
Not very quantitative (presence/absence)
2D gels can measure post-translational
modifications, but still low-throughput
because of current analysis
Co-immunoprecipitation (later), Yeast 2Hybrids can measure protein interactions, but
noisy
Another Way Around Limitations
Identify smaller part of the task that is
a step toward a full regulatory pathway
Part of a pathway
Classes or groups of genes
Examples:
Chromatin remodelers
Predicting the operons in E. coli
Chromatin Remodelers and
Nucleosome [Segal et al. 2006]
Previous DNA picture oversimplified
DNA double-helix is wrapped in further
complex structure
DNA is accessible only if part of this
structure is unwound
Can we predict what chromatin
remodelers act on what parts of DNA,
also what activates a remodeler?
The E. Coli Genome
Finding Operons in E. coli
(Craven, Page, Shavlik, Bockhorst and Glasner, 2000)
g2
g3
g4
g1
Given:
Do:
known operons and other E. coli data
predict all operons in E. coli
Additional Sources of Information
gene-expression data
functional annotation
g5
Comparing Naive Bayes
and Decision Trees (C5.0)
Using Only Individual Features
Single-Nucleotide Polymorphisms
SNPs: Individual positions in DNA where
variation is common
Roughly 2 million known SNPs in humans
New Affymetrix whole-genome scan
measures 500,000 of these
Easier/faster/cheaper to measure SNPs than
to completely sequence everyone
Motivation …
If We Sequenced Everyone…
Susceptible to Disease D or Responds to Treatment T
Not Susceptible or Not Responding
Example of SNP Data
Person SNP
1
2
3
...
CLASS
Person 1
C
T
A
G
T
T
...
old
Person 2
C
C
A
G
C
T
...
young
Person 3
T
T
A
A
C
C
...
old
Person 4
C
T
G
G
T
T
...
young
.
.
.
.
.
.
...
.
.
.
.
.
.
.
...
.
.
.
.
.
.
.
...
.
Phasing (Haplotyping)
Advantages of SNP Data
Person’s SNP pattern does not change
with time or disease, so it can give
more insight into susceptibility
Easier to collect samples (can simply
use blood rather than affected tissue)
Challenges of SNP Data
Unphased
Algorithms exist for phasing (haplotyping),
but they make errors and typically need
related individuals, dense coverage
Missing values are more common
than in microarray data (though improving
substantially, down to around 1-2% now)
Many more measurements. For example,
Affymetrix human SNP chip at a half million
SNPs.
Supervised Learning Task
Given: a set of SNP profiles, each from a different
patient.
Phased: nucleotides at each SNP position on each
copy of each chromosome constitute the features,
and patient’s disease constitutes the class
Unphased: unordered pair of nucleotides at each
SNP position constitute the features, and patient’s
disease constitutes the class
Do: Learn a model that accurately predicts
class based on features
Waddell et al., 2005
Multiple Myeloma, Young (susceptible) vs. Old (less
susceptible), 3000 SNPs, best at 64% acc (training)
SVM with feature selection (repeated on every fold of
cross-validation): 72% accuracy, also naïve Bayes.
Significantly better than chance.
Old
Young
Old
31
9
Young
14
26
Actual
Listgarten et al., 2005
SVMs from SNP data predict lung
cancer susceptibility at 69% accuracy
Naïve Bayes gives similar performance
Best single SNP at less than 60%
accuracy (training)
Lessons
Supervised data mining algorithms can
predict disease susceptibility at rates
better than chance and better than
individual SNPs
Accuracies much lower than we see
with microarray data, because we’re
predicting who will get disease, not who
already has it
Future Directions
Pharmacogenetics: predicting drug
response from SNP profile
Drug Efficacy
Adverse Reaction
Combining SNP data with other data
types, such as clinical (history, lab tests)
and microarray
Proteomics
Microarrays are useful primarily because
mRNA concentrations serve as surrogate for
protein concentrations
Like to measure protein concentrations
directly, but at present cannot do so in
same high-throughput manner
Proteins do not have obvious direct
complements
Could build molecules that bind, but binding
greatly affected by protein structure
Time-of-Flight (TOF) Mass
Spectrometry (thanks Sean McIlwain)
Detector
Measures the time for an
ionized particle, starting
from the sample plate, to
hit the detector
Laser
Sample
+V
Time-of-Flight (TOF)
Mass Spectrometry 2
Matrix-Assisted Laser
Desorption-Ionization
Detector
(MALDI)
Crystalloid structures
made using protonrich matrix molecule
Hitting crystalloid with
laser causes molecules
Sample
to ionize and “fly”
+V
towards detector
Laser
Time-of-Flight Demonstration 0
Sample Plate
Time-of-Flight Demonstration 1
Matrix Molecules
Time-of-Flight Demonstration 2
Protein Molecules
Time-of-Flight Demonstration 3
Laser
Detector
+10KV
Positive Charge
Time-of-Flight Demonstration 4
Laser pulsed directly
onto sample
Proton kicked off matrix
molecule onto another
molecule
+
+10KV
Time-of-Flight Demonstration 5
Lots of protons kicked
off matrix ions, giving
rise to more positively
charged molecules
+
+
+
+
+10KV
+
Time-of-Flight Demonstration 6
The high positive
potential under sample
plate, causes
positively charged
molecules to
accelerate towards
detector
+
+
+
+
+10KV
+
Time-of-Flight Demonstration 7
+
+
+
+
+
+10Kv
+
Smaller mass
molecules hit detector
first, while heavier
ones detected later
Time-of-Flight Demonstration 8
+
+
The incident time
measured from
when laser is
pulsed until
molecule hits
detector
+
+
+
+10KV
+
Time-of-Flight Demonstration 9
+
+
+
+
+
+
Experiment repeated a
number of times, counting
frequencies of “flight-times”
+10KV
Example Spectra from a
Competition by Lin et al. at Duke
Intensity
These are different
fractions from the same
sample.
M/Z
Frequency
Trypsin-Treated Spectra
M/Z
Many Challenges Raised by Mass
Spectrometry Data
Noise: extra peaks from handling of sample, from
machine and environment (electrical noise), etc.
M/Z values may not align exactly across spectra
(resolution ~0.1%)
Intensities not calibrated across spectra:
quantification is difficult
Cannot get all proteins… typically only several
hundred. To improve odds of getting the ones we
want, may fractionate our sample by 2D gel
electrophoresis or liquid chromatography.
Challenges (Continued)
Better results if partially digest proteins
(break into smaller peptides) first
Can be difficult to determine what
proteins we have from spectrum
Isotopic peaks: C13 and N15 atoms in
varying numbers cause multiple peaks
for a single peptide
Handling Noise: Peak Picking
Want to pick peaks that are statistically
significant from the noise signal
Want to use these as
features in our
learning algorithms.
Many Supervised Learning Tasks
Learn to predict proteins from spectra,
when the organism’s proteome is
known
Learn to identify isotopic distributions
Learn to predict disease from either
proteins, peaks or isotopic distributions
as features
Construct pathway models
Using Mass Spectrometry for
Early Detection of Ovarian Cancer
[Petricoin et al., 2002]
Ovarian cancer difficult to detect early, often
leading to poor prognosis
Trained and tested on mass spectra from
blood serum
100 training cases, 50 with cancer
Held-out test set of 116 cases, 50 with cancer
100% sensitivity, 95% specificity (63/66) on
held-out test set
Not So Fast
Data mining methodology seems sound
But Keith Baggerly argues that cancer samples were
handled differently than normal samples, and
perhaps data were preprocessed differently too
If we run cancer samples Monday and normals
Wednesday, could get differences from machine
breakdown or nearby electrical equipment that’s
running on Monday but not Wed
Lesson: tell collaborators they must randomize
samples for the entire processing phase and of
course all our preprocessing must be same
Debate is still raging… results not replicated in trials
Other Proteomics: 3D Structures
Other Proteomics: Interactions
Figure from Ideker et al., Science 292(5518):929-934, 2001
each node represents a gene product (protein)
blue edges show direct protein-protein interactions
yellow edges show interactions in which one protein
binds to DNA and affects the expression of another
Protein-Protein Interactions
Yeast 2-Hybrid
Immunoprecipitation
Antibodies (“immuno”) are made by
combinatorial combinations of certain
proteins
Millions of antibodies can be made, to
recognize a wide variety of different
antigens (“invaders”), often by recognizing
protein
specific proteins antibody
Protein-Protein Interactions
Immunoprecipitation
antibody
Co-Immunoprecipitation
antibody
Many Supervised Learning Tasks
Learn to predict protein-protein
interactions: protein 3D structures may
be critical
Use protein-protein interactions in
construction of pathway models
Learn to predict protein function from
interaction data
ChIP-Chip Data
Immunoprecipitation can also be done to
identify proteins interacting with DNA rather
than other proteins
Chromatin immunoprecipitation (ChIP): grab
sample of DNA bound to a particular protein
(transcription factor)
ChIP-Chip: run this sample of DNA on a
microarray to see which DNA was bound
Example of analysis of such new data: Keles
et al., 2006
Metabolomics
Measures concentration of each lowmolecular weight molecule in sample
These typically are “metabolites,” or small
molecules produced or consumed by
reactions in biochemical pathways
These reactions typically catalyzed by
proteins (specifically, enzymes)
This data typically also mass spectrometry,
though could also be NMR
Lipomics
Analogous to metabolomics, but
measuring concentrations of lipids
rather than metabolites
Potentially help induce biochemical
pathway information or to help
disease diagnosis or treatment choice
To Design a Drug:
Identify Target
Protein
Determine
Target Site
Structure
Synthesize a
Molecule that
Will Bind
Knowledge of proteome/genome
Relevant biochemical pathways
Crystallography, NMR
Difficult if Membrane-Bound
Imperfect modeling of structure
Structures may change at binding
And even then…
Molecule Binds Target But
May:
Bind too tightly or not tightly enough.
Be toxic.
Have other effects (side-effects) in the
body.
Break down as soon as it gets into the
body, or may not leave the body soon
enough.
It may not get to where it should in the
body (e.g., crossing blood-brain
barrier).
And Every Body is Different:
Even if a molecule works in the test
tube and works in animal studies, it
may not work in people (will fail in
clinical trials).
A molecule may work for some people
but not others.
A molecule may cause harmful sideeffects in some people but not others.
Typical Practice when Target
Structure is Unknown
High-Throughput Screening (HTS): Test many
molecules (1,000,000) to find some that bind
to target (ligands).
Infer (induce) shape of target site from 3D
structural similarities.
Shared 3D substructure is called a
pharmacophore.
Perfect example of a machine learning task
with spatial target.
Inactive
Active
An Example of Structure Learning
Common Data Mining
Approaches
Represent a molecule by thousands to
millions of features and use standard
techniques (e.g., KDD Cup 2001)
Represent each low-energy conformer by
feature vector and use multiple-instance
learning (e.g., Jain et al., 1998)
Relational learning
Inductive logic programming (e.g., Finn et al.,
1998)
Graph mining
Supervised Learning Task
Given: a set of molecules, each labeled
by activity -- binding affinity for target
protein -- and a set of low-energy
conformers for each molecule
Do: Learn a model that accurately
predicts activity (may be Boolean or
real-valued)
ILP as Illustration: The Logical
Representation of a Pharmacophore
Active(X) if:
has-conformation(X,Conf),
has-hydrophobic(X,A),
has-hydrophobic(X,B),
distance(X,Conf,A,B,3.0,1.0),
has-acceptor(X,C),
distance(X,Conf,A,C,4.0,1.0),
distance(X,Conf,B,C,5.0,1.0).
This logical clause states that a molecule X is active if it has some
conformation Conf, hydrophobic groups A and B, and a hydrogen acceptor C
such that the following holds: in conformation Conf of molecule X, the
distance between A and B is 3 Angstroms (plus or minus 1), the distance
between A and C is 4, and the distance between B and C is 5.
Background Knowledge I
Information about atoms and bonds in the molecules
atm(m1,a1,o,3,5.915800,-2.441200,1.799700).
atm(m1,a2,c,3,0.574700,-2.773300,0.337600).
atm(m1,a3,s,3,0.408000,-3.511700,-1.314000).
bond(m1,a1,a2,1).
bond(m1,a2,a3,1).
Background knowledge II
Definition of distance equivalence
dist(Drug,Atom1,Atom2,Dist,Error):number(Error),
coord(Drug,Atom1,X1,Y1,Z1),
coord(Drug,Atom2,X2,Y2,Z2),
euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),Dist1),
Diff is Dist1-Dist,
absolute_value(Diff,E1),
E1 =< Error.
euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),D):Dsq is (X1-X2)^2+(Y1-Y2)^2+(Z1-Z2)^2,
D is sqrt(Dsq).
Central Idea: Generalize by
searching a lattice
Lattice of Hypotheses
active(X)
active(X) if
has-hydrophobic(X,A)
active(X) if
has-hydrophobic(X,A),
has-donor(X,B),
distance(X,A,B,5.0)
active(X) if
has-donor(X,A)
active(X) if
has-acceptor(X,A)
active(X) if
active(X) if
has-donor(X,A),
has-acceptor(X,A),
has-donor(X,B),
has-donor(X,B),
distance(X,A,B,4.0) distance(X,A,B,6.0)
etc.
Conformational model
Conformational flexibility modelled as
multiple conformations:
Sybyl randomsearch
Catalyst
Pharmacophore description
Atom and site centred
Hydrogen bond donor
Hydrogen bond acceptor
Hydrophobe
Site points (limited at present)
User definable
Distance based
Example 1: Dopamine
agonists
Agonists taken from Martin data set on QSAR
society web pages
Examples (5-50 conformations/molecule)
OH
H3C
OH
N
OH
OH
OH
H2N
H2N
H3C
OH
OH
OH
OH H2N
OH
OH
N
HN
OH
CH3
Pharmacophore identified
Molecule A has the desired activity if:
in conformation B molecule A contains a hydrogen acceptor at C, and
in conformation B molecule A contains a basic nitrogen group at D, and
the distance between C and D is 7.05966 +/- 0.75 Angstroms, and
in conformation B molecule A contains a hydrogen acceptor at E, and
the distance between C and E is 2.80871 +/- 0.75 Angstroms, and
the distance between D and E is 6.36846 +/- 0.75 Angstroms, and
in conformation B molecule A contains a hydrophobic group at F, and
the distance between C and F is 2.68136 +/- 0.75 Angstroms, and
the distance between D and F is 4.80399 +/- 0.75 Angstroms, and
the distance between E and F is 2.74602 +/- 0.75 Angstroms.
Example II: ACE inhibitors
28 angiotensin converting enzyme
inhibitors taken from literature
D. Mayer et al., J. Comput.-Aided Mol.
Design, 1, 3-16, (1987)
O
HS
N
O
CH3
HO
N
O
N
COOH
P N
H
OH
N
O
COOH
COOH
CH3
N
H
N
O
COOH
ACE pharmacophore
Molecule A is an ACE inhibitor if:
molecule A contains a zinc-site B,
molecule A contains a hydrogen acceptor C,
the distance between B and C is 7.899 +/- 0.750 A,
molecule A contains a hydrogen acceptor D,
the distance between B and D is 8.475 +/- 0.750 A,
the distance between C and D is 2.133 +/- 0.750 A,
molecule A contains a hydrogen acceptor E,
the distance between B and E is 4.891 +/- 0.750 A,
the distance between C and E is 3.114 +/- 0.750 A,
the distance between D and E is 3.753 +/- 0.750 A.
Pharmacophore discovered
Distance
Progol
Mayer
A
4.9
5.0
B
3.8
3.8
C
8.5
8.6
Zinc site
H-bond acceptor
B
A
C
Additional Finding
Original pharmacophore rediscovered
plus one other
different zinc ligand position
similar to alternative proposed by CibaGeigy
4.0
3.9
7.3
Example III: Thermolysin
inhibitors
10 inhibitors for which crystallographic
data is available in PDB
Conformationally challenging molecules
Experimentally observed superposition
Key binding site interactions
Asn112-NH
O
OH
O=C Asn112
S2’
NH
Arg203-NH
O
S1’
O
Zn
P NH
O R
O=C Ala113
Interactions made by
inhibitors
Interaction
Asn112-NH
S2’
Asn112 C=O
Arg 203 NH
S1’
Ala113-C=O
Zn
1HYT
1THL
1TLP
1TMN
2TMN 4TLN
4TMN
5TLN
5TMN
6TMN
Pharmacophore Identification
Structures considered 1HYT 1THL 1TLP
1TMN 2TMN 4TLN 4TMN 5TLN 5TMN
6TMN
Conformational analysis using “Best”
conformer generation in Catalyst
98-251 conformations/molecule
Thermolysin Results
10 5-point pharmacophore identified,
falling into 2 groups (7/10 molecules)
3 “acceptors”, 1 hydrophobe, 1 donor
4 “acceptors, 1 donor
Common core of Zn ligands, Arg203
and Asn112 interactions identified
Correct assignments of functional
groups
Correct geometry to 1 Angstrom
tolerance
Thermolysin results
Increasing tolerance to 1.5Angstroms
finds common 6-point pharmacophore
including one extra interaction
Example IV: Antibacterial
peptides [Spatola et al., 2000]
Dataset of 11 pentapeptides showing
activity against Pseudomonas
aeruginosa
6 actives <64mg/ml IC50
5 inactives
Pharmacophore Identified
A Molecule M is active against Pseudomonas Aeruginosa
if it has a conformation B such that:
M has a hydrophobic group C,
M has a hydrogen acceptor D,
the distance between C and D in conformation B is 11.7 Angstroms
M has a positively-charged atom E,
the distance between C and E in conformation B is 4 Angstroms
the distance between D and E in conformation B is 9.4 Angstroms
M has a positively-charged atom F,
the distance between C and F in conformation B is 11.1 Angstroms
the distance between D and F in conformation B is 12.6 Angstroms
the distance between E and F in conformation B is 8.7 Angstroms
Tolerance 1.5 Angstroms
Clinical Databases of the Future
(Dramatically Simplified)
PatientID Gender Birthdate
P1
M
PatientID Date
P1
P1
3/22/63
Lab Test
1/1/01 blood glucose
1/9/01 blood glucose
PatientID Date Physician Symptoms
P1
P1
Result
42
45
PatientID Date Prescribed Date Filled
P1
5/17/98
5/18/98
1/1/01
2/1/03
Smith
Jones
Diagnosis
palpitations hypoglycemic
fever, aches influenza
PatientID SNP1 SNP2 … SNP500K
P1
P2
AA
AB
AB
BB
BB
AA
Physician Medication Dose
Jones
prilosec
10mg
Duration
3 months
Final Wrap-up
Molecular biology collecting lots and lots of
data in post-genome era
Opportunity to “connect” molecular-level
information to diseases and treatment
Need analysis tools to interpret
Data mining opportunities abound
Hopefully this tutorial provided solid start
toward applying data mining to highthroughput biological data
Thanks To
Jude Shavlik
John Shaughnessy
Bart Barlogie
Mark Craven
Sean McIlwain
Jan Struyf
Arno Spatola
Paul Finn
Beth Burnside
Michael Molla
Michael Waddell
Irene Ong
Jesse Davis
Soumya Ray
Jo Hardin
John Crowley
Fenghuang Zhan
Eric Lantz
If Time Permits… some of my
group’s directions in the area
Clinical Data (with Jesse Davis, Beth
Burnside, M.D.)
Addressing another problem with
current approaches to biological
network learning (with Soumya Ray,
Eric Lantz)
Using Machine Learning with
Clinical Histories: Example
We’ll use example of Mammography to
show some issues that arise
These issues arise here even with just
one relational table
These issues are even more
pronounced with data in multiple tables
Supervised Learning Task
Given: a database of mammogram
abnormalities for different patients (same cell
type from every patient)
Radiologist-entered values describing the
abnormality constitute the features, and
abnormality’s biopsy result as benign or
malignant constitutes the class
Do: Learn a model that accurately predicts
class based on features
Mammography Database
[Davis et al, 2005; Burnside et al, 2005]
Patient Abnormality Date Mass Shape …
P1
P1
P1
…
1
2
3
…
5/02
5/04
5/04
…
Spic
Var
Spic
…
Mass Size
0.03
0.04
0.04
…
Loc
RU4
RU4
LL3
…
Be/Mal
B
M
B
…
Original Expert Structure
Level 1: Parameters
Be/Mal
Shape
Size
Given: Features (node labels, or
fields in database), Data, Bayes
net structure
Learn: Probabilities. Note:
probabilities needed are
Pr(Be/Mal), Pr(Shape|Be/Mal),
Pr (Size|Be/Mal)
Level 2: Structure
Be/Mal
Shape
Size
Given: Features, Data
Learn: Bayes net structure
and probabilities. Note: with
this structure, now will need
Pr(Size|Shape,Be/Mal)
instead of Pr(Size|Be/Mal).
Mammography Database
Patient Abnormality Date Mass Shape …
P1
P1
P1
…
1
2
3
…
5/02
5/04
5/04
…
Spic
Var
Spic
…
Mass Size
0.03
0.04
0.04
…
Loc
RU4
RU4
LL3
…
Be/Mal
B
M
B
…
Mammography Database
Patient Abnormality Date Mass Shape …
P1
P1
P1
…
1
2
3
…
5/02
5/04
5/04
…
Spic
Var
Spic
…
Mass Size
0.03
0.04
0.04
…
Loc
RU4
RU4
LL3
…
Be/Mal
B
M
B
…
Mammography Database
Patient Abnormality Date Mass Shape …
P1
P1
P1
…
1
2
3
…
5/02
5/04
5/04
…
Spic
Var
Spic
…
Mass Size
0.03
0.04
0.04
…
Loc
RU4
RU4
LL3
…
Be/Mal
B
M
B
…
Level 3: Aggregates
Avg
size
this
date
Be/Mal
Shape
Size
Given: Features, Data,
Background knowledge –
aggregation functions such
as average, mode, max, etc.
Learn: Useful aggregate
features, Bayes net structure
that uses these features, and
probabilities. New features
may use other rows/tables.
Mammography Database
Patient Abnormality Date Mass Shape …
P1
P1
P1
…
1
2
3
…
5/02
5/04
5/04
…
Spic
Var
Spic
…
Mass Size
0.03
0.04
0.04
…
Loc
RU4
RU4
LL3
…
Be/Mal
B
M
B
…
Mammography Database
Patient Abnormality Date Mass Shape …
P1
P1
P1
…
1
2
3
…
5/02
5/04
5/04
…
Spic
Var
Spic
…
Mass Size
0.03
0.04
0.04
…
Loc
RU4
RU4
LL3
…
Be/Mal
B
M
B
…
Mammography Database
Patient Abnormality Date Mass Shape …
P1
P1
P1
…
1
2
3
…
5/02
5/04
5/04
…
Spic
Var
Spic
…
Mass Size
0.03
0.04
0.04
…
Loc
RU4
RU4
LL3
…
Be/Mal
B
M
B
…
Level 4: View Learning
Shape change
in abnormality
at this location
Increase in
average size of
abnormalities
Avg
size
this
date
Be/Mal
Shape
Size
Given: Features, Data,
Background knowledge –
aggregation functions and
intensionally-defined relations
such as “increase” or “same
location”
Learn: Useful new features
defined by views (equivalent to
rules or SQL queries), Bayes
net structure, and probabilities.
Example of Learned Rule
is_malignant(A) IF:
'BIRADS_category'(A,b5),
'MassPAO'(A,present),
'MassesDensity'(A,high),
'HO_BreastCA'(A,hxDCorLC),
in_same_mammogram(A,B),
'Calc_Pleomorphic'(B,notPresent),
'Calc_Punctate'(B,notPresent).
ROC: Level 2 (TAN) vs. Level
1
Precision-Recall Curves
All Levels of Learning
1
Level 4 (View)
Precision
0.9
0.8
Level 3 (Aggregate)
0.7
Level 2 (Structure)
0.6
Level 1 (Parameter)
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1
SAYU-View
Improved View Learning approach
SAYU: Score As You Use
For each candidate rule, add it to the
Bayesian network and see if it improves
the network’s score
Only add a rule (new field for the view)
if it improves the Bayes net
Relational Learning Algorithms
1
SAYU-View
0.9
Precision
0.8
Initial Level 4 (View)
0.7
Level 3 (Aggregates)
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Recall
0.7
0.8
0.9
1
Average Area Under
PR Curve For Recall >= 0.5
0.12
SAYU-View
Initial
Level 4
AURPR
0.10
0.08
0.06
0.04
0.02
0.00
Level 3
Clinical Databases of the Future
(Dramatically Simplified)
PatientID Gender Birthdate
P1
M
PatientID Date
P1
P1
3/22/63
Lab Test
1/1/01 blood glucose
1/9/01 blood glucose
PatientID Date Physician Symptoms
P1
P1
Result
42
45
PatientID Date Prescribed Date Filled
P1
5/17/98
5/18/98
1/1/01
2/1/03
Smith
Jones
Diagnosis
palpitations hypoglycemic
fever, aches influenza
PatientID SNP1 SNP2 … SNP500K
P1
P2
AA
AB
AB
BB
BB
AA
Physician Medication Dose
Jones
prilosec
10mg
Duration
3 months
Another Problem with Current
Learning of Regulatory Models
Current techniques all use greedy heuristic
Bayes net learning algorithms use “sparse
candidate” approach: to be considered as a
parent of gene 1, another gene 2 must be
correlated with gene 1
CPDs often represented as trees: use greedy
tree learning algorithms
All can fall prey to functions such as
exclusive-or – do these arise?
Skewing Example [Page & Ray, 2003; Ray & Page,
2004; Rosell et al., 2005; Ray & Page, 2005]
Female
0
0
1
1
SxL Gene
Gene3 … Gene100
Survival
0
0…0000000
0…0000001
…
1…1111111
0
1
0…0000000
0…0000001
…
1…1111111
1
0
0…0000000
0…0000001
…
1…1111111
1
1
0…0000000
0…0000001
…
1…1111111
0
Drosophila survival based on gender and Sxl gene activity
Hard Functions
Our Definition: those functions for
which no attribute has “gain” according
to standard purity measures (GINI,
Entropy)
NOTE: “Hard” does not refer to size
of representation
Example: n-variable odd parity
Many others
Learning Hard Functions
Standard method of learning hard
functions (e.g. with decision trees):
depth-k Lookahead
O(mn2
k+1-1
)
for m examples in n variables
We devise a technique that allows
learning algorithms to efficiently learn
hard functions
Key Idea
Hard functions are not hard for all data
distributions
We can skew the input distribution to
simulate a different one
By randomly choosing “preferred values”
for attributes
Accumulate evidence over several
skews to select a split attribute
Example: Uniform Distribution
x1
x2
0
0
0
1
1
0
1
1
x3
0
1
0
1
0
1
0
1
f
0
1
1
0
GINI ( f ) 0.25
GINI ( f ; xi 0) 0 .25
GINI ( f ; xi 1) 0 .25
GAIN ( xi ) 0
Example: Skewed Distribution
(“Sequential Skewing”, Ray & Page, ICML 2004)
x1
0
0
1
1
x2
0
1
0
1
x3
0
1
0
1
0
1
0
1
f
0
1
1
0
Weight
1
4
1
4
3
4
3
4
1
4
1
4
3
4
3
4
GINI ( f ) 0.25
GINI ( f ; x1 1) 0.19
GINI ( f ; x2 1) 0.25
GINI ( f ; x3 1) 0.25
GAIN ( x1 ) 0
GAIN ( x2 ) 0
GAIN ( x3 ) 0