CFS SNP data analysis - CAMDA 2009 Conference

Download Report

Transcript CFS SNP data analysis - CAMDA 2009 Conference

Identifying Potential Biomarkers
for Chronic Fatigue Syndrome
via
Classification Model Ensemble Mining
Ben Goertzel, PhD
Biomind LLC
www. biomind.com
General Methodology
Using machine learning to find
nonlinear combinations of genes,
mutations or clinical indicators that
are associated with diseases, toxic
reactions, symptoms, or other
phenotypic qualities
Summary
 CAMDA 2006 data was analyzed using a novel “classification model
ensemble mining” methodology
 Genetic programming and heuristic search are applied to learn
ensembles of classification rules that distinguish CFS from Control
– one set of classifiers based on microarray data
– one set based on SNP data)
 These ensembles are then statistically analyzed to identify genes,
gene categories, and combinations thereof that appear to play
important roles in characterizing CFS.
 The results of this analysis include
– potential microarray and SNP based diagnostic rules for CFS
– lists of SNP’s, genes and gene categories that are potentially
significant biomarkers for CFS
• and are different from those found via simple statistical
category-differentiation analysis).
Summary
Overall, our results appear compatible with a systemtheoretic view of CFS which views the disorder as a
complex pattern of activity across the organism including
interlinked disturbances in neural and endocrine
systems.
Conceptual Hypothesis
Recent Related Work
Goertzel BN, Pennachin C, de Souza Coelho L, Maloney EM, Jones JF, Gurbaxani B.
Allostatic load is associated with symptoms in chronic fatigue syndrome patients.
Pharmacogenomics. 2006 Apr;7(3):485-94.
Goertzel BN, Pennachin C, de Souza Coelho L, Gurbaxani B, Maloney EM, Jones JF.
Combinations of single nucleotide polymorphisms in neuroendocrine effector and
receptor genes predict chronic fatigue syndrome. Pharmacogenomics. 2006
Apr;7(3):475-83.
Maloney EM, Gurbaxani BM, Jones JF, de Souza Coelho L, Pennachin C, Goertzel BN.
Chronic fatigue syndrome and high allostatic load. Pharmacogenomics. 2006
Apr;7(3):467-73.
Gurbaxani BM, Jones JF, Goertzel BN, Maloney EM.
Linear data mining the Wichita
clinical matrix suggests sleep and allostatic load involvement in chronic fatigue
syndrome. Pharmacogenomics. 2006 Apr;7(3):455-65.
Overview
 Microarray Data Analysis
 SNP Data Analysis
Analyzing Microarray Data
via Supervised Categorization
 Often more statistically meaningful than clustering
– and allows one to do clustering of features based on whether
they’re used in the same categorization models
 The researcher must divide the data into two or more categories,
e.g.
– Case vs. Control
– Early vs. Late (in a time series experiment)
– Multiclass categorization: which kind of cancer?
 Algorithms learn rules (“models”) that predict which category a
microarray gene expression profile falls into, via combining
expression values in an automatically learned mathematical formula
Supervised Categorization Algorithms
Many supervised categorization algorithms exist, each with strengths
and weaknesses
Unlike with clustering, a choice may be made based on rigorous
validation methodology






Decision trees
Neural networks
Logistic regression
Support vector machines
Genetic programming
Etc.
Applications of Supervised
Categorization Analysis
 Classification models may be used as diagnostic rules
 Classification models may be studied to yield intuitive
insight
– particularly interesting in the case of model
ensembles
Example Classification Rule Learned via
Genetic Programming
if
(NM_005110 + NM_001614)/NM_002230 - .3* NM_002297 > 1
then Case
else Control
Using Ontologies to Make Enhanced
Feature Vectors
Traditional statistical
and machine
learning methods
characterize
individuals by
expression values
alone.
One may extend these
values with entries
representing the inferred
expression levels of
interesting categories of
genes or proteins, as well
as clinical data.
These enhanced feature
vectors are then fed into
machine learning
algorithms carrying out
categorization, clustering,
etc.
Gene Expression Data Enhancement
Example Classification Rule Learned via
Genetic Programming
“Enhanced Feature”
based on GO
“Enhanced Feature”
based on PIR
Example Ontology-Based Classification
Rule in Biomind ArrayGenius User
Interface
Classification Accuracy Using Biomind
Tools
Categorization Model Ensembles
 Generally there will be many qualitatively different “classification
models” distinguishing one category from another
 For diagnostics, one needs only a single good rule
– though voting across a model ensemble may give better
accuracy than any individual learned rule
 For gaining qualitative understanding, statistical analysis of feature
usage across models in an ensemble appears to be quite valuable
 Genetic programming is a particularly useful technique here,
because each learned model tends to use a relatively small number
of features
Important Features Analysis
 Given a classification model ensemble, one can
list the features that occur in the greatest
number of models
 These are NOT necessarily the same features
that provide the greatest differentiation the two
categories, considered individually
Classification Model Utilization Based
Clustering
 Associate each feature (gene, GO, etc.) with the set of
successful classification models that use this feature
 Interpret these sets as “meta feature vectors”, one for
each direct or enhanced feature in the original dataset
 Cluster these meta feature vectors
 The resulting clusters are sets of genes or GO’s that
have interesting interactions in the context of the
classification problem at hand
Example Application: Analysis
of CFS Data
• Collaboration with Dr. Suzanne
Vernon at CDC
• Specific Aims
– To determine which genes
are consistently expressed
in the peripheral blood
– To determine if genes and
background knowledge
could be used to classify
CFS
– To determine if there was a
common (perturbed) CFS
pathway
– To understand differences
between post-EBV fatigue
and other types of CFS
Details of CFS Data
 We analyzed
– CAMDA (Wichita) dataset
– Exercise dataset
– Post-Infectious Fatigue dataset
 Gene expression:
– Noise filtering: started with 30,000 genes
• Exercise dataset reduced to 1,921
• Wichita dataset reduced to 10,812
– Log-transformed and Z-score normalized
 Background knowledge
• Exercise dataset added 377 GO and 145 PIR features
• Wichita dataset added 1405 GO and 1413 PIR features
Results
Confusion Matrix on CAMDA Data
(Leave-One-Out Validation)
10 Most Important Features
Comparison of Clustering Approaches
(Example Clusters)
Expression-Based
Clustering
GO:0000118
histone deacetylase
complex
GO:0004407
histone deacetylase
activity
GO:0005667
transcription factor
complex
GO:0006476
protein amino acid
deacetylation
GO:0016570
histone modification
GO:0016575
histone deacetylation
GO:0019213
deacetylase activity
NM_001527
Homo sapiens histone
deacetylase 2 (HDAC2), mRNA
Model Usage Based Clustering
GO:0004407 histone deacetylase activity
AC016882
GO:0042221 response to chemical substance
GO:0008628 induction of apoptosis by
hormones
ENSG00000086758
AB053232
GO:0009991 response to extracellular
stimulus
Interpretation of Histone
Cluster
 The relation between histone deacetylase and apoptosis is now well
known
 It was demonstrated that caspase-2 and -3, which are part of the
superfamily of caspases, the major group of protein responsible for
apoptosis triggering (Cryns and Yuan, 1998), are able to interact and
cleave the amino terminal portion of the histone deacetylase 4,
which accumulates in the nucleus and interacts physically with the
transcription factor MEF2C, thus preventing this factor from
activating anti-death signals that would allow cell survival (Paroni et
al, 2004).
 Coexposure of cells to HDIs in conjunction with STI571 have been
observed to down-regulate proteins related to response to
extracellular stimulus, such as phospho-extracellular signalregulated kinase (ERK) (Yu et al, 2003).
Signal interaction map for
cluster derived from CFS data
Signal interaction map for
cluster derived from CFS data
13 of 17
ID’ed genes
in the
“hairball”
(320 nodes,
1603 edges)
Produced in
MetaCore
product by
GeneGo
Measuring Clustering Quality
 The quality of a clustering was measured as the
product homogeneity x separation.
 Homogeneity is calculated as the average of the
distances of all members of the cluster to their
nearest cluster-mates.
 Separation is simply the minimum distance from
any given member of the cluster to elements
outside the cluster.
 These particular definitions were used in order
to minimize the influence of the size of the
cluster on its quality.
Model Utilization Based
versus Conventional
Clustering
Useful feature map – Exercise Challenge
• Exercise challenge – 31 of 100 most useful features were
in GO that map to the following pathways:
RNA Pol II
Phosphorylation of
platelet sec-1 and Syntaxin-4
RAD
(DNA repair)
RNA Pol I
mRNA processing,
ribosomes, translation
Lipid metabolism
(triglyceride biosyn)
Caspases
• Exercise challenge – 62 of 100 most useful features were
genes that map to the following pathways:
Recognition and binding of core
promoter elements by TFIID
Ribosome formation
and Translation initiation
Useful feature map – Wichita
hemostasis
Cell cycle
Binding of TFIID
Metabolism
apoptosis
DNA repair
• Wichita – 80 of 100 most useful features were GO
categories and genes that map to several pathways.
– Noticeable absent are features in DNA repair initiation,
transcription, gene expression and mRNA processing.
Useful feature map – Post-Infective Fatigue
spliceosome formation
and mRNA processing
• Post-infective fatigue – 79 of 100 most useful features were
GO categories and genes that map to mRNA processing and
splicesome formation.
Interpretation of CFS
Microarray Results
 Features that make up these models indicate
widespread disruption of cell homeostasis in
both the Exercise and Wichita study.
 Features that make up the PIF classification
model identify mRNA processing pathways
known to be disrupted by EBV.
CFS SNP data analysis
 CFS SNP data publicly available through the
CAMDA 2006 challenge (see
http://www.camda.duke.edu/camda06 )
 Genes pre-selected for SNP analysis because of
possible CFS involvement
 SNP data processed with Biomind software
SNP-Sets as Pattern Strength
Classifiers
Each Pattern Strength Classifier is simply a list of SNPs and a
threshold.
For a given individual being evaluated by a given rule, the “sum of SNP
incidences” is computed in the following way:
 if the individual has a SNP s (present in the SNP list of the rule) in
heterozygosis, then the value 2 is summed for s;
 if s is present in homozygosis, then 1 is summed;
 finally, if s is undetermined for that individual, then 0 is summed.
After this sum is computed for all SNPs in the rule list, the value is
compared with the rule threshold: if it is greater than the threshold,
the individual is classified as CFS, otherwise Control.
For each SNP-set, the threshold value is selected that allows the SNPset to achieve the maximum accuracy for distinguishing Case vs.
Control.
SNP Based Classifiers for CFS
vs. Control
Important Genes for Differentiating CFS vs.
Control
Conceptual Hypothesis