Transcript wls-ideas03

Data Mining of Gene Expression
Profiles for the Diagnosis and
Understanding of Diseases
Limsoon Wong
Institute for Infocomm Research
Copyright 2003 limsoon wong
Plan
• Some accomplishments and challenges in
knowledge discovery from biological and
clinical data
• Data mining in microarray analysis
– diagnosis of disease state and subtype
– derivation of treatment plan
– understanding of gene interaction network
Copyright 2003 limsoon wong
Knowledge Discovery from
Biological and Clinical Data:
MOTIVATION
Copyright 2003 limsoon wong
Driving Forces: Genes, Proteins,
Interactions, Diagnosis, & Cures
• Complete genomes
are now available
• Proteins, not genes,
• Proteins function by
are responsible for
interacting with other
proteins and
• Knowing the genes is many cellular activities
biomolecules
not enough to
understand how
biology functions
INTERACTOME
GENOME
PROTEOME
Copyright 2003 limsoon wong
If we figure out how these work,
we get these Benefits
To the patient:
Better drug, better treatment
To the pharma:
Save time, save cost, make more $
To the scientist:
Better science
Copyright 2003 limsoon wong
To figure these out,
we bet on...
“solution” =
Data Mgmt + Knowledge Discovery
Data Mgmt =
Integration + Transformation + Cleansing
Knowledge Discovery =
Statistics + Algorithms + Databases
Copyright 2003 limsoon wong
Knowledge Discovery from
Biological and Clinical Data:
ACCOMPLISHMENT
Copyright 2003 limsoon wong
8 years of bioinformatics R&D
in Singapore
Integration
Technology
(Kleisli)
MHC-Peptide Protein Interactions
Extraction (PIES)
Binding
(PREDICT)
Gene Expression
Molecular
Cleansing &
Connections & Medical Record
Warehousing
Datamining (PCL)
(FIMM)
Gene Feature
Recognition (Dragon)
Venom
Informatics
GeneticXchange
1994
ISS
1996
1998
KRDL
2000
Biobase
2002
LIT/I2R
Copyright 2003 limsoon wong
Predict Epitopes,
Find Vaccine Targets
• Vaccines are often the
only solution for viral
diseases
• Finding & developing
effective vaccine targets
is slow and expensive
process
• Develop systems to recognize
protein peptides that bind
MHC molecules
• Develop systems to recognize
hot spots in viral antigens
Copyright 2003 limsoon wong
Recognize Functional Sites,
Help Scientists
• Effective recognition of
initiation, control, and
termination of biological
processes is crucial to
speeding up and focusing
scientific experiments
• Data mining of bio seqs
to find rules for
recognizing &
understanding
functional sites
Dragon’s 10x
reduction of
TSS recognition
false positives
Copyright 2003 limsoon wong
Diagnose Leukaemia,
Benefit Children
• Childhood leukaemia is a
heterogeneous disease
• Treatment is based on subtype
• 3 different tests and 4 different
experts are needed for
accurate diagnosis
 Curable in USA,
 fatal in Indonesia
• A single platform diagnosis
based on gene expression
• Data mining to discover
rules that are easy for
doctors to understand
Copyright 2003 limsoon wong
Understand Proteins,
Fight Diseases
• Understanding function and
role of protein needs organised
info on interaction pathways
• Such info are often reported in
scientific paper but are seldom
found in structured databases
• Knowledge extraction
system to process free text
• extract protein names
• extract interactions
Copyright 2003 limsoon wong
Data Mining in Microarray Analysis:
MICROARRAY BACKGROUND
Copyright 2003 limsoon wong
What’s a Microarray?
• Contain large number of DNA molecules
spotted on glass slides, nylon membranes,
or silicon wafers
• Measure expression of thousands of genes
simultaneously
Copyright 2003 limsoon wong
Affymetrix GeneChip Array
Copyright 2003 limsoon wong
Making Affymetrix GeneChip
quartz is washed to ensure uniform
hydroxylation across its surface and to
attach linker molecules
exposed linkers become deprotected and
are available for nucleotide coupling
Copyright 2003 limsoon wong
Gene Expression Measurement
by GeneChip
Copyright 2003 limsoon wong
A Sample Affymetrix GeneChip
File (U95A)
Copyright 2003 limsoon wong
Data Mining in Microarray Analysis:
DISEASE SUBSTYPE DIAGNOSIS
Copyright 2003 limsoon wong
Pediatric Acute
Lymphoblastic Leukemia
• A heterogeneous disease with more than
12 subtypes, e.g., T-ALL, E2A-PBX1, TELAML1, BCR-ABL, MLL, and Hyperdip>50.
• Treatment response is subtype dependent
• 80% continuous remission if subtype is
correctly diagnosed and the corresponding
treatment plan is applied
Copyright 2003 limsoon wong
Subtype Diagnosis
• Require different tests:
– immunophenotyping
– cytogenetics
– molecular diagnostics
• Require different experts:
– hematologist
– oncologist
– pathologist
– cytogeneticist
Copyright 2003 limsoon wong
Difficulties and Implications
• The different tests and experts are not
commonly available within a single
hospital, especially in less advanced
countries
 An 80%-curable disease in USA can be
a fatal disease in Indonesia!
 Is there a single diagnostic platform that
does not need multiple human
specialists?
Copyright 2003 limsoon wong
A Potential Solution by Microarrays
Yeoh et al., Cancer Cell 1:133--143, 2002
BCR-ABL
T-ALL
Hyperdiploid >50
MLL
Novel
TEL-AML1
E2A-PBX1
Genes for class
distinction (n=271)
Diagnostic ALL BM samples (n=327)
E2APBX1
MLL
1
0
-3 -2 -1
 = std deviation from mean
T-ALL
2
3
Hyperdiploid >50
BCRABL
Novel
TEL-AML1
Copyright 2003 limsoon wong
Some Caveats
• Study was performed on Americans
• May not be applicable to Singaporeans,
Malaysians, Indonesians, etc.
• Large-scale study on local populations
currently in the works
Copyright 2003 limsoon wong
Typical Procedure in Analysing
Gene Expression for Diagnosis
•
•
•
•
Gene expression data collection
Gene selection
Classifier training
Classifier tuning (optional for some
machine learning methods)
• Apply classifier for diagnosis of future
cases
Copyright 2003 limsoon wong
Feature Selection Methods
A refresher of feature selection methods
Copyright 2003 limsoon wong
Signal Selection (Basic Idea)
• Choose a signal w/ low intra-class distance
• Choose a signal w/ high inter-class distance
Copyright 2003 limsoon wong
Signal Selection (eg., t-statistics)
Copyright 2003 limsoon wong
Signal Selection (eg., 2)
Copyright 2003 limsoon wong
Signal Selection (eg., CFS)
• Instead of scoring individual signals, how
about scoring a group of signals as a
whole?
• CFS
– Correlation-based Feature Selection
– A good group contains signals that are highly
correlated with the class, and yet
uncorrelated with each other
Copyright 2003 limsoon wong
Gene Expression Profile
Classification
An introduction to gene expression profile classification
by the example on ALL subtype diagnosis
Copyright 2003 limsoon wong
Subtype Classification of ALL
A tree-structured
diagnostic
workflow was
recommended by
the doctors, as per
Yeoh et al., Cancer
Cell 1:133--143,
2002
Copyright 2003 limsoon wong
Training and Testing Sets
Copyright 2003 limsoon wong
Our procedure for
ALL subtype diagnosis
•
•
•
•
Gene expression data collection
Gene selection by entropy
Classifier training by emerging pattern
Classifier tuning (optional for some
machine learning methods)
• Apply classifier for diagnosis of future
cases by PCL
Copyright 2003 limsoon wong
Signal Selection (eg., entropy)
Copyright 2003 limsoon wong
Emerging Patterns (EPs)
• An EP is a set of conditions
– usually involving several features
– that most members of a class satisfy
– but none or few of the other class satisfy
• A jumping EP is an EP that
– some members of a class satisfy
– but no members of the other class satisfy
• We use only most general jumping EPs
Copyright 2003 limsoon wong
PCL: Prediction by Collective Likelihood
Copyright 2003 limsoon wong
Accuracy (using 20 genes of lowest entropy)
PCL
0:1
0:2
1:1
0:0
4
0:1
1:1
1:1
1:6
14
5
0:1
2:2
1:1
7
0:2
5
Copyright 2003 limsoon wong
Comprehensibility
Copyright 2003 limsoon wong
Gene Expression Profile Classification
How about other feature selection and
classification methods?
Copyright 2003 limsoon wong
Some gene selection heuristics
•
•
•
•
•
•
all-CFS: all features from CFS
top20-2: 20 features w/ highest 2 stats
top20-t: 20 features w/ highest t-stats
top20-mit: 20 features w/ highest MIT stats
entropy: 20 features w/ lowest entropy
all-2: all features meeting 5% significance
level of 2 stats
Copyright 2003 limsoon wong
Some other classification methods
• k-NN (k=1)
– majority votes of the k nearest neighbours
determined by Euclidean distance
• C4.5
– widely used decision tree method.
• Naïve Bayes (NB)
– probabilistic prediction using Bayes’ rule
• SVM
– (linear) discriminant function that maximizes
separation of boundary samples
Copyright 2003 limsoon wong
Accuracy
• Feature selection improves performance
• Entropy+PCL has consistent high performance
Copyright 2003 limsoon wong
When 20 genes are selected randomly
Average over 100 experiments
Cf. 7-15 mistakes total with good feature selection
Copyright 2003 limsoon wong
Data Mining in Microarray Analysis:
TREATMENT PLAN DERIVATION
A pure speculation!
Copyright 2003 limsoon wong
Can we do more with EPs?
• Detect gene groups that are significantly
related to a disease
• Derive coordinated gene expression
patterns from these groups
• Derive “treatment plan” based on these
patterns
Copyright 2003 limsoon wong
Colon Tumour Dataset
Alon et al., PNAS 96:6745--6750, 1999
• We use the colon tumour dataset above to
illustrate our ideas
– 22 normal samples
– 40 colon tumour samples
Copyright 2003 limsoon wong
Detect Gene Groups
• Feature Selection
– Use entropy method
– 35 genes have cut points
• Generate EPs
– 19501 EPs in normals
– 2165 EPs in tumours
• EPs with largest support
are gene groups
significantly co-related to
disease
Copyright 2003 limsoon wong
Top 20 EPs
Copyright 2003 limsoon wong
Observation 1
• Some EPs contain large
number of genes and still
have high freq
• E.g., {2, 3, 6, 7, 13, 17, 33}
has freq 90.91% in normal
and 0% in cancer samples
 Nearly all normal sample’s
gene expr. values satisfy
all conds. implied by these
7 items
Copyright 2003 limsoon wong
Observation 2
• Freq of singleton EP is not necessarily
larger than EP having multiple genes
• E.g., {5} is EP in cancer samples and has
freq 32.5%
• E.g., {16, 58, 62} is EP in cancer samples
and has freq 75.5%
 Groups of genes and their correlation's
could be more impt than single genes
Copyright 2003 limsoon wong
Observation 3
• M33680 has lowest
entropy of the 35 genes
if cutpoint is set at 352
• 18/40 of cancer
samples shift expr level
of M33680 from its
normal range to its
abnormal range
Copyright 2003 limsoon wong
Treatment Plan Idea
• Increase/decrease expression level of
particular genes in a cancer cell so that
– it has the common EPs of normal cells
– it has no common EPs of cancer cells
Copyright 2003 limsoon wong
Treatment Plan Example
• From the EP {2,3,6,7,13,17,33}
– 91% of normal cells express the 7 genes
(T51560, T49941, M62994, R34701, L02426, U20428, R10707) in the
corr. Intervals
– a cancer cell never express all 7 genes in
the same way
– if expression level of improperly expressed
genes can be adjusted, the cancer cell can
have one common EP of normal cells
– a cancer cell can then be iteratively
converted into a normal one
Copyright 2003 limsoon wong
Choosing Genes to Adjust
Copyright 2003 limsoon wong
Doing more adjustments...
• Down regulating T49941 leads to 2 more
top 10 EPs of normal cells to show up in
the adjusted T1
• Down regulating X62153 to below 396 and
T72403 to below 296 leads to T1 having 9
top 10 EPs of normal cells
• Ave. no. of EPs in normal cells is 9
• So the adjusted T1 now has impt features
of normal cells
Copyright 2003 limsoon wong
Next, eliminate common EPs of
cancer cells in T1
• 6 more genes (K03001, T49732, U29171, R76254, D31767,
L40992) are adjusted
• All top 10 EPs of cancer cells now
disappear from T1
• Ave. no. of top 10 EPs contained in
cancer cells is 6
• The adjusted T1 now holds enough
common features of normal cells and no
features of cancer cells
 T1 is converted to normal cellsCopyright 2003 limsoon wong
“Treatment Plan” Validation
• “Adjustments” were made to the 40 colon tumour
samples based on EPs as described
• Classifiers trained on original samples were
applied to the adjusted samples
It works!
Copyright 2003 limsoon wong
A Big But...
• Effective means for identifying mechanisms
and pathways through which to modulate
gene expression of selected genes need to
be developed
Copyright 2003 limsoon wong
Data Mining in Microarray Analysis:
GENE INTERACTION PREDICTION
Copyright 2003 limsoon wong
Beyond Classification of Gene
Expression Profiles
• After identifying the candidate genes by feature
selection, do we know which ones are causal
genes and which ones are surrogates?
Genes for class
distinction (n=271)
Diagnostic ALL BM samples (n=327)
E2APBX1
MLL
1
0
-3 -2 -1
 = std deviation from mean
T-ALL
2
3
Hyperdiploid >50
BCRABL
Novel
TEL-AML1
Copyright 2003 limsoon wong
Gene Regulatory Circuits
• Genes are “connected” in
“circuit” or network
• Expression of a gene in a
network depends on
expression of some other
genes in the network
• Can we reconstruct the
gene network from gene
expression data?
Copyright 2003 limsoon wong
Key Questions
For each gene in the network:
• which genes affect it?
• How they affect it?
– Positively?
– Negatively?
– More complicated ways?
Copyright 2003 limsoon wong
Some Techniques
• Bayesian Networks
– Friedman et al., JCB 7:601--620, 2000
• Boolean Networks
– Akutsu et al., PSB 2000, pages 293--304
• Differential equations
– Chen et al., PSB 1999, pages 29--40
• Classification-based method
– Soinov et al., “Towards reconstruction of gene
network from expression data by supervised
learning”, Genome Biology 4:R6.1--9, 2003
Copyright 2003 limsoon wong
A Classification-based Technique
Soinov et al., Genome Biology 4:R6.1-9, Jan 2003
• Given a gene expression matrix X
– each row is a gene
– each column is a sample
– each element xij is expression of gene i in
sample j
• Find the average value ai of each gene i
• Denote sij as state of gene i in sample j,
– sij = up if xij > ai
– sij = down if xij  ai
Copyright 2003 limsoon wong
A Classification-based Technique
Soinov et al., Genome Biology 4:R6.1-9, Jan 2003
• To see whether the
state of gene g is
determined by the
state of other genes
– we see whether sij | i 
g can predict sgj
– if can predict with high
accuracy, then “yes”
– Any classifier can be
used, such as C4.5,
PCL, SVM, etc.
• To see how the state of
gene g is determined
by the state of other
genes
– apply C4.5 (or PCL or
other “rule-based”
classifiers) to predict sgj
from sij | i  g
– and extract the decision
tree or rules used
Copyright 2003 limsoon wong
Advantages of this method
• Can identify genes affecting a target gene
• Don’t need discretization thresholds
• Each data sample is treated as an
example
• Explicit rules can be extracted from the
classifier (assuming C4.5 or PCL)
• Generalizable to time series
Copyright 2003 limsoon wong
Acknowledgements
Vladimir Brusic
See-Kiong Ng
Jinyan Li
Vladimir Bajic
Huiqing Liu
Copyright 2003 limsoon wong
Data Mining in Microarray Analysis:
NOTES
Copyright 2003 limsoon wong
References
• J.Li, L. Wong, “Geography of differences between
two classes of data”, Proc. 6th European Conf. on
Principles of Data Mining and Knowledge
Discovery, pp. 325--337, 2002
• J.Li, L. Wong, “Identifying good diagnostic genes or
gene groups from gene expression data by using
the concept of emerging patterns”, Bioinformatics,
18:725--734, 2002
• J.Li et al., “A comparative study on feature selection
and classification methods using a large set of gene
expression profiles”, GIW, 13:51--60, 2002
Copyright 2003 limsoon wong
References
• E.-J. Yeoh et al., “Classification, subtype
discovery, and prediction of outcome in pediatric
acute lymphoblastic leukemia by gene expression
profiling”, Cancer Cell, 1:133--143, 2002
• U.Alon et al., “Broad patterns of gene expression
revealed by clustering analysis of tumor colon
tissues probed by oligonucleotide arrays”, PNAS
96:6745--6750, 1999
• L.A.Soinov et al., “Towards reconstruction of gene
networks from expression data by supervised
learning”, Genome Biology 4:R6.1--9, 2003.
Copyright 2003 limsoon wong
>
>
>
>
>
>
>
>
>
>
>
Data Mining of Gene Expression Profiles for
the Diagnosis and Understanding of Diseases
This talk is divided into two parts. In Part I, I will provide a
brief overview of some accomplishments and challenges
in Bioinformatics. In Part II, I will discuss the data mining
in the analysis of microarray gene expression profiles for
(a) diagnosis of disease state or subtype, (b) derivation of
disease treatment plan, and (c) understanding of gene
interaction networks.
Copyright 2003 limsoon wong