Classification, subtype discovery, and prediction of
Download
Report
Transcript Classification, subtype discovery, and prediction of
Using Emerging Patterns to Analyze
Gene Expression Data
Jinyan Li
BioComputing Group
Knowledge & Discovery Program
Laboratories for Information Technology
Singapore
Outline
• Introduction
• Brief history of decision trees and emerging
patterns
• Basic ideas for decision trees and EPs
• Advanced topics
• Comparisons using gene expression data
• Summary
Introduction
• Decision trees and emerging patterns both
classification rules
• Sharp discrimination power (no or little
uncertainty)
• Advantage over black-box learning models
• Decision trees not the best (on accuracy)
• EP-based classifiers: competitive to the best
Brief History of
Decision Trees
CLS (Hunt etal. 1966)--- cost driven
CART (Breiman et al. 1984) --- Gini Index
ID3 (Quinlan, 1986 MLJ) --- Information-driven
C4.5 (Quinlan, 1993) --- Pruning ideas
Brief history of
emerging patterns
General EP (Dong & Li, 1999 Sigkdd)
CAEP (Dong etal, DS99), JEP-C (Li etal, KAIS00)
EP-space (Li etal, ICML00), DeEPs (Li etal, MLJ)
PCL (Li & Wong, ECML02)
Basic definitions
• Relational data
• Attributes (color, gene_x), attribute values
(red, 200.1), attribute-value pair
(equivalently, condition, item)
• Patterns, instances
• Training data, test data
A simple dataset
Outlook
Sunny
Sunny
Sunny
Sunny
Sunny
Overcast
Overcast
Overcast
Overcast
Rain
Rain
Rain
Rain
Rain
Temp
75
80
85
72
69
72
83
64
81
71
65
75
68
70
Humidity
70
90
85
95
70
90
78
65
75
80
70
80
80
96
Windy
true
true
false
true
false
true
false
true
false
true
true
false
false
false
class
Play
Don’t
Don’t
Don’t
Play
Play
Play
Play
Play
Don’t 9 Play samples
Don’t
Play 5 Don’t
Play
Play A total of 14.
A decision tree
outlook
sunny
overcast
humidity
<= 75
Play
2
rain
> 75
windy
false
true
Don’t
3
Play
Play
4
NP-complete problem
3
Don’t
2
C4.5
• A heuristic
• Using information gain to select the most
discriminatory feature (for tree and subtrees)
• Recursive subdivision over the original
training data
Characteristics of C4.5 trees
•
•
•
•
Single coverage of training data (elegance)
Divide-and-conquer splitting strategy
Fragmentation problem
Locally reliable but globally un-significant
rules
Missing many globally significant rules; mislead the system.
Emerging Patterns (1)
• An emerging pattern is a set of conditions
usually involving several genes, with which
most of a class satisfy but none of the other
class satisfy.
• Real example:
{gene(37720_at) > 215, gene(38028_at)<=12}.
73% vs 0%
• EPs are multi-gene discriminators.
Emerging Patterns (2)
100 C1 Samples
80%
100 C2 Samples
0%
C1
C2
An EP = {cond.1, cond.2, and cond.3}
Boundary emerging patterns
• Definition: A boundary EP is an EP whose
proper subsets are not EPs.
• Boundary EPs separate EPs from non- EPs
• Distinguish EPs with high frequency from
EPs with low frequency.
• Boundary EPs are of our greatest interests.
EP rules derived
• A total of 12 EPs, some important ones of
them never discovered by C4.5.
• Examples: {Humi <=80, windy = false} ->
Play (5:0).
• A total of 5 rules in the decision tree
induced by C4.5.
• C4.5 missed many important rules.
Characteristics of
EP approach
• Each EP is a tree with only one branch
• A cluster of trees: EPs combined (loss of
elegance)
• Globally significant rules
• Exponential number in size (need to focus
on very important feature if large number of
features exist.)
Usefulness of Emerging
Patterns in Classification
• PCL (Prediction by Collective Likelihood of
Emerging Patterns).
• Accurate
• Easily understandable.
Spirit of the PCL Classifier (1)
Top-Ranked EPs in
Positive class
Top-Ranked EPs in
Negative class
EP_1 (90%)
EP_2 (86%)
.
.
EP_n (68%)
EP_1 (100%)
EP_2 (95%)
.
.
EP_n (80%)
The idea of summarizing multiple top-ranked EPs is intended
to avoid some rare tie cases.
Spirit of the PCL Classifier (2)
Most freq. EP from posi. class
in the test sample
K=10,
Ideal Score_p = 10
Score_n = 0
Score_p = EP’_1_p / EP_1_p + … + EP’_k_p / EP_k_p
Most freq. EP of posi class
Similarly,
Score_n = EP’_1_n / EP_1_n + … + EP’_k_n / EP_k_n
If Score_p > Score_n, then positive class, otherwise
negative class.
C4.5 and PCL
• Differences:
– C4.5 is a greedy search
algorithm using the
divide-and-conquer
idea.
– PCL is a global search
algorithm.
– Leaves in C4.5 are
allowed to contain
mixed samples,
however PCL does not.
– Multiple trees used by
PCL, but single tree
used by C4.5.
• Similarities:
– Both can provide highlevel rules.
Advanced topics
• Bagging, boosting and C4.5
• Convexity of EP spaces (ICML00, Li etal).
• Decomposition of EP spaces into a series of
P-spaces and a small convex space
(ECML02, Li and Wong).
• DeEPs (to appear in MLJ, Li etal).
• Version spaces (Mitchell, 1982 AI).
Gene Expression Profiles
• Huge number of features
• Most of them can be ignored when for
classification
• Many good discriminating feautures
• Number of instances relatively small
Expression data in this talk
• Prostate disease, 102 instances, Cancer Cell
v.1, issue 1, 2002
• ALL disease, 327 instances, Cancer Cell,
v.1, issue 2, 2002
• MLL disease, 72 instances, Nature
Genetics, Jan., 2002
• Breast cancer, 98 instances, Nature, Jan.
2002
Classification models
in this talk
• K-nearest neighbor (simplest)
• C4.5 (decision tree based, easily understandable)
• Bagged and Boosted C4.5
• Support Vector Machines (black box)
• Our PCL classifier
Our Work Flow
Original Training Data
Feature selection
Establishing Classification Model
Giving a Test Sample
Making a Prediction
Selecting Discriminatory
Genes
• T-statistics and MIT-correlation.
– Based on expression average and deviation
between two classes of cell.
• Entropy-based discretization methods,
including Chi-Square statistics, and CFS.
– Based on clear boundaries in the expression
range of a gene.
An Ideal Gene Expression
Range
C1
Xyz.x
C2
• Expression values are two-end distributed.
Results on Prostate Dataset
52 tumor samples and 50 normal samples, each represented
by ~12,500 numeric values.
Two Problems:
-1- What is the main difference? How to use rules to
represent the difference?
-2- What’s the LOOCV accuracy by PCL and C4.5?
C4.5 Tree
32598_at
<=29
33886_at
<= 10
40707_at
> 10
Normal
6
34950_at
<=5
Tumor
>29
<= -6
> -6
Tumor
3+1
Normal
>5
Normal
3+1
Emerging patterns
Patterns
{9, 36}
{9, 23}
{4, 9}
{9, 14}
{6, 9}
{7, 21}
{7, 11}
{7, 43}
{7, 39}
{24, 29}
Frequency (T)
38 instances
38
38
38
38
0
0
0
0
0
Frequency(N)
0
0
0
0
0
36
35
35
34
34
Reference number 9: the expression of 37720_at > 215.
Reference number 36: the expression of 38028_at <= 12.
LOOCV accuracies
Classifier
PCL
Accuracy
Error rate
95.1
5
C4.5
SVM
Single Bagged Boosted
91.2 94.1
93.1
90.2
9
6
7
10
3-NN
96.1
4
Subtype classification
of Childhood
Leukemia
Using Gene Expression
Profiling
One of our important projects.
Collaborating Parties
• St. Jude Children’s Research Hospital, USA.
– Mary E. Ross, Sheila A. Shurtleff, W. Kent Williams, Divyen Patel, Rami
Mahfouz, Fred G. Behm, Susana C. Raimondi, Mary V. Relling, Anami
Patel, Cheng Cheng, Dario Campana, Ching-Hon Pui, William E. Evans,
Clayton Naeve, and James R. Downing
• NUH, Singapore.
– Eng-Juh Yeoh
• University of Mississippi, USA.
– Dawn Wilkins, Xiaodong Zhou
• LIT, Singapore.
– Jinyan Li, Huiqing Liu, Limsoon Wong
Important Motivations
• Leukemia is a
heterogeneous disease
(T-ALL, E2A-PBX1,
TEL-AML1, BCRABL, MLL, and
Hyperdip>50).
• Response is different.
• Leukemia is 80%
curable if subtype is
correctly diagnosed.
• Correct diagnosis
needs many different
tests and experts.
• Tests and experts not
commonly available in
a single hospital,
especially in less
advanced countries.
• So, developing new
methods is needed.
ALL Data Description
Subtype
Training
Testing
T-ALL
E2A-PBX1
TEL-AML1
BCR-ABL
MLL
Hyperdip50
Others
28
18
52
9
14
42
52
15
9
27
6
6
22
27
Total
215
112
A sample = {gene_1, gene_2, …, gene_12558}
Central questions
• Diagnosis: Classification of more than six
subtypes of the leukemia disease
• Prognosis: Prediction of outcome of therapy
Classification Strategy
A new sample
T-ALL?
no
E2A-PBX1?
no
TEL-AML1?
no
BCR-ABL?
no
MLL?
no
Hyperdip50?
no
Yes T-ALL predicted
Yes E2A-PBX1 predicted
Yes TEL-AML1 predicted
Yes BCR-ABL predicted
Yes MLL predicted
Yes Hyperdip50 predicted
Total Mis-classifications of
the 112 Test Samples
Feature Selection SVM NB k-NN C4.5 PCL
Top20-ChiSquare
Entropy
Top20-mit
6
5
7
8
7
7
7
4
9
14
14
14
4
5
4
Overall, we achieved 96% accuracy level in the prediction.
An Excellent Publication
Cancer Cell, March ‘02
• Lead article
• Cover page
Other Questions for the
Childhood Leukemia
• Can one subtype separating from other
subtypes?
• Can we do classification in parallel, rather
than using a tree structure?
Test Error Rates
Datasets
PCL
(112 test instances)
BCR-ABL vs others 1:0
E2A-PBX1 vs others 0:0
Hyperdip50 vs others 2:2
MLL vs others
0:2
T-ALL vs others
0:0
TEL-AML1 vs others 2:0
Parallel classification 7
C4.5
Single Bagged Boosted
4:4
6:0
4:4
0:0
0:0
0:0
4:7
4:2
4:7
2:2
1:0
2:2
1:0
1:0
1:0
2:2
2:1
2:2
27
20
10
Overall, PCL is better than C4.5 (single, bagged, or boosted).
MLL Distinction from
Conventional ALL
• Armstrong, etal, Nature Genetics, Jan. 2002
• 3 classes, 57 training samples, 15 test
instances.
• Much smaller than the St.Jude’s data.
• Independently obtained data.
Test Error Rates
by PCL and C4.5
Datasets
PCL
ALL vs others
AML vs others
MLL vs others
ALL vs AML
ALL vs MLL
AML vs MLL
0:0
0:0
0:1
0:0
0:0
0:0
C4.5
Single Bagged Boosted
1:2
0:0
1:2
1:0
0:0
0:0
0:0
0:0
0:0
0:0
1:0
0:0
0:0
0:0
0:0
0:0
0:0
0:0
Relapse Study
on Breast Cancer
• Veer etal, Nature, Jan. 2002
• 78 training samples, 19 test data for relapse
study.
Test Error Rates
by Classifiers
Dataset
PCL
Relapse vs
Non-relapse 3:3
(12:7)
C4.5
Single Bagged Boosted
5:0
5:2
2:4
SVM 3-NN
6:2
7:3
Accuracy not good; may need other more sophisticated methods;
Gene expression may be not sufficient for relapse study.
Summary
• Discussed similarities and differences between
decision trees and emerging patterns.
• Discussed advanced topics such as bagging,
boosting, convexity.
• Performance comparison using 4 gene expression
datasets.
• Overall, PCL is better that C4.5 (single, bagged, or
boosted) on accuracy and rules.
Thank you!
May 27, 2002