Transcript slides

Genetic algorithms applied to
multi-class prediction for the
analysis of gene expressions
data
C.H. Ooi & Patrick Tan
Presentation by Tim Hamilton
“Genechips”
• DNA microarrays – a
collection of
microscopic DNA
spots representing
single genes.
• Commonly used to
monitor expression
levels of thousands of
genes at once.
Classification
• Gene expression data is commonly used in the
classification of a biological sample.
- Tumor subtypes
- Response to certain types of treatment (e.g.
chemotherapy).
• Most approaches focus on classification of two, or at
most three classes, and have high rates of error when
run on sets containing multiple classes (19%)
• Propose using GA for analyzing multiple-class
expression data.
• Reduced performance of previous rank-based
approaches because of:
1) missing correlations between genes.
2) Predictor set size must be specified.
• Data Sets used for the GA:
– NCI60: expression profiles of 64 cancer cell lines containing
9703 cDNA sequences.
– GCM: expression profiles for 198 tumor samples, 90 normal
samples, and 20 unknowns containing 16063 genes.
– Both data sets were pre-processed to generate a truncated
1000-gene dataset, color ratio of a single spot – color ration of all
spots / standard deviation. Kept the genes with the highest
standard deviation.
Choosing a GA chromosome
• Determine some minimum and maximum
gene range for selection. [Rmin, Rmax]
• Chromosome string: [R g1 g2… gRmax ]
- R is the size of the predictive set
- any genes past length R are ignored.
- genes are chosen from the list of 1000.
Parameters
• Population size: 100
• Generations: 100
Other parameters were varied
• Crossover method: one-point or universal
• Selection method: stochastic universal sampling (SUS) or roulette
wheel selection (RWS)
• Probability of Crossover : 0.7 – 1.0
• Probability of mutation: 0.0005 – 0.01
• Predictor set size range [Rmin, Rmax]: [5, 10], [11, 15], [16, 20], [21,
25], [26,30];
• For each predictor set size this produced 96 different runs
• Run on both the truncated set, and the full data set for comparison.
• Each generation of chromosomes is used to
classify the data sets using a maximum
likelihood (MLHD) method.
• Fitness = 200 – (E1 + E2)
• E1 = cross validation error rate
• E2 = independent test error rate.
• The MLHD classifier involves a lot of math, but is
based upon Bayes Rule
• Used two previous rank-based methods on the
same truncated data set for comparison.
Results
•
•
•
Uniform crossover produced the best predictors in size ranges [11,15]
and [16,20]
One-point crossover best in ranges [5,10], [21,25] and [26,30].
Higher predictive accuracies when run against the truncated data set.
Results vs. Other Methods
• Finally, GA compared to another method
using SVM classification.
• The SVM had best performance when all
16063 genes of a data-set were used,
22% error
• The GA used only 32 elements, 18% error.