Gene selection: choice of parameters of the GA/KNN method

Download Report

Transcript Gene selection: choice of parameters of the GA/KNN method

Gene selection: choice of
parameters of the GA/KNN method
January 9, 2002
Kim Hye Jin
Intelligent Multimedia Lab.
[email protected]
Contents

Methods





Data Sets
Methodology : k-NN, Genetic Algorithm
Parameters:
Sensitivity, reproducibility, and stablility
Result
 Discussion
Methods

Data sets

Lymphoma data

4026 genes
 47 samples ( original D: 34 training/13 test )

Colon data

2000 genes
 57 samples ( original D: 40 training/ 17 test )
K-nearest
neighbor
Genetic
algorithm
K-nearest neighbors



K = 3 (default)
d genes
Rules by Euclidean distance

Consensus rule


decide if All 3 belong to the same class
Majority rule

Decide if 2 out of 3 belong to the same class
Genetic Algorithm
Selection
mutation
chromosome
N : dimension of chromosome / the number of genes in each chromosome
f i : fitness function
- all k membership agrees to the solution assign 1 to the gene
- the scores are summed and divided by M ( the # of samples in training)
Genetic Algorithm

Selection among chromosomes



Survival of the fittest principle
The single best chromosome from each
niche is entered into the respective
subsequent niche deterministically
The remains are filled according to the
relative fitness of the chromosome
Genetic Algorithm

Mutation

2.
Evolvability by introducing new genes
Which chromosome?
By a probability proportional to its fitness rank
How many genes?
Among 1 ~ 5, the number of mutations is assigned
randomly with prob. 0.53125, 0.25 0.125, 0.0625,
and 0.03125
3.
Which genes?
1.

1
2
3
4
5
Randomly selected and replaced randomly from the
genes not already in the chromosome
Stop : 10000 high-R2 chromosomes are obtained
Parameters

Sensitivity


Reproduciblility


Choice of d : 5, 10, 20, 30, 40, 50
Independent re-runs of the GA/KNN
method on the same data
Stability
Reassignments of ‘training’ and ‘test’ sets:
Original/ random/ discrepant

Result



Sensitivity
Reproducibility
Stability
Sensitivity(1)

Sensitivity


A few genes
dominate the
selection when d is
small
As d increases,
more peaks arise
and the pattern of
gene selection
stabilizes
Sensitivity(2)

Gene selection
is insensitive to
choices of d
between 20~50
Sensitivity(3)

Classification of
the test set sample
: classification is
insensitive to the
choice of d
Reproducibility

Reproducibility


Repeat the same
GA/KNN procedure
on the same training
set with different
seed numbers
Reproducibility is
high for all choice
of d
Stability(1)

Stability


Selection of optimal genes is insensitive to
this choice on the sensitivity and
reproducibility with d = 40
For the set :Original/ random/ discrepant

Original : randomly shuffled
 Random: randomly chose N samples from the
whole data set
 Discrepant : the last N
Stability(2)

colon
Gene selection

25~37 of the top 50
genes appear in both.
random
discrepant
original
original
lymphoma
random
discrepant
original
original
Stability(3)

Classification of test samples in lymphoma


Original : 2 errors
Random/discrepant : 0 errors
Discussion(1)

Choice of d



As d increase, the pattern of gene selection
is stabilized.
d in 20 ~ 50 gave the best result
Choice for the termination R2



R2 = (M-1)/M or (M-2)/M
Little effect on the selection
Computationally more rapid
Discussion(2)


The choice of the number of top genes
for classification : 50~ 200
Information/noise content




Lymphoma data case
Consensus rule : 31%
Majority rule : 61%
Much of the data does not contribute
information