Gene selection: choice of parameters of the GA/KNN method
Download
Report
Transcript Gene selection: choice of parameters of the GA/KNN method
Gene selection: choice of
parameters of the GA/KNN method
January 9, 2002
Kim Hye Jin
Intelligent Multimedia Lab.
[email protected]
Contents
Methods
Data Sets
Methodology : k-NN, Genetic Algorithm
Parameters:
Sensitivity, reproducibility, and stablility
Result
Discussion
Methods
Data sets
Lymphoma data
4026 genes
47 samples ( original D: 34 training/13 test )
Colon data
2000 genes
57 samples ( original D: 40 training/ 17 test )
K-nearest
neighbor
Genetic
algorithm
K-nearest neighbors
K = 3 (default)
d genes
Rules by Euclidean distance
Consensus rule
decide if All 3 belong to the same class
Majority rule
Decide if 2 out of 3 belong to the same class
Genetic Algorithm
Selection
mutation
chromosome
N : dimension of chromosome / the number of genes in each chromosome
f i : fitness function
- all k membership agrees to the solution assign 1 to the gene
- the scores are summed and divided by M ( the # of samples in training)
Genetic Algorithm
Selection among chromosomes
Survival of the fittest principle
The single best chromosome from each
niche is entered into the respective
subsequent niche deterministically
The remains are filled according to the
relative fitness of the chromosome
Genetic Algorithm
Mutation
2.
Evolvability by introducing new genes
Which chromosome?
By a probability proportional to its fitness rank
How many genes?
Among 1 ~ 5, the number of mutations is assigned
randomly with prob. 0.53125, 0.25 0.125, 0.0625,
and 0.03125
3.
Which genes?
1.
1
2
3
4
5
Randomly selected and replaced randomly from the
genes not already in the chromosome
Stop : 10000 high-R2 chromosomes are obtained
Parameters
Sensitivity
Reproduciblility
Choice of d : 5, 10, 20, 30, 40, 50
Independent re-runs of the GA/KNN
method on the same data
Stability
Reassignments of ‘training’ and ‘test’ sets:
Original/ random/ discrepant
Result
Sensitivity
Reproducibility
Stability
Sensitivity(1)
Sensitivity
A few genes
dominate the
selection when d is
small
As d increases,
more peaks arise
and the pattern of
gene selection
stabilizes
Sensitivity(2)
Gene selection
is insensitive to
choices of d
between 20~50
Sensitivity(3)
Classification of
the test set sample
: classification is
insensitive to the
choice of d
Reproducibility
Reproducibility
Repeat the same
GA/KNN procedure
on the same training
set with different
seed numbers
Reproducibility is
high for all choice
of d
Stability(1)
Stability
Selection of optimal genes is insensitive to
this choice on the sensitivity and
reproducibility with d = 40
For the set :Original/ random/ discrepant
Original : randomly shuffled
Random: randomly chose N samples from the
whole data set
Discrepant : the last N
Stability(2)
colon
Gene selection
25~37 of the top 50
genes appear in both.
random
discrepant
original
original
lymphoma
random
discrepant
original
original
Stability(3)
Classification of test samples in lymphoma
Original : 2 errors
Random/discrepant : 0 errors
Discussion(1)
Choice of d
As d increase, the pattern of gene selection
is stabilized.
d in 20 ~ 50 gave the best result
Choice for the termination R2
R2 = (M-1)/M or (M-2)/M
Little effect on the selection
Computationally more rapid
Discussion(2)
The choice of the number of top genes
for classification : 50~ 200
Information/noise content
Lymphoma data case
Consensus rule : 31%
Majority rule : 61%
Much of the data does not contribute
information