gene-featureselect-i..
Download
Report
Transcript gene-featureselect-i..
Microarray Data Set
The microarray data set we are dealing with is
represented as a 2d numerical array.
Characteristics of Microarray Data
High dimensionality of gene space, low
dimensionality of sample space.
Thousands to tens of thousands of genes, tens
to hundreds of samples.
Features (genes) correlation.
Genes collaborate to function. Gene correlation
characterizes how the system works.
A plethora of domain knowledge.
Tons of knowledge accumulated about genes in
question.
Microarray Data Analysis
Analysis from two angles
sample as object, gene as attribute
gene as object, sample/condition as attribute
Supervised Analysis
Select training samples (hold out…)
Sort genes (t-test, ranking…)
Select informative genes (top 50 ~ 200)
Cluster based on informative genes
Class 1
Class 2
g1
1 1 … 1 0 0 … 0
g2 1 1 … 1 0 0 … 0
.
.
.
.
.
.
.
g4131 0 0 … 0 1 1 … 1
g4132 0 0 … 0 1 1 … 1
g1 1 1 … 1 0 0 … 0
g2 1 1 … 1 0 0 … 0
.
.
.
g4131
0 0 … 0 1 1 … 1
g4132
0 0 … 0 1 1 … 1
Phenotype Structure Mining
samples
1 2 3
4 5 6 7 8 9 10
gene1
Informative
Genes
gene2
gene3
gene4
Noninformative
Genes
gene5
gene6
gene7
An informative gene is a gene which manifests samples'
phenotype distinction.
Phenotype structure: sample partition + informative genes.
Existing Feature Selection and
Extraction Algorithms
The characteristic of microarray data set
makes feature selection a critical process.
Too many features, too few samples.
Existing feature selection/extraction
algorithms include:
Single gene based discriminative scores, such as
t-test score, S2N, etc.
Redundancy removal based FSS algorithms.
General feature selection algorithms. (Relief
family, Float selection, etc.).
General feature extraction algorithms: PCA,
SVD, FLD etc. Haven’t witnessed specific feature
extraction algorithms.