gene-featureselect-i..

Download Report

Transcript gene-featureselect-i..

Microarray Data Set
 The microarray data set we are dealing with is
represented as a 2d numerical array.
Characteristics of Microarray Data
 High dimensionality of gene space, low
dimensionality of sample space.
 Thousands to tens of thousands of genes, tens
to hundreds of samples.
 Features (genes) correlation.
 Genes collaborate to function. Gene correlation
characterizes how the system works.
 A plethora of domain knowledge.
 Tons of knowledge accumulated about genes in
question.
Microarray Data Analysis
 Analysis from two angles
 sample as object, gene as attribute
 gene as object, sample/condition as attribute
Supervised Analysis




Select training samples (hold out…)
Sort genes (t-test, ranking…)
Select informative genes (top 50 ~ 200)
Cluster based on informative genes
Class 1
Class 2
g1
1 1 … 1 0 0 … 0
g2 1 1 … 1 0 0 … 0
.
.
.
.
.
.
.
g4131 0 0 … 0 1 1 … 1
g4132 0 0 … 0 1 1 … 1
g1 1 1 … 1 0 0 … 0
g2 1 1 … 1 0 0 … 0
.
.
.
g4131
0 0 … 0 1 1 … 1
g4132
0 0 … 0 1 1 … 1
Phenotype Structure Mining
samples
1 2 3
4 5 6 7 8 9 10
gene1
Informative
Genes
gene2
gene3
gene4
Noninformative
Genes
gene5
gene6
gene7
An informative gene is a gene which manifests samples'
phenotype distinction.
Phenotype structure: sample partition + informative genes.
Existing Feature Selection and
Extraction Algorithms
 The characteristic of microarray data set
makes feature selection a critical process.
 Too many features, too few samples.
 Existing feature selection/extraction
algorithms include:
 Single gene based discriminative scores, such as
t-test score, S2N, etc.
 Redundancy removal based FSS algorithms.
 General feature selection algorithms. (Relief
family, Float selection, etc.).
 General feature extraction algorithms: PCA,
SVD, FLD etc. Haven’t witnessed specific feature
extraction algorithms.