Greedy Feature Grouping for Optimal Discriminant Subspaces

Download Report

Transcript Greedy Feature Grouping for Optimal Discriminant Subspaces

Greedy Feature Grouping for
Optimal Discriminant Subspaces
Mahesan Niranjan
Department of Computer Science
The University of Sheffield
&
European Bioinformatics Institute
Overview
• Motivation
• Feature Selection
• Feature Grouping Algorithm
• Simulations
– Synthetic Data
– Gene Expression Data
• Conclusions and Future
Motivation
• Many new high dimensional problems
– Language processing
– Synthetic chemical molecules
– High throughput experiments in genomics
• Discriminant information may well lie in a small
subspace
– Better classifiers
– Better interpretation of classifier
Curse of dimensionality
Density estimation in high dimensions is difficult
Support Vector Machines
Classification, not density estimation
X
X
X
X
X
X
X
X
X
X
X
O
X
O
O
O O
O
O
O O
O
Support Vector Machines
Nonlinear Kernel Functions
X
X
O
X
X
O
X
O
O
O
O
O
X X
X X
X
OO
O O
O O
O
Classifier design
• Usually to minimize error rate
• Error rates can be misleading
– Large imbalance in classes
– Cost of misclassification can change
Adverse Outcome
x
x
x
x
x
x
x
Benign Outcome
x
x
x
x
x
Class Boundary
Threshold
True Positive
False Positive
Area under the ROC Curve: Neat Statistical Interpretation
True Positive
Convex Hull of ROC Curves
Provost & Fawcette
Scott, Niranjan & Prager
False Positive
Feature selection in classification
• Filters
– select subset that scores high
• Wrappers
– Sequential Forward Selection / Backward deletion
• Parcel
– [Scott, Niranjan & Prager; uses convex hulls of ROC
curves ]
PARCEL: Feature subset selection
• Area under Convex Hull of multiple ROCs
• Different classifier architectures (including
different features) in different operating
points.
• Has been put to good use on independent
implementations:
– Oxford, UCL, Surrey
– Sheffield Speech Group
Gene Expression Microarrays
70 “Experiments’’
5000 genes
Inference problems in Microarray Data
• Clustering
Similar expression patterns might imply
• similar function
• regulated in the same way
[ e.g. activated by the same transcription factor; concentration
maintained by same mechanism etc… ]
• Classification
– diagnostics - e.g. disease / not
– prediction - e.g. survival
 discrimination with features that do cluster
Subspaces of gene expressions
• Singular Value Decomposition (SVD)
Eigenarrays
Eigengenes
• Robust SVD for missing values & outliers
Alter, Brown & Botstein
• Combining different datasets
PNAS, 2000
– Pseudo-inverse Projection
Alter & Golub
– Generalized SVD
PNAS, 2004
Yeast Gene Classification:
[ Switch to MATLAB here ]
2000 yeast genes
79 experiments
Ribosome / Not
(125)
(1750)
First use of SVM
Brown et al PNAS 1999
Discriminant Subspaces
Seemingly similar models
• Product of Experts
( Hinton )
• Modular Mixture Model ( Attias )
Full feature set
– mixture model in subspaces
– Combined by hidden nodes
None of these search for combinations of features
Algorithm
Select M
Initial Assignment
At random /
domain knowledge
-- one feature per group
Sequential search through remaining
-- which feature, which group
-- maximize average AUROC / Sum of Fisher Ratios
Stopping criterion
Another view…
Within Class Scatter
Separation of Means
Another view…
Scatter matrix is not full rank
• diagonal, ignoring correlations
• regularize
e.g. Tibshirani
[LASSO]
Block diagonal scatter matrix
Search for groups of features
that contribute to these blocks
Inversion straightforward:
Simulations
• Synthetic Data
– 50 dimensions
– 3 groups of 4 dimensions + 38 irrelevant
( random, but valid, covariance matrices )
– 40 examples ( 20 / 20 )
– 100 simulations
– Results:
• 70% of the time correct features identified
• Often incorrect group assignment
Simulations
• Microarray Data
[ Rosenwald et al 2002 ]
– cancer patients, survival after chemotherapy
– 240 data points; 7000 genes on array
(filtered to 1000 genes)
– two classes: survived to 4 years / not
– 10  20 subspaces; random initial seeds; 60 runs
– classification accuracy comparable to reported results
– 10% of runs failed to group the features
( discrimination based on single subspace )
AML / ALL Leukaemia data
72 Patients
20
20
200 Genes
40
40
60
60
80
80
100
100
120
140
120
160
140
180
160
200
10
20
30
40
50
60
70
10
20
30
40
50
60
70
6 groups; random seeds
ALL
AML
Conclusions
• Searching for feature groups
– Desirable
– Feasible
– Achieves “discriminant clustering’’
• Next Step
– Biological interpretation of groups
[ comparison of genes in clusters with
functional annotations, where known
( gene ontology ) ]
– Careful initialization ( known pathways )