#### Transcript Greedy Feature Grouping for Optimal Discriminant Subspaces

Greedy Feature Grouping for Optimal Discriminant Subspaces Mahesan Niranjan Department of Computer Science The University of Sheffield & European Bioinformatics Institute Overview • Motivation • Feature Selection • Feature Grouping Algorithm • Simulations – Synthetic Data – Gene Expression Data • Conclusions and Future Motivation • Many new high dimensional problems – Language processing – Synthetic chemical molecules – High throughput experiments in genomics • Discriminant information may well lie in a small subspace – Better classifiers – Better interpretation of classifier Curse of dimensionality Density estimation in high dimensions is difficult Support Vector Machines Classification, not density estimation X X X X X X X X X X X O X O O O O O O O O O Support Vector Machines Nonlinear Kernel Functions X X O X X O X O O O O O X X X X X OO O O O O O Classifier design • Usually to minimize error rate • Error rates can be misleading – Large imbalance in classes – Cost of misclassification can change Adverse Outcome x x x x x x x Benign Outcome x x x x x Class Boundary Threshold True Positive False Positive Area under the ROC Curve: Neat Statistical Interpretation True Positive Convex Hull of ROC Curves Provost & Fawcette Scott, Niranjan & Prager False Positive Feature selection in classification • Filters – select subset that scores high • Wrappers – Sequential Forward Selection / Backward deletion • Parcel – [Scott, Niranjan & Prager; uses convex hulls of ROC curves ] PARCEL: Feature subset selection • Area under Convex Hull of multiple ROCs • Different classifier architectures (including different features) in different operating points. • Has been put to good use on independent implementations: – Oxford, UCL, Surrey – Sheffield Speech Group Gene Expression Microarrays 70 “Experiments’’ 5000 genes Inference problems in Microarray Data • Clustering Similar expression patterns might imply • similar function • regulated in the same way [ e.g. activated by the same transcription factor; concentration maintained by same mechanism etc… ] • Classification – diagnostics - e.g. disease / not – prediction - e.g. survival discrimination with features that do cluster Subspaces of gene expressions • Singular Value Decomposition (SVD) Eigenarrays Eigengenes • Robust SVD for missing values & outliers Alter, Brown & Botstein • Combining different datasets PNAS, 2000 – Pseudo-inverse Projection Alter & Golub – Generalized SVD PNAS, 2004 Yeast Gene Classification: [ Switch to MATLAB here ] 2000 yeast genes 79 experiments Ribosome / Not (125) (1750) First use of SVM Brown et al PNAS 1999 Discriminant Subspaces Seemingly similar models • Product of Experts ( Hinton ) • Modular Mixture Model ( Attias ) Full feature set – mixture model in subspaces – Combined by hidden nodes None of these search for combinations of features Algorithm Select M Initial Assignment At random / domain knowledge -- one feature per group Sequential search through remaining -- which feature, which group -- maximize average AUROC / Sum of Fisher Ratios Stopping criterion Another view… Within Class Scatter Separation of Means Another view… Scatter matrix is not full rank • diagonal, ignoring correlations • regularize e.g. Tibshirani [LASSO] Block diagonal scatter matrix Search for groups of features that contribute to these blocks Inversion straightforward: Simulations • Synthetic Data – 50 dimensions – 3 groups of 4 dimensions + 38 irrelevant ( random, but valid, covariance matrices ) – 40 examples ( 20 / 20 ) – 100 simulations – Results: • 70% of the time correct features identified • Often incorrect group assignment Simulations • Microarray Data [ Rosenwald et al 2002 ] – cancer patients, survival after chemotherapy – 240 data points; 7000 genes on array (filtered to 1000 genes) – two classes: survived to 4 years / not – 10 20 subspaces; random initial seeds; 60 runs – classification accuracy comparable to reported results – 10% of runs failed to group the features ( discrimination based on single subspace ) AML / ALL Leukaemia data 72 Patients 20 20 200 Genes 40 40 60 60 80 80 100 100 120 140 120 160 140 180 160 200 10 20 30 40 50 60 70 10 20 30 40 50 60 70 6 groups; random seeds ALL AML Conclusions • Searching for feature groups – Desirable – Feasible – Achieves “discriminant clustering’’ • Next Step – Biological interpretation of groups [ comparison of genes in clusters with functional annotations, where known ( gene ontology ) ] – Careful initialization ( known pathways )