Capturing Best Practice for Microarray Gene Expression Data Analysis

Download Report

Transcript Capturing Best Practice for Microarray Gene Expression Data Analysis

Capturing Best Practice for
Microarray Gene Expression
Data Analysis
Gregory Piatetsky-Shapiro
Tom Khabaza
Sridhar Ramaswamy
Presented briefly by Joey Mudd
What is Microarray Data?
•Microarray devices obtain RNA expression
levels from gene samples
•Data obtained can be used for a variety of
medical purposes: diagnosis, predicting
treatment outcome, etc.
•Data produced are typically large and
complex, which makes data mining a useful
task
Standardizing Data Mining
Process
•Crisp-DM: Cross-Industry
Standard Process model for
Data Mining
•Crisp-DM is a way of
standardizing steps taken in
a data mining process using
high-level structure and
terminology
•Useful for describing best
practice
Microarray Data Analysis Issues
•Typical number of records is small (<100) due to
difficulty of collecting samples
•Typical number of attributes (genes) is large
(many thousands)
•Can lead to false positives (correlation due to
chance), over-fitting
•Paper suggests reducing number of genes
examined (feature reduction)
Data Cleaning and Preparation
•Thresholding: Determine appropriate range of values
(authors used min:100, max 16,000 for Affymetrix arrays)
•Normalization: Required for clustering
(authors used mean 0, stddev 1)
•Filtering: Remove attributes that do not vary enough
across samples, such as:
MaxValue(G)-MinValue(G)<500,
MaxValue(G)/MinValue(G)<5
Feature Selection
•Because of the large number of attributes/small number
of samples, feature selection is important
•Use statistical measures to determine “best genes” for
each class
•To avoid under representing some classes, apply
heuristic of selecting equal number of genes from each
class
Building Classification Models
•For this data, decision trees work poorly, neural nets
work well
•Feature reduction alone not sufficient
•Test models using a varying number of genes from
each class
•Five-fold sufficient, leave-one-out cross-validation
considered most accurate
Case Study 1
•Leukemia data, 2 classes (AML, ALL), 38 samples
training, 34 samples test (separate samples)
•Filter to reduce number of genes, select top 100 based
on T-values
•Build neural net models, 10 genes turned out to be best
subset size
•97% accuracy (33/34 test record correctly classified)
Case Study 2
•Brain data, 5 classes, 42 samples (no separate test set)
•Same preprocessing as Case Study 1
•Select top genes based on Signal to Noise measure, select
equal number of genes per class
•Build neural net models, 12 genes per class (60 total)
gave best results
•Lowest average error rate was 15%.
Case Study 3
•Cluster analysis, with goal of discovering natural classes
•Leukemia data with 3 classes: ALL -> ALL-T and ALL-B
•Same preprocessing as before, also normalize values for
clustering
•Used two clustering methods in Clementine package, both
able to discover natural classes in data, to the authors’
satisfaction
Conclusions
•Ideas presented could be applicable to other domains
where balance between attributes and samples is similar
(cheminfomatics or drug design)
•Future work could evaluate cost-sensitive classification
which minimize errors based on cost they inflict
•Principled methodology can lead to good results