Spanish_Inquisition_Week2 - 91-421-Spring-2009

Download Report

Transcript Spanish_Inquisition_Week2 - 91-421-Spring-2009

Chris
Leon
Yan
Spanish Inquisition
Final Project Week 2 - 4/29/09
Breast Cancer Gene Expression Data
Leon Kay, Yan Tran, Chris Thomas
Weka Filtering
• Used CFS with BestFirst Search
• Reduced the number of attributes from
1544 to 125
• CFS stands for Correlation-based Feature
Selection. Basic hypothesis: “A good
feature subset is one that contains
features highly correlated with (predictive
of) the class, yet uncorrelated with (not
predictive of) each other.” [1]
CFS Algorithm - Searching
• Any search algorithm can be plugged into CFS – author
describes three - forward selection, backward
elimination, and best first. They are all essentially greedy
heuristic search algorithms. The greedy search approach
reduces the complexity of generating the feature subset.
• “Best first can start with either no features or all features.
In the former, the search progresses forward through the
search space adding single features; in the latter the
search moves backward through the search space
deleting single features. To prevent the best first search
from exploring the entire feature subset search space, a
stopping criterion is imposed. The search will terminate if
five consecutive fully expanded subsets show no
improvement over the current best subset.” [1]
CFS Algorithm Visual Diagram [1]
Accuracy (Error Rate) of algorithms before and after
applying CFS/BestFit filtering
Before*
After**
Error Rate Reduction
J48
32.17
28.02
12.92
Bagging (J48)
18.26
16.38
10.30
Boosting (J48)
20.87
16.38
21.52
Random Forests
15.65
14.22
9.12
SMO (SVM)
15.22
14.22
6.53
* From Week1 - all 1544 Attributes
** After applying CFS/BestFit filtering, 125 attributes
ROC – Receiver Operating
Characteristic
• ROC graphs “depict the tradeoff between hit
rates and false alarm rates of classifiers “ [2]
• “one point in ROC space is better than another if
it is to the northwest (tp rate is higher, fp rate is
lower, or both) of the first” [2]
• Therefore, Area Under Curve, or AUC is an
accurate numerical value that can be used to
compare classifiers.
ROC Data – Area under Curve
J48
Bagging (J48)
Boosting (J48)
Random Forests
SMO (SVM)
Basal-like
0.8978
0.9851
0.9883
0.9939
0.9802
Claudin-low
0.9515
0.9993
0.9975
0.9979
0.9977
HER2+/ER-
0.8137
0.9614
0.964
0.9476
0.9313
Luminal A
0.856
0.9558
0.9497
0.9735
0.9418
Luminal B
0.7842
0.93
0.9183
0.9563
0.9336
Normal Breast-like
0.7676
0.9731
0.922
0.9772
0.955
Example ROC – Random Forests
MeV Analysis
• Initial Hierarchical Clustering
Analyze the Cluster
FLJ13710 and GATA3
Lowly expressed in basal-like samples.
Highly expressed in luminal samples.
GATA3
• GATA3 levels are a known indication of
breast cancer prognosis. (Basal-like is
worse than Luminal.)
• Associated with estrogen receptor alpha,
which is often highly expressed in the
early stages of breast cancer.
FLJ13710
• Mentioned in a paper on finding prognostic
signatures for breast cancer.
• Couldn’t find any in-depth studies on this
gene.
References
1)
2)
3)
4)
5)
6)
Mark Hall, “Correlation-based Feature Selection for Machine
Learning”, http://www.cs.waikato.ac.nz/~mhall/thesis.pdf
Tom Fawcett, “An introduction to ROC analysis“,
doi:10.1016/j.patrec.2005.10.010 – enter into http://dx.doi.org/
Wilson, Brian J., Giguère, Vincent. “Meta-analysis of human cancer
microarrays reveals GATA3 is integral to the estrogen receptor
alpha pathway”, Molecular Cancer 2008, 7:49. http://www.molecularcancer.com/content/7/1/49
Hayashi, SI., et al. “The expression and function of estrogen receptor
alpha and beta in human breast cancer and its clinical application”,
http://erc.endocrinology-journals.org/cgi/content/abstract/10/2/193
“Suppl. Table 2: List of probe sets significantly differentially
expressed between luminal cell lines and basal cell lines. Probe
sets are ordered according to decreasing DS (discriminating score).
“www.nature.com/onc/journal/v25/n15/extref/1209254x4.xls
Carrivick, L., et al. “Identification of Prognostic Signatures in Breast
Cancer Microarray Data using Bayesian Techniques.”
http://www.enm.bris.ac.uk/cig/pubs/2005/rs4.pdf