Survival-Time Classification of Breast Cancer Patients
Download
Report
Transcript Survival-Time Classification of Breast Cancer Patients
Survival-Time Classification of Breast Cancer Patients
DIMACS Workshop on Data Mining and Scalable Algorithms
August 22-24, 2001- Rutgers University
Y.-J. Lee, O. L. Mangasarian & W.H. Wolberg
Data Mining Institute
University of Wisconsin - Madison
Second Annual Review
June 1, 2001
American Cancer Society
Year 2001 Breast Cancer Estimates
Breast cancer, the most common cancer among women, is
the second leading cause of cancer deaths in women (after lung
cancer)
192,200 new cases of breast cancer in women will be
diagnosed in the United States
40,600 deaths will occur from breast cancer (40,200 among
women, 400 among men) in the United States
According to the World Health Organization, more than 1.2
million people will be diagnosed with breast cancer this year
worldwide
Key Objective
Identify breast cancer patients for whom adjuvant
chemotherapy prolongs survival time
Main Difficulty: Cannot carry out comparative
tests on human subjects
Similar patients must be treated similarly
Our Approach: Classify patients into:
Good, Intermediate & Poor groups
Classification based on: 5 cytological features
plus tumor size
Classification criteria: Tumor size & lymph
node status
Principal Results
For 253 Breast Cancer Patients
All 69 patients in the Good group:
Had the best survival rate
Had no chemotherapy
All 73 patients in the Poor group:
Had the worst survival rate
Had chemotherapy
For the 121 patients in the Intermediate group:
The 67 patients who had chemotherapy had better survival
rate than:
The 44 patients who did not have chemotherapy
Last result reverses chemotherapy role for overall population
Very useful for treatment prescription
Outline
Tools used
Support vector machines (SVMs).
Feature selection
Classification
Clustering
k-Median (k-Mean fails!)
Cluster chemo patients into chemo-good & chemo-poor
Cluster no-chemo patients into no-chemo-good & no-chemo-poor
Three final classes
Good = No-chemo good
Poor = Chemo poor
Intermediate = Remaining patients
Generate survival curves for three classes
Use SVM to classify new patients into one of above three classes
Support Vector Machines Used
in this Work
Feature selection: SVM with 1-norm approach, SVM jjájj 1
0
min
÷e
y + kwk 1
>
y
0; w; í
s. t.
D (Aw à eí ) + y > e ,
where D i i = æ1 , denotes Lymph node > 0 or
Lymph node =0
6 out of 31 features selected by SVM:
5 out 30 cytological features describe nuclear size,
shape and texture
Tumor size from surgery
Classification: Use SSVMs with Gaussian kernel
Clustering in Data Mining
General Objective
Given: A dataset of m points in n-dimensional real space
Problem: Extract hidden distinct properties by clustering
the dataset
Concave Minimization Formulation
of Clustering Problem
Given: Set A of m points in R n represented by the matrix
A 2 R m â n , and a number k of desired clusters
Problem: Determine centers C ` , ` = 1; . . .; k in R n such
that the sum of the minima over ` 2 f 1; . . .; kg of the
1-norm distance between each point A i , i = 1; . . .; m ,
and cluster centers C ` , ` = 1; . . .k is minimized
Objective Function: Sum of m minima of k linear functions,
hence it is piecewise-linear concave
Difficulty: Minimizing a general piecewise-linear concave
function over a polyhedral set is NP-hard
Clustering via Concave Minimization
Minimize the sum of 1-norm distances between each data
point A i and the closest cluster center C ` :
m
P
min
min f e0D i ` g
C` ; D i `
i = 1`
= 1; . . .; k
s.t. à D i ` ô A 0 à C` ô D i `
i
i = 1; . . .; m; ` = 1; . . .; k
Bilinear reformulation:
m P
k
P
0
min
T
e
i` D i`
n
C ` ; D i ` 2 R ; Ti ` 2 R i = 1 ` = 1
s.t.
à D i ` ô A 0i à C` ô D i `
P
k
` = 1 Ti `
= 1; Ti ` õ 0
i = 1; . . .; m; ` = 1; . . .; k
Finite K-Median Clustering Algorithm
(Minimizing Piecewise-linear Concave Function)
Step 0 (Initialization): Given k initial cluster centers
Different initial centers will lead to different clusters
Step 1 (Cluster Assignment): Assign points to the cluster with
the nearest cluster center in 1-norm
Step 2 (Center Update) Recompute location of center for each
cluster as the cluster median (closest point to all cluster
points in 1-norm)
Step3 (Stopping Criterion) Stop if the cluster centers are
unchanged, else go to Step 1
Clustering Process:
Feature Selection & Initial Cluster Centers
6 out of 31 features selected by a linear SVM ( SVM jjájj 1 )
SVM separating lymph node positive (Lymph > 0)
from lymph node negative (Lymph = 0)
Perform k-Median algorithm in 6-dimensional feature space
Initial cluster centers used: Medians of Good1 & Poor1
Good1: Patients with Lymph = 0 AND Tumor < 2
Poor1: Patients with Lymph > 4 OR Tumor õ 4
Typical indicator for chemotherapy
Clustering Process
253 Patients
(113 NoChemo, 140 Chemo)
Good1:
Lymph=0 AND Tumor<2
Compute Median Using 6 Features
Compute Initial
Cluster Centers
Poor1:
Lymph>=5 OR Tumor>=4
Compute Median Using 6 Features
Cluster 113 NoChemo Patients
Cluster 140 Chemo Patients
Use k-Median Algorithm with Initial Centers:
Use k-Median Algorithm with Initial Centers:
Medians of Good1 & Poor1
Medians of Good1 & Poor1
69 NoChemo Good
Good
44 NoChemo Poor
67 Chemo Good
Intermediate
73 Chemo Poor
Poor
Survival Curves for
Good, Intermediate & Poor Groups
Survival Curves for Intermediate Group:
Split by Chemo & NoChemo
Survival Curves for All Patients
Split by Chemo & NoChemo
Survival Curves for Intermediate Group
Split by Lymph Node & Chemotherapy
Survival Curves for All Patients
Split by Lymph Node Positive & Negative
Nonlinear SVM Classifier
82.7% Tenfold Test Correctness
Four groups from the clustering result:
Intermediate
Good
(ChemoGood)
Good2:
Poor
SVM
Good & ChemoGood
Intermediate
(NoChemoPoor)
Poor2:
NoChemoPoor & Poor
Compute
Compute
LI(x) & CI(x)
LI(x) & CI(x)
SVM
SVM
Good
Intermediate
Intermediate
Poor
Conclusion
Used five cytological features & tumor size to cluster
breast cancer patients into 3 groups:
Good – No chemotherapy recommended
Intermediate – Chemotherapy likely to prolong survival
Poor – Chemotherapy may or may not enhance survival
3 groups have very distinct survival curves
First categorization of a breast cancer group for which
chemotherapy enhances longevity
SVM- based procedure assigns new patients into one of
above three survival groups