Survival-Time Classification of Breast Cancer Patients

Download Report

Transcript Survival-Time Classification of Breast Cancer Patients

Survival-Time Classification of Breast Cancer Patients
DIMACS Workshop on Data Mining and Scalable Algorithms
August 22-24, 2001- Rutgers University
Y.-J. Lee, O. L. Mangasarian & W.H. Wolberg
Data Mining Institute
University of Wisconsin - Madison
Second Annual Review
June 1, 2001
American Cancer Society
Year 2001 Breast Cancer Estimates
 Breast cancer, the most common cancer among women, is
the second leading cause of cancer deaths in women (after lung
cancer)
 192,200 new cases of breast cancer in women will be
diagnosed in the United States
 40,600 deaths will occur from breast cancer (40,200 among
women, 400 among men) in the United States
According to the World Health Organization, more than 1.2
million people will be diagnosed with breast cancer this year
worldwide
Key Objective
 Identify breast cancer patients for whom adjuvant
chemotherapy prolongs survival time
 Main Difficulty: Cannot carry out comparative
tests on human subjects
 Similar patients must be treated similarly
 Our Approach: Classify patients into:
Good, Intermediate & Poor groups
 Classification based on: 5 cytological features
plus tumor size
 Classification criteria: Tumor size & lymph
node status
Principal Results
For 253 Breast Cancer Patients
 All 69 patients in the Good group:
 Had the best survival rate
 Had no chemotherapy
 All 73 patients in the Poor group:
 Had the worst survival rate
 Had chemotherapy
 For the 121 patients in the Intermediate group:
 The 67 patients who had chemotherapy had better survival
rate than:
 The 44 patients who did not have chemotherapy
 Last result reverses chemotherapy role for overall population
 Very useful for treatment prescription
Outline
 Tools used
 Support vector machines (SVMs).
 Feature selection
 Classification
 Clustering
 k-Median (k-Mean fails!)
 Cluster chemo patients into chemo-good & chemo-poor
 Cluster no-chemo patients into no-chemo-good & no-chemo-poor
 Three final classes
 Good = No-chemo good
 Poor = Chemo poor
 Intermediate = Remaining patients
Generate survival curves for three classes
 Use SVM to classify new patients into one of above three classes
Support Vector Machines Used
in this Work
 Feature selection: SVM with 1-norm approach, SVM jjájj 1
0
min
÷e
y + kwk 1
>
y
0; w; í
s. t.
D (Aw à eí ) + y > e ,
where D i i = æ1 , denotes Lymph node > 0 or
Lymph node =0
 6 out of 31 features selected by SVM:
 5 out 30 cytological features describe nuclear size,
shape and texture
 Tumor size from surgery
 Classification: Use SSVMs with Gaussian kernel
Clustering in Data Mining
General Objective
 Given: A dataset of m points in n-dimensional real space
 Problem: Extract hidden distinct properties by clustering
the dataset
Concave Minimization Formulation
of Clustering Problem
 Given: Set A of m points in R n represented by the matrix
A 2 R m â n , and a number k of desired clusters
 Problem: Determine centers C ` , ` = 1; . . .; k in R n such
that the sum of the minima over ` 2 f 1; . . .; kg of the
1-norm distance between each point A i , i = 1; . . .; m ,
and cluster centers C ` , ` = 1; . . .k is minimized
 Objective Function: Sum of m minima of k linear functions,
hence it is piecewise-linear concave
 Difficulty: Minimizing a general piecewise-linear concave
function over a polyhedral set is NP-hard
Clustering via Concave Minimization
 Minimize the sum of 1-norm distances between each data
point A i and the closest cluster center C ` :
m
P
min
min f e0D i ` g
C` ; D i `
i = 1`
= 1; . . .; k
s.t. à D i ` ô A 0 à C` ô D i `
i
i = 1; . . .; m; ` = 1; . . .; k
 Bilinear reformulation:
m P
k
P
0
min
T
e
i` D i`
n
C ` ; D i ` 2 R ; Ti ` 2 R i = 1 ` = 1
s.t.
à D i ` ô A 0i à C` ô D i `
P
k
` = 1 Ti `
= 1; Ti ` õ 0
i = 1; . . .; m; ` = 1; . . .; k
Finite K-Median Clustering Algorithm
(Minimizing Piecewise-linear Concave Function)
Step 0 (Initialization): Given k initial cluster centers
 Different initial centers will lead to different clusters
Step 1 (Cluster Assignment): Assign points to the cluster with
the nearest cluster center in 1-norm
Step 2 (Center Update) Recompute location of center for each
cluster as the cluster median (closest point to all cluster
points in 1-norm)
Step3 (Stopping Criterion) Stop if the cluster centers are
unchanged, else go to Step 1
Clustering Process:
Feature Selection & Initial Cluster Centers
 6 out of 31 features selected by a linear SVM ( SVM jjájj 1 )
 SVM separating lymph node positive (Lymph > 0)
from lymph node negative (Lymph = 0)
 Perform k-Median algorithm in 6-dimensional feature space
 Initial cluster centers used: Medians of Good1 & Poor1
 Good1: Patients with Lymph = 0 AND Tumor < 2
 Poor1: Patients with Lymph > 4 OR Tumor õ 4
 Typical indicator for chemotherapy
Clustering Process
253 Patients
(113 NoChemo, 140 Chemo)
Good1:
Lymph=0 AND Tumor<2
Compute Median Using 6 Features
Compute Initial
Cluster Centers
Poor1:
Lymph>=5 OR Tumor>=4
Compute Median Using 6 Features
Cluster 113 NoChemo Patients
Cluster 140 Chemo Patients
Use k-Median Algorithm with Initial Centers:
Use k-Median Algorithm with Initial Centers:
Medians of Good1 & Poor1
Medians of Good1 & Poor1
69 NoChemo Good
Good
44 NoChemo Poor
67 Chemo Good
Intermediate
73 Chemo Poor
Poor
Survival Curves for
Good, Intermediate & Poor Groups
Survival Curves for Intermediate Group:
Split by Chemo & NoChemo
Survival Curves for All Patients
Split by Chemo & NoChemo
Survival Curves for Intermediate Group
Split by Lymph Node & Chemotherapy
Survival Curves for All Patients
Split by Lymph Node Positive & Negative
Nonlinear SVM Classifier
82.7% Tenfold Test Correctness
Four groups from the clustering result:
Intermediate
Good
(ChemoGood)
Good2:
Poor
SVM
Good & ChemoGood
Intermediate
(NoChemoPoor)
Poor2:
NoChemoPoor & Poor
Compute
Compute
LI(x) & CI(x)
LI(x) & CI(x)
SVM
SVM
Good
Intermediate
Intermediate
Poor
Conclusion
 Used five cytological features & tumor size to cluster
breast cancer patients into 3 groups:
 Good – No chemotherapy recommended
 Intermediate – Chemotherapy likely to prolong survival
 Poor – Chemotherapy may or may not enhance survival
 3 groups have very distinct survival curves
 First categorization of a breast cancer group for which
chemotherapy enhances longevity
SVM- based procedure assigns new patients into one of
above three survival groups