3. - Department of Mathematics and Statistics
Download
Report
Transcript 3. - Department of Mathematics and Statistics
INTRODUCTION
•The SSC presented a data set on cervical cancer for analysis.
•Purpose of the analysis: determine the different attributes (covariates) for predicting relapse for women that had cervical cancer and
surgery, as well as classifying the patients into Low, Medium and High risk.
•It has been assumed that prediction will be done with the information obtained right after the surgery. Hence, variable outcomes
observed in between surgery date and last follow-up date will not be used. Such variables are "if patients received radiation therapy
or not” and "dead with disease, dead without disease, alive with disease, etc." which was taken at time of last follow-up.
•905 patients entered the study, 34 patients were dropped since they had no follow-up date yet.
Covariates:
•surgery date
•last follow-up date
•age of the patient at time of surgery
•capillary lymphatic spaces (0=negative, 1,2=positive) (Cls)
•cell differentiation (1=better, 2=moderate, 3=worst) (Grad)
•histology of the cancer cells (determined by the pathologist, ranges from 0 to 6) (Histolog)
•disease left after surgery (0=clear, 1=para-vaginal area, 2=vaginal area, 3=both) (Margins)
•depth of the tumour (in mm.) (Maxdepth)
•pelvis involvement (O=negative, 1=positive) (Pellymph)
•size of the tumour (in mm.)
EXPLORATORY ANALYSIS
Univariate plots by variables, such as these, were performed to better understand their behaviour.
Size by relapse
Maxdepth by relapse
30
0
0
20
30
10
20
20
maxdepth
40
size
50
40
age
60
40
60
70
50
Age by relapse
yes
No
yes
relapse
300
0
100
200
300
0
100
200
300
200
100
0
No
Cls=positive2
400
500
Cls=positive1
400
Cls=negative
yes
relapse
relapse
400
500
relapse
500
No
No
yes
relapse
Also pairwise contingency tables were used as an exploratory tool.
No
yes
relapse
EXPLORATORY ANALYSIS
Classification trees are used to uncover inherent structure in data. These are binary arrangements created by splitting observations into
“more homogeneous” groups, dictated by rules of the form:(e.g.) “if Age<24 and Cls is positive then response is likely 1”
Classification tree with NA's as factor
Classification tree no NA's on data
maxdepth<19.5
|
maxdepth:abc
|
cls:bc
size:d
grad:abc
size<12.5
age<34.5
size:ab
size:ad
grad:ab
grad:b
pellymph:a
grad:a cls:b age<40
cls:bc
age<52.5
No
maxdepth<25.5
No cls:positive1 age<47.5
histolog:3
age<28.5
No
grad:moderate
No
No
No
age<35.5
age<32.5
grad:a
histolog:bd
No
age<42.5
No
No
maxdepth<7.5 maxdepth<12.5
No
No
age<40.5age<46.5
No
No
cls:b
age<32
No
No
cls:b
histolog:bc
size:bc
histolog:c age<33.5
No
No
age<42.5
yes
NoNo
No
NoNo
maxdepth:d
cls:a
NoNo
maxdepth:d
NoNo
age<40.5
NoNo No
histolog:c
No
age<35.5
grad:b
NoNo
age<57
No
NoNo
No
No
No
No
No
No
NoNoNoNo
217 obs
When dropping observations with NA,too much information is lost,
will use NA’s as a factor in all variables.
NoNo
age<38.5
No
NoNo
Misclassification Rate= .06774
Residual Mean Deviance= 0.2995
871 obs
Complex model, a smaller tree might do...
NoyesNo
age<36
maxdepth:b
age<48.5
age<34.5
No
age<35.5 age<35.5
age<51.5
NoNo
grad:ac
age<59.5
No
size:b
No
age<33.5
NoNo
maxdepth<6.5
grad:ab
yesNo
maxdepth:ab
No
grad:b
maxdepth<11.75
grad:bc
No NoNo
maxdepth:b
No
cls:negative
age<44.5
yes
age<49.5
NoNo
age<54.5
NoNoNo
age<54.5age<33.5
No
No
size<17.5
maxdepth<4.5
age<35.5 maxdepth:d
No
histolog:b
No
No
age<36.5
No
No
maxdepth:a
No
EXPLORATORY ANALYSIS
Pruned Tree (NA's inc)
Just as regression uses Residual Sum of Squares as a diagnostic
of fit, trees use Residual Deviance. Hence a decrease in deviance
means a better fitted tree. In regression, more parameters might
give a better fit but complex interpretation. Here, number of
terminal nodes is analogous to the latter. Pruning of a tree can be
done based on the following:
39.000
Deviance reduction per # of terminal nodes
4.900
4.000
3.500
3.000
1.900
1.500
maxdepth:abc
|
cls:bc
grad:abc
size:ad
0.680
size:d
size:ab
grad:ab
No
grad:b
pellymph:a
cls:b
No
cls:bc
age<52.5
yes
age<44.5
No
No
No
No
grad:bc
grad:ac
No
No
450
No
grad:ab
cls:b
histolog:bc
No
size:bc
histolog:c
400
size:b
No
No
age<33.5
No
No
350
No
yes
No
No
No
No
cls:a
No
300
Misclassification Rate= .07233
Residual Mean Deviance= 0.3696
250
deviance
age<40.5
No
1
10
20
30
size
NA's included
40
50
60
No
No
This smaller tree is easier to follow and the misclassification ratio
is still of acceptable size.
Maxdepth, Size and Cls are observed to be important variables in
the structure.
95
280
450
670
1200
1500
0.0 0.4
-0.6
-0.6
0.0
0.4
Beta(t) for ageclspositive2
Proportional Hazards Assumption was not violated neither
individually nor as a global model (pvalue=0.14)
Beta(t) for ageclspositive1
•Variable Size
is of importance as seen in trees.
Nevertheless, it has many missing values, and analyses
usually drop such observations. In order to keep information
we categorised it with the missing values as the lowest of the
levels and used the quartiles as cutoffs for the other levels.
2600
95
280
450
280
450
670
1200
1500
2600
1200
450
1500
670
1200
280
450
1500
450
670
1200
Time
1200
1500
2600
1500
2600
1500
2600
0.6
0.2
95
280
450
670
1200
1500
2600
-10
10
30
Time
Beta(t) for size30c>30
280
670
-0.2
2600
Time
95
2600
10 20
95
Beta(t) for maxdepth
40
20
Beta(t) for clspositive2
0
450
1200
Time
-20
280
670
0
2600
Time
95
1500
1.0
280
-20
Beta(t) for clspositive1
0.4
0.0
Beta(t) for age
-0.4
670
2 4 6
0.02653age:cls1 0.1056age:cls2 0.0957depth:size30 0.0023depth:size30 )
450
-2
h0(t)e( 0.0366age1.8048cls1 3.5872cls2 0.1425depth1.0199size30 0.6119size30
280
2600
Time
-6
Specifically the hazard as a function of time can be seen as
95
Beta(t) for size30c<=30
The model for prediction agreed on included Age, Cls, Maxdepth
and Size as predictors, along with two two-way interactions: Age
with Cls and Maxdepth with Size.
1500
0.0
95
Time
During the process of modeling, it was seen that the important
levels of Size were three categories: Not Measured (NA’s), 30
and >30
1200
-1.0
0.2
-0.2
95
Beta(t) for maxdepthsize30c>30
A Cox Proportional Hazards model was assumed.
670
Time
-0.6
SURVIVAL ANALYSIS
Beta(t) for maxdepthsize30c<=30
Time
95
280
450
670
1200
Time
SURVIVAL ANALYSIS
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
1.0
K-M and Cox Survival Curves for each level of Cls
1.0
K-M and Cox Survival Curves for each level of Size30
K-M Est
Cox PH
K-M Est
Cox PH
0
1000
2000
3000
4000
5000
Negative
positive1
positive2
0.0
0.0
Not Meas.
<=30
>30
0
1000
2000
3000
4000
5000
The Cox curves were calculated as the average of the curves corresponding to the different covariate patterns, rather than plotting
curves with the average VALUE of the covariates. (used S-plus function avg.surv created by Dr. R. Brant, CHS Dept, U of C )
SURVIVAL ANALYSIS
•Some interesting results and interpretation for the model:
The hazard ratio for comparison between having Cls positive1
to positive2, keeping all other variables fixed:
h(t)cls pos1
h(t)cls pos2
Similarly, we can look at hazard ratio for an increase in tumour
size, mainly:
h(t)size30
e( 1.01990 - 0 .61199(-.09573 .00234 )*depth)
h(t)size30
e
( 1.80488 3.58724-( .02653 .10564 )*age)
e
( 5 .39212- 0 .13217*age)
e( 0 .40791- 0 .09339*depth)
e
5 .39212 - 0 .13217*age
e 0 .40791e - 0 .09339*depth
e
So, for Age=30, the hazard ratio=4.166265, that is, the hazard
of having a relapse when Cls positive1 is 4.16625 times
greater than the hazard of relapse when Cls positive2 at age
30
Now, for Age=50 hazard ratio=0.2963008
With analogous interpretation.
We can see the effect of the interaction between Age and Cls
So, for Maxdepth=10, hazard ratio=0.59097, that is, the hazard
of having a relapse when Size is less than 30 is .2963008
times the hazard of relapse when Size>30.
•As with any model, assumptions are needed. The assumption of
non-informative censored data (censoring not related to the
chances of recurrence) was used.
LOGISTIC REGRESSION ANALYSIS
•The main model for a Logistic regression is to regress
the log of the odds of a binary output event as a linear
function of covariates.
•Odds is the ratio of the probability of an event
happening and the probability of the same event not
happening
n
log( odds ) i X i
i 1
odds (event)
P(event)
1 P(event)
Recall that during the process of modeling, it was seen that important levels of Size were really three categories: Not Measured
(NA’s), 30 and >30
The model for prediction agreed on included Age, Cls, Maxdepth, Size and Pellymph as predictors, along with a two-way interaction
between Age and Cls.
Specifically the logistic model can be seen as
log( odds) 2.9972 0.0286age 2.016cls1 2.9039cls 2 0.1862size 30 1.1692size 30
0.066depth 0.5071 pellymph 0.0288age : cls1 0.0964age : cls 2
The statistical significant model included an interaction between Pellymph and Size >30. However, there were only three
observations with such values and the inclusion of this interaction created problems for prediction. Hence, for the sake of
interpretability and in order to be able to predict, we decided to drop it.
The change in residual deviance from the fuller model to the one kept was from 252.99 to 260.99.
LOGISTIC REGRESSION ANALYSIS
50
30
ma
xd 20
ep
th 10
0
20
30
50
40 ge
a
60
70
40
30
ma
xd 20
ep
th 10
0
20
30
50
40 ge
a
60
70
50
40
30
ma
xd 20
ep
th 10
0
20
30
50
40 ge
a
60
70
Pellymph=Neg, Size=Not Measured
Observe interaction of
Age and cls
Probability Surface Cls=Neg
relapse
0 0.2 0.4 0.6 0.8 1
50
Probability Surface Cls=Pos1
40
50
30
ma
xd 20
ep
th 10
0
20
30
40
50
age
60
70
Probability Surface Cls=Pos2
relapse
0 0.2 0.4 0.6 0.8 1
This enables to
look
for
any
Age/Maxdepth
combination
40
Probability Surface Cls=Pos2
relapse
0 0.2 0.4 0.6 0.8 1
Given the fact that
we
had
2
continuous
variables in our
model, we present
some examples of
probability
surfaces.
relapse
0 0.2 0.4 0.6 0.8 1
50
Probability Surface Cls=Pos1
relapse
0 0.2 0.4 0.6 0.8 1
The usual plot for
this type of analysis
is
a
probability
curve.
relapse
0 0.2 0.4 0.6 0.8 1
Probability Surface Cls=Neg
40
30
ma
xd 20
ep
th 10
0
20
30
40
50
age
60
70
50
40
30
ma
xd 20
ep
th
10
0
Pellymph=Pos, Size=Not Measured
20
30
50
40 ge
a
60
70
LOGISTIC REGRESSION ANALYSIS
40
50
30
ma
xd 20
ep
th 10
0
50
age
40
30
ma
xd 20
ep
th 10
0
20
30
50
40 ge
a
60
70
50
40
30
ma
xd 20
ep
th 10
0
20
30
50
40 ge
a
60
70
Pellymph=Pos, Size<=30
Probability Surface Cls=Neg
40
30
ma
xd 20
ep
th 10
0
20
30
50
40 ge
a
60
70
50
Probability Surface Cls=Pos2
relapse
0 0.2 0.4 0.6 0.8 1
50
Probability Surface Cls=Pos1
relapse
0 0.2 0.4 0.6 0.8 1
relapse
0 0.2 0.4 0.6 0.8 1
Changing from Size
<=30 to Size >30
increments
the
probability
of
relapse, for a fixed
set of the other
variables (compare
top to bottom)
20
30
40
60
70
Probability Surface Cls=Pos2
relapse
0 0.20.4 0.6 0.8 1
50
Probability Surface Cls=Pos1
relapse
0 0.2 0.4 0.6 0.8 1
We can see that
Age plays a bigger
role when Cls has
level of positive2
relapse
0 0.2 0.4 0.6 0.8 1
Probability Surface Cls=Neg
40
50
30
ma
xd 20
ep
th 10
0
20
30
50
40 ge
a
Pellymph=Pos, Size>30
60
70
40
30
ma
xd 20
ep
th 10
0
20
30
50
40 ge
a
60
70
LOGISTIC REGRESSION ANALYSIS
•Some interesting results and interpretation for the model:
The odds ratio for comparison between having Cls positive1 to
positive2, keeping all other variables fixed:
odds(relapse )cls pos1
odds(relapse )cls pos2
Similarly, we can look at odds ratio for an increase of 10mm in
tumour depth, mainly:
e( 2 .0160 2 .9039-(.02888 .09644 )*age)
e
( 4 .9199- 0 .1253*age)
e
4 .9199 - 0 .1253*age
odds(relapse )depth a 10
odds(relapse )depth a
e( 10*0.06606043 )
e
So, for Age=30, the odds ratio=3.190795, that is, the odds of
having a relapse when Cls positive1 are 3.190795 times
greater than the odds of relapse when Cls positive2.
Now, for Age=50, the odds ratio=0.2602293, with analogous
interpretation.
We can see the effect of the interaction between Age and Cls
So, for fixed values of other variables, and an increase in 10
for Maxdepth, the odds ratio=1.935962.
That is, the odds of having a relapse when tumour is 10mm
deeper are 1.935962 times greater.
LOGISTIC REGRESSION ANALYSIS
One of the purposes of the case study was to classify patients
in Low, Medium and High risk of relapse.
We suggest to do this using the probabilities obtained from this
logistic regression in the following way:
Calculate the probability from the model for each patient. If the
probability is within a prefixed range, then it is set as Low, if it
is within another range Medium and so on. For example :
Low if in (0,.35], Med if in (.35, .60] and High if >.60
Another way for classifying, would involve at risk or not at risk
as the possible classifications (as a +/- test).
Classified + if predicted Pr(D) >= .5
-------- True -------Classified |
D
~D
Total
- ----------+--------------------------+----------+ |
5
3
|
8
- |
37
612 |
649
---------+--------------------------+----------Total |
42
615 |
657
True D defined as relapse ~= 0
Positive predictive value
Pr( D| +) 62.50%
Negative predictive value Pr(~D| -) 94.30%
Correctly classified
93.91%
Although this gives only two possibilities, predictive values can
be calculated and hence have a measure of accuracy.
Do this by setting a cutoff point for the probabilities calculated
and set the value of the test for the patient as + or -.
Some examples for different cutoffs follow.
•For the next cutoff values the table itself is
omitted.
LOGISTIC REGRESSION ANALYSIS
As a “goodness of fit” , a table for groups follows
Classified + if predicted Pr(D) >= .25
True D defined as relapse ~= 0
Positive predictive value
Pr( D| +) 33.33%
Negative predictive value Pr(~D| -) 94.76%
Correctly classified
92.24%
Classified + if predicted Pr(D) >= .4
True D defined as relapse ~= 0
Positive predictive value
Pr( D| +) 70.00%
Negative predictive value Pr(~D| -) 94.59%
Correctly classified
94.22%
Classified + if predicted Pr(D) >= .6
True D defined as relapse ~= 0
Positive predictive value
Pr( D| +) 50.00%
Negative predictive value Pr(~D| -) 93.74%
Correctly classified
93.61%
Logistic model for relapse, goodness-of-fit test
(Table collapsed on quantiles of estimated probabilities)
Group Prob Obs_1 Exp_1 Obs_0
1 0.0150
1
0.8
65
2 0.0188
1
1.1
65
3 0.0216
0
1.3
66
4 0.0259
3
1.5
62
5 0.0318
2
1.9
64
6 0.0440
1
2.5
65
7 0.0604
4
3.4
61
8 0.0839
2
4.7
64
9 0.1413
12
6.9
54
10 0.6601
16
17.8
49
number of observations =
number of groups =
Pvalue= 0.2704
657
10
Exp_0
65.2
64.9
64.7
63.5
64.1
63.5
61.6
61.3
59.1
47.2
Total
66
66
66
65
66
66
65
66
66
65
CONCLUSIONS
FUTURE WORK
•Given the nature of the study, and the assumption that
prediction of relapse would be done right after surgery,
variables observed after surgery were not taken into account .
These were: Status of patient at last follow-up date and if
patients received radiation.
•It would be of relevance to check the importance of covariates
when separating the response variable as no relapse, relapse
before a specific time and relapse after that time.
•Contrary to what we expected, Disease left after surgery did
not play an important role in prediction.
•There was agreement throughout the different analyses
(exploratory, survival and logistic) regarding the importance of
the inclusion of three covariates:
Maxdepth, Capillary
Lymphatic Spaces (Cls) and Size.
•The effect of variable Age on relapse is affected by its
interaction with Capillary Lymphatic Spaces (cls)
•The important variables for predicting the survival to relapse
are Age, Cls, Size and Maxdepth.
•The important variables for predicting the probability of
relapse are Age, Cls, Size, Maxdepth and Pellymph.
•Use of trees as a classification tool rather than an exploratory
tool.
AKNOWLEDGEMENTS
We would like to thank the following for their help and support in
the creation of this poster:
StatCar lab, Mathematics and Statistics Dept., U of C
Dr. R. Brant, CHS, U of C
Dr. P. Ehlers, Math and Stats, U of C
B. Teare, Math and Stats, U of C
Learning Commons, U of C
BIBLIOGRAPHY
•Rose, S., Lecture notes for Biostatistics II
•Venables, W.N. and Ripley, B.D. Modern Applied Statistics with
S-plus, Springer Statistics and Computing Series, New York,
1994
•Insightful, S-plus 2000 Guide to Statistics, Seattle, 1999