Classify the Thyroid Disease

Download Report

Transcript Classify the Thyroid Disease

Introduction
Objective
Project Plan
Data Selection
Property
Preprocessing
Various Approaches to Classify the Thyroid Disease
C4.5 / C5
SVM
ANN
Ensemble
Conclusion
2
Korea University , Industrial System Information Engineering
2015-07-17

Experience Data Mining as a part of KDD processes

Focused on using various Data Mining Techniques

Our objective is find a model(classifier)

Estimate constructed models

3
C = f(A)
Used R GUI version 2.9.0
with Tinn-R version 1.17.2.4
Korea University , Industrial System Information Engineering
2015-07-17
4

4/10
First Team meeting

~4/26

4/28
Submit a initial Proposal

5/10
Change the subject of the project

~5/27

5/29
Write out a modified Proposal

6/4
Submit a modified Proposal

6/6
Decision Tree and SVM classifier modeling

6/10
Ensemble & ANN model construction

6/16
Integrate the results and Typing final report

6/18
Submit a Final Report and Presentation
Find a exist research, data set for the project
Try to get a suitable data set for the project
Korea University , Industrial System Information Engineering
2015-07-17
Introduction
Objective
Project Plan
Data Selection
Property
Preprocessing
Various Approaches to Classify the Thyroid Disease
C4.5 / C5
SVM
ANN
Ensemble
Conclusion
5
Korea University , Industrial System Information Engineering
2015-07-17

Thyroid Disease Data set from UCI Machine Learning
Repository


Attributes




29 Nominal(T/F, M/F, etc.) and Ratio Attributes
Nominal attributes have text values
Some highly correlated attributes
Data Instances


6
(http://archive.ics.uci.edu/ml/datasets/Thyroid+Disease)
2800 training instances which contain some missing values
972 test instances also contain some missing values
Korea University , Industrial System Information Engineering
2015-07-17

Parallel Coordinate Plot

Example code
parallel(~hypo.data[,1:22])

7
There are too many
attributes to analysis
correlation between
attributes and classes
Korea University , Industrial System Information Engineering
2015-07-17

Parallel Coordinate Plot

Example code
attach(hypo.data)
parallel(~hypo.data
+ [,c(1,2,17,18,19,20,21,22)]
+ | Diagnosis,
+ groups=Diagnosis))

8
According to this, attribute
FTI, TT4 may classify
primary and compensated
hypothyroid
Korea University , Industrial System Information Engineering
Dimensionality
Reduction
• Eliminate highly correlated attributes
• Select meaningful attributes
Control
Anomaly/Missing Values
• Replace these with estimated values
Attribute
Transformation
9
• Text values to integer values
Korea University , Industrial System Information Engineering
2015-07-17

Dimensionality Reduction

For each instance, attribute TSH, T3, TT4, T4U, FTI have
unknowns when the values of each measured are FALSE





10
(29 attributes to 22)
Replace unknowns with zero
e.g) If a value of TSH measured is FALSE then a value of TSH is
unknown ; TSH measured has high correlation with TSH
Each measured is meaningless attribute
DELETE ATTRIBUTES
Values of TBG measured are all FALSE, moreover TBG values
are all unknown also
DELETE ATTRIBUTES
ID : Nominal Attribute which is worth to identify uniqueness
of instance
DELETE ATTRIBUTES
Korea University , Industrial System Information Engineering
2015-07-17

Anomaly


11
It is supposed to input the value of age 45 or 55
Replace 455 to 50
Korea University , Industrial System Information Engineering
2015-07-17

Missing Value


12
We decide to choose some patients who are similar to the
patient missed Age value.
Finally, we chose 2 patients using Excel then replaced missed
age value with a mean of 2 values
Korea University , Industrial System Information Engineering
2015-07-17

Missing Value

13
Replaced with all possible values with prob. distribution (1:2)
Korea University , Industrial System Information Engineering
2015-07-17

Attribute Transformation

All of Nominal Attributes except SEX have TRUE/FALSE values


Attribute SEX has MALE/FEMALE values, also text values

14
Transform these text values to integer values 0(FALSE) and 1(TRUE)
Transform to integer values 1(MALE) and 2(FEMALE)
Korea University , Industrial System Information Engineering
2015-07-17
Introduction
Objective
Project Plan
Data Selection
Property
Preprocessing
Various Approaches to Classify the Thyroid Disease
C4.5 / C5
SVM
ANN
Ensemble
Conclusion
15
Korea University , Industrial System Information Engineering
2015-07-17

16
We decided to construct first classification model by
using decision tree

Decision Tree is a method easily building a classifier

It is based Hunt’s Algorithm

Measurement of the impurity of leaf nodes is Entropy
Korea University , Industrial System Information Engineering
2015-07-17

We used Tree library to branch decision tree in RGui

Example code

17
library(tree)
hypo.tree <- tree(Diagnosis ~ ., data = hypo.data)
pred.tree <- predict(hypo.tree, x, type=c("class"))
table(pred.tree,y)
plot(hypo.tree, type = c("uniform“);text(hypo.tree, cex = 0.7)
Korea University , Industrial System Information Engineering
2015-07-17


18
Cross Validation
of the Decision Tree
According to this result, it is
estimated that an optimal
model with low deviance
when the number of the leaf
nodes is 7
Korea University , Industrial System Information Engineering
2015-07-17

19
Decision Tree
Korea University , Industrial System Information Engineering
2015-07-17

Training Set

Accuracy = 2784/2800 = 0.9943
Actual Class
Compensated
Hypothyroid
Compensated
Hypothyroid
Predicted Class

20
Negative
Primary
Hypothyroid
Secondary
Hypothyroid
153
3
6
0
Negative
1
2573
0
2
Primary
Hypothyroid
0
4
58
0
Secondary
Hypothyroid
0
0
0
0
Too low Entropy of original dataset(0.4720 ; max 2)
Korea University , Industrial System Information Engineering
2015-07-17

Test Set

Accuracy = 968/972 = 0.9959
Actual Class
Compensated
Hypothyroid
Predicted Class
21
Negative
Primary
Hypothyroid
Secondary
Hypothyroid
Compensated
Hypothyroid
40
2
1
0
Negative
0
898
0
0
Primary
Hypothyroid
0
1
30
0
Secondary
Hypothyroid
0
0
0
0
Korea University , Industrial System Information Engineering
2015-07-17



22
Support Vector Machine(SVM) is a efficient model to
classify instances by finding linear or non-linear hyper
plane
It is suitable model when data set has multi dimension
Very hard to visualize all data instances with many
attributes, however two attributes with some slices,
we can visualize instances include relationship between
attributes
Korea University , Industrial System Information Engineering
2015-07-17

23
SVM modeling using R
Korea University , Industrial System Information Engineering
2015-07-17


24
We thought attribute
FTI and TT4 are
suitable to separate
instances
This figure shows that
how attribute FTI and
TT4 separate data set
instances, but all of
records in this area
are classified as
negative
Korea University , Industrial System Information Engineering
2015-07-17


25
Now, change the axis
and give some slices
which give us reduction
of dimensions
The area painted with
light pink suggests that
the class of instances in
that area would be
predicted primary
hypothyroid
Korea University , Industrial System Information Engineering
2015-07-17

Prediction of Training Set

Accuracy = 2658/2800 = 0.9493
Actual Class
Compensated
Hypothyroid
Predicted Class

26
Negative
Primary
Hypothyroid
Secondary
Hypothyroid
Compensated
Hypothyroid
30
0
2
0
Negative
122
2579
13
2
Primary
Hypothyroid
2
1
49
0
Secondary
Hypothyroid
0
0
0
0
Too low Entropy of original dataset(0.4720 ; max 2)
Korea University , Industrial System Information Engineering
2015-07-17

Prediction of Test Set

Accuracy = 933/972 = 0.9599
Actual Class
Compensated
Hypothyroid
Predicted Class
27
Negative
Primary
Hypothyroid
Secondary
Hypothyroid
Compensated
Hypothyroid
11
0
4
0
Negative
29
901
6
0
Primary
Hypothyroid
0
0
21
0
Secondary
Hypothyroid
0
0
0
0
Korea University , Industrial System Information Engineering
2015-07-17

Concept of ANN




28
An artificial neural network, usually called “neural network” is
a computational model that tries to simulate the structure
and/or functional aspects of biological neural networks
In most cases an ANN is an adaptive system that changes its
structure based on external or internal information that flows
through the network during the phase
In a neural network model, simple nodes are connected
together to form a network of nodes
Its practical use comes with algorithms designed to alter the
strength(weights) of the connections in the network to
produce a desired signal flow
Korea University , Industrial System Information Engineering
2015-07-17


29
Training / Test error rate
According to this result,
the number of hidden
nodes be used in ANN
would be 18
Korea University , Industrial System Information Engineering
2015-07-17

Construction of ANN classifier

Used nnet library

Example code
y <- hypo.data$Diagnosis
hypo.ann <- nnet(Diagnosis~.,
+ hypo.data, size=18,
+ decay=5e-4, maxit=300)
hypo.ann
summary(hypo.ann)
pred.ann <- predict(hypo.ann,
+ hypo.data, type="class")
table(pred.ann,y)
30
Korea University , Industrial System Information Engineering
2015-07-17

31
A 22-18-4 network with 490 weights
Korea University , Industrial System Information Engineering
2015-07-17

32
A 22-18-4 network
Korea University , Industrial System Information Engineering
2015-07-17

Prediction of Training Set

Accuracy = 2798/2800 = 0.9993
Actual Class
Compensated
Hypothyroid
Compensated
Hypothyroid
Predicted Class

33
Negative
Primary
Hypothyroid
Secondary
Hypothyroid
152
0
0
0
Negative
0
2580
0
0
Primary
Hypothyroid
2
0
64
0
Secondary
Hypothyroid
0
0
0
2
Most high training accuracy ever than other model
Korea University , Industrial System Information Engineering
2015-07-17

Prediction of Test Set

Accuracy = 954/972 = 0.9815
Actual Class
Compensated
Hypothyroid
Predicted Class
34
Negative
Primary
Hypothyroid
Secondary
Hypothyroid
Compensated
Hypothyroid
35
7
1
0
Negative
4
891
2
0
Primary
Hypothyroid
1
3
28
0
Secondary
Hypothyroid
0
0
0
0
Korea University , Industrial System Information Engineering
2015-07-17

Bagging – Algorithm


Sampling with replacement
Build a classifier on each bootstrap sample


Step 1


As known as Bootstrap aggregation
Sampling B bootstraps from the sample with size N then
construct classifier models from each bootstrap sample.
Step 2

Aggregate B decision trees from step 1
C * ( x)  argmax  Ci ( x)  y 
y

Step 3

35
i
Assign class to a majority of values from step 2
Korea University , Industrial System Information Engineering
2015-07-17
36
Korea University , Industrial System Information Engineering
2015-07-17

37
Example of Majority Vote (Tree 1)
Korea University , Industrial System Information Engineering
2015-07-17

38
Example of Majority Vote (Tree 2)
Korea University , Industrial System Information Engineering
2015-07-17

39
Example of Majority Vote (Tree 3)
Korea University , Industrial System Information Engineering
2015-07-17

40
Example of Majority Vote (Tree 4)
Korea University , Industrial System Information Engineering
2015-07-17

41
Example of Majority Vote (Tree 5)
Korea University , Industrial System Information Engineering
2015-07-17

Example of Majority Vote

43
According to majority vote, a class of 80th instance is
predicted to NEGATIVE ; it is same as actual class
Korea University , Industrial System Information Engineering
2015-07-17
44
Korea University , Industrial System Information Engineering
2015-07-17

45
Result of Bagging (Majority Vote)
Korea University , Industrial System Information Engineering
2015-07-17

46
Result of Bagging (Majority Vote)
Korea University , Industrial System Information Engineering
2015-07-17

Bagging

Accuracy = 2790/2800 = 0.9964
Actual Class
Compensated
Hypothyroid
Compensated
Hypothyroid
Predicted Class

47
Negative
Primary
Hypothyroid
Secondary
Hypothyroid
154
4
0
0
Negative
0
2572
0
2
Primary
Hypothyroid
0
4
64
0
Secondary
Hypothyroid
0
0
0
0
Secondary Hypothyroid is misclassified again
Korea University , Industrial System Information Engineering
2015-07-17
Introduction
Objective
Project Plan
Data Selection
Property
Preprocessing
Various Approaches to Classify the Thyroid Disease
C4.5 / C5
SVM
ANN
Ensemble
Conclusion
48
Korea University , Industrial System Information Engineering
2015-07-17


It was a valuable experience to us by mining data from
raw data sets
Limitation of our project is that the Data set we chose
has not enough distribution of classes


49
e.g) the instances those class is secondary hypothyroid are
just two
Since not enough number of instances, the models we
constructed are may be misclassify classes ; especially
secondary hypothyroid
Korea University , Industrial System Information Engineering
2015-07-17




50
Data Mining techniques can be applied to pathology to
diagnose disease.
We can also use data mining techniques in another medical
decision. Using in MRI or CT scan may be good example.
Because our R programming skill is too short, we could not
do what we want to perfectly.
So, there are some researches which are resulted from by
J.R. Quinlan. We referred to these, when branching decision
trees
Korea University , Industrial System Information Engineering
2015-07-17


51
As comparing training error, ANN model was best classifier
however comparing test error, decision tree classifies instances
well
The most attributes of data set we used are consisted the type of
TRUE or FALSE data. Because of strength of decision tree when
it treats discrete values, they are done well

An Ensemble model with decision tree by using bagging method,
was very accurate also, because of its majority voting rule

However, the number of instance is too small and initial entropy
value is too low, it was hard to classifying small class.

Otherwise, ANN model only classified classes well despite of its
very small size even the number of this instances is only two
Korea University , Industrial System Information Engineering
2015-07-17




52
To diagnose some serious diseases in pathology is very
fascinating, but critical. For example, we can diagnose a patient
as normal even though he/she had very critical disease like a lung
cancer
For this reason, we think it should be applied very huge cost to
misclassify patients as normal/negative and consider not only
error rate of the model but also the costs of prediction
Since there are many considerations of putting costs, it is hard to
estimate costs accurately. we couldn’t applied to our models
Even this classifier can diagnose thyroid disease, the right of final
decision in doctor
Korea University , Industrial System Information Engineering
2015-07-17
53
Korea University , Industrial System Information Engineering
2015-07-17