Classify the Thyroid Disease
Download
Report
Transcript Classify the Thyroid Disease
Introduction
Objective
Project Plan
Data Selection
Property
Preprocessing
Various Approaches to Classify the Thyroid Disease
C4.5 / C5
SVM
ANN
Ensemble
Conclusion
2
Korea University , Industrial System Information Engineering
2015-07-17
Experience Data Mining as a part of KDD processes
Focused on using various Data Mining Techniques
Our objective is find a model(classifier)
Estimate constructed models
3
C = f(A)
Used R GUI version 2.9.0
with Tinn-R version 1.17.2.4
Korea University , Industrial System Information Engineering
2015-07-17
4
4/10
First Team meeting
~4/26
4/28
Submit a initial Proposal
5/10
Change the subject of the project
~5/27
5/29
Write out a modified Proposal
6/4
Submit a modified Proposal
6/6
Decision Tree and SVM classifier modeling
6/10
Ensemble & ANN model construction
6/16
Integrate the results and Typing final report
6/18
Submit a Final Report and Presentation
Find a exist research, data set for the project
Try to get a suitable data set for the project
Korea University , Industrial System Information Engineering
2015-07-17
Introduction
Objective
Project Plan
Data Selection
Property
Preprocessing
Various Approaches to Classify the Thyroid Disease
C4.5 / C5
SVM
ANN
Ensemble
Conclusion
5
Korea University , Industrial System Information Engineering
2015-07-17
Thyroid Disease Data set from UCI Machine Learning
Repository
Attributes
29 Nominal(T/F, M/F, etc.) and Ratio Attributes
Nominal attributes have text values
Some highly correlated attributes
Data Instances
6
(http://archive.ics.uci.edu/ml/datasets/Thyroid+Disease)
2800 training instances which contain some missing values
972 test instances also contain some missing values
Korea University , Industrial System Information Engineering
2015-07-17
Parallel Coordinate Plot
Example code
parallel(~hypo.data[,1:22])
7
There are too many
attributes to analysis
correlation between
attributes and classes
Korea University , Industrial System Information Engineering
2015-07-17
Parallel Coordinate Plot
Example code
attach(hypo.data)
parallel(~hypo.data
+ [,c(1,2,17,18,19,20,21,22)]
+ | Diagnosis,
+ groups=Diagnosis))
8
According to this, attribute
FTI, TT4 may classify
primary and compensated
hypothyroid
Korea University , Industrial System Information Engineering
Dimensionality
Reduction
• Eliminate highly correlated attributes
• Select meaningful attributes
Control
Anomaly/Missing Values
• Replace these with estimated values
Attribute
Transformation
9
• Text values to integer values
Korea University , Industrial System Information Engineering
2015-07-17
Dimensionality Reduction
For each instance, attribute TSH, T3, TT4, T4U, FTI have
unknowns when the values of each measured are FALSE
10
(29 attributes to 22)
Replace unknowns with zero
e.g) If a value of TSH measured is FALSE then a value of TSH is
unknown ; TSH measured has high correlation with TSH
Each measured is meaningless attribute
DELETE ATTRIBUTES
Values of TBG measured are all FALSE, moreover TBG values
are all unknown also
DELETE ATTRIBUTES
ID : Nominal Attribute which is worth to identify uniqueness
of instance
DELETE ATTRIBUTES
Korea University , Industrial System Information Engineering
2015-07-17
Anomaly
11
It is supposed to input the value of age 45 or 55
Replace 455 to 50
Korea University , Industrial System Information Engineering
2015-07-17
Missing Value
12
We decide to choose some patients who are similar to the
patient missed Age value.
Finally, we chose 2 patients using Excel then replaced missed
age value with a mean of 2 values
Korea University , Industrial System Information Engineering
2015-07-17
Missing Value
13
Replaced with all possible values with prob. distribution (1:2)
Korea University , Industrial System Information Engineering
2015-07-17
Attribute Transformation
All of Nominal Attributes except SEX have TRUE/FALSE values
Attribute SEX has MALE/FEMALE values, also text values
14
Transform these text values to integer values 0(FALSE) and 1(TRUE)
Transform to integer values 1(MALE) and 2(FEMALE)
Korea University , Industrial System Information Engineering
2015-07-17
Introduction
Objective
Project Plan
Data Selection
Property
Preprocessing
Various Approaches to Classify the Thyroid Disease
C4.5 / C5
SVM
ANN
Ensemble
Conclusion
15
Korea University , Industrial System Information Engineering
2015-07-17
16
We decided to construct first classification model by
using decision tree
Decision Tree is a method easily building a classifier
It is based Hunt’s Algorithm
Measurement of the impurity of leaf nodes is Entropy
Korea University , Industrial System Information Engineering
2015-07-17
We used Tree library to branch decision tree in RGui
Example code
17
library(tree)
hypo.tree <- tree(Diagnosis ~ ., data = hypo.data)
pred.tree <- predict(hypo.tree, x, type=c("class"))
table(pred.tree,y)
plot(hypo.tree, type = c("uniform“);text(hypo.tree, cex = 0.7)
Korea University , Industrial System Information Engineering
2015-07-17
18
Cross Validation
of the Decision Tree
According to this result, it is
estimated that an optimal
model with low deviance
when the number of the leaf
nodes is 7
Korea University , Industrial System Information Engineering
2015-07-17
19
Decision Tree
Korea University , Industrial System Information Engineering
2015-07-17
Training Set
Accuracy = 2784/2800 = 0.9943
Actual Class
Compensated
Hypothyroid
Compensated
Hypothyroid
Predicted Class
20
Negative
Primary
Hypothyroid
Secondary
Hypothyroid
153
3
6
0
Negative
1
2573
0
2
Primary
Hypothyroid
0
4
58
0
Secondary
Hypothyroid
0
0
0
0
Too low Entropy of original dataset(0.4720 ; max 2)
Korea University , Industrial System Information Engineering
2015-07-17
Test Set
Accuracy = 968/972 = 0.9959
Actual Class
Compensated
Hypothyroid
Predicted Class
21
Negative
Primary
Hypothyroid
Secondary
Hypothyroid
Compensated
Hypothyroid
40
2
1
0
Negative
0
898
0
0
Primary
Hypothyroid
0
1
30
0
Secondary
Hypothyroid
0
0
0
0
Korea University , Industrial System Information Engineering
2015-07-17
22
Support Vector Machine(SVM) is a efficient model to
classify instances by finding linear or non-linear hyper
plane
It is suitable model when data set has multi dimension
Very hard to visualize all data instances with many
attributes, however two attributes with some slices,
we can visualize instances include relationship between
attributes
Korea University , Industrial System Information Engineering
2015-07-17
23
SVM modeling using R
Korea University , Industrial System Information Engineering
2015-07-17
24
We thought attribute
FTI and TT4 are
suitable to separate
instances
This figure shows that
how attribute FTI and
TT4 separate data set
instances, but all of
records in this area
are classified as
negative
Korea University , Industrial System Information Engineering
2015-07-17
25
Now, change the axis
and give some slices
which give us reduction
of dimensions
The area painted with
light pink suggests that
the class of instances in
that area would be
predicted primary
hypothyroid
Korea University , Industrial System Information Engineering
2015-07-17
Prediction of Training Set
Accuracy = 2658/2800 = 0.9493
Actual Class
Compensated
Hypothyroid
Predicted Class
26
Negative
Primary
Hypothyroid
Secondary
Hypothyroid
Compensated
Hypothyroid
30
0
2
0
Negative
122
2579
13
2
Primary
Hypothyroid
2
1
49
0
Secondary
Hypothyroid
0
0
0
0
Too low Entropy of original dataset(0.4720 ; max 2)
Korea University , Industrial System Information Engineering
2015-07-17
Prediction of Test Set
Accuracy = 933/972 = 0.9599
Actual Class
Compensated
Hypothyroid
Predicted Class
27
Negative
Primary
Hypothyroid
Secondary
Hypothyroid
Compensated
Hypothyroid
11
0
4
0
Negative
29
901
6
0
Primary
Hypothyroid
0
0
21
0
Secondary
Hypothyroid
0
0
0
0
Korea University , Industrial System Information Engineering
2015-07-17
Concept of ANN
28
An artificial neural network, usually called “neural network” is
a computational model that tries to simulate the structure
and/or functional aspects of biological neural networks
In most cases an ANN is an adaptive system that changes its
structure based on external or internal information that flows
through the network during the phase
In a neural network model, simple nodes are connected
together to form a network of nodes
Its practical use comes with algorithms designed to alter the
strength(weights) of the connections in the network to
produce a desired signal flow
Korea University , Industrial System Information Engineering
2015-07-17
29
Training / Test error rate
According to this result,
the number of hidden
nodes be used in ANN
would be 18
Korea University , Industrial System Information Engineering
2015-07-17
Construction of ANN classifier
Used nnet library
Example code
y <- hypo.data$Diagnosis
hypo.ann <- nnet(Diagnosis~.,
+ hypo.data, size=18,
+ decay=5e-4, maxit=300)
hypo.ann
summary(hypo.ann)
pred.ann <- predict(hypo.ann,
+ hypo.data, type="class")
table(pred.ann,y)
30
Korea University , Industrial System Information Engineering
2015-07-17
31
A 22-18-4 network with 490 weights
Korea University , Industrial System Information Engineering
2015-07-17
32
A 22-18-4 network
Korea University , Industrial System Information Engineering
2015-07-17
Prediction of Training Set
Accuracy = 2798/2800 = 0.9993
Actual Class
Compensated
Hypothyroid
Compensated
Hypothyroid
Predicted Class
33
Negative
Primary
Hypothyroid
Secondary
Hypothyroid
152
0
0
0
Negative
0
2580
0
0
Primary
Hypothyroid
2
0
64
0
Secondary
Hypothyroid
0
0
0
2
Most high training accuracy ever than other model
Korea University , Industrial System Information Engineering
2015-07-17
Prediction of Test Set
Accuracy = 954/972 = 0.9815
Actual Class
Compensated
Hypothyroid
Predicted Class
34
Negative
Primary
Hypothyroid
Secondary
Hypothyroid
Compensated
Hypothyroid
35
7
1
0
Negative
4
891
2
0
Primary
Hypothyroid
1
3
28
0
Secondary
Hypothyroid
0
0
0
0
Korea University , Industrial System Information Engineering
2015-07-17
Bagging – Algorithm
Sampling with replacement
Build a classifier on each bootstrap sample
Step 1
As known as Bootstrap aggregation
Sampling B bootstraps from the sample with size N then
construct classifier models from each bootstrap sample.
Step 2
Aggregate B decision trees from step 1
C * ( x) argmax Ci ( x) y
y
Step 3
35
i
Assign class to a majority of values from step 2
Korea University , Industrial System Information Engineering
2015-07-17
36
Korea University , Industrial System Information Engineering
2015-07-17
37
Example of Majority Vote (Tree 1)
Korea University , Industrial System Information Engineering
2015-07-17
38
Example of Majority Vote (Tree 2)
Korea University , Industrial System Information Engineering
2015-07-17
39
Example of Majority Vote (Tree 3)
Korea University , Industrial System Information Engineering
2015-07-17
40
Example of Majority Vote (Tree 4)
Korea University , Industrial System Information Engineering
2015-07-17
41
Example of Majority Vote (Tree 5)
Korea University , Industrial System Information Engineering
2015-07-17
Example of Majority Vote
43
According to majority vote, a class of 80th instance is
predicted to NEGATIVE ; it is same as actual class
Korea University , Industrial System Information Engineering
2015-07-17
44
Korea University , Industrial System Information Engineering
2015-07-17
45
Result of Bagging (Majority Vote)
Korea University , Industrial System Information Engineering
2015-07-17
46
Result of Bagging (Majority Vote)
Korea University , Industrial System Information Engineering
2015-07-17
Bagging
Accuracy = 2790/2800 = 0.9964
Actual Class
Compensated
Hypothyroid
Compensated
Hypothyroid
Predicted Class
47
Negative
Primary
Hypothyroid
Secondary
Hypothyroid
154
4
0
0
Negative
0
2572
0
2
Primary
Hypothyroid
0
4
64
0
Secondary
Hypothyroid
0
0
0
0
Secondary Hypothyroid is misclassified again
Korea University , Industrial System Information Engineering
2015-07-17
Introduction
Objective
Project Plan
Data Selection
Property
Preprocessing
Various Approaches to Classify the Thyroid Disease
C4.5 / C5
SVM
ANN
Ensemble
Conclusion
48
Korea University , Industrial System Information Engineering
2015-07-17
It was a valuable experience to us by mining data from
raw data sets
Limitation of our project is that the Data set we chose
has not enough distribution of classes
49
e.g) the instances those class is secondary hypothyroid are
just two
Since not enough number of instances, the models we
constructed are may be misclassify classes ; especially
secondary hypothyroid
Korea University , Industrial System Information Engineering
2015-07-17
50
Data Mining techniques can be applied to pathology to
diagnose disease.
We can also use data mining techniques in another medical
decision. Using in MRI or CT scan may be good example.
Because our R programming skill is too short, we could not
do what we want to perfectly.
So, there are some researches which are resulted from by
J.R. Quinlan. We referred to these, when branching decision
trees
Korea University , Industrial System Information Engineering
2015-07-17
51
As comparing training error, ANN model was best classifier
however comparing test error, decision tree classifies instances
well
The most attributes of data set we used are consisted the type of
TRUE or FALSE data. Because of strength of decision tree when
it treats discrete values, they are done well
An Ensemble model with decision tree by using bagging method,
was very accurate also, because of its majority voting rule
However, the number of instance is too small and initial entropy
value is too low, it was hard to classifying small class.
Otherwise, ANN model only classified classes well despite of its
very small size even the number of this instances is only two
Korea University , Industrial System Information Engineering
2015-07-17
52
To diagnose some serious diseases in pathology is very
fascinating, but critical. For example, we can diagnose a patient
as normal even though he/she had very critical disease like a lung
cancer
For this reason, we think it should be applied very huge cost to
misclassify patients as normal/negative and consider not only
error rate of the model but also the costs of prediction
Since there are many considerations of putting costs, it is hard to
estimate costs accurately. we couldn’t applied to our models
Even this classifier can diagnose thyroid disease, the right of final
decision in doctor
Korea University , Industrial System Information Engineering
2015-07-17
53
Korea University , Industrial System Information Engineering
2015-07-17