rough set theory and fuzzy logic based warehousing

Download Report

Transcript rough set theory and fuzzy logic based warehousing

ROUGH SET THEORY AND FUZZY LOGIC BASED
WAREHOUSING OF HETEROGENEOUS CLINICAL
DATABASES
Yiwen Fan
Purpose:
Warehouse medical databases:
Clinical databases: have accumulated large quantities of information about patients
and their medical conditions
Warehouse these databases and to analyze the patient’s condition: we need an
efficient data mining technique.
Data Mining Process: Data warehousing, data query and cleaning, and data analysis.
Three major data mining Techniques
Regression
Clustering
Classification
Techniques used in this paper
Two phase: Clustering and Classification
First phase: To Use Rough Set Theory for Clustering ( clustering technique will reduce the
complexity of the RST result)
Second phase: Using Fuzzy Logic to classify the result of the clusters.
Rough Set Theory (RST): Cluster
Fuzzy logic: Classification
Definition of Clustering: A kind of data mining techniques for warehousing the heterogeneous
database. And it is used to group data that have similar characteristics in the same cluster and also
group the data that have dissimilar characteristics with other clusters. (used to handle uncertainty
and incomplete information)
Previous clustering techniques :
K-Means
Expectation Maximization
Association Rule
K-Prototype
Fuzzy K-Modes
etc.
Phase 1 – Clustering
Definition: Partition data into groups of similar categories or objects.
Cluster: The group in the same category or object.
Different Clusters:
Each of the categories in clusters is similar between them and is dissimilar to the categories of
other groups.
Fewer Number of Cluster:
1) Lose: Lose data details;
2) Benefit: Simplification.
The search for the clusters
Unsupervised Learning
Clusters Type:
1.
Exclusive Clusters: Any categories or objects belong to only one cluster.
2.
Overlapping Clusters: Category or an object may belong to many clusters.
3.
Probabilistic Clusters: A category or an object belongs to each cluster with a certain probability.
Notations in Rough Set Theory(RST)
Definition 1:- Indiscernibility Relation: IND (B)
Definition 2:- Equivalence Class: [xi ]IND(B)
Definition 3:- Lower Approximation:
Definition 4:- Upper Approximation:
Definition 5:- Roughness:
Definition 6:- Mean Roughness
Definition 7:- Standard Deviation
1) Whole Data Set -> Parent Node U
2) Current Number of Data Set: >CNC( iterated from 1-K)
3)A attributes, Find the attributes have in
the same category
4)Calculate the Roughness of these
attributes of this category.
5)Found the mean value of all these
attributes
6)Calculate and Store the Standard
Deviation of these attributes
7) The smaller standard deviation is used
for next iteration
8) If the Standard deviation does not
match the smaller value, the next smaller
value is taken as the splitting attribute.
9) Perform binary splitting: split the whole
dataset into two clusters
9) Use Distance of Relevance formula to
select the cluster(which have largest
distance)
Phase 2 – Classification
Fuzzy Inference: Generating a mapping from a given input to an output using fuzzy logic. Then,
the mapping gives a basis, from which decisions can be generated or patterns discerned.
Fuzzy Inference System:
1) Fuzzification
2) Fuzzy Rules Generation
3) Defuzzification
Fuzzy Inference Process:
1) Membership Functions
2)Logical Operations
3)If-Then Rules
Fuzzification
Conditions
1. All the “Cluster 1 (C-1)” values are compared with “Minimum Limit Value
( ML(C -1)) “. If any values of Cluster 1 values are less than the value ML ,
then those values are set as L .
2. All the “Cluster 1 (C-1)” values are compared with “Maximum Limit Value
( XL(C -1) ) “. If any values of Cluster 1 values are less than the value XL(C - 1) , then those values are set as H .
(C-1)
3. If any values of“Cluster1(C-1)”values are greater than the value ML ,and
less than the value XL(C -1) , then those values are set as M .
Similarly, make the conditions for other cluster C - 2 also for generating fuzzy
values.
Fuzzy Rules Generation
According to the fuzzy values for each feature that are generated in the
Fuzzification process, the Fuzzy Rules are also generated.
General form of Fuzzy Rule:
“IF A THEN B”
IF:antecedent
THEN:conclusion
The output values between L and H of the FIS is trained
for generating the Fuzzy Rules.
Defuzzification
Input: The fuzzy set
Output : A single number with value L , M or H (represents whether the given
input dataset is in the Low range, Medium range or in the High range.)
The FIS is trained with the use of the Fuzzy Rules and the testing process is done
with the help of datasets.
Evaluation metrics
Evaluate the effectiveness of the proposed systems
Justify theoretical and practical developments of these systems
Sensitivity
Sensitivity measures the proportion of actual positives which are correctly identified. It relates to the test‟s ability to
identify positive results.
Specificity:
Measures the proportion of negatives which are correctly identified. It relates to the ability of the test to identify
negative results.
Accuracy
From the above results, we can easily get the accuracy value using the following formula,
Results and Discussions
The paper used the heart disease data sets: Cleveland, Hungarian and
Switzerland
Total Number of Attributes: 76
Generally used 14 attributes:
Age, sex, chest pain type, resting blood pressure,serum cholesterol in
mg/dl, fasting blood sugar, resting electro-cardiographic results,
maximum heart rate achieved, exercise induced angina, ST depression,
slope of the peak exercise ST segment, number of major vessels, thal
and diagnosis of heart disease.
Clustering Results
The dataset are clustered
into two sets.
Red dots->Cluster 1
Blue dots-> Cluster 2
Cross-> Centroids
Cleveland dataset
Performance evaluation for
sensitivity, specificity and accuracy
of Cleveland dataset
Iteratio
No
Sensitivit Specificit
Accuracy
(in %)
(in %)
(in %)
1
2
3
4
5
6
7
8
9
10
21
29
36
54
57
57
64
71
71
79
7
19
25
25
38
38
50
57
69
75
30
37
44
45
47
50
54
59
64
75
Graph for the sensitivity, sensitivity and
accuracy of Cleveland dataset
Switzerland dataset
Performance evaluation for sensitivity,
specificity and accuracy of Switzerland
dataset
Iteratio
No
Specificit
Sensitivit
Accurac
(in %)
(in %)
(in %)
1
8
98
15
2
25
98
31
3
68
98
69
4
83
98
85
5
83
98
85
6
93
98
92
7
93
98
92
8
98
98
98
Graph for the sensitivity, sensitivity and
accuracy of Switzerland dataset
Hungarian Dataset
Performance evaluation for sensitivity,
specificity and accuracy of Hungarian
Iteratio
No
Specificit
Sensitivit
Accurac
(in %)
(in %)
(in %)
1
9
26
40
2
9
58
50
3
18
59
54
4
28
63
54
5
37
69
57
6
37
69
57
7
37
73
60
8
46
79
62
9
46
89
69
10
64
98
72
Graph for the sensitivity, sensitivity and
accuracy of Hungarian dataset
Conclusion
Rough Set Theory was used as clustering algorithm
Fuzzy logic was used to classify the clusters.
The experimentation was carried out on heart disease datasets
The evaluation metrics of sensitivity, specificity and accuracy for the
proposed work was also analyzed.
:
Result
The Switzerland dataset has provided better result, in compared
with the other two datasets.
At the highest iteration level, we could achieved good
clustering and classification results.
Reference:
[1] R.SARAVANA KUMAR, “ROUGH SET THEORY AND FUZZY LOGIC BASED
WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES”,
[2] Duo Chen, Du-Wu Cui, Chao-Xue Wang, and Zhu-Rong Wang, "A Rough Set-Based Hierarchical
Clustering Algorithm for Categorical Data", International Journal of Information Technology, Vol.12,
No.3, pp. 149-159, 2006