Learning from Imbalanced Data Sets with Boosting and Data
Download
Report
Transcript Learning from Imbalanced Data Sets with Boosting and Data
國立雲林科技大學
National Yunlin University of Science and Technology
A supervised clustering and
classification algorithm
for mining data with mixed variables
Xiangyang Li and Nong Ye
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—
PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 2, 2006, pp.396-406.
Presenter : Wei-Shen Tai
Advisor : Professor Chung-Chian Hsu
2006/10/11
N.Y.U.S.T.
I. M.
Outline
Introduction
Review of CCAS
ECCAS
Applications of ECCAS
Results and discussion
Conclusion
Comments
N.Y.U.S.T.
I. M.
Motivation
Handling mixed data types in data mining
For data with mixed variables, including
numerical, ordinal, and nominal variables.
N.Y.U.S.T.
I. M.
Objective
An enhanced supervised clustering
It enhances the robustness to the presentation
order of training data points and the noise in
training data.
This algorithm supports incremental learning
and mixed data types.
Cluster Representation and
Distance Measures
Clustering and classification algorithm—supervised
(CCAS)
Based on the distance of the data points, as well as
the target class of each data point.
A grid-based supervised clustering of data points.
N.Y.U.S.T.
I. M.
Post-processing of the Cluster
Structure for More Robustness
Data Redistribution
Reduce the impact of the presentation order of data points.
When a seed cluster (existing clusters) is found to be the nearest
cluster to a data point, the seed cluster is replaced by a new cluster
with the data point as the centroid and only this data point in this
cluster.
Supervised Grouping of Clusters
N.Y.U.S.T.
I. M.
Any two clusters nearest to each other have the same target class
and thus can be grouped into one cluster.
Removal of Outliers
Remove data outliers by checking the number of data points in
each cluster.
N.Y.U.S.T.
I. M.
Classification
Concept
Classify a new data point using the clusters
labeled with target class.
where Lj is the jth nearest cluster, and Wj is the
weight for the cluster Lj based on the squared
distance to D; the target class values of this cluster
and D are Y Lj and Y.
N.Y.U.S.T.
I. M.
ECCAS (extended CCAS)
Method A: Based on a Combination of Two Distance Measures
Count the frequencies of the ni categories for this nominal attribute for a cluster j
with a number of data points, and represent these frequencies
Method B: Based on Conversion of Nominal Variables to Binary Variables
Each categorical value of a nominal attribute is represented by a binary variable.
N.Y.U.S.T.
I. M.
Results and discussion
N.Y.U.S.T.
I. M.
Conclusions
ECCAS
Number of grid intervals
Handles data with both numeric and nominal variables.
Reduces the impact of the data presentation order on
the prediction accuracy.
Shows the impact on the prediction accuracy of
ECCAS.
Adaptively and dynamically adjust the parameters
Includes the grid-interval configuration and the
threshold-controlling outlier removal.
N.Y.U.S.T.
I. M.
Comments
Advantage
Drawback
Provide a concept for supervised clustering with target class.
An alternative method in handling data with mixed type.
Attempt to represent hyperspace via a line concept.
If the target class is the determinant of clustering, why we
need the redistribution to improve the robustness?
Experimental results seem disorderly, inconsistent.
Application
A classification solution for data mining with mixed data
type.