Learning from Imbalanced Data Sets with Boosting and Data

Transcript Learning from Imbalanced Data Sets with Boosting and Data

國立雲林科技大學
National Yunlin University of Science and Technology
A supervised clustering and
classification algorithm
for mining data with mixed variables
Xiangyang Li and Nong Ye
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—
PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 2, 2006, pp.396-406.
Presenter : Wei-Shen Tai
Advisor : Professor Chung-Chian Hsu
2006/10/11
N.Y.U.S.T.
I. M.
Outline







Introduction
Review of CCAS
ECCAS
Applications of ECCAS
Results and discussion
Conclusion
Comments
N.Y.U.S.T.
I. M.
Motivation

Handling mixed data types in data mining

For data with mixed variables, including
numerical, ordinal, and nominal variables.
N.Y.U.S.T.
I. M.
Objective

An enhanced supervised clustering

It enhances the robustness to the presentation
order of training data points and the noise in
training data.

This algorithm supports incremental learning
and mixed data types.
Cluster Representation and
Distance Measures

Clustering and classification algorithm—supervised
(CCAS)


Based on the distance of the data points, as well as
the target class of each data point.
A grid-based supervised clustering of data points.
N.Y.U.S.T.
I. M.
Post-processing of the Cluster
Structure for More Robustness

Data Redistribution



Reduce the impact of the presentation order of data points.
When a seed cluster (existing clusters) is found to be the nearest
cluster to a data point, the seed cluster is replaced by a new cluster
with the data point as the centroid and only this data point in this
cluster.
Supervised Grouping of Clusters


N.Y.U.S.T.
I. M.
Any two clusters nearest to each other have the same target class
and thus can be grouped into one cluster.
Removal of Outliers

Remove data outliers by checking the number of data points in
each cluster.
N.Y.U.S.T.
I. M.
Classification

Concept

Classify a new data point using the clusters
labeled with target class.

where Lj is the jth nearest cluster, and Wj is the
weight for the cluster Lj based on the squared
distance to D; the target class values of this cluster
and D are Y Lj and Y.
N.Y.U.S.T.
I. M.
ECCAS (extended CCAS)

Method A: Based on a Combination of Two Distance Measures


Count the frequencies of the ni categories for this nominal attribute for a cluster j
with a number of data points, and represent these frequencies
Method B: Based on Conversion of Nominal Variables to Binary Variables

Each categorical value of a nominal attribute is represented by a binary variable.
N.Y.U.S.T.
I. M.
Results and discussion
N.Y.U.S.T.
I. M.
Conclusions

ECCAS



Number of grid intervals


Handles data with both numeric and nominal variables.
Reduces the impact of the data presentation order on
the prediction accuracy.
Shows the impact on the prediction accuracy of
ECCAS.
Adaptively and dynamically adjust the parameters

Includes the grid-interval configuration and the
threshold-controlling outlier removal.
N.Y.U.S.T.
I. M.
Comments

Advantage




Drawback



Provide a concept for supervised clustering with target class.
An alternative method in handling data with mixed type.
Attempt to represent hyperspace via a line concept.
If the target class is the determinant of clustering, why we
need the redistribution to improve the robustness?
Experimental results seem disorderly, inconsistent.
Application

A classification solution for data mining with mixed data
type.

Learning from Imbalanced Data Sets with Boosting and Data

Transcript Learning from Imbalanced Data Sets with Boosting and Data

Directory