A Novel Method of Estimating the Number of Clusters in a Dataset

Download Report

Transcript A Novel Method of Estimating the Number of Clusters in a Dataset

A Novel Method of Estimating the Number of Clusters in a Dataset
Reza Zafarani and Ali A. Ghorbani
Faculty of Computer Science, University of New Brunswick, Fredericton, NB, Canada
{r.zafarani, ghorbani} @ unb.ca
What is Clustering?
•
•
The Oracle of Clustering
Definition:
• A function, called the Oracle, that can predict whether two
random data points belong to the same cluster or not.
The unsupervised division of patterns
(data points, feature vectors, instances,…)
into groups of similar objects.
•
It's simple to see that given this Oracle function, the clustering can
be done within a O(mn) time complexity where m is the number of
clusters and n is the number of datapoints.
•
The Oracle function can be approximated for different clustering
algorithms (preferably for those with quadratic running times).
•
Their running times can be reduced to O(mn), if this
approximation can take place.
•
Transfer Learning can be used in order to approximate this Oracle
for algorithms with quadratic running times (the relation between
oracles of different clustering algorithms can be learnt or
approximated).
1 if x1 and x2 belong to the same cluster 
O( x1 , x2 )  

0
otherwise


Objects in the same group are more
similar whereas objects in different
groups are more dissimilar.
Oracle Approximation
•
Thresholding:
•
The Need of Clustering
•
•
•
•
Discussions
A simple yet effective approach to predict the Oracle is to use
thresholding on the similarity function between the data.
1 if  ( x1 , x2 )  threshold 
O( x1 , x2 )  

otherwise
0

Analysis of Data / Finding (dis)similar data.
Need of abstraction / Reducing redundancy.
Alleviate the effect of low computation power.
Cost efficiency and business use-cases.
•
A justifiable threshold could be a linear combination of the
mean and standard deviation of the similarities between the
data.
•
In order to make this more accurate dimensionality reduction
methods can be used on the data first.
Future Work
Challenges in Clustering
•
•
•
•
Dynamism
Validity
High Dimensions (curse of dimensionality)
Subjectivity
•
A reasonable way to predict the number of clusters (m) from these
probabilities is to use methods in Partition Theory along with the
methods in Convex Optimization.
•
Gap statistics is another area worth investigating in order to detect
the number of clusters here. The methods in gap statistics can be
used to refine the threshold values which are used in oracle
prediction.
•
The transfer learning function can be learnt and the optimum
conditions under which the function is learnable can be discovered.
e.g. the set {ship, bird, fish} can be clustered in two different ways.
•
•
•
•
•
Large Data Sets
Complexity
Proximity Measures
Initial Conditions
Ensemble Techniques
•
Ensemble Clustering
•
Two points are considered to be in the same cluster if a
majority of different clustering algorithms consider them to be in
the same cluster.
•
This method can be computationally inefficient.
Related Work
•
•
Early literature in the area of dynamic clustering have attempted
to solve this by running algorithms for several Ks
(number of clusters).
Best K among them is determined based on some coefficients or
statistics.
Predicting the Number of Clusters Using
the Oracle
Monte Carlo Sampling:
• Given this Oracle function and using Monte Carlo sampling of
this Oracle function, the probability of random points being in
the same cluster can be estimated.
References
•
•
•
Distance between two cluster centroids normalized by cluster's
standard deviation could be used as a coefficient.
P(r   ) 
•
•
•
•
•
Silhouette coefficient, which compares the average distance
Value between points in the same cluster and the average
Distance value between points in different clusters.
These coefficients are plotted as a function of K (number of
clusters) and the best K is selected.
Probabilistic measures which determine the best model
in mixture models can also be used.
In this area, an optimal K corresponds to the best fitting model.
Some famous criteria in this area are BIC, MDL, and MML.

a
i 1 i
m
•
ai , m, and
n
n are the size of the cluster i, the number of clusters,
and the dataset size, respectively.
The sampling is controlled using chernoff bounds:
P( pN  p   )  2 exp( 2 2 N )  
•
•
•
Where p is the actual probability, N is the sample size, and
 ,  are two prefixed constants.
For Example: N  26500 if   0.01 and   0.01
This problem links this area to the research avenues in Partition
Theory, and more specifically, variations of Kloosterman sums
and summand distributions in integer partitions.
•
•
•
•
Milligan, G., Cooper, M.: An examination of procedures for
determining the number of clusters in a data set.
Psychometrika 50(2) (1985) 159-179
Pelleg, D., Moore, A.: X-means: Extending K-means with
efficient estimation of the number of clusters. Proceedings of
the 17th International Conf. on Machine Learning (2000) 727734
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of
clusters in a data set via the gap statistic. Journal of the Royal
Statistical Society. Series B (Statistical Methodology) 63(2)
(2001) 411-423
Guan, Y., Ghorbani, A., Belacel, N.: Y-means: A clustering
method for intrusion detection. Proceedings of Canadian
Conference on Electrical and Computer Engineering (2003)
Erdos, P., Lehner, J.: The distribution of the number of
summands in the partitions of a positive integer. Duke Math. J
8(2) (1941) 335-345
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge
University Press (2004)
Intelligent and Adaptive Systems Research Group