#### Transcript Data Mining: Concepts and Techniques

Literature Survey of Clustering Algorithms Bill Andreopoulos Biotec, TU Dresden, Germany, and Department of Computer Science and Engineering York University, Toronto, Ontario, Canada June 27, 2006 Outline • What is Cluster Analysis? • Types of Data in Cluster Analysis • A Categorization of Major Clustering Methods • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Clustering Methods • Supervised Classification Objective of clustering algorithms for categorical data • Partition the objects into groups. – Objects with similar categorical attribute values are placed in the same group. – Objects in different groups contain dissimilar categorical attribute values. • An issue with clustering in general is defining the goals. Papadimitriou et al. (2000) propose: – Seek k groups G1,…,Gk and a policy Pi for each group i. – Pi : a vector of categorical attribute values. k – Maximize: i 1 cGi Pi c – ‘○’ is the “overlap” operator between 2 vectors. • Clustering problem is NP-complete: – Ideally, search all possible clusters and all assignments of objects. – The best clustering is the one maximizing a quality measure. General Applications of Clustering • Pattern Recognition • Spatial Data Analysis – create thematic maps in GIS by clustering feature spaces – detect spatial clusters and explain them in spatial data mining • Image Processing • Economic Science (especially market research) • WWW – Document classification – Cluster Weblog data to discover groups of similar access patterns Examples of Clustering Applications • Software clustering: cluster files in software systems based on their functionality • Intrusion detection: Discover instances of anomalous (intrusive) user behavior in large system log files • Gene expression data: Discover genes with similar functions in DNA microarray data. • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • Land use: Identification of areas of similar land use in an earth observation database • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost What Is Good Clustering? • A good clustering method will produce high quality clusters with – high intra-class similarity – low inter-class similarity • The quality of a clustering depends on: – Appropriateness of method for dataset. – The (dis)similarity measure used – Its implementation. • The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. Requirements of Clustering in Data Mining • Ability to deal with different types of attributes • Discovery of clusters with arbitrary shape • Minimal requirements for domain knowledge to determine input parameters • Able to deal with noise and outliers • Insensitive to order of input records • Scalability to High dimensions • Interpretability and usability • Incorporation of user-specified constraints Outline • What is Cluster Analysis? • Types of Data in Cluster Analysis • A Categorization of Major Clustering Methods • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Clustering Methods • Supervised Classification Data Structures • Data matrix • Dissimilarity matrix x11 ... x i1 ... x n1 ... x1f ... ... ... ... xif ... ... ... ... ... xnf ... ... 0 d(2,1) 0 d(3,1) d ( 3,2) 0 : : : d ( n,1) d ( n,2) ... x1p ... xip ... xnp ... 0 Measure the Quality of Clustering • Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j) • There is a separate “quality” function that measures the “goodness” of a cluster. • The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables. • It is hard to define “similar enough” or “good enough” – the answer is typically highly subjective. Type of data in clustering analysis • Nominal (Categorical) • Interval-scaled variables • Ordinal (Numerical) • Binary variables • Mixed types Nominal (categorical) • A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green • Method 1: Simple matching – m: # of matches, p: total # of variables m d (i, j) p p • Method 2: use a large number of binary variables – creating a new binary variable for each of the M nominal states Interval-scaled variables • Standardize data – Calculate the mean absolute deviation: sf 1 n (| x1 f m f | | x2 f m f | ... | xnf m f |) where m f 1n (x1 f x2 f ... xnf ) . – Calculate the standardized measurement (z-score) xif m f zif sf Ordinal (numerical) • An ordinal variable can be discrete or continuous • order is important, e.g., rank • Can be treated like interval-scaled – replacing xif by their rank rif {1,...,M f } – map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by zif rif 1 M f 1 – compute the dissimilarity using methods for intervalscaled variables Similarity and Dissimilarity Between Objects • Distances are normally used to measure the similarity or dissimilarity between two data objects • Some popular ones include: Minkowski distance: d (i, j) q (| x x |q | x x |q ... | x x |q ) i1 j1 i2 j2 ip jp where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer • If q = 1, d is Manhattan distance d (i, j) | x x | | x x | ... | x x | i1 j1 i2 j 2 ip j p Similarity and Dissimilarity Between Objects (Cont.) • If q = 2, d is Euclidean distance: d (i, j) (| x x |2 | x x |2 ... | x x |2 ) i1 j1 i2 j2 ip jp – Properties • d(i,j) 0 • d(i,i) = 0 • d(i,j) = d(j,i) • d(i,j) d(i,k) + d(k,j) • Also one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures. Binary Variables • A contingency table for binary data Object j Object i 1 0 1 a b 0 c d sum a c b d sum a b cd p • Simple matching coefficient (invariant, if the binary bc variable is symmetric): d (i, j) a bc d • Jaccard coefficient (noninvariant if the binary variable is d (i, j) b c asymmetric): a bc Dissimilarity between Binary Variables • Example Name Jack Mary Jim Gender M F M Fever Y Y Y Cough N N P Test-1 P P N Test-2 N N N Test-3 N P N Test-4 N N N – gender is a symmetric attribute – the remaining attributes are asymmetric binary – let the values Y and P be set to 1, and the value N be set to 0 01 0.33 2 01 11 d ( jack , jim ) 0.67 111 1 2 d ( jim , mary ) 0.75 11 2 d ( jack , mary ) Variables of Mixed Types • A database may contain all the six types of variables – symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio. • One may use a weighted formula to combine their effects. pf 1 ij( f ) d ij( f ) d (i, j ) pf 1 ij( f ) – f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w. – f is interval-based: use the normalized distance – f is ordinal r if 1 • compute ranks rif and zif M f 1 • and treat zif as interval-scaled Clustering of genomic data sets Clustering of gene expression data sets Clustering of synthetic mutant lethality data sets Clustering applied to yeast data sets • Clustering the yeast genes in response to environmental changes • Clustering the cell cycle-regulated yeast genes • Functional analysis of the yeast genome : Finding gene functions - Functional prediction Software clustering • Group system files such that files with similar functionality are in the same cluster, while files in different clusters perform dissimilar functions. – Each object is a file x of the software system. • Both categorical and numerical data sets: • Categorical data set on a software system: for each file, which other files it may invoke during runtime. – After the filename x there is a list of the other filenames that x may invoke. • Numerical data set on a software system: the results of a profiling of the execution of the system, how many times each file invoked other files during the run time. – After the file name x there is a list of the other filenames that x invoked and: how many times x invoked them during the run time. Outline • What is Cluster Analysis? • Types of Data in Cluster Analysis • A Categorization of Major Clustering Methods • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Clustering Methods • Supervised Classification Major Clustering Approaches • Partitioning algorithms: Construct various partitions and then evaluate them by some criterion • Hierarchical algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion • Density-based: based on connectivity and density functions • Grid-based: based on a multiple-level granularity structure • Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other • Unsupervised vs. Supervised: clustering may or may not be based on prior knowledge of the correct classification. Outline • What is Cluster Analysis? • Types of Data in Cluster Analysis • A Categorization of Major Clustering Methods • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Clustering Methods • Supervised Classification Partitioning Algorithms: Basic Concept • Partitioning method: Construct a partition of a database D of n objects into a set of k clusters • Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion • Heuristic methods: k-means and k-medoids algorithms – k-means (MacQueen’67): Each cluster is represented by the center of the cluster – k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster The k-Means Clustering Method • Given k, the k-Means algorithm is implemented in 4 steps: – Partition objects into k nonempty subsets – Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. – Assign each object to the cluster with the nearest seed point. – Go back to Step 2, stop when no more new assignment. The K-Means Clustering Method • Example 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Comments on the K-Means Method • Strength – Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. – Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms • Weakness – Applicable only when mean is defined, then what about categorical data? – Need to specify k, the number of clusters, in advance – Unable to handle noisy data and outliers – Not suitable to discover clusters with non-convex shapes Variations of the K-Means Method • A few variants of the k-means which differ in: – Selection of the initial k means – Dissimilarity calculations – Strategies to calculate cluster means The K-Medoids Clustering Method • Find representative objects, called medoids, in clusters • PAM (Partitioning Around Medoids, 1987) – starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering – PAM works effectively for small data sets, but does not scale well for large data sets • CLARA (Kaufmann & Rousseeuw, 1990) • CLARANS (Ng & Han, 1994): Randomized sampling PAM (Partitioning Around Medoids) (1987) • PAM (Kaufman and Rousseeuw, 1987), built in Splus • Use real object to represent the cluster – Select k representative objects arbitrarily – For each pair of non-selected object h and selected object i, calculate the total swapping cost TCih – For each pair of i and h, • If TCih < 0, i is replaced by h • Then assign each non-selected object to the most similar representative object – repeat steps 2-3 until there is no change PAM Clustering: Total swapping cost TCih=jCjih 10 10 9 9 8 t 7 j t 8 7 6 j 5 6 5 4 i 3 h 4 h i 3 2 2 1 1 0 0 1 2 3 4 5 6 7 8 9 10 0 0 Cjih = d(j, h) - d(j, i) 1 2 3 4 5 6 7 8 9 10 Cjih = 0 10 10 9 9 8 8 h 7 7 j 6 6 i 5 5 i 4 4 t 3 h 3 2 2 t 1 1 j 0 0 0 1 2 3 4 5 6 7 8 9 Cjih = d(j, t) - d(j, i) 10 0 1 2 3 4 5 6 7 8 9 Cjih = d(j, h) - d(j, t) 10 CLARA (Clustering Large Applications) (1990) • CLARA (Kaufmann and Rousseeuw in 1990) • It draws a sample of the data set, applies PAM on the sample, and gives the best clustering as the output. • Strength: deals with larger data sets than PAM • Weakness: – Efficiency depends on the sample size – A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased CLARANS (“Randomized” CLARA) (1994) • CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94) • CLARANS draws multiple samples dynamically. • A different sample can be chosen at each loop. • The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids • It is more efficient and scalable than both PAM and CLARA Squeezer Single linkage clustering Not the most effective and accurate clustering algorithm that exists, but it is efficient as it has a complexity of O(n) where n is the number of data objects [Portnoy01]. 1) Initialize the set of clusters, S, to the empty set. 2) Obtain an object d from the data set. If S is empty, then create a cluster with d and add it to S. Otherwise, find the cluster in S that is closest to this object. In other words, find the closest cluster C to d in S. 3) If the distance between d and C is less than or equal to a user specified threshold W then associate d with the cluster C. Else, create a new cluster for d in S. 4) Repeat steps 2 and 3 until no objects are left in the data set. Fuzzy k-Means • Clusters produced by k-Means: "hard" or "crisp" clusters – since any feature vector x either is or is not a member of a particular cluster. • In contrast to "soft" or "fuzzy" clusters – a feature vector x can have a degree of membership in each cluster. • The fuzzy-k-means procedure of Bezdek [Bezdek81, Dembele03, Gasch02] allows each feature vector x to have a degree of membership in Cluster i Fuzzy k-Means Fuzzy k-Means algorithm • • • • • • • Choose the number of classes k, with 1<k<n. Choose a value for the fuzziness exponent f, with f>1. Choose a definition of distance in the variable-space. Choose a value for the stopping criterion e (e= 0.001 gives reasonable convergence). Make initial guesses for the means m1, m2,..., mk Until there are no changes in any mean: – Use the estimated means to find the degree of membership u(j,i) of xj in Cluster i. For example, if a(j,i) = exp(- || xj - mi ||2 ), one might use u(j,i) = a(j,i) / sum_j a(j,i). – For i from 1 to k • Replace mi with the fuzzy mean of all of the examples for Cluster i -– end_for end_until K-Modes for categorical data (Huang’98) • Variation of the K-Means Method – Replacing means of clusters with modes – Using new dissimilarity measures to deal with categorical objects – Using a frequency-based method to update modes of clusters – A mixture of categorical and numerical data: kprototypes method K-Modes algorithm • K-Modes deals with categorical attributes. • Insert the first K objects into K new clusters. • Calculate the initial K modes for K clusters. • Repeat { • For (each object O) { • Calculate the similarity between object O and the modes of all clusters. • Insert object O into the cluster C whose mode is the least dissimilar to object O. • } • Recalculate the cluster modes so that the cluster similarity between mode and objects is maximized. • } until (no or few objects change clusters). Fuzzy K-Modes • The fuzzy k-modes algorithm contains extensions to the fuzzy k-means algorithm for clustering categorical data. Bunch • Bunch is a clustering tool intended to aid the software developer and maintainer in understanding and maintaining source code [Mancoridis99]. • Input : Module Dependency Graph (MDG). • Bunch “good partition" : highly interdependent modules are grouped in the same subsystems (clusters) . – Independent modules are assigned to separate subsystems. • Figure b shows a “good” partitioning of Figure a. • Finding a good graph partition involves: – systematically navigating through a very large search space of all possible partitions for that graph. Major Clustering Approaches • Partitioning algorithms: Construct various partitions and then evaluate them by some criterion • Hierarchical algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion • Density-based: based on connectivity and density functions • Grid-based: based on a multiple-level granularity structure • Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other • Unsupervised vs. Supervised: clustering may or may not be based on prior knowledge of the correct classification. Hierarchical Clustering • Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 a Step 1 Step 2 Step 3 Step 4 ab b abcde c cde d de e Step 4 agglomerative (AGNES) Step 3 Step 2 Step 1 Step 0 divisive (DIANA) AGNES (Agglomerative Nesting) • Introduced in Kaufmann and Rousseeuw (1990) • Merge nodes that have the least dissimilarity • Go on in a non-descending fashion • Eventually all nodes belong to the same cluster 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 A Dendrogram Shows How the Clusters are Merged Hierarchically Decompose data objects into several levels of nested partitioning (tree of clusters), called a dendrogram. A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster. DIANA (Divisive Analysis) • Introduced in Kaufmann and Rousseeuw (1990) • Inverse order of AGNES • Eventually each node forms a cluster on its own 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 More on Hierarchical Clustering Methods • Weaknesses of agglomerative clustering methods – do not scale well: time complexity of at least O(n2), where n is the number of total objects. – can never undo what was done previously. More on Hierarchical Clustering Methods • Next….. • Integration of hierarchical with distancebased clustering – BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters – CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction – CHAMELEON (1999): hierarchical clustering using dynamic modeling BIRCH (1996) • Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96) • Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering – Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) – Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree Clustering Feature Vector Clustering Feature: CF = (N, LS, SS) N: Number of data points LS: Ni=1=Xi SS: Ni=1=Xi2 CF = (5, (16,30),(54,190)) 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 (3,4) (2,6) (4,5) (4,7) (3,8) CF Tree - A nonleaf node in this tree contains summaries of the CFs of its children. A CF tree is a multilevel summary of the data that preserves the inherent structure of the data. L=6 Root CF3 CF1 CF2 child1 child2 child3 CF6 child6 Non-leaf node CF1 CF2 CF3 CF5 child1 child2 child3 child5 Leaf node prev CF1 CF2 CF6 next Leaf node prev CF1 CF2 CF4 next BIRCH (1996) Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans Weakness: handles only numeric data, and sensitive to the order of the data record. CURE (Clustering Using REpresentatives) • CURE: proposed by Guha, Rastogi & Shim, 1998 – CURE goes a step beyond BIRCH by not favoring clusters with spherical shape – thus being able to discover clusters with arbitrary shape. CURE is also more robust with respect to outliers [Guha98]. – Uses multiple representative points to evaluate the distance between clusters, adjusts well to arbitrary shaped clusters and avoids single-link effect. Drawbacks of distance-based Methods – Consider only one point as representative of a cluster – Good only for convex shaped, similar size and density, and if k can be reasonably estimated CURE: The Algorithm – Draw random sample s. – Partition sample to p partitions with size s/p. Each partition is a partial cluster. – Eliminate outliers by random sampling. If a cluster grows too slow, eliminate it. – Cluster partial clusters. – The representative points falling in each new cluster are “shrinked” or moved toward the cluster center by a userspecified shrinking factor. These objects then represent the shape of the newly formed cluster. – Data Partitioning and Clustering x Figure 11 – Clustering a set of objects using CURE. (a) A random sample of objects. (b) Partial clusters. Representative points for each cluster are marked with a “+”. (c) The partial clusters are further clustered. The representative points are moved toward the cluster center. (d) The final clusters are nonspherical. x y Cure: Shrinking Representative Points y x x • Shrink the multiple representative points towards the gravity center by a fraction of . • Multiple representatives capture the shape of the cluster Clustering Categorical Data: ROCK • ROCK: Robust Clustering using linKs, by S. Guha, R. Rastogi, K. Shim (ICDE’99). • – Use links to measure similarity/proximity – Cubic computational complexity O(n2 nmmma n2 log n) Rock: Algorithm • Links: The number of common neighbours for the two points. {1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5} {1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5} {1,2,3} 3 {1,2,4} • Initially, each tuple is assigned to a separate cluster and then clusters are merged repeatedly according to the closeness between clusters. • The closeness between clusters is defined as the sum of the number of “links” between all pairs of tuples, where the number of “links” represents the number of common neighbors between two clusters. CHAMELEON • CHAMELEON: hierarchical clustering using dynamic modeling, by G. Karypis, E.H. Han and V. Kumar’99 • Measures the similarity based on a dynamic model – Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters CHAMELEON • A two phase algorithm – Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters – Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters Overall Framework of CHAMELEON Construct Partition the Graph Sparse Graph Data Set Merge Partition Final Clusters Eisen’s hierarchical clustering of gene expression data • The hierarchical clustering algorithm by Eisen et al. [Eisen98, Eisen99] – commonly used for cancer clustering – clustering of genomic (gene expression) data sets in general. • For a set of n genes, compute an upper-diagonal similarity matrix – containing similarity scores for all pairs of genes – use a similarity metric • Scan to find the highest value, representing the pair of genes with the most similar interaction patterns • The two most similar genes are grouped in a cluster – the similarity matrix is recomputed, using the average properties of both or all genes in the cluster • More genes are progressively added to the initial pairs to form clusters of genes [Eisen98] – process is repeated until all genes have been grouped into clusters LIMBO • LIMBO is introduced in [Andritsos04] is a scalable hierarchical categorical clustering algorithm that builds on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering. – LIMBO uses the IB framework to define a distance measure for categorical tuples. – LIMBO handles large data sets by producing a memory bounded summary model for the data. Outline • What is Cluster Analysis? • Types of Data in Cluster Analysis • A Categorization of Major Clustering Methods • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Clustering Methods • Supervised Classification Major Clustering Approaches • Partitioning algorithms: Construct various partitions and then evaluate them by some criterion • Hierarchical algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion • Density-based: based on connectivity and density functions • Grid-based: based on a multiple-level granularity structure • Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other • Unsupervised vs. Supervised: clustering may or may not be based on prior knowledge of the correct classification. Density-Based Clustering Methods • Clustering based on density (local cluster criterion), such as density-connected points • Major features: – Discover clusters of arbitrary shape – Handle noise – Need user-specified parameters – One scan Density-Based Clustering: Background • Two parameters: – Eps: Maximum radius of the neighbourhood – MinPts: Minimum number of points in an Epsneighbourhood of that point p q MinPts = 5 Eps = 1 cm Density-Based Clustering: Background (II) • Density-reachable: p – A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi p1 q • Density-connected: – A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts. p q o DBSCAN: Density Based Spatial Clustering of Applications with Noise • Relies on a density-based notion of cluster: A cluster is defined as a maximal set of densityconnected points • Discovers clusters of arbitrary shape in spatial databases with noise Outlier Border Eps = 1cm Core MinPts = 5 DBSCAN: The Algorithm – Check the e-neighborhood of each object in the database. – If the e-neighborhood of an object o contains more than MinPts, a new cluster with o as a core object is created. – Iteratively collect directly density-reachable objects from these core objects, which may involve the merge of a few density-reachable clusters. – Terminate the process when no new object can be added to any cluster. OPTICS: A Cluster-Ordering Method (1999) • OPTICS: Ordering Points To Identify the Clustering Structure – Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99) – Produces a special order of the database wrt its density-based clustering structure – Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure OPTICS: An Extension from DBSCAN – OPTICS was developed to overcome the difficulty of selecting appropriate parameter values for DBSCAN [Ankerst99]. – The OPTICS algorithm finds clusters using the following steps: 1) Create an ordering of the objects in a database, storing the core-distance and a suitable reachability-distance for each object. Clusters with highest density will be finished first. 2) Based on the ordering information produced by OPTICS, use another algorithm to extract clusters. 3) Extract density-based clusters with respect to any distance e’ that is smaller than the distance e used in generating the order. Figure 15a – OPTICS. The core distance of p is the distance e’, between p and the fourth closest object. The reachability distance of q1 with respect to p is the coredistance of p (e’=3mm) since this is greater than the distance between p and q1. The reachability distance of q2 with respect to p is the distance between p and q2 since this is greater than the core-distance of p (e’=3mm). Adopted from [Ankerst99]. Reachability -distance ‘ Cluster-order of the objects DENCLUE: using density functions • DENsity-based CLUstEring by Hinneburg & Keim (KDD’98) • Major features: – Solid mathematical foundation – Good for data sets with large amounts of noise – Allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets – Significantly faster than existing algorithm (faster than DBSCAN by a factor of up to 45) – But needs a large number of parameters DENCLUE: Technical Essence • Influence function: describes the impact of a data point within its neighborhood. • Overall density of the data space can be calculated as the sum of the influence function of all data points. • Clusters can be determined mathematically by identifying density attractors. • Density attractors are local maximal of the overall density function. Density Attractor Center-Defined and Arbitrary CACTUS Categorical Clustering • CACTUS is presented in [Ganti99]. • Distinguishing sets: clusters are uniquely identified by a core set of attribute values that occur in no other cluster. • A distinguishing number represents the minimum size of the distinguishing sets – i.e. attribute value sets that uniquely occur within only one cluster. • While this assumption may hold true for many real world datasets, it is unnatural and unnecessary for the clustering process. COOLCAT Categorical Clustering • COOLCAT is introduced in [Barbara02] as an entropy-based algorithm for categorical clustering. • COOLCAT starts with a sample of data objects – identifies a set of k initial tuples such that the minimum pairwise distance among them is maximized. • All remaining tuples of the data set are placed in one of the clusters such that – at each step, the increase in the entropy of the resulting clustering is minimized. CLOPE Categorical Clustering • Let's take a small market basket database with 5 transactions {(apple, banana), (apple, banana, cake), (apple, cake, dish), (dish, egg), (dish, egg, fish)}. – For simplicity, transaction (apple, banana) is abbreviated to ab, etc. • For this small database, we want to compare the following two clusterings: – (1) { {ab, abc, acd}, {de, def} } and (2) { {ab, abc}, {acd, de, def} }. • • • H=2.0, W=4 H=1.67, W=3 H=1.67, W=3 H=1.6, W=5 {ab, abc, acd} {de, def} {ab, abc} {acd, de, def} clustering (1) clustering (2) – Histograms of the two clusterings. Adopted from [Yang2002]. • We judge the qualities of these two clusterings, by analyzing the heights and widths of the clusters. Leaving out the two identical histograms for cluster {de, def} and cluster {ab, abc}, the other two histograms are of different quality. • The histogram for cluster {ab, abc, acd} has H/W=0.5, but the one for cluster {acd, de, def} has H/W=0.32. – – Clustering (1) is better since we prefer more overlapping among transactions in the same cluster. Thus, a larger height-to-width ratio of the histogram means better intra-cluster similarity. Outline • What is Cluster Analysis? • Types of Data in Cluster Analysis • A Categorization of Major Clustering Methods • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Clustering Methods • Supervised Classification Major Clustering Approaches • Partitioning algorithms: Construct various partitions and then evaluate them by some criterion • Hierarchical algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion • Density-based: based on connectivity and density functions • Grid-based: based on a multiple-level granularity structure • Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other • Unsupervised vs. Supervised: clustering may or may not be based on prior knowledge of the correct classification. Grid-Based Clustering Method • Using multi-resolution grid data structure • Several interesting methods – STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997) – WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98) • A multi-resolution clustering approach using wavelet method – CLIQUE: Agrawal, et al. (SIGMOD’98) STING: A Statistical Information Grid Approach • Wang, Yang and Muntz (VLDB’97) • The spatial area area is divided into rectangular cells • There are several levels of cells corresponding to different levels of resolution STING: A Statistical Information Grid Approach – Each cell at a high level is partitioned into a number of smaller cells in the next lower level – Statistical info of each cell is calculated and stored beforehand and is used to answer queries – Parameters of higher level cells can be easily calculated from parameters of lower level cell • count, mean, s, min, max • type of distribution—normal, uniform, etc. – Use a top-down approach to answer spatial data queries – Start from a pre-selected layer—typically with a small number of cells STING: A Statistical Information Grid Approach – When finish examining the current layer, proceed to the next lower level – Repeat this process until the bottom layer is reached – Advantages: • O(K), where K is the number of grid cells at the lowest level – Disadvantages: • All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected WaveCluster (1998) • Sheikholeslami, Chatterjee, and Zhang (VLDB’98) • A multi-resolution clustering approach which applies wavelet transform to the feature space – A wavelet transform is a signal processing technique that decomposes a signal into different frequency sub-band. • Input parameters: – the wavelet, and the # of applications of wavelet transform. WaveCluster (1998) • How to apply wavelet transform to find clusters – Summaries the data by imposing a multidimensional grid structure onto data space – These multidimensional spatial data objects are represented in a n-dimensional feature space – Apply wavelet transform on feature space to find the dense regions in the feature space – Apply wavelet transform multiple times which result in clusters at different scales from fine to coarse What Is Wavelet (2)? WaveCluster (1998) • Why is wavelet transformation useful for clustering – Unsupervised clustering It uses hat-shape filters to emphasize region where points cluster, but simultaneously to suppress weaker information in their boundary – Effective removal and detection of outliers – Multi-resolution – Cost efficiency • Major features: – – – – Complexity O(N) Detect arbitrary shaped clusters at different scales Not sensitive to noise, not sensitive to input order Only applicable to low dimensional data Quantization Transformation CLIQUE (CLustering In QUEst) • Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98). • CLIQUE can be considered as both densitybased and grid-based. – It partitions each dimension into the same number of equal length interval – It partitions an m-dimensional data space into nonoverlapping rectangular units – A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter – A cluster is a maximal set of connected dense units within a subspace =3 30 40 Vacation 20 50 Salary 30 (week) 0 1 2 3 4 5 6 7 Vacation (10,000) 0 1 2 3 4 5 6 7 age 60 20 30 40 50 age 50 age 60 Strength and Weakness of CLIQUE • Strengths – It is insensitive to the order of records in input and does not presume some canonical data distribution – It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases • Weakness – The accuracy of the clustering result may be degraded Outline • What is Cluster Analysis? • Types of Data in Cluster Analysis • A Categorization of Major Clustering Methods • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Clustering Methods • Supervised Classification Major Clustering Approaches • Partitioning algorithms: Construct various partitions and then evaluate them by some criterion • Hierarchical algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion • Density-based: based on connectivity and density functions • Grid-based: based on a multiple-level granularity structure • Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other • Unsupervised vs. Supervised: clustering may or may not be based on prior knowledge of the correct classification. Model-Based Clustering Methods • Attempt to optimize the fit between the data and some mathematical model • Statistical and AI approach – Conceptual clustering • A form of clustering in machine learning • Produces a classification scheme for a set of unlabeled objects • Finds characteristic description for each concept (class) – COBWEB (Fisher’87) • A popular a simple method of incremental conceptual learning • Creates a hierarchical clustering in the form of a classification tree • Each node refers to a concept and contains a probabilistic description of that concept COBWEB Clustering Method A classification tree More on COBWEB Clustering • Limitations of COBWEB – The assumption that the attributes are independent of each other is often too strong because correlation may exist – Not suitable for clustering large database data – skewed tree and expensive probability distributions AutoClass (Cheeseman and Stutz, 1996) • E is the attributes of a data item, that are given to us by the data set. For example, if each data item is a coin, the evidence E might be represented as follows for a coin i: – • If there were many attributes, then the evidence E might be represented as follows for a coin i: – • Ei = {"land tail","land tail","land head"} meaning that in 3 separate trials the coin i landed as tail, tail and head. H is a hypothesis about the classification of a data item. – • Ei = {"land tail“} meaning that in one trial the coin i landed to be tail. For example, H might state that coin i belongs in the class “two headed coin”. We usually do not know the H for a data set. Thus, AutoClass tests many hypotheses. (H | E) L( E | H ) ( H ) L( E | H ) ( H ) (E) L( E | H ) ( H ) H • AutoClass uses a Bayesian method for determining the optimal class H for each object. – – • Prior distribution for each attribute, symbolizing the prior beliefs of the user about the attribute. Change the classifications of items in clusters and change the means and variances of the distributions in each cluster, until the means and variances stabilize. Normal distribution for an attribute in a cluster: Neural Networks and SelfOrganizing Maps • Neural network approaches – Represent each cluster as an exemplar, acting as a “prototype” of the cluster – New objects are distributed to the cluster whose exemplar is the most similar according to some distance measure • Competitive learning – Involves a hierarchical architecture of several units (neurons) – Neurons compete in a “winner-takes-all” fashion for the object currently being presented Self-organizing feature maps (SOMs) type of neural network • • • • Several units (clusters) compete for the current object The unit whose weight vector is closest to the current object wins The winner and its neighbors learn by having their weights adjusted SOMs are believed to resemble processing that can occur in the brain Outline • What is Cluster Analysis? • Types of Data in Cluster Analysis • A Categorization of Major Clustering Methods • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Clustering Methods • Supervised Classification Major Clustering Approaches • Partitioning algorithms: Construct various partitions and then evaluate them by some criterion • Hierarchical algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion • Density-based: based on connectivity and density functions • Grid-based: based on a multiple-level granularity structure • Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other • Unsupervised vs. Supervised: clustering may or may not be based on prior knowledge of the correct classification. Supervised classification • Bases the classification on prior knowledge about the correct classification of the objects. Support Vector Machines • Support Vector Machines (SVMs) were invented by Vladimir Vapnik for classification based on prior knowledge [Vapnik98, Burges98]. • Create classification functions from a set of labeled training data. – The output is binary: is the input in a category? • SVMs find a hypersurface in the space of possible inputs. – Split the positive examples from the negative examples. – The split is chosen to have the largest distance from the hypersurface to the nearest of the positive and negative examples. • Training an SVM on a large data set can be slow. – Testing data should be near the training data. CCA-S • Clustering and Classification Algorithm – Supervised (CCA-S) – for detecting intrusions into computer network systems [Ye01]. • CCA-S learns signature patterns of both normal and intrusive activities in the training data. • Then, classify the activities in the testing data as normal or intrusive based on the learned signature patterns of normal and intrusive activities. Classification (supervised) applied to cancer tumor data sets • Class Discovery: dividing tumor samples into groups with similar behavioral properties and molecular characteristics. – Previously unknown tumor subtypes may be identified this way • Class Prediction: determining the correct class for a new tumor sample, given a set of known classes. – Correct class prediction may suggest whether a patient will benefit from treatment, how will respond after treatment with a certain drug • • • • Examples of Cancer Tumor Classification: Classification of acute leukemias Classification of diffuse large B-cell lymphoma tumors Classification of 60 cancer cell lines derived from a variety of tumors Chapter 8. Cluster Analysis • What is Cluster Analysis? • Types of Data in Cluster Analysis • A Categorization of Major Clustering Methods • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Clustering Methods • Supervised Classification • Outlier Analysis What Is Outlier Discovery? • What are outliers? – The set of objects are considerably dissimilar from the remainder of the data – Example: Sports: Michael Jordan, Wayne Gretzky, ... • Problem – Find top n outlier points • Applications: – – – – Credit card fraud detection Telecom fraud detection Customer segmentation Medical analysis Outlier Discovery: Statistical Approaches Assume a model underlying distribution that generates data set (e.g. normal distribution) • Use discordancy tests depending on – data distribution – distribution parameter (e.g., mean, variance) – number of expected outliers • Drawbacks – Most distribution tests are for single attribute – In many cases, data distribution may not be known Outlier Discovery: Distance-Based Approach • Introduced to counter the main limitations imposed by statistical methods – We need multi-dimensional analysis without knowing data distribution. • Distance-based outlier: A (p,D)-outlier is an object O in a dataset T such that at least a fraction p of the objects in T lies at a distance greater than D from O. Outlier Discovery: Deviation-Based Approach • Identifies outliers by examining the main characteristics of objects in a group • Objects that “deviate” from this description are considered outliers • Sequential exception technique – simulates the way in which humans can distinguish unusual objects from among a series of supposedly like objects • OLAP data cube technique – uses data cubes to identify regions of anomalies in large multidimensional data Chapter 8. Cluster Analysis • What is Cluster Analysis? • Types of Data in Cluster Analysis • A Categorization of Major Clustering Methods • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Clustering Methods • Outlier Analysis • Summary Problems and Challenges • Considerable progress has been made in scalable clustering methods – Partitioning: k-means, k-medoids, CLARANS – Hierarchical: BIRCH, CURE, LIMBO – Density-based: DBSCAN, CLIQUE, OPTICS – Grid-based: STING, WaveCluster – Model-based: Autoclass, Denclue, Cobweb • Current clustering techniques do not address all the requirements adequately • Constraint-based clustering analysis: Constraints exist in data space (bridges and highways) or in user queries Constraint-Based Clustering Analysis • Clustering analysis: less parameters but more userdesired constraints, e.g., an ATM allocation problem Summary • Cluster analysis groups objects based on their similarity and has wide applications • Measure of similarity can be computed for various types of data • Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods • Outlier detection and analysis are very useful for fraud detection, etc. and can be performed by statistical, distance-based or deviation-based approaches • There are still lots of research issues on cluster analysis, such as constraint-based clustering References (1) • R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD'98 • M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973. • M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure, SIGMOD’99. P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996 M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'96. • • • M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. SSD'95. • D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172, 1987. • D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In Proc. VLDB’98. • S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98. • A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988. References (2) • L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990. • E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98. • G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons, 1988. • P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997. • R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94. • E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105. • G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multiresolution clustering approach for very large spatial databases. VLDB’98. • W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB’97. • T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large databases. SIGMOD'96. Thank you !!! Ratio-Scaled Variables • Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt • Methods: – treat them like interval-scaled variables — not a good choice! (why?) – apply logarithmic transformation yif = log(xif) – treat them as continuous ordinal data treat their rank as interval-scaled. Intrusion detection systems • Clustering is applied to files with logged behavior of system users over time, to separate instances of normal activity from instances of abusive or attack activity. Clustering is primarily used in Intrusion Detection Systems, especially when the log files of system user behavior are too large for a human expert to analyze. • Unsupervised clustering: When unsupervised clustering is applied to Intrusion Detection Systems data on which no prior knowledge exists the data is first clustered and then different clusters are labeled as either ‘normal’ or ‘intrusions’ based on the size of the cluster. Given a new object d, classification proceeds by finding a cluster C which is closest to d and classifying d according to the label of C as either normal or anomalous [Li04, Portnoy01]. IDSs based on unsupervised clustering focus on activity data that may contain features of an individual TCP connection, such as its duration, protocol type, number of bytes transferred and a flag indicating the connection’s normal or error status. Other features may include the number of file creation operations, number of failed login attempts, whether root shell was obtained, the number of connections to the same host as the current connection within the past two seconds, the number of connections to the same service as the current connection within the past two seconds and others. • • Intrusion detection systems • Clustering is applied to files with logged behavior of system users over time, to separate instances of normal activity from instances of abusive or attack activity. Clustering is primarily used in Intrusion Detection Systems, especially when the log files of system user behavior are too large for a human expert to analyze. • Supervised clustering: Supervised clustering is applied to Intrusion Detection Systems for Signature Recognition purposes. This means that signatures representing patterns of previous intrusion activities are learned by the system. Then, new patterns of activity are compared to the previously learned signatures to determine if the activities are normal or if they may be intrusions [Ye01]. • IDS based on signature recognition focus on two main types of activity data: network traffic data and computer audit data. Some categorical activity attributes that can be obtained from this data include the user id, event type, process id, command, remote IP address. Some numerical activity attributes that can be obtained from this data include the time stamp, CPU time, etc. Squeezer • • Squeezer is introduced in [He2002] as a one-pass algorithm. Squeezer repeatedly reads tuples from the data set one by one. When the first tuple arrives, it forms a cluster alone. The consequent tuples are either put into an existing cluster or rejected by all existing clusters to form a new cluster by the given similarity function. Squeezer clustering consists of the following steps: 1) Initialize the set of clusters, S, to the empty set. 2) Obtain an object d from the data set. If S is empty, then create a cluster with d and add it to S. Otherwise, find the cluster in S that is closest to this object. In other words, find the closest cluster C to d in S. 3) If the distance between d and C is less than or equal to a user specified threshold W then associate d with the cluster C. Else, create a new cluster for d in S. 4) Repeat steps 2 and 3 until no objects are left in the data set. Farthest First Traversal k-center Algorithm • The farthest-first traversal k-center algorithm (FFT) is a fast, greedy algorithm that minimizes the maximum cluster radius [Hochbaum85]. In FFT, k points are first selected as cluster centers. The remaining points are added to the cluster whose center is the closest. The first center is chosen randomly. The second center is greedily chosen as the point furthest from the first. Each remaining center is determined by greedily choosing the point farthest from the set of already chosen centers, where the furthest point, x, from a set, D, is defined as, maxx{min{d(x,j), j(-D}}. • • • • • • • • • • • Farthest_first_traversal(D: data set, k: integer) { randomly select first center; //select centers for (I= 2,…,k) { for (each remaining point) { calculate distance to the current center set; } select the point with maximum distance as new center; } //assign remaining points for (each remaining point) { calculate the distance to each cluster center; put it to the cluster with minimum distance; } } Clustering Feature Vector BIRCH’s operation is based on clustering features (CF) and clustering feature trees. These are used to summarize the information in a cluster. A CF is a triplet summarizing information about subclusters of objects. A CF typically holds the following information about a subcluster: the number of objects N in the subcluster, a vector holding the linear sum of the N objects in the subcluster, and a vector holding the square sum of the N objects in the subcluster. Thus, a CF is a summary of statistics for the given subcluster OPTICS: An Extension from DBSCAN – OPTICS was developed to overcome the difficulty of selecting appropriate parameter values for DBSCAN [Ankerst99]. – The OPTICS algorithm finds clusters using the following steps: 1) Create an ordering of the objects in a database, storing the core-distance and a suitable reachability-distance for each object. Clusters with highest density will be finished first. 2) Based on the ordering information produced by OPTICS, use another algorithm to extract clusters. 3) Extract density-based clusters with respect to any distance e’ that is smaller than the distance e used in generating the order. Figure 15a – OPTICS. The core distance of p is the distance e’, between p and the fourth closest object. The reachability distance of q1 with respect to p is the coredistance of p (e’=3mm) since this is greater than the distance between p and q1. The reachability distance of q2 with respect to p is the distance between p and q2 since this is greater than the core-distance of p (e’=3mm). Adopted from [Ankerst99]. Gradient: The steepness of a slope • Example f Gaussian ( x , y ) e f D Gaussian f d ( x , y )2 2 2 ( x ) i 1 e N d ( x , xi ) 2 2 2 ( x, xi ) i 1 ( xi x) e D Gaussian N d ( x , xi ) 2 2 2 Mutual Information Clustering • • • • • Mutual information refers to a measure of correlation between the information content of two information elements [Michaels98]. For example, when clustering gene expression data , the formula M(A,B) = H(A) + H(B) - H(A,B) represents the mutual information M, that is shared by two temporal gene expression patterns A and B. H refers to the Shannon entropy. H(A) is a measure of the number of expression levels exhibited by the gene A expression pattern, over a given time period. The higher the value of H, the more expression levels gene A exhibits over a temporal expression pattern, and thus the more information the expression pattern contains. An H of zero means that a gene expression pattern over a time course was completely flat, or constant, and thus the expression pattern carries no information. H(A,B) is a measure of the number of expression levels observed in either gene expression pattern A or B. Thus, the mutual information M(A,B) = H(A) + H(B) - H(A,B) is defined as the number of expression levels that are observed in both gene expression patterns A and B, over a given time period [26]. CACTUS Categorical Clustering • • • • • CACTUS is presented in [Ganti99], introducing a novel formalization of a cluster for categorical attributes by generalizing a definition of a cluster for numerical data. CACTUS is a summary-based algorithm that utilizes summary information of the dataset. It characterizes the detected categorical clusters by building summaries. The authors assume the existence of a distinguishing number that represents the minimum size of the distinguishing sets – i.e. attribute value sets that uniquely occur within only one cluster. The distinguishing sets in CACTUS rely on the assumption that clusters are uniquely identified by a core set of attribute values that occur in no other cluster. While this assumption may hold true for many real world datasets, it is unnatural and unnecessary for the clustering process. The distinguishing sets are then extended to cluster projections. COOLCAT Categorical Clustering • COOLCAT is introduced in [Barbara02] as an entropy-based algorithm for categorical clustering. The COOLCAT algorithm is based on the idea of entropy reduction within the generated clusters. • It first bootstraps itself using a sample of maximally dissimilar points from the dataset to create initial clusters. • COOLCAT starts with a sample of data objects and identifies a set of k initial tuples such that the minimum pairwise distance among them is maximized. • The remaining points are then added incrementally. Clusters are created by "cooling" them down, i.e. reducing their entropy. • All remaining tuples of the data set are placed in one of the clusters such that, at each step, the increase in the entropy of the resulting clustering is minimized. STIRR Categorical Clustering • • • • • • STIRR produces a dynamical system from a table of categorical data. STIRR begins with a table of relational data, consisting of k fields (or columns), each of which can assume one of many possible values. Figure 15 shows an example of the data representation. Each possible value in each field is represented by a node and the data is represented as a set of tuples – each tuple consists of nodes, with one node for each field. A configuration is an assignment of a weight w to each node. A normalization function N() is needed to rescale the weights of the nodes associated with each field so that their squares add up to 1. Figure 15 - Representation of a collection of tuples. Adopted from [Gibson98]. A dynamical system is the repeated application of a function f on some set of values. A fixed point of a dynamical system is a point u for which f(u) = u. So it is a point which remains the same under the repeated application of f. The dynamical system for a set of tuples is based on a function f, which maps a current configuration of weights to a new configuration. We define f as follows. We choose a combiner function defined below and update each weight wv as follows: CLICK Categorical Clustering • CLICK finds clusters in categorical datasets based on a search method for k-partite maximal cliques [Peters04]. • Figure 16 – The CLICK algorithm. Adopted from [Peters04]. • • The basic Click approach consists of the three principal stages, shown in Figure 17, as follows: Pre-Processing: In this step, the k-partite graph is created from the input database D, and the attributes are ranked for efficiency reasons. Clique Detection: Given Γ(D), all the maximal k-partite cliques in the graph are enumerated. Post-Processing: the support of the candidate cliques within the original dataset is verified to form the final clusters. Moreover, the final clusters are optionally merged to partially relax the strict cluster conditions. • • CLOPE Categorical Clustering • • CLOPE proposes a novel global criterion function that tries to increase the intra-cluster overlapping of transaction items by increasing the height-to-width ratio of the cluster histogram. Let's take a small market basket database with 5 transactions {(apple, banana), (apple, banana, cake), (apple, cake, dish), (dish, egg), (dish, egg, fish)}. For simplicity, transaction (apple, banana) is abbreviated to ab, etc. For this small database, we want to compare the following two clustering (1) { {ab, abc, acd}, {de, def} } and (2) { {ab, abc}, {acd, de, def} }. For each cluster, we count the occurrence of every distinct item, and then obtain the height (H) and width (W) of the cluster. For example, cluster {ab, abc, acd} has the occurrences of a:3, b:2, c:2, and d:l, with H=2.0 and W=4. H is computed by summing the numbers of occurrences and dividing by the number of distinct items. Figure 17 shows the values of these results as histograms, with items sorted in reverse order of their occurrences, only for the sake of easier visual interpretation. • • • • H=2.0, W=4 H=1.67, W=3 H=1.67, W=3 H=1.6, W=5 {ab, abc, acd} {de, def} {ab, abc} {acd, de, def} clustering (1) clustering (2) Figure 17 - Histograms of the two clusterings. Adopted from [Yang2002]. • We judge the qualities of these two clusterings, by analyzing the heights and widths of the clusters. Leaving out the two identical histograms for cluster {de, def} and cluster {ab, abc}, the other two histograms are of different quality. The histogram for cluster {ab, abc, acd} has H/W=0.5, but the one for cluster {acd, de, def} has H/W=0.32. Clearly, clustering (1) is better since we prefer more overlapping among transactions in the same cluster. From the above example, we can see that a larger height-to-width ratio of the histogram means better intra-cluster similarity. This intuition is the basis of CLOPE and defines the global criterion function using the geometric properties of the cluster histograms. WaveCluster (1998) • • WaveCluster uses wavelet transformations to cluster data. It uses a wavelet transformation to transform the original feature space, finding dense regions in the transformed space. • • • • • • • • A wavelet transform is a signal processing technique that decomposes a signal into different frequency subbands. The wavelet model can be applied to a data set with n dimensions by applying a one-dimensional wavelet n times. Each time that a wavelet transform is applied, data are transformed so as to preserve the relative distance between objects at different levels of resolution, although the absolute distance increases. This allows the natural clusters in the data to become more distinguishable. Clusters are then identified by searching for dense regions in the new domain. Wavelets emphasize regions where the objects cluster, but suppress less dense regions outside of clusters. The clusters in the data stand out because they are more dense and “clear” the regions around them [Sheikholeslami98]. CLIQUE: The Major Steps – finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces • Partition the data space and find the number of points that lie inside each cell of the partition. • Identify clusters: – Determine dense units in all subspaces of interest. – Determine connected dense units in all subspaces of interest. • Generate minimal description for the clusters – Determine maximal regions that cover a cluster of connected dense units for each cluster – Determination of minimal cover for each cluster ACDC • ACDC works in a different way than algorithms we mentioned above [Tzerpos00]. ACDC performs the task of clustering in two stages. In the first one it creates a skeleton of the final decomposition by identifying subsystems using a pattern-driven approach. There are many patterns that have been used in ACDC. Figure 24 shows some of these patterns. • Depending on the pattern used the subsystems are given appropriate names. In the second stage ACDC completes the decomposition by using an extended version of a technique known as Orphan Adoption. Orphan Adoption is an incremental clustering technique based on the assumption that the existing structure is well established. It attempts to place each newly introduced resource (called an orphan) in the subsystem that seems “more appropriate”. This is usually a subsystem that has a larger amount of connectivity to the orphan than any other subsystem [Tzerpos00]. ACDC