Wine Clustering

Download Report

Transcript Wine Clustering

Wine
Clustering
Ling Lin
Contents
❏ Motivation
❏ Data
❏ Dimensionality Reduction-MDS, Isomap
❏ Clustering-Kmeans, Ncut, Ratio Cut, SCC
❏ Conclustion
❏ Reference
Motivation
• Clustering is a main task of exploratory data mining
 Make market Segementation, marketing strategies
 Document Clustering
 Target appropriate treatment to patients with similar response patterns
 Image segementation
• Apply clustering methods to a real data
Data
➢Wine data
Source of the data set : “Machine Learning Repository” , University of California, Irvine.
Data sample size : 14 variables and 178 observations in 3 classes : different cultivar
Variables :
1) Alcohol 2) Malic acid 3) Ash 4) Alcalinity of ash 5) Magnesium 6) Total phenols
7) Flavanoids 8) Nonflavanoid phenols 9) Proanthocyanins 10)Color intensity 11)Hue
12)OD280/OD315 of diluted wines 13)Proline
MDS
Can I seperate objects better? ---> change the ways to find the distances
Cityblock(L1)
Distance
Chebychev Distance
Cosine Distance
Mahalanobis Distance
Distances
• Euclidean Distance-Straight line distance between two points.
𝑛
𝑑=
𝑥𝑖 − 𝑦𝑢
2
𝑖=1
• City-block Distance- (L1 Distance)
Sum of the distances of two points in any coordinate dimension.
𝑛
𝑑=
𝑝𝑖 − 𝑞𝑖
𝑖=1
Distances
• Chebychev Distance-(Chessboard Distance)
The greatest distance of two points’ difference in any coordinate dimension.
𝑑 = max 𝑥𝑖1 − 𝑥𝑗1 , 𝑥𝑖2 − 𝑥𝑗2 , 𝑥𝑖3 − 𝑥𝑗3 … 𝑥𝑖𝑛 − 𝑥𝑗𝑛
• Cosine Distance-
The cosine of the angle between two vectors
𝑑 = cos( 𝜃) =
𝐴 .𝐵
𝐴 𝐵
Distances
• Mahalanobis Distance-The dissimilarity of two vectors. S is the covariance matrix.
𝑑=
𝑥 − 𝑦 𝑇 𝑆 −1 𝑥 − 𝑦
Euclidean Distance = c
City-block Distance = a+b
b
Chebychev Distance = max(a,b) = a
a
θ
c
Cosine Distance = cos(θ)
MDS in 3D
MDS in 2D
Isomap
Cosine
Mahalanobis
Isomap
Cosine
Mahalanobis
Kmeans Clustering
Error rate = 0.03
True Labeled
Kmeans Clustering
Ratio Cut
SCC
Clustering
Comparison
Normalized Cut
Conclusion
• Dimensionality ReductionDifferent methods for calculating distances and reducing dimension
--->Wine data
V
X
3D MDS
Cosine Distance
Mahalanobis
2D MDS
Cosine Distance
Mahalanobis
Isomap make Mahalanobis distance a better display
Conclusion
• Clustering:
Kmeans= Rcut→ SCC→ Ncut
Ncut and Rcut : consider both inter and intra cluster connections.
However, in this dataset, the intra cluster connections are weak.