Project Presenation

Download Report

Transcript Project Presenation

CLUSTERING FOR TAXONOMY
EVOLUTION
By
-Anindya Das
- Sneha Bankar
PROBLEM STATEMENT


Problem
-Due to lack of correct category many a times
products are placed in the wrong category
-This could be an indication of taxonomy
evolution
Solution
-Clustering products based on product
descriptions
TAXONOMY EVOLUTION
Camera &
Photo
Lenses
Flashes
Digital Cameras
Compact
System Camera/
Point & Shoot
Cameras/
Digital SLR
Cameras
Digital SLR
Cameras
TAXONOMY EVOLUTION
Camera & Photo
Lenses
Flashes
Compact System
Camera
Digital Cameras
Digital SLR
Camera
Point & Shoot
Cameras
FEATURE EXTRACTION

Use product description as features

Brand Removal

Stemming

Use of unigrams and bigrams

Feature Weighing based on Term Frequency

Feature Weighing based on TFIDF
HIERARCHICAL AGGLOMERATIVE
CLUSTERING
Initially, each item is considered a cluster.
 The closest pair is chosen.
 Those two clusters are merged.
 Each iteration reduces one cluster.
 Continues till terminating condition satisfies.

No. of clusters
 Inter cluster Distance


UPGMA used for measuring cluster distance.
DISTANCE MEASURES

Edit Distance


Euclidean Distance


Sum of square of weights of all disjoint features
Jaccard Distance


Min. operations to convert A into B
1−
𝑛(𝐴∩𝐵)
𝑛(𝐴∪𝐵)
Hamming Distance

𝑛 𝐴− 𝐴∩𝐵
+ 𝑛(𝐵 − (𝐴 ∩ 𝐵))
K-MEANS




Select K initial centroids
Assign data points(ASIN feature vector) to the centroids based on
distances
Update Mean for the Centroids
Re-assign and update the centroids till data points can be reassigned
EXECUTION PIPELINE
Data Preprocessor
Feature Extraction
Engine
Clustering Engine
Cluster Evaluation
Engine
CLUSTER EVALUATION



How many items in a cluster are talking about
the top most frequent features of a cluster?
Precision = true positives / (true positives + false
positives)
Recall = true positives /( true positives + false
negatives)
RESULTS


Precision Values
HAC
K-Means
Dataset 1
95%
92%
Dataset 2
92%
96%
Dataset 3
93%
90%
Recall values for all cases lie between 20% to 30%
FUTURE WORK



Mining topics from product descriptions using
them as features
Approach to detect outliers and merge them to
form a new category
Use of association rule mining for evaluation
instead of top frequent words
REFERENCES



http://en.wikipedia.org/wiki/Hierarchical_clusteri
ng
http://en.wikipedia.org/wiki/K-means_clustering
Liu, Tao. "An Evaluation on Feature Selection for
Text Clustering." N.p., 2003. Web.