RabbanykASONAM2012 - Department of Computing Science

Download Report

Transcript RabbanykASONAM2012 - Department of Computing Science

Relative Validity Criteria for
Community Mining Evaluation
Reihaneh Rabbany, Mansoreh Takaffoli, Justin Fagnan, Osmar R. Zaϊane and Ricardo J. G. B. Campello
Department of Computing Science,
University of Alberta,
Edmonton, Canada
ASONAM 2012
Aug 2012
Motivation
Applications in different domains; sociology, criminology
Module identification in Biological Networks
•
Clusters in Protein-Protein Interaction Networks
Protein complexes and parts of pathways; Clusters in a
protein similarity network
protein families. (R Guimerà et al., Functional cartography of complex
metabolic networks, Nature 433, 2005)
Prerequisite of further analysis; Targeted advertising, link prediction,
recommendation
Social Networks: personalized news feed, easier privacy settings
•
Gmail's "Don't Forget Bob!" and "Got the Wrong Bob?" features (M Roth et al., Suggesting Friends Using
the Implicit Social Graph, KDD 2010)
• Citation network of scholars
Paper and collaborator recommendation, Network visualization and Navigation; e.g. CiteULike, Arnet Miner
and Microsoft Academic
• Hyperlinks between web pages - WWW
Detecting Group of closely related topics to refined search results (J Chen et al., An Unsupervised Approach
to Cluster Web Search Results Based on Word Sense Communities. Web Intelligence 2008)
1
Community
Loosely defined as groups of nodes that have relatively more links
between themselves than to the rest of the network
Nodes that have structural similarity (SCAN, Xu et al. 2007)
Nodes that are connected with cliques (CFinder by Palla et al. 2005)
Nodes that a random walk is likely to trap within them (MCL by Dongen,
•
•
•
•
•
•
•
Walktrap by Pons and Latapy)
Nodes that follow the same leader (TopLeaders, 2010)
Nodes that make the graph compress efficiently (Infomap, Infomod, Rosvall and
Bergstrom, 2011)
Nodes that are separated from the rest by min cut, conductance (flow
based methods, e.g. Kernighan-Lin (KL), betweenness of Newman)
Nodes that number of links between them is more than chance (Newman's
Q modularity, FastModularity, Blondel et al.’s Louvain)
2
Evaluation; overlooked
Internal Evaluation
Predefined quality/structure for the communities
•
Graph partitioning
measures
(density,
conductance)
The community
structure
is not
known beforehand
External Evaluation
Agreement between the results and a given known ground-truth
A clustering similarity/agreement
indexes;
Rand Index, Jaccard
No ground
truth
Benchmarks with
ground
truth;
LFR(2008)
No
large data
setGN(2002),
with known
ground truth
•
•
The synthetic benchmarks disagree with some real network characteristics
Karate
GN
LFR
3
Relative Validity Criteria
Validity criteria defined for clustering evaluation; compares different
clusterings of a same data set
We altered criteria
Generalized distance; graph distance measures
Generalized mean/centroid notion; averaging v.s. medoid
•
•
e.g. Variance Ratio Criterion (VRC)
Same for: Dunn index, Silhouette Width Criterion (SWC), Alternative Silhouette, PBM, C-Index, ZStatistics, Point-Biserial (PB)
Distance Alternatives: Edge Path (ED), Shortest Path Distance (SPD), Adjacency Relation Distance (ARD),
Neighbour Overlap Distance (NOD), Pearson Correlation Distance (PCD), ICloseness Distance (ICD)
4
Correlation with External Index
Correlation of relative criteria and external scores on different clusterings
of same data set
random clusterings that range from very close to very far from ground truth
For karate;
5
Correlation with External Index
Correlation of relative criteria and external scores on different clusterings
of same data set
random clusterings that range from very close to very far from ground truth
For karate;
5
Ranking of Criteria on Real World Benchmarks
Data set statistics
Difficulty Analysis
Overall Ranking
6
Ranking of Criteria on Synthetic Benchmarks
Ranking for well separated communities
Data set statistics
Overall ranking for very mixed communities
7
Ranking varies
Criteria Ranking is affected by:
Choice of benchmarks, synthetic generator and its parameters
Choice of External agreement Index; ARI, NMI, AMI, Jacard
Choice of correlation measure; Pearson & Spearman correlation
Choice of clustering randomization
•
•
•
•
Get the ranking in your setting
www.cs.ualberta.ca/~rabbanyk/criteriaComparison
8
Future Works
Evaluation Issues
Community mining specific agreement
measure
Realistic synthetic benchmarks
Extensions of criteria
Incorporating attributes; combine
clustering and community mining for
cases for which we have both attributes
and relations
Incorporating uncertainty and edges with
probability
•
•
•
•
•
...
9
End
Questions?
10
Alternative Distances
•
Edge Path (ED),
•
Shortest Path Distance (SPD),
•
Adjacency Relation Distance (ARD),
•
Neighbour Overlap Distance (NOD),
•
Pearson Correlation Distance (PCD),
•
ICloseness Distance (ICD)
A
Relative criteria
•
•
•
•
•
•
•
•
Variance Ratio Criterion (VRC)
Dunn index,
Silhouette Width Criterion (SWC),
Alternative Silhouette,
PBM,
Davies-Bouldin
C-Index,
Point-Biserial (PB)
B