Detecting Communities Via Simultaneous Clustering of Graphs and

Download Report

Transcript Detecting Communities Via Simultaneous Clustering of Graphs and

Detecting Communities Via
Simultaneous Clustering of
Graphs and Folksonomies
Akshay Java
Anupam Joshi
Tim Finin
University of Maryland, Baltimore County
KDD 2008
Workshop on Web Mining and Web Usage Analysis
Outline
• Introduction
• Community Detection
– Clustering Approach
– Spectral Approach
– Co-Clustering
•
•
•
•
Simultaneous Clustering
Evaluation
Future Work
Conclusions
Outline
• Introduction
• Community Detection
– Clustering Approach
– Spectral Approach
– Co-Clustering
•
•
•
•
Simultaneous Clustering
Evaluation
Future Work
Conclusions
Social Media
Describes the online technologies
and practices that people use to
share opinions, insights,
experiences, and perspectives
and engage with each other.
~Wikipedia
Social Media Graphs
G = (V,E) describing the relationships
between different entities (People,
Documents, etc.)
1
2
3
1
1
4
Tags
2
2
3
Users
4
URLs
G’ = <V,T,R> a tri-partite graph that
expresses how entities ‘Tag’ some
resource
Political Blogs
Twitter Network
Facebook Network
What is a Community
A community in the real world is
identified in a graph as a set of
nodes that have more links within
the set than outside it.
Outline
• Introduction
• Community Detection
– Clustering Approach
– Spectral Approach
– Co-Clustering
•
•
•
•
Simultaneous Clustering
Evaluation
Future Work
Conclusions
Community Detection
Clustering Approach
Clustering Approach
1. Agglomerative/Hierarchical
Topological Overlap: Similarity is measured in terms of number of nodes that both i and
j link to. (Razvasz et al.)
Community Detection
Clustering Approach
Clustering Approach
1. Agglomerative/Hierarchical
2. Divisive/Partition based
Remove edges that have highest edge betweenness centrality
(Girvan-Newman Algorithm)
Political Books
Community Detection
Spectral Approach
Graph Laplacian
L  DW  I D

1
2
*W * D

1
2
• The graph can be partitioned
using the eigenspectrum of the

Laplacian. (Shi and Malik)
• The second smallest
Normalized Cuts
 1
1 
eigenvector of the graph
NCut(A,B)  Cut(A,B)


Vol(A) Vol(B)
Laplacian is the Fiedler vector.
• The graph can be recursively
Cost of edges deleted
to disconnect the graph
partitioned using the sign of the
Total cost of all edges
values in its Fielder vector.
that start from B
Community Detection
Co-Clustering
• Spectral graph bipartitioning
• Compute graph laplacian using
Where
A  nm is the document by term matrix
(Dhillon et al.)

Outline
• Introduction
• Community Detection
– Clustering Approach
– Spectral Approach
– Co-Clustering
•
•
•
•
Simultaneous Clustering
Evaluation
Future Work
Conclusions
Social Media Graphs
Links Between Nodes
Links Between Nodes and Tags
Simultaneous Cuts
Communities in
Social Media
A community in the real world is
identified in a graph as a set of
nodes that have more links within
the set than outside it and share
similar tags.
Clustering Tags and Graphs

 I
W   T
C
'
C 

W 
1
1
1
1 0 0 0
1 1 0 0
0 1 1 1
0 0 1 1
Tags

β= 0 is like co-clustering,
β= 1 Equal importance to blog-blog and blog-tag,
β>> 1 NCut
1
1 1 0 0

1 1 0 0
0 1 1 0

0 0 1 1
0 0 1 1

1 1 1 0
1 1 0 0

0 0 1 1


0 0 1 1
Nodes
Nodes
Tags
1







1
1

0


0
Tags
Nodes
Tags
Nodes
Fiedler Vector Polarity 
1
1
1
1
1
1
1
1
1
Clustering Tags and Graphs
Clustering Only Links
Clustering Links + Tags
 I
W   T
C
'
C 

W 
β= 0 is like co-clustering,
β= 1 Equal importance to blog-blog and blog-tag,
β>> 1 NCut
Clustering Tags and Graphs
Clustering Only Links
Clustering Links + Tags
Outline
• Introduction
• Community Detection
– Clustering Approach
– Spectral Approach
– Co-Clustering
•
•
•
•
Simultaneous Clustering
Evaluation
Future Work
Conclusions
Datasets
• Citeseer
– Agents, AI, DB, HCI, IR, ML
– Words used in place of tags
• Blog data
– derived from the WWE/Buzzmetrics dataset
– Tags associated with Blogs derived from del.icio.us
– For dimensionality reduction 100 topics derived from blog homepages using LDA
(Latent Dirichilet Allocation)
• Pairwise similarity computed
– RBF Kernel for Citeseer
– Cosine for blogs
Citeseer Data
Accuracy = 36%
Accuracy = 62%
Higher accuracy by adding ‘tag’ information
Citeseer Data
NCut
SimCut Results in
• Higher intra-cluster similarity
• Lower inter-cluster similarity
SimCut
Citeseer Data
True
NCut
Constrains cuts based on both
• Link Structure
• Tags
SimCut
Blog Data
NCut
SimCut Results in
• Higher intra-cluster similarity
• Lower inter-cluster similarity
SimCut
Blog Data
NCut
35 Clusters
SimCut
Ncut
Few, Large clusters with low intra-cluster similarity
SimCut
Moderate size clusters higher intra-cluster similarity
Effect of Number of Tags, Clusters
Citeseer
Mutual Information compares clusters to ground truth
More tags help, to an extent
Lower mutual information if only the graph is used
Effect of Number of Tags, Clusters
Blogs
Mutual Information compares clusters to content-based clusters (no tags/graph)
More tags help, to an extent
Lower mutual information if only the graph is used
Outline
• Introduction
• Community Detection
– Clustering Approach
– Spectral Approach
– Co-Clustering
•
•
•
•
Simultaneous Clustering
Evaluation
Future Work
Conclusions
Future Work
• Evaluating SimCut algorithm on derived feature
types like: named entities, sentiments and
opinions, links to main stream media.
• For a dataset with ground truth, a comparison of
graph based, text based and graph+tag based
clustering
• Evaluating effect of varying β
Outline
• Introduction
• Community Detection
– Clustering Approach
– Spectral Approach
– Co-Clustering
•
•
•
•
Simultaneous Clustering
Evaluation
Future Work
Conclusions
Conclusions
• Many Social Media sites allow users to tag
resources
• Incorporating folksonomies in community
detection can yield better results
• SimCut can be easily implemented and relates
to Ncut with two simultaneous objectives
– Minimize number of node-node edges being cut
– Minimize number of node-tag edges being cut
• Detected communities can be associated with
meaningful, descriptive tags
Thanks!
http://ebiquity.umbc.edu
http://socialmedia.typepad.com
More Tags
Only Graph
SimCut
Citeseer (Community Size, Similarity)
Blogs (Community Size, Similarity)