ppt - Computer Science and Engineering

Download Report

Transcript ppt - Computer Science and Engineering

Local/Global Term Analysis for
Discovering Community
Differences in Social Networks
David Fuhry, Yiye Ruan, and Srinivasan Parthasarathy
Data Mining Research Laboratory
Dept. of Computer Science and Engineering
The Ohio State University
Communities in Social Networks
Observations:
• Social networks consist of many interacting communities
of users.
• Each community can be characterized by the content
which its members generate.
Motivating questions:
• Given a community, how can we determine what its members
are talking about, relative to the entire social network?
• Given two communities, how can we determine the difference
between them?
Methodology
• A community’s users mention relevant terms frequently.
• Many works look at #hashtags or most frequent terms.
• But not all frequent terms are relevant.
• Desiderata:
– Consider all content terms
– Interpretable
– scalable to million-user social networks
Four-step Process
• Four-step process for determining community
differences:
– Community Discovery
Network
– Term Extraction & Aggregation
– Visualization
– Handling Time Varying Data
Content
1. Community Discovery (I)
• Keyword search based identification
of candidate users
• Extract underlying network of users
• Local community identification
• Graph clustering (e.g. METIS
[KARYPIS’99], Graclus [DHILLON’07],
MLR-MCL [SATULURI’09], Localized
Clustering (L-Spar) [SATULURI’11])
• Modularity [NEWMAN’04]
• Content-Sensitive Viewpoint
Neighborhoods [Asur’09]
1. Community Discovery (II)
• Start with the network of all users
• Extract candidate communities
• Using any community discovery
algorithm
• Filter candidate communities by
keyword strength
2. Term Extraction & Aggregation
• Extract terms from each
message and weight them
• Term Frequency
• TF/IDF
• Domain-dependent
semantic importance
• Merge terms
• Combine synonyms
• Handling hypernyms
• Aggregate them by user
3. Visualization
• Plot terms by frequency
across two axes.
• Global (all users) on Y-axis
• Local (community users) on
X-axis.
• Terms on the regression line
are equifrequent in both
groups
• Terms off the regression line
are relatively more frequent
in one group
• Support for multiple scales of
local community identification
4. Handling Time Varying Data
• Time range divided into batches
• Perform steps 1 to 3 for each batch
• Visualize results
Experimental Results
Using a dataset of 1M tweets we look at groups
discussing Canon, Nikon, and Olympus cameras:
Between Nikon and Olympus communities, Olympus
community talks more about blogs.
Experimental Results
Between camera and global
communities, camera community
talks less about health, teeth, and
success.
Experimental Results
Using a dataset of 2M tweets about the “Occupy”
movement, we compare “Occupy Oakland” to the entire
“Occupy” movement:
Occupy Oakland movement talks less about NYPD, p2 (group of
progressives using social media), and tcot (“Top Conservatives On
Twitter”).
Filter and Zoom
Conclusions
• Four-part visual analytic framework for discovering
differences between communities in social networks.
– Simple
– Scalable
• Qualitative and quantitative results.
• Future
– Temporal
– More quantitative measures
– Automatically determine best scale
Thank You!