Connecting Users across Social Media Sites: A Behavioral

Download Report

Transcript Connecting Users across Social Media Sites: A Behavioral

REZA ZAFARANI AND HUAN LIU
DATA MINING AND MACHINE LEARNING LABORATORY (DMML)
ARIZONA STATE UNIVERSITY
KDD 2013 – CHICAGO, ILLINOIS
How hard can it be to identify
an individual across sites?
Privacy Experts Claim Advertisers
Know a lot about People
Can they stop showing you the
same repetitive ads across sites?
More information about
individuals
Many social media sites
Partial Information
Huan Liu
Google+
Age
N/A
Location
USA
Education
Complementary Information
Facebook
USC (1985-89)
Connectivity is not available
Consistency in Information Availability
Better User Profiles
Can we connect individuals
across sites?
Can we verify that the
information provided
across sites belong to the
same individual?
Human behavior generates
- Behavioral
Information redundancy
Modeling
Information shared across
sites
- Minimum
Information
provides
a
behavioral
fingerprint
MOBIUS
MOdeling Behavior for Identifying Users across Sites
Minimum information
available on ALL sites:
Usernames
Identification Function
Candidate
Username (john.smith)
Prior
Usernames ({jsmith, john.s})
Generates
Captured Via
Behavior 1
Information Redundancy
Feature Set 1
Behavior 2
Information Redundancy
Feature Set 2
Behavior n
Information Redundancy
Feature Set n
Identification
Function
Learning
Framework
Data
Human
Limitation
Behaviors
Exogenous
Factors
Endogenous
Factors
Time & Memory
Limitation
Knowledge
Limitation
Typing Patterns
Language
Patterns
Personal Attributes
& Traits
Habits
Using Same
Usernames
59% of individuals use
the same username
5
Username Length
Likelihood
4
2
0
0
1
0
2
0
3
0
4
0
5
1
0
6
7
8
9
10
0
11
12
Limited
Vocabulary
Identifying individuals
by their vocabulary size
Limited
Alphabet
Alphabet Size is
correlated to language:
शमंत कुमार -> Shamanth Kumar
QWERTY Keyboard
Variants: AZERTY, QWERTZ
DVORAK Keyboard
Keyboard type impacts your usernames
Modifying
Previous
Usernames
Adding Prefixes/Suffixes,
Abbreviating, Swapping or
Adding/Removing Characters
Creating
Similar
Usernames
Nametag and
Gateman
Username
Observation
Likelihood
Usernames come from
a language model
Previous Methods:
1) Zafarani and Liu, 2009
Data:
2) Perito et al., 2011
200,000 instances (50% class balance)
Baselines:
414 Features
1) Exact Username Match
2) Substring Match
3) Patterns in Letters
100
90
80
70
60
50
40
30
20
10
0
77
63.12
Exact
Username
Match
66
77.59
91.38
49.25
Substring
Matching
Patterns in
Letters
Zafarani
and Liu
Perito et al.
Naïve
Bayes
94.5
94
93.5
93
92.5
92
91.5
91
90.5
90
89.5
89
93.59 93.7 93.71 93.77 93.8
91.38
Naïve
Bayes
90.87
J48
Random
Forest
L2-reg L2- L1-reg L2Loss SVM Loss SVM
L2-reg
L1-reg
Logistic
Logistic
Regression Regression
Discover
applications
ofin sites
Human
Behavior
Results
Information
shared
across
A
methodology
for
connecting
Incorporating
features
connecting
users
across
sites
Information
Redundancy
acts
as a behavioral
fingerprint
individuals
across
sites
indigenous
to
specific
sites
A
behavioral modeling approach
 Uses
minimum information across sites
 Allows
for integration of additional behaviors
when required