Mosaic - Northwestern University
Download
Report
Transcript Mosaic - Northwestern University
Mosaic: Quantifying Privacy Leakage
in Mobile Networks
Aleksandar Kuzmanovic
EECS Department
Northwestern University
http://networks.cs.northwestern.edu
Buy 1 Get 1 Free
Mosaic
– Joint work with
•
•
•
•
•
•
Ning Xia (Northwestern University)
Han Hee Song (Narus Inc.)
Yong Liao (Narus Inc.)
Marios Iliofotou (Narus Inc.)
Antonio Nucci (Narus Inc.)
Zhi-Li Zhang (University of Minnesota)
Synthoid
– Joint work with
• Marcel Flores (Northwestern University)
2
Scenario
Different
“footprints”
Different
services
IP1
… IPK
ISP A
IP1
… IPK
CSP B
Dynamic IP,
CSP/ISP
Different
devices
3
Problem
Other
research
work
IP1 … IPK
ISP A
IP1 … IPK
CSP B
Input packet
traces
Tessellation
Mosaic
We are
here!
How much private information can be obtained and
expanded about end users by monitoring network traffic?
4
Motivation
I will know everything about
everyone!
IP1
…
IPK
ISP A
IP1
…
IPK
CSP B
Agencies
Bad guys
Mobile Traffic:
• Relevant: more personal information
• Challenging: frequent IP changes
5
Challenges
•
How to track users when they hop over different IPs?
Sessions:
Flows(5-tuple) are
grouped into sessions
IP1
time
Traffic Markers:
IP2
time
IP3
time
Identifiers in the traffic
that can be used to
differentiate users
With Traffic Markers, it is possible to
connect the users’ true identities to their sessions.
6
Datasets
Dataset
Source
Description
3h-Dataset
CSP-A
Complete payload
9h-Dataset
CSP-A
Only HTTP headers
Ground Truth Dataset
CSP-B
Payload & RADUIS info.
•
•
•
3h-Dataset: main dataset for most experiments
9h-Dataset: for quantifying privacy leakage
Ground Truth Dataset: for evaluation of session
attribution
• RADIUS: provide session owners
7
Methodology Overview
… IPK
ISP A
IP1 … IPK
CSP B
IP1
Tessellation
Via
traffic markers
Traffic attribution
Via activity
fingerprinting
Mapping from
sessions to users
Network data
analysis
Mosaic construction
Public web
crawling
Combine information from both network data and
OSN profiles to infer the user mosaic.
8
Traffic Attribution via Traffic Markers
Traffic Markers:
• Identifiers in the traffic to differentiate users
• Key/value pairs from HTTP header
• User IDs, device IDs or sessions IDs
Domain
Keywords
Category
Source
osn1.com
c_user=<OSN1_ID>
OSN User ID
Cookies
osn2.com
oauth_token=<OSN2_ID>-##
OSN User ID
HTTP header
admob.com
X-Admob-ISU
Advertising
HTTP header
pandora.com
user_id
User ID
Cookies
google.com
sid
Session ID
Cookies
How can we select and evaluate
traffic markers from network data?
9
Traffic Attribution via Traffic Markers
OSN IDs as Anchors:
•
•
The most popular user identifiers among all services
Linked to user public profiles
OSN
Source
Session Coverage
OSN1 ID
HTTP URL and cookies
1.3%
OSN2 ID
HTTP header
1.0%
Top 2 OSN providers
from North America
Only 2.3% sessions
contain OSN IDs
OSN IDs can be used as anchors, but their coverage on
sessions is too small
10
Traffic Attribution via Traffic Markers
Block Generation: Group Sessions into Blocks
OSN ID
Other sessions?
≥δ
Session interval δ
• Depends on the CSP
• δ=60 seconds in our
study
time
IP1
IP
Block
IP
IP 1
time
Block
• Session group on the
same IP from the same
user
• Traffic markers shared
by the same block
99K session blocks generated from the 12M sessions
11
Traffic Attribution via Traffic Markers
Culling the Traffic Markers: OSN IDs are not enough
• Uniqueness: Can the traffic marker differentiate between users?
• Persistency: How long does a traffic marker remain the same?
Uniqueness
Persistency
Uniqueness = 1
No two users will share the
same google.com#sid value
Score
1
0.98
0.96
0.94
0.92
craigslist.org
craigslist.org
#cl_b#cl_b
google.com
google.com
#sid #sid
mydas.mobi
mydas.mobi
#mac-id
#mac-id
mobclix.com
mobclix.com
#u
#u
pandora.com
pandora.com
#user_id
#user_id
mobclix.com
#uid #uid
OSN1 ID
admob.com
#isu
mobclix.com
0.9
Traffic markers
Traffic
markers
Persistency ~= 1
The value of Google.com#sid
remains the same for the
same user nearly all the
observation duration
We pick 625 traffic markers with uniqueness = 1, persistency >
12
Traffic Attribution via Traffic Markers
Traffic Attribution: Connecting the Dots
Tessellation User Ti
(
)
IP 1
Same OSN ID
IP 2
Same traffic marker
IP 3
Traffic markers are the key in attributing sessions to the
same user over different IP addresses
13
Traffic Attribution via Activity Fingerprinting
•
What if a session block has no traffic markers?
Assumption (Activity Fingerprinting):
• Users can be identified from the DNS names of their
favorite services
DNS names:
Service
classes
Service
providers
• Extracted 54,000 distinct DNS names
• Classified into 21 classes
Search
bing, google, yahoo
Chat
skype, mtalk.googl.com
Dating
plentyoffish, date
E-commerce
amazon, ebay
Activity Fingerprinting:
Email
google, hotmail, yahoo
• Favorite (top-k) DNS names as the
user’s “fingerprint”
News
msnbc, ew, cnn
Picture
Flickr, picasa
…
…
14
Traffic Attribution via Activity Fingerprinting
Y(Fi)
• Fi : Top k DNS names from user as “activity fingerprint”
•
: Uniqueness of the fingerprint
1
0.98
0.96
0.94
0.92
0.9
Y-axis:
closer to 1, more distinct
the fingerprint is
k=4
k=5
k=6
k=7
k=8
0
0.2
0.4
0.6
0.8
Normalizedfingerprint
DNS namesIDs
Normalized
1
X-axis:
normalized by the total
number of DNS names
Mobile users can be identified
by the DNS names from their preferred services
15
Traffic Attribution Evaluation
Correct
(Not complete)
Not correct
Session
Ri
RADIUS user
(Ground Truth)
Ti
Tessellation user
(Correct?)
Ti
Tj
Ri
identified
sessions/users
Coverage = ----------------------total
sessions/users
Rj
correctly identified
Accuracy on sessions/users
= ------------------Covered Set total identified
sessions/users
16
Traffic Attribution Evaluation
Coverage
Evaluation Results
15.70%
User
Session
43.20%
2.40%
OSN ID extraction
Accuracy on
Covered Set
•
Via traffic markers
49.80%
78.60%
Via activity fingerprinting
100%
99.30%
96.40%
User
100%
94.50%
92.50%
Session
OSN ID extraction
69.00%
Via traffic markers
Via activity fingerprinting
17
Construction of User Mosaic
•
Mosaic of Real-World User Alice
Sub-classes:
Residence,
coordinates, city,
state, and etc.
Least gain
Most gain
Example MOSAIC with 12 information classes(tesserae):
• Information (Education, affiliation and etc.) from OSN profiles
• Information (Locations, devices and etc.) from user sessions
18
Quantifying Privacy Leakage
•
Leakage from OSN profiles vs. from Network Data
15000
12000
# of Users
OSN profiles provide
static user information
(education, interests)
Both public OSN profiles & activity analysis
Public OSN profiles only
Activity analysis only
9000
Analysis on network
data provides real-time
activities and locations
6000
3000
0
News_info.
Content_exch.
Entertainment
Affiliation
E-commerce
Education
Art_culture
Location
Social_actvty
Association
Demographics
Device_info.
Information from both
sides can corroborate
to each other
Information from OSN profiles and network data
complement and corroborate each other
19
Preventing User Privacy Leakage
Protect
traffic markers
• Traffic markers (OSN IDs and etc.)
should be limited and encrypted
Restrict
3rd parties
• Third party applications/developers
should be strongly regulated
20
Beyond Traffic Encryption
Trackers and Information Aggregators
21
Current Approaches (in the Advertising Domain)
Block or disrupt ad interaction
– May disrupt regular site operation
Privacy preserving infrastructures
– Requires participation of ad networks
Do Not Track, Opt-out mechanisms
– Requires trust in ad networks
22
Synthoid
Endpoint User
Profile Control
– User explicitly
defines who he
wants to be online,
i.e., his online profile
– Synthoid imprints
this profile into all
possible trackers
and information
aggregators
23
Ad Network Synthoid
24
Synthoid Performance
Volume of Synthoid traffic varied 1% -- 100%
It is possible to completely alter the user profile with small
amount of artificial traffic
25
Conclusions
•
Prevalence in the use of OSNs leaves users’
true identities available in the network
• Significant portions of flows can be attributed to
users, even without any direct identity leaks
•
Ubiquitous encryption is not likely to take
place, plus it does not solve the problem
•
Our endpoint user profile control lets users
explicitly define their online profiles
• Works for all possible trackers at once
• Requires no changes or consent from trackers
• Currently developing Synthoid for search, mobile,
information aggregators, price discriminators, etc.
26
Questions?
Thanks!
http://networks.cs.northwestern.edu
27
28