Transcript XOR

When small data is better data
Paul Francis, MPI-SWS
Ruichuan Chen,
Ekin Akkus,
Johannes Gehrke
private
When small data is better data
Paul Francis, MPI-SWS
Ruichuan Chen,
Ekin Akkus,
Johannes Gehrke
The user data “understanding”
• There is a tacit understanding among
users that if you send data to a company,
they are free to use it how they wish
Facebook
“knowing” all
kinds of personal
information
Doubleclick
monitoring your
browsing
behavior
Google
gathering WLAN
traffic during
drive-by
Leads to data gathering services
• Companies build (free) services designed
to gather as much data about users as
they can
– And often secretly gather data about users
when they can’t
• Then try to monetize that data
– Mainly through advertising
– Though Jean Bolot had some interesting
ideas
Leads to “big data” (mining)
• Companies gather what they can, but don’t
always get what they want
– Google knows your searches, but not your
relationship status
– Facebook knows your relationship status, but not
what you buy
– Amazon knows what you buy, but not what you
search for
• So they use big data mining to infer what they
don’t know
A new user data understanding
• It is ok to monetize (or otherwise benefit)
from user data if:
– The user data is very expensive to collected
in any identifiable form
– Users can know what is going on, and users
can opt-out
Why is this interesting?
• Keeping user data on the user device is
the key to user privacy
• Most user data is at, or has passed
through, the user device
– Search and browsing in browser history
– Facebook user profile easily scraped
– Amazon purchases easily scraped
Premise of “Private by Design”
• If we can monetize user data, without
collecting user data, then we have
legitimate access to far more user data
– Less need to deal with big data
– Better monetization, less overhead
My group’s research agenda
• “Private by Design” behavioral advertising
• “Private by Design” aggregate analytics
My group’s research agenda
• “Private by Design” behavioral advertising
• “Private by Design” aggregate analytics
Aggregate Analytics
• Web analytics: want to know
demographics of user base, what other
websites users visit, etc.
• App analytics: want to know what other
apps user runs (competitors)
• Mobile analytics, general analytics,….
Typical database privacy settings:
trusted component sees database
Untrusted
Analyst
Analyst
query
query
Database’
Query Module
(add noise)
query
Database
anonymize
Trusted
Database
Our setting: nobody (except user)
sees individual user data
Untrusted
Analyst
???
Untrusted
Data
Data
Data
Previous work in our setting
• Assumed differential privacy
• Poor scaling characteristics, and/or
• Could not tolerate user fraud
Analyst
Our goal: Assume
differential privacy, but fix
scaling and user fraud
problems.
???
Data
Data
Data
Differential privacy
• Differential privacy adds noise to the
output of a computation (i.e., query).
Database
Query Module
(add noise)
DB1
DB2 (differs by
one user)
Analyst
Components & assumptions
Analyst is potentially malicious
(violating user privacy)
Analyst
Proxy (add DP noise blindly)
Data
Data
Data
Proxy is honest but curious
1) Follows the specified protocol
(does not collude)
2) Tries to exploit additional info
that can be learned in so doing
Clients are user devices.
Clients are potentially malicious
(distorting the final results)
Actually, two proxies!
Honest-but-Curious
proxy must not see
user data
Analyst
Blind
Proxy
Data
Blind
Proxy
Data
Data
If one proxy, need
expensive public key
encryption between
clients and analyst
If two proxies, can
use much cheaper
form of encryption
(one time pad)
Message XOR Random_String = Result
Proxy 1
Result
Sender
Result
Random_String
Random_String
Receiver
Proxy 2
Result XOR Random_String = Message
Queries are
counting queries:
Analyst
Blind
Proxy
Data
Blind
Proxy
Data
Data
Ex: How many
users…..are male and
between ages of 10-20?
Analyst
Blind
Proxy
Data
Blind
Proxy
Data
Data
Clients answer
‘yes’ or ‘no’ only
Analyst
Proxies adds N
additional random
yes/no answers
(coins)
Blind
Proxy
Blind
Proxy
N = 2σ2
But, must not know
how many yes’s and
no’s it added!
Data
Data
Data
Analyst
Each proxy
independently adds
N random coins
Blind
Proxy
Data
Blind
Proxy
Data
Data
XOR at analyst will
produce random
result
But neither proxy
knows what the
result will be
Analyst
Coins and answers
Blind
Proxy
Data
Blind
Proxy
Data
Data
Analyst
Decrypt and tabulate
Blind
Proxy
Data
Blind
Proxy
Data
Data
Buckets
• Not “is your age between 10-20?”,
– but “are you 1?”, “are you 2?”, “are you 3?”….
• Query is generally a vector of yes/no
questions
– Answer a vector of 1’s and 0’s
• Vector can be big:
– List of 20K websites
– 185K combinations of 10 of 20 attributes
Analyst
Blind
Proxy
b1: u4, u12, c2, ……
b2: u6, c3, u19, ……
b3: u12, c7, u6, ……
Blind
Proxy
Proxies add coins
and shuffle user
answers (per bucket)
u1: b1, b2, b3, ……
u2: b1, b2, b3, ……
u3: b1, b2, b3, ……
…….
Data
Data
Data
b1: u4, u12, c2, ……
b2: u6, c3, u19, ……
b3: u12, c7, u6, ……
u1: b1, b2, b3, ……
u2: b1, b2, b3, ……
u3: b1, b2, b3, ……
…….
Analyst
Blind
Proxy
Data
b1: u4, u12, c2, ……
b2: u6, c3, u19, ……
b3: u12, c7, u6, ……
Blind
Proxy
Data
Data
u1: b1, b2, b3, ……
u2: b1, b2, b3, ……
u3: b1, b2, b3, ……
…….
b1: u4, u12, c2, ……
b2: u6, c3, u19, ……
b3: u12, c7, u6, ……
u1: b1, b2, b3, ……
u2: b1, b2, b3, ……
u3: b1, b2, b3, ……
…….
Blind
Proxy
Analyst
b1: u4, u12, c2, ……
b2: u6, c3, u19, ……
b3: u12, c7, u6, ……
Blind
Proxy
u1: b1, b2, b3, ……
u2: b1, b2, b3, ……
u3: b1, b2, b3, ……
…….
The shuffling at each proxy must
be identical (though random)
Because
eachData
bit must Data
be paired
Data
with its XOR partner
b1: u4, u12, c2, ……
b2: u6, c3, u19, ……
b3: u12, c7, u6, ……
u1: b1, b2, b3, ……
u2: b1, b2, b3, ……
u3: b1, b2, b3, ……
…….
Analyst
Blind
Proxy
b1: u4, u12, c2, ……
b2: u6, c3, u19, ……
b3: u12, c7, u6, ……
Blind
Proxy
u1: b1, b2, b3, ……
u2: b1, b2, b3, ……
u3: b1, b2, b3, ……
…….
But the proxies may have a
(slightly) different set of answers.
Data
Data
Data
Analyst
u1: b1, b2, b3, ……
u2: b1, b2, b3, ……
u3: b1, b2, b3, ……
…….
Blind
Proxy
Blind
Proxy
Synchronize the list of answers.
Share a random seed for a
Datagenerator,
Data use Data
random number
to
shuffle.
u1: b1, b2, b3, ……
u2: b1, b2, b3, ……
u3: b1, b2, b3, ……
…….
Time
• Queries (unfortunately) take time:
– There is a period of time during which a query
is active
• 10s of minutes, hours, or days???
Start query
Synchronize and add coins
TIME
Clients pull in and answer queries
Differential Privacy, good and bad
• Good:
– Adds noise
– Lots of machinery being built
• Bad:
– Very pessimistic (measure of privacy loss is
almost certainly way worse than actual
privacy loss)
– “Throwing away the database” not realistic
From INTIMATE workshop
• Jean’s mobility a good application
• Collaborative filtering (Bach, Aruna) looks
hard to do
• Serge’s social knowledge may be
centered on user devices…
– Query for people’s opinions…
• Real-time analytics may be possible
– Streamed coin addition???
Status and future
• Building an application analytics tool
– Initial focus is PC platforms
– Hope to get real app developers to bundle our
tool
• Additional privacy mechanisms (beyond
differential privacy)
• Work on better understanding of privacy
loss in a realistic setting