A case for an interactive approach to privacy
Download
Report
Transcript A case for an interactive approach to privacy
Lower Bounds on Noise
Sergey Yekhanin
Institute for Advanced Study
Setting
Database of information about individuals
E.g. Medical history, Census data, Customer info.
Need to guarantee confidentiality of individual entries
Want to make deductions about the database; learn large
scale trends.
E.g. Learn that a drug V increases likelihood of heart
disease
Do not leak info about individual patients
Message
Two approaches to database privacy:
Interactive: Analyst asks questions; curator returns
approximate answers
Curator
Analyst
Message
Two approaches to database privacy:
Interactive: Analyst asks questions; curator returns
approximate answers
Non-interactive: Publish a “summary” of the database;
analyst can use summary to get answers
Curator
Summary
Analyst
Message
Two approaches to database privacy:
Interactive: Analyst asks questions; curator returns
approximate answers
Non-interactive: Publish a “summary” of the database;
analyst can use summary to get answers
Thesis: The interactive approach is the right way to give
good accuracy for a given level of privacy
Any non-interactive solution permitting “too accurate”
answers to “too many” questions leaks private information.
Plan
Mathematical model of database and queries
Attacks
Somewhat accurate answers to all queries lead to privacy
leakage. (Fourier analysis) [Y] (extends [DiNi]).
Somewhat accurate answers to a fraction of queries lead to
privacy leakage. (Linear programming / Polynomial
interpolation) [DMT,DY]
Study of privacy leads to a variety of mathematical
challenges!
Model
[Dinur-Nissim] Simple Model (easily justifiable)
Database: n-bit binary vector x
Query: vector a
True answer: Dot product ax
Response is ax+ e = True Answer + Noise
Privacy Leakage: Attacker learns a certain bit of x.
Blatant Non-Privacy: Attacker learns n−o(n) bits of x.
Fourier attack
Theorem: If a curator adds o(√n) noise to every response;
then an attacker can ask n questions, perform O(n log n)
computation and recover n-o(n) bits of the database.
Put database records in one-to-one correspondence
with elements of a group Z 2k .
Think of a database as a function D from Z 2k to {0,1}.
Choose queries to ask for Fourier coefficients of D.
Noisy Fourier coefficients approximately determine the
Boolean function D! (Parseval identity).
Linear programming attack
Theorem: If a curator adds o(√n) noise to 0.773 fraction of
responses; then an attacker can ask O(n) questions,
perform O(n3) computation and recover n-o(n) bits of the
database.
Arbitrarily large error on arbitrary and unknown 0.239
fraction on answers.
Linear programming attack
Ask O(n) random +1/-1 questions
Obtain y=Ax+e, where e is the error vector
A natural approach to recover x from y:
Solve: min |e'|0 such that y=Ax'+e‘, x' in Rn (hard!)
Solve a linear program [D, CT, MT]:
min |e'|1
such that
y=Ax'+e'
x' in Rn
Ax'
y
Polynomial interpolation attack
Model: Questions have O(c) large coefficients
Theorem: If a curator adds o(c) noise to 0.501 fraction of
responses; then an attacker can ask c questions, perform
O(c4) computation and reliably recover any particular bit
of the database.
Arbitrarily large error on arbitrary and unknown 0.499
fraction on answers.
Polynomial interpolation attack
Assume c is prime.
Think of the space of queries as a linear space
Fcn .
To obtain a reliable answer to query x = (1,0, … , 0) ,
draw a degree two curve through x.
Ask all queries that correspond to points on the curve.
Use polynomial interpolation to carefully combine the
answers.
x
q1
q4
q2
q3
q5
q6
Implications
Privacy has a Price
There is no safe way to avoid increasing the noise as the
number of queries increases
Applies to Non-Interactive Setting
Any non-interactive solution permitting answers that are “too
accurate” to “too many” questions is vulnerable to attack.
Cannot just output a noisy table.
Non-interactive approach has inherent limitations
Interactive approach works
Can also publish a summary, as long as its clear which stats
are accurate, and which ones are not.
Future directions:
Fewer queries
Understand what can and what cannot be done privately