Transcript Slides

Foundations of Privacy
Lecture 11
Lecturer: Moni Naor
Recap of recent lecture
• Continual changing data
– Counters
– How to combine expert advice
– Multi-counter and the list update problem
• Pan Privacy
• General Transformation to continual output
Petting
The Dynamic Privacy Zoo
Continual Pan
Privacy
Differentially
Private Outputs
Privacy under
Continual
Observation
Pan Privacy
Sketch vs. Stream
User level Privacy
Sanitization Can’t be Too Accurate
Usual counting queries
– Query: q µ [n]
– i 2 q di Response = Answer + noise
Blatant Non-Privacy: Adversary Guesses 99% bits
Theorem: If all responses are within o(n) of the true
answer, then the algorithm is blatantly non-private.
But: require exponential # of queries .
4
Proof: Exponential Adversary
• Focus on Column Containing Super Private Bit
1
0
0
1
0
1
1
“The database”
d
• Assume all answers are within error bound .
Will show that  cannot be o(n)
5
Proof: Exponential Adversary
• Estimate #1’s in all possible sets
– 8 S µ [n]: |K(S) – i 2 S di | ≤ 
• Weed Out “Distant” DBs
K(S) real answer
on S
– For each possible candidate database c:
If for any S µ [n]:
|i 2 S ci – K(S)| > ,
then rule out c.
– If c not ruled out, halt and output c
Claim: Real database d won’t be ruled out
6
Proof: Exponential Adversary
Suppose: 8 S µ [n]: |K(S) – i 2S di | ≤ 
• Claim: For c that has not been ruled out
Hamming distance (c,d) ≤ 2
S0 0
S1 1
d
0
0
0
1
0
1
1
1
c
|K(S0) - i 2S0 ci | ≤  (c not ruled out)
≤ 2
|K(S1) - i 2S1 ci | ≤  (c not ruled out)
Contradiction?
• We have seen algorithms that allow answer each
query with accuracy o(n)
– O(√n) and O(n2/3)
• Why is there no contradiction with current results
What can we do efficiently?
Allowed “too” much power to the adversary
• Number of queries
• Computation
• On the other hand: lack of wild errors in the
responses
Theorem: For any sanitization algorithm:
If all responses are within o(√n) of the true answer, then
it is blatantly non-private even against a polynomial
time adversary making O(n log2 n) random queries.
Show the adversary
The model
• As before: database d is a bit string of length n.
• Users query for subset sums:
– A query is a subset q µ {1, …, n}
– The (exact) answer is aq = i 2q di
• -perturbation
– for an answer: aq ± 
Slide 10
Privacy requires Ω(√n) perturbation
For
every
query
q
:
its
answer
according
to
c
is
j
Consider a database with o(√n) perturbation
at
most
2
far
from
its
(real)
answer
in
d.
2
• Adversary makes t = n log n random queries q ,
getting noisy answers aj
• Privacy violating Algorithm:
j
A solution must
exist: d itself
Construct database c = {ci}1 ≤ i ≤ n by solving Linear Program:
0 ≤ ci ≤ 1
for 1 ≤ i ≤ n
aj -  ≤ i 2q ci ≤ aj +  for 1 ≤ j ≤ t
• Round the solution:
– if ci > 1/2 set to 1 and to 0 otherwise
Bad solutions to LP do not survive
• A query disqualifies a potential database c if its answer
for the query is more than 2 + 1 far from its real
answer in d.
• Idea: show that for a database c that is far away from d
a random query disqualifies c with some constant
probability 
• Want to use the Union Bound: all far away solutions are
disqualified w.p. at least 1 – nn(1 - )t = 1–neg(n)
How do we limit the solution space?
Round each one value to closest 1/n
Privacy requires Ω(√n) perturbation
A query disqualifies a potential database c if its answer
for the query is more than 2 + 1 far from its real
answer in d.
Count number of entries far from d
Claim: a random query disqualifies far away from d
database c with some constant probability 
• Therefore: t = n log2 n queries leave a negligible
probability for each far reconstruction.
• Union bound: all far away suggestions are disqualified
w.p. at least 1 – nn(1 - )t = 1 – neg(n)
Can apply union bound by discretization
Review and Conclusion
• When the perturbation is o(√n), choosing Õ(n)
random queries gives enough information to
efficiently reconstruct an o(n)-close db.
• Database reconstructed using Linear programming
– polynomial time.
o(√n) databases are Blatantly Non-Private.
poly(n) time reconstructable
Slide 14
Ω(√n) lower bound
revisited
• An attack on a o(√n)-perturbation database with
substantially better performance
• Previous attack uses n log2 n queries and runs in
n5 log4 n time (LP)
• New attack: issues n queries and runs in O(nlog n)
time
• New attack is deterministic
– Fixed set of queries for each size
– Not necessarily an advantage – must ask certain queries
Slide 15
The Fourier Attack
Vector defines a function
• Treat the database d as a function Z2logn → Z2
• Query specific subset sums: from which the Fourier
coefficients of the function can be calculated
– One for each Fourier coefficient
• Round reconstructed function’s values to bits
• When the sums have o(√n) error, so do the coefficients
– the reconstruction can be shown to have o(n) error.
• Fourier transform can be computed in time O(n log n)
Key point: linearity of Fourier transform implies small error
in coefficients also mean small error in function
Slide 16
Fourier Transform
• The characters of Z2k : homomorphisms into {-1,1}
• There are 2k characters:
one for each a=(a1, a2, …, ak) 2 Z2k
Ha,b = a (b)
k
H = 2k x 2k

(-1) i=1 ai xi Hadamard matrix
a(x) =
• For function f: Z2logn → R
H H = 2k I
Æ
The Fourier coefficients f(a) are x a(x) f(x)
Æ
We have: f(x) = a a(x) f(a)
Æ
f=Hf
Æ
f = 1/2k H f
Parseval’s Identity
• Relates the absolute values of f to absolute values
of Fourier coefficients of f
x 2 Z
2
k
|f(x)|2 = 1/2k a 2 Z
2
Æ
k
|f(a)|2
Evaluating Fourier Coefficients with
Counting queries
• Let 0 = x f(x)
k-1
|S
|=
2
a
• For a=(a1, a2, …, ak) let
Sa= {x| <a,x>=0 mod 2}
Æ
• f(a) = 2 x 2 Sa f(x) - 0
e: error
vector of
Approximation of counting
query
on
S
yields
a
Æ
Fourier co.
approximation of f(a) with related term
Æ
Æ
e=(e1, e2, …, en)
f = 1/2k H f => 1/2k H (f + e) = f + 1/2k He
e: error vector of Fourier co.
Æ
e=(e1, e2, …, en)
Æ
f = 1/2k H f => 1/2k H (f + e) = f + 1/2k He
If 1/2k He has (n) entries which are ¸ ½
Then by Parseval’s: 1/2k a 2 Z k |ea|2 is (n)
2
n
Contradicting assumption
on accuracy
Hence: at least one |ea| is (√n)
x 2 Z
k
2
|f(x)|2 = 1/2k a 2 Z
k
2
|f(a)|2
Changing the Model: weighted counting
• Previous attacks: assume all queries are within
some small perturbation  Want some randomness of
queries – otherwise repetition
New model:
• To up to ½- of the queries unbounded noise is
added
some prime p = Ω(1/2 +  /)
• To the rest “small” noise  bounded
• Stronger query model:
subset sums are weighted with weights 0...p-1 for
Cannot “hide” single bits: all the weight might be
there
Slide 21
Interpolation attack
By dropping
info
• Treat database as linear form of n variables over Zp
• Treat a query q = (q1, …, qn) as the evaluation of the
form at a point
f(q1, …, qn) = Σi=1..n di qi mod p
– An answer to query q =((p-1)/2, 0, …, 0) that is within
(p-1)/4 error tells us the first db bit
– Similarly to all other bits
• No point in asking the query directly: these useful
queries might have unbounded noise
• Need to deduce (approximate) answer to q from other
queries
Slide 22
Interpolation attack - implementation
• Want to evaluate a specific query q with small error
• Pick a random degree-2 curve that passes through
q and issue queries for the p points on the curve
• Key issue: points on curve are pairwise independent
• Therefore: for sufficiently many queries, with high
probability interpolation gives a correct (up to small
noise) answer for q
• Can try exhaustively all degree 2 polynomials
Similar to Reed Muller decoding
Slide 23
Interpolation attack …
• Interpolation implemented by searching all p3 degree 2
polynomials for one which is -close at ½- of the
entries polynomial
– restrictions of a deg-2 curve to a linear form is a deg-2
polynomial
• Any two such polynomials must be 2-close, due to low
degree
To query
• Hence the accuracy of the reconstructed answer is 2.
• For (p-1)/4 > 2: can figure out any specific database
Slide 24
bit with high probability
Interpolation Attack: evaluating a query
accurately
• DB: f(q1, …, qn) = Σi=1..n di qi (Zpn → Zp)
• Pick a curve: for two random points u1, u2 in Zpn:
c(t) = q + u1 t + u2 t2
• Restriction of f to c: f|c(t) = f(c(t))
this is a degree-2 polynomial
(Zp → Zpn)
(Zp → Zp)
• Query all p points of c to get evaluations of f|c
– answers are inaccurate
• Interpolate to find f|c up to a small error
• Evaluate f|c(0) = f(q) accurately
Slide 25
Interpolation attack - performance
• Time for finding any specific bit: O(p4)=O(-8)
• Independent of db size n? (querying time? |q| = Θ(n))
– Can be used with very large databases if interesting part is
small
• Time to construct whole db with small error: O(n)
with pn queries (or O(n2))
Slide 26
Summary
• Ω(√n) perturbation lower bound revisited – simple
and efficient attack
• When queries allow sufficiently large weights, an
adversary can:
– Handle unbounded noise on large portion of the queries
– Find out private data in time independent of size of DB
Slide 27