ppt - Berkeley Institute of Design - University of California, Berkeley

Download Report

Transcript ppt - Berkeley Institute of Design - University of California, Berkeley

Practical Private Computation and ZeroKnowledge Tools for Privacy-Preserving
Distributed Data Mining
Yitao Duan and John Canny
http://www.cs.berkeley.edu/~duan
Berkeley Institute of Design
Computer Science Division
University of California, Berkeley
Goal

To provide practical solutions with
provable privacy and adequate
efficiency in a realistic adversary model
at reasonably large scale
Goal

To provide practical solutions with
provable privacy and adequate
efficiency in a realistic adversary model
at reasonably large scale
The Scenario




Two data miners mine data from n users
The data miners are semi-honest: follow the
protocol but try to get more info
Some fraction of users can be malicious: they
may input bogus data to disrupt the
computation
A more realistic adversary model than most
existing privacy-preserving data mining
schemes
Model
f
di in Zφm
φ: < 32 or 64-bit
u1
u2
d1
d2
……
Challenge: standard
cryptographic tools not
feasible at large scale
un-1
un
dn-1
dn
Must be
obfuscated
A Practical Solution


Provable privacy: Cryptography
Efficiency:



VSS over small field.
Minimize the number of expensive primitives and
rely on probabilistic guarantee
Realistic adversary model: An extremely
efficient zero-knowledge proof to bound the
L2-norm of a user’s vector. An effective way
to limit the influence malicious users could
have on the computation
Basic Approach
f
=
u1
u2
d1
d2
No leakage beyond final
result for many algorithms
or differential privacy
[Dwork06]
Σ
……
un-1
un
dn-1
dn
Cryptographic privacy
The Power of Addition

A large number of popular algorithms can be run
with addition-only steps





Linear algorithms: voting and summation, nonlinear
algorithm: regression, SVD, PCA, k-means, ID3, EM etc
All algorithms in the statistical query model [Kearns 93]
Many other gradient-based numerical algorithms
A trick used a lot for parallelization in distributed
computing [Chu 06, Das 07]
Addition-only framework has very efficient private
implementation in cryptography and admits efficient
ZKPs
Private Addition


The computation: secret sharing over
small field
Malicious users: efficient zero-knowledge
proof to bound the L2-norm of the user
vector
Big Integers vs. Small Ones



Most applications work with “regular-sized”
integers (e.g. 32- or 64-bit). Arithmetic operations
are very fast when each operand fits into a single
memory cell (~10-9 sec)
Public-key operations (e.g. used in encryption and
verification) must use keys with sufficient length
(e.g. 1024-bit) for security. Existing private
computation solutions must work with large
integers extensively (~10-3 sec)
A 6 orders of magnitude difference!
Private Addition
ui
ui + vi = di
vi
di: user i’s private vector.
ui,,vi and di are all in a small
integer field
Private Addition
μ = Σui
ν = Σvi
ui + vi = di
Private Addition
μ
ν
μ = Σui
ν = Σvi
ui + vi = di
Private Addition
μ+ν
Private Addition



Provable privacy
Computation on each server is over
small field: same cost as non-private
implementation – O(m) small field
operations
So the cost for privacy is only due to
verification

For that we have a solution that involves
only O(log m) large field operations
The Need for Verification


Private computation obfuscates user
data. A malicious user could input
anything.
Think of a voting scheme: “Please place
your vote 0 or 1 in the envelope”
Bush 100,000
Gore -100,000
Zero Knowledge Proofs



I can prove that I know X without
disclosing what X is.
I can prove that an encrypted number is a
ZERO OR ONE, i.e. a bit. (6 extra numbers
needed)
I can prove that an encrypted number is a
k-bit integer. I need 6k extra numbers to
do this (!!!)
Bounding the L2-Norm




A natural and effective way to restrict a
cheating user’s malicious influence
You must have a big vector to produce large
influence on the sum
Perturbation theory bounds system change
with norms:
|σi(A) - σi(B)| ≤ ||A-B||2 [Weyl]
Can be the basis for other checks

Setting L = 1 forces each user to have only 1 vote
An Efficient ZKP of Boundedness


Luckily, we don’t need to prove that every number in a
user’s vector is small, only that the vector is small.
The server asks for some random projections of the
user’s vector, and expects the user to prove that the
square sum of them is small.
• O(log m) public key crypto operations (instead of O(m)) to prove
that the L-2 norm of an m-dim vector is smaller than L.
• Running time reduced from hours to seconds.
Random Projection-based
L2-Norm ZKP



Server generates N random m-vectors in
{-1, 0, +1}m with i.i.d. probability {¼, ½, ¼}
User projects his data to the N directions.
provides ZKP that the square sum of the
projections < NL2/2
Expensive public key operations are only on
the projections and the square sum
Effectiveness
Acceptance/rejection Probabilities
(a) Linear and (b) log plots of probability of user input acceptance as a function of |d|/L for N =
50. (b) also includes probability of rejection. In each case, the steepest (jagged curve) is the
single-value vector (case 3), the middle curve is Zipf vector (case 2) and the shallow curve is
uniform vector (case 1)
Performance Evaluation
(a) Verifier and (b) prover times in seconds for the validation protocol where (from top to
bottom) L (the required bound) has 40, 20, or 10 bits. The x-axis is the vector length.
•
Standard technique takes 6 to 10 hours at m = 106
Current Status





The protocols (the L2-norm ZKP and the
private vector addition) have been
implemented
Adding more mid-tier components
In Java using native code for big integer
Runs on Linux platform
Made an open-source toolkit for building
privacy-preserving real-world applications
More info



[email protected]
http://www.cs.berkeley.edu/~duan/rese
arch/p4p.html
Thank You!