Differential Privacy

Download Report

Transcript Differential Privacy

Differential Privacy
Xintao Wu
Oct 31, 2012
Sanitization approaches
• Input perturbation
– Add noise to data
– Generalize data
• Summary statistics
– Means, variances
– Marginal totals
– Model parameters
• Output perturbation
– Add noise to summary statistics
Blending/hiding into a crowd
• K-anonymity based approaches
• Adversary may have various background
knowledge to breach privacy
• Privacy models often assume “the
adversary’s background knowledge is
given”
Classic intuition for privacy
• Privacy means that anything can be
learned about a respondent from the
statistical database can be learned without
access to the database.
• Security of encryption
– Anything about the plaintext that can be
learned from a ciphertext can be learned
without the ciphertext.
• Prior and posterior views about an
individual should not change much
Motivation
• Publicly release statistical information
about a dataset without compromising the
privacy of any individual
Requirement
• Anything that can be learned about a
respondent from a statistical database should be
learnable without access to the database
• Reduce the knowledge gain of joining the
database
• Require that the probability distribution on the
public results is essentially the same
independent of whether any individual opts in to,
or opts out of the dataset
Definition
Sensitivity function
• Captures how great a difference must be
hidden by the additive noise
LAP distribution noise
Guassian noise
Adding LAP noise
Proof sketch
Delta_f=1, epsilon varies
Delta_f=1 epsilon=0.01
Delta_f=1 epsilon=0.1
Delta_f=1 epsilon=1
Delta_f=1 epsilon=2
Delta_f=1 epsilon=10
Delta_f=2, epsilon varies
Delta_f=3, epsilon varies
Delta_f=10000, epsilon varies
Composition
• Sequential composition
• Parallel composition
--for disjoint sets, the ultimate privacy
guarantee depends only on the worst of
the guarantees of each analysis, not the
sum.
Example
• Let us assume a table with 1000 customers and each
record has attributes: name, gender, city, cancer, salary.
–
–
–
–
For attribute city, we assume the domain size is 10;
for attribute cancer, we only record Yes or No for each customer;
for attribute salary, the domain range is 0-10k.
The privacy threshold \epsilon is a constant 0.1 set by data
owner.
• For one single query “How many customers got cancer?”
• The adversary is allowed to ask three times of the query
shown the above.
Example (continued)
• “How many customers got cancer in each
city?”
• For one single query “What is the sum of
salaries across all customers?”
Type of computing (query)
•
•
•
•
•
some are very sensitive, others are not
single query vs. query sequence
query on disjoint sets or not
outcome expected: number vs. arbitrary
interactive vs. not interactive
Sensitivity
• Global sensitivity
• Local sensitivity
• Smooth sensitivity
Different areas of DP
• PINQ
• DM with DP
• Optimizing linear counting queries under
differential privacy.
-Matrix mechanism for answering a
workload of predicate counting queries
PPDM interface--PINQ
• A programmable privacy preserving layer
• Add calibrated noise to each query
• Need to assign privacy cost budget
Data Mining with DP
• Previous study—privacy preserving interface
ensures everything about DP
• Problems—inferior results if the interface is
utilized simply during data mining
• Solution—consider both together
• DP ID3
—noisy count
—evaluate all attributes in one exponential
mechanism query using entire budget instead of
splitting budget among multiple
DP in Social Networks
• Page 97-120 of pakdd11 tutorial