An Ad Omnia Approach to Defining and Achieving Private

Download Report

Transcript An Ad Omnia Approach to Defining and Achieving Private

I’m in the Database, But Nobody Knows
Cynthia Dwork, Microsoft Research
Many Threats to Privacy of Electronic Data
Theft
 Phishing
 Viruses
 Cryptanalysis
 Changing Privacy Policies
…

This Talk: Privacy-Preserving Data Analysis
C

“First Tier” Motivating Examples


?
Analysis of Census Data, Medical Outcomes Data, GWAS data,
Epidemiology, Analysis of Vehicle Braking Records
“Second Tier” Examples

Training an advertising classifier, Recommendation System, Netflix
Challenge
“Pure” Privacy Problem
C

Difficult Even if


Curator is Angel
Data are in Vault
?
Typical Suggestions

“Large Set” Queries



Add Random Noise to True Answer



How many MSFT employees have Sickle Cell Trait (CST)?
How many MSFT employees who are not female Distinguished
Scientists with very curly hair have the CST?
Average of responses to repeated queries converges to true answer
Can’t simply detect repetition (undecidable)
Detect When Answering is Unsafe

Refusal can be disclosive
A Litany
AOL Search History Release (2006)
Name: Thelma Arnold
Age: 62
Widow
Residence: Lilburn, GA
William Weld’s Medical Record [Sweeney02]
HMO data
ethnicity
visit date
ZIP
diagnosis
birth
date
procedure
medication
total charge
sex
name
address
date reg.
party
affiliation
last voted
voter registration
data
GWAS Membership [Homer et al. ‘08]
…
T
C
…
…
T
T
…
SNP: Single Nucleotide (A,C,G,T) polymorphism
Reference Population
Major Allele (C): 94%
Minor Allele (T): 6%
Genome-Wide Association Study
Allele frequencies for many
thousands of SNPS
Anonymized Social Networks [BackstromDK07]

Magic Step


Isolate lightly linked-in subgraphs from rest of graph
Special structure of subgraph permits finding A, B
A
S
B
J
Definitional Failures

Failure to Cope with Auxiliary Information

Existing and future databases, newspaper reports, Flikr, literature, etc.

Definitions are Syntactic and Ad Hoc

Dalenius’s Ad Omnia Guarantee (1977):
Anything that can be learned about a respondent from the statistical
database can be learned without access to the database
 Unachievable
Parable: How Tall is Pamela Jones (Groklaw)?
An Admittedly Unreasonable Impossibility Proof



Database teaches average heights of population subgroups
“PJ is 2 inches shorter than avg Swedish woman”
PJ’s height learnable with the DB, not learnable without.
PJ loses privacy whether or not she is in the database


Suggests new notion of privacy: risk incurred by joining DB
The outcome of any analysis is essentially equally likely, independent
of whether any individual joins or refrains from joining the dataset.
(The likelihood is over the choices made by the algorithm.)
Differential Privacy
[Dwork-McSherry-Nissim-Smith 2006]
M gives  - differential privacy if for all adjacent D1 and D2, and

all C µ range(M): Pr[ M (D1) 2 C] ≤ e Pr[ M (D2) 2 C]
Neutralizes all linkage attacks.
Composes unconditionally and automatically: Σi  i
ratio bounded
Pr [response]
Bad Responses:
X
X
X
http://research.microsoft.com/en-us/projects/DatabasePrivacy/
Differential Privacy

Resilience to All Auxiliary Information


Low-error high-privacy DP techniques exist for many problems


Past, present, future data sources and algorithms
datamining tasks (association rules, decision trees, clustering, …),
contingency tables, histograms, synthetic data sets for query logs,
machine learning (boosting, statistical queries learning model, SVMs,
logistic regression), various statistical estimators, network trace
analysis, recommendation systems, …
Programming Platforms


http://research.microsoft.com/en-us/projects/PINQ/
http://userweb.cs.utexas.edu/~shmat/shmat_nsdi10.pdf
Download
today!
Can we store and share your information with health officials and
researchers?
“…This information can be very helpful in monitoring regional health
conditions, plan flu response, and conduct health research. By allowing the
responses to the survey questions to be used for public health, education
and research purposes, you can help your community.”
Snow 1854
Cholera
cases
Suspected
pump
https://h1n1.cloudapp.net/Privacy.aspx
“Microsoft may also disclose information if required to do so by law or in
the good faith belief that such action is necessary to (a) conform to the
edicts of the law or comply with legal process served on Microsoft or the
Site; (b) protect and defend the rights or property of Microsoft and our
family of Web sites; or (c) act in urgent circumstances to protect the
personal safety of users of Microsoft products or members of the public.”
Mission Creep
“Think of the children!”
C
Never store the data!
?
Pan-Private Streaming Algorithms [DNPRY10]

Private “inside and out”
Completely hide the pattern of appearances of any individual

Presence, absence, frequency, etc.
Protect against mission creep, subpoena, intrusion


DiffeP: Limitations and Challenges



Can’t study outliers
Privacy erosion over multiple analyses is cumulative
Privacy erosion over multiple databases is cumulative



Compare real world to one in which my info is everywhere deleted,
looking at a lifetime of exposure against worst-case
adversary/information/collection of databases
Formally capture “reasonable” worlds?
What are the right questions to ask about social networks?

Removing one person can affect data of many other people
Utility Implies Exposure to Harm




Database teaches that smoking causes cancer.
 Smoker S’s insurance premiums rise.
But learning that smoking causes cancer is the whole point.
 Smoker S enrolls in a smoking cessation program.
May be fine for “First-Tier” Uses, but what about “Second Tier”?
Who decides, and how?
Pause

“Ad Omnia” definition of privacy



Achievable, frequently with great accuracy
Usable


Composes automatically and obliviously; independent of aux info
Can program using a privacy-preserving interface
Many questions remain

Is there a weaker “ad omnia” definition than differential privacy that
also composes automatically?
Which Ad(s) Am I Charged For?
Å
Å
More Subtle Attack





A potentially embarrassing interest:
In a Long Leg Cast Anna … models her cast on a stoop,
wiggling/rubbing her casted toes. Length: 90 minutes
User clicks on ad targeted to wealthy, whale loving LLC enthusiasts
) user is wealthy, whale loving LLC enthusiast
User understands that interest in LLC is communicated to seller, but
does not understand that wealth and love of whales are also
communicated
Should privacy of these attributes be protected? What about race?
Fairness: should poor children see different ads than rich children?
Wall Street Journal 4/4/2010






User visits capitalone.com
Capital One uses tracking information via tracking network [x+1] to
personalize offers
Danger: Steering minorities into higher rates
In principle, law seems to allow credit outcomes based on
browsing history; may encode race
How can an ad network prevent steering?
What is fairness in classification, and how can we achieve it?
Work in Progress
D., Fiat, Hardt, Pitassi, Reingold, Zemel
Thank You!
-4R
-3R
-2R
-R
0
R
2R
3R
4R
5R