Transcript Slides

Towards Robustness in Query
Auditing
Shubha U. Nabar
Stanford University
VLDB 2006
Joint Work With
B. Marthi, K. Kenthapadi, N. Mishra, R. Motwani
Data Mining vs Privacy
• Large amount of data available in digital form
• Statisticians query data to mine useful trends
• Potential for privacy breaches
Online Query Auditing
• Given a stream of queries over a DB containing
private information, when should queries be denied
to protect privacy?
• Our focus:




Statistical DBs: census, hospital, employee
Only one private attribute, e.g., salary, disease
Statistical queries over private attribute: sum, max, mean
Stream of queries of single type from single user
Online Query Auditing
Adversary
Alice’s
salary =
$42,000!
Company Database
Name Age Sex Salary
Sum of salaries of
female employees
42,000
Alice
23
F
42K
Bob
25
M
50K
Carl
30
M
80K
Dave 21
M
35K
Online Query Auditing
• In general, more complex queries can be posed
and answers put together to deduce information
• Task of auditor: deny query when answer to
current and past queries can be “stitched together”
to leak information.
Our Contributions
• Auditor for max queries
• Auditor for combinations of max and min queries
• A first analysis of the utility of an auditing scheme
Related Work
• Perturbing data itself [W ‘65, AS ‘00, EGS ’03, CDMSW ‘05]
• Perturbing results supplied to user [DN ‘03, DMNS ‘06]
Statisticians unhappy with addition of noise
Auditors provide exact answers if at all
Previous Work
• Restricting Size and Overlap of Queries
[Dobkins, Jones, Lipton ‘79]
• Offline Auditing
[Chin ‘86]
• Auditing for Boolean Attributes
[Kleinberg, Papadimitriou, Raghavan ‘03]
• Auditing Compliance with a Hippocratic Database
[Agrawal, Bayardo, Faloutsos, Kiernan, Rantzau, Srikant ’04]
• Simulatable Auditing
[Kenthapadi, Mishra, Nissim ‘05]
Naïve Auditor
If answer to current query causes an element to
be determined, deny
Adversary
Carl’s salary =
$80,000!
Company Database
max salary{Alice,Bob,Carl}
80,000
max salary{Alice,Bob}
denied
Name Age Sex Salary
Alice
23
F
42K
Bob
25
M
50K
Carl
30
M
80K
Dave 21
M
35K
Simulatability
• Denials based on answer to current query may
cause privacy breach
• Solution: If attacker can simulate and predict
decision to deny ) denials do not leak information
• Auditor: If there is any dataset consistent with past
answers in which current query causes breach, deny
 Attacker can check condition himself
 Denials do not leak information
Goal
Find online, efficient, simulatable, high-utility
auditors for various classes of queries
Definition of Privacy Breach
• Full Disclosure: some private data point can be
uniquely determined
 e.g. max{xa, xb, xc} = 10 max{xa, xb} = 8 ) xc = 10
• Partial Disclosure (probabilistic compromise):
significant change in attacker’s confidence about
some private data value
Probabilistic Compromise
• Private data known to be drawn according to D
• Range of each data point divided in to intervals
qt
SDB
at
query
0
1
Prior
0
1
Posterior
Outline
•
•
•
•
•
•
Problem Statement
Previous Work
Auditing Max Queries
Auditing Max and Min Queries
Utility
Future Work
See paper for auditing against full disclosure
Skeleton of Probabilistic Auditor
1. Attacker poses query qt
2. Attacker has posterior distribution over answer to qt,
given previous answers
3. Auditor repeatedly:
a. Samples possible answer from this distribution
b. Checks if sampled answer will change attacker’s belief
about some data point
4. If qt “unsafe” in significant fraction of samples, deny
Need to estimate posterior distributions in 2. and 3b.
Probabilistic Max Auditor
• Assumption: dataset drawn uniformly at random
from set of duplicate-free points in [,]n
 For each xi and any interval in [α, ] prior prob uniform
• Given answers to set of queries, what are
posterior probabilities?
Probabilistic Max Auditor
• Given queries q1…qt and answers a1…at create
synopsis Bmax
• Bmax contains predicates
[max(S1) = a1], [max(S2) < a2]…
Sis are disjoint
• Bmax enables succinct representation of audit trail
• Bmax enables computation of posterior probabilities
Determining Posterior Probabilities
max{xa, xb, xc} = 0.75
• Pr{xa = 0.75} = 1/3, since any
one of xa, xb or xc is equally
likely
be max
[0,0.25]}
Pr{x
0.75}
2to
[0.25,0.5]}
[0.5,0.75)}
a =
xa
(0.75, 0, 0)
(0, 0, 0.75)
xb
• With remaining 2/3
probability, xa is uniformly
distributed in [0,0.75)
(0, 0.75, 0)
xc
Probabilistic Max Auditor
1. Attacker poses query qt
2. Attacker has posterior distribution over answer to qt,
given previous answers
3. Auditor repeatedly:
a. Samples possible answer from this distribution
b. Checks if sampled answer will change attacker’s belief
about some data point
4. If qt “unsafe” in significant fraction of samples, deny
Can give guarantees on probability that adversary learns
new information
Outline
•
•
•
•
•
•
Problem Statement
Previous Work
Auditing Max Queries
Auditing Max and Min Queries
Utility
Future Work
Probabilistic Max-and-Min Auditor
• Computing posterior probabilities becomes harder
• Given queries, create synopsis so that a data point
occurs in at most one max and one min predicate
Equivalent Graph Coloring Problem
a, b, c
d, e
max{xa, xb, xc} = 1
min{xa, xb} = 0.2
max{xd, xe} = 2
min{xc, xd, xe} = 0.5
Every valid coloring corresponds to a set of
consistent datasets
a, b
c, d, e
Probabilistic Max-and-Min Auditor
We show
 Can sample consistent dataset according to posterior
distribution by sampling valid coloring according to
distribution P
 Can sample valid coloring according to P using
markov chain over colorings
 Can use sampled colorings to answer questions
about posterior distribution of data points up to
arbitrary precision
See paper for details
Outline
•
•
•
•
•
•
Problem Statement
Previous Work
Auditing Max Queries
Auditing Max and Min Queries
Utility
Future Work
Utility
• Several dimensions of utility:




How many queries are answered?
What kinds of queries are answered?
What can be computed?
“Price of simulatability”
• Expected time to first denial
Utility of Sum Auditor
• Consider full disclosure
• No prior knowledge – data points come from
unbounded range
• Queries chosen uniformly at random
Sum Auditor
1
1
1
1
0
1
0
1
1
0
0
0
0
0
0
0
1
0
1
1
xa
xb
xc
xd
xe
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
xa
xb
xc
xd
xe
=
=
a1
a2
a3
a4
a2 - a4 + a3
a4 – a3
a1 - a3
a4 - a2
Utility of Sum Auditor
• We show, expected time to first denial
 ¸ n/4
 · n + lgn
• Good news for large databases – answers not
riddled with denials
• Can’t do much better
• Once n-1 independent queries are answered, at
least half the queries will be denied on average
Utility of Sum Auditor
• Reality
 Users do not choose queries uniformly at random
 Users cannot query arbitrary subsets of the data
 Database frequently updated – old information
becomes irrelevant
e.g. q1 = xa + xb + xc ; xa is modified
q2 = xa + xb
q2 will no longer be denied
• Denials may not be so frequent in reality
Utility: Experiments
Plot 1: Sum queries chosen uniformly at random
Plot 2: Sum queries with updates
Plot 3: 1 dimensional range sum queries
Future Work
• Ways to proactively enhance utility
 Deny innocuous queries in the present in the hope that more
can be answered in the future
• Ward off denial of service attacks
• Devise auditors, study utility for more complex queries
• Remove assumptions about prior knowledge
• Solution to collusion