Transcript Document

Statistical Databases – Query
Auditing
Li Xiong
CS573 Data Privacy and Anonymity
Partial slides credit: Vitaly Shmatikov, Univ Texas at Austin
Query Audit Problem
 Maintaining “privacy” of data
Auditor
Q1 Q 2 … Qn
…
Q
Database
A1 A2 … An
A or “Denied”
Does answer to Q combined
with answers to Q1,…,Qn
reveal something?
slide 2
Variations of the Problem
Specifies subset of the variables
Auditor
Q1 Q 2 … Qn
…
Database
A1 A2 … An
List of real, integer, or Boolean values
Wants to learn
value of some
variable
Min, max, median, sum, average,
or count of specified subset
slide 3
Offline vs. Online
 Offline auditing
Given a collection of queries and answers to them,
check whether anything “forbidden” was revealed
 Detects privacy breaches after the fact
 Online auditing
 Queries are presented to auditor one at a time;
auditor checks if answering the current query (in
combination with past answers) reveals “forbidden”
information
 Prevents privacy breaches on-the-fly

slide 4
Disclosure measures
 Full disclosure (exact-value disclosure) – the
exact value of a protected attribute is
disclosed
 Partial disclosure (interval-based disclosure)
– the disclosed range (difference of the lower
and upper bounds) of the protected attribute
is smaller than a predefined threshold
 Probability-based disclosure – the posterior
distribution of the data after answering
queries is significantly different from its prior
distribution
Offline Auditors for Full
Disclosure
 Sum queries
 Max and min queries
 Median and average queries
Audit Expert (Chin 1982)
 Query auditing method for SUM queries
 A SUM query can be considered as a linear equation
where
is whether record i belongs to the query set, xi
is the sensitive value, and q is the query result
 A set of SUM queries can be thought of as a system of
linear equations
 Maintains the binary matrix representing linearly
independent queries and update it when a new query is
issued
 A row with all 0s except for ith column indicates disclosure
Offline Auditing for Full
Disclosure
 Arbitrary combinations of aggregate queries
 It is unlikely there will be efficient on-line
application algorithms for SUM + MAX
queries, SUM + MIN queries, or SUM + MAX
+ MIN queries.
 Example hardness results:
Theorem. There is no polynomial time fulldisclosure auditing algorithm for sum and
max queries unless P=NP.
Auditing Sum Queries on Booleans
 Database: collection of secret Boolean variables
 Query: specifies subset S of variables
 Answer: sum of variables in S
 Privacy breach: after asking several queries, user
learns the value of some secret variable(s)
 Auditing problem: given a set of Boolean
equations, is there a variable that has the same
value in all solutions?
Auditing Boolean attributes, Kleinberg, 2000
slide 9
Why Is This Interesting?
 Linear Diaphantine equations
 Query can be safe on real-valued, unbounded
data, but reveal information when the data are
discrete, with known bounds
x+y+w=1
y+z=1
x+z=1
Real: multiple solutions, secure
Boolean: unique solution, insecure (why?)
 Hardness results: the auditing problem for
Boolean values is coNP-complete.
slide 10
Offline Auditor for Partial
Disclosure
 Partial disclosure – the disclosed range of the protected
attribute is smaller than a predefined threshold
 Sum queries
 Interval-based disclosure: monitoring upper and lower
bounds for all confidential attributes
Auditing interval-based inference. Li et al. 2002
Offline Auditor for Partial
Disclosure
 New query
 A series of linear programming problems
Incremental evaluation of the LP
 Treats the auditing problem as a series of
updation problems and updates the bounds
with certain rules
 Horizontal updation – given the same set of
queries, the bounds of one variable, how can
the prior result be modified to get the bounds
of other variables
 Vertical updatation – given the same set of
variables, and bounds under the previous
queries, how can the prior result modified to
get the bounds when a new query arrives
An efficient online auditing approach to limit private data disclosure, Lu, 2009
Online Auditing
Auditor
Qi+1
Database
A or “Denied”
“Denied” if answering
Qi+1 would cause a
privacy breach
Previous queries
Q1 … Qi
slide 14
Online Auditing
 Given a sequence of queries and
corresponding answers, and a new query,
determine if the new query should be
answered or denied in order to prevent a
privacy breach.
 Earliest approaches



Query set size control
Query set overlap control
Limited data utility
Offline to Online?
 Can an offline auditor directly solve the online
auditing problem?
 Denials leak information!
Sounds Familiar?
[slide stolen from Kobbi Nissim]
Colonel Oliver North, on the Iran-Contra arms deal
“On the advice of my counsel I respectfully and
regretfully decline to answer the question based on
my constitutional rights.”
David Duncan, former auditor for Enron and
partner in Arthur Andersen
“Mr. Chairman, I would like to answer the
committee's questions, but on the advice of my
counsel I respectfully decline to answer the
question based on the protection afforded me
under the Constitution of the United States.”
slide 17
Example: Sum/Max
 Variables di are real, privacy breached if
adversary learns some di
Auditor
Gimme sum(d1,d2,d3)
Answer=15
Gimme max(d1,d2,d3)
Database
“Denied”
Oh
well
Wait… there must be
a reason why second
query was denied
The only possible reason for
denial is if d1=d2=d3=5
slide 18
Online Audit
 Denials reduce the search space
Possible assignments to {d1,…,dn}
Assignments consistent with
(q1,…qi; a1,…,ai)
qi+1 denied
slide 19
One workaround
 Deny whenever the offline algorithm does, in
addition, randomly deny queries.
 Issues



Leakage is not prevented
Have to remember which queries were
randomly denied
Semantically determine whether two queries
are equivalent
Simulatable Auditing
 Observation: denials have the potential to
leak information if the auditor uses
information that is unavailable to the attacker
(the answer to the new query)
 Simlatable auditing: the attacker should be
able to simulate or mimic the auditors
decisions to answer or deny a query
 Denials provably do not leak information
Simulatable Auditing, Kenthapadi, 2005
Simulatable Auditing
 An auditor is a function of Q, A and X
 An auditor is simulatable if there exists a simulator
that is a function of only Q and A-ai+1 and whose
output is always equal to the auditor.
q1,…,qi
Database
qi+1
qi+1
q1,…,qi
a1,…,ai
Auditor
Deny or answer
Simulator

Deny or answer
slide 22
Simulatable Auditing
Possible assignments to {d1,…,dn}
Assignments consistent with
(q1,…qi, a1,…,ai )
qi+1 denied/allowed
slide 23
Simulatable Auditing
 Query-set-size control
 Query-set-overlap control
 Audit expert for sum queries
Constructing Simulatable Auditor
 General sufficient condition: the auditor
should determine if there is any possible
dataset, consistent with all past responses, in
which the answer to the current query would
cause some element to be fully disclosed
Example revisited: Sum/Max
Auditor
Gimme sum(d1,d2,d3)
Answer=a1
Gimme max(d1,d2,d3)
Database
“Denied”
 Simulatable auditor would always deny the max
query following a sum query
 Lose some utility due to the requirement of
simulatability
slide 26
Challenges
 Privacy definition
Privacy of groups/families
 Algorithmic limitations
 Simulatable algorithms computationally prohibitive
 Most work on sum queries, some on max, min,
median, hardness results on mixed queries
 Collusion
 Reduced utility for legitimate users
 Large audit trail
 Utility


Percentage of denials may not be the best measure