Transcript Talk ppt

Privacy Preservation of Aggregates in
Hidden Databases: Why and How?
Arjun Dasgupta, Nan Zhang,
Gautam Das, Surajit Chaudhuri
Presented by PENG Yu
Outline
Introduction
 Problem Definition
 Our Approach
 Privacy Guarantee
 Experiments
 Conclusion

Privacy leakage

An airline company’s flight search form lets a user search for a
flight by specifying a set of attributes such as departure and
destination, date, number of stops, carrier, and cabin preferences.
Privacy of Preservation of Aggregates

Reasons:




Legitimate interfaces give chances to attackers to detect the
sensitive aggregates information.
Aggregates information can be used by adversaries to master the
whole distribution and other features of the hidden databases behind
the interfaces.
To some extent, aggregates information is more useful than
individual information.
Challenge:
Given a hidden database, develop techniques that make it very
difficult to obtain uniform random samples of the database via its
search interface without necessitating human intervention.
Privacy of Preservation of Aggregates

Some Assumptions
Data is only accessible through a webbased interface
 Consider sampling attacks only
 Keep bona fide users unaffected
 External knowledge is omitted
 Consider Boolean attribute and extend it to
categorical or numerical one

Outline
Introduction
 Problem Definition
 Our Approach
 Privacy Guarantee
 Experiments
 Conclusion

Preliminaries

Terms:
D: database table
 m: number of tuples in D
 Qs: search query
 Sel(Qs): the result set of tuples in D that satisfy
Qs
 n: number of predicates in Qs


Notification
If |Sel(Qs)|>k, only the top-k tuples in Sel(Qs)
will be returned according to a ranking function.
Preliminaries (Cont.)

A query Qs is called
–
–
–
Underflow; if |Sel(Qs)|=0
Overflow; if |Sel(Qs)|>k
Valid; if 0<|Sel(Qs)|≤k
Universal space Ω : the set of all possible
search queries
 Active space Θ : a subset of Ω containing
only those queries that are candidates for
issuing at a subsequent time

Problem Definition

(ε,δ)-privacy
For a sensitive aggregate query QA:

(ε,δ,p)-privacy

Problem
Outline
Introduction
 Problem Definition
 Our Approach
 Privacy Guarantee
 Experiments
 Conclusion

Our Approach

Observation
To obtain a uniform random sample tuple t, a sampler must have
discovered at least one valid research query that contains t in its
result.

Main idea
In order to thwart sampling attacks, we carefully construct and
insert dummy tuples into databases such that most valid and
some underflowing queries are converted to overflowing queries.
Single-Sample Attack
Observation
|Ω|=3n
Pr(picking a valid query)≤m•(2/3)n
 Three possible outcomes of Q1:

–
–
–
underflow : the size of Θ shrinks to 3n-1
overflow : the size of Θ shrinks to 3n-1
valid: the size of Θ shrinks to 1
Single-Sample Attack and Defense

Three possible outcomes of Qc:
–
–
–

underflow : the size of Θ shrinks to (c+1)3n-c
overflow : the size of Θ shrinks to |Θ|/3c
valid: the size of Θ shrinks to 1
Key Observation:
−
−
Shrinking Θ significantly reduces sampling
query cost.
Valid queries as well as long overflowing
queries contribute the most to shrinking Θ.
Single-Sample Defense

Techniques: Neighbor Insertion
It is difficult to find long overflowing queries,
with Pr ≤ m/2c.
 Short valid queries are the most dangerous
threat. We insert dummy tuples into the
“neighboring zone” of real tuples, such that
all valid queries with fewer than b predicates
will overflow, b is a parameter.

Multi-Sample Attack and Defense
Similarly, we analyze the shrinkage of ΘE
and ΘF , and try to minimize it.
Multi-Sample Attack and Defense

Three possible outcomes of Qc:
–
–
–

underflow : up to (c+1)3n-c queries should be removed
from both ΘE and ΘF.
overflow : 2c queries removed from ΘE, ΘF can be as
small as |ΘE|/3c .
valid: similar to underflow, (c+1)3n-c queries should be
removed from both ΘE.
Key Observations


Shrinking ΘE contributes more to the efficiency of
sampling than shrinking ΘF.
Short underflowing queries become a major threat to
defense.
Multi-Sample Defense

Techniques: High-Level Packing
To convert short underflowing queries to
overflowing ones, we add dummy tuples
such that all underflowing queries with fewer
than d predicates will overflow, d is a
parameter.
 For example:
SELECT * FROM D WHERE a1=1

when k=1, we add <1,0,…,0> and <1,0,…,1>
COUNTER-SAMPLER Algorithm
Extensions
The COUNTER-SAMPLER can be
directly applied to both Boolean and
categorical databases.
 For numerical data, we can use
discretization techniques to convert it
into categorical data.

Outline
Introduction
 Problem Definition
 Our Approach
 Privacy Guarantee
 Experiments
 Conclusion

Privacy Guarantee
Outline
Introduction
 Problem Definition
 Our Approach
 Privacy Guarantee
 Experiments
 Conclusion

Delay of Sampling for Boolean
Delay of Sampling for categorical
Efficiency
Outline
Introduction
 Problem Definition
 Our Approach
 Privacy Guarantee
 Experiments
 Conclusion

Conclusion

Main contributions
Develop a dummy tuple insertion method to
prevent sampling of hidden databases.
 Extend it to categorical and numerical
databases


Future Directions
Integration of dummy insertion and query
auditing
 Take external knowledge in to consideration

Thank you!