The Inference Problem in Statistical Databases

Transcript The Inference Problem in Statistical Databases

Nabil Tabbaa

Introduction

Background Definitions

The Inference Problem

Limiting Disclosure Risk

Disclosure Risk vs. Data Utility
2

Introduction

Background Definitions

The Inference Problem

Limiting Disclosure Risk

Disclosure Risk vs. Data Utility
3




Large amount of individuals data (medical,
educational, human services records…).
These types of data are invaluable to
researchers in a vast array of fields.
Many agencies rely on publicly released data
from the census.
Numerous research projects depend on
publicly available medical or educational data
sets.
4

Sensitive information about an individual must
remain private for:



Trust between a data collecting agency and its
respondents is very important:


Ethical reasons.
Legal reasons.
Respondents may alter responses or simply not
respond at all to some surveys.
Groups or individuals often have incentives to
use data maliciously.
5

Introduction

Background Definitions

The Inference Problem

Limiting Disclosure Risk

Disclosure Risk vs. Data Utility
6

Data (values collected from responders)




Categorical values (marital status).
Magnitude values (income).
Summarized using aggregation functions.
Sensitive Information

A cell in a table is considered to be sensitive (or
unsafe) if it contains a value whose publication could
disclose the specific information of a respondent.
7

Attackers



Attacker has the aim of gaining access to details in
the sensitive cells of a table.
The attacker will work with the published
information, the structure of the table, and some a
priori knowledge that may be publicly available.
Output pattern


Several methodologies are used to protect the
sensitive information in a table.
The output of a methodology is called a pattern.
8

Loss of information


The information loss of a pattern depends very much
on the protection methodology.
The
optimization
problem
underlying
a
methodology is the problem of finding a protected
pattern with minimum loss of information.
9

Introduction

Background Definitions

The Inference Problem

Limiting Disclosure Risk

Disclosure Risk vs. Data Utility
10


Inference problems are security concerns that
arise when users deduce sensitive information
about the database from relatively trivial
information.
Inference problems differ from other security
problems in that it is not an issue of
unauthorized access to data or leakage of
information.
11

Inference rules
Subsume rule: the result of one query and the result
of another query together correspond to the same
tuple.
 Overlapping rule: some of the values returned by a
query match some of the values of another query.
 Complementary rule: taking the difference between
two sets of queries.
 Functional
dependency rule: based on the
relationship between the attributes of a database.

12

Inference information








Information that is stored in the database.
The design of the database.
The relationship between the different attributes of
the database.
Statistical data derived from the database.
The existence or absence of data.
The changing values of the data.
Specialized information about the database.
Common knowledge and Common sense.
13

Introduction

Background Definitions

The Inference Problem

Limiting Disclosure Risk

Disclosure Risk vs. Data Utility
14

Basic methods
Limitation of detail.
 Top/bottom coding.
 Suppression.
 Rounding.
 Addition of noise.


Sampling


Makes it difficult to verify population uniqueness.
Easy to implement and the resulting sampled data
are relatively easy to analyze.
15

Matrix Masking
Rather than release the data X, one could release the
data Y = A X B + C .
 Special cases of matrix masking include: noise
addition, sampling, suppressing sensitive variables,
cell suppression, and addition of simulated data.
 The analyzer must have knowledge of the masking
procedure used.
 The analysis of the data can be complex and special
software may be needed.

16

Data Swapping and Data Shuffling



Data are swapped in such a way as to maintain the
marginal counts of the table.
Swapping only needs to be performed on sensitive
variables in order to remove the relationship
between the record and the respondent.
Drawbacks: may not maintain multivariate
relationships, analysis of sub-populations may be
affected by the swapping procedure, and the
swapping may result in nonsensical combinations.
17
Raw Data
Swapped Data
Record
X
Y
Z
Record
X
Y
Z
1
0
1
0
1
1
1
0
2
0
1
0
2
0
1
0
3
0
0
1
3
0
0
1
4
0
0
1
4
1
0
1
5
1
1
1
5
0
1
1
6
1
0
0
6
1
0
0
7
1
0
0
7
0
0
0
18

Synthetic Data



The idea is to view sensitive data as missing values
and replace them using multiple imputation
techniques.
Sensitive attributes would be replaced by random
draws from an appropriate posterior predictive
distribution.
Advantage: the ease with which the data can be
analyzed.
19

Other Methods
Slicing, Micro-aggregates, and Recombination.
 Location Data.
 Scrub System, Datafly, Argus, and SUDA2.
 Micro-agglomeration, Substitution, Subsampling,
and Calibration (MASSC).

20

Introduction

Background Definitions

The Inference Problem

Limiting Disclosure Risk

Disclosure Risk vs. Data Utility
21




Disclosure risk can be lowered by applying a
disclosure limitation (DL) procedure to mask
the data.
This masking will typically also lower the data
utility.
It is crucial that the tradeoff between Risk and
Utility be assessed.
R-U confidentiality map is offered as an
analytical framework for this assessment.
22
Disclosure Risk
R-U confidentiality map
raw
data
Maximum Tolerable Risk Threshold
released
data
no
data
Data Utility
23

In all the cases, the question is:
Whether the disclosure limitation methods used are
adequate, but not excessive,
 Could less severe distortion or obscuring of the data
still keep low the risk from data snoopers, while
allowing better data utility,
 What explicitly is the tradeoff between disclosure
risk and data utility,
 Would a different DL method lower disclosure risk
while maintaining data utility?

24






J. J. Salazar-González, “Statistical Confidentiality: Optimization Techniques to Protect Tables,”
Computer and Operations Research, vol. 35, no. 5, pp. 1638-1651, 2008.
R. E. Yip and E. N. Levitt, “Data Level Inference Detection in Database Systems,” in CSFW '98
Proceedings of the 11th IEEE workshop on Computer Security Foundations, Rockport, MA, USA,
1998, pp. 179-189.
“NCSE Technical Report –005,” vol. 1, no. 5, May 1996.
G. Duncan and R. Pearson, “Enhancing Access to Microdata while Protecting Confidentiality:
Prospects for the Future (with discussion),” Statistical Science, vol. 6, pp. 219–232, 1991.
N. R. Adam and J. C. Worthmann, “Security-Control Methods for Statistical Databases: A
Comparative Study,” ACM Computing Survey, vol. 21, no. 4, pp. 515–556, 1989.
C. Skinner, C. Marsh, S. Openshaw, and C. Wymer, “Disclosure Control for Census
Microdata,” Journal of Official Statistics, vol. 10, no. 1, pp. 31–51, 1994.
25





L. H. Cox, “Matrix Masking Methods for Disclosure Limitation in Microdata,” Survey
Methodology, vol. 6, pp. 165–169, 1994.
S. E. Fienberg and J. McIntyre, “Data Swapping: Variations on a Theme by Dalenius and
Reiss,” In: Domingo-Ferrer, J., Torra, V. (Eds.), Privacy in Statistical Databases. Vol. 3050 of
Lecture Notes in Computer Science. Springer Berlin/Heidelberg, pp. 519, 2004.
T. E. Raghunathan, J. P. Reiter, and D. B. Rubin, “Multiple Imputation for Statistical Disclosure
Limitation,” Journal of Official Statistics, vol. 19, no. 1, pp. 1–16, 2003.
G. J. Matthews, O. Harel, and R. H. Aseltine, “Examining the Robustness of Fully Synthetic
Data Techniques for Data with Binary Variables,” Journal of Statistical Computation and
Simulation, vol. 80, no. 6, pp. 609–624, 2010.
G. T. Duncan, S. A. Keller-McNulty, and S. L. Stokes, “Disclosure Risk vs. Data Utility: The R-U
Confidentiality Map,” Technical Report LA-UR-01-6428., Statistical Sciences Group, Los
Alamos, N.M.:Los Alamos National Laboratory, 2001.
26

The Inference Problem in Statistical Databases

Transcript The Inference Problem in Statistical Databases

Directory