The effect statistical disclosure control methods have on data

Download Report

Transcript The effect statistical disclosure control methods have on data

The use of protected microdata in
tabulation: case of SDC-methods
microaggregation and PRAM
Researcher Janika Konnu
Manchester, United Kingdom
17-19 December 2007
Outline
Data
 SDC-methods
 Results
 Conclusions
 Forthcoming research

Janika Konnu
Tuesday 18 December
2
Data used in the study
Data of teachers was originally collected for administrative
purposes.
 Only high schools teachers (N=7798) were included in our
study.
 Data included information about
 teachers: age, gender, position, etc.
 the schools those teachers taught in: the location of the
school, number of students, etc.

Janika Konnu
Tuesday 18 December
3
SDC Methods: Microaggregation

First data is divided into groups
of k observations and the group
averages are released instead of
original values of variable.

MDAV-algorithm was used in
grouping: algorithm finds the
average observation with respect
to the values and forms groups
by using the distance from this
average observation.

Grouping the data is the crucial
point for this method: when the
most similar observations are
contained in the group,
information loss will be
minimised.

In our study microaggregation
was used for categorical data
although it is intended for
numerical data.
Janika Konnu
Tuesday 18 December
4
SDC Methods: The Post RAndomization Method

Method changes values of a
variable according to probability
matrix (Markov matrix)

example:
0 
0.80 0.20 0
0.10 0.80 0.10 0 


 0 0.10 0.80 0.10


0
0
0
.
20
0
.
80



When PRAM is applied, data
user must take the probability
matrix into account in order to
obtain correct results.

In our study we were testing
usefulness of PRAM when
probability matrix is not used in
analysis.
Janika Konnu
Tuesday 18 December
5
Empirical work: -Argus software
Software includes disclosure risk measurement and
following methods: global recoding, local suppression, top
and bottom coding, PRAM, numerical microaggregation,
numerical rank swapping and Sullivan masking.
 Software produces protected data if suppressions are
allowed.
 In our case, only SDC-methods PRAM and numerical
microaggregation were studied. No suppressions were
made, because we needed information on the difference
between original and protected data.

Janika Konnu
Tuesday 18 December
6
Results: Data protected by Microaggregation
Group sizes used in protection are 2, 5, 8, 10 and 15

Microaggregation does not have
an effect on frequencies.

Unfortunately this implies that
hardly any change occur in
values.

Conclusion: microaggregation
does not give strong enough
protection when it comes to
categorical data.
Janika Konnu
Tuesday 18 December
7
Results: Data protected by PRAM (no bandwidth)
Changing probabilities: 0.05, 0.10, 0.20, 0.30 and 0.40

PRAM changes values of
variables and that way data will
be protected.

Unfortunately PRAM leads to
problems when categories have
big differences in the
frequencies. The larger
frequency keeps getting smaller
and the other way around.
Janika Konnu
Tuesday 18 December
8
Results: Data protected by PRAM (bandwidth is 2)
Changing probabilities: 0.05, 0.10, 0.20, 0.30 and 0.40

Restricting the change of values
can not solve problem with
difference in frequencies.

Our study shows that
frequencies in categories next to
the one with largest frequency
still grow too fast.
Janika Konnu
Tuesday 18 December
9
Results: Data protected by PRAM
No bandwidth
Bandwidth is 2
Janika Konnu
Tuesday 18 December
10
Conclusion: Microaggregation

Microaggregation perform well with numerical data, but its
application for categorical data needs more research.

Data protected by microaggregation includes almost the
same information as the original data.

Are we sure that microaggregation is able to protect
categorical data properly?
Janika Konnu
Tuesday 18 December
11
Conclusion: PRAM

PRAM seems to perform quite well when it comes to
protecting data, but there are some issues to overcome.

PRAM can protect data with small changing probabilities,
because it is based on uncertainty of identification.

In this case our concern is with information loss. Is the
protected data useful without using probability matrix?
Janika Konnu
Tuesday 18 December
12
Forthcoming research

Include more methods
 rank swapping
 noise adding

Include disclosure risk measures

Include more precise measurement for information loss
Janika Konnu
Tuesday 18 December
13
Some preferences

Domingo-Ferrer, J., Torra, V. 2001. A Quantitative Comparison of Disclosure Control
Methods for Microdata. In Confidentiality, Disclosure, and Data Access: Theory and
Practical Applications for Statistical Agencies. Amsterdam: North-Holland.

Gouweleeuw, J., Kooiman, P., Willenborg, L., and de Wolf, P. 1998. Post Randomisation
for Statistical Disclosure Control: Theory and Implementation. Journal of Official
Statistics. Vol. 14, No.4, s. 463--478.

Group Crises. 2004. Research Reports: Microaggregation for Privacy Protection in
Statistical Databases. In July 2005. <http://vneumann.etse.urv.es/publications/reports/>.
Thank You!
Janika Konnu
Tuesday 18 December
14