The effect statistical disclosure control methods have on data
Download
Report
Transcript The effect statistical disclosure control methods have on data
The use of protected microdata in
tabulation: case of SDC-methods
microaggregation and PRAM
Researcher Janika Konnu
Manchester, United Kingdom
17-19 December 2007
Outline
Data
SDC-methods
Results
Conclusions
Forthcoming research
Janika Konnu
Tuesday 18 December
2
Data used in the study
Data of teachers was originally collected for administrative
purposes.
Only high schools teachers (N=7798) were included in our
study.
Data included information about
teachers: age, gender, position, etc.
the schools those teachers taught in: the location of the
school, number of students, etc.
Janika Konnu
Tuesday 18 December
3
SDC Methods: Microaggregation
First data is divided into groups
of k observations and the group
averages are released instead of
original values of variable.
MDAV-algorithm was used in
grouping: algorithm finds the
average observation with respect
to the values and forms groups
by using the distance from this
average observation.
Grouping the data is the crucial
point for this method: when the
most similar observations are
contained in the group,
information loss will be
minimised.
In our study microaggregation
was used for categorical data
although it is intended for
numerical data.
Janika Konnu
Tuesday 18 December
4
SDC Methods: The Post RAndomization Method
Method changes values of a
variable according to probability
matrix (Markov matrix)
example:
0
0.80 0.20 0
0.10 0.80 0.10 0
0 0.10 0.80 0.10
0
0
0
.
20
0
.
80
When PRAM is applied, data
user must take the probability
matrix into account in order to
obtain correct results.
In our study we were testing
usefulness of PRAM when
probability matrix is not used in
analysis.
Janika Konnu
Tuesday 18 December
5
Empirical work: -Argus software
Software includes disclosure risk measurement and
following methods: global recoding, local suppression, top
and bottom coding, PRAM, numerical microaggregation,
numerical rank swapping and Sullivan masking.
Software produces protected data if suppressions are
allowed.
In our case, only SDC-methods PRAM and numerical
microaggregation were studied. No suppressions were
made, because we needed information on the difference
between original and protected data.
Janika Konnu
Tuesday 18 December
6
Results: Data protected by Microaggregation
Group sizes used in protection are 2, 5, 8, 10 and 15
Microaggregation does not have
an effect on frequencies.
Unfortunately this implies that
hardly any change occur in
values.
Conclusion: microaggregation
does not give strong enough
protection when it comes to
categorical data.
Janika Konnu
Tuesday 18 December
7
Results: Data protected by PRAM (no bandwidth)
Changing probabilities: 0.05, 0.10, 0.20, 0.30 and 0.40
PRAM changes values of
variables and that way data will
be protected.
Unfortunately PRAM leads to
problems when categories have
big differences in the
frequencies. The larger
frequency keeps getting smaller
and the other way around.
Janika Konnu
Tuesday 18 December
8
Results: Data protected by PRAM (bandwidth is 2)
Changing probabilities: 0.05, 0.10, 0.20, 0.30 and 0.40
Restricting the change of values
can not solve problem with
difference in frequencies.
Our study shows that
frequencies in categories next to
the one with largest frequency
still grow too fast.
Janika Konnu
Tuesday 18 December
9
Results: Data protected by PRAM
No bandwidth
Bandwidth is 2
Janika Konnu
Tuesday 18 December
10
Conclusion: Microaggregation
Microaggregation perform well with numerical data, but its
application for categorical data needs more research.
Data protected by microaggregation includes almost the
same information as the original data.
Are we sure that microaggregation is able to protect
categorical data properly?
Janika Konnu
Tuesday 18 December
11
Conclusion: PRAM
PRAM seems to perform quite well when it comes to
protecting data, but there are some issues to overcome.
PRAM can protect data with small changing probabilities,
because it is based on uncertainty of identification.
In this case our concern is with information loss. Is the
protected data useful without using probability matrix?
Janika Konnu
Tuesday 18 December
12
Forthcoming research
Include more methods
rank swapping
noise adding
Include disclosure risk measures
Include more precise measurement for information loss
Janika Konnu
Tuesday 18 December
13
Some preferences
Domingo-Ferrer, J., Torra, V. 2001. A Quantitative Comparison of Disclosure Control
Methods for Microdata. In Confidentiality, Disclosure, and Data Access: Theory and
Practical Applications for Statistical Agencies. Amsterdam: North-Holland.
Gouweleeuw, J., Kooiman, P., Willenborg, L., and de Wolf, P. 1998. Post Randomisation
for Statistical Disclosure Control: Theory and Implementation. Journal of Official
Statistics. Vol. 14, No.4, s. 463--478.
Group Crises. 2004. Research Reports: Microaggregation for Privacy Protection in
Statistical Databases. In July 2005. <http://vneumann.etse.urv.es/publications/reports/>.
Thank You!
Janika Konnu
Tuesday 18 December
14