Kein Folientitel

Download Report

Transcript Kein Folientitel

Work session on statistical data confidentiality
Manchester 17-19 December 2007
Analytical validity and confidentiality protection of
anonymised longitudinal enterprise microdata –
Survey of a German Project
Maurice Brandt1, Michael Konold2,
Rainer Lenz3 and Martin Rosemann4
Research Data Centres of the Federal Statistical Office1 and the Statistical
Offices of the Länder2,
University of Applied Sciences Mainz3
Institute for Applied Economic Research4
© Federal Statistical Office, Research Data Centre, Maurice Brandt
Folie 1
Overview
1. Introduction
2. The data sets of the project
3. Anonymisation methods and analytical validity
4. Approaches to assessing anonymity
5. Conclusions
© Federal Statistical Office, Research Data Centre, Maurice Brandt
Folie 2
1. Introduction
“Business Panel data and de facto anonymisation”
new project since the beginning of 2006
 improve the data infrastructure in Germany regarding
longitudinal data on local units and enterprises
 guarantee the access of the scientific community to the
panel data of economic statistics
 the formerly project “De facto anonymisation of business
microdata” has shown that de facto anonymisation can be
achieved on a cross-section basis
© Federal Statistical Office, Research Data Centre, Maurice Brandt
Folie 3
1. Introduction
 In this project different business statistics are linked to
longitudinal datasets
 it is planned to complement the data with information
from the official business register
 the data sets can already be used for scientific work
 the final aim is to produce a scientific use file
© Federal Statistical Office, Research Data Centre, Maurice Brandt
Folie 4
2.1 The data sets of the project
Units of analysis are the local units in manufactoring and mining
Complete enumeration of local units with 20 or more employees
Monthly reports
 years from 1995 to 2005
 Information about employees, wages, salaries, turnover
Survey of investments
 years from 1995 to 2005
 Information on highly different types of investments
Survey of small units
 years from 1995 to 2002
 Local units with 19 or fewer employees
© Federal Statistical Office, Research Data Centre, Maurice Brandt
Folie 5
2.2 The data sets of the project
Cost Structure Survey
Stratified sample of enterprises with 20 or more
employees in the manufacturing and mining sector
 years from 1995 to 2005
 all together over 43.000 enterprises
 Information on output, production factors, employees
 from 1999 to 2002 13.300 enterprises available in the
whole period
 studies regarding investments in research and
development are possible
© Federal Statistical Office, Research Data Centre, Maurice Brandt
Folie 6
2.3 The data sets of the project
Turnover Tax Statistics
 Very large data set of a total of 4.3 million enterprises
years from 2000 to 2004 (1.8 million for the whole period)
 Information on all taxable turnovers, turnover tax, prior
tax and of tax liability
IAB Panel of local units
 Information on employment trend, staff structure, hours
worked, turnover, export share, investments and
innovation
 Since year 1993 various waves on about 4.300 to a
max. of 16.000 local units
© Federal Statistical Office, Research Data Centre, Maurice Brandt
Folie 7
3. Anonymisation methods and analytical validity
Anonymisation methods
 methods reducing the information (suppression of variables or
presenting key variables in broader categories)
 methods modifying the values of numerical data (data
perturbating methods)
Data perturbating methods for panel data
 Micro aggregation: (a) separately for all variables and all periods
(Individual Ranking), (b) separately for all variables but jointly for
all periods, (c) separately for all periods but jointly for all variables
and (d) jointly for all periods and all variable
 Multiplicative stochastic noise: mixture distribution (approach of
Höhne)
 Multiple Imputation
© Federal Statistical Office, Research Data Centre, Maurice Brandt
Folie 8
3. Anonymisation methods and analytical validity
In Focus
Impacts of data perturbating methods on
 descriptive distribution measures
 the estimation of econometric panel models, particularly on the
within-estimator to control for individual unobservable
heterogeneity
First Results
 the within estimator is consistent in the case of anonymisation by
individual ranking
 Project team derived consistent within-estimators in the case of
anonymisation by multiplicative stochastic noise (including the
method of Höhne) and no autocorrelation
 Case of autocorrelation: work in progress
 Multiple Imputation: separate speech on this conference
© Federal Statistical Office, Research Data Centre, Maurice Brandt
Folie 9
4. Approaches to assessing anonymity
We calculate coefficients d (ai , b j )
and obtain:
(AP) Minimize
n
i 1
j 1
  d (a , b ) x ,
i
j
ij
xij {0,1} for i, j  1,..., n,
s.t.
n
{a1,...,an} external data
{b1,...,bn} target data
n
x
j 1
ij
n
x
i 1
© Federal Statistical Office, Research Data Centre, Maurice Brandt
ij
1
for
i  1,..., n and
1
for
j  1,..., n.
Folie 10
4. Approaches to assessing anonymity
Four approaches in order to estimate the coefficients of
the linear program (AP) are used:




Conventional distance based approach
Correlation based approach
Distribution based approach
Collinearity based approach
© Federal Statistical Office, Research Data Centre, Maurice Brandt
Folie 11
5. Conclusions
Within the scope of the project the panel data sets can be
used by
 remote data processing
 safe scientific work stations in the office
They are already used in some research projects
First scientific use files for data use on one‘s own
workstation are probably available at the beginning of 2009
© Federal Statistical Office, Research Data Centre, Maurice Brandt
Folie 12
Thank you for your attention
© Federal Statistical Office, Research Data Centre, Maurice Brandt
Folie 13