ISTAT generalised tools v2

Download Report

Transcript ISTAT generalised tools v2

ESSnet “Preparation of standardisation”
Istat Generalized Software
Systems
Rome, 6-7 June 2011
Generalised solutions for statistical production
One of the goals of the Italian National Statistical Institute is to
make available for each survey stage – from the sample design
to data analysis and dissemination – generalised solutions, i.e.
tools or systems designed so as to ensure production
functionalities, that have the following features:
•
implement advanced methodologies and techniques, so to
ensure the best possible quality levels of produced
information;
•
are operable with no or limited need for further software
development;
•
are provided with adequate documentation and userfriendly interface, usable also by non expert users.
Another desired requirement is interoperability, so to ensure
the possibility to share tools and systems with other
members of the official statistical community.
Technologies and software: open vs proprietary
A strategic decision has been taken in ISTAT since five years: to
privilege open source software and technologies as instruments
for the development of generalised IT tools.
This was due to various factors:
1. costs;
2. interoperability;
3. dynamics of open source communities.
Since 2006 we obtained the following results:
• migration of all IT tools that previously had been developed by
using SAS, to new releases based on open source software,
first of all R;
• mass training on R (about 300 people);
• reduction of the use of SAS in production, and consequent
reduction of SAS fees (from 1.0 million euros to 0.4).
PHASES and SUB-PROCESSES
SOFTWARE
2. Design
2.3 Design frame and sample
MAUSS (Multivariate Allocation of Units in Sampling
Surveys) for sampling design
4. Collect
4.1 Select sample
sampling (R package) for sample units selection
4. Collect
4.3 Run collection
BLAISE for CAPI, CATI e CADI
5. Process
5.1 Integrate data
RELAIS for record linkage
StatMatch for statistical matching
5. Process
5.2 Classify and code
BLAISE for assisted coding;
ACTR for automatic coding (now G-CODE)
5. Process
5.3 Review, validate and edit
5.4 Impute
CONCORD, ADAMSOFT, DIESIS for automatic data
editing and imputation
SelEMix for optimised selective editing
5. Process
5.6 Calculate weights
5.7 Calculate aggregates
ReGENESEES for calibration and sampling variance
estimation
6. Analyse
6.4 Apply disclosure control
ARGUS for data confidentiality protection
7. Disseminate
7.2 Produce dissemination products
I.Stat: statistical data warehousing
Sample design: MAUSS
MAUSS (Multivariate Allocation of Units in Sampling Surveys), is
based on Bethel’s method and allows to:
•
•
fix the sample size;
allocate the total amount of units in the different strata of
the population.
Required inputs are:
•
•
•
desired precision for each estimate of interest;
variability of estimates in the domains of interest;
available budget and costs associated with data collection.
Sample design: MAUSS
Sample design: MAUSS
Sample design: MAUSS
Record linkage: RELAIS
In many situations it is necessary to link data from different
sources, taking care in referring correctly the information
pertaining to the same units.
If we have common and unique identifiers on both datasets, to
join the datasets is a straightforward task, but this is a very
uncommon situation. So, we will have to compare a subset of
common variables in order to perform the record linkage. The
task is complicated by the fact that matching variables are
generally subject to errors and missing values, so a
methodology to deal with these complex situations is required,
together with software enabling to apply it.
Software RELAIS (REcord Linkage at IStat) allows to develop
different and complex procedures for record linkage, both
deterministic and probabilistic.
The methodology to deal with probabilistic record linkage is the
Fellegi-Sunter (Statistics Canada).
Record linkage: RELAIS
Statistical matching: package StatMatch
By using "statistical matching”, we want to integrate data sources
that do not have common observed units (or a limited subset
of them), but do have a subset of common variables that are
observed in both sources.
The integration may happen at the micro level, by creating a
synthetic dataset, or at an aggregate level, by inferencing the
values of the parameters that describe relations among
variables that are not jointly observed.
This integration technique is currently used when it is necessary
to combine two or more samples, as in this case the
probability to observe the same units is very small.
Past applications in ISTAT were related to the integration of the
Household Budget Survey (ISTAT) samples with the Income
Survey (Bank of Italy) samples, with the aim of creating the
Social Account Matrices (SAM); or the integration of Labour
Forces Survey with the Time Use Survey.
A useful software that can be used is the open source R package
“StatMatch”.
Edit and imputation: CONCORD and ADAMSOFT
The systems CONCORD and Adamsoft (“localise error”
function) permit to automatically treat the localisation of errors
in data.
They all implement the probability approach known as Fellegi-Holt
approach (with variants), thus enabling a better identification of
random errors, and therefore a greater accuracy.
In fact, compared to the deterministic approach (IF-THEN rules),
the probabilistic approach minimises:
• false positives (true values considered as errors);
• false negatives (errors considered as true values).
Edit and imputation: CONCORD and ADAMSOFT
The CONCORD (CONtrollo e CORrezione Dati) system can be
applied to surveys where categorical variables are prevalent,
and can be used in the main household surveys.
Adamsoft (“localise error” function) can be applied to surveys
with prevalence of continuous variables, i.e. surveys on
businesses and institutions.
Edit and imputation: CONCORD
Edit and imputation: ADAMSOFT
Selective Editing: package SelEMix
SelEMix (Selective Editing via Mixtures), is an R-package for
identification of influential errors in numerical data. Methodology
is based on latent class models (contamination models)
Required inputs are:
•
•
•
•
model choice (normal or log-normal)
sample weights
accuracy threshold
technical parameters…
Outputs are:
•
•
•
estimates of model parameters
list of influential units (at the given threshold)
predictions of “true” values” given observed values.
Selective Editing: package SelEMix
Outlying observations are also returned according to their (posterior)
probability of being erroneous. They not necessarily are influential
errors.
Example: Small and Medium Enterprises Survey
Selective Editing: package SelEMix
SelEmix can also be used to robustly impute missing
items. Imputations are obtained as expectations of true
values given the observed (non missing) values.
Advantages:
•
•
the method does not require a set of cleaned data for
tuning of parameters
the threshold is directly related to the expected residual
error in data
Disadvantages:
•
departures from model assumptions (in particular zero
inflation) can deteriorate the performances
•
it is difficult to explicitly take into account balance edits
Calculation of sampling estimates: ReGenesees
ReGENESEES (R evolution GENEralised software for
Sampling Errors and Estimates in Surveys) is a generalised
software that can be used to:
• assign sampling weight to observations taking account of the
survey design, of total non-response and of the availability of
auxiliary information (calibration estimators), in order to avoid
bias and variability of the estimates (thus maximising their
accuracy);
• produce the estimates of interest;
• calculate sampling errors to support the accuracy of estimates,
and present them synthetically through regression models;
• evaluate the efficiency of the sample (deff) for its optimisation
and documentation.
It allows to:
1. build the vector of known totals for the calibration in an
assisted and controlled way;
2. in case, the known totals vector can also be automatically
calculated by direct access to sampling frame data;
3. calculate sampling variance for estimators of whatever
complexity.
Calculation of sampling estimates: ReGenesees
Calculation of sampling estimates: ReGenesees
Disclosure control
ARGUS is a generalised software use to ensure confidentiality
both of microdata and aggregate data. Two versions are in fact
available:
mu-Argus is used for microdata:
1. it allows to assess the risk of disclosure associated with a
given dataset;
2. if this exceeds a prefixed threshold, the software allows to
apply different protection techniques (recoding, local
suppression, microaggregation).
tau-Argus is used for aggregate data (tables). It allows to
1. identify the sensitive cells (using the dominance rule, or
the prior-posterior rule)
2. apply a series of protection techniques (re-design of
tables, rounding or suppression).
Protection techniques are based on algorithms that make sure
that the loss of information be as low as possible. In this
sense, the quality aspect that is improved is the accessibility.
Generalised solutions for statistical production
Software and the related information are developed, or gathered
and tested by means of an ad-hoc virtual space, created by
Istat to meet the needs of the interested users, called
Osservatorio Tecnologico per i Software generalizzati
(OTS) – Technological Observatory for generalised software.
OTS has been released and available at the following address:
http://www.istat.it/strumenti/metodi/software/
These pages contain software tools developed in Istat (versions
that have been fully tested), available for download.
For software systems that are not property of Istat, indications are
provided on how to find them.
‘Implementing a quality
system in statistics’
Rome, 25 – 29 May 2009
Generalised solutions for statistical production
Generalised solutions for statistical production
Generalised solutions for statistical production