Ranking Approaches to Confidentiality in Survey Data

Download Report

Transcript Ranking Approaches to Confidentiality in Survey Data

1
Rank matching – a disclosure control method for
continuous survey data.
(Work in progress)
Johan Heldal
Statistics Norway
EU-SILC Anonymization Task Force Report
2005:
”The TF has pointed the specificities of so called
register countries. For these countries, some of the
income variables available in the EU-SILC may
come directly from registers (DK, NO, SE, FI, LT,
LI, CZ, SI, IS). If this register information together
with direct identifiers is available to external users,
the risk of disclosure is greatly increased. This
specific issue should be carefully studied. A specific
section of this report is dedicated to it. ”
2
This contribution deals with
• Rank Matching, a confidensiality method for
microdata from sample surveys
• Applicable to variables attached to surveys from
registers (administrative data) or censuses.
• Has emphasis on continuous data.
• Bears resemblance to Rank Swapping (Moore
1996).
3
Basic situation:
We have a finite population U of size N .
A register based sampling frame.
A probability sample s of size n.
Two sets of variables
- X i  ( X i1 ,
, X iJ )T linked to the survey
from registers (Continuous).
- Yi  (Yi1 ,, YiK ) provided by respondents.
T
Sample ranks for each X ij : Rij , (i  ( Rij ) j )
4
Rank matching
Draw a new sample s + from the same frame
with the same design.
Link register variables X i  ( X i1 ,
, X iJ )T to s + .
For the entire s, or within strata or domains,
and for each variable, replace X ij with the value
+
having the same rank on the same variable in s :
X ij  X ( Rij ) j j  X  X
*
ij

( Rlj ) j j
 
lj j
where i  ( Rij ) j  ( R )
5
Some properties
• RM preserves multivariate sample ranks
Ri = (Ri1,,RiJ)T within RM domains.
• If Xij ~ Fj(x) then Xij* ~ Fj(x), population cdf, not
sample cdf.
• Some information loss at the multivariate level.
Less with large RM domains than with small:
F*(x|r)  F(x|r) but Fj*(xj|rj) = Fj(xj|rj)  Fj(xj|r)
6
Two worst case scenarios
1. The intruder knows that some members in her IF
are in s and their true values on some Xij.
2. The intruder has access to the register but does
not know who are in s.
•
•
In case 1, intrusion must be carried out using Xij
and Xij*. Ex. of case 1 is the paper.
In case 2, only population and sampling ranks are
relevant. Example of case 2 will be presented.
7
Case 2: The entire register available.
Let  = (1 ,
, N ) be the population
rank matrix. i  ( i1 ,
, iK ).
Without loss of generality we can assume that
i  i1  unit label in population and
j  rj1  sample unit number
i j  pop. label associated with sample unit j.
(stochastic)
8
With N  7, n  3 og K  1:
  [1,2,3,4,5,6,7], R  [1,2,3]
 i 1  7  i 
P(i j  i )  
/ 35



 j  1  3  j 
j\i
1
2
3
4
5
6
7
1
15/35
10/35
6/35
3/35
1/35
0
0
2
0
5/35
8/35
9/35
8/35
5/35
0
3
0
0
1/35
3/35
6/35
10/35
15/35
9
Assume , K  2 and still N  7, n  3
Population rank matrix : (example)
T
1 2 3 4 5 6 7 


4 5 2 3 1 7 6 
The sample s  (2,4,5) generates
the sub-matrix
 2 4 5
s  

5 3 1
10
Sample space S  {s}  { s }
 N  7
#( S )    =    35 possible samples
 n   3
K 1
2 1
There are ( n !)  (3!)  6 possible
sample rank matrices R.
They partitition S into subsets SR .
What is then P(i j  i | SR ) ?
Table
11
The intruders problem:
• Given R, to determine SR
• and units with large P(ij = i | SR).
• With increasing n and N: A non-trivial combinatorial
problem(?).
12
Research issues
• How good protection can the method give in
various scenarios?
• How should it be best implemented to balance
information loss and protection?
• Statistics Norway will start investigating this with
application to the Norwegian EU-SILC.
• and we want to invite researchers from other
nations to participate with us.
• in particular with people from other ”register
nations”.
13
That was all.
Thank you for listening.
14