SteveTalk11-11-02 - CMU Contributed Webserver

Download Report

Transcript SteveTalk11-11-02 - CMU Contributed Webserver

CMU/Pitt Applied Decision Modeling Seminar
Disclosure Limitation in Large
Statistical Databases
Stephen F. Roehrig
Collaborators







George Duncan, Heinz/Statistics, CMU
Stephen Fienberg, Statistics, CMU
Adrian Dobra, Statistics, Duke
Larry Cox, Center for Health Statistics, H&HS
Jesus De Loera, Math, UC Davis
Bernd Sturmfels, Math, UC Berkeley
JoAnne O’Roarke, Institute for Social
Research, U. Michigan
Funding







NSF
NCES
NIA
NCHS
NISS
Census
BLS
How Should Government
Distribute What It Knows?



Individuals and businesses contribute data about
themselves, as mandated by government
Government summarizes and returns these data to
policy makers, policy researchers, individuals and
businesses
Everyone is supposed to see the value, but with no
privacy downside
 Obligations: return as much information as
possible
 Restrictions: don’t disclose sensitive information
Who Collects Data?

Federal, state and local government:




Federally funded surveys:



Census
Bureau of Labor Statistics
National Center for Health Statistics, etc.
Health and Retirement Study (NIA)
Treatment Episode Data Set (NIH)
Industry



Health care
Insurance
“Dataminers”
Real-World Example I




The U.S. Census Bureau wants to allow online
queries against its census data.
It builds a large system, with many
safeguards, then calls in “distinguished
university statisticians” to test it.
Acting as “data attackers”, they attempt to
discover sensitive information.
They do, and so the Census Bureau’s plans
must be scaled back.
(Source: Duncan, Roehrig and Kannan)
Real-World Example II



To better understand health care usage
patterns, a researcher wants to measure the
distance from patients’ residences to the
hospitals where their care was given.
But because of confidentiality concerns, it’s
only possible to get the patients’ ZIP codes,
not addresses.
The researcher uses ZIP code centroids
instead of addresses, causing large errors in
the analysis.
(Source: Marty Gaynor)
Real-World Example III



To understand price structures in the health
care industry, a researcher wants to compare
negotiated prices to list prices, for health
services.
One data set has hospital IDs and list prices
for various services. Another data set gives
negotiated prices for actual services provided,
but for for confidentiality reasons, no hospital
IDs can be provided.
Thus matching is impossible.
(Source: Marty Gaynor)
Data Utility vs. Disclosure Risk




Data utility: a measure of the usefulness of a
data set to its intended users
Disclosure risk: the degree to which a data
set reveals sensitive information
The tradeoff is a hard one, currently judged
mostly heuristically
Disclosure limitation, in theory and practice,
examines this tradeoff.
The R-U Confidentiality Map
Two possible releases
of the same data
Disclosure
Risk
Original
Data
Disclosure
Threshold
No
Data
Data Utility
The R-U Confidentiality Map
For many disclosure limitation methods, we can
choose one or more parameters.
Original
Data
Disclosure
Risk
DL Method 1
Disclosure
Threshold
No
Data
DL Method 2
Data Utility
Traditional Methods: Microdata


Microdata: a set of records containing
information on individual respondents
Suppose you are supposed to release
microdata, for the public good, about
individuals. The data include:





Name
Address
City of residence
Occupation
Criminal record
Microdata


Typical safeguard: delete “identifiers”
So release a dataset containing only city of
residence, occupation and criminal record…
Microdata


Typical safeguard: delete “identifiers”
So release a dataset containing only city of
residence, occupation and criminal record…



Residence = Amsterdam
Occupation = Mayor
Criminal history = has criminal record
Microdata


Typical safeguard: delete “identifiers”
So release a dataset containing only city of
residence, occupation and criminal record…





Residence = Amsterdam
Occupation = Mayor
Criminal history = has criminal record
HIPPA says this is OK, so long as a researcher
“promises not to attempt re-identification”
Is this far-fetched?
Microdata


Unique (or almost unique) identifiers
are common
The 1997 voting list for Cambridge, MA
has 54,805 voters:
Uniqueness of demographic fields
birth date alone
12%
birth date and gender
29%
birth date and 5-digit ZIP
69%
birth date and full postal code 97%
(Source: Latanya Sweeney)
Traditional Solutions for
Microdata






Sampling
Adding noise
Global recoding (coarsening)
Local suppression
Data swapping
Micro-aggregation
Example: Data Swapping
Unique on location,
age and sex
Find a match in
another location
Flip a coin to see
if swapping is done
Location
Age
Sex
Candidate
Pittsburgh
Young
M
Bush
Pittsburgh
Young
M
Gore
Pittsburgh
Young
F
Gore
Pittsburgh
Old
M
Gore
Pittsburgh
Old
M
Bush
Cleveland
Young
F
Bush
Cleveland
Young
F
Gore
Cleveland
Old
M
Gore
Cleveland
Old
F
Bush
Cleveland
Old
F
Gore
Data Swapping Terms




Uniques Key: Variables that are used to
identify records that pose a confidentiality
risk.
Swapping Key: Variables that are used to
identify which records will be swapped.
Swapping Attribute: Variables over which
swapping will occur.
Protected Variables: Other variables, which
may or may not be sensitive.
Swapping
Attribute
Uniques Key &
Swapping Key
Protected
Variable
Location
Age
Sex
Candidate
Pittsburgh
Young
M
Bush
Pittsburgh
Young
M
Gore
Pittsburgh
Young
F
Gore
Pittsburgh
Old
M
Gore
Pittsburgh
Old
M
Bush
Cleveland
Young
F
Bush
Cleveland
Young
F
Gore
Cleveland
Old
M
Gore
Cleveland
Old
F
Bush
Cleveland
Old
F
Gore
Data Swapping In Use

The Treatment Episode Data Set (TEDS)




National admissions to drug treatment
facilities
Administered by SAMHSA (part of HHS)
Data released through ICPSR
1,477,884 records in 1997, of an estimated
2,207,375 total admissions nationwide
TEDS (cont.)

Each record contains










Age
Sex
Race
Ethnicity
Education level
Marital status
Source of income
State & PMSA
Primary substance abused
and much more…
First Step: Recode

Example recode: Education level
Continuous 0-25
becomes
1: 8 years or less
2: 9-11
3: 12
4: 13-15
5: 16 or more
TEDS Uniques Key

This was determined empirically,after
recoding. We ended up choosing:








State and PMSA
Pregnancy
Veteran status
Methadone planned as part of treatment
Race
Ethnicity
Sex
Age
Other Choices



Swapping key: the uniques key, plus
primary substance of abuse
Swapping attribute: state, PMSA,
Census Region and Census Division
Protected variables: all other TEDS
variables
TEDS Results


After recoding, only 0.3% of the
records needed to be swapped
Swapping was done between nearby
locations, to preserve statistics over
natural geographic aggregations
Tables: Magnitude Data
Language Region A Region B Region C Total
C++
11
47
58
116
Java
1
15
33
49
Smalltalk
2
31
20
53
Total
14
93
111 218
Profit of Software Firms  $10 million
(Source: Java Random.nextInt(75))
Tables: Frequency Data
Income Level, Gender = Male
Race
White
Black
Chinese
Total
 $10,000
96
10
1
107
>$10,000 and  $25,000
72
7
1
80
> $25,000
161
6
2
169
Total
329
23
4
356
Income Level, Gender = Female
Race
White
Black
Chinese
Total
 $10,000
186
11
0
197
(Source: 1990 Census)
>$10,000 and  $25,000
127
7
1
135
> $25,000
51
3
0
51
Total
364
21
1
386
Traditional Solutions for Tables

Suppress some cells



Perturb some cells



Publish only the marginal totals
Suppress the sensitive cells, plus others as
necessary
Controlled rounding
Lots of research here, and good results for 2way tables
For 3-way and higher, this is surprisingly
hard!
Disclosure Risk, Data Utility

Risk



the degree to which confidentiality might be
compromised
perhaps examine cell feasibility intervals, or better,
distributions of possible cell values
Utility


a measure of the value to a legitimate user
higher, the more accurately a user can estimate
magnitude of errors in analysis based on the
released table
Example: Delinquent Children
Education Level of Head of Household
County
Low
Medium
Alpha
15
1
3
1
20
Beta
20
10
10
15
55
3
10
10
2
25
Delta
12
14
7
2
35
Total
50
35
30
20
135
Gamma
High Very High
Total
Number of Delinquent Children by County and Education Level
(Source: OMB Statistical Policy Working Paper 22)
Controlled Rounding (Base 3)
County
Low
Medium
Alpha
15
0
3
0
18
Beta
21
9
12
15
57
3
9
9
3
24
Delta
12
15
6
3
36
Total
51
33
30
21
135
Gamma



High Very High
Uniform (and known) feasibility interval.
Easy for 2-D tables, sometimes impossible for 3-D
1,025,908,683 possible original tables.
Total
Suppress Sensitive Cells & Others
County
Low
Medium
Alpha
15
p
s
p
20
Beta
20
10
10
15
55
3
10
s
p
25
Delta
12
s
7
p
35
Total
50
35
30
20
135
Gamma



High Very High
Total
Hard to do optimally (NP complete).
Feasibility intervals easily found with LP.
Users have no way of finding cell value probabilities.
Release Only the Margins
County
Low
Medium
High Very High
Total
Alpha
20
Beta
55
Gamma
25
Delta
35
Total
50
35
30
20
135
Release Only the Margins






18,272,363,056 tables have our margins
(thanks to De Loera & Sturmfels).
Low risk, low utility.
Easy!
Very commonly done.
Statistical users might estimate internal cells
with e.g., iterative proportional fitting.
Or with more powerful methods…
Some New Methods for
Disclosure Detection




Use methods from commutative algebra to
explore the set of tables having known
margins
Originated with the Diaconis-Sturmfels paper
(Annals of Statistics, 1998)
Extensions and complements by Dobra and
Fienberg
Related to existing ideas in combinatorics
Background: Algebraic Ideals

Let A be a ring (e.g., ℝ or ℤ), and I 
A. Then I is called an ideal if




0I
If f, g  I, then f + g  I
If f  I and h  A, then f h  I
f
gI
A
Ex. 1: The ring ℤ of integers, and the
set
I = {…-4, -2, 0, 2, 4,…}.
h
Generators

Let f1,…,fs  A. Then

 s

f1 ,, f s   hi f i : h1 ,, hs  A
 i 1

is the ideal generated by f1,…,fs
Ex. 1: The ring ℤ of integers, and the set
I = {…-4, -2, 0, 2, 4,…}.
SAT question: What is a generator of I ?
I = 2, since I = {2} {…-1, 0, 1,…}


Ideal Example 2



The ring k [x] of polynomials of one
variable, and the ideal I = x 4-1, x 6-1.
GRE question: What is a minimal
generator of I (i.e., no subset also a
generator)?
I = x 2-1 since x 2-1 is the greatest
common divisor of {x 4-1, x 6-1}.
Why Are Minimal Generators
Useful?



Compact representation of the ideal--the initial description may be “verbose”.
Allow easy generation of elements,
often in a disciplined order.
Guaranteed to explore the full ideal.
Disclosure Analysis: Marginals

Suppose a “data snooper” knows only
this:
County
Low
Medium
High Very High
Total
Alpha
20
Beta
55
Gamma
25
Delta
35
Total
50
35
30
20
135
Disclosure Analysis

Of course the data collector knows the
true counts:
County
Low
Medium
Alpha
15
1
3
1
20
Beta
20
10
10
15
55
3
10
10
2
25
Delta
12
14
7
2
35
Total
50
35
30
20
135
Gamma
High Very High
Total
Disclosure Analysis

What are the feasible tables, given the
margins? Here is one:
County
Low
Medium
Alpha
15+1
1-1
3
1
20
Beta
20-1
10+1
10
15
55
3
10
10
2
25
Delta
12
14
7
2
35
Total
50
35
30
20
135
Gamma
High Very High
Total
Disclosure Analysis Problems

Both the data collector and the snooper
are interested in the largest and
smallest feasible values for each cell.


Narrow bounds might constitute disclosure.
Both might also be interested in the
distribution of possible cell values.

A tight distribution might constitute
disclosure.
The Bounds Problem
County
Low
Medium
High Very High
Total
Alpha
20
Beta
55
Gamma
25
Delta
35
Total
50
This is usually framed
as a continuous problem.
Is this the right question?
35
30
0
1 2 3 4 5
20
135
The Distribution Problem
County
Low
Medium
High Very High
Total
Alpha
20
Beta
55
Gamma
25
Delta
35
Total
50
Given the margins, and
a set of priors over cell
values, what distribution
results?
35
30
0
1 2 3 4 5
20
135
Transform to an Algebraic
Problem

Define some indeterminates:
x11 x12 x13 x14
x21 x22 x23 x24

Then write the move
1
-1
-1
1
x31 x32 x33 x34
x41 x42 x43 x44
as the polynomial x11x22 - x12x21
Ideal Example 3

Ideal I is the set of monomial differences
n1
1
x  xJK

nJK
 x  xJK
m1
1
mJK
that take a non-negative table of dimension
J K with fixed margins M to another nonnegative table with the same margins.
Putnam Competition question: What is a
minimal generator of I ?
Solutions: Bounds,
Distributions

Upper and lower bounds on cells:


Integer linear programming (max/min, subject to
constraints implied by the marginals).
Find a generator of the ideal:



Use Buchberger’s algorithm to find a Gröbner basis.
Apply long division repeatedly; the remainder yields an
optimum (Conti & Traverso).
Distribution of cell values:

Use the Gröbner basis to enumerate or sample
(Diaconis & Sturmfels, Dobra, Dobra & Fienberg).
Trouble in Paradise


For larger disclosure problems
(dimension  3), the linear
programming bounds may be fractional.
For larger disclosure problems
(dimension  3), the Gröbner basis is
very hard to compute.
Why Fractional Bounds?

Sometimes the marginal sum values
cause the constraints to intersect at
non-integer points.
x2
LP maximum for x2
Integer maximum
x1
But Could This Happen?

Standard example from ManSci 101:
x2
LP maximum for x2
Integer maximum
x1
Computing Gröbner Bases

Both the Conti & Traverso and Sturmfels
constructions include extra variables.




The ring k [x1,…,xIJK] is embedded in
k [x1,…,xIJK, y1,…,yN], where N is the number of
constraints.
Buchberger’s algorithm applied.
Then throw away basis elements not in
k[x1,…,xIJK]
Computing a basis for the 333 problem:
about 7 hours with Macaulay.
An Improved Algorithm

Designed to compute implicitly in
k [x1,…,xIJK, y1,…,yN] but only process
elements in k [x1,…,xIJK].
An Improved Algorithm


Designed to compute implicitly in
k [x1,…,xIJK, y1,…,yN] but only process
elements in k [x1,…,xIJK].
333 problem runs in 25 mS.
An Improved Algorithm



Designed to compute implicitly in
k [x1,…,xIJK, y1,…,yN] but only process
elements in k [x1,…,xIJK].
333 problem runs in 25 mS.
433 problem runs in 20 min.
An Improved Algorithm




Designed to compute implicitly in
k [x1,…,xIJK, y1,…,yN] but only process
elements in k [x1,…,xIJK].
333 problem runs in 25 mS.
433 problem runs in 20 min.
533 problem runs in 3 months…
Ideals
“If they’re so freaking ideal, how come
you can’t compute them?”

Overheard at Grostat 5, the Conference on
Commutative Algebra and Statistics.
Other Approaches


Connections with decomposable and
graphical log-linear models (Dobra &
Fienberg, Dobra).
Example: 3-D sensitive table, two 2-D
margins given. Use conditional independence:
Known margins
A
B
C
B
C
A
Decomposable Models



Gröbner bases are “simple,” and can be
written down explicitly.
For some other problems (e.g.,
graphical models), the Gröbner basis
can be “assembled” from bases for
smaller pieces.
Nice!
A New Method for Disclosure
Limitation: Cyclic Perturbation

Choose cycles that leave the margins
fixed. One example:
Original
15
20
3
12

1
10
10
14
3
10
10
7
Cycle
1
15
2
2
+
1
-1
0
0
0
0
0
0
-1
1
0
0
Perturbed table
0
0
0
0
=
16
19
3
12
1
10
10
14
2
11
10
7
1
15
2
2
The set of cycles determines the
published table’s feasibility interval
Cyclic Perturbation: Details

Choose a set of cycles that covers all
table cells “equally”. Example:
+ - 0 0
0 + - 0
0 0 + - 0 0 +
0 + - 0
0 0 + - 0 0 +
+ - 0 +
0 0 +
- 0 0
+ - 0
0 + -
+
0
0
+
0
0
0 0 +
- 0 0
+ - 0
0 + -
Each cell has exactly
two “chances” to move.
Cyclic Perturbation: Details

Flip a three-sided coin with outcomes







A (probability = )
B (probability = )
C (probability = )
If A, add the first cycle (unless there is a zero
in the cycle)
If B, subtract the first cycle (unless there is a
zero in the cycle)
If C, do nothing
Repeat with the remaining cycles
Cyclic Perturbation: Details


Original
Table
For the chosen set of cycles, there are
34=81 possible perturbed tables.
The feasibility interval is original value 
2.
Perturbed

TO



TP
cycle 1 cycle 2 cycle 3 cycle 4
applied applied applied applied
Table
Cyclic Perturbation: Details

Choose , .
Perturb with each cycle.
Publish the resulting table.

Publish the cycles and , .


Original
15
20
3
12
1
10
10
14
3
10
10
7
Perturbed table
1
15
2
2
16
21
2
11
0
11
11
13
2
9
11
8
2
14
1
3
Analysis of Cell Probabilities
Tk


TO



TP




TO


Original
Table
Perturbed
Table
T4
Possible
Tables
Distributions of Cell Values





Since the mechanism is public, a user can calculate
the distribution of true cell values:
Compute every table Tk that could have been the
original, along with the probability Pr(TP | Tk).
Specify a prior distribution over all the possible
original tables Tk.
Apply Bayes’ theorem to get the posterior probability
Pr(Tk | TP) for each Tk.
The distribution for each cell is
Pr(t (i, j )  q) 
P
Pr(
T
|
T
 k )
k :t k ( i , j )  q
Results for the Example
Original
15
20
3
12
1
10
10
14
3
10
10
7
q=
Pr( t(1,2) = q | TP )
Pr( t(1,4) = q | TP )
Pr( t(3,4) = q | TP )
Pr( t(4,4) = q | TP )
Perturbed table
1
15
2
2
16
21
2
11
0
11
11
13
2
9
11
8
2
14
1
3
0
1
2
3
4
5
0.71
0.06
0.00
0.00
0.25
0.25
0.71
0.05
0.04
0.38
0.25
0.29
0.00
0.25
0.04
0.44
0.00
0.06
0.00
0.21
0.00
0.00
0.00
0.01
Properties



It’s not difficult to quantify data utility
and disclosure risk (cf. cell suppression
and controlled rounding).
Priors of data users and data intruders
can be different.
Theorem: For a uniform prior, the
mode of each posterior cell distribution
is it’s published value.
Scaling




Sets of cycles w/ desirable properties are
easy to find for larger 2-D tables.
Extensions to 3 and higher dimensions also
straightforward.
Computing the perturbation for any size table
is easy & fast.
The complete Bayesian analysis is feasible to
at least 2020 (with no special TLC)
What Might Priors Be?




They could reflect historical data.
If I’m in the survey, I know my cell is at
least 1.
Public information.
Insider information.
Rounding Redux





A similar Bayesian analysis can be done,
provided the exact algorithm is available.
It’s generally much harder to do.
Using a deterministic version of Cox’s `87
rounding procedure, we must consider “only”
17,132,236 tables.
For uniform priors, the posterior cell
distributions were nearly uniform.
Three days of computing time for a 44
table…
Future Work (i.e. What’s Hot?)



Much more to be learned about the
connections between log-linear models,
algebraic structures and networks.
The Grostat conference, and the “Tables”
conference at Davis have resulted in
wonderful collaborations between
statisticians, mathematicians and operations
researchers.
Stay tuned!