#### Transcript SteveTalk11-11-02 - CMU Contributed Webserver

CMU/Pitt Applied Decision Modeling Seminar Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig Collaborators George Duncan, Heinz/Statistics, CMU Stephen Fienberg, Statistics, CMU Adrian Dobra, Statistics, Duke Larry Cox, Center for Health Statistics, H&HS Jesus De Loera, Math, UC Davis Bernd Sturmfels, Math, UC Berkeley JoAnne O’Roarke, Institute for Social Research, U. Michigan Funding NSF NCES NIA NCHS NISS Census BLS How Should Government Distribute What It Knows? Individuals and businesses contribute data about themselves, as mandated by government Government summarizes and returns these data to policy makers, policy researchers, individuals and businesses Everyone is supposed to see the value, but with no privacy downside Obligations: return as much information as possible Restrictions: don’t disclose sensitive information Who Collects Data? Federal, state and local government: Federally funded surveys: Census Bureau of Labor Statistics National Center for Health Statistics, etc. Health and Retirement Study (NIA) Treatment Episode Data Set (NIH) Industry Health care Insurance “Dataminers” Real-World Example I The U.S. Census Bureau wants to allow online queries against its census data. It builds a large system, with many safeguards, then calls in “distinguished university statisticians” to test it. Acting as “data attackers”, they attempt to discover sensitive information. They do, and so the Census Bureau’s plans must be scaled back. (Source: Duncan, Roehrig and Kannan) Real-World Example II To better understand health care usage patterns, a researcher wants to measure the distance from patients’ residences to the hospitals where their care was given. But because of confidentiality concerns, it’s only possible to get the patients’ ZIP codes, not addresses. The researcher uses ZIP code centroids instead of addresses, causing large errors in the analysis. (Source: Marty Gaynor) Real-World Example III To understand price structures in the health care industry, a researcher wants to compare negotiated prices to list prices, for health services. One data set has hospital IDs and list prices for various services. Another data set gives negotiated prices for actual services provided, but for for confidentiality reasons, no hospital IDs can be provided. Thus matching is impossible. (Source: Marty Gaynor) Data Utility vs. Disclosure Risk Data utility: a measure of the usefulness of a data set to its intended users Disclosure risk: the degree to which a data set reveals sensitive information The tradeoff is a hard one, currently judged mostly heuristically Disclosure limitation, in theory and practice, examines this tradeoff. The R-U Confidentiality Map Two possible releases of the same data Disclosure Risk Original Data Disclosure Threshold No Data Data Utility The R-U Confidentiality Map For many disclosure limitation methods, we can choose one or more parameters. Original Data Disclosure Risk DL Method 1 Disclosure Threshold No Data DL Method 2 Data Utility Traditional Methods: Microdata Microdata: a set of records containing information on individual respondents Suppose you are supposed to release microdata, for the public good, about individuals. The data include: Name Address City of residence Occupation Criminal record Microdata Typical safeguard: delete “identifiers” So release a dataset containing only city of residence, occupation and criminal record… Microdata Typical safeguard: delete “identifiers” So release a dataset containing only city of residence, occupation and criminal record… Residence = Amsterdam Occupation = Mayor Criminal history = has criminal record Microdata Typical safeguard: delete “identifiers” So release a dataset containing only city of residence, occupation and criminal record… Residence = Amsterdam Occupation = Mayor Criminal history = has criminal record HIPPA says this is OK, so long as a researcher “promises not to attempt re-identification” Is this far-fetched? Microdata Unique (or almost unique) identifiers are common The 1997 voting list for Cambridge, MA has 54,805 voters: Uniqueness of demographic fields birth date alone 12% birth date and gender 29% birth date and 5-digit ZIP 69% birth date and full postal code 97% (Source: Latanya Sweeney) Traditional Solutions for Microdata Sampling Adding noise Global recoding (coarsening) Local suppression Data swapping Micro-aggregation Example: Data Swapping Unique on location, age and sex Find a match in another location Flip a coin to see if swapping is done Location Age Sex Candidate Pittsburgh Young M Bush Pittsburgh Young M Gore Pittsburgh Young F Gore Pittsburgh Old M Gore Pittsburgh Old M Bush Cleveland Young F Bush Cleveland Young F Gore Cleveland Old M Gore Cleveland Old F Bush Cleveland Old F Gore Data Swapping Terms Uniques Key: Variables that are used to identify records that pose a confidentiality risk. Swapping Key: Variables that are used to identify which records will be swapped. Swapping Attribute: Variables over which swapping will occur. Protected Variables: Other variables, which may or may not be sensitive. Swapping Attribute Uniques Key & Swapping Key Protected Variable Location Age Sex Candidate Pittsburgh Young M Bush Pittsburgh Young M Gore Pittsburgh Young F Gore Pittsburgh Old M Gore Pittsburgh Old M Bush Cleveland Young F Bush Cleveland Young F Gore Cleveland Old M Gore Cleveland Old F Bush Cleveland Old F Gore Data Swapping In Use The Treatment Episode Data Set (TEDS) National admissions to drug treatment facilities Administered by SAMHSA (part of HHS) Data released through ICPSR 1,477,884 records in 1997, of an estimated 2,207,375 total admissions nationwide TEDS (cont.) Each record contains Age Sex Race Ethnicity Education level Marital status Source of income State & PMSA Primary substance abused and much more… First Step: Recode Example recode: Education level Continuous 0-25 becomes 1: 8 years or less 2: 9-11 3: 12 4: 13-15 5: 16 or more TEDS Uniques Key This was determined empirically,after recoding. We ended up choosing: State and PMSA Pregnancy Veteran status Methadone planned as part of treatment Race Ethnicity Sex Age Other Choices Swapping key: the uniques key, plus primary substance of abuse Swapping attribute: state, PMSA, Census Region and Census Division Protected variables: all other TEDS variables TEDS Results After recoding, only 0.3% of the records needed to be swapped Swapping was done between nearby locations, to preserve statistics over natural geographic aggregations Tables: Magnitude Data Language Region A Region B Region C Total C++ 11 47 58 116 Java 1 15 33 49 Smalltalk 2 31 20 53 Total 14 93 111 218 Profit of Software Firms $10 million (Source: Java Random.nextInt(75)) Tables: Frequency Data Income Level, Gender = Male Race White Black Chinese Total $10,000 96 10 1 107 >$10,000 and $25,000 72 7 1 80 > $25,000 161 6 2 169 Total 329 23 4 356 Income Level, Gender = Female Race White Black Chinese Total $10,000 186 11 0 197 (Source: 1990 Census) >$10,000 and $25,000 127 7 1 135 > $25,000 51 3 0 51 Total 364 21 1 386 Traditional Solutions for Tables Suppress some cells Perturb some cells Publish only the marginal totals Suppress the sensitive cells, plus others as necessary Controlled rounding Lots of research here, and good results for 2way tables For 3-way and higher, this is surprisingly hard! Disclosure Risk, Data Utility Risk the degree to which confidentiality might be compromised perhaps examine cell feasibility intervals, or better, distributions of possible cell values Utility a measure of the value to a legitimate user higher, the more accurately a user can estimate magnitude of errors in analysis based on the released table Example: Delinquent Children Education Level of Head of Household County Low Medium Alpha 15 1 3 1 20 Beta 20 10 10 15 55 3 10 10 2 25 Delta 12 14 7 2 35 Total 50 35 30 20 135 Gamma High Very High Total Number of Delinquent Children by County and Education Level (Source: OMB Statistical Policy Working Paper 22) Controlled Rounding (Base 3) County Low Medium Alpha 15 0 3 0 18 Beta 21 9 12 15 57 3 9 9 3 24 Delta 12 15 6 3 36 Total 51 33 30 21 135 Gamma High Very High Uniform (and known) feasibility interval. Easy for 2-D tables, sometimes impossible for 3-D 1,025,908,683 possible original tables. Total Suppress Sensitive Cells & Others County Low Medium Alpha 15 p s p 20 Beta 20 10 10 15 55 3 10 s p 25 Delta 12 s 7 p 35 Total 50 35 30 20 135 Gamma High Very High Total Hard to do optimally (NP complete). Feasibility intervals easily found with LP. Users have no way of finding cell value probabilities. Release Only the Margins County Low Medium High Very High Total Alpha 20 Beta 55 Gamma 25 Delta 35 Total 50 35 30 20 135 Release Only the Margins 18,272,363,056 tables have our margins (thanks to De Loera & Sturmfels). Low risk, low utility. Easy! Very commonly done. Statistical users might estimate internal cells with e.g., iterative proportional fitting. Or with more powerful methods… Some New Methods for Disclosure Detection Use methods from commutative algebra to explore the set of tables having known margins Originated with the Diaconis-Sturmfels paper (Annals of Statistics, 1998) Extensions and complements by Dobra and Fienberg Related to existing ideas in combinatorics Background: Algebraic Ideals Let A be a ring (e.g., ℝ or ℤ), and I A. Then I is called an ideal if 0I If f, g I, then f + g I If f I and h A, then f h I f gI A Ex. 1: The ring ℤ of integers, and the set I = {…-4, -2, 0, 2, 4,…}. h Generators Let f1,…,fs A. Then s f1 ,, f s hi f i : h1 ,, hs A i 1 is the ideal generated by f1,…,fs Ex. 1: The ring ℤ of integers, and the set I = {…-4, -2, 0, 2, 4,…}. SAT question: What is a generator of I ? I = 2, since I = {2} {…-1, 0, 1,…} Ideal Example 2 The ring k [x] of polynomials of one variable, and the ideal I = x 4-1, x 6-1. GRE question: What is a minimal generator of I (i.e., no subset also a generator)? I = x 2-1 since x 2-1 is the greatest common divisor of {x 4-1, x 6-1}. Why Are Minimal Generators Useful? Compact representation of the ideal--the initial description may be “verbose”. Allow easy generation of elements, often in a disciplined order. Guaranteed to explore the full ideal. Disclosure Analysis: Marginals Suppose a “data snooper” knows only this: County Low Medium High Very High Total Alpha 20 Beta 55 Gamma 25 Delta 35 Total 50 35 30 20 135 Disclosure Analysis Of course the data collector knows the true counts: County Low Medium Alpha 15 1 3 1 20 Beta 20 10 10 15 55 3 10 10 2 25 Delta 12 14 7 2 35 Total 50 35 30 20 135 Gamma High Very High Total Disclosure Analysis What are the feasible tables, given the margins? Here is one: County Low Medium Alpha 15+1 1-1 3 1 20 Beta 20-1 10+1 10 15 55 3 10 10 2 25 Delta 12 14 7 2 35 Total 50 35 30 20 135 Gamma High Very High Total Disclosure Analysis Problems Both the data collector and the snooper are interested in the largest and smallest feasible values for each cell. Narrow bounds might constitute disclosure. Both might also be interested in the distribution of possible cell values. A tight distribution might constitute disclosure. The Bounds Problem County Low Medium High Very High Total Alpha 20 Beta 55 Gamma 25 Delta 35 Total 50 This is usually framed as a continuous problem. Is this the right question? 35 30 0 1 2 3 4 5 20 135 The Distribution Problem County Low Medium High Very High Total Alpha 20 Beta 55 Gamma 25 Delta 35 Total 50 Given the margins, and a set of priors over cell values, what distribution results? 35 30 0 1 2 3 4 5 20 135 Transform to an Algebraic Problem Define some indeterminates: x11 x12 x13 x14 x21 x22 x23 x24 Then write the move 1 -1 -1 1 x31 x32 x33 x34 x41 x42 x43 x44 as the polynomial x11x22 - x12x21 Ideal Example 3 Ideal I is the set of monomial differences n1 1 x xJK nJK x xJK m1 1 mJK that take a non-negative table of dimension J K with fixed margins M to another nonnegative table with the same margins. Putnam Competition question: What is a minimal generator of I ? Solutions: Bounds, Distributions Upper and lower bounds on cells: Integer linear programming (max/min, subject to constraints implied by the marginals). Find a generator of the ideal: Use Buchberger’s algorithm to find a Gröbner basis. Apply long division repeatedly; the remainder yields an optimum (Conti & Traverso). Distribution of cell values: Use the Gröbner basis to enumerate or sample (Diaconis & Sturmfels, Dobra, Dobra & Fienberg). Trouble in Paradise For larger disclosure problems (dimension 3), the linear programming bounds may be fractional. For larger disclosure problems (dimension 3), the Gröbner basis is very hard to compute. Why Fractional Bounds? Sometimes the marginal sum values cause the constraints to intersect at non-integer points. x2 LP maximum for x2 Integer maximum x1 But Could This Happen? Standard example from ManSci 101: x2 LP maximum for x2 Integer maximum x1 Computing Gröbner Bases Both the Conti & Traverso and Sturmfels constructions include extra variables. The ring k [x1,…,xIJK] is embedded in k [x1,…,xIJK, y1,…,yN], where N is the number of constraints. Buchberger’s algorithm applied. Then throw away basis elements not in k[x1,…,xIJK] Computing a basis for the 333 problem: about 7 hours with Macaulay. An Improved Algorithm Designed to compute implicitly in k [x1,…,xIJK, y1,…,yN] but only process elements in k [x1,…,xIJK]. An Improved Algorithm Designed to compute implicitly in k [x1,…,xIJK, y1,…,yN] but only process elements in k [x1,…,xIJK]. 333 problem runs in 25 mS. An Improved Algorithm Designed to compute implicitly in k [x1,…,xIJK, y1,…,yN] but only process elements in k [x1,…,xIJK]. 333 problem runs in 25 mS. 433 problem runs in 20 min. An Improved Algorithm Designed to compute implicitly in k [x1,…,xIJK, y1,…,yN] but only process elements in k [x1,…,xIJK]. 333 problem runs in 25 mS. 433 problem runs in 20 min. 533 problem runs in 3 months… Ideals “If they’re so freaking ideal, how come you can’t compute them?” Overheard at Grostat 5, the Conference on Commutative Algebra and Statistics. Other Approaches Connections with decomposable and graphical log-linear models (Dobra & Fienberg, Dobra). Example: 3-D sensitive table, two 2-D margins given. Use conditional independence: Known margins A B C B C A Decomposable Models Gröbner bases are “simple,” and can be written down explicitly. For some other problems (e.g., graphical models), the Gröbner basis can be “assembled” from bases for smaller pieces. Nice! A New Method for Disclosure Limitation: Cyclic Perturbation Choose cycles that leave the margins fixed. One example: Original 15 20 3 12 1 10 10 14 3 10 10 7 Cycle 1 15 2 2 + 1 -1 0 0 0 0 0 0 -1 1 0 0 Perturbed table 0 0 0 0 = 16 19 3 12 1 10 10 14 2 11 10 7 1 15 2 2 The set of cycles determines the published table’s feasibility interval Cyclic Perturbation: Details Choose a set of cycles that covers all table cells “equally”. Example: + - 0 0 0 + - 0 0 0 + - 0 0 + 0 + - 0 0 0 + - 0 0 + + - 0 + 0 0 + - 0 0 + - 0 0 + - + 0 0 + 0 0 0 0 + - 0 0 + - 0 0 + - Each cell has exactly two “chances” to move. Cyclic Perturbation: Details Flip a three-sided coin with outcomes A (probability = ) B (probability = ) C (probability = ) If A, add the first cycle (unless there is a zero in the cycle) If B, subtract the first cycle (unless there is a zero in the cycle) If C, do nothing Repeat with the remaining cycles Cyclic Perturbation: Details Original Table For the chosen set of cycles, there are 34=81 possible perturbed tables. The feasibility interval is original value 2. Perturbed TO TP cycle 1 cycle 2 cycle 3 cycle 4 applied applied applied applied Table Cyclic Perturbation: Details Choose , . Perturb with each cycle. Publish the resulting table. Publish the cycles and , . Original 15 20 3 12 1 10 10 14 3 10 10 7 Perturbed table 1 15 2 2 16 21 2 11 0 11 11 13 2 9 11 8 2 14 1 3 Analysis of Cell Probabilities Tk TO TP TO Original Table Perturbed Table T4 Possible Tables Distributions of Cell Values Since the mechanism is public, a user can calculate the distribution of true cell values: Compute every table Tk that could have been the original, along with the probability Pr(TP | Tk). Specify a prior distribution over all the possible original tables Tk. Apply Bayes’ theorem to get the posterior probability Pr(Tk | TP) for each Tk. The distribution for each cell is Pr(t (i, j ) q) P Pr( T | T k ) k :t k ( i , j ) q Results for the Example Original 15 20 3 12 1 10 10 14 3 10 10 7 q= Pr( t(1,2) = q | TP ) Pr( t(1,4) = q | TP ) Pr( t(3,4) = q | TP ) Pr( t(4,4) = q | TP ) Perturbed table 1 15 2 2 16 21 2 11 0 11 11 13 2 9 11 8 2 14 1 3 0 1 2 3 4 5 0.71 0.06 0.00 0.00 0.25 0.25 0.71 0.05 0.04 0.38 0.25 0.29 0.00 0.25 0.04 0.44 0.00 0.06 0.00 0.21 0.00 0.00 0.00 0.01 Properties It’s not difficult to quantify data utility and disclosure risk (cf. cell suppression and controlled rounding). Priors of data users and data intruders can be different. Theorem: For a uniform prior, the mode of each posterior cell distribution is it’s published value. Scaling Sets of cycles w/ desirable properties are easy to find for larger 2-D tables. Extensions to 3 and higher dimensions also straightforward. Computing the perturbation for any size table is easy & fast. The complete Bayesian analysis is feasible to at least 2020 (with no special TLC) What Might Priors Be? They could reflect historical data. If I’m in the survey, I know my cell is at least 1. Public information. Insider information. Rounding Redux A similar Bayesian analysis can be done, provided the exact algorithm is available. It’s generally much harder to do. Using a deterministic version of Cox’s `87 rounding procedure, we must consider “only” 17,132,236 tables. For uniform priors, the posterior cell distributions were nearly uniform. Three days of computing time for a 44 table… Future Work (i.e. What’s Hot?) Much more to be learned about the connections between log-linear models, algebraic structures and networks. The Grostat conference, and the “Tables” conference at Davis have resulted in wonderful collaborations between statisticians, mathematicians and operations researchers. Stay tuned!