How De-Identification Came into U.S. Law and Why

Download Report

Transcript How De-Identification Came into U.S. Law and Why

Setting the Stage:
How De-Identification Came into U.S. Law,
and Why the Debate Matters Today
Professor Peter Swire
Ohio State University/Future of Privacy Forum
FPF Conference on DeIdentification
National Press Club
December 5, 2011
Overview
• U.S. history: Census, federal agency statistics, &
HIPAA
• Why Deidentification (DeID) matters today
– The debate – it works or it doesn’t
– Three threat models
– Analogy to law enforcement
• Big picture – useful for many tasks, even with the
limits shown by scientists
Census, Statistics & DeID
• Many years of Census experience
– Highly useful data
– Deidentified
• Periodic opposition to mandatory reporting
• Needed strong confidentiality promises
– Suppress small cell size
• Only home in a census tract
– Fuzz data
– Strict rules against release even for national security
purposes
Federal Agency Statistics
• Codification in Confidential Information Protection &
Statistical Efficiency Act of 2002 (CIPSEA)
– Good history by Sylvester & Lohr
• Basic rule: if collect data for statistical purposes, use
only for statistical purposes, don’t ReID
• Funny thing: same culture & practice for years in
private sector polling (Gallup-style) and market
research
• Many years of practice here
• Perhaps a basic guideline going forward?
HIPAA
• 1999-2000 regs informed by Sweeney research
• Safe harbor – delete a lot of specified data fields
• Expert (I pushed for this) – where statistical basis,
can achieve DeID based on risk, not safe harbor
• Data use agreements – release for research, with
enforceable promise not to ReID
• In short:
– If scrubbed enough, can release publicly
– If scrubbed less, then enforceable promise not to
ReID
Why It Matters Today
• Now data mining far beyond specialized researchers
– The Internet (commercial since only 1993) gives
me access to data
– Storage & processing on my laptop > mainframe
of 25 years ago
– Search is way better
– The erosion of practical obscurity – “they” really
may figure out who “we” are
The Debate is Joined
• Ohm (and others) draw on Sweeney-type research
– DeID likely to lead to ReID
• Yakowitz (and others) respond
– Benefits of public data enormous
– Practical risk/harm from ReID low
• Anonymization creates huge risks or low risks?
• Worth doing anonymization/DeID at all?
• Today’s conference to shed light on this …
Threat Models – Which Attackers?
• Three types of attackers on “anonymized” data:
– Insiders “peeping”
– Outside hackers intruding
– The public who doesn’t get into the database
• DeID often effective for first two
• Ohm/Yakowitz debate primarily on the third
Insiders Peeping
• Swire 2009 Peeping article, at peterswire.net
• Threat: employee or employee of sub-contractor sees the
data and “peeps”
– Sees celebrity information - Clooney
– Sees information about friend/family/ex
– Sees information to create harm (ID theft, blackmail)
• Anonymization useful part of anti-peeping strategy
– Employee doesn’t search or stumble upon Clooney
– Employee may lack tools to do Sweeney-type analysis
– Audit logs catch employees who try
– Give employees access to statistical data, not PII
Outside Hackers
• Hacker may intrude for a short while
– Anonymization may prevent “ah hah” – Clooney
• Hacker may download database
– If so, then hacker becomes similar to the public
– May or may not be good at Sweeney-type tricks
– May be focused on specific types of information,
and not try to ReID
• Less-than-perfect DeID may substantially reduce
incidence of ReID
Re-ID by “The Public”
• So, masking may help against some threats
• The debate, though, is whether “the public” (i.e., the
experts) can ReID
• Sweeney & other research provides startling &
important results of ReID
– Can everything be ReIdentified?
ReID & 2 Famous Studies
• Date of birth, zip, & gender -> 80%+ unique
– Yes
– BUT, DOB is off-the-charts different
• Gender – splits population in half
• DOB = 366 (days) x 80 (years) = over 25,000 cells
• Moral – DOB ridiculously strong to ReID
• Netflix and can Re-ID over 60% of movie reviews
– BUT, takes known ImDB reviewers and matches to
Netflix
– Can ReID a lot, but not a big effect
Law Enforcement Analogy
• So, is ReID generally easy or hard, useful or
useless?
• Consider cop with a bunch of clues (male, tall, red
hair, etc.)
– Enough to ReID? No
– Helpful to ReID? Yes
– A matter of how much legwork, analysis, extra data is
available and accurate
– Very big range for difficulty of finding the suspect
– Same is true for ability of “the public” to ReID, to
name the suspect
Conclusion
• Issue matters today -- more data potentially available
to “the public”
• History of useful anonymization in statistics
– If collect data for statistical purposes, use only for
statistical purposes, store that way, don’t ReID
• DeID helps against insider & hacker threats
• DeID by “the public” varies widely in the effort
needed to find the “suspect”
• Our conference today to help policymakers learn
where DeID likely to be most useful