Privacy in statistical databases
Download
Report
Transcript Privacy in statistical databases
University of Texas at El Paso
Privacy in Statistical Databases
Dr. Luc Longpré
Computer Science Department
Spring 2006
UTEP
1
Computer Science Dept.
Database with
Confidential Information
• Examples:
– census data
– medical information
• Privacy: protect the confidentiality of
individuals
• Usefulness: want to derive meaningful
statistics
UTEP
2
Computer Science Dept.
The Need for
Privacy Safeguards
• Per person available disk space:
– 1983: 0.02Mb
– 1996: 28Mb
– 2000: 472Mb
• Equivalent of one page per 3 minutes of
life
UTEP
3
Computer Science Dept.
The Need for
Privacy Safeguards
• Misuse of personal health information:
– banker cross-referencing cancer patients with
outstanding loans
– using medical records to make decisions about
employees
– snooping in hospital computer network
– 40% of insurers disclose personal health
information to lenders, employers, marketers,
without customer permission
UTEP
4
Computer Science Dept.
Approaches
• Access control, encryption:
– Only fixes who has access to what
– Does not protect disclosures based on inference
• Problem
– Sometimes it may be possible to derive
confidential information from released
information
UTEP
5
Computer Science Dept.
Examples
• Salary database
• Query: what’s the average salary of white
male professors with 2 children living El
Paso Texas since 1994 and in Boston from
1987 to 1994?
UTEP
6
Computer Science Dept.
Examples
• 87% of population of the US are unique
under ID made of:
– 5 digit ZIP,
– gender,
– date of birth
UTEP
7
Computer Science Dept.
Linking to Re-Identify Data
• Medical database:
– Ethnicity, visit date, diagnosis, procedure,
medication, ZIP, Birth date, Sex
• Voter list:
– Name, address, date registered, ZIP, Birth date,
Sex
UTEP
8
Computer Science Dept.
Statistical Database
• Data collected with the purpose of releasing
statistical information.
• Important for research, policy
• Facing tremendous demand for personspecific data
– data mining, fraud detection, homeland security
UTEP
9
Computer Science Dept.
Sample Size
• Possible solution: do not release any
statistics on any set of less than,
say,10 records
UTEP
10
Computer Science Dept.
Problem Remains
• Query 1: What’s the average salary of
every male age 89 in zip code 79912?
• Query 2: What’s the average salary of
people age 89 in zip code 79912?
UTEP
11
Computer Science Dept.
K-anonymity
• Release only information where at least k
records are identical (work by Sweeney)
• Attacks are still possible:
– Unsorted matching: use the order of records
• solution: randomize order
UTEP
12
Computer Science Dept.
K-anonymity
– Complementary release:
combining k-anonymous releases may not be kanonymous
• solution: consider all releases together
– Temporal attack: data is dynamic, adding and
removing data affects k-anonymous properties
• solution: analyze k-anonymous properties of
dynamic data
UTEP
13
Computer Science Dept.
Other Solutions
• Add noise in the answers
• Add noise in the data
• Limit the kinds of queries allowed to the
statistical database
UTEP
14
Computer Science Dept.
Quantifying Information
• Need a formal model, possibly based on
information theory
• Measure entropy in database records before
and after a statistical release
UTEP
15
Computer Science Dept.
Further Complications
• Some data is more sensitive than others
– Example: bits in salary
• Common knowledge, information from
other databases
– Could define entropy conditional to available
information
– Very impractical in applications
• Some people know some of the records
UTEP
16
Computer Science Dept.
Non Additivity
• Data sensitivity is non additive
– Ex: don’t mind either digit of SSN to be
released, but not all digits
• Privacy loss is non additive
– Ex: There could be 2 sets of information, each
of which, if released, gives no information, but
which, if together released, reveals all the
information
UTEP
17
Computer Science Dept.
Past Research
• Denning: “Cryptography and data security”,
1982
• Sweeney: Ph.D. thesis, Applications to
medical data, 1996
• A few more stray results, topics becoming
popular again in “privacy preserving data
mining”.
UTEP
18
Computer Science Dept.