Privacy preserving data mining

Download Report

Transcript Privacy preserving data mining

Privacy and Data Mining in the Electronic
Society -- Overview
Xintao Wu
University of North Carolina at Charlotte
August 20, 2012
Privacy Case
• Nydia Velázquez
(1982)
Three weeks after Nydia Velázquez won the New
York Democratic Party's nomination to serve in the
U.S. House of Representatives, somebody at St.
Claire Hospital in New York faxed Velázquez's
medical records to the New York Post. The records
detailed the care that Velázquez had received at
the hospital after a suicide attempt--an attempt
that had happened several years before the
election.
Database Nation: The Death of Privacy in the 21st Century, Simson
Garfinkel, Jan 2000, 1-56592-653-6
2
Privacy Case
•
AOL's publication of the search histories of more than 650,000
of its users has yielded more than just one of the year's bigger
privacy scandals. (Aug 6, 2006)
That database does not include names or user identities. Instead, it
lists only a unique ID number for each user. AOL user 710794



an overweight golfer, owner of a 1986 Porsche 944 and 1998 Cadillac
SLS, and a fan of the University of Tennessee Volunteers Men's
Basketball team.
interested in the Cherokee County School District in Canton, Ga., and has
looked up the Suwanee Sports Academy in Suwanee, Ga., which caters to
local youth, and the Youth Basketball of America's Georgia affiliate.
regularly searches for "lolitas," a term commonly used to describe
photographs and videos of minors who are nude or engaged in sexual
acts.
AOL's disturbing glimpse into users' lives By Declan McCullagh ,
CNET News.com, August 7, 2006, 8:05 PM PDT
3
NetFlix Prize
• An open competition for the best collaborative filtering
algorithm to predict user ratings for films.


On Sept 21 2009, the grand prize $1M was given to BellKor’s
Pragmatic Chaos team which bested Netflix’s own algorithm for
predicting ratings by 10.06%.
Wiki http://en.wikipedia.org/wiki/Netflix_Prize
• NetFlix cancels contest after privacy lawsuit on March 12,
2010.

http://www.wired.com/threatlevel/2010/03/netflix-cancels-contest/
4
Source: http://www.privacyinternational.org/issues/foia/foia-laws.jpg
5
6
National Laws
• USA

HIPAA for health care


Passed August 21, 96
lowest bar and the States are welcome to enact more stringent rules




California State Bill 1386
Grann-Leach-Bliley Act of 1999 for financial institutions
COPPA for childern’s online privacy
etc.
• Canada

PIPEDA 2000


Personal Information Protection and Electronic Documents Act
Effective from Jan 2004
• European Union (Directive 94/46/EC)



Passed by European Parliament Oct 95 and Effective from Oct 98.
Provides guidelines for member state legislation
Forbids sharing data with states that do not protect privacy
7
Privacy & Breaches of Privacy
• Various definitions of privacy



http://www.privacy.org
http://en.wikipedia.org/wiki/Privacy
Context dependent definitions: physical privacy, internet privacy,
medical privacy, genetics, political privacy, surveillance.
• An individual right

The claim of individuals, groups, or institutions to determine for
themselves when, how, and to what extent information about them
is communicated to others. – Alan Westin
• Expansion of government and company databases &
growing use of web and mobile devices lead to increase of
collection, analysis and disclosure of sensitive information.

Location based services need user’s position and preference
8
Privacy vs. Confidentiality
• Privacy is the right to keep one’s personal information out
of the public view
• Confidentiality is the dissemination without public
identification
• Disclosure



Identity disclosure = when a specific person’s record can be found
in a released file.
Attribute disclosure = when sensitive information about a specific
person is revealed through the released file, sometimes with
additional knowledge.
Inferential disclosure = if from the released data one can
determine the value of some characteristic of an individual more
accurately than otherwise would have been possible
9
Mining vs. Privacy
• Data mining

The goal of data mining is summary results (e.g., classification,
cluster, association rules etc.) from the data (distribution)
• Individual Privacy


Individual values in database must not be disclosed, or at least no
close estimation can be got by attackers
Contractual limitations: privacy policies, corporate agreements
• Privacy Preserving Data Mining (PPDM)

How to transform data such that


we can build a good data mining model (data utility)
while preserving privacy at the record level (privacy)?
10
Two Approaches
• Distributed



Suitable for multi-party
platforms
Secure multi-party computation
Tolerated disclosure:
computationally private
• Generalization/randomization
/transformation



Perturb data to protect privacy
of individual records.
Preserve intrinsic distributions
necessary for modeling.
Tolerated disclosure:
statistically private
11
Data miner vs. attacker
12
Scope
ssn
name zip
race
…
age
Sex
Bal
income
…
IntP
1
28223
Asian
…
20
M
10k
85k
…
2k
2
28223
Asian
…
30
F
15k
70k
…
18k
3
28262
Black
…
20
M
50k
120k
…
35k
4
28261
White
…
26
M
45k
23k
…
134k
.
.
.
…
.
.
.
.
…
.
N
28223
Asian
…
20
M
80k
110k
…
15k
69% unique on zip and birth date
87% with zip, birth date and gender.
k-anonymity, L-diversity
SDC etc.
Generalization/randomization
13
Additive Noise Randomization Example
Bal
income
…
IntP
1
10k
85k
…
2k
2
15k
70k
…
18k
3
50k
120k
…
35k
4
45k
23k
…
134k
.
.
.
…
.
N
80k
110k
…
15k
17.334
88.759
2.099
19.199
77.537
25.939
59.199
128.447
38.678
51.208
30.313
135.939
89.048
115.692
21.318
Y
=
=
7.334
3.759
0.099
4.199
7.537
7.939
35
9.199
8.447
3.678
23
134
6.208
7.313
1.939
110
15
9.048
5.692
6.318
10
85
2
15
70
18
50
120
45
80
X
14
+
+
E
Additive Randomization (Z=X+Y)
• R.Agrawal and R.Srikant SIGMOD 00
Alice’s
age
Add random
number to
Age
30
becomes
65
(30+35)
30 | 70K | ...
50 | 40K | ...
Randomizer
Randomizer
65 | 20K | ...
25 | 60K | ...
Reconstruct
Distribution
of Age
Reconstruct
Distribution
of Salary
Classification
Algorithm
...
...
...
Model
15
Identity Theft
• SSN
### - ## - ####
Determined by zip code Group no
Sequential no
https://secure.ssa.gov/apps10/poms.nsf/lnx/0100201030
Facebook study
http://www.heinz.cmu.edu/~acquisti/papers/
privacy-facebook-gross-acquisti.pdf
16
Randomized Response ([ Stanley Warner; JASA 1965])
A
: Cheated in the exam
A
: Didn’t cheat in the exam
A
Purpose
Purpose: Get the proportion(  A) of population
members that cheated in the exam.
 Procedure:
Cheated in
exam
A
“Yes” answer
Didn’t cheat
Randomization device

Do you belong to A? (p)
…
…
Do you belong to A?(1-p)
1 
“No” answer
As:
   A p  (1   A )(1  p)
Unbiased estimate of
ˆ AW
17
A
p 1
ˆ


2 p 1 2 p 1
is:
Linked data
ssn
name zip
race
…
age
Sex
Bal
income
…
IntP
1
28223
Asian
…
20
M
10k
85k
…
2k
2
28223
Asian
…
30
F
15k
70k
…
18k
3
28262
Black
…
20
M
50k
120k
…
35k
4
28261
White
…
26
M
45k
23k
…
134k
.
.
.
…
.
.
.
.
…
.
N
28223
Asian
…
20
M
80k
110k
…
15k
sensitive links
18
Privacy issues in Social Network
• Social network contains much private relation information;
• Anonymization is not enough for protecting the privacy.
Subgraph attacks [Backstrom et al., WWW07, Hay et al., 07].
sensitive link
attacker
19
Other issues
• Statistical disclosure limitation methods for
tabular/microdata
• Secure multi-party computation protocols and tools
• Privacy issues in various application areas such as ecommerce, healthcare, finance, and RFID
20
Tutorials on PPDM
• Privacy in data system, Rakesh Agrawal, PODS03
• Privacy preserving data mining, Chris Clifton, PKDD02, KDD03
• Preserving privacy in database systems, Johann-Chrostoph Freytag,
WAIM06
• Models and methods for privacy preserving data publishing and
analysis, Johannes Gehrke, ICDM05, ICDE06, KDD06
• Cryptographic techniques in privacy preserving data mining, Helger
Lipmaa, PKDD06
• Randomization based privacy preserving data mining, Xintao Wu,
PKDD06 & WAIM06
21