Transcript Document

Privacy Cognizant Information
Systems
Rakesh Agrawal
IBM Almaden Research Center
joint work with Srikant, Kiernan & Xu
GOAL
ƒ
Build information systems that
ƒ protect the privacy and ownership of
information
ƒ do not impede the flow of information
Drivers

Policies and Legislations
– U.S. and international regulations
– Legal proceedings against businesses

Consumer Concerns
– Consumer privacy apprehensions continue to plague the
Web … these fears will hold back roughly $15 billion
in e-Commerce revenue.” Forrester Research, 2001
– Most consumers are “privacy pragmatists.” Westin
Surveys

Moral Imperative
– The right to privacy: the most cherished of human
freedom -- Warren & Brandeis, 1890
Outline

Privacy Preserving Data Mining
– How to have you cake and mine it too!
– Assuming no trusted third party.

Hippocratic Databases
– What I may see or hear in the course of treatment … I
will keep to myself.
- Hippocratic Oath, 8 (circa 400 BC)

Other related topics
– Privacy Preserving Information Integration
– Watermarking & fingerprinting
Data Mining and Privacy
 The
primary task in data mining:
– development of models about aggregated
data.
 Can
we still develop accurate models,
while protecting the privacy of
individual records?
Alice
Randomization Overview
35
95,000
J.S. Bach
painting
nasa
35
95,000
J.S. Bach
painting
nasa
Bob
45
60,000
B. Spears
baseball
cnn
45
60,000
B. Spears
baseball
cnn
Chris
42
85,000
B. Marley,
camping,
microsoft
42
85,000
B. Marley
camping
microsoft
Recommendation
Service
Alice
Randomization Overview
35
95,000
J.S. Bach
painting
nasa
35
95,000
J.S. Bach
painting
nasa
Bob
45
60,000
B. Spears
baseball
cnn
45
60,000
B. Spears
baseball
cnn
Chris
42
85,000
B. Marley,
camping,
microsoft
Recommendation
Service
Mining Algorithm
42
85,000
B. Marley
camping
microsoft
Data Mining Model
Alice
Randomization Overview
50
65,000
Metallica
painting
nasa
35
95,000
J.S. Bach
painting
nasa
Bob
45
60,000
B. Spears
baseball
cnn
38
90,000
B. Spears
soccer
fox
Chris
42
85,000
B. Marley,
camping,
microsoft
32
55,000
B. Marley
camping
linuxware
Recommendation
Service
Alice
Randomization Overview
50
65,000
Metallica
painting
nasa
35
95,000
J.S. Bach
painting
nasa
Bob
45
60,000
B. Spears
baseball
cnn
38
90,000
B. Spears
soccer
fox
Chris
42
85,000
B. Marley,
camping,
microsoft
Recommendation
Service
Recovery
Mining Algorithm
32
55,000
B. Marley
camping
linuxware
Data Mining Model
Inducing Classifiers over Privacy
Preserved Numeric Data
Alice’s
age
Alice’s
salary
John’s
age
30 | 25K | …
30
become
s 65
(30+35)
50 | 40K | …
Randomizer
Randomizer
65 | 50K | …
35 | 60K | …
Reconstruct
Age Distribution
Reconstruct
Salary Distribution
Decision Tree
Algorithm
Model
Reconstruction Problem

Original values x1, x2, ..., xn
– from probability distribution X (unknown)
 To hide these values, we use y1, y2, ..., yn
– from probability distribution Y
 Given
– x1+y1, x2+y2, ..., xn+yn
– the probability distribution of Y
Estimate the probability distribution of X.
Reconstruction Algorithm
fX0 := Uniform distribution
j := 0
repeat
n
1
fY (( xi  yi )  a ) f Xj (a )
fXj+1(a) := n  
Bayes’ Rule
j
i 1
 fY (( xi  yi )  a ) f X (a )

j := j+1
until (stopping criterion met)
(R. Agrawal & R. Srikant, SIGMOD 2000)

Converges to maximum likelihood estimate.
– D. Agrawal & C.C. Aggarwal, PODS 2001.
Works Well
1000
Original
800
600
Randomized
400
Reconstructed
0
60
200
20
Number of People
1200
Age
Decision Tree Example
Age
23
17
43
68
32
20
Salary
50K
30K
40K
50K
70K
20K
Repeat
Visitor?
Repeat
Repeat
Repeat
Single
Single
Repeat
Age < 25
No
Yes
Salary <
50K
Repeat
Yes
Repeat
No
Single
Accuracy vs. Randomization
Fn 3
100
Accuracy
90
80
Or iginal
70
Random ized
Reconstr ucted
60
50
40
10
20
40
60
80
100
Randomization Level
150
200
Hippocratic Databases

Hippocratic Oath, 8 (circa 400 BC)
– What I may see or hear in the course of
treatment … I will keep to myself.

What if the database systems were to
embrace the Hippocratic Oath?
R. Agrawal, J. Kiernan, R. Srikant, Y. Xu
Hippocratic Databases. VLDB 2002..
The Ten Principles

Driven by current privacy legislation.
– US (FIPA, 1974), Europe (OECD , 1980), Canada (1995),
Australia (2000), Japan (2003)

Principles:
– Collection Group: Purpose Specification, Consent,
Limited Collection
– Use Group: Limited Use, Limited Disclosure,
Limited Retention, Accuracy
– Security & Openness Group: Safety, Openness,
Compliance
Collection Group
1.
Purpose Specification
– For personal information stored in the database, the
purposes for which the information has been collected
shall be associated with that information.
2.
Consent
– The purposes associated with personal information
shall have consent of the donor (person whose
information is being stored).
3.
Limited Collection
– The information collected shall be limited to the
minimum necessary for accomplishing the specified
purposes.
Use Group
4.
Limited Use
– The database shall run only those queries that
are consistent with the purposes for which the
information has been collected.
5.
Limited Disclosure
– Personal information shall not be
communicated outside the database for
purposes other than those for which there is
consent from the donor of the information.
Use Group (2)
6.
Limited Retention
– Personal information shall be retained only as
long as necessary for the fulfillment of the
purposes for which it has been collected.
7.
Accuracy
– Personal information stored in the database
shall be accurate and up-to-date.
Security & Openness Group
8.
Safety
– Personal information shall be protected by security
safeguards against theft and other misappropriations.
9.
Openness
– A donor shall be able to access all information about
the donor stored in the database.
10.
Compliance
– A donor shall be able to verify compliance with the
above principles. Similarly, the database shall be able
to address a challenge concerning compliance.
Architecture
Privacy
Policy
Privacy
Metadata
Creator
Privacy
Metadata
Data
Collection
Queries
Other
Privacy
Constraint
Validator
Attribute
Access
Control
Data
Collection
Analyzer
Data
Accuracy
Analyzer
Query
Intrusion
Detector
Data
Retention
Manager
Audit
Info
Audit
Info
Audit
Trail
Store
Record
Access
Control
Encryption
Support
Decision-Making Across Sovereign Data
Repositories


•
Separate databases due to
statutory, competitive, or security
reasons.
 Selective, minimal sharing on
need-to-know basis.
Minimal Necessary Sharing
R
a
u
Example: Among those who took
a particular drug, how many had
adverse reaction and their DNA
contains a specific sequence?
 Researchers must not learn
anything beyond counts.
v
Algorithms for computing joins
and join counts while revealing
minimal additional information.
u
x
RS
 R must not
know that S
has b & y
 S must not
know that R
has a & x
RS
u
v
S
b
v
y
Count (R  S)
 R & S do not learn
anything except that
the result is 2.
Watermarking Relational Databases

Goal: Deter data theft and assert ownership of
pirated copies.
–


1.
Specify secret
key
2.
Specify
table/attributes
to be marked
2.
Specify
table/attributes
which should
contain marks
Very unlikely to occur by chance.
Hard to find => hard to destroy (robust against
malicious attacks).
Watermark
Existing watermarking techniques developed for
Insertion
multimedia: images, sound, text, … are not
3. Pseudo
applicable to database tables.
–
–
–

Choose secret
key
Examples: Life Sciences, Electronic Parts.
Watermark – Intentionally introduced pattern in
the data.
–
–
1.
Rows in a table are unordered.
Rows can be inserted, updated, deleted.
Attributes can be added, dropped.
randomly
select a subset
of the rows for
marking
Function of
secret key and
attribute
Watermark can be detected using only a subset values
New algorithm for watermarking database
tables.
–
of the rows and attributes of a table.
– Robust against updates,incrementally updatable.
Database
Watermark
Detection
3. Identify
marked
rows/attributes,
compare marks 4. Confirm
with expected presence or
absence of
mark values
the
Requires
watermark
neither original
unmarked data
nor the
Suspicious
watermark
Database
Closing Thoughts


Solutions to complex problems such as
privacy require a mix of legislations,
societal norms, market forces & technology
By advancing technology, we can change
the mix and improve the overall quality of
the solution
References










M. Bawa, R. Bayardo, R. Agrawal. Privacy-preserving indexing of Documents on the Network. 29th
Int'l Conf. on Very Large Databases (VLDB), Berlin, Sept. 2003.
R. Agrawal, A. Evfimievski, R. Srikant. Information Sharing Across Private Databases. ACM Int’l
Conf. On Management of Data (SIGMOD), San Diego, California, June 2003.
A. Evfimievski, J. Gehrke, R. Srikant. Liming Privacy Breaches in Privacy Preserving Data Mining.
PODS, San Diego, California, June 2003.
R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. An Xpath Based Preference Language for P3P. 12th Int'l
World Wide Web Conf. (WWW), Budapest, Hungary, May 2003.
R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Implementing P3P Using Database Technology. 19th Int'l
Conf.on Data Engineering(ICDE), Bangalore, India, March 2003.
R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Server Centric P3P. W3C Workshop on the Future of P3P,
Dulles, Virginia, Nov. 2002.
R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Hippocratic Databases. 28th Int'l Conf. on Very Large
Databases (VLDB), Hong Kong, August 2002.
R. Agrawal, J. Kiernan. Watermarking Relational Databases. 28th Int'l Conf. on Very Large
Databases (VLDB), Hong Kong, August 2002. Expanded version in VLDB Journal 2003.
A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over Privacy
Preserving Data. 8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining (KDD),
Edmonton, Canada, July 2002.
R. Agrawal, R. Srikant. Privacy Preserving Data Mining. ACM Int’l Conf. On Management of Data
(SIGMOD), Dallas, Texas, May 2000.