T - Rakesh Agrawal
Download
Report
Transcript T - Rakesh Agrawal
Privacy in Data Systems
Rakesh Agrawal
IBM Almaden Research Center
joint work with Srikant, Kiernan & Xu
Theme
ƒ
Increasing need for information systems
that
ƒ protect the privacy and ownership of
information
ƒ do not impede the flow of information
ƒ
Resolving the apparent contradiction in the
above statement is a major research
challenge and opportunity
Drivers
Policies and Legislations
– U.S. and international regulations
– Legal proceedings against businesses
Consumer Concerns
– Consumer privacy apprehensions continue to plague the
Web … these fears will hold back roughly $15 billion
in e-Commerce revenue.” Forrester Research, 2001
– Most consumers are “privacy pragmatists.” Westin
Surveys
Moral Imperative
– The right to privacy: the most cherished of human
freedom -- Warren & Brandeis, 1890
Privacy Is Headline News
“Privacy #1 issue in
the 21Century”
-Wall Street Journal,
January 24, 2000
“Anyone today who thinks the privacy issue has peaked is greatly
mistaken…we are in the early stages of a sweeping change in
attitudes that will put once-routine business practices under the
microscope.”
Forrester Research, March 5, 2001
Outline
Privacy Preserving Data Mining
– How to have you cake and mine it too!
– Assuming no trusted third party.
Hippocratic Databases
– What I may see or hear in the course of treatment … I
will keep to myself.
- Hippocratic Oath, 8 (circa 400 BC)
Other related topics
– Privacy Cognizant Information Integration
– Database support for P3P
– Watermarking & fingerprinting
Data Mining and Privacy
The
primary task in data mining:
– development of models about aggregated
data.
Can
we still develop accurate models,
while protecting the privacy of
individual records?
Approaches
Randomization
Cryptographic
Statistical disclosure control
Alice
Randomization Overview
35
80,000
J.S. Bach
painting
nasa
35
95,000
J.S. Bach
painting
nasa
Bob
45
60,000
B. Spears
baseball
cnn
45
60,000
B. Spears
baseball
cnn
Chris
42
85,000
B. Marley,
camping,
microsoft
42
85,000
B. Marley
camping
microsoft
Recommendation
Service
Alice
Randomization Overview
35
80,000
J.S. Bach
painting
nasa
35
95,000
J.S. Bach
painting
nasa
Bob
45
60,000
B. Spears
baseball
cnn
45
60,000
B. Spears
baseball
cnn
Chris
42
85,000
B. Marley,
camping,
microsoft
Recommendation
Service
Mining Algorithm
42
85,000
B. Marley
camping
microsoft
Data Mining Model
Alice
Randomization Overview
50
65,000
Metallica
painting
nasa
35
95,000
J.S. Bach
painting
nasa
Bob
45
60,000
B. Spears
baseball
cnn
38
90,000
B. Spears
soccer
fox
Chris
42
85,000
B. Marley,
camping,
microsoft
32
55,000
B. Marley
camping
linuxware
Recommendation
Service
Alice
Randomization Overview
50
65,000
Metallica
painting
nasa
35
95,000
J.S. Bach
painting
nasa
Bob
45
60,000
B. Spears
baseball
cnn
38
90,000
B. Spears
soccer
fox
Chris
42
85,000
B. Marley,
camping,
microsoft
Recommendation
Service
Recovery
Mining Algorithm
32
55,000
B. Marley
camping
linuxware
Data Mining Model
Inducing Classifiers over Privacy
Preserved Numeric Data
Alice’s
age
Alice’s
salary
John’s
age
30 | 25K | …
30
become
s 65
(30+35)
50 | 40K | …
Randomizer
Randomizer
65 | 50K | …
35 | 60K | …
Reconstruct
Age Distribution
Reconstruct
Salary Distribution
Decision Tree
Algorithm
Model
Reconstruction Problem
Original values x1, x2, ..., xn
– from probability distribution X (unknown)
To hide these values, we use y1, y2, ..., yn
– from probability distribution Y
Given
– x1+y1, x2+y2, ..., xn+yn
– the probability distribution of Y
Estimate the probability distribution of X.
Reconstruction Algorithm
fX0 := Uniform distribution
j := 0
repeat
n
1
fY (( xi yi ) a ) f Xj (a )
fXj+1(a) := n
Bayes’ Rule
j
i 1
fY (( xi yi ) a ) f X (a )
j := j+1
until (stopping criterion met)
(R. Agrawal & R. Srikant, SIGMOD 2000)
Converges to maximum likelihood estimate.
– D. Agrawal & C.C. Aggarwal, PODS 2001.
Estimate of a single point
Use Bayes' rule for density functions
1
0V
A
g
e
9
0
O
r
i
g
i
n
a
l
d
i
s
t
r
i
b
u
t
i
o
n
f
o
r
A
g
e
P
r
o
b
a
b
i
l
i
s
t
i
c
e
s
t
i
m
a
t
e
o
f
o
r
i
g
i
n
a
l
v
a
l
u
e
o
f
V
Estimate of a single point
Use Bayes' rule for density functions
1
0V
A
g
e
9
0
O
r
i
g
i
n
a
l
D
i
s
t
r
i
b
u
t
i
o
n
f
o
r
A
g
e
P
r
o
b
a
b
i
l
i
s
t
i
c
e
s
t
i
m
a
t
e
o
f
o
r
i
g
i
n
a
l
v
a
l
u
e
o
f
V
Estimate of the distribution
Combine estimates of where a point came
from for all the points:
– yields estimate of original distribution.
10
Age
90
Works Well
1000
Original
800
600
Randomized
400
Reconstructed
0
60
200
20
Number of People
1200
Age
Decision Tree Example
Age
23
17
43
68
32
20
Salary
50K
30K
40K
50K
70K
20K
Repeat
Visitor?
Repeat
Repeat
Repeat
Single
Single
Repeat
Age < 25
No
Yes
Salary <
50K
Repeat
Yes
Repeat
No
Single
Algorithms
Global
– Reconstruct for each attribute once at the beginning
By Class
– For each attribute, first split by class, then reconstruct
separately for each class.
Local
– Reconstruct at each node
See SIGMOD 2000 paper for details.
Experimental Methodology
Compare accuracy against
– Original: unperturbed data without randomization.
– Randomized: perturbed data but without making any
corrections for randomization.
Test data not randomized.
Synthetic benchmark from [AGI+92].
Training set of 100,000 records, split equally
between the two classes.
Decision Tree Experiments
100% Randomization Lev el
100
Accuracy
90
Original
80
Randomized
70
Reconstructed
60
50
Fn 1
Fn 2
Fn 3
Fn 4
Fn 5
Accuracy vs. Randomization
Fn 3
100
Accuracy
90
80
Original
70
Randomized
Reconstructed
60
50
40
10
20
40
60
80
100
Randomization Level
150
200
Discovering Associations Over
Privacy Preserved Categorical Data
A transaction t is a set of items
Support s for an itemset A is the number of
transactions in which A appears
Itemset A is frequent if s smin
Task: Find all frequent itemsets, while
preserving the privacy of individual transaction.
A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules
Over Privacy Preserving Data. KDD-2002.
S. Rizvi, J. Haritsa. Privacy-Preserving Association Rule Mining. VLDB 2002
Uniform Randomization
Given
a transaction,
– Keep item with, say 20% probability
– Replace with a new random item with
80% probability.
Privacy Breach
If
one has access to the result (frequent
itemsets) as well as the randomized
transactions, one may make inferences
about the original transactions.
Itemset A causes a privacy breach of
level if, for some item z A
Prz t | A t
See also: A. Evfimievski, J. Gehrke , R. Srikant. PODS 2003.
Example: {x, y, z}
10 M transactions of size 10 with 10 K items:
1%
5% have
have {x, y}, {x, z},
{x, y, z} or {y, z} only
•
0.23
0.008%
800 ts.
97.8%
0.22 •
•
8/10,000
0.00016%
16 trans.
1.9%
Original
94%
have one or zero
items of {x, y, z}
at most
• 0.2 • (9/10,000)2
less than 0.00002%
2 transactions
Randomize
0.3%
d
Privacy Breach: Given {x, y, z} in a randomized transaction,
we have 98% certainty of {x, y, z} being in the original one
Solution
“Where does a wise man hide a leaf? In the forest.
But what does he do if there is no forest?”
“He grows a forest to hide it in.”
G.K. Chesterton
Insert many false items into each transaction
Hide true itemsets among false ones
Cut and Paste Randomization
Given transaction t of size m, construct t’:
t =
t’ =
a, b, c, u, v, w, x, y, z
Cut and Paste Randomization
Given transaction t of size m, construct t’:
– Choose a number j between 0 and Km (cutoff);
t =
a, b, c, u, v, w, x, y, z
t’ =
j=4
Cut and Paste Randomization
Given transaction t of size m, construct t’:
– Choose a number j between 0 and Km (cutoff);
– Include j items of t into t’;
t =
t’ =
a, b, c, u, v, w, x, y, z
b, v, x, z
j=4
Cut and Paste Randomization
Given transaction t of size m, construct t’:
– Choose a number j between 0 and Km (cutoff);
– Include j items of t into t’;
– Each other item is included into t’ with probability pm .
The choice of Km and pm is based on the desired level of privacy.
t =
t’ =
a, b, c, u, v, w, x, y, z
b, v, x, z
j=4
œ, å, ß, ξ, ψ, €, א, ъ, ђ, …
Partial Supports
To recover original support of an itemset, we need randomized
supports of its subsets.
Given an itemset A of size k and transaction size m,
A vector of partial supports of A is
s s0 , s1 ,..., sk , where
sl
1
# t T | # t A l
T
– Here sk is the same as the support of A.
– Randomized partial supports are denoted by
s .
The Unbiased Estimators
Given randomized partial supports, we can estimate original
partial supports:
sest Q s , where Q P 1
Covariance matrix for this estimator:
1 k
T
Cov sest
s
Q
D
[
l
]
Q
,
l
T l 0
where D[l ]i , j Pi , l i j Pi , l Pj , l
To estimate it, substitute sl with (sest)l .
– Special case: estimators for support and its variance
Can we still find frequent itemsets?
Breach level = 50%.
Soccer:
smin = 0.2%
Mailorder:
smin = 0.2%
Itemset
Size
True
Itemsets
True
Positives
False
Drops
False
Positives
1
266
254
12
31
2
217
195
22
45
3
48
43
5
26
Itemset
Size
True
Itemsets
True
Positives
False
Drops
False
Positives
1
65
65
0
0
2
228
212
16
28
3
22
18
4
5
Cryptographic Approach
+
–
?
Accuracy
Performance
Security
– Semi-honest (or passive) adversary: Correctly
follows the protocol specification, yet attempts
to learn additional information by analyzing the
messages.
Cryptographic Primitives
Oblivious transfer [CACM85, MIT81]
– Sender’s input: (x0, x1), Receiver’s input {0,1}
– Receiver learns x , sender learns nothing
– Sufficient for secure computation [STOC 88]
Oblivious polynomial evaluation [STOC99]
– Sender’s input: polynomial Q over F
– Receiver’s input: z F
– Receiver obtains Q(z), sender learns nothing
Yao’s two-party protocol [FOCS 84]
– Party 1 with input x, Party 2 with input y
– Compute f(x,y) without revealing x,y
– Any polynomial-time function can be expressed as a combinatorial
circuit of polynomial size [JACM72]
Private Distributed ID3
Problem: Two parties owning confidential
databases wish to build a decision-tree
classifier on the union of their databases,
without revealing any unnecessary
information.
Y. Lindell, B. Pinkas. Privacy Preserving Data Mining. Crypto 2000.
Basic Idea
Find attribute with highest information gain
privately
We can then split on this attribute and
recurse.
Information Gain
Let
– T = set of records (dataset),
– T(ci) = set of records in class i,
– T(ci,aj) = set of records in class i with value(A) = aj.
– Entropy(T) =
i
| T ( ci ) |
| T (ci ) |
log
|T |
|T |
– Gain(T,A) = Entropy(T) -
| T (aj ) |
j | T | Entropy(T( aj))
Need to compute
– Sj Si |T(aj, ci)| log |T(aj, ci)|
– Sj |T(aj)| log |T(aj)|.
Selecting the Split Attribute
Given v1 known to party 1 and v2 known to
party 2, compute (v1 + v2) log (v1 + v2)
and output random shares.
– Party 1 gets Answer -
– Party 2 gets , where is a random number
Given random shares for each attribute, use
Yao's protocol to compute information gain.
Purdue Toolkit
Partitioned databases
Secure Building Blocks for computing
– Sum
– Set Union
– Size of Set Intersection
– Scalar Product
C. Clifton et al. Tools for Privacy Preserving Data Mining.
SIGKDD Explorations 2003.
Sum
Sum = 24 – R = 17
5
R=7
24
9
12
15
3
Semi-honest Assumption
Algorithms
Association rules
– horizontally partitioned data
– vertically partitioned data
EM Clustering
Work in Statistical Databases
Provide statistical information without
compromising sensitive information about
individuals (AW89, Sho82)
Techniques
– Query Restriction
– Data Perturbation
Negative Results: cannot give high quality
statistics and simultaneously prevent partial
disclosure of individual information [AW89]
Techniques
Query Restriction
– Restrict the size of query result (e.g. FEL72, DDS79)
– Control overlap among successive queries (e.g. DJL79)
– Suppress small data cells (e.g. CO82)
Output Perturbation
– Sample result of query (e.g. Den80)
– Add noise to query result (e.g. Bec80)
Data Perturbation
– Replace db with sample (e.g. LST83, LCL85, Rei84)
– Swap values between records (e.g. Den82)
– Add noise to values (e.g. TYW84, War65)
Summary
Promising technical direction & results
Much more needs to be done, e.g.
– Trade off between the amount of privacy
breach and performance
– Examination of other approaches (e.g.
randomization based on swapping)
Hippocratic Databases
Hippocratic Oath, 8 (circa 400 BC)
– What I may see or hear in the course of
treatment … I will keep to myself.
What if the database systems were to
embrace the Hippocratic Oath?
R. Agrawal, J. Kiernan, R. Srikant, Y. Xu
Hippocratic Databases. VLDB 2002..
The Ten Principles
Driven by current privacy legislation.
– US (FIPA, 1974), Europe (OECD , 1980), Canada (1995),
Australia (2000), Japan (2003)
Principles:
– Collection Group: Purpose Specification, Consent,
Limited Collection
– Use Group: Limited Use, Limited Disclosure,
Limited Retention, Accuracy
– Security & Openness Group: Safety, Openness,
Compliance
Collection Group
1.
Purpose Specification
– For personal information stored in the database, the
purposes for which the information has been collected
shall be associated with that information.
2.
Consent
– The purposes associated with personal information
shall have consent of the donor (person whose
information is being stored).
3.
Limited Collection
– The information collected shall be limited to the
minimum necessary for accomplishing the specified
purposes.
Use Group
4.
Limited Use
– The database shall run only those queries that
are consistent with the purposes for which the
information has been collected.
5.
Limited Disclosure
– Personal information shall not be
communicated outside the database for
purposes other than those for which there is
consent from the donor of the information.
Use Group (2)
6.
Limited Retention
– Personal information shall be retained only as
long as necessary for the fulfillment of the
purposes for which it has been collected.
7.
Accuracy
– Personal information stored in the database
shall be accurate and up-to-date.
Security & Openness Group
8.
Safety
– Personal information shall be protected by security
safeguards against theft and other misappropriations.
9.
Openness
– A donor shall be able to access all information about
the donor stored in the database.
10.
Compliance
– A donor shall be able to verify compliance with the
above principles. Similarly, the database shall be able
to address a challenge concerning compliance.
Architecture
Privacy
Policy
Privacy
Metadata
Creator
Privacy
Metadata
Data
Collection
Queries
Other
Privacy
Constraint
Validator
Attribute
Access
Control
Data
Collection
Analyzer
Data
Accuracy
Analyzer
Query
Intrusion
Detector
Data
Retention
Manager
Audit
Info
Audit
Info
Audit
Trail
Store
Record
Access
Control
Encryption
Support
New Challenges
General
– Language
– Efficiency
Use
– Limited Collection
– Limited Disclosure
– Limited Retention
Security and Openness
– Safety
– Openness
– Compliance
Language
Need a declarative language for specifying privacy
policies & user preferences.
P3P is very limited
– Developed primarily for web shopping
– No enforcement
Some features:
– User negotiation models
Least invasive site
Coalitional game [KPR2001])
– Balance expressibility and usability
Efficiency
How do we minimize the cost of privacy
checking?
– Need cell-level access control
How do we incorporate purpose into database
design and query optimization?
How does the secure databases work on
decomposition of multilevel relations into singlelevel relations [JS91] apply here?
– Difference in granularity and scale from work in secure
databases
Limited Collection
How do we identify attributes that are collected
but not used?
– Assets are only needed for mortgage when salary is
below some threshold.
What’s the needed granularity for numeric
attributes?
– Queries only ask “Salary > threshold” for rent
application.
How do we generate minimal query for a given
purpose?
Limited Disclosure
Need dynamically determined set of
recipients?
Example: Alice wants to add EasyCredit to
set of recipients in EquiRate’s database.
Digital signatures.
Limited Retention
Completely forgetting some information is
non-trivial.
How do we delete a record from the logs
and checkpoints, without affecting
recovery?
How do we continue to support historical
analysis and statistical queries without
incurring privacy breaches?
Safety
Encryption do avoid inadvertent disclosure
of data
– How do we index encrypted data?
– How do we run queries against encrypted data?
– [SWP00], [HILM02]
Openness
A donor shall be able to access all information
about the donor stored in the database.
How does the database check Alice is really Alice
and not somebody else?
– Princeton admissions office broke into Yale’s
admissions using applicant’s social security number and
birth date.
How does Alice find out what databases have
information about her?
– Information discovery literature
– Symmetrically private information retrieval [GIKM98].
Compliance
Universal Logging
– Can we provide each user whose data is
accessed with a log of that access, along with
the query reading the data?
Tracking Privacy Breaches
– Insert “fingerprint” records with emails, telephone
numbers, and credit card numbers.
– Some data may be more valuable for spammers or
credit card theft. How do we identify categories to do
stratified fingerprinting rather than randomly
inserting records?
Summary
Gold mine of challenging research problems
(besides being useful)!
Related Topics
Privacy Cognizant Information Integration
Database support for P3P
Watermarking & fingerprinting
Decision-Making Across Private Data Repositories
•
Separate databases due to
statutory, competitive, or security
reasons.
Selective, minimal sharing on
need-to-know basis.
Minimal Necessary Sharing
R
a
u
Example: Among those who took
a particular drug, how many had
adverse reaction and their DNA
contains a specific sequence?
Researchers must not learn
anything beyond counts.
v
Algorithms for computing joins
and join counts while revealing
minimal additional information.
u
x
RS
R must not
know that S
has b & y
S must not
know that R
has a & x
RS
u
v
S
b
v
y
Count (R S)
R & S do not learn
anything except that
the result is 2.
Database Support for P3P
APPEL
P3P: New W3C standard to encode company privacy policies and user
Privacy
privacy preferences in XML.
Preference
• Can programmatically match preferences against policies.
• Solves the problem that current policies are written by lawyers, for
lawyers.
Matching
• Current implementations do the matching in the client (browser).
result
Proposal: Server-centric preference matching using relational
databases.
Policy-Preference
Matching
Advantages:
• Server-side matching necessary for thin clients, e.g. mobile devices.
APPEL to SQL
• Sets up necessary infrastructure for policy enforcement.
Converter
• Provides companies with extra information for refining privacy
policies.
Query
SQL
Prototype enables DB2 with P3P support.
results
query
• Stores P3P policies in relational tables.
• Reuses database query technology for policy-preference matching.
Database
• Algorithm for converting APPEL preferences into SQL queries.
P3P
Privacy
Policy
Policy Storing
Shredder
Policy Metadata
Watermarking Relational Databases
Goal: Deter data theft and assert ownership of
pirated copies.
–
1.
Specify secret
key
2.
Specify
table/attributes
to be marked
2.
Specify
table/attributes
which should
contain marks
Very unlikely to occur by chance.
Hard to find => hard to destroy (robust against
malicious attacks).
Watermark
Existing watermarking techniques developed for
Insertion
multimedia: images, sound, text, … are not
3. Pseudo
applicable to database tables.
–
–
–
Choose secret
key
Examples: Life Sciences, Electronic Parts.
Watermark – Intentionally introduced pattern in
the data.
–
–
1.
Rows in a table are unordered.
Rows can be inserted, updated, deleted.
Attributes can be added, dropped.
randomly
select a subset
of the rows for
marking
Function of
secret key and
attribute
Watermark can be detected using only a subset values
New algorithm for watermarking database
tables.
–
of the rows and attributes of a table.
– Robust against updates,incrementally updatable.
Database
Watermark
Detection
3. Identify
marked
rows/attributes,
compare marks 4. Confirm
with expected presence or
absence of
mark values
the
Requires
watermark
neither original
unmarked data
nor the
Suspicious
watermark
Database
Closing Thoughts
The right to privacy: the most cherished of
human freedoms. -- Warren & Brandeis, 1890
Code is law … it is all a matter of code: the
software and hardware that now rule. -- L. Lessig
We can architect computing systems to protect
values we believe are fundamental, or we can
architect them to allow those values to disappear.
What do we want to do as database researchers?
References
R. Agrawal, A. Evfimievski, R. Srikant. Information Sharing Across Private Databases.
ACM Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June
2003.
R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. An Xpath Based Preference Language for
P3P. 12th Int'l World Wide Web Conf. (WWW), Budapest, Hungary, May 2003.
R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Implementing P3P Using Database
Technology. 19th Int'l Conf.on Data Engineering(ICDE), Bangalore, India, March 2003.
R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Server Centric P3P. W3C Workshop on the
Future of P3P, Dulles, Virginia, Nov. 2002.
R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Hippocratic Databases. 28th Int'l Conf. on
Very Large Databases (VLDB), Hong Kong, August 2002.
R. Agrawal, J. Kiernan. Watermarking Relational Databases. 28th Int'l Conf. on Very
Large Databases (VLDB), Hong Kong, August 2002.
A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over
Privacy Preserving Data. 8th Int'l Conf. on Knowledge Discovery in Databases and Data
Mining (KDD), Edmonton, Canada, July 2002.
R. Agrawal, R. Srikant. Privacy Preserving Data Mining. ACM Int’l Conf. On
Management of Data (SIGMOD), Dallas, Texas, May 2000.