Transcript PPT
Tools for Privacy
Preserving Distributed
Data Mining
By Michael Holmes
Why Private Data Mining
❖
The CDC may want to use data mining techniques to
identify trends in disease outbreaks.
❖
Insurance companies have useful data but can’t disclose
it because of privacy concerns.
❖
Is there a way to obtain this data without revealing the
identity of the patients?
Private Data Mining
Techniques
❖
Secure Sum
❖
Secure Set Union
❖
Secure Size of Set Intersection
❖
Scalar Product
Private Data Mining Toolkit
❖
Association Rules in horizontally partitioned data
❖
Association Rules in vertically partitioned data
❖
EM Clustering
Secure Sum
❖
Securely compute the sum from individual databases.
❖
Have a site randomly generate a number R
❖
Add this number to every value and send it to site 2.
❖
Site 2 can then add each of it’s values to that values
sent from site 1 and return a single number back to Site
1.
❖
Site 1 can then remove the random number N times and
find the correct sum.
Secure Sum
Secure Set Union
Secure Size of Set
Intersection
❖
Only possible with Commutative Encryption.
❖
very party encrypts their data and then sends it to another party.
❖
The next party also encrypts the encrypted data.
❖
After all parties have encrypted all the data from every other
party only that has been duplicated by the encryption is shared.
❖
Count the duplicates and you know the size of the intersection.
Scalar Product
❖
Want to compute the sum of x1 * y1 between two
databases
❖
Use linear combinations of random numbers to disguise
elements and then computationally remove these once
you get the result.
Association Rules in Horizontally
Partitioned Data
❖
Candidate Set Generation
❖
Local Pruning
❖
Itemset Exchange (Secure Union Step here)
❖
Support Count Exchange
Association Rules in Vertically
Partitioned Data
❖
Uses scalar product to determine if the count of an item
set is greater than a threshold
❖
If the count is above the threshold you’ve determined
that the database is worth querying
❖
Can also user Secure Size Set Intersection to see how
much is in common.
❖
Useful when using algorithm such as apriori algorithm
EM Clustering
❖
Uses secure sum to get a global number associated with
all sites involved.
❖
Once global sum is computed, it can be used in the
Expectation-maximization method to generate staistical
models.
EM Clustering
❖
Uses secure sum to get a global number associated with
all sites involved.
❖
Once global sum is computed, it can be used in the
Expectation-maximization method to generate staistical
models.
Things to Note
❖
These algorithms are not fully private, some information
is learned in the process.
❖
For example in the set intersection, sites can
potentially learn the sizes of each database.
❖
Make sure to pick the appropriate algorithms for what
you need to accomplish
❖
Watch out for intermediate information being leaked!
Thank you