Mining Multiple Private Databases
Download
Report
Transcript Mining Multiple Private Databases
Mining Multiple Private
Databases
Topk Queries Across Multiple Private Databases (2005)
Li Xiong (Emory University)
Subramanyam Chitti (GA Tech)
Ling Liu (GA Tech)
Presented by: Cesar Gutierrez
About Me
2
ISYE Senior and CS minor
Graduating December, 2008
Humanitarian Logistics and/or Supply Chain
Originally from Lima, Peru
Travel, paintball and politics
Outline
3
Intro. & Motivation
Problem Definition
Important Concepts & Examples
Private Algorithm
Conclusion
Introduction
↓ of information-sharing restrictions due
to technology
↑ need for distributed data-mining tools
that preserve privacy
Accuracy
Trade-off
Efficiency
4
Privacy
Motivating Scenarios
5
CDC needs to study insurance data to detect
disease outbreaks
Disease incidents
Disease seriousness
Patient Background
Legal/Commercial Problems prevent release
of policy holder's information
Motivating Scenarios (cont'd)
6
Industrial trade group collaboration
Useful pattern: "manufacturing using chemical
supplies from supplier X have high failure rates"
Trade secret: "manufacturing process Y gives low
failure rate"
Problem & Assumptions
Model: n nodes, horizontal partitioning
...
7
Assume Semi-honesty:
Nodes follow specified protocol
Nodes attempt to learn additional information
about other nodes
Challenges
8
Why not use a Trusted Third Party (TTP)?
Difficult to find one that is trusted
Increased danger from single point of
compromise
Why not use secure multi-party computation
techniques?
High communication overhead
Feasible for small inputs only
Recall Our 3-D Goal
Accuracy
Efficiency
9
Privacy
Private Max
Actual Data sent on
first pass
start
Static Starting Point
Known
30
2
1
30
10
40
30
40
20
4
3
40
10
Multi-Round Max
Randomly perturbed
data passed to
successor during
multiple passes
Start
18
0
32
35
D2
D2
11
No successor can
determine actual data
from it's predecessor
Randomized Starting
Point
30
32
35
10
40
18
20
40
D4
D3
32
35
40
32
35
Evaluation Parameters
Parameter Description
n
# of nodes in the system
k
KNN parameter
Po
Initial randomization probability in neighbor selection
d
Dampening factor in neighbor selection
r
# of rounds in neighbor selection
12
Large k = "avoid information leaks"
Large d = more randomization = more privacy
Small d = more accurate (deterministic)
Large r = "as accurate as ordinary classifier"
Accuracy Results
13
Varying Rounds
14
Privacy Results
15
Conclusion
16
Problems Tackled
Preserving efficiency and accuracy while
introducing provable privacy to the system
Improving a naive protocol
Reducing privacy risk in an efficient manner
Critique
17
Dependency on other research papers in
order to obtain a full understanding
Few/No Illustrations
A real life example would have created a
better understanding of the charts