Information Sharing across Private Databases

Download Report

Transcript Information Sharing across Private Databases

Sovereign Information Sharing,
Searching and Mining
Rakesh Agrawal
IBM Almaden Research Center
Thesis

Organizational boundaries are blurring in the
emerging networked economy
– Compete and co-operate simultaneously
– Int’l value chain

Need to rethink information sharing, searching, and
mining in the new brave world of virtual
organizations
Sovereign Information Sharing



Separate databases due to
statutory, competitive, or security
reasons.
 Selective, minimal sharing on
need-to-know basis.
Minimal Necessary Sharing
R
a
u
Example: Among those who took
a particular drug, how many had
adverse reaction and their DNA
contains a specific sequence?
 Researchers must not learn
anything beyond counts.
v
Commutative Encryption:
u
E1(E2(T)) = E2(E1(T))
x
RS
 R must not
know that S
has b & y
 S must not
know that R
has a & x
RS
u
v
S
b
v
Count (R  S)
 R & S do not learn
anything except that
the result is 2.
y
SIGMOD 00
Privacy Preserving Data Mining
Alice’s
age
Alice’s
salary
Bob’s
age
30 | 70K | ...
50 | 40K | ...
Randomizer
Randomizer
30+35
65 | 20K | ...
Reconstruct
distribution
of Age
25 | 60K | ...

Insight: Preserve privacy at the individual level, while
still building accurate data mining models at the
aggregate level.

Add random noise to individual values to protect
privacy.

EM algorithm to estimate original distribution of values
given randomized values + randomization function.

Algorithms for building classification models and
discovering association rules on top of privacypreserved data with only small loss of accuracy.
Reconstruct
distribution
of Salary
Data Mining Algorithms
Data Mining Model
1200
120
1000
100
800
80
600
60
400
40
20
200
0
0
Original
Randomized
20
40
82
74
66
58
50
42
34
26
18
2
SIGMOD 00
10
10
Reconstructed
60
80
100
150
Randomization Level
Original
Randomized
Reconstructed
200
Finessing Schema Chaos
50
50
 Use a simple regular expression
extractor to get numbers
40
40
30
30
20
20
 Do simple data extraction to get hints
10
10
0
0
 Hint for unit: the word following the
number.
 Hint for attribute name: k following
numbers.
256 MB SDRAM memory
10
20
30
40
0
50
0
10
20
30
40
100
80
60
40
20
Unit Hint:
MB
Attribute Hint:
SDRAM, Memory
 Use only numbers in the queries
 Treat any attribute name in the
query also as hint
W W W 03
0
1
2
3
4
5
7
Query Size
Accuracy
Non-Reflectivity
Randomized NonReflectivity
Reflectivity estimates accuracy
50
Privacy Preserving Indexing



A public mapping function that maps a query to a
set of providers P that may contain the desired
document
P contains false negatives
Providers return a document only if the searcher is
authorized to access the document
VLDB 03
Some Interesting Topics

Current integration approaches do not scale
– Information integration per se is not interesting
– Static vs. dynamic plumbing


Incentive compatibility
Auditing interactions