Green Data Mining: Privacy-Oriented Data Mining by Proof Checking

Download Report

Transcript Green Data Mining: Privacy-Oriented Data Mining by Proof Checking

Privacy-oriented Data Mining
by Proof Checking
Stan Matwin
(joint work with Amy Felty )
SITE
University of Ottawa, Canada
[email protected]
The TAMALE Group
• 4 profs
• Some 30 graduate students
• Areas: machine learning, data mining, text mining,
NLP, data warehousing
• Research in
–
–
–
–
Inductive Logic Programming
Text mining
Learning in the presence of knowledge
Applications of ML/DM (e.g. in SE: tools for maintenance
personnel)
UBC 29/7/03
2
• Why did I get into this research?
• what is already being done… and why it ’s not
enough
• the main idea
• its operation
• discussion – ’correctness ’
• prototype - Coq and CIC
• example
• some technical challenges
• acceptance?
UBC 29/7/03
3
Some useful concepts...
• opting out vs opting in
• Use Limitation Principle: data should be
used only for the explicit purpose for
which it has been collected
UBC 29/7/03
4
…and existing technical proposals
On the web: P3P Platform for Privacy Preferences
• W3C standard
• XML specifications - on websites and in
browsers - of what can be collected and for
what purpose - ULP?
• Handles cookies
• Data exchange protocol more than privacy
protocol: no provisions for opting out after an
initial opt-in
• the ULP part is in NL…not verifiable
UBC 29/7/03
5
Agrawal ’s data perturbation
transformations
• data is perturbed by random distortion:
xi  xi +r
r uniform or gaussian
• a procedure to reconstruct a PACesitimation of the original distribution (but
not the values)
• a procedure to build an accurate decision
tree on the perturbed distribution
UBC 29/7/03
6
Agrawal ’s transformations cont’d
• proposes a measure to quantify privacy –
estimate intervals and their size
• lately extended to non-numerical
attributes, and to association rules
• does not address the ULP
• how do we know it is applied?
UBC 29/7/03
7
the main idea: towards a verifiable ULP
• User sets permissions: what can and
cannot be done with her data
• Any claim that a software respects
these permissions is a proof of a
theorem about the software
• Verifying the claim is then checking that
proof against the software
UBC 29/7/03
8
Who are the players?
•
•
•
•
•
UBC 29/7/03
User C
Data miner Org
Data mining software developer Dev
Independent verifier Veri
…BUT no one owns the data D
9
D:
A:
S:
database scheme
given set of database and
data mining operations
source code for A
UBC 29/7/03
PC(D,A):
C’s permissions
T(PC,S): theorem that S respects PC
R(PC,S): proof of T(PC,S):
B: binary code of S
10
Discussion - properties
• It can be proven that C ’s permissions are
respected (or not): PC is in fact a verifiable
ULP
• PC can be negative (out) or positive (in)
• proof construction needs to be done only
once for a given PC ,D and A
• Scheme is robust against cheating by Dev
or Org
UBC 29/7/03
11
Acceptance issues
• No Org will give Veri access to S
• Too much overhead to check R(PC,S)
for each task, and each user
• Too cumbersome for C
• Based on all Orgs buying in
UBC 29/7/03
12
Acceptance1:Veri’s operation- access
• Veri needs
–
–
–
–
PC from C
R(S, PC) from Dev
S from Dev
B from Org
• Veri could check R(S, PC) at Dev’s
• Veri needs to verify that S (belonging
normally to Dev) corresponds to B that Org
runs.
UBC 29/7/03
13
Acceptance2: overhead
• Veri runs proof checking on a control
basis
• Org’s execution ovhd ?
UBC 29/7/03
14
Issues
• Naming the fields: XML or disclosure
• restricted class of theorems for a given
P-automating proof techniques for this
class
UBC 29/7/03
15
Acceptance3: C’s perspective
• Building PCs must be easy for C, based
on D and processing schema; initially a
closed set?
• permissions could be encoded on a
credit card, smart card, in the electronic
wallet
• or in the CA; they can then be
dynamically modified and revoked
UBC 29/7/03
16
« Political » aspects: who is Veri?
• generally trusted –
– « consumer association »?
– « Ralph Nader »?
– « transparency international »?
• IT expert at the level of instrumenting and
running the proof checker – connection to Open
Software Foundation?
• theorem proving can be cast as « better
testing »
UBC 29/7/03
17
how to make Orgs buy in?
• The first Org is needed to volunteer
• a Green Data Mining logo will be
granted and administered (verified)
by Veri
• other Orgs will have an incentive to
join
UBC 29/7/03
18
Future work
• Build the tools
• expand the prototype
• extend from Weka to commercial data
mining packages
• Integrate with P3P?
• find a willing Org
UBC 29/7/03
19
UBC 29/7/03
20
Link between S and B
• compilation not an option
• watermarking solution: B is watermaked
by a slightly modified compiler with
MD5(tar(S)) =128 bytes
• marks are inserted by a trusted
makefile-and-compiler in locations in B
given by Veri and unknown to Org
UBC 29/7/03
21
Link…
• Veri, given access to S, can verify that
B corresponds to S
• An attack by I requires hacking the
compiler
• An attack by Org requires knowing the
locations of watermarks
UBC 29/7/03
22
Example
C restricts her Employee data from
participating in a join with her Payroll
data
Record Payroll : Set :=
mkPay{PID : nat; JoinInd : bool; Position : string; Salary: nat}.
Record Employee : Set :=
mkEmp{Name : string; EID : nat; …}.
Record Combined : Set :=
mkComb{CID : nat; CName : string; Csalary: nat; …}.
UBC 29/7/03
23
Fixpoint Join [Ps: list Payroll]: (list Employee)  (list
Combined) :=
[Es : list Employee]
Cases Ps of
nil
(nil Combined)
| (cons p ps)
(app
(check_JoinInd_and_find_employee_record p Es)
(Join ps Es))
end.
(check_JoinInd_and_find_employee_record p Es)
if a record is found in Es whose EID matches Ps PID and
JoinInd permits Join, then a list of length 1 with the result of
Join is returned, otherwise empty
UBC 29/7/03
24
Definition Pc:=
[S:((list Payroll)(list Employee) (list Combined))  Prop]
 Ps:list Payroll}.  Es:list Employee. (UniqueJoinInd Ps) 
 P: Payroll.(In P Ps)  ((JoinInd P)=false 
not  C:Combined ((In C (S Ps Es))  ((CID C)=(PID P)))
• PC(S) is written as (PC Join): Coq expands the definition
of PC and provides the theorem
• request to proof checking operator of Coq will check this
proof i.e; it will check that the user permissions are
encoded into the Join program given
• Whole proof: 300 lines of Coq code; proof checking: 1
sec on a 600MHz machine
UBC 29/7/03
25