Security and Privacy

Download Report

Transcript Security and Privacy

Privacy-Preserving Data Mining
Representor : Li Cao
October 15, 2003
1
Presentation organization




Associate Data Mining with Privacy
Privacy Preserving Data Mining scheme
using random perturbation
Privacy Preserving Data Mining using
Randomized Response Techniques
Comparing these two cases
2
Privacy protection history
Privacy concerns nowadays


Citizens’ attitude
Scholars’ attitude
3
Internet users’ attitudes
17%
27%
Privacy
fundamentalists
Pragmatic majority
Marginal concern
56%
4
Privacy value
Filtering to weed out unwanted information
 Better search results with less effort
 Useful recommendations
 Market trend …………
Example:
From the analysis of a large number of purchase
transaction records with the costumers’ age and
income, we know what kind of costumers like
some style or brand.

5
Motivation
(Introducing Data Mining )


Data Mining’s goal: discover knowledge,
trends, patterns from large amounts of data.
Data Mining’s primary task: developing
accurate models about aggregated data
without access to precise information in
individual data records. (Not only
discovering knowledge but also preserving
privacy)
6
Presentation organization




Associate Data Mining with Privacy
Privacy Preserving Data Mining scheme
using random perturbation
Privacy Preserving Data Mining using
Randomized Response Techniques
Comparing these two cases
7
Privacy Preserving Data Mining
scheme using random perturbation




Basic idea
Reconstruction procedure
Decision-Tree classification
Three different algorithms
8
Attribute list of a example
9
Records of a example
10
Privacy preserving methods


Value-class Membership: the values for
an attribute are partitioned into a set of
disjoint, mutually-exclusive classes.
Value Distortion: xi+r instead of xi where
r is a random value.
a) Uniform
b) Gaussian
11
Basic idea



Original data  Perturbed data (Let users
provide a modified value for sensitive
attributes)
Estimate the distribution of original data
from perturbed data
Build classifiers by using these
reconstructed distributions (Decision-tree)
12
Basic Steps
13
Reconstruction problem
1)
2)
View the n original data value x1,x2,…xn of a
one-dimensional distribution as realizations of n
independent identically distributed (iid) random
variables X1,X2,…Xn, each with the same
distribution as the random variable X.
To hide these data values, n independent random
variables Y1,Y2…Yn have been used, each with
the same distribution as a different random
variable Y.
14
3)

Given x1+y1, x2+y2…xn+yn (where yi is the
realization of Yi) and the cumulative distribution
function Fy for Y, we would like to estimate the
cumulative distribution function Fx for X.
In short, given a cumulative distribution Fy and
the realizations of n iid random samples X1+Y1,
X2+Y2,…Xn+Yn, estimate Fx.
15
Reconstruction process

Let the value of Xi+Yi be wi (xi+yi). Use
Bayes’ rule to estimate the posterior
distribution function Fx1’ (given that
X1+Y1=w1) for X1, assuming we know the
density function fx and fy for X and Y
respectively.
16

To estimate the posterior distribution
function Fx’ given x1+y1, x2+y2…xn+yn, we
average the distribution function for each
of the Xi.
17

The corresponding posterior density
function, fx’ is obtained by differentiating
Fx’:

Given a sufficiently large number of
samples, fx’ will be very close to the real
density function fx.
18
Reconstruction algorithm
1.
2.
3.
4.
fx0 := Uniform distribution
j :=0 //Iteration number
Repeat
j := j+1
until (stopping criterion met)
19
Stopping Criterion

Observed randomized distribution ?= The
result of randomizing the current estimate
of the original distribution

The difference between successive
estimates of the original distribution is very
small.
20
Reconstruction effect
21
22
Decision-Tree Classification


Tow stages:
(1) Growth
Example:
(2) Prune
23
Tree-growth phase algorithm
Partition (Data S)
begin
if (most points in S are of the same class)
then return;
for each attribute A do
evaluate splits on attribute A;
Use best split to partition S into S1 and S2;
Partition (S1)
Partition (S2)
end
24
Choose the best split


Information gain (categorical attributes)
Gini index (continuous attributes)
25
Gini index calculation


(pj is the relative frequency of class j in S)
If a split divides S into two subsets S1 and
S2
Note: Calculating this index requires only
the distribution of the class values.
26
27
When & How original
distribution are reconstructed



Global: Reconstruct the distribution for each
attribute once. Decision-tree classification.
ByClass: For each attribute, first split the training
data by class, then reconstruct the distributions
separately. Decision-tree classification
Local: The same as in ByClass, however, instead
of doing reconstruction only once, reconstruction
is done at each node.
28
Example (ByClass and Local)
29
Comparing the three algorithms
Execution Time
Accuracy
Global
Cheapest
Worst
ByClass
Middle
Middle
Local
Most expensive
Best
30
Presentation Organization





Associate Data Mining with Privacy
Privacy Preserving Data Mining scheme
using random perturbation
Privacy Preserving Data Mining using
Randomized Response Techniques
Comparing these two cases
Are there any other classification methods
available?
31
Privacy Preserving Data Mining using
Randomized Response Techniques



Randomized Response
Building Decision-Tree
Key: Information Gain Calculation
Experimental results
32
Randomized Response



A survey contains a sensitive attribute A.
Instead of asking whether the respondent has the
attribute A, ask two related questions, the answer
to which are opposite to each other (have A  no
A).
Respondent use a randomizing device to decide
which question to answer. The device is designed
in such way that the probability of choosing the
first question is θ .
33

To estimate the percentage of people who has the
attribute A, we can use:
P’(A=yes) = P(A=yes)* θ +P(A=no)*(1- θ)
P’(A=no) = P(A=no)* θ +P(A=yes)*(1- θ)
P’(k=yes): The proportion of the “yes” responses
obtained from the survey data.
P(k=yes): The estimated proportion of the “yes”
responses

Our Goal: P(A=yes) and P(A=no)
34
Example:
Sensitive attribute: Married?
Two questions:
1) A?
Yes / No
2)
B?
No / Yes
35
Decision-Tree (Key: Info Gain)
m: m classes assumed
Qj: The relative frequency of class j in S
v: any possible value of attribute A
Sv: The subset of S for which attribute A has value v.
| Sv |: The number of elements in Sv
|S|: The number of elements in S
36
P(E): The proportion of the records in the
undisguised data set that satisfy E=true
P*(E): The proportion of the records in the
disguised data set that satisfy E=true
Assume the class label is binary.
So the Entropy(S) can be calculated. Similarly,
calculate |Sv|. At last, we get Gain(S,A).
37
Experimental results
38
Comparing these two cases
Perturbation
Attribute
Continuous
Privacy Preserving Value
method
distortion
Choose attribute to Gini index
split
Inverse procedure Reconstruct
distribution
Response
Categorical
Randomized
response
Information
Gain
Estimate P(E)
from P’(E)
39
Future work




Solve categorical problems by the first
scheme
Solve continuous problems by the second
scheme
Combine these two scheme to solve some
problems
Other classification suitable
40