multiplicative

Download Report

Transcript multiplicative

Multiplicative Data Perturbations
Outline
 Introduction
 Multiplicative data perturbations
 Rotation perturbation
 Geometric Data Perturbation
 Random projection
 Understanding
 Distance preservation
 Perturbation-invariant models
 Attacks
 Privacy Evaluation Model
 Background knowledge and attack analysis
 Attack-resilient optimization
 Comparison
Summary on additive perturbations
 problems
 Weak to various attacks
 Need to publish noise distribution
 The column distribution is known

Need to develop/revise data mining algorithms in order to
utilize perturbed data
 So far, we have only seen that decision tree and naïve bayes
classifier can utilize additive perturbation.
 Benefits
 Can be applied to both the Web model and the corporate
model
 Low cost
More thoughts about
perturbation
1. Preserve Privacy
 Hide the original data
 not easy to estimate
the original values
from the perturbed
data
 Protect from data
reconstruction
techniques
 The attacker has prior
knowledge on the
published data
2. Preserve Data Utility for
Tasks
 Single-dimensional info
column data
distribution, etc.
 Multi-dimensional
info
Cov matrix, distance,
etc
For most PP approaches…
Privacy
guarantee
?
Data Utility/
Model accuracy
Privacy guarantee
Data utility/
Model accuracy
•Difficult to balance the two factors
•Subject to attacks
•May need new DM algorithms: randomization, cryptographic approaches
Multiplicative perturbations
 Geometric data perturbation (GDP)
 Rotation data perturbation
 Translation data perturbation
 Noise addition
 Random projection perturbation(RPP)
Definition of Geometric Data
Perturbation
 G(X) = R*X + T + D
R: random rotation
T: random translation
D: random noise, e.g., Gaussian noise
Characteristics:
R&T preserving distance, D slightly perturbing distance
Example:
ID
001
002
age
1158
943
rent
3143
2424
tax
-2919
-2297
=
.83
-.40
.40
.2
.86
.46
.53
.30
-.79
*
ID
001
002
age
30
25
rent
1350
1000
tax
4230
3320
+
12
12
18
18
30
30
+
-.4
.29
-1.7
-1.1
.13
1.2
* Each component has its use to enhance the resilience to attacks!
Benefits of Geometric Data
Perturbation
Privacy
guarantee
decoupled
Data Utility/
Model accuracy
Make optimization and balancing easier!
-Almost fully preserving model accuracy - we optimize privacy only
Applicable to many DM algorithms
-Distance-based Clustering
-Classification: linear, KNN, Kernel, SVM,…
Resilient to Attacks
-the result of attack research
Definition of Random Projection
Perturbation
 F(X) = P*X
 X is m*n matrix: m columns and n rows
 P is a k*m random matrix, k <= m
 Johnson-Lindenstrauss Lemma
There is a random projection F() with
e is a small number <1, so that
(1-e)||x-y||<=||F(x)-F(y)||<=(1+e)||x-y||
i.e. distance is approximately preserved.
Comparison between GDP and
RPP
 Privacy preservation
 Subject to similar kinds of attacks
 RPP is more resilience to distance-based
attacks
 Utility preservation(model accuracy)
 GDP preserves distances well
 RPP approximately preserves distances
 Model accuracy is not guaranteed
Illustration of multiplicative
data perturbation
Preserving distances
while perturbing each
individual dimensions
A Model “invariant” to GDP …
 If distance plays an important role

Class/cluster members and decision boundaries are
correlated in terms of distance, not the concrete locations
2D Example:
Class 1
Class
2
Classification boundary
Rotation and
translation
Classification
boundary
Distance perturbation
(Noise addition)
Slightly
changed
Classification
boundary
Applicable DM algorithms
 Modeling methods that depend on Euclidean
geometric properties
 Models “invariant” to GDP
 all Euclidean distance based clustering algorithms
 Classification algorithms






K Nearest Neighbors
Kernel methods
Linear classifier
Support vector machines
Most regression models
And potentially more …
When to Use Multiplicative Data
Perturbation
Service Provider/data user
Data Owner
G(X)=RX+T+D
Apply F to G(Xnew)
G(X)
F(G(X), )
Mined models/patterns
Good for the corporate model or dataset publishing.
Major issue!! curious service providers/data users try to break G(X)
Major issue: attacks!!
 Many existing PP methods are found not so
effective
 when attacks are considered
 Ex: various data reconstruction algorithms to the
random noise addition approach [Huang05][Guo06]
 Prior knowledge
 Service provider Y has “PRIOR KNOWLEDGE”
about X’s domain and nothing stops Y from using it
to infer information in the sanitized data
Knowledge used to attack GDP
 Three levels of knowledge
 Know nothing  naïve estimation
 Know column distributions  Independent
Component Analysis
 Know specific points (original points and their
images in perturbed data)  distance inference
Methodology of attack analysis
 An attack is an estimate of the original
data
Original O(x1, x2,…, xn) vs. estimate P(x’1,
x’2,…, x’n)
How similar are these two series?
One of the effective methods is to evaluate the
variance/standard deviation of the difference [Rakesh00]
Var (P–O) or std(P-O), P: estimated, O: original
Two multi-column privacy metrics
qi : privacy guarantee for column i
qi = std(Pi–Oi), Oi normalized column values,
Pi estimated column values
Min privacy guarantee: the weakest link of all columns
 min { qi, i=1..d}
Avg privacy guarantee: overall privacy guarantee
 1/d qi
Attack 1: naïve estimation
 Estimate original points purely based
on the perturbed data
If using “random rotation” only
 Intensity of perturbation matters
 Points around origin
Y
Class 1
Class
2
Classification boundary
X
Counter naïve estimation
 Maximize intensity
 Based on formal analysis of “rotation intensity”
 Method to maximize intensity
 Fast_Opt algorithm in GDP
 “Random translation” T
 Hide origin
 Increase difficulty of attacking!
 Need to estimate R first, in order to find out T
Attack 2: ICA based attacks
 Independent Component Analysis (ICA)
 Try to separate R and X from Y= R*X
Characteristics of ICA
1. Ordering of dimensions is not preserved.
2. Intensity (value range) is not preserved
Conditions of effective ICA-attack
1. Knowing column distribution
2. Knowing value range.
Counter ICA attack
 Weakness of ICA attack
 Need certain amount of knowledge
 Cannot effectively handle dependent
columns
 In reality…
 Most datasets have correlated columns
 We can find optimal rotation perturbation
 maximizing the difficulty of ICA attacks
Attack 3: distance-inference attack
If with only rotation/translation perturbation, when the attacker knows
a set of original points and their mapping…
image
Known
point
Original
Perturbed
How is the Attack done …
 Knowing points and their images …
 find exact images of the known points
 Enumerate pairs by matched distances
… Less effective for large data …
 we assume pairs are successfully identified
 Estimation
1. Cancel random translation T from pairs (x, x’)
2. calculate R with pairs: Y=RX  R = Y*X-1
3. calculate T with R and known pairs
Counter distance-inference:
Noise addition
 Noise brings enough variance in estimation of
R and T


Now the attacker has to use regression to estimate R
Then, use approximate R to estimate T  increase uncertainty
 Can the noise be easily filtered?



Need to know noise distribution,
Need to know distribution of RX + T,
Both distributions are not published, however.
Note: It is very different from the attacks to noise addition
data perturbation [Kargupta03, Huang05]
Attackers with more knowledge?
 What if attackers know large amount of
original records?
 Able to accurately estimate covariance matrix,
column distribution, and column range, etc., of
the original data
 Methods PCA, AK_ICA, …,etc can be used
 What do we do?
You have released so much original information…
Stop releasing any kind of data anymore 
A randomized perturbation
optimization algorithm
 Start with a random rotation
 Goal: passing tests on simulated attacks
 Not simply random – a hillclimbing
method
1. Iteratively determine R
- Test on naïve estimation (Fast_opt)
- Test on ICA (2nd level)
 find a better rotation R
2. Append a random translation component
3. Append an appropriate noise component
Comparison on methods
 Privacy preservation
 In general, RPP should be better than GDP
 Evaluate the effect of attacks for GDP
 ICA and distance perturbation need experimental
evaluation
 Utility preservation
 GDP:
 R and T exactly preserve distances,
 The effect of D needs experimental evaluation
 RPP
 # of perturbed dimensions vs. utility
 Datasets
 12 datasets from UCI Data Repository
Privacy guarantee:GDP
 In terms of naïve estimation and ICA-based attacks
 Use only the random rotation and translation (R*X+T)
components
Optimized for
Naïve estimation only
Optimized perturbation for both attacks
Worst perturbation (no optimization)
Privacy guarantee:GDP
 In terms of distance inference attacks
 Use all three components (R*X +T+D)
 Noise D : Gaussian N(0, 2)
 Assume pairs of (original, image) are identified by attackers
 no noise addition, privacy guarantee =0
Considerably high PG
at small perturbation
=0.1
Data utility : GDP with noise
addition
 Noise addition vs. model accuracy noise: N(0, 0.12)
Boolean data is
more sensitive to
distance perturbation
Data Utility: RPP
 Reduced # of dims vs. model accuracy
KNN classifiers
SVMs
Perceptrons