multiplicative
Download
Report
Transcript multiplicative
Multiplicative Data Perturbations
Outline
Introduction
Multiplicative data perturbations
Rotation perturbation
Geometric Data Perturbation
Random projection
Understanding
Distance preservation
Perturbation-invariant models
Attacks
Privacy Evaluation Model
Background knowledge and attack analysis
Attack-resilient optimization
Comparison
Summary on additive perturbations
problems
Weak to various attacks
Need to publish noise distribution
The column distribution is known
Need to develop/revise data mining algorithms in order to
utilize perturbed data
So far, we have only seen that decision tree and naïve bayes
classifier can utilize additive perturbation.
Benefits
Can be applied to both the Web model and the corporate
model
Low cost
More thoughts about
perturbation
1. Preserve Privacy
Hide the original data
not easy to estimate
the original values
from the perturbed
data
Protect from data
reconstruction
techniques
The attacker has prior
knowledge on the
published data
2. Preserve Data Utility for
Tasks
Single-dimensional info
column data
distribution, etc.
Multi-dimensional
info
Cov matrix, distance,
etc
For most PP approaches…
Privacy
guarantee
?
Data Utility/
Model accuracy
Privacy guarantee
Data utility/
Model accuracy
•Difficult to balance the two factors
•Subject to attacks
•May need new DM algorithms: randomization, cryptographic approaches
Multiplicative perturbations
Geometric data perturbation (GDP)
Rotation data perturbation
Translation data perturbation
Noise addition
Random projection perturbation(RPP)
Definition of Geometric Data
Perturbation
G(X) = R*X + T + D
R: random rotation
T: random translation
D: random noise, e.g., Gaussian noise
Characteristics:
R&T preserving distance, D slightly perturbing distance
Example:
ID
001
002
age
1158
943
rent
3143
2424
tax
-2919
-2297
=
.83
-.40
.40
.2
.86
.46
.53
.30
-.79
*
ID
001
002
age
30
25
rent
1350
1000
tax
4230
3320
+
12
12
18
18
30
30
+
-.4
.29
-1.7
-1.1
.13
1.2
* Each component has its use to enhance the resilience to attacks!
Benefits of Geometric Data
Perturbation
Privacy
guarantee
decoupled
Data Utility/
Model accuracy
Make optimization and balancing easier!
-Almost fully preserving model accuracy - we optimize privacy only
Applicable to many DM algorithms
-Distance-based Clustering
-Classification: linear, KNN, Kernel, SVM,…
Resilient to Attacks
-the result of attack research
Definition of Random Projection
Perturbation
F(X) = P*X
X is m*n matrix: m columns and n rows
P is a k*m random matrix, k <= m
Johnson-Lindenstrauss Lemma
There is a random projection F() with
e is a small number <1, so that
(1-e)||x-y||<=||F(x)-F(y)||<=(1+e)||x-y||
i.e. distance is approximately preserved.
Comparison between GDP and
RPP
Privacy preservation
Subject to similar kinds of attacks
RPP is more resilience to distance-based
attacks
Utility preservation(model accuracy)
GDP preserves distances well
RPP approximately preserves distances
Model accuracy is not guaranteed
Illustration of multiplicative
data perturbation
Preserving distances
while perturbing each
individual dimensions
A Model “invariant” to GDP …
If distance plays an important role
Class/cluster members and decision boundaries are
correlated in terms of distance, not the concrete locations
2D Example:
Class 1
Class
2
Classification boundary
Rotation and
translation
Classification
boundary
Distance perturbation
(Noise addition)
Slightly
changed
Classification
boundary
Applicable DM algorithms
Modeling methods that depend on Euclidean
geometric properties
Models “invariant” to GDP
all Euclidean distance based clustering algorithms
Classification algorithms
K Nearest Neighbors
Kernel methods
Linear classifier
Support vector machines
Most regression models
And potentially more …
When to Use Multiplicative Data
Perturbation
Service Provider/data user
Data Owner
G(X)=RX+T+D
Apply F to G(Xnew)
G(X)
F(G(X), )
Mined models/patterns
Good for the corporate model or dataset publishing.
Major issue!! curious service providers/data users try to break G(X)
Major issue: attacks!!
Many existing PP methods are found not so
effective
when attacks are considered
Ex: various data reconstruction algorithms to the
random noise addition approach [Huang05][Guo06]
Prior knowledge
Service provider Y has “PRIOR KNOWLEDGE”
about X’s domain and nothing stops Y from using it
to infer information in the sanitized data
Knowledge used to attack GDP
Three levels of knowledge
Know nothing naïve estimation
Know column distributions Independent
Component Analysis
Know specific points (original points and their
images in perturbed data) distance inference
Methodology of attack analysis
An attack is an estimate of the original
data
Original O(x1, x2,…, xn) vs. estimate P(x’1,
x’2,…, x’n)
How similar are these two series?
One of the effective methods is to evaluate the
variance/standard deviation of the difference [Rakesh00]
Var (P–O) or std(P-O), P: estimated, O: original
Two multi-column privacy metrics
qi : privacy guarantee for column i
qi = std(Pi–Oi), Oi normalized column values,
Pi estimated column values
Min privacy guarantee: the weakest link of all columns
min { qi, i=1..d}
Avg privacy guarantee: overall privacy guarantee
1/d qi
Attack 1: naïve estimation
Estimate original points purely based
on the perturbed data
If using “random rotation” only
Intensity of perturbation matters
Points around origin
Y
Class 1
Class
2
Classification boundary
X
Counter naïve estimation
Maximize intensity
Based on formal analysis of “rotation intensity”
Method to maximize intensity
Fast_Opt algorithm in GDP
“Random translation” T
Hide origin
Increase difficulty of attacking!
Need to estimate R first, in order to find out T
Attack 2: ICA based attacks
Independent Component Analysis (ICA)
Try to separate R and X from Y= R*X
Characteristics of ICA
1. Ordering of dimensions is not preserved.
2. Intensity (value range) is not preserved
Conditions of effective ICA-attack
1. Knowing column distribution
2. Knowing value range.
Counter ICA attack
Weakness of ICA attack
Need certain amount of knowledge
Cannot effectively handle dependent
columns
In reality…
Most datasets have correlated columns
We can find optimal rotation perturbation
maximizing the difficulty of ICA attacks
Attack 3: distance-inference attack
If with only rotation/translation perturbation, when the attacker knows
a set of original points and their mapping…
image
Known
point
Original
Perturbed
How is the Attack done …
Knowing points and their images …
find exact images of the known points
Enumerate pairs by matched distances
… Less effective for large data …
we assume pairs are successfully identified
Estimation
1. Cancel random translation T from pairs (x, x’)
2. calculate R with pairs: Y=RX R = Y*X-1
3. calculate T with R and known pairs
Counter distance-inference:
Noise addition
Noise brings enough variance in estimation of
R and T
Now the attacker has to use regression to estimate R
Then, use approximate R to estimate T increase uncertainty
Can the noise be easily filtered?
Need to know noise distribution,
Need to know distribution of RX + T,
Both distributions are not published, however.
Note: It is very different from the attacks to noise addition
data perturbation [Kargupta03, Huang05]
Attackers with more knowledge?
What if attackers know large amount of
original records?
Able to accurately estimate covariance matrix,
column distribution, and column range, etc., of
the original data
Methods PCA, AK_ICA, …,etc can be used
What do we do?
You have released so much original information…
Stop releasing any kind of data anymore
A randomized perturbation
optimization algorithm
Start with a random rotation
Goal: passing tests on simulated attacks
Not simply random – a hillclimbing
method
1. Iteratively determine R
- Test on naïve estimation (Fast_opt)
- Test on ICA (2nd level)
find a better rotation R
2. Append a random translation component
3. Append an appropriate noise component
Comparison on methods
Privacy preservation
In general, RPP should be better than GDP
Evaluate the effect of attacks for GDP
ICA and distance perturbation need experimental
evaluation
Utility preservation
GDP:
R and T exactly preserve distances,
The effect of D needs experimental evaluation
RPP
# of perturbed dimensions vs. utility
Datasets
12 datasets from UCI Data Repository
Privacy guarantee:GDP
In terms of naïve estimation and ICA-based attacks
Use only the random rotation and translation (R*X+T)
components
Optimized for
Naïve estimation only
Optimized perturbation for both attacks
Worst perturbation (no optimization)
Privacy guarantee:GDP
In terms of distance inference attacks
Use all three components (R*X +T+D)
Noise D : Gaussian N(0, 2)
Assume pairs of (original, image) are identified by attackers
no noise addition, privacy guarantee =0
Considerably high PG
at small perturbation
=0.1
Data utility : GDP with noise
addition
Noise addition vs. model accuracy noise: N(0, 0.12)
Boolean data is
more sensitive to
distance perturbation
Data Utility: RPP
Reduced # of dims vs. model accuracy
KNN classifiers
SVMs
Perceptrons