Privacy-preserving Data Mining for the Internet of Things

Download Report

Transcript Privacy-preserving Data Mining for the Internet of Things

Privacy-preserving Data Mining for the
Internet of Things: State of the Art
Yee Wei Law (罗裔纬)
wsnlabs.com
Speaker’s brief bio
• Ph.D. from University of Twente for research on security of wireless
sensor networks (WSNs) in EU project EYES
• Research Fellowship on WSNs from The University of Melbourne
– ARC projects “Trustworthy sensor networks: theory and
implementation”, “BigNet”
– EU FP7 projects “SENSEI”, “SmartSantander”, “IoT-i”, “SocIoTal”
– IBES seed projects on participatory sensing, smart grids
– Taught Master’s course “Sensor Systems”
• Professional membership:
– Associate of (ISC)2 (junior CISSP)
– Smart Grid Australia Research Working Group
Current research interests:
Current research orientation:
• Privacy-preserving data mining
• Secure/resilient control
• Applications of above to the IoT
and smart grid
• Mixed basic/applied research in
data science or network science
• Research involving
probabilistic/statistical,
combinatorial, matrix analysis
Agenda
• The IoT and its research priorities
– Participatory sensing (PS)
– Collaborative learning (CL)
•
•
•
•
Introduction to privacy-preserving data mining
Schemes suitable for PS and CL
Research opportunities challenges
If time permits, SOCIOTAL
H. Sundmaeker et al.,
“Vision and Challenges for
Realising the Internet of
Things,” Cluster of European
Research Projects on the
Internet of Things, Mar.
2010.
A dynamic global network
infrastructure with selfconfiguring capabilities based on
standard and interoperable
communication protocols where
physical and virtual “things” have
identities, physical attributes, and
virtual personalities and use
intelligent interfaces, and are
seamlessly integrated into the
information network.
Evidence of the Internet of Things
Nissan EPORO robot cars
Smart grid
Research priorities
ITU-T: “Through the exploitation of identification, data
capture, processing and communication capabilities,
the IoT makes full use of things to offer services to all
kinds of applications, whilst maintaining the required
privacy.”
Smart transport
Smart grid
Smart water
Smart whatever
Among research priorities:
• Mathematical models and
algorithms for inventory
management, production
scheduling, and data mining
• Privacy aware data processing
ARPAnet
Shifting priorities
Machine-to-machine
communications
• We have enough tech to
hook things up, now we
should make find better
ways of capturing and
analyzing data.
• Introducing participatory
sensing and collaborative
learning...
Some graphics from Sabina Jeschke
Participatory sensing
A process whereby individuals and communities use evermore-capable mobile phones
and cloud services to collect and analyze systematic data for use in discovery.
Citizen-provided data
can improve
governance with
benefits including:
• Increased public
safety
• Increased social
inclusion and
awareness
• Increased resource
efficiency for
sustainable
communities
• Increased public
accountability
Source: Estrin et al.
Data sharing scenarios
Lindell and Pinkas [2000]: “privacy-preserving data mining” refers to privacy-preserving distributed data mining
Data sharing scenarios (cont’d)
• Collaborative learning: Multiple data owners collaboratively
analyze the union of their data with the involvement of a thirdparty data miner.
• Agrawal and Srikant [2000] coined the term “privacy-preserving
data mining” to refer to privacy-preserving collaborative learning.
• Encrypting data to data miner is inadequate, data should be
masked, at a balanced point between accuracy and privacy.
Privacy-preserving collaborative learning
Adversarial models
Privacy criterion
• Requirement imposed by
participatory sensing:
– online data submission,
offline data processing
• Design space:
– Data type:
• continuous or categorical
• voice, images, videos, etc.
Semantic
Syntactic
SMC
Randomization
Proposed
criterion
Differential
privacy
– Data structure:
• relational or time series
• for relational data: horizontal
or vertical partitioned
– Data mining operation
Linear
Additive
Nonlinear
Multiplicative
Adversarial models
Semi-honest (honest but curious)
• Passive attacker tries to
learn the private states of
other parties, without
deviating from protocol
• By definition, semi-honest
parties do not collude
Malicious
• Active attacker tries to learn
the private states of other
parties, and deviates
arbitrarily from protocol
• Common approach: Design in the semi-honest model,
enhance it for the malicious model
• General method: zero-knowledge proofs often not practical
• Semi-honest model often realistic enough
Syntactic privacy criteria
Semantic privacy criteria
• To prevent syntactic attacks,
e.g., table linkage:
• To minimize the difference
between adversarial prior
knowledge and adversarial
posterior knowledge about
individuals represented in the
database
• General enough for most data
types, relational or time series
• Example:
– Attacker has access to an
anonymous table and a
nonanonymous table, with the
anonymous table being a
subset of the nonanonymous
table
– Attacker can infer the presence
of its target’s record in the
anonymous table from the
target’s record in the
nonanonymous table
• Relevant for relational data,
not time series data
• Example:
– k-anonymity
– Cryptographic privacy
– Differential privacy
Cryptographic
privacy
Secure Multiparty
Computation
Differential
privacy
Randomization
Secure multiparty computation
x1
Output
f(x1,x2)
x2
Metaphor: Yao’s millionaire
problem [1982]
Garbled circuits for
arbitrary functions
[Beaver et al. 1990]
Building blocks:
Oblivious transfer
Sender
Receiver
chooses a value
Sender doesn’t
know which
n values
1-out-of-n oblivious transfer
Oblivious transfer
• Introduced by Rabin [1981]
• Killian [1988] showed oblivious
transfer is sufficient for secure twoparty computation
• Naor et al. [2001] reduce the
amortized overhead of oblivious
transfer to one exponentiation per a
log number of oblivious transfers
• Homomorphic encryption can be
used in the semi-honest model
Differential privacy
• In cryptography, semantic security: whatever is
computable about the cleartext given the ciphertext is
also efficiently computable without the ciphertext
• Useless for PPDM: A DB satisfying above has no utility
• Dwork [2006] proposed “differential privacy” for
statistical disclosure control: add noise to query results
Differential privacy (cont’d)
• Theoretical basis for answering “sum queries”
Row index
Row
• Sum queries can be used for histogram, mean, covariance,
correlation, SVD, PCA, k-means, decision tree, etc.
Differential privacy
Sensitivity
Laplace noise
Noisy sum
queries
Taxonomy of attacks against
randomization-based approaches
Known input/sample attack:
• The attacker has some input
samples and all output
samples but does not know
which input sample
corresponds to which output
sample
• Typically begins with
establishing correspondences
between the input samples
and the output samples
Da
ta
......
Re
sul
t
ta
Da
t
sul
Re
Data miner
Known input-output attack:
• The attacker has some input
samples and all output
samples, and knows which
input sample corresponds
to which output sample
Proposed privacy criterion:
The distance between f(X) and
estimated f(X) kept above a
specified threshold under
known attacks
Randomization
Randomized distortion or
perturbation of data
Randomization
Linear
Additive
perturbation
Time series
data
Nonlinear
• Spectral filtering attack by
Kargupta et al. [2003]
• PCA attack by Huang et al.
[2005]:
– Estimate covariance matrix
of original data
Multiplicative
perturbation
Relational data
• Additive perturbation:
adds noise data to data
• iid noise susceptible to:
– Find eigenvalues and
eigenvectors of covariance
matrix through PCA
–
eigenvectors of covar
• Bayesian estimation may
not have analytic form
Collaborative learning using additive perturbation
• Compared to multiplicative
perturbation, easier to
recover the source data
distribution fX(x) from the
perturbed data distribution
and noise distribution
• Against attacks: noise to be
correlated with data and
participant-specific
• PoolView [Ganti et al. 2008]
builds a model of the data,
then generate noise from
the model:
Attac
k
• With a common noise model,
a participant (i) can
reconstruct another
participant’s (j) data from the
perturbed data:
Solved through deconvolution
Estimated with kernel density estimation
Collaborative learning using additive perturbation
• Zhang et al. [2012]
Data-dependence
Participant-dependence
PDF reconstructed by data miner
based on PDF of y and noise
• Catches:
– The data miner has to
know the participants’
parameters—system not
resilient to collusion
– Data correlation
between participants
expose them to attacks
(recall the PCA-based
attack?)
Multiplicative perturbation
Multiplies data with noise
Rotation perturbation [Chen
et al. 2005]
• Noise matrix is an
orthogonal matrix with
orthonormal rows and
columns
• Giannella et al.’s [2013]
attack can estimate the
original data using all
perturbed data and a small
amount of original data
Input
x
Perturbation
Output
y
Attack stage 1
• Find maximally unique map
β that satisfies
Then we know which xi is
mapped to which yi
Attack stage 2
• Find that maximizes
Enhanced version: geometric
perturbation
Multiplicative projection: random projection
• Projection by Gaussian
random matrix
– Statistically orthogonal
– essentially a JohnsonLindenstrauss transform
Perturbed vectors
d dimension
k dimension
• Other JohnsonLindenstrauss transforms:
inter-point distances change by factor (1±ε)
as long as k≥O(ε-2logn)
• Attack against orthogonal
transform adaptable for
this?
Collaborative learning using multiplicative perturbation
Goal is to use a different perturbation matrix for a different participant
Liu et al. [2012]:
Data miner then
get an estimation
of Xu and Xv !
What about the
privacy criterion?
Learn in
approx an
inverse of
Ru and Rv
Nonlinear perturbation
Nonlinear + linear perturbation:
Nonlinear function
Random matrices
=tanh(x)
Extreme values (potential
outliers) are “squashed”
Normalized values
• Relies on linear
perturbation to achieve
projection
• Near-many-to-one
mapping provides
privacy property
• Many-to-one mapping
extended to the
“normal” part of the
curve?
Bayesian estimation attacks against
multiplication perturbation
• Solve underdetermined
system Y=RX for X
• Maximum a posteriori
estimation (why?)
• If R is known
• Gaussian original data
obviously simplifies the
attacker’s problem
• If R is not known
• Difficult optimization
problem, although
Gaussian data simplifies
the problem
• Choice of p(R) matters
Independent component analysis
against multiplicative perturbation
• Prerequisites for attacker
Perturbation matrix
treated as mixing
matrix
Blind source separation
m=n
m<n
m>n
Overcomplete/underdet
ermined ICA
Sparse
representation
Nonnegative
matrix
factorization
– independence
– at most one Gaussian
component
– sparseness (Laplace)
– m≥(n+1)/2
• Steps:
– estimate R
– estimate X
– resolve permutation and
scaling ambiguity
Research opportunities and challenges
Multiplicative perturbation
Nonlinear perturbation
Tools: Statistical analysis, Bayesian
analysis, matrix analysis, time series
analysis, optimization, signal processing
Data mining algorithms
Bayesian estimation
attacks
Participants’ data
ICA attacks
Perturbed data
• Commercial interest? • Challenging multidisciplinary problems
necessitate broad range of tools:
• Large design space:
– Scenario-dependent privacy criteria
effectiveness
– Defenses and attacks evolve side-by-side
depends as much on
the nature of data as
– Role of dimensionality reduction?
the data mining
– Steganography for “traitor tracing”?
algorithms
– Many more from syntactic privacy, SMC, etc.
• What is Big Data?
• Unsupervised learning of Big
Data, e.g., Deep Learning
• Vision: Business-centric
Internet of Things  Citizencentric Internet of Things
• Main non-technical aim:
Create trust and confidence in
Internet of Things systems,
Alice’s sensor network Alice’s friend Bob
while providing user-friendly
ways to contribute to and use
monitoring her house granted access to
Alice’s network while the system thus encouraging
creation of services of high
Alice’s on vacation
socio-economic value.
• Main technical aims:
Motivating use cases:
Sensor network monitoring community
microgrid feeding data to stakeholders
– Reliable and secure
communications
– Trustworthy data collection
– Privacy-preserving data
mining
Duration: Sep 2013- Aug
2016
Funding scheme: STREP
Total Cost: €3.69 m
EC Contribution: €2.81m
Contract Number:
CNECT-ICT- 609112
Conclusion
Source: Cisco IBSG, April 2011
Adversarial models
Privacy criterion
Semantic
Syntactic
SMC
Randomization
Proposed
criterion
Linear
Additive
Differential
privacy
Nonlinear
Multiplicative
• Looking back: 1970s
gives us statistical
disclosure control; 2000s
gives us PPDM
• Technological
development expands
design space, invites
multidisciplinary input
• Socio-economical
development plays
critical role
Syntactic privacy criteria/definitions
To prevent syntactic attacks:
• Table linkage:
– Attacker has access to an anonymous table and a nonanonymous table, with
the anonymous table being a subset of the nonanonymous table
– Attacker can infer the presence of its target’s record in the anonymous table
from the target’s record in the nonanonymous table
• Record linkage:
– Attacker has access to an anonymous table and a nonanonymous table, and
the knowledge that its target is represented in both tables
– Attacker can uniquely identify the target’s record in the anonymous table from
the target’s record in the nonanonymous table
• Attribute linkage:
– Attacker has access to an anonymous table, and the knowledge that its target
is represented in the table, the attacker can infer the value(s) of its target’s
sensitive attribute(s) from the group (e.g., 30-40 year-old females) the target
belongs to
Examples:
• k-anonymity