Transcript dummy
Secure Data Outsourcing
Outline
Motivation
Background
Research issues
Summary
Motivation
Cost of maintaining/mining large data
4-5 times of the cost of data acquisition
DBAs are paid well
More and more data service providers
Low cost – cloud computing
Maintain one database for one user multiple users
Examples:
Alentus.com
Datapipe.com
Discountasp.net
…
Concerns about data security and privacy
Untrusted service provider
Un-trusted service provider
Lazy: incentives to perform less
Curious: incentives to acquire
information
Malicious:
Denial of service
Incorrect results
Possibly compromised
Challenges
Data confidentiality
Data need to be encrypted (?)
Utility of protected data?
Query utility
Mining utility
Access pattern privacy
Integrity
Data integrity
Query integrity
Correct
Complete
Fresh
Why is it hard for query
services?
Arbitrary expressivity
SQL statements
Often, restricted for certain type of query
for simplicity (e.g. range query, knn
query)
Cost
Communication
Computation (server side vs client side)
Why it is hard for mining
services?
Many data mining models
Different utilities to preserve
No one-size-for-all solutions
Data confidentiality
Bucketization method (crypto-index)
Order preserving encryption
Perturbations
Bucketization method
Hacigumus (SIGMOD02)
Main steps
Partition sensitive attributes
Order preserving: supports comparison
Random: query rewriting becomes hard
Build index on the partitions
Rewrite queries to target partitions
‘john doe’ 105
Select * from T’ where name=105
Execute queries and return results
Prune/post-process results on client
Trade off between confidentiality and
overhead
Larger partition increased privacy
increased overheads
Order preserving encryption
Agrawal2004, Boldyreva2009
The set of data is securely
transformed so that the order is
preserved but the distribution and
domain are changed
Benefits: indexing/searching on OPE
encrypted data
Weakness: once the original
distribution is known, OPE is broken
Not attribute-wise order preserving
Order preserving encryption (OPE, Agrawal et al
2004) is not resilient to distribution-based attacks
OPE
Original Xi distribution is known
Bucket based
Estimation
Transformed Xi’ distribution
Data perturbation
Definition
1. randomly change the original data
2. the attacker cannot effectively recover
the original data
3. the desired properties are preserved
Techniques
Single dimension: noise addition
Multidimensional
Geometric perturbation
Random projection
RASP random space perturbation
Noise addition
Y = X+ R
X: original data column, R: random noise
(distribution published), Y: published
data
Applications in data mining
Reconstructing column distribution
Rakesh Agrawal SIGMOD 2000
Applied to privacy-preserving decision tree, naïve
bayes classifier
Attacks
Spectral filtering (Kargupta ICDM 2004)
PCA reconstruction (Huang
Multiplicative perturbations
Geometric data perturbation for
outsourced data mining
Random Projection
RASP perturbation for query services
(range query, kNN query).
Perturbation-based framework
Mining
service
Geometric data perturbation
Y=RX+T+D
R: secret rotation matrix (preserve Euclidean
distances)
T: secret random translation matrix, D: secret
random noise matrix
Distances are approximately preserved (D)
Resilient to most attacks to rotation perturbation
Applications
Outsourced privacy preserving data mining,
applicable for many classification and clustering
algorithms
Attacks
Population based attacks (when covariance matrix is
revealed)
Random Projection
Y=AX+D
A: random projection, e.g., entries from
N(0,1)
Distances are approximately preserved
Applications
Many classification and clustering algorithms
Worse accuracy than geometric perturbation
Good for sparse high-dimensional data (text
data), i.e., sketch methods (A is randomly
generated for EACH record)
Attacks
Possibly more resilient than other two
perturbation methods
But utility (distance) is not well preserved
RASP perturbation
k-dimensional numeric data, n records,
represented as a k x n matrix, x: a record
(1) Extend x to k+2 dimensions
-
(K+1) th dimension is always 1 – homogeneous dimension
(K+2) th dimension v is a real random number drawn from
(2) Encryption
- A is a (k+2)x(k+2) invertible real value matrix, with at
least two non-zero values for each row and the last
column of A has all non-zero values
- A is shared by all records
Properties
Not an OPE
Preserves convexity of the dataset
Convex dataset in Rk another convex dataset in
Rk+2.
Good for range query
Each range query in Rk
hyperplane based query
range query in Rk+2 .
RASP properties
Convexity preserving
Queried range (hypercube) is convex
RASP transforms the range to another convex (polyhedron)
half space: wTx<=a
wTx=a
The intersection of convex sets is also convex.
illustration of convexity
preserving
Original space
Encrypted space
Secure query transformation
A naïve solution
Based on the convexity preserving property
Problems: (1) A-1 can be probed
(2)
is
. . If a is known, the whole
dimension i is breached.
Secure query transformation
Enhanced solution
Xk+2 is always positive
(Xi-a) 0 (Xi-a)Xk+2 0
Correspondingly, in the encrypted space
yTy 0,
Problems addressed:
(1) A-1 cannot be derived from
(2) (Xi-a)Xk+2 0 contains the random component Xk+2 that protects
the condition (Xi-a) 0
Efficient two-stage query processing
illustrated
Stage2:
Filter out the junk records
Stage1:
Querying this bounding
box
Original space
Transformed space
A multidimensional tree index is been built on the encrypted data (in the
transformed space) in the server.
Stage 1:
The client calculates the large bounding box;
The server uses the index to find the results.
Stage 2:
filter the initial results with the conditions
yTiy 0 for 1…2k
Note: the two-stage strategy works, if the output of stage 1 is
significantly smaller than the original database and can be
fit into the memory.
Otherwise, use linear scan with stage 2 filtering.
RASP-based data mining
Preserving range query linear
classifier
Use the boosting framework to get
strong classifiers (PerturBoost, in
ICDM 2013)
Access pattern privacy
On database queries
Problem is the same as PIR
Attackers may use the access pattern to
breach data confidentiality
Each of previous approaches should
handle this problem!
PIR is impractical
Solutions based on private
Information retrieval (PIR)
PIR is still impractical
For Bucktization approach
Based on the architecture of
Hacigumus (SIGMOD02)
Hore VLDB04
For range query
Privacy concern: reveal the distribution
of value in each bucket
“Diffusion”: split buckets and combine
parts of different buckets
Trade off: now the server needs to
return more noisy results larger size
For OPE
Use queries to find out the
distributions, then break the
encryption
For RASP
Secure query transformation
Attacks to transformed queries
Oblivious RAM
Access pattern: read/write data items
Setting:
Client has a small secure memory
Server has large insecure storage, semihonest
Data items are encrypted
Client cannot hide the accessed locations
An active area
Existing Approaches
Dummy Block
Real Block
Real Block
Dummy Block
Real Block
Dummy Block
Dummy Block
Real Block
Inside a level
Some real blocks
Useful data
Some dummy blocks
Random data
Randomly permuted
Only the client knows
the permutation
Existing Approaches
Reading
Read a block from
each level
One real block.
Remaining are
dummy blocks
dummy
real
dummy
dummy
dummy
dummy
Server
Client
Existing Approaches
Server (before) Client
Server (after)
Writing
Shuffle
consecutively
filled levels.
Write into next
unfilled level.
Clear the source
levels
shuffle
blocks
Continuous Shuffling
…
To write:
The Problem with Existing
Approaches
Integrity guarantee
Merkle hash tree
H(H(x1)+H(x2)) , + is string concatenation
Can be stored with tree like structure : index, xml
Hash chains
Query correctness with merkle
by Devanbu et. al.
Using merkle tree
Example:
5<=q<=10
LUB(q) = 4
GLB(q) = 11
Operations:
Selections, projections, equijoins, set ops
Issues
Works only on data with verification objects
Query expressiveness
Expensive
Related work
Pang et. al (ICDE04, SIGMOD05), using ElGamal
function
Sion VLDB05: challenge token
F.Li SIGMOD06: freshness
Secure keyword search
Simple information retrieval
For a keyword, find the documents
containing the keyword
What if the documents are encrypted
word by word
and if the keyword is also encrypted
Secure keyword search
Song 2000
•Seed is random, different for
each Wi
•Key idea: Li and Ri are selfverifiable
•Advantage of XOR
How to set K?
Setting of ki
Ki = Fk’(Wi), k’ is secret
User publishes W and k = Fk’(W)
Server checks CiW
whether <Li, Fk(Li)> == CiW
It reveals nothing if Ci is not the ciphertext
for W.
And Li is random for different Wi – server
cannot find any information from Li.
Hidden search
In previous schemes, W is revealed
Weakness: each search will have to release k for W
Easy to collect information
Solution: encrypt Wi with an private
key, then xor with <Li, Fk(Li)>
Recent developments
Reza 2006
“Searchable symmetric encryption:
improved definitions and efficient
constructions”
Completely solved this problem, with a
solution indistinguishability under chosen
ciphertext attack (IND-CCA)
Trusted hardware
Possible benefits
Discussion
Data confidentiality/access pattern
Restrict cryptographic definition
(keyword search) or
Relaxed definition (perturbation,
bucketization, OPE, etc.)
It is very difficult to formulate and
prove the security of non-traditional
approaches
Do we need to reformulate the security
model? and how?