Transcript dummy

Secure Data Outsourcing
Outline




Motivation
Background
Research issues
Summary
Motivation
 Cost of maintaining/mining large data
 4-5 times of the cost of data acquisition
 DBAs are paid well 
 More and more data service providers
 Low cost – cloud computing
 Maintain one database for one user  multiple users
 Examples:




Alentus.com
Datapipe.com
Discountasp.net
…
 Concerns about data security and privacy
 Untrusted service provider
Un-trusted service provider
 Lazy: incentives to perform less
 Curious: incentives to acquire
information
 Malicious:
 Denial of service
 Incorrect results
 Possibly compromised
Challenges
 Data confidentiality
 Data need to be encrypted (?)
 Utility of protected data?
 Query utility
 Mining utility
 Access pattern privacy
 Integrity
 Data integrity
 Query integrity



Correct
Complete
Fresh
Why is it hard for query
services?
 Arbitrary expressivity
 SQL statements
 Often, restricted for certain type of query
for simplicity (e.g. range query, knn
query)
 Cost
 Communication
 Computation (server side vs client side)
Why it is hard for mining
services?
 Many data mining models
 Different utilities to preserve
 No one-size-for-all solutions
Data confidentiality
 Bucketization method (crypto-index)
 Order preserving encryption
 Perturbations
Bucketization method
 Hacigumus (SIGMOD02)
 Main steps
 Partition sensitive attributes
 Order preserving: supports comparison
 Random: query rewriting becomes hard
 Build index on the partitions
 Rewrite queries to target partitions
 ‘john doe’  105
 Select * from T’ where name=105
 Execute queries and return results
 Prune/post-process results on client
 Trade off between confidentiality and
overhead
 Larger partition  increased privacy 
increased overheads
Order preserving encryption
 Agrawal2004, Boldyreva2009
 The set of data is securely
transformed so that the order is
preserved but the distribution and
domain are changed
 Benefits: indexing/searching on OPE
encrypted data
 Weakness: once the original
distribution is known, OPE is broken
 Not attribute-wise order preserving
 Order preserving encryption (OPE, Agrawal et al
2004) is not resilient to distribution-based attacks
OPE
Original Xi distribution is known
Bucket based
Estimation
Transformed Xi’ distribution
Data perturbation
 Definition
1. randomly change the original data
2. the attacker cannot effectively recover
the original data
3. the desired properties are preserved
 Techniques
 Single dimension: noise addition
 Multidimensional
 Geometric perturbation
 Random projection
 RASP random space perturbation
Noise addition
 Y = X+ R
 X: original data column, R: random noise
(distribution published), Y: published
data
 Applications in data mining
 Reconstructing column distribution
 Rakesh Agrawal SIGMOD 2000
 Applied to privacy-preserving decision tree, naïve
bayes classifier
 Attacks
 Spectral filtering (Kargupta ICDM 2004)
 PCA reconstruction (Huang
 Multiplicative perturbations
 Geometric data perturbation for
outsourced data mining
 Random Projection
 RASP perturbation for query services
(range query, kNN query).
Perturbation-based framework
Mining
service
Geometric data perturbation
 Y=RX+T+D
 R: secret rotation matrix (preserve Euclidean
distances)
 T: secret random translation matrix, D: secret
random noise matrix
 Distances are approximately preserved (D)
 Resilient to most attacks to rotation perturbation
 Applications
 Outsourced privacy preserving data mining,
applicable for many classification and clustering
algorithms
 Attacks
 Population based attacks (when covariance matrix is
revealed)
Random Projection
 Y=AX+D
 A: random projection, e.g., entries from
N(0,1)
 Distances are approximately preserved
 Applications
 Many classification and clustering algorithms
 Worse accuracy than geometric perturbation
 Good for sparse high-dimensional data (text
data), i.e., sketch methods (A is randomly
generated for EACH record)
 Attacks
 Possibly more resilient than other two
perturbation methods
 But utility (distance) is not well preserved
RASP perturbation
k-dimensional numeric data, n records,
represented as a k x n matrix, x: a record
(1) Extend x to k+2 dimensions
-
(K+1) th dimension is always 1 – homogeneous dimension
(K+2) th dimension v is a real random number drawn from
(2) Encryption
- A is a (k+2)x(k+2) invertible real value matrix, with at
least two non-zero values for each row and the last
column of A has all non-zero values
- A is shared by all records
 Properties
 Not an OPE
 Preserves convexity of the dataset
 Convex dataset in Rk  another convex dataset in
Rk+2.
 Good for range query
 Each range query in Rk
 hyperplane based query
 range query in Rk+2 .
RASP properties
 Convexity preserving


Queried range (hypercube) is convex
RASP transforms the range to another convex (polyhedron)
half space: wTx<=a
wTx=a
The intersection of convex sets is also convex.
illustration of convexity
preserving
Original space
Encrypted space
Secure query transformation
 A naïve solution
 Based on the convexity preserving property
Problems: (1) A-1 can be probed
(2)
is
. . If a is known, the whole
dimension i is breached.
Secure query transformation
 Enhanced solution
 Xk+2 is always positive
 (Xi-a)  0  (Xi-a)Xk+2  0
 Correspondingly, in the encrypted space
yTy  0,
Problems addressed:
(1) A-1 cannot be derived from 
(2) (Xi-a)Xk+2  0 contains the random component Xk+2 that protects
the condition (Xi-a)  0
Efficient two-stage query processing
 illustrated
Stage2:
Filter out the junk records
Stage1:
Querying this bounding
box
Original space
Transformed space
A multidimensional tree index is been built on the encrypted data (in the
transformed space) in the server.
Stage 1:
The client calculates the large bounding box;
The server uses the index to find the results.
Stage 2:
filter the initial results with the conditions
yTiy  0 for 1…2k
Note: the two-stage strategy works, if the output of stage 1 is
significantly smaller than the original database and can be
fit into the memory.
Otherwise, use linear scan with stage 2 filtering.
RASP-based data mining
 Preserving range query  linear
classifier
 Use the boosting framework to get
strong classifiers (PerturBoost, in
ICDM 2013)
Access pattern privacy
 On database queries
 Problem is the same as PIR
 Attackers may use the access pattern to
breach data confidentiality
 Each of previous approaches should
handle this problem!
PIR is impractical
 Solutions based on private
Information retrieval (PIR)
 PIR is still impractical
For Bucktization approach
 Based on the architecture of
Hacigumus (SIGMOD02)
 Hore VLDB04
 For range query
 Privacy concern: reveal the distribution
of value in each bucket
 “Diffusion”: split buckets and combine
parts of different buckets
 Trade off: now the server needs to
return more noisy results  larger size
For OPE
 Use queries to find out the
distributions, then break the
encryption
For RASP
 Secure query transformation
 Attacks to transformed queries
Oblivious RAM
 Access pattern: read/write data items
 Setting:
 Client has a small secure memory
 Server has large insecure storage, semihonest
 Data items are encrypted
 Client cannot hide the accessed locations
 An active area
Existing Approaches
Dummy Block
Real Block
Real Block
Dummy Block
Real Block
Dummy Block
Dummy Block
Real Block
 Inside a level
 Some real blocks
 Useful data
 Some dummy blocks
 Random data
 Randomly permuted
 Only the client knows
the permutation
Existing Approaches
 Reading
 Read a block from
each level
 One real block.
 Remaining are
dummy blocks
dummy
real
dummy
dummy
dummy
dummy
Server
Client
Existing Approaches
Server (before) Client
Server (after)
 Writing
 Shuffle
consecutively
filled levels.
 Write into next
unfilled level.
 Clear the source
levels
shuffle
blocks
Continuous Shuffling
…
To write:

The Problem with Existing
Approaches

Integrity guarantee
 Merkle hash tree
H(H(x1)+H(x2)) , + is string concatenation
Can be stored with tree like structure : index, xml
 Hash chains
Query correctness with merkle
by Devanbu et. al.
Using merkle tree
Example:
5<=q<=10
LUB(q) = 4
GLB(q) = 11
 Operations:
 Selections, projections, equijoins, set ops
 Issues
 Works only on data with verification objects
 Query expressiveness
 Expensive
 Related work
 Pang et. al (ICDE04, SIGMOD05), using ElGamal
function
 Sion VLDB05: challenge token
 F.Li SIGMOD06: freshness
Secure keyword search
 Simple information retrieval
 For a keyword, find the documents
containing the keyword
 What if the documents are encrypted
word by word
 and if the keyword is also encrypted
Secure keyword search
 Song 2000
•Seed is random, different for
each Wi
•Key idea: Li and Ri are selfverifiable
•Advantage of XOR
How to set K?
 Setting of ki
 Ki = Fk’(Wi), k’ is secret
 User publishes W and k = Fk’(W)
 Server checks CiW
 whether <Li, Fk(Li)> == CiW
It reveals nothing if Ci is not the ciphertext
for W.
And Li is random for different Wi – server
cannot find any information from Li.
Hidden search
 In previous schemes, W is revealed
 Weakness: each search will have to release k for W
 Easy to collect information
 Solution: encrypt Wi with an private
key, then xor with <Li, Fk(Li)>
Recent developments
 Reza 2006
 “Searchable symmetric encryption:
improved definitions and efficient
constructions”
 Completely solved this problem, with a
solution indistinguishability under chosen
ciphertext attack (IND-CCA)
Trusted hardware
Possible benefits
Discussion
 Data confidentiality/access pattern
 Restrict cryptographic definition
(keyword search) or
 Relaxed definition (perturbation,
bucketization, OPE, etc.)
 It is very difficult to formulate and
prove the security of non-traditional
approaches
 Do we need to reformulate the security
model? and how?