2013 caine secure db

Download Report

Transcript 2013 caine secure db

Knowledge Discovery
in
Protected Vertical Information
Dr. William Perrizo
University Distinguished Professor
of Computer Science
North Dakota State University,
Fargo, ND
[email protected]
Vertical Data and pTrees
Traditionally, a file or dataset comprises one or more horizontal records, each record comprises one
or more fields, and each field comprises some number of bits that represent a piece of data.
Traditionally, a dataset is structured horizontally as a set of such horizontal records, and is processed
vertically, record-by-record.
Alternatively, a file or dataset may be structured vertically as a set of columns and processed
horizontally, column-by-column.
A column-structured dataset may be further vertically decomposed as a set of bit-vectors or bitslices
and processed horizontally, bitslice-by-bitslice using logical operations.
A pTree is a tree-like organization of these bitslices having a zero bit value for each internal node
and a zero or one bit value for each of its leaves (external nodes).
The three basic pTree operations are AND, OR and complement.
An example follows.
predicate Trees (pTrees): slice by attribute or column
vertically slice off each bit position (12 vertical structures)
then compress each bit slice into a tree using a predicate
e.g., the compression of R11 into pTree, P11 :
A pTree example
1st, Vertically Processing of Horizontal Data (VPHD)
e.g., find the number of occurences of 7 0 1 4 =2
2nd, using pTrees find the number of occurences of 7 0 1 4
Base 10
for Horizontally
structured,
record-oriented
data, one must
scan vertically
R(A1 A2 A3 A4)
2
6
3
2
3
2
7
7
7
7
7
7
2
2
0
0
6
6
5
5
1
1
1
1
1
0
1
7
4
5
4
4
R[A1] R[A2] R[A3] R[A4]
Base 2
=
010
011
010
010
011
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
0
0
0
0
0
0
1
1
R11
pure1? false=0
pure1? true=1
pure1? false=0 pure1? false=0
Record truth of predicate:
"purely 1-bits" in a tree,
recursively on halves,
until the half is pure.
1. Whole thing pure1? false  0
2. Left half pure1?
false  0
3. Right half pure1? false  0
4. Left half of rt half ? false0
5. Rt half of right half? true1
pure1? false=0
010
011
010
010
011
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
R11 R12 R13 R21 R22 R23 R31 R32 R33
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
1
0
0
1
0
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
R41 R42 R43
0
0
0
1
1
1
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
P11 P12 P13
P11
But it's pure0 so
this branch ends
0
0
0
01
1
0
0
0
0 0
0
0
01
1
P21 P22 P23 P31 P32 P33 P41 P42 P43
0
0
0
0
0
0
0
0
0
0
0 0 1 0 1 0
0 01 0 0 0 0 1 0 1 0 0 0 0
10 10
10 01
00 00
0001
0100
^ 10
^
^
^ 01
^
^
^ 10
01
01
01
01
7
0
1
4
=
To count (7,0,1,4)s use 111000001100
P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43
0
*23
0 0 *22
0 1 *21
*20
=2
Securing Vertical pTree Data
How can a pTree database be made secure without causing the processing to be slowed down
instance, requiring a slow decryption step prior to processing)?
(by, for
A first level of security may be provided by re-ordering (e.g., permuting) the pTrees. The re-ordering method
or permutation becomes the security key (those who know it can use the data and those who do not,
cannot). For big data, key compression techniques might be necessary.
A second level of security can be provided through random bit padding (at the bottom and/or top) Padding will
randomly place the true starting and ending position of the pTree somewhere in the middle of the
structure. This second level is intended to prevent an adversary from discovering the re-ordering by
focusing on a smaller part of the dataset (e.g., the first bit or the last bit only).
A third level of security is also possible when needed. That third level would involve re-ordering the bits
within the actual pTrees. For datasets in which the ordering of the record instances is random or
otherwise unknown, this would be unnecessary. However, if there is a known ordering, it might be
possible for the clever adversary to discover the starting and ending positions of the pTrees.
At this time, this third level seems problematic since it would probably require the encryption and therefore a
time consuming decryption step prior to processing. There may be a way to re-order the pTree bits
without introducing significant additional processing time? This is "future work".
The "bottom" padding might be unnecessary from a security perspective (add little or no additional security,
however, we believe it will make it much easier to secure volatile datasets (by providing a large bottom
pad that can be overwritten as the dataset grows in cardinality)
Implementations
This security scheme can be implemented per dataset or over the entire database as a whole.
That is, for each dataset, there could be a massive string of bits in which all the pTrees of
just that dataset are embedded (separate massive string of bits for each dataset),
or
there could be a much more massive string of bits in which all the pTrees of the
database are embedded (and intermingled for additional security?).
The latter might be the best choice since it does not introduce any additional processing time
overhead and it does provide additional security.
In general, a security monitor subsystem would be used to implement this security scheme.
The security monitor would be responsible for identification, authentication and
authorization.
The security monitor could manage free space within the massive bit strings (be they per
dataset or per database).
Conclusion
This vertical database scheme involves at least two levels of security.
The first level is a matter of disguising the location of each vertical bitslice structure (e.g.
permuting the ordering of the bitslice).
The second level is a matter of disguising the bit position location of the pTree within a
bigger bit string (to make it harder to focus on the first bits only, etc.).
An important fact is that neither of these two schemes imposes almost no additional
processing time. The reason we can say that is pTree processing always involves
1. identifying those pTrees that are to be involved in the processing
2. locating those pTrees (which is always a matter of following a pointer and we simple
shuffle the order of those pointers so that those who know the shuffle can find the
data and those who do not, cannot.
3. loading the pertinent pTrees (which is just as fast as long as you know at which bit
position to start the load. Those who know that start position, can load and those
who do not, cannot.).
Conclusion continued
A third possible level of security was mentioned, that of encrypting or otherwise
anonymizing each pTree itself.
Again, this may be unnecessary if there is no know or expected record ordering involved.
One simple way to [somewhat] anonymize individual pTrees it to shuffle the order of the
levels within each pTree. This would be metadata (manage by the security monitor)
but once known, could be used with no or very very little additional processing delay,
since each level of a multi-level pTree is processed separately and therefore the
locating portion of the process involves a pointer and no additional processing
would be required on that pointer (it is just that the correct level of the correct pTree
to be located would not be in the expected place within the pTree as stored.).
Other partial anonymizations, short of full encryption, might suffice in most cases.