Data Security using Vertical Data and pTrees

Download Report

Transcript Data Security using Vertical Data and pTrees

PGP-D Pretty Good Protection of Data
protects vertical pTree data.
key=array(location,pad) 5,54 | 7,539 | 87,3 | 209,126 | 25,896 | 888,23 | ...
APPENDIX
With PGP-D, to get pTree info, you need:
the ordering (the mapping of bit position to table row) the predicate (e.g., table column id and bit slice or bitmap) and the location.
pTrees are compressed, data-mining-ready vertical data structures which need not be uncompressed to be used. PGP-D is a mechanism in which
we "scrambled" pTree information (location info) in such a way that data can be still be processed just as fast.
For data mining purposes, the scrambled pTrees would be unrevealing of the raw data to anyone, but a person qualified to issue data-mining
requests (classification/ARM/clustering). The Key (K) reveals the pTree location.
5,54 | 7,539 | 87,3 | 209,126 | 25,896 | 888,23 | ...
We could make all pTrees (over entire [distr.] DB) same length and pad in the front of each so that statistics can't reveal the pTree start position.
We scramble the locations of the pTrees. For basic pTrees, K would reveal the offset and pre-pad. E.g., the first pTree is found at offset=5 (has
been shuffled forward 5 pTree slots from the slot assigned by the usual static ToC(pTree, location)) and that the first 54 bits are pad bits.
If the DB had 5000 files with 50 columns each (on avg) and each column had 32 bits (on avg), we have 8 million pTrees. We could pad with
statistically indistinguishable additions to make it impossible to try enough alternatives in human time to break the key.
An additional thought: In the distributed case (multiple sites) since we'd want lots of pTrees, it would make sense to always fully replicate
(making all retrievals local). Thus we are guaranteed that all pTrees are statistically "real looking" (because the ARE real). We might not need to
pad with bogus pTrees. A hacker could extract only the first bit of every pTree (e.g., the 8M bits that IS the first horizontal record), then shuffle
those bits until something meaningful appears (or starts to appear). From all meaningful shuffles, he/she might be able to break the key code
(e.g., look at 2nd, 3rd, etc.). To get around that possibility, we could store the entire database as a massive "Big Bit String" and have as part of
our Key (K) the start offset of each pTree (which would be shuffled randomly). We would include a column of the [randomly determined]
amount of padding (now variable) so that the position of first start bits is unknowable. If, as is expected, many impure pTrees are constructed,
those would subsitute for the suggested bogus ones.
We could also include large collection of bogus key-lookup-tables (identify correct one to authorized subgroup only). Additional layer. encrypt?
For multiple users at different levels of security with rights to parts of DB and not others) we would have a separate key for each user level.
Using the key would be simple and quick, and once the key is applied, then accessing and processing the data would be at zero additional time
cost (the current thinking is that we would not encrypt or otherwise alter the pTrees themselves - just their identity).
One would only need to work on the "key mechanism" to improve the method in speed, protection level. (individual pTrees are intact/unaltered)
Some data collections need not be protected in their entirety (tends to be by column and not by row - pTrees are good for column protection. (I.e.,
it is usually the case that certain attributes are sensitive and others are routine public information).
When there are differences in protection level by row (subsets of instances of the entity require different protection levels) then we would simply
create each subset as a separate "file" (all of the same massive length through padding) and protect each at the proper level.
Another issue: It may be necessary to change the database from time to time, for several reasons. If the key gets out or if enough time has elapse
to make it a possibility that the scheme can be broken, then the database will have to be reshuffled and a new key issued. How can that be done?
Does the location of every pTree need to be changed (all pTrees need to be moved), or can security be established with the movement of only a
small number of pTrees? I.e., if a hacker compares the old with the new and can thus find strings that were not changed, can he/she crack it?
8. Data Security using Vertical Data and pTrees
Some preliminaries:
1. with pTrees you need to know the ordering to have any information.
2. You also need to know which pTree stands where (in which column and which bit slice) to have any info.
3. All pTrees same len (max file len of DB); shuffle/scramble/alter ordering of cols/slices and ordering to conceal info.
pTrees are compressed, data-mining-ready vertical data structures which need not be uncompressed to be used. We want to devise a mechanism
based on the above notions (or others?) in which the "scambled" pTree data can be processed without an expensive unscrambling step?
For data mining purposes, the scrambled pTrees would be unrevealing of the raw data to anyone, but a person qualified could issue a datamining
request (a classification/ARM/clustering request) and get the answer even though the actual data would never be exposed. Different from
encrypting? Encrypting massive data stores is never a good options and decryption is usually necessary to datamine information from the store.
The key is a Lookup Table for pTree identity (intact pTrees). One could customize Partial-Key-Lookup-Tables to subgroups, etc.
Could also construct large collection of bogus key-lookup-tables (identify correct one to authorized subgroup only). Additional layer. encrypt?
For multiple users at different levels of security with rights to parts of DB and not others) we would have a separate key for each user level.
Using the key would be simple and quick, and once the key is applied, then accessing and processing the data would be at zero additional time
cost (the current thinking is that we would not encrypt or otherwise alter the pTrees themselves - just their identity).
One would only need to work on the "key mechanism" to improve the method in speed, protection level. (individual pTrees are intact/unaltered)
Some data collections need not be protected in their entirety (tends to be by column and not by row - pTrees are good for column protection. (I.e.,
it is usually the case that certain attributes are sensitive and others are routine public information).
When there are differences in protection level by row (subsets of instances of the entity require different protection levels) then we would simply
create each subset as a separate "file" (all of the same massive length through padding) and protect each at the proper level.
Make all pTrees (entire [distr.] DB) the same depth (padding in such a way that statistical analysts could not determine pad from actual data?)?
Then all pTrees can be intermingled and protected using the same mechanism The more pTrees, the better.
If the DB had 5000 files with 50 columns each (avg) and each column had 32 bits (avg), we have 8 million pTrees. We could pad with
statistically indistinguishable additions to make it impossible to try enough alternatives in human time to break the key.
An additional thought:
In the distributed case (multiple sites) since we'd want lots of pTrees, it would make sense to always fully replicate (making all retrievals local).
Thus we are guaranteed that all pTrees are statistically "real looking" (because the ARE real). We might not need to pad with bogus pTrees.
A hacker could extract only the first bit of every pTree (e.g., the 8M bits that IS the first horizontal record), then shuffle those bits until something
meaningful appears (or starts to appear). From all meaningful shuffles, he/she might be able to break the key code (e.g., look at 2nd, 3rd, etc.).
To get around that possibility, we could store the entire database as a massive "Big Bit String" and have as part of our Key Lookup Table (KLT)
the start offset of each pTree (which would be shuffled randomly). We would include a column of the [randomly determined] amount of padding
(now variable) so that the position of first start bits is unknowable.
Alternatively, we could use a common length but have random "non-pTree" gaps between them.
Data Security using Vertical Data and pTrees
With all the "break ins" occurring (e.g., Citibank, etc.) how can data be protected?
Can vertical data be protected more easily than horizontal data?
Can pTree representation be useful in protecting data? Some modification of standard pTrees?
Some preliminaries:
1. with pTrees you need to know the ordering to have any information.
2. You also need to know which pTree stands where (in which column and which bit slice) to have any info.
3. We may want to make all pTrees the same length (using the max file length over the database). then we
can shuffle/scramble/alter the ordering of columns/slices and even of the ordering to conceal information.
With pTree representations, there are no horizontal data records (as opposed to indexes which are vertical
structures which accompany the horizontal data files). pTrees ARE the data as well as the indexes.
My thoughts include:
pTrees are compressed, data-mining-ready vertical data structures which need not be uncompressed to be
used. Therefore we want to devise a mechanism based on the above notions (or others?) in which the
"scambled" pTree data can be processed without an expensive unscrambling step?
For data mining purposes, the scrambled pTrees would be unrevealing of the raw data to anyone but a person
qualified could issue a datamining request (a classification/ARM/clustering request) and get the answer even
though the actual data would never be exposed.
That's probably not much different really from encrypting the data, but encrypting massive data stores is
never a good options and decryption is usually necessary to mine information from the store.
Data Security using Vertical Data and pTrees -2
The key is a "Lookup Table" (or whatever mechanism is used to identify to the authorized group,
which pTree is which).
One could customize Partial-Key-Lookup-Tables to subgroups, etc.
One could also construct a large collection of bogus key-lookup-tables too (and identify the correct one
to the authorized subgroup only) as an additional layer.
I suppose one could then encrypt those (using PGP?).
For multi-level security (multiple users at different levels of security who have the right to only parts of
the database and not others) we would have a separate key for each user level.
The bottom line point is that using the key would be simple and quick, and once the key is applied,
then accessing and processing the data would be at zero additional time cost (the current thinking is
that we would not encrypt or otherwise alter the pTrees themselves - just their identity).
One would only need to work on the "key mechanism" to improve the method (speed, protection level).
It may be the case also that some data collections need not be protected in their entirety. Then only the
pTrees that need to be protected would be protected We note that that tends to be by column and not
by row and pTrees are perfect for column-wise protection. I.e., it is usually the case that certain
attributes are sensitive and others are routine public information).
When there are differences in protection level by row (subsets of instances of the entity require
different protection levels) then we would simply create each subset as a separate "file" (all of the
same massive length through padding) and protect each at the proper level.
Data Security using Vertical Data and pTrees -3
At this point, we think it would be best to make all pTrees across the entire [distributed] database, the
same depth (padding in such a way that statistical analysts could not determine pad from actual data?).
That way all pTrees can be intermingled and protected using the same mechanism
Note that the more pTrees we have, the better. For example, if the DB had 5000 files with 50 columns
each (on average) and each column had 32 bits (avg), we have 8 million pTrees.
We could pad that up with statistically identical additions to whatever it takes to make it impossible to
try enough alternatives in human time to break the key.
An additional thought:
In the distributed case (multiple sites) since we'd want lots of pTrees, it would make sense to always
fully replicate (which makes all retrievals local). Thus we are guaranteed that all pTrees are statistically
"real looking" (because the ARE real). We might not need to pad with bogus pTrees in this case.
It occurs to me that a hacker could extract only the first bit of every pTree (e.g., the 8,000,000 bits that
IS the first horizontal record). Then that hacker could shuffle those bits until something meaningful
appears (or starts to appear). From all meaningful shuffles, he/she might be able to break the key code
(e.g., look next as 2nd bit, then 3rd, etc.). To get around that possibility, we could store the entire
database as a massive "Big Bit String" and have as part of our Key Lookup Table (KLT) the start offset
of each pTree (which would be shuffled randomly). We could include a column of the [randomly
determined] amount of padding (now variable) so that the position of first start bits is unknowable.
Alternatively, we could use a common length but have random "non-pTree" gaps between them.