Transcript Document

Our Approach
 Vertical, compressed data structures, variously called either
Predicate-trees or Peano-trees (Ptrees in either case)1
processed horizontally (DBMSs process horizontal data vertically)
 Ptrees are data-mining-ready, compressed data
structures, which attempt to address the curses of
scalability and curse of dimensionality.
1
Ptree Technology is patent pending
by North Dakota State University
Current practice:
Structure data into horizontal records.
Process vertically (vertical scans)
Predicate tree technology: vertically project each attribute,
then vertically project each bit position of each attribute,
then compress each bit slice into a basic Ptree.
e.g., compression of R11 into P11 goes as follows:
R[A1] R[A2] R[A3] R[A4]
R( A1 A2 A3 A4)
Horizontally
structured
records
Scanned
vertically
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
010
011
010
010
101
010
111
111
0
0
0
0
1
0
1
1
The compressed Ptree
(1-Dim) version of R11,
denoted, P11, is built by
recording the truth of
the predicate “pure 1”
in a tree recursively on
halves, until purity is
achieved.
R11
pure1? false=0
pure1? true=1
pure1? false=0 pure1? false=0
pure1? false=0
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
R11 R12 R13 R21 R22 R23 R31 R32 R33
0
0
0
0
1
0
1
1
1
1
1
1
0
1
1
1
0
1
0
0
1
0
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
R41 R42 R43
0
0
0
1
1
1
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
Horizontally AND basic Ptrees
1. Whole is pure1? false  0
P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43
2. Left half pure1? false  0
P11 And it’s pure0000so00 10 0 00 0 1 00 1 00 0 00 1 00 0 00 0 01 0 01 0 00 00 0
3. Right half pure1? false  0
10 10
10 01
01 01
0100
0
01
1 01 0001
branch ends 10
0
^ 01
^ 10
^
^
^ 01
^
^^
^
^ 10
01
01
01
01
10
4. Left half of rt half ? false0
7 0 1 4
0 0
5. Rt half of right half? true1
01
To count occurrences of 7,0,1,4 use pure111000001100: 0 23-level
6. Lf half of lf of rt? true1 But it is pure
1
10
P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level =2
(pure0) so this
^ 21-level
01
7. Rt half of lf of rt? false0 branch ends
RunListTrees? (RLtrees)
To facilitate subsetting (isolating a subset) and processing, a Ptree stucture can be constructed as follows:
R11
0
0
0
0
1
0
1
1
1. Whole file is not pure1 0
2. 1st half is not pure1  0
3. 2nd half is not pure1  0
4. 1st half of 2nd half not  0
5. 2nd half of 2nd half is  1
6. 1st half of 1st of 2nd is  1
0
0
0
01
1
10
7. 2nd half of 1st of 2nd not 0
RL11
0:000 1:100 0:101 1:110
Or, a separate NotPure0 index tree (trees could be terminated at any level). 1 st, AND NP0trees. Only 1-branches / result
need ANDing thru list scans. The more operands, the fewer 1-branches.
R11
0
0
0
0
1
0
1
1
1. Whole file is true
1
st
2. 1 half is false
0
3. 2nd half is true
1
st
nd
4. 1 half of 2 half true  1
5. 2nd half of 2nd half true  1
6. 1st half of 1st of 2nd true 1
7. 2nd of 1st of 2nd false
RL11
0
1
1
11
1
10
0
0:000 1:100 0:101 1:110
Other Indexes on RunLists
We could put Pure0-Run, Pure1-Run and even Mixed-Run (or LZV-Run) RunListIndexes on RL:
R11
0
0
0
0
1
0
1
1
0
1
0
1
0
1
0
1
start
length
P0RI11 000:4 101:1
pattern
RL11
0:0 1:100 0:101 1:110 01:1000
Length (# of
consecutive replicas
of pattern)
P1RI11 100:1 110:2
PLZVRI11 1000:1
Best Pure1 tree Implementation?
My guess circa 04jan
For n-D Pure1 trees:
1.
At any node, if |1-bits| in the tuple set represented by the node < lower threshold, LT,
1.
2.
Then that node will simply show the 1List, the list of 1-bit positions (use a 0-bit if =0) and have no children,
m
Else if the tuple set represented by that node < UT=2n , an upper threshold, leave bit-slice uncompressed
Building such Ptrees bottom up:
1.
2.
Using in-order ordering,
If 1-count of the next UT-segment  LT install P-sequence, else install 1List.
If current UT-segment node is numbered k*(2n–1) and it and all 2n-1 predecessors are 1Lists, and the
cardinality of the union of said 1Lists < LT, install the union in the parent node. Recurse this
collapsing process upward to the root.
Building such Ptrees top down:
1.
2.
3.
For datasets larger than UT, recursively build down the pure1.
If ever a node has < LT 1-bits, install the 1List and terminate that branch.
At the level where the represented tuple set = UT, install 1List if |1-bits| < LT, else install P-sequence.
Notes:
1.
This method should extend well to data streams. When the data stream exceeds the current allocation
(which, for n-D Ptrees will be a power of 2n), just grow the root of each Ptree to a new level, using the
current Ptree as node 0. Nodes 1,2,3,…2n-1 of the new Ptree will start as 0 nodes without children and
will grow as 1Lists until LLT is reach then they will be converted to P-sequences.
Ptrees
Vertical, compressed, lossless structures that facilitates fast horizontal AND-processing
P-trees are all of the following:
Partition-tree: Tree of nested partitions:
Root is the entire bit sequence
Level-1 is a partition of the root P(R)={C1..Cn}
In level-2 each level-1 component is partitioned by P(Ci ) = {Ci,1..Ci,ni} i=1..n,
In level-3 each level-2 component is partitioned by P(Ci,j) = {Ci,j1..Ci,jn }
ij
...
Partition tree
R
/ … \
C1 … Cn
/…\ … /…\
C11…C1,n1
Cn1…Cn,nn
. . .
Predicate-tree: For a predicate on the leaf-nodes of a partition-tree
(also induces predicates on i-nodes using quantifiers)
Predicate-tree nodes can be truth-values (Boolean P-tree)
Predicate can be quantified existentially (1 or a threshold %) or universally
Purity-tree: universally quantified tree (predicate is “1  position) , Pure1-tree)
existential quantified tree (predicate is “ a 1 position), NotPure0-tree)
We will focus on P1trees.
All Ptrees shown so far were
1-dimensional (recursively partition by halving bit files), but they can be
2-D (recursively quartering) (e.g., used for 2-D images)
3-D (recursively eighth-ing), …
Or based on purity runs or LZW-runs or …
Best Ptree Implementation?
My guess circa 04feb
Ptrees can be viewed “breadth first bottom up ” That is, one can view the leaf level (level-0) horizontally and
then view level-1 as another bit vector in which each bit represents the truth of the predicate (e.g.,
pure1), applied to pairs of bits in the leaf level vector. This can be continued upwards for each
successive level until the root is reached. The result is a bottom up construction of the 1-D ptree.
This construction has the advantages that the leaf level vector can be any length and can grow (e.g., for data
streams) , Each level up e.g., level-k, can be grown as new 2k-granules for that level fill up.
Additionally, the breadth-first bottom up view of the 2-D ptree for that bit vector is just the leaf vector with
every other level above it left out. The breadth first bottom up view of the 3-D ptree is just the leaf
level then leaving out 2 consecutive levels at a time, going up the levels. 4-D is the same leaving out
triples of levels at a time, etc.
Another advantage of this view is that, stripping an additional k levels just above the leaf level amounts to
using 2k length p-sequence at the leaf as discussed discussed on a previous slide.
The breadth-first bottom up (B2U) view doubles the storage requirement over the storage requirement of psequences. However, storage is free so it isn’t a big concern. The capture time (on the DII side) is
probably much higher, but that is a one-time cost.
On the face of it, compression is gone. However, we are talking about disk space again, which is free. In
terms of anding speed, the compression is there (just don’t decent on purity). However, one should
have the NZ trees (for the existential case) as well in order to distinguish purity. Note, we are
replacing all pointers with offsets now, so distinguishing mixed from pure zero cannot be done by “no
child pointers”.