Vertical Set Square Distance (VSSD)

Download Report

Transcript Vertical Set Square Distance (VSSD)

Vertical Set Square Distance:
A Fast and Scalable Technique to Compute
Total Variation in Large Datasets
Taufik Abidin, Amal Perera, Masum Serazi, William Perrizo
20th International Conference on Computer and Their Applications
March 16-18, 2005
Louisiana, New Orleans
20th International Conference on CATA 2005
Outline:
 Introduction
 Motivation
 Contribution
 P-tree Vertical Data Structure
 Vertical Set Square Distance (VSSD)
 Experimental Results
 Conclusion and Future Works
20th International Conference on CATA 2005
Introduction
 The determination of closeness (distance) of a point to
another point or to a set of points (average distance) is
required in data mining tasks.
 For example:

Distance-based clustering.
Distance is used to determine in which cluster a
certain point should be placed, i.e. K-means.

Near nbr or case-based classification
The assignment of class label to an
unclassified point is based on the
majority of class label of the nearest
neighbor points. “Nearest” here
implies the closest distance of points.
20th International Conference on CATA 2005
Introduction (cont)
 One way to measure the distance from a certain point to a
set of points is by examining the actual total separation of
the set of points from the point in question
 However, if the examination is done point by point when the
set is very large, scanning the entire space causes the
approach to be very costly, slow and non-scalable.
 In this paper, we introduce a new technique called the
Vertical Set Square Distance (VSSD) that can scalably,
quickly and accurately compute the total length of
separation (total variation) of a set of points about a point.
20th International Conference on CATA 2005
Motivation
 In data mining, efforts have focused on finding
techniques that can scalably deal with large cardinality of
dataset due to the explosive growth of data stored in
digital form.
 Most existing approaches for measuring closeness of
points are slow and computationally intensive (often
relying upon sampling to become useable).
 The availability of P-tree technology (vertical data
structure) is an appropriate data structure to solve this
curse of cardinality.
20th International Conference on CATA 2005
Our Contributions
 We introduce a new vertical technique that scalably,
quickly and accurately computes the total length of
separation of a set of points about a fixed point.
 We present empirical results based on real and
synthetic datasets and show the one-to-one comparison
in terms of scalability and speed (time) between our
new method that uses the vertical approach (P-tree
vertical data structure) and other method that uses the
horizontal approach (record-based).
20th International Conference on CATA 2005
P-Tree Vertical Data Structure
 P-tree vertical data representation consists of set structures
essentially representing the data column-by-column rather
than row-by-row (i.e. relational data).
 The construction of P-trees from an existing relational table
is typically by decomposing each attribute in the table into
separate bit vectors (e.g., one for each bit position of a
numeric attribute or one bitmap for each category in a
categorical attribute). The construction for raw data coming
from a sensor platform is done directly (without creating a
horizontal relational table first.
Note that many sensor platforms (e.g.,
RSI sensors), product raw vertical data.
20th International Conference on CATA 2005
The Construction of P-Tree
R(A1
2
6
2
2
5
2
7
7
R[A1] R[A2] R[A3] R[A4]
A2 A3 A4)
7
7
7
7
2
2
0
0
6
6
5
5
1
1
1
1
1
0
1
7
4
5
4
4
=
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
The construction of the P- Tree:
1. Vertically project each attribute.
2. Vertically project each bit position.
3. Compress each piece of bit slice into
a P-tree.
Logical operations are AND (), OR ()
and complement (').
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
001
000
001
111
100
101
100
100
R11 R12 R13 R21 R22 R23 R31 R32 R33
0
0
0
0
1
0
1
1
1
1
1
1
0
1
1
1
0
1
0
0
1
0
1
1
P11 P12 P13
Root count operation is the count of 1bits from the resulting P-trees logical
operations
110
110
101
101
001
001
001
001
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
R41 R42 R43
0
0
0
1
1
1
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
P21 P22 P23 P31 P32 P33 P41 P42 P43
0
0
0
0
0
0
0
0
0
0
0
0
0 0 1 0 1 0
0 01 0 0 0 0 1 0 1 0 0 0 0
0 0 1 0
10 10
10 01
01 0001
01 01
0100
01
^ 01
^ 10
^
^
^ 01
^
^
^
^ 10
01
01
01
01
10
20th International Conference on CATA 2005
Vertical Set Square Distance
Binary Representation
Let x be a numerical domain of attribute Ai of relation R(A1,
A2, …, Ad), then x written in b bits binary number is
shown in the left-hand side term of the equation below.
xi ,b1  xi ,0 
0
j
2
  xi, j
j b 1
The first subscript of x is the index of the attribute to
which x belong. The second subscript
of x indicates the bit position.
The summation on the right-hand
side of the equation is equal to
the actual value of x as a decimal
number.
20th International Conference on CATA 2005
Vertical Set Square Dist. (Cont)
Let X be a set of vectors in a relation R(A1, A2, …, Ad), let
PX be the P-tree class mask of X, x is a vector : x  X
and a is a target vector (also in d-dimensional space),
then the vectors x and a in binary can be written as:
x  ( x1,(b 1)  x1, 0 , x2,(b 1)  x2, 0 ,, xd ,(b 1)  xd , 0 )
a  (a1,(b 1)  a1, 0 , a2,(b 1)  a2, 0 ,, ad ,(b 1)  ad , 0 )
The total variation of a set of vectors in X about a vector a
can be computed vertically using
Vertical Set Square Distance
(VSSD) defined as follows:
20th International Conference on CATA 2005
Vertical Set Square Dist. (Cont)
X  a X  a
2
d

 x  a  x  a    x  a 
i
x X i 1
x X




x X 
d

d
xi2
i 1

d
x a  
 2
i i
i 1
d

i
i 1
d
xi2
 2
xX i 1

xX i 1
 T1  T2  T3
Term T1, T2 and T3 can be computed
separately and their summation is
actually the total variation (the sum of
the square length of separation of the
vectors connecting X and a.

ai2 


d
xi ai 

xX i 1
ai2
20th International Conference on CATA 2005
Vertical Set Square Dist. (Cont)
d
T1 

xi2
d
0
x X i 1
 
2
xX i 1 j b 1
k b 1
d

2
d  0


 0 k

j
j
    2  xi , j      2  xi , j   2  xi ,k 

xX i 1  j b 1
 k b 1
 xX i 1  j b 1
d
jk
0
d
 xi , j xi ,k  
0
2  x
i 1 j b 1
k b 1
j k
xX
x
i , j i ,k
0
jk
2
 rc( PX ^ Pi, j ^ Pi,k )
i 1 j b 1
k b 1
Or we can write T1, expressing the diagonal terms (j=k)
separately (noting also that xi,j2 = xi,j as they are bits)
d
T1  
0
2j
2
  rc( PX  Pi, j ) 
i 1 j b 1
k
2
  rc( PX  Pi, j  Pi,l )
k  ( j*2 )( j 1) and j  0
l ( j 1)0 and j  0
20th International Conference on CATA 2005
Vertical Set Square Dist. (Cont)
d
T2  2 
 x a
i i
xX i 1
 0 j

 2   ai    2  xi , j 
xX i 1
 j b 1

d
d
 2   ai 
i 1
0

d
 2  xi, j  2   ai 
j
i 1
j b 1 xX
0
j
2
  rc( PX  Pi, j )
j b 1
And T3 is simply:
d
T3 
 a
i
x X i  i
2
d
 rc ( PX )   ai2  rc( PX )  a  a
i 1
20th International Conference on CATA 2005
Vertical Set Square Dist. (Cont)
The root count operation is obviously independent of the
vector a. These include the root count of single P-tree
operand, PX, and the root count of two basic P-tree operands.
d
0
  rc( P
X
i 1 j b 1
 Pi , j )
and the root count of three P-tree operands
d
0

 rc( P
X
i 1 j b 1 l ( j 1)0 and j  0
 Pi , j  Pi ,l )
which appear in T1, T2 and T3.
20th International Conference on CATA 2005
Vertical Set Square Dist. (Cont)

The independency allows us to pre-compute the root
counts once in advance, during the construction of
the P-tree, and can use them repeatedly as needed,
regardless of the number of target vectors a.

This amortizes the cost of P-tree ANDing for high
value datasets, e.g. cancer analysis, RSI, etc.
20th International Conference on CATA 2005
Horizontal Set Square Distance
(HSSD) as a comparison method
Let X, a set of vectors in R(A1,A2 … Ad) and
x = (x1, x2, …, xd) is a vector belong to X and
a = (a1, a2, …, ad) is a target vector
Then the horizontal set square distance (HSSD) is defined
as follows:
d
2
 X  a    X  a    x  a   x  a    xi  ai 
x X
x X i 1
20th International Conference on CATA 2005
Experimental Results
 The experiments were conducted using both real and
synthetic datasets.
 The goals are to compare the execution time (speed) and
scalability of our algorithm employing a vertical approach
(vertical data structure and horizontal bitwise AND
operation) with a horizontal approach (horizontal data
structure and vertical scan operation).
 Performance of both algorithms was observed under
different machine specifications,
including an SGI Altix machine.
20th International Conference on CATA 2005
Experimental Results (Cont.)
The specification of machines used in the experiments.
Machine
AMD1GB
Specification
AMD Athlon K7 1.4GHz, 1GB RAM
P42GB
Intel P4 2.4GHz processor, 2GB RAM
SGI Altix
SGI Altix CC-NUMA 12 processor shared
memory (12 x 4 GB RAM)
20th International Conference on CATA 2005
Datasets
 A set of aerial photographs from the Best Management
Plot (BMP) of Oakes Irrigation Test Area (OITA) near
Oakes, ND.
 The original image size
(dataset) is 1024x1024 pixels
(cardinality of 1,048,576),
contains 3 bands (RGB)
plus synchronized data
for soil moisture, soil nitrate
and crop yield (total 6 attributes).
20th International Conference on CATA 2005
Datasets (Cont.)
 The datasets contain 4 different yield classes:

Low Yield
 Medium Low Yield
 Medium High Yield
 High Yield
( 0 < intensity <= 63)
( 63 < intensity <= 127)
(127< intensity <= 191)
(191< intensity <= 255)
 With super sampling five synthetic datasets are created:





2,097,152 rows
4,194,304 rows (2048x2048 pixels)
8,388,608 rows
16,777,216 rows (4096x4096 pixels)
25,160,256 rows (5016x5016 pixels)
20th International Conference on CATA 2005
Timing and Scalability Results
Observation 1.
 The first performance evaluation was done on a P4 2 GB
RAM machine using synthetic datasets having 4.1 and
8.3 million data tuples.
 We use 100 arbitrary unclassified target vectors (resulting
in 400 classification computations as to predicted yeild).
 Datasets of sizes greater than 8.3 million rows cannot be
executed in this machine due to out of memory problems
when running Horizontal Set Square
Distance (HSSD).
20th International Conference on CATA 2005
Results
VSSD vs HSSD Time Comparison
Using 100 Test Cases on 4,194,304 Rows Dataset
9
VSSD
HSSD
4,193,304
0.0003
7.3800
8,388,608
0.0004
15.1600
7
Time
(in Seconds)
Dataset
size
(# of Rows)
Average Time to Compute
Total Variation for Each
Test Case in (Seconds)
5
3
1
-1 0
10
20
15
Time
(in Seconds)
13
11
9
7
5
3
1
10
20
30
40
50
60
Test Case ID
VSSD
HSSD
70
80
90
50
VSSD
17
0
40
60
Test Case ID
VSSD vs HSSD Time Comparison
Using 100 Test Cases on 8,388,608 Rows Dataset
-1
30
100
HSSD
70
80
90
100
20th International Conference on CATA 2005
Timing and Scalability Results
Observation 1 (Continue)
Time
(Seconds)
Dataset
(Rows)
VSSD
HSSD
Root Count Pre-Computation
and P-trees Loading
Horizontal Dataset
Loading
1,048,576
3.900
4.974
2,097,152
8.620
10.470
4,194,304
18.690
19.914
8,388,608
38.450
39.646
This table shows the comparison of time
required for loading the vertical data
structure to memory and
one time root count operations for
VSSD, and loading horizontal
records to memory needed for HSSD.
20th International Conference on CATA 2005
Timing and Scalability Results
Observation 2
This table shows the algorithm’s timing and
scalability performances when they are executed
on different machines.
Average Running Time to
Compute Total Variation for Each Test Case
(Seconds)
Cardinality of
Dataset
HSSD
VSSD
AMD 1GB
P4 2GB
SGI Altix 12x4GB
AMD 1GB
1,048,576
2.2000
1.8400
5.4800
0.0003
2,097,152
4.4100
3.6400
8.3200
0.0003
4,194,304
8.5800
7.3800
15.8640
0.0004
8,388,608

15.1600
33.9000
0.0004
16,777,216


66.5400
0.0004
25,160,256


115.2040
0.0004
20th International Conference on CATA 2005
Timing and Scalability Results
Observation 2 (Continue)
120
VSSD vs HSSD
Time Comparison Using 100 Difference Test Cases
Running in Different Types of Machines
100
Time
(Seconds)
80
VSSD on AMD-1G
HSSD on AMD-1G
60
HSSD on P4-2G
HSSD on SGI-48G
Out of Memory
40
20
0
0
2
4
6
8 10 12 14 16 18 20 22 24
Number of Tuples (1024^2)
20th International Conference on CATA 2005
Conclusion
 Vertical Set Square Distance (VSSD) is fast and accurate
way to compute total variation for classification, clustering
and rule mining, and scale well to very large datasets
compare to the traditional horizontal approach (HSSD).
 The complexity of the VSSD is O(d * b2) where d is the
number of dimensions and b is the maximum bit-width of
the attributes (depends on the width of the data set only).
 The VSSD is very fast because of the independency of
root counts of P-tree operands to target vector a which
allows the pre-computation of counts once in advance,
during the construction of the P-tree.
20th International Conference on CATA 2005
Future Work
 Comprehensive study on the Vertical Set Square Distance
(VSSD) in many data mining tasks, i.e. classification,
clustering or outlier detection.
 For classification using VSSD in the voting phase has
already been shown to greatly accelerate the assignment of
class since the calculation of class votes can be done
entirely in one computation without having to visit each
individual point, as is the case of the horizontal-based
approach.
20th International Conference on CATA 2005
Thank You…