Sachin Kulkarni`s presentation. - Illinois Institute of Technology

Download Report

Transcript Sachin Kulkarni`s presentation. - Illinois Institute of Technology

High-Dimensional Similarity Search
using Data-Sensitive Space
Partitioning ┼
Sachin Kulkarni1 and Ratko Orlandic2
1
Illinois Institute of Technology, Chicago
2 University of Illinois at Springfield
Database and Expert Systems Applications 2006
┼
Work supported by the NSF under grant no. IIS-0312266.
Outline
•
•
•
•
•
•
•
•
Problem Definition
Existing Solutions
Our Goal
Design Principle
GardenHD Clustering and Γ Partitioning
System Architecture and Processes
Results
Conclusions
CS695 April 13, 2007
2
Problem Definition
• Consider a database of addresses of clubs
• Typical queries are:
[d2]
– Find all the clubs within 35 miles of 10 West
31st. Street, Chicago.
– Find 5 nearest clubs
[d1]
CS695 April 13, 2007
3
Problem Definition
• K-Nearest Neighbor (k-NN) Search:
– Given a database with N points and a query point q in
some metric space, find k  1 points closest to q. [1]
• Applications:
–
–
–
–
–
Computational geometry
Geographic information systems (GIS)
Multimedia databases
Data mining
Etc.
CS695 April 13, 2007
4
Challenge of k-NN Search
• In High-dimensional feature spaces indexing
structures face the problem of dead space (KDBTrees) or overlaps (R-tree).
• Volume and area grows exponentially with respect
to number of dimensions.
• Finding k-NN points is costly.
• Traditional access methods are at par with
sequential scan – “Curse of dimensionality”
CS695 April 13, 2007
5
Existing Solutions
• Approximation and dimensionality reduction.
• Exact Nearest Neighbor Solutions
• R-tree
• SS-tree
• SR-tree
• VA-File
• A-tree
• iDistance
• Significant effort in finding the exact nearest
neighbors has yielded limited success.
CS695 April 13, 2007
6
Goal
• Our goal:
– Scalability with respect to dimensionality
– Acceptable pre-processing (data-loading) time
– Ability to work on incremental loads of data.
CS695 April 13, 2007
7
Our Solution
1
•Clustering
•Space partitioning
•Indexing
0
CS695 April 13, 2007
1
8
Design Principle
• “multi-dimensional data must be grouped
on storage in a way that minimizes the
extensions of storage clusters along all
relevant dimensions and achieves high
storage utilization”.
CS695 April 13, 2007
9
What does it Imply?
• Storage organization must maximize the
densities of storage clusters
• Reduce their internal empty space
• Improve search performance even before the
retrieval process hits persistent storage
• For best results, employ a genuine clustering
algorithm
CS695 April 13, 2007
10
Achieving the Principles
•
Data space reduction:
–
•
Detecting dense areas (dense cells) in the
space with minimum amounts of empty
space.
Data clustering:
–
Detecting the largest areas with the above
mentioned property, called data clusters.
CS695 April 13, 2007
11
GardenHD Clustering
• Motivated by the stated principle.
• Efficiently and effectively separates disjoint
areas with points.
• Hybrid of cell- and density-based clustering that
operates in two phases.
• Recursive space partition -  partitioning.
• Merging of dense cells.
CS695 April 13, 2007
12
 partitioning
1
 region1
 region3
 Region4
0
CS695 April 13, 2007
• G no. generators
• D no. dimensions,
• No. regions = 1+(G–1)D
 subspace
• Space partition is
compactly represented by
a  filter (in memory).
 subspace
 region0
 region2
1
13
Data-Sensitive Gamma Partition
• DSGP :– Data-Sensitive Gamma Partition
3
KDB-Trees
2
1
4
Effective
boundaries
CS695 April 13, 2007
14
System Architecture
Data Clustering
“Data-Sensitive”
Space Partitioning
Incremental
Data Loading
Data Loading
Data Retrieval
Region Search
CS695 April 13, 2007
Similarity Search
15
Basic Processes
• Each region in space represented by
separate KDB-tree
– KDB-trees perform implicit slicing
• Initial and incremental loading of data
– Dynamic assignment of multi-dimensional data
to index pages
• Retrieval
– Region and k-nearest neighbor search
– Several stages of refinement
CS695 April 13, 2007
16
Similarity Search - GammaNN
• Nearest neighbor search using GammaNN.
Region
Representatives
Query Point
Clipped
portions to
be queried
CS695 April 13, 2007
Query Hyper-sphere
17
Region Search
3
2
1
4
CS695 April 13, 2007
18
Experimental Setup
• PC with 3.6 GHz CPU, 3GB RAM, and 280GB
disk.
• Page size was 8K bytes.
• Normalized D-dimensional space [0,1]D.
• The GammaNN implementations with and without
explicit clustering are referred to here as ‘data
aware’ and ‘data blind’ algorithms, respectively.
• Comparison with Sequential Scan and VA-File.
CS695 April 13, 2007
19
Datasets
• Data:
– Synthetic data
• Up to 100 dimensions, 100,000 points.
• Distributed across 11 clusters—one in the center and 10 in
random corners of the space
– Real data
• 54-dimensional, 580,900 points, forest cover type
(“covtype”).
• Distributed across 11 different classes.
• UCI Machine learning repository.
CS695 April 13, 2007
20
Metrics
• Pre-processing time
– Time of space partitioning, I/O and the time for
data loading (i.e., the construction of indices
plus insertion of data).
– For VA-File, only the time to generate the vector
approximation file.
• Performance
– Average page access for k-NN queries.
– Time to process k-NN queries.
CS695 April 13, 2007
21
Experimental Results
Pre-processing time for the three algorithms
covtype data, 54 dim , 580900 points
160
Time in seconds
14 0
120
100
80
60
40
20
0
Data Aware
CS695 April 13, 2007
Data Blind
VA-File
22
Cumulative time for 100 queries
10 NN, synthetic data
Time in seconds
450
400
350
300
250
200
150
100
50
0
Sequential Scan
VA-File
Data Blind
Data Aware
10 20 30 40 50 60 70 80 90 100
Number of dimensions
CS695 April 13, 2007
Avg page accesses
Performance Synthetic Data
Average page accesses for 100 queries
10 NN, synthetic data
3000
Sequential Scan
Data Blind
VA-File
Data Aware
2500
2000
1500
1000
500
0
10 20 30 40 50 60 70 80 90 100
Number of dimensions
23
Cumulative time for 100 queries
10 NN, real data
Time in seconds
1200
1000
800
600
400
200
0
S. Scan
Data Blind
CS695 April 13, 2007
VA-File
Data Aware
Avg page accesses
Performance Real Data
Average page accesses for 100 queries
10 NN, real data
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
S. Scan
Data Blind
VA-File Data Aware
24
Progress with k in k-NN
1400
1200
1000
800
600
400
200
0
Data Blind
Sequential Scan
VA-File
Data Aware
1
10
100
Number of nearest neighbors
CS695 April 13, 2007
Progress of page accesses with respect to
k for k-NN, real data
Avg page accesses
Time in seconds
Progress in time with respect to value
of k for k-NN, real data
10000
8000
6000
4000
2000
0
Sequential Scan
Data Blind
Data Aware
VA-File
1
10
100
Number of nearest neighbors
25
Incremental Load of Data
12
10
8
6
4
2
0
Data Aware
Incremental load
Data Aware
full load
200k 300k 400k 500k
Number of points
CS695 April 13, 2007
Average page accesses vs number of points
Avg page accesses
Time In seconds
Cumulative time vs number of points
150
100
Data Aware
Incremental load
50
0
200k 300k 400k 500k
Data Aware
full load
Number of points
26
Conclusions
• Comparison of the data-sensitive and data-blind
approach clearly highlights the importance of clustering
data on storage for efficient similarity search.
• Our approach can support exact similarity search while
accessing only a small fraction of data.
• The algorithm is very efficient in high dimensionalities
and performs better than sequential scan and the VAFile technique.
• The performance remains good even after incremental
loads of data without re-clustering.
CS695 April 13, 2007
27
Current and Future Work
• Incorporate R-trees or A-trees in place of
KDB-trees.
• Provide facility for handling data with
missing values.
CS695 April 13, 2007
28
References
1.
Fagin, R., Kumar, R., Shivakumar, D.: Efficient similarity search and classification
via rank aggregation, Proc. Proc. ACM SIGMOD Conf., (2003) 301-312
2.
Orlandic, R., Lukaszuk, J.: Efficient high-dimensional indexing by superimposing
space-partitioning schemes, Proc. 8th International Database Engineering &
Applications Symposium IDEAS’04, (2004) 257-264
3.
Orlandic, R., Lai, Y., Yee, W.G.: Clustering high-dimensional data using an
efficient and effective data space reduction, Proc. ACM Conference on Information
and Knowledge Management CIKM’05, (2005) 201-208
4.
Jagdish H. V., Ooi B. C., Tan K. L., Yu C., Zhang R., iDistance: An Adaptive B+Tree Based Indexing Method for Nearest Neighbor Search, ACM Transactions on
Database Systems, Vol. 30, No. 2, (2005): 364-395.
5.
Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for
similarity search methods in high-dimensional spaces, Proc. 24th VLDB Conf.,
(1998) 194-205
6.
Sakurai, Y., Yoshikawa, M., Uemura, S., Kojima, H.: The A-tree: An index structure
for high-dimensional spaces using relative approximation, Proc. 26th VLDB Conf.,
(2000) 516-526
CS695 April 13, 2007
29
Questions ?
[email protected]
http://cs.iit.edu/~egalite
CS695 April 13, 2007
30