Transcript FROM R
1
39
Christian Böhm
University for Health Informatics and Technology, Innsbruck
Similarity Search and Data Mining:
Database Techniques Supporting
Next Decade's Applications
Keynote at iiWAS 2002
2
39
Similarity Search
3
39
Feature Based Similarity
4
39
Simple Similarity Queries
Specify query object and
• Find similar objects – range query
• Find the k most similar objects – nearest neighbor q.
5
39
Multidimensional Index Structure (R-tree)
Directory
Data
Page: Page:
rectangle
1, address1
point
1: x11, x12, x13, ...
rectangle
2,xaddress
2
point
:
x
,
2 21 22, x23, ...
rectangle
3,xaddress
3
point
:
x
,
3 31 32, x33, ...
rectangle4, address4
6
39
Range Query with Depth-First Traversal
7
39
Nearest Neighbor: Priority Algorithm
[Hjaltason, Samet: Ranking in Spatial Databases, SSD 1995]
4 page accesses
8
39
Problems of High-Dim. Index Structures
„Curse of dimensionality“:
• Search performance of index deteriorates in high dim.
• Outperformed by sequential scan
Solution
• Optimize various parameters of index structures
Needed: Cost model for queries
How many pages are expected to be accessed for
• Range queries (with given e)
• Nearest neighbor queries (with given k)
9
39
Cost Estimation (Uniformity/Independence)
Minkowski sum:
Estimation of the access probability of a page
[Böhm: A Cost Model for Query Processing in High-Dimensional Data Spaces, TODS 25(2), 2000]
Nearest neighbor: Estimate distance by point density
10
39
Cost Estimation
Boundary and saturation effects in high dim. space
(considered by our model extension)
Correlation between attributes
(considered by the concept of fractal dimension)
Cluster structure has also impact on performance
• Currently neglected by our model
• Histograms and similar data descriptions difficult in
high-dimensional space
(number of histo-bins exponential in dimensionality)
• Other descriptions of cluster structure (dendrograms)
Subject to future work
11
39
Optimization of Index Structures
To avoid the possibility to outperform index based
query processing by the sequential scan:
Optimize various parameters such as
•
•
•
•
Logical block size of the index pages
Indexed dimension
I/O schedule optimization (fast index scan)
Data quantization
Observe the balance! (Master Confucius)
12
39
Page Size Optimization
13
39
Page Size Optimization
14
39
Optimized Dimension Assignment
Hi-dim. Index
R-tree
Problem in hi-dim:
Too few splits in
each dimension
Inverted List
B-tree
Problem in hi-dim:
Too many results
in each dimension
Matching
15
39
Optimized Dimension Assignment
Hi-dim. Index
R-tree
Inverted List
B-tree
Compromise:
A moderate number of R-trees
each indexing a few dimensions
Matching
OPTIMIZE!
16
39
Schedule Optimization (Fast Index Scan)
Range Query: Required Pages are known from the directory
17
39
Schedule Optimization (NN Queries)
Current expenses are traded for possible later savings
Start at 100% page and extend forward and backward
Optimize the cumulated cost balance (CCB):
18
39
Quantization
Approximate the points by
quantization grid based on quantiles
Benefit:fewer bits for representation
Cost: Grid cell partially intersected
access the original point data
How to choose grid resolution ???
[Weber, Schek, Blott: A Quantitative Analysis and Performance Study..., VLDB 1998]
19
39
Independent Quantization (IQ tree)
Combines index, scan, and quantization
[Berchtold, Böhm, Jagadish, Kriegel, Sander: Independent Quantization..., ICDE 2000]
Grid resolution optimized by cost model
20
39
Open Research Problems in Optimization
Multi-Parameter Optimization:
• How can parameters be optimized simultaneously?
• Are there conflicts between optimization goals?
Example:
Uniform data:
Quantization
Correlated data:
Tree Striping
21
39
Open Research Problems in Optimization
Consider Insert/Delete/Update:
If the data set faces heavy update, the constructed
index should look differently compared with more
static data sets
• Update-bound: Construct index rather simple
• Query-bound: Spend more effort to organize data
Can be considered as an optimization problem
22
39
Data Mining
23
39
KDD Algorithms Based on Similarity Queries
LOF Simultan.
Nearest
Dist. Neighbor
OPTICS Based Classific.
Outliers
DBSCAN
....
....
....
Spatial
Trend
Detect.
Spatial
Assoc.
Rules
24
39
Join Applikationen
Katalogkonversion (Catalogue Matching)
• z.B. Astronomie-Kataloge
R
S
25
39
Clustering
Clustering (e.g. DBSCAN)
[Ester, Kriegel, Sander, Xu: A Density Based Algorithm for Discovering Clusters, KDD 1996]
26
39
Cache Behavior
27
39
Clustering and Similarity Join
DBSCAN uses similarity join as basic operations
[Böhm, Braunmüller, Breunig, Kriegel: High Perf. Clustering based on the Sim. Join, CIKM 2000]
28
39
k-Nearest Neighbor Classification
Example:
k=3
Objects with known class
New objects
•
New objects
Known objects
29
39
Distance Range Join (e-Join)
•
•
Most widespread and best evaluated join
Often also called the similarity join
30
39
k-Closest Pair Query
In SQL notation: SELECT * FROM R, S
ORDER BY ||R.obj - S.obj||
STOP AFTER k
31
39
k-Nearest Neighbor Join
In SQL notation: SELECT * FROM R, S
(limited to k = 1) GROUP BY R.obj
ORDER BY ||R.obj - S.obj||
STOP AFTER K (* k *)
32
39
R-tree Spatial Join (RSJ)
procedure r_tree_sim_join (R, S, e)
if IsDirpg (R) IsDirpg (S) then
foreach r R.children do
foreach s S.children do
if mindist (r,s) e then
CacheLoad(r); CacheLoad(s);
r_tree_sim_join (r,s,e) ;
else (* assume R,S both DataPg *)
foreach p R.points do
foreach q S.points do
if |p - q| e then report (p,q);
R
e
S
33
39
Modeling and Optimization
[Böhm, Kriegel: A Cost Model and Index Architecture for the Similarity Join, Wednesday, 1630]
Mating probability of index pages:
Probability that distance between two pages e
Two-fold application of Minkowski sum
34
39
Modeling and Optimization
I/O cost:
• High const. cost per page
• Large capacity optimum
CPU cost:
• Low const. cost per page
• Low capacity optimum
CPU-performance like CPU optimized index
I/O- performance like I/O optimized index
35
39
Open Problems for Research (Sim. Join)
Modeling and Optimization:
•
•
•
•
Dimension
Quantization
Page scheduling
Caching strategies
Nearest Neighbor Join
• Applications
• Algorithms
General
• Integration into object-relational DBMS
36
39
New Challenges
New Challenges
Incertain Features:
Application:
• Biometric Identification
Particularities:
Relative probability
37
39
• Features individually associated with incertainty
(e.g. as Gaussian distributions)
Queries:
• Probability of match
• Find objects with highes probability of match
• Find objects with probability of match >= e
Feature a1
38
39
New Challenges
Support of e-commerce in all phases
• Marketing
• Sales and booking
• Add-on products
Advanced Similarity
•
•
•
•
Adaptable
Multimodal models
Relevance-feedback
Convex hull
customer segmentation
advanced similarity search
Sales transaction analysis
39
39
New Challenges
Stock quota: Technical chart analysis
Known: Database techniques for similarity search
in time sequences (DFT, etc.)
40
39
New Challenges
Professional analyst tools use:
• Trading signals generated by indicators (etc. MACD)
• Formations indicating trends in charts
• Relationships to the market and to derivatives
41
39
Conclusion
Database primitives: abstraction from application:
Similarity Search Range Queries
Nearest Neighbor Queries
Clustering
Classification
Similarity Join
Outlier Detection
Advantages
• General solution, reuse
• Separately optimizable