FROM R

Transcript FROM R

1
39
Christian Böhm
University for Health Informatics and Technology, Innsbruck
Similarity Search and Data Mining:
Database Techniques Supporting
Next Decade's Applications
Keynote at iiWAS 2002
2
39
Similarity Search
3
39
Feature Based Similarity
4
39
Simple Similarity Queries
 Specify query object and
• Find similar objects – range query
• Find the k most similar objects – nearest neighbor q.
5
39
Multidimensional Index Structure (R-tree)
Directory
Data
Page: Page:
rectangle
1, address1
point
1: x11, x12, x13, ...
rectangle
2,xaddress
2
point
:
x
,
2 21 22, x23, ...
rectangle
3,xaddress
3
point
:
x
,
3 31 32, x33, ...
rectangle4, address4
6
39
Range Query with Depth-First Traversal
7
39
Nearest Neighbor: Priority Algorithm
[Hjaltason, Samet: Ranking in Spatial Databases, SSD 1995]
4 page accesses
8
39
Problems of High-Dim. Index Structures
 „Curse of dimensionality“:
• Search performance of index deteriorates in high dim.
• Outperformed by sequential scan
 Solution
• Optimize various parameters of index structures
 Needed: Cost model for queries
How many pages are expected to be accessed for
• Range queries (with given e)
• Nearest neighbor queries (with given k)
9
39
Cost Estimation (Uniformity/Independence)
 Minkowski sum:
Estimation of the access probability of a page
[Böhm: A Cost Model for Query Processing in High-Dimensional Data Spaces, TODS 25(2), 2000]
Nearest neighbor: Estimate distance by point density
10
39
Cost Estimation
 Boundary and saturation effects in high dim. space
(considered by our model extension)
 Correlation between attributes
(considered by the concept of fractal dimension)
 Cluster structure has also impact on performance
• Currently neglected by our model
• Histograms and similar data descriptions difficult in
high-dimensional space
(number of histo-bins exponential in dimensionality)
• Other descriptions of cluster structure (dendrograms)
 Subject to future work
11
39
Optimization of Index Structures
 To avoid the possibility to outperform index based
query processing by the sequential scan:
 Optimize various parameters such as
•
•
•
•
Logical block size of the index pages
Indexed dimension
I/O schedule optimization (fast index scan)
Data quantization
 Observe the balance! (Master Confucius)
12
39
Page Size Optimization
13
39
Page Size Optimization
14
39
Optimized Dimension Assignment
Hi-dim. Index
R-tree
Problem in hi-dim:
Too few splits in
each dimension
Inverted List
B-tree
Problem in hi-dim:
Too many results
in each dimension
Matching
15
39
Optimized Dimension Assignment
Hi-dim. Index
R-tree
Inverted List
B-tree
Compromise:
A moderate number of R-trees
each indexing a few dimensions
Matching
OPTIMIZE!
16
39
Schedule Optimization (Fast Index Scan)
Range Query: Required Pages are known from the directory
17
39
Schedule Optimization (NN Queries)
 Current expenses are traded for possible later savings
 Start at 100% page and extend forward and backward
 Optimize the cumulated cost balance (CCB):
18
39
Quantization
 Approximate the points by
quantization grid based on quantiles
 Benefit:fewer bits for representation
 Cost: Grid cell partially intersected
 access the original point data
 How to choose grid resolution ???
[Weber, Schek, Blott: A Quantitative Analysis and Performance Study..., VLDB 1998]
19
39
Independent Quantization (IQ tree)
Combines index, scan, and quantization
[Berchtold, Böhm, Jagadish, Kriegel, Sander: Independent Quantization..., ICDE 2000]
Grid resolution optimized by cost model
20
39
Open Research Problems in Optimization
 Multi-Parameter Optimization:
• How can parameters be optimized simultaneously?
• Are there conflicts between optimization goals?
Example:
Uniform data:
Quantization
Correlated data:
Tree Striping
21
39
Open Research Problems in Optimization
 Consider Insert/Delete/Update:
 If the data set faces heavy update, the constructed
index should look differently compared with more
static data sets
• Update-bound: Construct index rather simple
• Query-bound: Spend more effort to organize data
 Can be considered as an optimization problem
22
39
Data Mining
23
39
KDD Algorithms Based on Similarity Queries
LOF Simultan.
Nearest
Dist. Neighbor
OPTICS Based Classific.
Outliers
DBSCAN
....
....
....
Spatial
Trend
Detect.
Spatial
Assoc.
Rules
24
39
Join Applikationen
 Katalogkonversion (Catalogue Matching)
• z.B. Astronomie-Kataloge
R
S
25
39
Clustering
 Clustering (e.g. DBSCAN)
[Ester, Kriegel, Sander, Xu: A Density Based Algorithm for Discovering Clusters, KDD 1996]
26
39
Cache Behavior
27
39
Clustering and Similarity Join
 DBSCAN uses similarity join as basic operations
[Böhm, Braunmüller, Breunig, Kriegel: High Perf. Clustering based on the Sim. Join, CIKM 2000]
28
39
k-Nearest Neighbor Classification
 Example:
k=3
Objects with known class
New objects
•
New objects
Known objects
29
39
Distance Range Join (e-Join)
•
•
Most widespread and best evaluated join
Often also called the similarity join
30
39
k-Closest Pair Query
In SQL notation: SELECT * FROM R, S
ORDER BY ||R.obj - S.obj||
STOP AFTER k
31
39
k-Nearest Neighbor Join
In SQL notation: SELECT * FROM R, S
(limited to k = 1) GROUP BY R.obj
ORDER BY ||R.obj - S.obj||
STOP AFTER K (*  k *)
32
39
R-tree Spatial Join (RSJ)
procedure r_tree_sim_join (R, S, e)
if IsDirpg (R)  IsDirpg (S) then
foreach r  R.children do
foreach s  S.children do
if mindist (r,s)  e then
CacheLoad(r); CacheLoad(s);
r_tree_sim_join (r,s,e) ;
else (* assume R,S both DataPg *)
foreach p  R.points do
foreach q  S.points do
if |p - q| e then report (p,q);
R
e
S
33
39
Modeling and Optimization
[Böhm, Kriegel: A Cost Model and Index Architecture for the Similarity Join, Wednesday, 1630]
 Mating probability of index pages:
 Probability that distance between two pages e
 Two-fold application of Minkowski sum
34
39
Modeling and Optimization
 I/O cost:
• High const. cost per page
• Large capacity optimum
 CPU cost:
• Low const. cost per page
• Low capacity optimum
 CPU-performance like CPU optimized index
 I/O- performance like I/O optimized index
35
39
Open Problems for Research (Sim. Join)
 Modeling and Optimization:
•
•
•
•
Dimension
Quantization
Page scheduling
Caching strategies
 Nearest Neighbor Join
• Applications
• Algorithms
 General
• Integration into object-relational DBMS
36
39
New Challenges
New Challenges
Incertain Features:
 Application:
• Biometric Identification
 Particularities:
Relative probability
37
39
• Features individually associated with incertainty
(e.g. as Gaussian distributions)
 Queries:
• Probability of match
• Find objects with highes probability of match
• Find objects with probability of match >= e
Feature a1
38
39
New Challenges
Support of e-commerce in all phases
• Marketing
• Sales and booking
• Add-on products
 Advanced Similarity
•
•
•
•
Adaptable
Multimodal models
Relevance-feedback
Convex hull
 customer segmentation
 advanced similarity search
 Sales transaction analysis
39
39
New Challenges
Stock quota: Technical chart analysis
 Known: Database techniques for similarity search
in time sequences (DFT, etc.)
40
39
New Challenges
 Professional analyst tools use:
• Trading signals generated by indicators (etc. MACD)
• Formations indicating trends in charts
• Relationships to the market and to derivatives
41
39
Conclusion
 Database primitives: abstraction from application:
Similarity Search  Range Queries
Nearest Neighbor Queries
Clustering
Classification
 Similarity Join
Outlier Detection
 Advantages
• General solution, reuse
• Separately optimizable