Multimedia and Time

Download Report

Transcript Multimedia and Time

Multimedia and Time-Series Data
When Is “Nearest Neighbor” Meaningful?
Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason Yeung
CSIS 7101 Advanced Database
Contribution
 As dimension increase, the distance to the
nearest neighbor approaches the distance
to the farthest neighbor
 The distinction between nearest and farthest
neighbors may blur with as few as 15
dimensions
 Linear scan almost out-performs Nearest
Neighbor processing techniques
Part One
Introduce Nearest Neighbor
What is nearest neighbor (NN)
problem?
 Given a collection of data points and a
query point in a m-dimensional metric
space, find the data point that is
closest to the query point
Nearest neighbor algorithm
1. Make two sets of nodes, set A and set B
and put all nodes into set B
2. Put your starting node into set A
3. Pick the node which is closest to the last
node which was placed in set A and is not
in set A; put this closest neighboring node
into set A
4. Repeat step 3 until all nodes are in set A
Query point and its nearest neighbor
NN
Query point
Practical Applications of NN Search
 Medical Imaging
 Molecular Biology
 Spatial and Multimedia databases
Adaptable Similarity Approach
 In multimedia database, given an
image database
 one may want to retrieve all images that
are similar to a given query image
 data domain is high dimensional
Color-oriented similarity of images
 On Image Database
Shape-oriented similarity of images
 Aims at the level of individual pixels
Shape-oriented similarity of 3-D
objects
 On 3-D protein database
Approximation-based Shape
similarity of 3-D surface segments
 Measures the similarity of 3-D segments by
using geometric approx.
Exception Case
 Distance between the nearest neighbor and
any other point in the data set is small
 We call this unstable query
Part Two
Nearest Neighbor in high dimensional
Unstable query
 A nearest neighbor query is unstable
for a given € if the distance from the
query point to most data point is less
than (1 + €) times the distance from
the query point to its nearest neighbor
NN in High-Dimensional Space
 Proof in the paper that the concept NN
become meaningless as
dimensionality (m) increases
 If a pre-condition holds: As m
increases, the difference in distance
between the query point and all data
points become negligible (i.e., the
query becomes unstable)
Assumption for the pre-condition to
hold
 The data distribution and query
distribution are IID in all dimensions
 Unique dimensions with correlation
between all dimensions
What is IID?
 Independent and identically distributed
 It means that the distribution of values
in each dimension is identical (i.e. all
values are uniformly distributed or
dimensional have same skew) and
independent
High Dimensional indexing can be
meaningful
 When the dimensions of both the query
point and data points follow identical
distribution, but are completely dependent
(i.e: value in D1 = values in D2=…)
 The result is a set of data points and query
point on a diagonal line
 The underlying query can actually be
converted to 1D NN problem
Graphical View
 All dimension has same value
 All data points are on the diagonal
Y
Z
X
High Dimensional indexing can be
meaningful (Cont’d)
 The underlying dimensionality is much
lower than the actual dimensionality
 E.g.: It is a 3-D data set, but the data
always have the Z coordinate
High Dimensional indexing can be
meaningful (Cont’d)
 When the query point is within some
small distance of a data point (instead
of being required to be identical to a
data point)
 The result set of the query is to return
all points within the closest cluster, not
just the nearest point
NN query in clustered data
 E.g.: Data falls into discrete classes or cluster
in some potentially high dimensional feature
space
Nearest
Cluster
Query point
Distribution of distances in clustered data
Points are close and are in same cluster (NN meaningful)
Point are in other cluster which are all far
Experimental studies of NN
 Want to find out the rate of convergence
 Based on 3 synthetic work-load and one
real data set
 NN can become unstable with as few as 1020 dimensions
 The graph is exponential
 In reality, the dimensions might be 1000
Correlated Distributions
Recursive and uniform workload (NN not meaningful)
Two degrees of freedom workload (NN meaningful)
Part Three
Linear Scan is powerful…
NN indexing VS Linear scan
 Linear scan can handily beats NN
indexing
 NN indexing is meaningful when data
consists of small, well-formed clusters
 And the query is guaranteed to land in
or very near one of these cluster
Why Linear scan
 A set of sequentially arranged disk
pages is much faster than unordered
retrieval of the same pages
 Fetching a large number of data pages
through multi-dimensional index
usually results in unordered retrieval
Linear Scan outperforms
 Both the SS tree and the R* tree at 10
dimensions in all cases
 SR tree in all cases at 16 dimensional
synthetic data set
Justification
 All the report performance studies
examined situations in which the
difference in distance between the
query point and NN differed little from
the distance to other data points
 In reality, it might be different
Other related work
 Dimensionality Curse
 Fractal Dimensions
Dimensionality Curse
 Vague indication that high
dimensionality causes problems in
some situations
 Examples:
 NN problem
 “Boundary effects” not taken into account
on NN query in high dimensional case
Fractal Dimensions
 It is a measure of how "complicated" a
self-similar figure (data) is
 NN queries become stable when
fractal dimensionality is low
 In reality, real data sets do not exhibit
fractal behavior
Conclusion
 The effect of dimensionality on NN
queries
 High dimensional index can be
meaningful
 Evaluate NN workload
 Linear scan outperforms NN
processing technique on some
meaningful workload
Reference
 Kevin Beyer, Jonathan Goldstein,
Raghu Ramakrishnan, Uri Shaft. What
Is “Nearest Neighbor” Meaningful?
 Thomas Seidl. Adaptable Similarity
Search in 3-D Spatial Database
System
 http://www.dbs.informatik.unimuenchen.de/Forschung/Similarity/