Multimedia and Time
Download
Report
Transcript Multimedia and Time
Multimedia and Time-Series Data
When Is “Nearest Neighbor” Meaningful?
Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason Yeung
CSIS 7101 Advanced Database
Contribution
As dimension increase, the distance to the
nearest neighbor approaches the distance
to the farthest neighbor
The distinction between nearest and farthest
neighbors may blur with as few as 15
dimensions
Linear scan almost out-performs Nearest
Neighbor processing techniques
Part One
Introduce Nearest Neighbor
What is nearest neighbor (NN)
problem?
Given a collection of data points and a
query point in a m-dimensional metric
space, find the data point that is
closest to the query point
Nearest neighbor algorithm
1. Make two sets of nodes, set A and set B
and put all nodes into set B
2. Put your starting node into set A
3. Pick the node which is closest to the last
node which was placed in set A and is not
in set A; put this closest neighboring node
into set A
4. Repeat step 3 until all nodes are in set A
Query point and its nearest neighbor
NN
Query point
Practical Applications of NN Search
Medical Imaging
Molecular Biology
Spatial and Multimedia databases
Adaptable Similarity Approach
In multimedia database, given an
image database
one may want to retrieve all images that
are similar to a given query image
data domain is high dimensional
Color-oriented similarity of images
On Image Database
Shape-oriented similarity of images
Aims at the level of individual pixels
Shape-oriented similarity of 3-D
objects
On 3-D protein database
Approximation-based Shape
similarity of 3-D surface segments
Measures the similarity of 3-D segments by
using geometric approx.
Exception Case
Distance between the nearest neighbor and
any other point in the data set is small
We call this unstable query
Part Two
Nearest Neighbor in high dimensional
Unstable query
A nearest neighbor query is unstable
for a given € if the distance from the
query point to most data point is less
than (1 + €) times the distance from
the query point to its nearest neighbor
NN in High-Dimensional Space
Proof in the paper that the concept NN
become meaningless as
dimensionality (m) increases
If a pre-condition holds: As m
increases, the difference in distance
between the query point and all data
points become negligible (i.e., the
query becomes unstable)
Assumption for the pre-condition to
hold
The data distribution and query
distribution are IID in all dimensions
Unique dimensions with correlation
between all dimensions
What is IID?
Independent and identically distributed
It means that the distribution of values
in each dimension is identical (i.e. all
values are uniformly distributed or
dimensional have same skew) and
independent
High Dimensional indexing can be
meaningful
When the dimensions of both the query
point and data points follow identical
distribution, but are completely dependent
(i.e: value in D1 = values in D2=…)
The result is a set of data points and query
point on a diagonal line
The underlying query can actually be
converted to 1D NN problem
Graphical View
All dimension has same value
All data points are on the diagonal
Y
Z
X
High Dimensional indexing can be
meaningful (Cont’d)
The underlying dimensionality is much
lower than the actual dimensionality
E.g.: It is a 3-D data set, but the data
always have the Z coordinate
High Dimensional indexing can be
meaningful (Cont’d)
When the query point is within some
small distance of a data point (instead
of being required to be identical to a
data point)
The result set of the query is to return
all points within the closest cluster, not
just the nearest point
NN query in clustered data
E.g.: Data falls into discrete classes or cluster
in some potentially high dimensional feature
space
Nearest
Cluster
Query point
Distribution of distances in clustered data
Points are close and are in same cluster (NN meaningful)
Point are in other cluster which are all far
Experimental studies of NN
Want to find out the rate of convergence
Based on 3 synthetic work-load and one
real data set
NN can become unstable with as few as 1020 dimensions
The graph is exponential
In reality, the dimensions might be 1000
Correlated Distributions
Recursive and uniform workload (NN not meaningful)
Two degrees of freedom workload (NN meaningful)
Part Three
Linear Scan is powerful…
NN indexing VS Linear scan
Linear scan can handily beats NN
indexing
NN indexing is meaningful when data
consists of small, well-formed clusters
And the query is guaranteed to land in
or very near one of these cluster
Why Linear scan
A set of sequentially arranged disk
pages is much faster than unordered
retrieval of the same pages
Fetching a large number of data pages
through multi-dimensional index
usually results in unordered retrieval
Linear Scan outperforms
Both the SS tree and the R* tree at 10
dimensions in all cases
SR tree in all cases at 16 dimensional
synthetic data set
Justification
All the report performance studies
examined situations in which the
difference in distance between the
query point and NN differed little from
the distance to other data points
In reality, it might be different
Other related work
Dimensionality Curse
Fractal Dimensions
Dimensionality Curse
Vague indication that high
dimensionality causes problems in
some situations
Examples:
NN problem
“Boundary effects” not taken into account
on NN query in high dimensional case
Fractal Dimensions
It is a measure of how "complicated" a
self-similar figure (data) is
NN queries become stable when
fractal dimensionality is low
In reality, real data sets do not exhibit
fractal behavior
Conclusion
The effect of dimensionality on NN
queries
High dimensional index can be
meaningful
Evaluate NN workload
Linear scan outperforms NN
processing technique on some
meaningful workload
Reference
Kevin Beyer, Jonathan Goldstein,
Raghu Ramakrishnan, Uri Shaft. What
Is “Nearest Neighbor” Meaningful?
Thomas Seidl. Adaptable Similarity
Search in 3-D Spatial Database
System
http://www.dbs.informatik.unimuenchen.de/Forschung/Similarity/