Lecture 6: Multimedia IR: Indexing and Searching
Download
Report
Transcript Lecture 6: Multimedia IR: Indexing and Searching
Special Topics in Computer Science
Advanced Topics in Information Retrieval
Lecture 6 (book chapter 12):
Multimedia IR:
Indexing and Searching
Alexander Gelbukh
www.Gelbukh.com
Previous Chapter: Conclusions
Basically, images are handled as text described them
Namely, feature vectors (or feature hierarchies)
Context can be used when available to determine features
Also, queries by example are common
From the point of view of DBMS, integration with IR
and multimedia-specific techniques is needed
Object-oriented technology is adequate
2
Previous Chapter: Research topics
How similarity function can be defined?
What features of images (video, sound) there are?
How to better specify the importance of individual
features? (Give me similar houses: similar = size?
color? strructure? Architectural style?)
How to determine the objects in an image?
Integration with DBMSs and SQL for fast access and
rich semantics
Integration with XML
Ranking: by similarity, taking into account history, profile
3
The problem
Data examples:
2D/3D color/grayscale images: e.g., brain scans, scientific
databases of vector fields
(2D) video,
(1D) voice/music; (1D) time series: e.g.,
financial/marketing time series; DNA/genomic databases
Query examples:
find photographs with the same color distribution as this
find companies whose stock prices move as this one
find brain scans with a texture of a tumor
Applications: search; data mining
4
Solution
Reduce the problem to search for multi-dimensional
points (feature vectors, but vector space is not used)
Define a distance measure
for time series: e.g., Euclidean distance between vectors
for images: e.g., color distribution (Euclidean distance);
another approach: mathematical morphology
Other features as vectors
For search within distance, the vectors are organized
in R-trees
Clustering plays important role
5
Types of queries
All within given distance
Find all images that are within 0.05 distance from this one
Nearest-neighbor
Find 5 stocks most similar to IBM
All pairs within given distance
Further: clustering
Whole object vs. sub-pattern match
Find parts of image that are...
E.g., in 512 512 brain scans, find pieces similar to the
given 16 16 typical X-ray of a tumor
Like passage retrieval for text documents
6
Neighbor and pairs types of queries
The objects are organized in R-trees
For neighbor queries: branch-and-bound algorithm
For pairs: recently discovered algorithms
These types of queries are not discussed here
7
Desiderata for a method
Fast
No sequential search with all objects
Correct
100% recall
Precision is less important, though kept low. False alarms
are easy to discard manually
Little space overhead
Dynamic
easy to insert, delete, update
8
Types of methods
Linear quadtrees
Complexity = hypersurface of the query region
Grows exponentially with dimensionality
grid-files
Complexity grows exponentially with dimensionality
R-trees methods, such as R*-trees
Most used due to lower complexity
9
R-tree
Objects and parts of images represented as Minimal
Bounding Rectangle (MBR)
Can overlap for different objects
Larger objects contain smaller objects
MBRs are nested
MBRs are arranged into a tree
In storage, an index of disk blocks is maintained
Disk blocks are fetched at once at hardware level
For better insertion/deletion, tight MBRs are needed
Good clustering is needed
10
File structure of R-tree
Corresponds to disk blocks
Fanout = 3: number of parts to group
11
R-tree
R-tree
12
Search in R-tree
Range queries:
find objects within distance from query object
= Find MBRs that intersect with query’s MBR
Determine MBR of the query
Descend the tree
Discarding all MBRs that do not intersect with the
query’s MBR
Many variations of R-tree method have been proposed
13
Indexing
Only consider here whole match queries
Given collection of objects and distance function
Find objects within given distance from given object Q
Problems:
1. Slow comparison of two objects
2. Huge database
GEMINI approach
GEneric Multimedia object INdexIng
Attempts to solve both problems
14
GEMINI indexing
Quick-and-dirty test to quickly discard bad objects
Uses clusters to avoid sequential search
Quick test
Single-valued feature, e.g., average for series.
Averages differ much objects differ much
Not vice-versa. False alarms are OK
Several features, but fewer than all data. E.g., deviation
for series
15
Algorithm
Map the actual objects into f-dimensional feature
space
Use clusters (e.g., R-trees) to search
Retrieve objects, compute the actual distances, and
discard false alarms
16
17
Feature selection
Features should reflect distances
Allow no misses (100% recall)
features should make things look closer
Lower Bound lemma:
If distance in feature space actual distance
then 100% recall
(we speak about whole-match queries)
Holds for distance search, nearest-neighbor, pair search
18
Algorithm (more detail)
Determine distance
Choose features
Prove that distance in feature space for actual objects
Use quick method (R-tree) to search in feature space
For found objects, compute the actual distances (this
can be expensive)
Discard false alarms
objects with greater actual distances, even if in feature space
the distance is OK
Example: similar averages, but different series
19
Discussion
The method does NOT improve quality
Provides SAME quality as sequential search, but faster
Distance definition requires domain/application expert
How much do the two images differ?
What is important/unimportant for the specific application?
Feature selection requires a good knowledge engineer
Choose the most characteristic feature: discriminative
If needed, choose the second best, etc.
Good features should be orthogonal: combination adds info
20
Example: Time series
In yearly stock movements, find ones similar to IBM
Distance: Euclidean (365-D vectors); others exist
Features:
First feature is average.
If needed, Discrete Fourier Transform (DFT) coefficients
Or, Discrete Cosine Transform, waivelet Transform, etc.
Lower-bound lemma:
Parseval theorem: DFT preserves distances (DCT, WT too)
First several coefficients give distance
Transforms “concentrate energy” in the first coefficients
Thus, the more realistic prediction of distance
21
Time series: Applications
Such feature selection is effective for many skewed
spectrum distributions
Colored noises: the energy decreases as F–b
b = 0: white spectrum: unpredictable. Method useless.
b = 1: pink noise: works of art
b = 2: brown noise: stock movements
b > 2: black noise: river levels, rainfall patterns
The greater b the better the first coefficients of the
transform predict the actual distance
Some other n-D signals show similar properties
JPEG compression ignores higher coefficients
22
Time series: Performance
Fewer features more false alarms time lost
More features more complex computation
Optimal number of features proves to be about 1..3
for skewed enough distributions
JPEG compression shows that photographs have it
23
Time series: Sub-pattern search
Use sliding window
Encode each window with few features
24
Example: Color images
Give me images with a texture of tumor like this one
Give me images with blue at top and red at bottom
Handles color, texture, shape, position, dominant
edges
25
Color images: Color representation
Compute color histogram
Distance: use color similarity matrix
Very expensive computationally: cross-talk between
features (compare all to all features)
26
27
Color images: Feature mapping
The GEMINI question again: What single feature is
the most representative?
Take average R, G, B
Lower-bound?
Yes: Quadratic Distance Bounding theorem
28
Automatic feature selection
Features can be selected automatically
In texts: Latent semantic indexing (LSI)
Many methods
Principle components analysis (= LSI), ...
In fact, they can reduce features, but not define them
Of colors, one can select characteristic combinations
But not classify into faces and flowers
So description of the objects is still on human researchers
29
Research topics
Object detection (pattern and image recognition)
Automatic feature selection
Spatial indexing data structures (more than 1D)
New types of data.
What features to select? How to determine them?
Mixed-type data (e.g., webpages, or images with
sound and description)
What clustering/IR methods are better suited for
what features? (What features for what methods?)
Similar methods in data mining, ...
30
Conclusions
How to accelerate search? Same results as sequential
Ideas:
Quick-and-dirty rejection of bad objects, 100% recall
Fast data structure for search (based on clustering)
Careful check of all found candidates
Solution: mapping into fewer-D feature space
Condition: lower-bounding of the distance
Assumption: skewed spectrum distribution
Few coefficients concentrate energy, rest are less important
31
Thank you!
Till Tuesday 11, 6 pm
32