Cortina: a web image search engine
Download
Report
Transcript Cortina: a web image search engine
Thursday, May 27,
2004
A System for Large-scale,
Content-based
Web Image Retrieval
- and the Semantics within
Till Quack
Thursday, May 27, 2004
Task
Create a content-based image retrieval system for the
WWW
Large-scale, one order of magnitude larger than existing
systems. Means O(106) items
Relevance Feedback
Explore and exploit the semantics within
Take large-scale, content-based image retrieval one
step closer to commercial applications
Thursday, May 27, 2004
Outline
Content-based Image Retrieval on the WWW
PART I: A System for Image Retrieval on the WWW
Features
Retrieval
Relevance Feedback
Software Design
PART II: The Semantics within
Identifying a Method to find Semantics
Data Mining for Semantic Clues
Frequent Itemset Mining and Association Rules
The Visual Link
Discussion & Demonstration
Conclusions & Outlook
Thursday, May 27, 2004
Content-based Image Retrieval on
the WWW
Characteristics of the data repository
Size: 4.2 billion documents in Google’s index
Diversity: Documents in any context, language
Control: Anybody can publish anything
Dynamics: Ever changing
System Requirements
FAST
SCALABLE
Make use of all the information
available
Motivation for a new system
Existing systems
• Either pure text (Google)
• Or pure content-based
Large-Scale
Thursday, May 27, 2004
PART I:
A System for Large-scale, Content-based
Image Retrieval on the WWW
Ullrich Moenich
Till Quack
Lars Thiele
Thursday, May 27, 2004
System Overview
World Wide Web
DMOZ
Data
Image Spider
Image Description
Keyword Request
Matching Images
Images
(Binaries)
User picks
relevant
images
Nearest Neighbor Search
Inverted Index
keyid | imageid
mySQL
Keyword
Extraction
Feature
Extraction
Keyword
Indexing
Keywords
Offline
Matching Images
Visual
Features
Cluster Cluster
1
2Cluster n
Cluster Cluster
1
2Cluster n
Cluster Cluster
1
2Cluster n
Clustering
Cluster Cluster
1
2Cluster n
Retrieval
Thursday, May 27, 2004
Visual Features describe the Images
Global Features from MPEG-7 Standard
Currently no Segmentation
• Reasons: Scalability and the diversity of the data
Texture Features
Edge Histogram Descriptor (EHD)
• Histogram of quantified edge directions. 80 dimensions
Homogeneous Texture Descriptor (HTD)
• Output of Gabor filter-bank. 62 dimensions.
Color Features
Scalable Color Descriptor (SCD)
• Color Histogram. 256, 128, 64 or 32 dimensions
Dominant Color Descriptor (DCD)
• Up to 8 dominant colors (3d color-space) and their percentages
– 32 “dimensions”
• “Bins” defined for each image
Thursday, May 27, 2004
Collateral Text as an additional
Feature
ALT Tag and Collateral Text around images
VERY uncontrolled annotation
Stemming: Porter Stemmer
Example: training -> train
More matching terms for boolean queries
But also some new ambiguities
• train: to train [verb] / the train [noun]
Thursday, May 27, 2004
Retrieval in 2 Steps
er
ption
Keyword Request
1. Text Retrieval
Matching Images
Images
(Binaries)
User picks
relevant
images
Nearest Neighbor Search
Inverted Index
keyid | imageid
Feature
Extraction
Matching Images
Keyword
Indexing
Visual
Features
2. Visual Nearest
Neighbor Search
Cluster Cluster
1
2Cluster n
Cluster Cluster
1
2Cluster n
Cluster Cluster
1
2Cluster n
Clustering
Cluster Cluster
1
2Cluster n
Retrieval
Thursday, May 27, 2004
Retrieval: Text
Options
Boolean query on inverted
index
Vector Space Model
LSI etc.
Choice
Ranked boolean queries on
inverted index
Ranking: tf*idf
Reasons
Speed
Sparsity of data:
• 600 000 Keywords in
total
• 1 document: 10-50
words
Keyword
Keyid
ImageId
tf
shoe
124
1233
1
sport
341
1233
1
red
345
1233
1
banana
445
1234
1
fruit
75
1234
2
Order
875
1234
1
Thursday, May 27, 2004
Retrieval – Visual Features (MPEG-7)
K-Nearest Neighbor search (K-NN)
Find K closest candidates ci to query
image q in a vector space
Distance: Minkowsky Metrics for
distance d(ci,q) namely L1 and L2
norms
Most MPEG-7 descriptors are highdimensional vectors
The “dimensionality curse” applies
High dimensional spaces behave
“weirdly”
In particular the distances are not too
meaningful
Thursday, May 27, 2004
Retrieval – Challenges for Visual
Features
We have several (visual) feature types
How can we combine them?
Our database is very large.
How can we search it fast enough?
i.e. how can we avoid comparing the query vector with
each database entry?
Thursday, May 27, 2004
A Combined Distance for the MPEG7 Features
We use a combined distance of all the visual feature types
The individual distances occupy different ranges in different
distributions
The distributions were transformed to a normal distribution in
the range [0,1]
The distances are then combined linearly
Thursday, May 27, 2004
Clustering speeds up the search
Problem
Millions of items in DB
Linear search over the whole dataset
too slow
Looking only for the K nearest
neighbors anyway
(One) Solution
Partition the data into Clusters,
identified by representative, the
centroid
Only search the cluster whose centroid
is closest to query q
K-Means clustering algorithm
Not the best, in particular in HD spaces
But fast!
Problem with Clustering:
Query at the border of a cell does not
find all the nearest neighbors
Simple Solution:
Overlapping Clusters
Problem: Scalability
•
•
Original data 7GB
Overlapping data: 50 GB
Imageid
Primary
Descriptor
Secondary
Descriptor
1
Secondary
Descriptor
2
Secondary
Descriptor
3
122
ehd
htd
scd
dcd
45233
ehd
htd
scd
dcd
6688
ehd
htd
scd
dcd
Thursday, May 27, 2004
Relevance Feedback Improves the
Results
Relevance feedback: User input to improve search results iteration by iteration
i.e. the user selects „good matches“
We obtain the following information:
1. A new query vector which is a combination of the
relevant images = Query Vector Movement
2. The ratios for the combination of the feature types
Thursday, May 27, 2004
Relevance Feedback: Query Vector
Movement
Construct the query vector qn of images selected in
iteration n
Vector component k
Feature type f (EHD,SCD,HTD)
i=1...M relevant images
The final, new query vector is
q = 0.75 *qn + 0.25 *qn-1
i.e. move from the old query vector towards the new vector
Thursday, May 27, 2004
Relevance Feedback: Weight
Adapation
Which feature is most important for the given query?
The one for which all the relevant images are closest
Determine the ratios for the combination based on the
average distance, e.g. for the EHD
and set
Thursday, May 27, 2004
Implementation – Software and
Hardware
Languages: C++ and Perl
Inline::CPP to connect
Layers
WWW: Apache and CGI
Relational DB: mySQL
Operating System: OS X
Hardware
Dual 2 GHZ Apple G5, 2GB
RAM
Teran Terrabyte Disk Array
Thursday, May 27, 2004
Part II: The Semantics Within
Thursday, May 27, 2004
Semantics: Combining Text and Visual
Features
Our dataset is multi-modal
Keywords and several visual
features
Not only valid for WWW data
• Video: image+speech,
• Bio-imagery:
image+microscope setting,
cell coloring fluid
Goal: Try to jointly use the
different modes
Do semantic relations between
the modes exist?
Learn something about these
semantic relations
Improve the retrieval precision
based on them
Challenges in our project:
Large-scale
Noisy and uncontrolled data
Only global visual features
Thursday, May 27, 2004
Identifying a Method to find the
Semantics
Related work
Latent Semantic Indexing (LSI) [Westerveld 2000]
• – problem O(N2m3), N=Documents+Terms, m=concept space
Statistical models [Barnard, Forsyth 2001-2004]
• Problem O: “several hours for several thousand images”
• Problem: It is a (rather strict, hierarchical) model
Others
• Neural networks (SOM etc.)
• Hidden Markov Models
Often: Classification
We don’t know our classes, or: there are just too many
We can’t train them either (data too diverse and noisy)
Most of the methods above only tested on relatively small,
supervised datasets
There is one more option …
Thursday, May 27, 2004
Method: Data Mining for Semantic
Clues
Mine the data for patterns
Find them only where they exist
Deduce Rules from them
Scalable methods available
Frequent Itemset Mining and Association Rules
Classic Application: Market baskets, Census data …
Some works on Multimedia data
• [Zaïane 98]: Datacubes with appended keywords
• [Tešić et al. 03]: Perceptual associations (texture) within
images
Thursday, May 27, 2004
Frequent Itemsets and Association
Rules
Itemset I
Transaction T
Database D
Support of Itemset A
A is called frequent if
Rule
Support of a Rule
Statistical significance
Confidence of a Rule
Strength of implication
Maximum likelihood estimate that B
is true given that A is true
Thursday, May 27, 2004
Example & Advantages
Example: Market Baskets
Rule {Diaper,Milk}{Beer}
Advantages
TID
Items
1
Bread, Milk
2
Beer, Diaper, Bread, Eggs
3
Beer, Coke, Diaper, Milk
4
Beer, Bread, Diaper, Milk
5
Coke, Bread, Diaper, Milk
Human readable
Can be edited
Fast Algorithms available
Note: Associations are not correlations
The same concept, just simpler
Associations and correlations:
[Brin, Motwani, Silverstein 98]
Thursday, May 27, 2004
Using FIMI to find the itemsets
Frequent Itemset Mining (FIMI)
Find frequent itemsets with support > minsupp
Minimal support minsupp given by “an expert”
First Algorithm: APriori [Agrawal et al. 93]
Basic Idea: If an itemset is frequent, all its subsets must be
frequent (Monotonicity)
k-passes over dataset for itemsets of length k
~O(knp) n transactions, p items, itemsets of length k
Today’s algorithms
Rely on the same basic principle
But much faster (Main Reason: Data structures)
• Usually only 2 database passes
• ~linear runtime
State-of-the-art algorithm overview: FIMI’03
We used: fpmax* [Grahne, Zhu: Nov 03]
Thursday, May 27, 2004
Diapers and Beer !!?
Application to the domain of
Multimedia data:
1. Formulate images as transactions
2. Low-level clusters serve as a
dimensionality reduction for the
visual features
3. We find associations of visual
features (clusters) and keywords
4. From theses associations we
deduce semantic rules
Advantages
Comparably low computational
complexity
Other data sources can be
integrated in the same manner (e.g.
long-term relevance feedback)
Challenges
Noisy, uncontrolled data
Associations within keywords much
stronger than associations between
keywords and visual features
Uneven distribution of cluster sizes
(K-Means problem)
Thursday, May 27, 2004
Characteristics of the Itemsets and
Rules
There are associations
Within text {shoe} {walk}
Within visual clusters {EHD 14}
{SCD 12}
Between text and visual
clusters {shoe} {EHD 14}
Measure for interestingness or
choice of rules from FI
Confidence?
Statistical Criteria?
Background Knowledge?
(Example: pregnant -> Woman:
100% confidence)
Our „Background Knowledge“:
Rules that connect keywords
and low-level features are more
interesting
Since this is known, the mining
can be adapted and made
even faster
Thursday, May 27, 2004
Exploiting the Itemsets and Rules
Thursday, May 27, 2004
Selecting Interesting Low-Level Clusters
based on Rules
Clusters were introduced to partition the
visual feature vector data and search only
on certain clusters
But association rules might find and
solve this situation
Problem: We miss certain nearest
neighbors if images for a concept are
spread over several clusters
Unsatisfactory Solution: Overlapping
Clusters
Clusters are re-united
If number of images for concept in both
clusters is >minsupp
Example:
{shirt} -> {ehd249,ehd310} reunites these
clusters for the initial keyword-query
“shirt”!
This is scalable - unlike overlapping
clusters
Another benefit is that more images
labeled with the original keyword are
“injected” into the results of K-NN search
Currently: One Keyword as high level
semantic concept
Future: Find high level semantic concepts
by mining associations within text first
Thursday, May 27, 2004
The Visual Link
Another contribution, NOT related to Frequent Itemset
Mining and Association Rules…
Since search-concept suggests visual nearest neighbor
search with relevance feedback after intitial keyword
search:
It would be nice to have a diverse selection of images for a
given keyword on the first page of results
Images sorted not only by keyword ranking, but also based
on visual feature information
Basic idea: For a given keyword query, build groups of
images that are visually close.
Larger groups are more important
Show only one representative per group
Thursday, May 27, 2004
The Visual Link: A Graph-Based
Approach
Let I(Q) be a set of images matching a keyword query Q
Define a graph G(V,E)
i.e. images are visually linked if the distance between
them is lower than a given threshold
Do a connected component analysis to find connected
components C
For each component C find the „best“ representative rC
Re-rank results based on representatives rC
Thursday, May 27, 2004
The Visual Link: An Example
Thursday, May 27, 2004
The Visual Link: An Approximation
Problem: Distance calculations
for graph take too long
Clusters cannot be used
Loading individual vectors
takes a lot of time
Solution:
Approximate distance
Idea: If images in the same
cluster and same distance
range to the centroid
Probability that they are „close“
is high
New definition for visually
linked
If in same cluster and same
range of relative distance to its
centroid
Can be encoded in relational
DB! And comes at nearly no
extra cost in creation
Imageid
Clusterid
2ndClusterid
Reldist
1
221
122
0.6
2
342
345
0.8
3
223
42
0.2
4
12
126
0.4
Thursday, May 27, 2004
Discussion & Demo
Thursday, May 27, 2004
Discussion: Precision
Measuring the quality of
such a large-scale system
is difficult
Precision/Recall measure
not possible: ground truth
not known
• C: correct results
• D: Desired results
• A: Actual results
We measure the precision
based on user questioning
Thursday, May 27, 2004
Before we continue … some
numbers
Number of Images: 3 006 660
Size of Image data: 111 GB
Feature Extraction: 15 days (dual 2Ghz CPU, 2GB RAM)
Number of distinct keywords: 680 256
Size of inverted keyword index table: 50 260 345 lines
MySQL database size: 23 GB
Thursday, May 27, 2004
And now … the moment you’ve all been
waiting for …
The Demo of Cortina
Thursday, May 27, 2004
Conclusions
A system with over 3 Million items was implemented
Probably the largest CBIR System to date?
A retrieval concept was introduced
a keyword query followed by relevance feedback and visual
nearest neighbour search
Superior to existing retrieval concepts (query by keyword or
query by example)
Data mining to explore and exploit semantics in largescale systems was introduced
Thursday, May 27, 2004
Questions
Thursday, May 27, 2004
Outlook
Many extensions and improvements possible
Segmentation
• Or maybe rather some simple tiling
Indexing
• K-Means should be replaced
• Suggestion: VA-File based approach [Manjunath,Tesic 03]
Association Rule Mining
• Multilevel Approach
• First keywords for high level semantic concepts
• Then visual features
Thursday, May 27, 2004
Thanks
Ullrich Moenich and Lars Thiele
Thursday, May 27, 2004
Which Rules are of Interest?
There are associations
Within text {shoe} {walk}
Within visual clusters {EHD 14} {SCD 12}
Between text and visual clusters {shoe} {EHD 14, SCD
12}
There are long and short rules
Short rules have higher support by the nature of the
problem
Long rules contain more (precise) information about the
semantics
Measure for interestingness or choice of rules from FI
Confidence?
Statistical Criteria?
Background Knowledge? (Example pregnant Woman )
Thursday, May 27, 2004
Characteristics and Challenges
Chosen criteria
Mainly interested in rules
{keywords} {visual feature clusters}.
(Our “Background Knowledge”)
Support, confidence
Mine long and short rules
Restriction of the problem: Mine for frequent itemsets per
keyword
i.e. all images=transactions for a given keyword
This means
• We avoid being distracted by associations within keywords
• The method is made even more scalable
The keyword as a placeholder for a semantic concept
A keyword does not always stand for a single semantic concept
Proposal for future versions: Multi-Level approach:
• First {keywords} {keywords} rules to identify “real” semantic
concepts
• Then itemset mining per identified concept
Thursday, May 27, 2004
Characteristics of the Itemsets and Rules
- Overall
Thursday, May 27, 2004
Why keyword filtering of the results does
not work
Thursday, May 27, 2004
Proposal: Semantic Clusters
Ultimate goal: Search some
kind of „Semantic Clusters“
instead of visual feature
clusters
Proposal based on
approach from Ester et al.
2002, 2003
Clustering based on
frequent itemsets, originally
for text
Clustering criterion:
minimize overlap