Cortina: a web image search engine

Download Report

Transcript Cortina: a web image search engine

Thursday, May 27,
2004
A System for Large-scale,
Content-based
Web Image Retrieval
- and the Semantics within
Till Quack
Thursday, May 27, 2004
Task
 Create a content-based image retrieval system for the
WWW
 Large-scale, one order of magnitude larger than existing
systems. Means O(106) items
 Relevance Feedback
 Explore and exploit the semantics within
 Take large-scale, content-based image retrieval one
step closer to commercial applications
Thursday, May 27, 2004
Outline
 Content-based Image Retrieval on the WWW
 PART I: A System for Image Retrieval on the WWW
 Features
 Retrieval
 Relevance Feedback
 Software Design
 PART II: The Semantics within
 Identifying a Method to find Semantics
 Data Mining for Semantic Clues
 Frequent Itemset Mining and Association Rules
 The Visual Link
 Discussion & Demonstration
 Conclusions & Outlook
Thursday, May 27, 2004
Content-based Image Retrieval on
the WWW
 Characteristics of the data repository




Size: 4.2 billion documents in Google’s index
Diversity: Documents in any context, language
Control: Anybody can publish anything
Dynamics: Ever changing
 System Requirements
 FAST
 SCALABLE
 Make use of all the information
available
 Motivation for a new system
 Existing systems
• Either pure text (Google)
• Or pure content-based
 Large-Scale
Thursday, May 27, 2004
PART I:
A System for Large-scale, Content-based
Image Retrieval on the WWW
Ullrich Moenich
Till Quack
Lars Thiele
Thursday, May 27, 2004
System Overview
World Wide Web
DMOZ
Data
Image Spider
Image Description
Keyword Request
Matching Images
Images
(Binaries)
User picks
relevant
images
Nearest Neighbor Search
Inverted Index
keyid | imageid
mySQL
Keyword
Extraction
Feature
Extraction
Keyword
Indexing
Keywords
Offline
Matching Images
Visual
Features
Cluster Cluster
1
2Cluster n
Cluster Cluster
1
2Cluster n
Cluster Cluster
1
2Cluster n
Clustering
Cluster Cluster
1
2Cluster n
Retrieval
Thursday, May 27, 2004
Visual Features describe the Images
 Global Features from MPEG-7 Standard
 Currently no Segmentation
• Reasons: Scalability and the diversity of the data
 Texture Features
 Edge Histogram Descriptor (EHD)
• Histogram of quantified edge directions. 80 dimensions
 Homogeneous Texture Descriptor (HTD)
• Output of Gabor filter-bank. 62 dimensions.
 Color Features
 Scalable Color Descriptor (SCD)
• Color Histogram. 256, 128, 64 or 32 dimensions
 Dominant Color Descriptor (DCD)
• Up to 8 dominant colors (3d color-space) and their percentages
– 32 “dimensions”
• “Bins” defined for each image
Thursday, May 27, 2004
Collateral Text as an additional
Feature
 ALT Tag and Collateral Text around images
 VERY uncontrolled annotation
 Stemming: Porter Stemmer
 Example: training -> train
 More matching terms for boolean queries
 But also some new ambiguities
• train: to train [verb] / the train [noun]
Thursday, May 27, 2004
Retrieval in 2 Steps
er
ption
Keyword Request
1. Text Retrieval
Matching Images
Images
(Binaries)
User picks
relevant
images
Nearest Neighbor Search
Inverted Index
keyid | imageid
Feature
Extraction
Matching Images
Keyword
Indexing
Visual
Features
2. Visual Nearest
Neighbor Search
Cluster Cluster
1
2Cluster n
Cluster Cluster
1
2Cluster n
Cluster Cluster
1
2Cluster n
Clustering
Cluster Cluster
1
2Cluster n
Retrieval
Thursday, May 27, 2004
Retrieval: Text
 Options
 Boolean query on inverted
index
 Vector Space Model
 LSI etc.
 Choice
 Ranked boolean queries on
inverted index
 Ranking: tf*idf
 Reasons
 Speed
 Sparsity of data:
• 600 000 Keywords in
total
• 1 document: 10-50
words
Keyword
Keyid
ImageId
tf
shoe
124
1233
1
sport
341
1233
1
red
345
1233
1
banana
445
1234
1
fruit
75
1234
2
Order
875
1234
1
Thursday, May 27, 2004
Retrieval – Visual Features (MPEG-7)
 K-Nearest Neighbor search (K-NN)
 Find K closest candidates ci to query
image q in a vector space
 Distance: Minkowsky Metrics for
distance d(ci,q) namely L1 and L2
norms
 Most MPEG-7 descriptors are highdimensional vectors
 The “dimensionality curse” applies
 High dimensional spaces behave
“weirdly”
 In particular the distances are not too
meaningful
Thursday, May 27, 2004
Retrieval – Challenges for Visual
Features
 We have several (visual) feature types
How can we combine them?
 Our database is very large.
How can we search it fast enough?
 i.e. how can we avoid comparing the query vector with
each database entry?
Thursday, May 27, 2004
A Combined Distance for the MPEG7 Features
 We use a combined distance of all the visual feature types
 The individual distances occupy different ranges in different
distributions
 The distributions were transformed to a normal distribution in
the range [0,1]
 The distances are then combined linearly
Thursday, May 27, 2004
Clustering speeds up the search





Problem
 Millions of items in DB
 Linear search over the whole dataset
too slow
 Looking only for the K nearest
neighbors anyway
(One) Solution
 Partition the data into Clusters,
identified by representative, the
centroid
 Only search the cluster whose centroid
is closest to query q
K-Means clustering algorithm
 Not the best, in particular in HD spaces
 But fast!
Problem with Clustering:
 Query at the border of a cell does not
find all the nearest neighbors
Simple Solution:
Overlapping Clusters
 Problem: Scalability
•
•
Original data 7GB
Overlapping data: 50 GB
Imageid
Primary
Descriptor
Secondary
Descriptor
1
Secondary
Descriptor
2
Secondary
Descriptor
3
122
ehd
htd
scd
dcd
45233
ehd
htd
scd
dcd
6688
ehd
htd
scd
dcd
Thursday, May 27, 2004
Relevance Feedback Improves the
Results
 Relevance feedback: User input to improve search results iteration by iteration
 i.e. the user selects „good matches“
 We obtain the following information:
1. A new query vector which is a combination of the
relevant images = Query Vector Movement
2. The ratios for the combination of the feature types
Thursday, May 27, 2004
Relevance Feedback: Query Vector
Movement
 Construct the query vector qn of images selected in
iteration n
Vector component k
Feature type f (EHD,SCD,HTD)
i=1...M relevant images
The final, new query vector is
q = 0.75 *qn + 0.25 *qn-1
i.e. move from the old query vector towards the new vector
Thursday, May 27, 2004
Relevance Feedback: Weight
Adapation
 Which feature is most important for the given query?
 The one for which all the relevant images are closest
 Determine the ratios for the combination based on the
average distance, e.g. for the EHD
 and set
Thursday, May 27, 2004
Implementation – Software and
Hardware
 Languages: C++ and Perl
 Inline::CPP to connect
Layers




WWW: Apache and CGI
Relational DB: mySQL
Operating System: OS X
Hardware
 Dual 2 GHZ Apple G5, 2GB
RAM
 Teran Terrabyte Disk Array
Thursday, May 27, 2004
Part II: The Semantics Within
Thursday, May 27, 2004
Semantics: Combining Text and Visual
Features
 Our dataset is multi-modal
 Keywords and several visual
features
 Not only valid for WWW data
• Video: image+speech,
• Bio-imagery:
image+microscope setting,
cell coloring fluid
 Goal: Try to jointly use the
different modes
 Do semantic relations between
the modes exist?
 Learn something about these
semantic relations
 Improve the retrieval precision
based on them
 Challenges in our project:
 Large-scale
 Noisy and uncontrolled data
 Only global visual features
Thursday, May 27, 2004
Identifying a Method to find the
Semantics
 Related work
 Latent Semantic Indexing (LSI) [Westerveld 2000]
• – problem O(N2m3), N=Documents+Terms, m=concept space
 Statistical models [Barnard, Forsyth 2001-2004]
• Problem O: “several hours for several thousand images”
• Problem: It is a (rather strict, hierarchical) model
 Others
• Neural networks (SOM etc.)
• Hidden Markov Models
 Often: Classification
 We don’t know our classes, or: there are just too many
 We can’t train them either (data too diverse and noisy)
 Most of the methods above only tested on relatively small,
supervised datasets
 There is one more option …
Thursday, May 27, 2004
Method: Data Mining for Semantic
Clues
 Mine the data for patterns
 Find them only where they exist
 Deduce Rules from them
 Scalable methods available
 Frequent Itemset Mining and Association Rules
 Classic Application: Market baskets, Census data …
 Some works on Multimedia data
• [Zaïane 98]: Datacubes with appended keywords
• [Tešić et al. 03]: Perceptual associations (texture) within
images
Thursday, May 27, 2004
Frequent Itemsets and Association
Rules

Itemset I

Transaction T


Database D
Support of Itemset A

A is called frequent if

Rule

Support of a Rule
 Statistical significance

Confidence of a Rule
 Strength of implication
 Maximum likelihood estimate that B
is true given that A is true
Thursday, May 27, 2004
Example & Advantages
 Example: Market Baskets
 Rule {Diaper,Milk}{Beer}
 Advantages
TID
Items
1
Bread, Milk
2
Beer, Diaper, Bread, Eggs
3
Beer, Coke, Diaper, Milk
4
Beer, Bread, Diaper, Milk
5
Coke, Bread, Diaper, Milk
 Human readable
 Can be edited
 Fast Algorithms available
 Note: Associations are not correlations
 The same concept, just simpler
 Associations and correlations:
[Brin, Motwani, Silverstein 98]
Thursday, May 27, 2004
Using FIMI to find the itemsets
 Frequent Itemset Mining (FIMI)
 Find frequent itemsets with support > minsupp
 Minimal support minsupp given by “an expert”
 First Algorithm: APriori [Agrawal et al. 93]
 Basic Idea: If an itemset is frequent, all its subsets must be
frequent (Monotonicity)
 k-passes over dataset for itemsets of length k
 ~O(knp) n transactions, p items, itemsets of length k
 Today’s algorithms
 Rely on the same basic principle
 But much faster (Main Reason: Data structures)
• Usually only 2 database passes
• ~linear runtime
 State-of-the-art algorithm overview: FIMI’03
 We used: fpmax* [Grahne, Zhu: Nov 03]
Thursday, May 27, 2004
Diapers and Beer !!?

Application to the domain of
Multimedia data:
1. Formulate images as transactions
2. Low-level clusters serve as a
dimensionality reduction for the
visual features
3. We find associations of visual
features (clusters) and keywords
4. From theses associations we
deduce semantic rules

Advantages
 Comparably low computational
complexity
 Other data sources can be
integrated in the same manner (e.g.
long-term relevance feedback)

Challenges
 Noisy, uncontrolled data
 Associations within keywords much
stronger than associations between
keywords and visual features
 Uneven distribution of cluster sizes
(K-Means problem)
Thursday, May 27, 2004
Characteristics of the Itemsets and
Rules
 There are associations
 Within text {shoe}  {walk}
 Within visual clusters {EHD 14}
 {SCD 12}
 Between text and visual
clusters {shoe}  {EHD 14}
 Measure for interestingness or
choice of rules from FI
 Confidence?
 Statistical Criteria?
 Background Knowledge?
(Example: pregnant -> Woman:
100% confidence)
 Our „Background Knowledge“:
Rules that connect keywords
and low-level features are more
interesting
 Since this is known, the mining
can be adapted and made
even faster
Thursday, May 27, 2004
Exploiting the Itemsets and Rules
Thursday, May 27, 2004
Selecting Interesting Low-Level Clusters
based on Rules

Clusters were introduced to partition the
visual feature vector data and search only
on certain clusters



But association rules might find and
solve this situation







Problem: We miss certain nearest
neighbors if images for a concept are
spread over several clusters
Unsatisfactory Solution: Overlapping
Clusters
Clusters are re-united
If number of images for concept in both
clusters is >minsupp
Example:
{shirt} -> {ehd249,ehd310} reunites these
clusters for the initial keyword-query
“shirt”!
This is scalable - unlike overlapping
clusters
Another benefit is that more images
labeled with the original keyword are
“injected” into the results of K-NN search
Currently: One Keyword as high level
semantic concept
Future: Find high level semantic concepts
by mining associations within text first
Thursday, May 27, 2004
The Visual Link
 Another contribution, NOT related to Frequent Itemset
Mining and Association Rules…
 Since search-concept suggests visual nearest neighbor
search with relevance feedback after intitial keyword
search:
 It would be nice to have a diverse selection of images for a
given keyword on the first page of results
 Images sorted not only by keyword ranking, but also based
on visual feature information
 Basic idea: For a given keyword query, build groups of
images that are visually close.
 Larger groups are more important
 Show only one representative per group
Thursday, May 27, 2004
The Visual Link: A Graph-Based
Approach
 Let I(Q) be a set of images matching a keyword query Q
 Define a graph G(V,E)
 i.e. images are visually linked if the distance between
them is lower than a given threshold
 Do a connected component analysis to find connected
components C
 For each component C find the „best“ representative rC
 Re-rank results based on representatives rC
Thursday, May 27, 2004
The Visual Link: An Example
Thursday, May 27, 2004
The Visual Link: An Approximation
 Problem: Distance calculations
for graph take too long
 Clusters cannot be used
 Loading individual vectors
takes a lot of time
 Solution:
 Approximate distance
 Idea: If images in the same
cluster and same distance
range to the centroid 
Probability that they are „close“
is high
 New definition for visually
linked
 If in same cluster and same
range of relative distance to its
centroid
 Can be encoded in relational
DB! And comes at nearly no
extra cost in creation
Imageid
Clusterid
2ndClusterid
Reldist
1
221
122
0.6
2
342
345
0.8
3
223
42
0.2
4
12
126
0.4
Thursday, May 27, 2004
Discussion & Demo
Thursday, May 27, 2004
Discussion: Precision
 Measuring the quality of
such a large-scale system
is difficult
 Precision/Recall measure
not possible: ground truth
not known
• C: correct results
• D: Desired results
• A: Actual results
 We measure the precision
based on user questioning
Thursday, May 27, 2004
Before we continue … some
numbers
 Number of Images: 3 006 660
 Size of Image data: 111 GB
 Feature Extraction: 15 days (dual 2Ghz CPU, 2GB RAM)
 Number of distinct keywords: 680 256
 Size of inverted keyword index table: 50 260 345 lines
 MySQL database size: 23 GB
Thursday, May 27, 2004
And now … the moment you’ve all been
waiting for …
 The Demo of Cortina
Thursday, May 27, 2004
Conclusions
 A system with over 3 Million items was implemented
 Probably the largest CBIR System to date?
 A retrieval concept was introduced
 a keyword query followed by relevance feedback and visual
nearest neighbour search
 Superior to existing retrieval concepts (query by keyword or
query by example)
 Data mining to explore and exploit semantics in largescale systems was introduced
Thursday, May 27, 2004
Questions
Thursday, May 27, 2004
Outlook
 Many extensions and improvements possible
 Segmentation
• Or maybe rather some simple tiling
 Indexing
• K-Means should be replaced
• Suggestion: VA-File based approach [Manjunath,Tesic 03]
 Association Rule Mining
• Multilevel Approach
• First keywords for high level semantic concepts
• Then visual features
Thursday, May 27, 2004
Thanks
 Ullrich Moenich and Lars Thiele
Thursday, May 27, 2004
Which Rules are of Interest?
 There are associations
 Within text {shoe}  {walk}
 Within visual clusters {EHD 14}  {SCD 12}
 Between text and visual clusters {shoe}  {EHD 14, SCD
12}
 There are long and short rules
 Short rules have higher support by the nature of the
problem
 Long rules contain more (precise) information about the
semantics
 Measure for interestingness or choice of rules from FI
 Confidence?
 Statistical Criteria?
 Background Knowledge? (Example pregnant Woman )
Thursday, May 27, 2004
Characteristics and Challenges
 Chosen criteria
 Mainly interested in rules
{keywords}  {visual feature clusters}.
(Our “Background Knowledge”)
 Support, confidence
 Mine long and short rules
 Restriction of the problem: Mine for frequent itemsets per
keyword
 i.e. all images=transactions for a given keyword
 This means
• We avoid being distracted by associations within keywords
• The method is made even more scalable
 The keyword as a placeholder for a semantic concept
 A keyword does not always stand for a single semantic concept
 Proposal for future versions: Multi-Level approach:
• First {keywords}  {keywords} rules to identify “real” semantic
concepts
• Then itemset mining per identified concept
Thursday, May 27, 2004
Characteristics of the Itemsets and Rules
- Overall
Thursday, May 27, 2004
Why keyword filtering of the results does
not work
Thursday, May 27, 2004
Proposal: Semantic Clusters
 Ultimate goal: Search some
kind of „Semantic Clusters“
instead of visual feature
clusters
 Proposal based on
approach from Ester et al.
2002, 2003
 Clustering based on
frequent itemsets, originally
for text
 Clustering criterion:
minimize overlap