Kotropoulos.ppt

Download Report

Transcript Kotropoulos.ppt

Aristotle University of Thessaloniki
Department of Informatics
Visual Information Retrieval
Constantine Kotropoulos
Monday July 8, 2002
1
Outline
 Fundamentals
 Still image segmentation: Comparison of ICM and LVQ
techniques
 Shape retrieval based on Hausdorff distance
 Video Summarization: Detecting shots, cuts, and fades
in video – Selection of key frames
 MPEG-7: Standard for Multimedia Applications
 Conclusions
2
Fundamentals










About
Toward visual information retrieval
Data types associated with images or video
First generation systems
Second generation systems
Content-based interactivity
Representation of visual content
Similarity models
Indexing methods
Performance evaluation
3
About
 Visual information retrieval:
– To retrieve images or image sequences from a database that are relevant to a query.
– Extension of traditional information retrieval designed to include visual media.
 Needs: Tools and interaction paradigms that permit searching for visual data by
referring directly to its content.
– Visual elements (color, texture, shape, spatial relationships) related to perceptual
aspects of image content.
– Higher-level concepts: clues for retrieving images with similar content from a
database.
 Multidisciplinary field:
– Information retrieval
– Visual data modeling and representation
– Multimedia database organization
– Multimedia database organization
– Multidimensional indexing
 Image/video analysis and processing
 Pattern recognition
 Computer vision
 User behavior modeling
 Human-computer interaction
4
Toward visual information retrieval
 Databases
– allow a large amount of alphanumeric data to be stored in a local repository
and accessed by content through appropriate query languages.
 Information Retrieval Systems
– provide access to unstructured text documents
• Search engines working in the textual domain either using keywords or
full text.
 Need for Visual Information Retrieval Systems has become
apparent when
– digital archives were released.
– distribution of image and video data though large-bandwidth computer
networks emerged
– more prominent as we progress to the wireless era!
5
Query by image content using
NOKIA 9210 Communicator
www.iva.cs.tut.fi/COST211
Iftikhar et al.
6
Data types associated with images or
video
 Content-dependent metadata
– Data related in some way to image/video content (e.g., format,
author’s name, date, etc.)
 Content-dependent metadata
– Low/intermediate-level features: color, texture, shape, spatial
relationship, motion, etc.
– Data referring to content semantics (content-descriptive
metadata)
 Impact on the internal organization of the retrieval
system
7
First generation systems
 Answers to queries: Find
– All images of paintings of El Greco.
– All byzantine ikons dated from 13th century, etc.




Content-independent metadata: alphanumeric strings
Representation schemes: relational models, frame models, object-oriented.
Content-dependent metadata: annotated keywords or scripts
Retrieval: Search engines working in the textual domain (SQL, full text
retrieval)
 Examples: PICDMS (1984), PICQUERY (1988), etc.
 Drawbacks:
– Difficult for text to capture the distinctive properties of visual features
– Text not appropriate for modeling perceptual similarity
– Subjective
8
Second generation systems
 Supports full retrieval by visual content
– Conceptual level: keywords
– Perceptual level: objective measurements at pixel level
– Other sensory data (speech, sound) might help (e.g. video streams).
 Image processing, pattern recognition and computer vision are an
integral part of architecture and operation
 Retrieval systems for
–
–
–
–
2-D still images
Video
3-D images and video
WWW
9
Retrieval systems for 2-D still images (1)
 Content
– Perceptual properties: color, texture, shape, and spatial relationships
– Semantic primitives: objects, roles, and scenes
– Impressions, emotions, and meaning associated with the combination of perceptual
features
 Basic retrieval paradigm: For each image a set of descriptive features are pre-
computed
 Queries by visual examples
– The user selects the features, ranges of model parameters, and chooses a similarity
measure
– The system checks the similarity between the visual content of the user’s query and
database images.
 Objective: To keep the number of misses as low as possible. Number of false
alarms?
 Interaction: Relevance feedback
10
Retrieval systems for 2-D still images (2)
Similarity vs. matching
 Matching is a binary partition operator:
“Does the observed object correspond to a model or not?”
Uncertainties are managed during the process
 Similarity-based retrieval: To re-order the database of images
according to how similar are to a query example.
Ranking not classification
The user is in the retrieval loop; Need for a flexible interface.
11
Retrieval systems for video (1)
 Video conveys information from multiple planes of communication
– How the frames are linked together using editing effects (cuts, fades,
dissolves, etc).
– What is in the frames (characters,story content, etc.)
 Each type of video (commercials, news, movies, sport) has its own peculiar
characteristics.
 Basic Terminology
–
–
–
–
–
Frame: basic unit of information usually samples at 1/25 or 1/30 of a second.
Shot: A set of frames between a camera turn-on and a camera turn-off
Clip: A set of frames with some semantic content
Episodes: An hierarchy of shots;
Scene: A collection of consecutive shots that share simultaneity is space, time, and
action (e.g. a dialog scene).
 Video is accessed through browsing and navigation
12
Retrieval systems for video (2)
13
Retrieval systems for 3-D images and
video / WWW
 3-D images and video are available in
–
–
–
–
–
biomedicine
computer-aided design
Geographic maps
Painting
Games and entertainment industry (immersive environments)
 Expected to flourish in the current decade
 Retrieval on the WWW:
– Distributed problem
– Need for standardization (MPEG-7)
– Response time is critical (work in the compressed domain, summarization)
14
Research directions
1.
2.
3.
4.
5.
6.
7.
8.
9.
Visual interfaces
Standards for content representation
Database models
Tools for automatics extraction of features from images and video
Tools for extraction of semantics
Similarity models
Effective indexing
Web search and retrieval
Role of 3-D
15
Content-based interactivity
 Browsing offers a panoramic view of the visual
information space
 Visualization
www.virage.com
16
QBIC
color
layout
http://wwwqbic.almaden.ibm.com/
17
Querying by content (1)
For still images:
 To check if the concepts expressed in a query match the concepts
of database images:
“find all Holy Ikons with a nativity”
“find all Holy Ikons with Saint George” (object categories)
Treated with free-text or SQL-based retrieval engines (Google)
 To verify spatial relations between spatial entities
“find all images with a car parked outside a house”
 topological queries (disjunction, adjacency, containment,
overlapping)
 metric queries (distances, directions, angles)
Treated with SQL-like spatial query languages
18
Querying by content (2)
 To check the similarity of perceptual features (color, texture,
edges, corners, and shapes)
 exact queries: “find all images of President Bush”
 range queries: “find all images with colors between green and
blue”
 K-nearest neighbor queries: find the ten most similar images
to the example”
For video:
 Concepts related to video content
 Motion, objects, texture, and color features of video: Shot
extraction, dominant colors, etc.
19
Google
20
Ark of Refugee Heirloom
www.ceti.gr/kivotos
21
Querying by visual example
 Suited to express perceptual aspects of low/intermediate features of visual
content.
 The user provides a prototype image as a reference example
 Relevance feedback: the user analyses the responses of the system and
indicates, for each item retrieved the degree of relevance or the exactness of
the ranking; the annotated results are fed back into the system to refine the
query.
 Types of querying:
– Iconic (PN) :
• Suitable for retrieval based on high-level concepts
– By painting
• Employed in color-based retrieval (NETRA)
– By sketch (PICASSO)
– By image (NETRA)
22
PICASSO/PN
http://viplab.dsi.unifi.it/PN/
23
NETRA
http://maya.ece.ucsb.edu/Netra/netra.html
24
Representation of visual content
 Representation of perceptual features of images and video is a
fundamental problem in visual information retrieval.
 Image analysis and pattern recognition algorithms provide the means
to extract numeric descriptors.
 Computer vision enables object and motion identification
 Representation of perceptual features






Color
Texture
Shape
Structure
Spatial relationships
Motion
 Representation of content semantics
 Semantic primitives
 Semiotics
25
Representation of perceptual features
Color (1)
26
Representation of perceptual features
Color (2)
 Human visual system: Responsible for color perception are the
cones.
 From psychological point of view, perception of color is related to
several factors e.g.,
 color attributes (brightness, chromaticity, saturation)
 surrounding colors
 color spatial organization
 observer’s memory/knowledge/experience
 Geometric color models (RGB, HSV, Lab, etc.)
 Color histogram: to describe the low-level color properties.
27
Image retrieval by color similarity (1)
 Color spaces
 Histograms;
 Moments of distribution
 Quantization of the color space
 Similarity measures
 L1 and L2 norm of the difference between the query histogram H(IQ) and
the histogram of a database image H(ID)
28
Image retrieval by color similarity (2)
 histogram intersection
 weighted Euclidean distance
29
Representation of perceptual features
Texture (1)
 Texture: One level of
Brodatz album
abstraction above pixels.
 Perceptual texture dimensions:
 Uniformity
 Density
 Coarseness
 Roughness
 Regularity
 Linearity
 Directionality/Direction
 Frequency
 Phase
30
Representation of perceptual features
Texture (2)
 Statistical methods:
 Autocorrelation function (coarseness, periodicity)
 Frequency content [rings, wedges] Coarseness,
Directionality, isotropic/non-isotropic patterns
 Moments
 Directional histograms and related features
 Run-lengths and related features
 Co-occurrence matrices
 Structural methods (Grammars and production rules)
31
Representation of perceptual features
Shape (1)
 Criteria of a good shape representation
 Each shape possesses a unique representation invariant to
translation, rotation, and scaling.
 Similar shapes should have similar representations
 Methods to extract shapes and to derive features stem from
image processing
 Chain codes
 Polygonal approximations
 Skeletons
 Boundary descriptors




contour length/ diameter
shape numbers
Fourier descriptors
Moments
32
Representation of perceptual features
Shape (2)
Polygonal approximation
Chain codes
(I. Pitas)
33
Representation of perceptual features
Shape (3)
a
b
c
d
Face segmentation:
(a) original color image (b) skin
segmentation. (c ) connected components (d) best fit-ellipses.
34
Representation of perceptual features
Structure/Spatial relationships
 Structure

To provide a Gestalt impression of the shapes in the image.

set of edges

corners
 To distinguish photographs from drawings.
 To classify scenes: portrait, landscape, indoor
 Spatial relationships


Spatial entities: points, lines, regions, and objects
Relationships:

Directional (include a distance/angle measure)

Topological (do not include distance but they capture set-theoretical
concepts e.g. disjunction)
 They are represented symbolically.
35
Representation of perceptual features
Motion
 Main characterizing element in a sequence of frames
 Related to change in the relative position of spatial entities or toa
a camera movement.
 Methods:


Detection of temporal changes of gray-level primitives (optical flow)
Extraction of a set of sparse characteristic features of the objects, such as
corners or salient points and their tracking in subsequent frames.
 Crucial role in video
Salient features
(Kanade et al.)
36
Representation of content semantics
Semantic primitives
 Identification of objects, roles, actions and events as abstractions
of visual signs.
 Achieved through recognition and interpretation
 Recognition

To select a set of low-level local features and statistical pattern
recognition for object classification
 Interpretation is based on reasoning.
 Domain-dependent e.g. Photobook (www-white.media.mit.edu)
 Retrieval systems including interpretation: facial database
systems to compare facial expressions
37
Representation of content semantics
Semiotics
 Grammar of color usage to formalize effects
 Association of color hue, saturation, etc to psychological
behaviors
 Semiotics identifies two distinct steps for the production of
meaning


Abstract level by narrative structures (e.g. camera breaks, colors, editing
effects, rhythm, shot angle)
Concrete level by discourse structures: how the narrative elements create
a story.
38
Similarity models
 Pre-attentive: perceived similarity between stimuli
 Color/texture/shape;
 Models close to human perception
 Attentive:
 Interpretation
 Previous knowledge and a form of reasoning
 Domain-specific retrieval applications (mugshots); need for
models and similarity criteria definition
39
Metric model (1)
 Distance in a metric psychological

Properties of a distance function d:
 Commonly used distance functions:
 Euclidean
 City-block
 Minkowsky
40
Metric model (2)
 Inadequacies: shape similarity
 Advantages:
 similarity judgment of color stimuli
 consistent with pattern recognition and computer vision
 suitable for creating indices
 Other similarity models:


Virtual metric spaces
Tversky’s model: function of two types of features: those that are
common to the two stimuli and those that exclusively appear to one only
stimulus.
 Transformational distances: elastic graph matching
 User subjectivity?
41
Four eyes approach





Self improving database browser
and annotator based on user
interaction
Similarity is presented with
groupings
The system chooses in trees
hierarchies those nodes which
most efficiently represent the
positive examples.
Set-covering algorithm to
remove all positive examples
covered.
Iterations
42
Indexing methods (1)
 To avoid sequential scanning
 Retrieved images are ranked in order of similarity to a query
 Compound measure of similarity between visual features and
text attributes.
 Indexing of string attributes
 Commonly used indexing techniques
 Hashing tables and signatures
 Cosine similarity function
43
Indexing methods (2)
 Triangle inequality (Barros et al.)
 When the query item q is presented, then d(q,r) is computed.
 For all database items i:
 Maximum threshold l=d(q,r); r the most similar item
 Search for distances closest to d(q,r)
 If d(i,r) inferior to d(q,r) is found, item i is regarded as the most
similar item, and l=d(i,r).
 Continue until | d(i,r)-d(q,r)|  l
44
Index structures
 Fixed grids: non-hierarchical index structure that organizes the




space into buckets.
Grid files: fixed grids with buckets of unequal size
K-d trees: Binary tree; the values of one of the k features is
checked at each node.
R-trees: partition the feature space into multidimensional
rectangles
SS-trees: Weighted Euclidean distance; suitable for clustering;
ellipsoidal clusters
45
Performance evaluation
Judgment by evaluator
Relevant
Retrieved
Not Retrieved
Not relevant
A
C
(correctly retrieved) (falsely retrieved)
B
D
(missed)
(correctly rejected)
A
recall 
AC
A
precision 
A B
46
Wrap-up
 Visual information retrieval is a research topic at the intersection of
digital image processing, pattern recognition, and computer vision
(fields of our interest/expertise) but also information retrieval,
databases.
 Related to semantic web
 Challenging research topic dealing with many unsolved problems:
 segmentation
 machine similarity vs. human perception
 focused searching
47
Still Image Segmentation:
Comparison of ICM and LVQ
Comparison
– Iterated Conditional Modes (ICM)
– Split and Merge Learning Vector Quantizer (LVQ)
Ability to extract meaningful image parts based on the
ground truth
Evaluation of still image segmentation algorithms
48
Iterated Conditional Modes
(ICM)
The ICM method is based on the maximization of the probability
density function of the image model given real image data.
The criterion function is:


ys  mi

pxs | ys , xq , q  N 8 ( s )   exp  
2
 s
2

i


2

 VC ( x) 

xs C

where xs is the region assignment and ys is the luminance value of
the pixel s mi and δi are mean value and the standard deviation
of luminance of the region i; C is the clique of the pixel s,
VC(x) is the potential function of C, N8(s) is 8x8 neighborhood
of the pixel s.
49
How ICM works
Initial segmentation is obtained using the K-means clustering
algorithm. Cluster center initialization is based on image
intensity histogram.
At each iteration probability, the value of the criterion function, is
calculated for each pixel. Pixels are assigned to clusters- regions
with maximum probability.
Having a new segmentation, the mean intensity value and the
cluster variance are estimated. The iterative process stops when
no change occurs in clusters.
For obtained segmentation, small regions are merged with nearest
ones. The output image contains the large regions assigned the
mean luminance value.
50
Image features and parameters of the
ICM algorithm
The ICM algorithm is applied on the luminance
component of the image. Input for the algorithm is a
gray level image.
The parameter of the algorithm is the value of the
potential function.
The parameter controls the roughness of the segment
boundaries.
The value of the parameter is tuned experimentally.
51
Segmentation results (ICM )
52
Learning Vector Quantizer (1)
neural network
self organizing
competitive learning law
unsupervised
approximates data pdf by adjusting the weights of the
reference vectors
53
Learning Vector Quantizer (2)
codebook
reference vectors representing their nearest data patterns
number of reference vectors
– predefined
– split and merge
54
Learning Vector Quantizer (3)
Minimal error for data representation:
   x  wc f ( x)dx
r
x
Iterative correction of reference vectors:
wc (k  1)  wc (k )  a(k )xi (k )  wc (k )
55
Learning Vector Quantizer (4)
Split and merge technique
– Find the winner reference vector w(k) for pattern x(k).
– if x(k) is not an outlier proceed as in standard LVQ.
– if x(k) is an outlier:
• split the cluster and include x(k) in one of the sub-clusters.
• or
• create a new cluster having seed x(k).
56
Learning Vector Quantizer (5)
57
Experimental set-up (1)
 Apply both methods on images provided by BAL
 Explore the ability of the algorithms to extract
meaningful image parts based on the qualitative
description of the ground truth.
58
Paintings from Bridgeman Art
Library
–
–
–
–
–
–
sky, mountains, people, water (smpw)
hammerhead cloud, reflection (cr)
sky, buildings, trees, people, pavement (sbtpp)
sky, people, hat (sph)
sky, trees, water, sails (stws)
horses, sledges, people, snow, sky (hspss)
59
Experimental setup (2)
 We define by O={O1,..,OM} the set of objects given
in the qualitative description of the ground truth,
where M is the number of objects.
 We define by T={T1,..,TN} the set of the regions with
the unique label, obtained in the segmented image,
where N is number of regions.
 Three cases on the outcome of the segmentation as
compared to the ground truth are possible.
60
Matching
Case 1, best match (BM): The best match is when the
region of the segmented image has one to one
correspondence with the ground truth object;
Case 2, reasonable match (RM): The reasonable match is
when the ground truth object has one to many
correspondence with the regions of the segmented image;
Case 3, mismatch. The mismatch is when there is no
correspondence between the ground truth objects and the
regions of the segmented image.
61
Three Cases
For the jth ground truth object Oj by denoting the cases
by i , and the segmented region by T the three cases
occur as follows:
i  1,
i2
i3
when
when
when
O j  T  Tk ,
O j  T  Tk ,
O j  T   ,
k 1,  , N .
62
Decision
The decision about the presence of the ground truth
object Oj in the segmented image according to all
cases is:
1, O j  case i,
rij  
0, otherwise.
We put a decision for each object after visual
examination of the segmented image according to the
definition of the ground truth.
63
Assessment of results (1)
•Ground truth
•sky
•buildings
•trees
•people
•pavement
64
Assessment of results (2)
LVQ
65
Assessment of results (3)
ICM
66
Assessment of results (4)
•Ground truth
•horses
•sledges
•people
•snow
•sky
67
Assessment of results (5)
LVQ
68
Assessment of results (6)
ICM
69
Assessment of results (7)
Number of regions
Image
Gr. truth
ICM
LVQ
Smpw
5
7
7
Cr
4
10
8
Sbtpp
5
13
8
Sph
3
14
8
Stws
4
15
11
Hspss
3
5
8
70
Assessment of results (8)
Ranking: ICM vs. LVQ
ICM
BM: Best Match
RM: Reasonable
Match
MM: Mismatch
Image
Smpw
Cr
Sbtpp
Sph
Stws
Hspss
LVQ
BM RM MM BM RM MM
0
0.75 0.25 0.5 0.5
0
0.75 0.25 0.5 0.5
0.4 0.4 0.2 0.2 0.8
0
1
0
0
1
0.5 0.25 0.25 0.25 0.75
0.33
0 0.66 0
1
0
0
0
0
0
0
71
Assessment of results (9)
Ranking: ICM vs. LVQ
ICM
LVQ
BM RM MM BM RM MM
Average
0.20 0.53 0.27 0.24 0.76
0
BM: Best Match
RM: Reasonable Match
MM: Mismatch
72
Evaluation of Image Segmentation
Algorithms (1)
Cián Shaffrey, Univ. of Cambridge
73
Evaluation of Image Segmentation
Algorithms (2)
 Evaluation within the Semantic Space; Impossible to ask the
Average User to provide all possible h
 Compromise: Evaluation in the Indexing Space;Allows us to
access S without explicitly defining σ.
 Average User: to achieve a consensus on h.
 Ask users to evaluate two proposed arrows π to obtain Average
User’s response. Implicitly characterize h and σ.
74
Evaluation of Image Segmentation
Algorithms (3)

Unsupervised algorithms
1. Multiscale Image Segmentation (UCAM-MIS)
2. Blobworld (UC Berkeley-Blobworld)
3. Iterated Conditional Modes (AUTH-ICM)
4. Learning Vector Quantizer (AUTH-LVQ)
5. Double Markov Random Field (TCD-DMRF)
6. Complex Wavelet based Hidden Markov Tree (UCAMCHMT)
75
Evaluation of Image Segmentation
Algorithms (4)
 Hard measurements
 Soft measurements: The speed of response of the user
(time-1): how much better the user prefers one scheme over the
other
 Faster response: the selected scheme provides a better
semantic breakdown of the original image
 Slower response: reflects the similarity of two schemes
Aims:
 To determine whether or not agreement exists in users’ decisions
 Do two pairwise rankings lead to consistent total orderings?
 Do hard and soft measurements coincide?
76
Evaluation of Image Segmentation
Algorithms (5)
Cián Shaffrey, Univ. of Cambridge
77
Evaluation of Image Segmentation
Algorithms (6)
Cián Shaffrey, Univ. of Cambridge
78
Wrap-up
ICM
– continuous, large sized regions
– appropriate for homogeneous regions
LVQ
– spatially connected, small regions
– more detailed segmentation
Both provide good RM
79
Image retrieval based on Hausdorff
distance
 Hausdorff distance definition
 Advantages
 How to speed-up the computations
 Experiments
80
Hausdorff distance definition
dH(A,B) = max (dH+(A,B), dH-(A,B))
dH+(A,B) = sup {d(x,B) : x  A}
dH-(A,B) = sup {d(y,A) : y  B},
d(v,W) = inf {d(v,w) : w  W}.
81
Hausdorff distance advantages
 dH (A, B) = 0  A=B
(A, B – sets representing graphical objects, object contours, etc.)
 Information about parameters of transformation
(complex object recognition)
 Predictable – simple intuitive interpretation
 dH+ and dH- - for partial obscured or erroneously segmented
objects
 Possibility of generalization: max  quantiles
 Possibility of taking into consideration
any object transformations
82
How to speed up the computations
for comparing one pair (1)
A. Replacing objects by their contours
The HD between the objects may be large although for contours the HD is small (e.g. disk and ring)
possibility of false alarms
but
Contours of similar objects are always similar (small HD)
no possibility of omitting similar objects
83
How to speed up the computations for
comparing one pair (2)
B. Voronoi diagram or distance transform
C. Early scan termination
D. Pruning some parts of transformation space
84
How to speed up the computations –
Number of models consider
Idea:
Matrix of distances for models (every pair)

1. Pruning some models (we know they will not match query)
2. Database navigation optimal search order (possibility of early
finish)
85
How to speed up the computations
A. Excluding of model object from the search
ref – any model object


ref
query
 - distance to the closest
model
only here may lay model
closest to query object
Model closest to query object may lay only in colored area
86
How to speed up the computations
B. Pruning with
many reference
objects
87
How to speed up the computations
C. Optimal
searching
order
88
How to speed up the computations
D. Introducing other criteria (pre-computation)
Moment invariants:
• M1=(M20+M02) / m002
• M2=(M20M02 – M112) / m004
where:
I
M pq  
i 1
J

j 1
I
m pq  
(i  i0 ) ( j  j0 ) f ij
p
q
i 1
J
p q
i
 j f ij
j 1
Shape coefficients:
• Blair-Bliss coefficient
WBB 
S
2  r 2 ds
89
Experiments - database
Database: 76 islands, represented as *.bmp images
..…
90
Experiment 1: map query
Image retrieval. Step 1: interactive segmentation of query object
91
Experiment 1: map query
Searching order: 8 / 76 model object were checked
Loading model 1 / 76: "amorgos.bmp"
Hausdorff distance: 0.156709
Loading model 42 / 76: "ithaca.bmp"
Hausdorff distance: 0.143915
Loading model 27 / 76: "ikaria.bmp"
Hausdorff distance: 0.080666
Loading model 31 / 76: "kasos.bmp"
Hausdorff distance: 0.080551
Loading model 20 / 76: "sikinos.bmp"
Hausdorff distance: 0.121180
Loading model 52 / 76: "alonissos.bmp"
Hausdorff distance: 0.153914
Loading model 17 / 76: "rithnos.bmp"
Hausdorff distance: 0.103512
Loading model 61 / 76: "skopelos.bmp"
Hausdorff distance: 0.045430
92
Experiment 1: map query
Minimum of Hausdorff distance of model closest to query object
93
Experiment 2: mouse-drawing query
Query
HD criterion
position for min HD
HD+M1+M2+WBB
HD = 0.112
MCD=1.024
closest
Santorini
second
Poros
Elafonisos
furthest
HD = 0.143
max HD = 0. 3072
MCD=1.771
max MCD=3.4326
94
Wrap-up
Hausdorff distance is better for shape recognition than feature-based
criteria.Big computational cost of image retrieval based on HD can be
reduced by:
• decreasing cost of computation for pair of objects
• replacing object by it’s contours
• using of Voronoi diagram
• off-line database processing – calculating of matrix of distances between model
objects
• reducing number of model objects to be compared
• optimal searching order
• using features as auxiliary similarity criteria
95
Video Summarization: Detecting
shots, cuts, and fades in video –
Selection of key frames
96
Outline
 Entropy, joint entropy, and mutual information
 Shot cut detection based on mutual information
 Fade detection based on joint entropy
 Key frame selection
 Comparison with other methods
 Wrap-up
97
Entropy-Joint Entropy
• Entropy of a random variable X (RV):
measure of the information content or the “uncertainty” about X.
• Joint entropy of RVs X and Y:
98
Mutual Information
It measures the average reduction in uncertainty about X
that results from learning the value of Y.
It measures the amount of information that X conveys
about Y.
99
Algorithm for detecting abrupt cuts (1)
- for each pair of successive frames ft and f t+1 whose gray levels vary from
0 to N-1
• Calculate three NxN co-occurrence matrices, one for each chromatic
component R, G, and B,
R
G
B
,
,
Ct ,t 1 Ct ,t 1 Ct ,t 1
whose (i,j) element is the joint probability of observing a pixel having the ith
gray level in ft and jth gray level in f t+1
• calculate the mutual information of the gray levels for the three components
R, G, B independently and sum them.
100
Algorithm for detecting abrupt cuts (2)
– Apply a robust estimator of the mean value in the time-series of mutual
information values by defining a time-window around each time instant t0
- An abrupt cut is detected if
101
Mutual information pattern (1)
• Mutual information pattern from “star” video sequence that
depicts cuts
cuts
102
Ground truth
103
Performance evaluation
GT: denotes the ground truth,
Seg: the segmented (correct and false) shots using our
methods
Recall is corresponding to the probability of detection
Precision is corresponding to the accuracy of the method
considering false detections
Overlap (for fades)
104
Test results (1)
105
Test results (2)
106
Alternative technique for shot cut
detection
– Features that could be used to define a distance measure:
• Successive color frame differences:
• Successive color vector bin-wise HS histogram differences (invariant to brightness changes):
– Fusion of the two differences:
– Shot cut detection
• by adaptive local thresholding
107
Comparison of abrupt cut detection
methods
results using the combined
method
results using mutual
information
108
Fades (1)
If G(x,y,t) is a gray scale sequence then, the chromatic scaling of
G(x,y,t) can be modeled as
Therefore, a fade-out can be modeled as:
and a fade-in as:
109
Fades (2)
part of video sequence showing fade-in
part of video sequence showing fade-out
110
Mutual information pattern (2)
• Mutual information pattern from “basketball” video sequence
showing cuts and fade
cuts
fade
111
Algorithms for detecting fades (1)
For each pair of successive frames ft and f t+1
calculate the joint entropy of the basic chromatic components.
Determine the values of the joint entropy close to zero
Detect fade-out (fade-in)
• The first (last) zero value defines the end (start) of fade-out (fade-in)
• Find the start (end) of fade-out (fade-in).
A fade should have at least a duration of 2 frames:
112
Joint entropy pattern (1)
Fade out
frame 1765
frame 1770
frame 1785
frames 1791-1802
cut
frame 1775
frame 1803
frame 1780
frame 1805
113
Joint entropy pattern (2)
threshold
fade
fade
frame 4420
frame 4425
frame 4426
frame 4430
frame 4440
Cut to the dark frame
114
Comparison of fade detection methods (1)
results using the joint entropy
results using the average frame value
115
Comparison of fade detection methods (2)
results using the joint entropy
results using the average frame value
116
Algorithm for key frame selection (1)
split & merge algorithm
based on the series of mutual information of gray levels at
successive frames within the shot
choose clusters of large sizes
select as potential key frame the first frame from each
cluster.
test the similarity of potential key-frames using the mutual
information
117
Key frame selection (1)
118
Key frame selection (2)
star sequence
119
Key frame selection (3)
frame 1690
frame 1770
120
Key frame selection (4)
key frames selected from different shots
frame 314
frame 2026
frame 2904
frame 4344
two key frames selected from one shot
frame 2607
frame 2637
121
Wrap-up
 New methods for detecting cuts and fades with high
precision have been described.
 Accurate detection of fade borders (starting and ending
point) has been achieved.
 Comparisons with other methods demonstrate the
accuracy/success of the proposed techniques.
 Satisfactory results for key frame selection by
performing clustering on the mutual information series
have been reported.
122
MPEG-7: Standard for Multimedia
Information Systems
 Introduction
 Applications
 Standard
 Description elements
 Visual structural elements
 Description schemes
 for still images
 video
 Wrap-up
123
Introduction
 MPEG-7: annotates
– data in
• MPEG-4
• MPEG-2
• MPEG-1
object-based representations (interactive representations)
Frame-based encoding of waveforms
– analog data (e.g. VHS)
– photo prints
– artistic pictures
 It is not about compression.
 Aim: Description of audiovisual content
– Descriptors
– Description Schemes
– Description Definition
124
Applications
 Provides generic description
of audiovisual and multimedia
content for
– systematic access to audiovisual
information sources
– re-usability of descriptions and
annotations
– management and linking of
content, events, and user
interaction
(Jens-Rainer Ohm, HHI)
125
Standard
o MPEG-7 consists of
 Descriptors (D) with
Descriptor Value (DV)
 Description Schemes (DS)
 Description Definition
Language (DDL)
(Jens-Rainer Ohm, HHI)
126
Description elements
 Structural (Can be extracted automatically)
 Signal-based features
 Regions and Segments
 Semantic/Conceptual (Mostly manual annotation)
o Objects
o Scenes
o Events
 Metadata (Manual or non-signal based annotation)
o Acquisition & productions
o High-level content description
o Intellectual property, usage
127
Visual structural elements
 Examples of low-level visual features
 Color
 Texture
 Shape
 Motion
 Examples of MPEG-7 visual descriptors
Color
Color histogram, Dominant color
Texture
Shape
Motion
Frequency layout, Edge histogram
Zernike moments, curvature peaks
Motion trajectory, parametric motion
 Examples of MPEG-7 Visual Description Schemes
 Still region
 Moving region
 Video Segment
128
Description Schemes
 Layouts for description schemes
 Hierarchical (tree)
 Relational (entity relationship graph)
129
Still Region Description Scheme
130
Video Sequence Description Scheme
131
Description Definition Language
 Based on Extensible Markup Languages
132
Wrap-up
 MPEG-7: Generic description interface for audiovisual
and multimedia content
 MPEG-7: Can be used for
 Search/filtering and manipulation of audiovisual
information
 Multimedia browsing and navigation
 Data organization, archiving, and authoring
 Interpretation and understanding of multimedia
content
 Key technology
133
Conclusions




Overview of fundamentals for information retrieval
Focus on segmentation and its assessment
Shape retrieval based on Hausdorff distance
Video Summarization
Acknowledgments:
I. Pitas, E. Pranckeviciene, Z. Chernekova,
C. Nikou, and P. Rotter.
134