Indoor Outdoor image classification

Download Report

Transcript Indoor Outdoor image classification

Just Another Experiment
but using What Features ?
Da Deng
KIWI Lab
Department of Information Science
University of Otago
[email protected]
Outline
Content-Based Image Retrieval (CBIR)
Semantic analysis in CBIR
Scene classification




Music analysis



Combination of multiple precision-boosted
classifiers
Mood and Instrument recognition
Conclusion & Discussion
2
Content-based Image Retrieval (CBIR)




Finding a picture by key words generally is almost
implausible.
CBIR is based on the idea of searching for images
by matching their low-level features.
Can work effectively to a certain degree.
Relevant techniques

Feature extraction, indexing, pattern recognition …..
Visual features
Database
CBIR
Search Engine
Search by:
•Examples
•Sketches
•Visual features
3
CBIR Features

Colour statistics


Texture Features


Co-occurrence matrix, Edge histogram, Gabor filters
Shapes


Global/region histograms, Colour correlograms
Invariant moments, Fourier descriptors etc.
MPEG-7 eXperiments Model (XM) (e.g. see
Manjunath et al. 2001) defines a set of audiovisual
content descriptors based on core experiments.


Colour: CSD, CLD, SCD etc.
Texture: HTD, EHD etc.
4
Limitation of CBIR

Semantic gap (Smeulders 2000)




Human understanding of the image content leads
to semantic concepts and reasoning.
Image storage and CBIR feature schemes remain
low-level in machines.
Similarity defined in the low-level feature space
cannot reflect faithfully the high-level semantic
similarity between images.
Relevance feedback and custom feature
schemes do not really help...
5
Semantic Analysis


An ongoing effort to bridge the semantic gap
Methods


Use relationship model between colour, position
and semantic concepts.
Use machine learning techniques to model visual
features and semantic concepts


SVM, k-NN, 2-D HMM,…, mixture of experts
MPEG-7 feature descriptors
6
A Scene Classification Problem




Indoor-outdoor image classification
(Szummer & Picard, 1998)
Uses fixed-size windows which
may not relate to any objects
reliably.
Others e.g. Payne & Singh (2004)
also used 4x4 windows.
Can we do better?



Use Object-related regions
Use more robustly tested features from
CBIR.
Or, better combination?
7
Scene Classification Scheme
Segmentation
(JSEG)
Local Features:
EHD, CH, HTD
Classifiers
Global Features:
EHD, CH
Classifiers
Precision-boosted
Combination
Results
8
Features Extraction - I

Colour histograms (CH)


Global CH generated directly from RGB space,
with 125 (5x5x5) bins.
Segment colours quantised in finer granularity,
resulting in local CH of 142 bins (L:20, U:70,
V:52)
9
Features Extraction - II


Edge histogram (EHD)
Captures the spatial distribution of the edge in six statues: 0º,
45º, 90º, 135º, non direction and no edge.


Global EHD of an image: Concatenating 16 sub EHDs into a 96
bins
Local EHD of a segment

Grouping the edge histogram of the image-blocks fallen into the
segment
Macro-block
Image-block
10
Feature Extraction - III


Segmented regions bear
homogenous colour/texture
characteristics.
Homogenous Texture
Descriptor (HTD) is
included in the local feature
scheme.

Sampling
window
Average intensity and
deviation in an image plus
the log-scale of mean
energies and energy
deviations in 30 Gabor
channels
11
Classification

k-NN for feature set classification

Global Feature classification
P(indoor | xi )  kindoor / k

Local Feature Classification
N indoor
P(indoor | xi ) 
N indoor  N outdoor
Voting
Labelling
12
Bayesian Combination Classification

A sample xi associated with F  { f1 , f 2 ,... f n }
P( wc | xi )  P( wc | f1, f 2 ,..., f n )

Bayesian Rule (equal prior probability assumed)
n
P( wc | f1 , f 2 ,..., f n ) 
n

 p( w
c
j 1
P( wc ) p( f j | wc )
j 1
n
 P ( w ) p ( f
iC
| f j)
i
j 1
j
| wi )
n
 p( w | f
iC j 1
i
j
)
13
Popular Combination Schemes


Kittler et al. 1998
Product Rule
n
c  arg max  P( wc | f j )
i

Sum Rule
j 1
n
c  arg max  P( wi | f j )
i

j 1
Majority Voting


Hard ‘membership’ used instead
odd n
14
Precision-Boosted Combination
Contributions from individual classifiers are
considered equally even though their
confidences may vary!
 Our solution: Tune the Bayesian posterior
probability and assign bigger weights to
classifiers with higher precisions:

1
1
pb ( wc | f i )   [ p( wc | f i )  ]  pr ( wc , f i )
2
2
15
Posteriors Boosted? Suppressed!
16
Experiment

Data





ISB images: 153 images, 3102 segments
Scenes labelled as indoor and outdoor
39% indoor and 61% outdoor
Objects include: person, grass, pavement, sky, cloud,
building, tree, carpet, sofa, chair, lamp, desk etc.
Evaluation


Leave-One-Out Cross Validation to evaluate the classifiers
Recall, precision and average accuracy compared.
17
Classification Combination Results



Four best individual
classifiers chosen, plus
L-EHD.
Average accuracy
assessed using
different classifiers and
different combination
rules.
Precision-boosted
classifier combination
gives the best accuracy
rate.
Classifiers
Accuracy (%)
G-CH
k=7
75.8
G-EHD
k=5
78.4
L-CH
k=5
83.0
L-HTD
k=7
86.3
L-EHD
k=9
39.8
Majority voting
88.5
Product rule
91.5
Sum rule
92.3
Precision-boosted
93.5
18
Hits & Misses
19
Smarter Image Search
Engines?
Automatic
 Content Organisation
 Classification
 Filtering
 Annotation ......
Do Better Than Google

Images returned from Image Search Engines
are typically




unorganised
unclassified
and sometimes, very irrelevant.
Why can’t we append CBIR capabilities to
these web services?

Use SOM or other clustering algorithms to
organise pictures according to their visual
similarity.
21
Good Features for Mapping


Small features may not be informative.
Large features are not easily reduced and
visualised.


SOMs may become ‘folded’ and hard to read.
Our choice: Colour Layout Descriptor (CLD)
in MPEG-7



Average colour /dominant colour on an 8x8 grid
192 dimensions
However, very compact
22
The Compactness of CLD
Residual
PCA
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
p.c.
Isomap
Residual
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
p.c.
23
POISE: Picture Organiser for Image Search
Engines
24
25
26
How Much Does It Cost?
IR: Image Retrieval (from Search Engines)
FE: Feature Extraction (MPEG-7 based)
CO: Content Organisation
27
Work in Progress




Further evaluating our approach using larger
benchmark image/video and music databases
Video summarisation (semantics + perception
models)
More MPEG-7 features and more classifiers
Semantic scene classification?



Regional labelling via object/concept recognition.
Probabilistic scene classification with hierarchical
semantics.
Cluster-based implementation


Image semantic analysis
Meta-search engine
28
What about Music?
Can a machine listen to a music replay
and recognise instruments on the fly?
Solo detection
Solo classification
Instrument Recognition Experiments




The samples taken from the Iowa Music Samples Collection.
All samples are recorded in the same acoustic environment
(anechoic chamber) under the same conditions.
Purpose of this experiment is to test the modelling of instruments
by different feature schemes.
Results may not generalise!
Classes
Instruments
Brass
Trumpet, French Horn, Trombone (tenor, bass), Tuba
Woodwind
Flute (normal, alto, bass), Clarinet (eb, bb, bass),
Oboe, Bassoon, Sax (alto, soprano)
String
Violin, Viola, Cello, Double Bass
Piano
Grand Piano
30
The Auditory Model




IPEM-Toolbox from the University of Ghent,
http://www.ipem.ugent.be/Toolbox
Simulates human perception
Represents the instrument sound samples in
a physiological way by calculating an Auditory
Nerve Image based on filtering.
Feature extraction from ANI.
31
ANI of a Violin Signal
32
MPEG-7 Audio Framework
33
Feature No.
Description
Perception-based features
1
Zero Crossings (ZC)
2-3
Mean and standard deviation of Zero Crossing Ratio (ZCR)
4-5
Mean and standard deviation of RMS (volume root mean square)
6-7
Mean and standard deviation of Centroid
8-9
Mean and standard deviation of Bandwidth
10-11
Mean and standard deviation of Flux
MPEG-7 Harmonic Instrument Timbre DS
12
Harmonic Centroid (HC)
13
Harmonic Deviation (HD)
14
Harmonic Spread (HS)
15
Harmonic Variation (HV)
16
Spectral Centroid (SC)
17
Temporal Centroid (TC)
18
Log-Attack-Time (LAT)
MFCC
19-44
Mean and standard deviation of the first 13 linear MFCC
34
Classification Results



Classifiers: k-NN, SMO, MLP
10-fold cross validation
4-class results with all 44 features:




65.2% (3-NN, 200 protocols)
82.6% (SMO, c=10, 97 s.v.)
94.1% (MLP, 44-22-1)
20-class result: 90.5% (MLP)
35
Feature Evaluation



Rank features based on
Information Gain/Gain Ratio
and Symmetric Uncertainty
(Weka).
Log-Attack-Time and
Harmonic Deviations are
the best features.
MFCC is the best single
feature set in classifying all
instruments.

Still confusion with trumpet 
piano, piano  violin
36
Combining Features




MPEG mistakes piano
 flute, not good for
trumpets vs. flute
IPEM mistakes: piano
 flute; trumpets 
violin
IPEM+MPEG: flute and
trumpet improved, still
confusion in pianos.
Overall: weaknesses in
piano/flute and
piano/violin
Feature sets
MLP CV
MFCC (26)
85.5%
MPEG-7 (7)
67.75%
IPEM (11)
74.25%
MFCC - MPEG-7 (33)
89.5%
MFCC – IPEM (37)
90.25%
MPEG-7 – IPEM (18)
80.5%
All (44)
90.5%
37
Is It All about Features, after all?

INFO411-2004 Project

Problem: Mood classification of electronic music

Data: Electronic music


Drum and bass genre sampled at 16000Hz (single
channel mono), wav format,10 seconds of length,
200 samples
Features:

IPEM auditory incl. zero-crossing, intensity, pitch
summary, tonality, roughness etc.
38
Moody Features?
anxious, frantic, terror
shower scene in Psycho
exuberant, triumphant
William Tell
energy
?



content, serene
Jesus - Joy of Man’s Desiring
ominous, depression
Firebird
Thayer’s model used.
Exuberant vs. Anxious
Findings:



stress
MLP 95.5%, 5-NN 86.0%. Very good.
Pitch is the most relevant to mood classification.
???
39
Conclusions

MPEG-7 XM feature descriptors are powerful
for a number of semantic analysis problems:




However, it depends.



Object / scene classification
Instrument detection / recognition
Mood classification
Better/bigger benchmarks needed.
Still, new feature schemes may help.
Classifier combination further improves the
result.
40
Acknowledgement


Thanks to the hard work of

Jane Zhang

Christian Simmermacher

Lincoln Johnston

Heiko Wolf

Matt Gleeson
Thank You!
41