Transcript ppt

Multimedia concepts and applications,
Augsburg University, Germany
Applied Computer Science,
Bielefeld University, Germany
Comparing feature sets for acted and
spontaneous speech in view of
automatic emotion recognition
Thurid Vogt, Elisabeth André
ICME 2005
Emotion Recognition System
Input
Training data
Feature
extraction
Classification
Result
Research questions
1. Does a large number of features provided to
the selection algorithm enable the selection
of a better feature set?
2. Which analysis units can be calculated
automatically in an online system and still
give good results?
3. How do feature sets for acted and realistic
data differ?
Overview
• Feature extraction:
– Segment length
– Feature calculation
– Feature selection
• Databases
• Results
• Conclusions
Feature extraction
Segment length
• Features are computed over signal segments
• Difficulty:
– Features can be computed more accurate for long
segments
– Emotions can be short and change quickly
• Possible segments:
– Whole utterances
– Larger pauses as segment borders
– Words, syllables, word in context (1 or 2 left and
right)
– Fixed length, e.g. 0.5, 1, 2 seconds
Feature calculation
• Features based on pitch, energy + 1st & 2nd
derivatives and 12 MFCCs + 1st & 2nd
derivatives
• Looking at basic values, only minima or
maxima, as well as distances, differences,
slopes between adjacent extrema
• Mean, minimum, maximum, ... of time
segments
• Some others, such as normalised pitch and
pauses
• Oriented at Oudeyer, 2003
Feature selection
• Correlation-based feature selection from
Weka data mining software from University of
Waikato, New Zealand (Witten & Frank,
2000)
• Reduction from 1280 to ~ 90-160 features
Databases
Acted speech
database
Spontaneous
speech database
Database from TU
Berlin for emotional
speech synthesis
(Sendelmeier, 2001)
SmartKom Database
from U. of Munich
(Steininger et. al., 2002)
– Recorded from actors
– High quality
– 10 speakers; 20 min
– 7 emotions
– Wizard-of-Oz scenario
– Mid quality
– ~80 speakers; 3h20min
net; few emotions
exhibited
– 11 user states
Results
Which analysis units can be computed
automatically in an online system and
still give good results?
100
80
[%]
60
72,5
48,3
67
61,7
44,2
52
40
20
0
Whole
utterance/
pauses as
borders
Word in
context
Acted data
0.5 seconds
WOZ data
Word
Does a large number of features
provided to the selection algorithm
yield a better feature set?
100
80
[%]
88,6
85,4
77,4
69,1
72,5
67,1
81,9
51,6
48,3
60
40
85,3
26
25,6
20
0
7 classes
Activation
Actors: whole feature set
WOZ: whole feature set
Evaluation
Emotion vs.
Non-Emotion
Actors: reduced feature set
WOZ: reduced feature set
Does a large number of features
provided to the selection algorithm
yield a better feature set cont.
• Reduced feature set almost always better
than full feature set
• Features perform comparable to Batliner et
al., 2003, on SmartKom data, but our features
are computed completely automatically, while
some of theirs were determined manually
• Selected features are not necessarily those
one would expect
How do feature sets for acted and
realistic data differ?
• Important features for acted emotions:
– Basic pitch
– Pauses (for sadness)
• Important features for WOZ emotions:
– MFCCs (mainly low coefficients and 1st
derivatives)
– Extrema of pitch and energy
Conclusions
• Automatic segment extraction showed not to
be a disadvantage
• Big feature set provided to the selection
algorithm might compensate for the
disadvantages of completely automatically
computed features
• Feature sets for acted and WOZ emotions
overlap little  looking at acted data when
building an emotion recognizer for
spontaneous emotions may not make sense