Multimodal Events” Recognition through DBN based models

Download Report

Transcript Multimodal Events” Recognition through DBN based models

Dynamic Bayesian Networks for
Meeting Structuring
Alfred Dielmann, Steve Renals
(University of Sheffield)
Introduction
GOAL
Automatic analysis of meetings through
“multimodal events” recognition
Using objective measures
and statistical methods
events which involve one or more communicative modalities,
and represent a single participant or a whole group behaviour
Multimodal Recognition
Knowledge
Database
Meeting Room
Audio
Video
………
………
Feature Extraction Signal Pre-processing Information Retrieval
Specialised Recognition Systems (Speech,Video,Gestures)
“Multimodal Events” Recognition
Models
Group Actions
1. The machine observes group behaviours through
objective measures (“external observer”)
derived from different comunicative modalities
2. Results of this analysis are “structured” into a
sequence of symbols (“coding system”)
– Exhaustive (covering the entire meeting duration)
– Mutually exclusive (non overlapping symbols)
We used the coding system adopted by the “IDIAP
framework”, composed by 5 “meeting actions”:
• Monologue / Dialogue / Note taking / Presentation /
Presentation at the whiteboard
Corpus
• 60 meetings (30x2 set) collected in the “IDIAP Smart
Meeting Room”:
– 30 meetings are used for the training
– 23 meetings are used for the testing
– 7 meetings will be used for the results validation
• 4 participants per meeting
• 5 hours of multi-channel Audio-Visual recordings:
– 3 fixed cameras
– 4 lapel microphones + 8 element circular microphones array
• Meeting agendas are generated “a priori” and strictly
followed, in order to have an average of 5 “meeting
actions” for each meeting
http://mmm.idiap.ch/
• Available for public distribution
Features (1)
Only features derived from audio are currently used...
Mic. Array
Beam-forming
Speaker
Turns
Dimension
reduction
Prosody and Acoustic
Lapel Mic.
Pitch baseline
Energy
Rate Of Speech
…..
Features (2)
Speaker Turns
L1 L2 L3
t-3 0.1 0.4 0.6
t-2 0.3 0.5 0.5
t-1 0.2 0.4 0.7
t 0.2 0.3 0.7
L4
0.3
0.3
0.2
Li(t)*Lj(t-1)*Lk(t-2)
0.1
i
k
j
Location based “Speech activities”
(SRP-PHAT beamforming)
Kindly provided by IDIAP
Speaker Turns Features
Features (3)
Lapel Mic.
Pitch
extractor
Filters (*)
MRATE
Mic. Array
Beam-forming
Mask Features using “Speech activity”
RMS
Energy
Pitch
Rate Of Speech
(*) Histogram, median and interpolating filter
Features (4)
We’d like to integrate other features…..
Participants Motion features
Video
Image Processing
Other blob positions …
Gestures and Actions …
Audio.
Other …
ASR
Transcripts
… Everything that could be automatically
extracted from a recorded meeting …
Dynamic Bayesian Networks (1)
Bayesian Networks are a convenient graphical way to describe
statistical (in)dependencies among random variables
A
F
Direct Acyclic Graph
Conditional Probability Tables
C
S
Given a set of examples, EM learning algorithms
(ie: Baum-Welch) could be used to train CPTs
L
O
Given a set of known evidence nodes, the
probability of other nodes can be computed
through inference
Dynamic Bayesian Networks (2)
DBN are an extension of BNs with random variables that evolves in time:
• Instancing a static BN for each temporal slice t
• Explicating temporal dependences between variables
C
S
C
S
C
……..
S
L
L
L
O
O
O
t=0
t=+1
t=T
Dynamic Bayesian Networks (3)
Hidden Markov Models, Kalman Filter Models and other
state-space models are just a special case of DBNs :
p
Q0
A
….
Qt
Qt+1
Y0
Yt
Yt+1
t=0
t
t+1
….
Representation of an HMM
as an instance of a DBN
B
Dynamic Bayesian Networks (4)
Representing HMMs in terms of DBNs makes easy
to create variations on the basic theme ….
Z0
X0
….
Xt
Xt+1
Z0
….
Zt
Zt+1
Q0
….
Qt
Qt+1
Q0
Yt
Yt+1
Y0
Y0
Factorial HMMs
….
V0
….
Zt
Zt
Vt
Vt
Qt
Qt
Yt
Yt
Coupled HMMs
Dynamic Bayesian Networks (5)
Use of DBN and BN present some advantages:
• Intuitive way to represent models graphically,
with a standard notation
• Unified theory for a huge number of models
• Connecting different models in a structured view
• Making easier to study new models
• Unified set of instruments (ie: GMTK) to work
with them (training, inference, decoding)
• Maximizes resources reuse
• Minimizes “setup” time
First Model (1)
“Early integration” of features and modelling through a
2-level Hidden Markov Model
Hidden
Meeting Actions
A0
S0
Y0
….
….
At
At+1
St
St+1
Yt
Yt+1
….
….
AT
ST
YT
Hidden
Sub-states
Observable
Features
Vector
First Model (2)
The main idea behind this model is to decompose each
“meeting action” in a sequence of “sub actions” or substates
(Note that different actions are free to share the same sub-state)
A0
….
At
S0
….
St
Y0
Yt
The structure is composed by
two Ergodic HMM chains:
•The top chain links sub-states
{St} with “actions” {At}
•The lower one maps directly
the feature vectors {Yt} into a
sub-state {St}
First Model (3)
• The sequence of actions {At} is known a priori
• The sequence {St} is determined during the training
process,and the meaning of each substate is unknown
A0
….
At
S0
….
St
Y0
Yt
•The cardinality of {St} is one
of the model’s parameters
•The mapping of observable
features {Yt} into hidden substates {St} is obtained through
Gaussian Mixture Models
Second Model (1)
Multistream processing of features through two parallel
and independent Hidden Markov Models
….
C0
C0
E0
S01
E0
….
A0
S02
Y01
Y02
C0
At
At+1
….
Action
Counter
C0
E0
Enable Transitions
….
AT
Meeting Actions
St1
St+11
ST1
….
….
St2
St+12
ST2
….
….
Yt1
Yt+11
YT1
Yt2
Yt+12
YT2
Speaker Turns Features
Hidden
Sub-states
Prosodic
Features
Second Model (2)
Each features-group (or modality) Ym, is mapped into an
independent HMM chain, therefore every group is evaluated
independently and mapped into an hidden sub-state {Stn}
As in the previous model, there is
another HMM layer (A), witch
represents “meeting actions”
A0
….
S01
St1
….
St2
….
S02
Y01
Y02
At
Yt1
Yt2
The whole sub-state
{St1 x St2 x … Stn} is
mapped into an action {At}
Second Model (3)
It is a variable-duration HMM with explicit enable node:
• At represents “meeting actions” as usual
• Ct counts “meeting actions”
• Et is a binary indicator variable that enables states
changes inside the node At
….
C0
C0
E0
A0
C0
E0
….
At
At+1
….
E0
….
Ct
…
1
1
2
2
2
…
Et
…
0
1
0
0
0
…
At
…
8
8
5
5
5
…
Second Model (4)
• Training: when {At} changes {Ct} is
incremented and is set on for a single frame {Et}
(At ,Et and Ct are part of the training dataset)
Behaviours of {Et} and {Ct} learned
during the training phase are then
exploited during the decoding
• Decoding: {At} is free to
change only if {Et} is high, and
then according to {Ct} state
Ct
…
1
1
2
2
2
…
Et
…
0
1
0
0
0
…
At
…
8
8
5
5
5
…
Results
Using the two models previously described, results
obtained using only audio derived features:
First Model
Second Model
Corr. Sub. Del.
93.2 2.3
4.5
94.7 1.5
3.8
Ins.
4.5
0.8
AER
11.4
6.1
The second model reduces effectively both the number
of Substitutions and the number of Insertions
Equivalent to the Word Error Rate
Sub  Ins  Del
AER

100
measure, used to evaluate speech
TotalActionsNumber
recogniser performances
Conclusions
• A new approach has been proposed
• Achieved results seem to be promising, and in
the future we’d like to:
– Validate them with the remaining part of the test-set
(or eventually an independent test-set)
– Integrate other features:
• video, ASR transcripts, Xtalk, ….
– Try new experiments with existing models
– Develop new DBNs based models
Multimodal Recognition (2)
Knowledge sources:
•
•
•
•
•
•
•
•
•
•
Raw Audio
Raw Video
Acoustic Features
Visual Features
Automatic Speech
Recognition
Video Understanding
Gesture Recognition
Eye Gaze Tracking
Emotion Detection
….
Approaches:
A standalone hi-level recogniser
operating on low level raw data
Fusion of different recognisers
at an early stage, generating
hybrid recognisers (like AVSR)
Integration of recognisers
outputs through an “high level”
recogniser