Mutual Disambiguation of Recognition Errors in a

Download Report

Transcript Mutual Disambiguation of Recognition Errors in a

Designing Robust Multimodal Systems for
Diverse Users and Mobile Environments
Sharon Oviatt
[email protected];
http://www.cse.ogi.edu/CHCC/
Center for Human Computer Communication
Department of Computer Science, OG I
1
Introduction to Perceptive Multimodal
Interfaces
• Multimodal interfaces recognize combined natural
human input modes (speech & pen, speech & lip
movements)
• Radical departure from GUIs in basic features,
interface design & architectural underpinnings
• Rapid development in 1990s of bimodal systems
• New fusion & language processing techniques
• Diversification of mode combinations & applications
• More general & robust hybrid architectures
Center for Human Computer Communication
Department of Computer Science, OG I
2
Advantages of Multimodal
Interfaces
•
•
•
•
•
Flexibility & expressive power
Support for users’ preferred interaction style
Accommodate more users,** tasks, environments**
Improved error handling & robustness**
Support for new forms of computing, including mobile
& pervasive interfaces
• Permit multifunctional & tailored mobile interfaces,
adapted to user, task & environment
Center for Human Computer Communication
Department of Computer Science, OG I
3
The Challenge of Robustness:
Unimodal Speech Technology’s Achilles’
Heel
• Recognition errors currently limit commercialization
of speech technology, especially for:
– Spontaneous interactive speech
– Diverse speakers & speaking styles (e.g.,
accented)
– Speech in natural field environments (e.g.,
mobile)
• 20-50% drop in accuracy typical for real-world
usage conditions
Center for Human Computer Communication
Department of Computer Science, OG I
4
Improved Error Handling in
Flexible Multimodal Interfaces
• Users can avoid errors through mode selection
• Users’ multimodal language is simplified, which
reduces complexity of NLP & avoids errors
• Users mode switch after system errors, which
undercuts error spirals & facilitates recovery
• Multimodal architectures potentially can support
“mutual disambiguation” of input signals
Center for Human Computer Communication
Department of Computer Science, OG I
5
Example of Mutual Disambiguation:
QuickSet Interface during Multimodal “PAN”
Command
Multimodal
Input on User
Interface
Gesture
Recognition
Speech
Recognition
Gestural
Language
Interpretation
Spoken
Language
Interpretation
Multimodal
Integrator
Multimodal
Bridge
System
Confirmation
to User
Processing &
Architecture
• Speech & gestures
processed in parallel
• Statistically ranked
unification of semantic
interpretations
• Multi-agent architecture
coordinates signal
recognition, language
processing, & multimodal
integration
General Research Questions
• To what extent can a multimodal system support
mutual disambiguation of input signals?
• How much is robustness improved in a multimodal
system, compared with a unimodal one?
• In what usage contexts and for what user groups is
robustness most enhanced by a multimodal
system?
• What are the asymmetries between modes in
disambiguation likelihoods?
Center for Human Computer Communication
Department of Computer Science, OG I
8
Study 1- Research Method
• Quickset testing with map-based tasks
(community fire & flood management)
• 16 users— 8 native speakers & 8 accented
(varied Asian, European & African accents)
• Research design— completely-crossed factorial
with between-subjects factors:
(1) Speaker status (accented, native)
(2) Gender
• Corpus of 2,000 multimodal commands
processed by QuickSet
Center for Human Computer Communication
Department of Computer Science, OG I
9
Videotape
Multimodal system processing
for accented and mobile users
Center for Human Computer Communication
Department of Computer Science, OG I
10
Study 1- Results
• 1 in 8 multimodal commands succeeded due to
mutual disambiguation (MD) of input signals
• MD levels significantly higher for accented speakers
than native ones—
15% vs 8.5% of utterances
• Ratio of speech to total signal pull-ups differed for
users—
.65 accented vs .35 native
• Results replicated across signal & parse-level MD
Center for Human Computer Communication
Department of Computer Science, OG I
11
Table 1—Mutual Disambiguation Rates for
Native versus Accented Speakers
NATIVE
SPEAKERS
ACCENTED
SPEAKERS
8.5%
25.5%
.35
15.0%*
31.7%*
.65*
MD LEVELS:
Signal MD level
Parse MD level
Ratio of speech
signal pull-ups
Center for Human Computer Communication
Department of Computer Science, OG I
12
Table 2- Recognition Rate Differentials between
Native and Accented Speakers for Speech,
Gesture and Multimodal Commands
RECOGNITION RATE
DIFFERENTIAL:
Speech
Gesture
Multimodal
NATIVE
SPEAKERS
ACCENTED
SPEAKERS
—
-3.4%*
—
-9.5%*
—
—
Center for Human Computer Communication
Department of Computer Science, OG I
13
Study 1- Results (cont.)
Compared to traditional speech processing,
spoken language processed within a multimodal
architecture yielded:
41.3% reduction in total speech error
rate
No gender or practice effects found in MD rates
Center for Human Computer Communication
Department of Computer Science, OG I
14
Study 2- Research Method
• QuickSet testing with same 100 map-based tasks
• Main study:
– 16 users with high-end mic (close-talking, noisecanceling)
– Research design completely-crossed factorial:
(1) Usage Context- Stationary vs Mobile (within
subjects)
(2) Gender
• Replication:
– 6 users with low-end mic (built-in, no noise cancellation)
– Compared stationary vs mobile
Center for Human Computer Communication
Department of Computer Science, OG I
15
Study 2- Research Analyses
• Corpus of 2,600 multimodal commands
• Signal amplitude, background noise & SNR
estimated for each command
• Mutual disambiguation & multimodal system
recognition rates analyzed in relation to dynamic
signal data
Center for Human Computer Communication
Department of Computer Science, OG I
16
Mobile user with hand-held system & closetalking headset in moderately noisy environment
(40-60 dB noise)
Center for Human Computer Communication
Department of Computer Science, OG I
17
Mobile research infrastructure, with user
instrumentation and researcher field
station
Center for Human Computer Communication
Department of Computer Science, OG I
18
Study 2- Results
• 1 in 7 multimodal commands succeeded due to
mutual disambiguation of input signals
• MD levels significantly higher during mobile than
stationary system use—
16% vs 9.5% of utterances
• Results replicated across signal and parse-level MD
Center for Human Computer Communication
Department of Computer Science, OG I
19
Table 3- Mutual Disambiguation Rates
during Stationary and Mobile System Use
STATIONARY
MOBILE
7.5%
11.4%
11.0%*
21.5%*
.26
.34*
SIGNAL MD LEVELS:
Noise-canceling mic
Built-in mic
RATIO OF SPEECH
PULL-UPS
Center for Human Computer Communication
Department of Computer Science, OG I
20
Table 4- Recognition Rate Differentials during
Stationary
and Mobile System Use for Speech,
Table 5. Recognition Rate Differentials during StaGesture
and System
Multimodal
tionary
and Mobile
Use for Commands
Speech, Gesture,
and Multimodal Commands.
RECOGNITION RATE
STATIONARY
MOBILE
NOISE-CANCELING MIC:
Speech
Gesture
Multimodal
—
—
—
-5.0%*
—
-3.0%*
BUILT-IN MIC:
Speech
Gesture
Multimodal
—
—
—
-15.0%*
—
-13.0%*
DIFFERENTIAL
Center for Human Computer Communication
Department of Computer Science, OG I
21
Study 2- Results (cont.)
Compared to traditional speech processing,
spoken language processed within a multimodal
architecture yielded:
19-35% reduction in total speech error
rate
(for noise-canceling & built-in mics, respectively)
No gender effects found in MD
Center for Human Computer Communication
Department of Computer Science, OG I
22
Conclusions
• Multimodal architectures can support mutual
disambiguation & improved robustness over
unimodal processing
• Error rate reduction can be substantial— 20-40%
• Multimodal systems can reduce or close the
recognition rate gap for challenging users (accented
speakers) & usage contexts (mobile)
• Error-prone recognition technologies can be
stabilized within a multimodal architecture, which
functionmore reliably in real-world contexts
Center for Human Computer Communication
Department of Computer Science, OG I
23
Future Directions & Challenges
• Intelligently adaptive processing, tailored for mobile
usage patterns & diverse users
• Improved language & dialogue processing
techniques, and hybrid multimodal architectures
• Novel mobile & pervasive multimodal concepts
• Break the robustness barrier— reduce error rate
(For more information— http://www.cse.ogi.edu/CHCC/)
Center for Human Computer Communication
Department of Computer Science, OG I
24