Victor-Zue-InterSpeech-07-08-28

Download Report

Transcript Victor-Zue-InterSpeech-07-08-28

On Organic Interfaces
Victor Zue ([email protected])
MIT Computer Science and Artificial
Intelligence Laboratory
Acknowledgements
Graduate Students
Anderson, M.
Aull, A.
Brown, R.
Chan, W.
Chang, J.
Chang, S.
Chen, C.
Cyphers, S.
Daly, N.
Doiron, R.
Flammia, G.
Glass, J.
Goddeau, D.
Hazen, T.J.
Hetherington, L.
Huttenlocher, D.
Jaffe, O.
Kassel, R.
Kasten,P.
Kuo, J.
Kuo, S.
Lauritzen, N.
Lamel, L.
Lau, R.
Leung, H.
Lim, A.
Manos, A.
Marcus, J.
Neben, N.
Niyogi, P.
Mou, X.
Ng, K.
Pan, K.
Pitrelli, J.
Randolph, M.
Rtischev, D.
Sainath, T.
Sarma, S.
Seward, D.
Soclof, M.
Spina, M.
Tang, M.
Wichiencharoen, A.
Zeiger, K.
MIT Computer Science and Artificial Intelligence Laboratory
Research Staff
Eric Brill
Scott Cyphers
Jim Glass
Dave Goddeau
T J Hazen
Lee Hetherington
Lynette Hirschman
Raymond Lau
Hong Leung
Helen Meng
Mike Phillips
Joe Polifroni
Shinsuke Sakai
Stephanie Seneff
Dave Shipman
Michelle Spina
Nikko Ström
Chao Wang
Introduction
MIT Computer Science and Artificial Intelligence Laboratory
Virtues of Spoken Language
Natural:
Flexible:
Efficient:
Economical:
Requires no special training
Leaves hands and eyes free
Has high data rate
Can be transmitted/received inexpensively
Speech interfaces are ideal for information access and
management when:
• The information space is broad and complex,
• The users are technically naive,
• The information device is small, or
• Only telephones are available.
MIT Computer Science and Artificial Intelligence Laboratory
Communication via Spoken Language
Human
Input
Output
Speech
Speech
Recognition
Synthesis
Computer
Text
Text
Generation
Understanding
Meaning
MIT Computer Science and Artificial Intelligence Laboratory
Components of a Spoken Dialogue System
Sentence
SPEECH
SYNTHESIS
LANGUAGE
GENERATION
Speech
Graphs
& Tables
Speech
DIALOGUE
MANAGEMENT
DISCOURSE
CONTEXT
Meaning
Representation
Meaning
LANGUAGE
UNDERSTANDING
SPEECH
RECOGNITION
Words
MIT Computer Science and Artificial Intelligence Laboratory
DATABASE
Tremendous Progress to Date
Technological Advances
Data Intensive Training
Inexpensive Computing
Increased Task Complexity
MIT Computer Science and Artificial Intelligence Laboratory
Some Example Systems
BBN, 2007
MIT, 2007
MIT Computer Science and Artificial Intelligence Laboratory
KTH, 2007
Speech Synthesis
• Recent trend moves toward corpus-based approaches
– Increased storage and compute capacity
– Availability of large text and speech corpora
– Modeled after successful utilization for speech recognition
• Many successful implementations, e.g.,
– AT&T
– Cepstral
– Microsoft
compassion
disputed
cedar city
since
giant
since
MIT Computer Science and Artificial Intelligence Laboratory
computer
science
But we are far from done …
• Machine performance typically lags far behind human
performance
• How can interfaces be truly anthropomorphic?
Lippmann, 1997
80
SWITCHBOARD (Spontaneous Speech)
60
43%
40
20
4%
0
MACHINE
MIT Computer Science and Artificial Intelligence Laboratory
HUMAN
Premise of the Talk
• Propose a different perspective on development of speechbased interfaces
• Draw from insights in evolution of computer science
– Computer systems are increasingly complex
– There is a move towards treating these complex systems like
organisms that can observe, grow, and learn
• Will focus on spoken dialogue systems
MIT Computer Science and Artificial Intelligence Laboratory
Organic Interfaces
MIT Computer Science and Artificial Intelligence Laboratory
Computer: Yesterday and Today
• Computation of static functions
in a static environment, with wellunderstood specification
• Adaptive systems operating in
environments that are dynamic
and uncertain
• Computation is its main goal
xxxxx
• Communication, sensing, and
control just as important
• Single agent
xxxxxxxxxxxxxxxxxx
• Multiple agents that may be
cooperative, neutral, adversarial
• Batch processing of text and
homogeneous data
• Stream processing of massive,
heterogeneous data
• Stand-alone applications
• Interaction with humans is key
• Binary notion of correctness
• Trade off multiple criteria
Increasingly, we rely on probabilistic representation,
machine learning techniques, and optimization
principles to build complex systems
MIT Computer Science and Artificial Intelligence Laboratory
Properties of Organic Systems
•
•
•
•
•
•
Robust to changes in environment and operating conditions
Learning through experiences
Observe their own behavior
Context aware
Self healing
…
MIT Computer Science and Artificial Intelligence Laboratory
Research Challenges
MIT Computer Science and Artificial Intelligence Laboratory
Some Research Challenges
• Robustness
–
–
–
–
Signal Representation
Acoustic Modeling
Lexical Modeling
Multimodal Interactions
• Establishing Context
• Adaptation
• Learning
– Statistical Dialogue Management
– Interactive Learning
– Learning by Imitation
* Please refer to written paper for topics not covered in talk
MIT Computer Science and Artificial Intelligence Laboratory
Robustness: Acoustic Modeling
sentence
• Statistical n-grams have masked
the inadequacies in acoustic
modeling, but at a cost
syntax
– Size of training corpus
– Application-dependent performance
• To promote acoustic modeling
research, we may want to
develop a sub-word based
recognition kernel
– Application independent
– Stronger constraints than phonemes
– Closed vocabulary for a given
language
• Some success has been
demonstrated (e.g., Chung &
Seneff, 1998)
MIT Computer Science and Artificial Intelligence Laboratory
semantics
word (syllable)
morphology
Sub-word Units
phonotactics
Speech
phonemics
Recognition
Kernel
phonetics
Units
Acoustic
acoustics
Models
LM
Robustness: Lexical Access
• Current approaches represent words as phoneme strings
• Phonological rules are sometimes used to derive alternate
pronunciations
“temperature”
• Lexical representation based on features offers much
appeal (Stevens, 1995)
– Fewer models, less training data, greater parsimony
– Alternative lexical access models (e.g., Zue, 1983)
• Lexical access based on islands of reliability might be better
able to deal with variability
MIT Computer Science and Artificial Intelligence Laboratory
Robustness: Multimodal Interactions
• Other modalities can augment/complement speech
SPEECH
RECOGNITION
HANDWRITING
RECOGNITION
LANGUAGE
UNDERSTANDING
GESTURE
RECOGNITION
MOUTH & EYES
TRACKING
MIT Computer Science and Artificial Intelligence Laboratory
meaning
Challenges for Multimodal Interfaces
• Input needs to be understood in the proper context
– “What about that one”
• Timing information is a useful way to relate inputs
Speech:
Pointing:
“Move this one over here”
(object)
(location)
time
• Handling uncertainties and errors (Cohen, 2003)
• Need to develop a unifying linguistic framework
MIT Computer Science and Artificial Intelligence Laboratory
Audio Visual Symbiosis
• The audio and visual signals both
contain information about:
– Identity/location of the person
– Linguistic message
– Emotion, mood, stress, etc.
• Integration of these sources of
information has been known to
help humans
Benoit, 2000
MIT Computer Science and Artificial Intelligence Laboratory
Audio Visual Symbiosis
• The audio and visual signals both
contain information about:
– Identity/location of the person
– Linguistic message
– Emotion, mood, stress, etc.
• Integration of these sources of
information has been known to
helps humans
• Exploiting this symbiosis can lead
to robustness, e.g.,
– Locating and identifying the speaker
MIT Computer Science and Artificial Intelligence Laboratory
Hazen et al., 2003
Audio Visual Symbiosis
• The audio and visual signals both
contain information about:
– Identity/location of the person
– Linguistic message
– Emotion, mood, stress, etc.
• Integration of these sources of
information has been known to
helps humans
• Exploiting this symbiosis can lead
to robustness, e.g.,
– Locating and identifying the speaker
– Speech recognition/understanding
augmented with facial features
MIT Computer Science and Artificial Intelligence Laboratory
Huang et al., 2004
Audio Visual Symbiosis
• The audio and visual signals both
contain information about:
– Identity/location of the person
– Linguistic message
– Emotion, mood, stress, etc.
• Integration of these sources of
information has been known to
helps humans
• Exploiting this symbiosis can lead
to robustness, e.g.,
Cohen, 2005
– Locating and identifying the speaker
– Speech recognition/understanding
augmented with facial features
– Speech and gesture integration
MIT Computer Science and Artificial Intelligence Laboratory
Gruenstein et al., 2006
Audio Visual Symbiosis
• The audio and visual signals both
contain information about:
– Identity/location of the person
– Linguistic message
– Emotion, mood, stress, etc.
• Integration of these sources of
information has been known to
helps humans
• Exploiting this symbiosis can lead
to robustness, e.g.,
– Locating and identifying the speaker
– Speech recognition/understanding
augmented with facial features
– Speech and gesture integration
– Audio/visual information delivery
MIT Computer Science and Artificial Intelligence Laboratory
Ezzat, 2003
Establishing Context
• Context setting is important for dialogue interaction
– Environment
– Linguistic constructs
– Discourse
• Much work has been done, e.g.,
– Context-dependent acoustic and language models
– Sound segmentation
calendar
– Discourse modeling
• Some interesting new directions
– Tapestry of applications
– Acoustic scene analysis (Ellis, 2006)
photos
weather
address
stocks
phonebook
MIT Computer Science and Artificial Intelligence Laboratory
music
Acoustic Scene Analysis
• Acoustic signals contain a wealth of information (linguistic
message, environment, speaker, emotion, …)
• We need to find ways to adequately describe the signals
time
signal type: speech
transcript: although both of the,
both sides of the Central Artery
…
topic: traffic report
speaker: female
...
signal type: speech
transcript: Forecast calls for at least
partly sunny weather …
topic: weather, sponsor
acknowledgement, time
speaker: male
...
signal type: music
genre: instrumental
artist: unknown
...
Some time in the future …
MIT Computer Science and Artificial Intelligence Laboratory
signal type:
speech
transcript:
This is
Morning
Edition, I’m
Bob
Edwards …
topic: NPR
news
speaker:
male, Bob
Edwards
...
Learning
• Perhaps the most important aspect of organic interfaces
– Use of stochastic modeling techniques for speech recognition,
language understanding, machine translation, and dialogue modeling
• Many different ways to learn
– Passive learning
– Interactive learning
– Learning by imitation
MIT Computer Science and Artificial Intelligence Laboratory
Interactive Learning: An Example
• New words are inevitable,
and they cannot be ignored
• Acoustic and linguistic
knowledge is needed to
– Detect
– Learn, and
– Utilize new words
• Fundamental changes in
problem formulation and
search strategy may be
necessary
Hetherington, 1991
MIT Computer Science and Artificial Intelligence Laboratory
Interactive Learning: An Example
• New words are inevitable,
and they cannot be ignored
• Acoustic and linguistic
knowledge is needed to
– Detect
– Learn, and
– Utilize new words
• Fundamental changes in
problem formulation and
search strategy may be
necessary
Chung & Seneff, 2004
• New words can be detected and incorporated through
– Dynamic update of vocabulary
MIT Computer Science and Artificial Intelligence Laboratory
Interactive Learning: An Example
• New words are inevitable,
and they cannot be ignored
• Acoustic and linguistic
knowledge is needed to
– Detect
– Learn, and
– Utilize new words
• Fundamental changes in
problem formulation and
search strategy may be
necessary
Fillisko & Seneff, 2006
• New words can be detected and incorporated through
– Dynamic update of vocabulary
– Speak and Spell
MIT Computer Science and Artificial Intelligence Laboratory
Learning by Imitation
• Many tasks can be learned through
interaction
– “This is how you enable Bluetooth.”
 “Enable Bluetooth.”
– “These are my glasses.”
 “Where are my glasses?”
• Promising research by James Allen (2007) Allen et.al., (2007)
– Learning phase:
* User shows the system how to perform tasks (perhaps through
some spoken commentary)
* System learns the task through learning algorithms and updates its
knowledge base
– Application phase
* Looks up tasks in its knowledge base and executes the procedure
MIT Computer Science and Artificial Intelligence Laboratory
In Summary
• Great strides have been made in speech technologies
• Truly anthropomorphic spoken dialogue interfaces can only
be realized if they can behave like organisms
– Observe, learn, grow, and heal
• Many challenges remain …
MIT Computer Science and Artificial Intelligence Laboratory
Thank You
MIT Computer Science and Artificial Intelligence Laboratory
Dynamic Vocabulary Understanding
• Dynamically alter vocabulary within a single utterance
“What’s the phone number
for Flora in Arlington.”
???? in Arlington
What’s the phone number of Flora
Clause: wh_question
NLG
Property: phone
TTS
Dialog
Topic:
restaurant
Audio
Name:
Flora
????
DB
Hub
City:
Arlington
ASR
Context
NLU
“The telephone number
for Flora is …”
MIT Computer Science and Artificial Intelligence Laboratory
Arlington Diner
Blue Plate Express
Tea Tray in the Sky
Asiana Grille
Bagels etc
Flora
….