Advancing the Vibrancy of Computing

Download Report

Transcript Advancing the Vibrancy of Computing

Introduction to Computer Speech Processing
Alex Acero
Research Area Manager
Microsoft Research
Outline
• Grand challenges in Speech and Language
• Vision videos
• Products today
• Prototypes
• The role of speech
• Technology Introduction
Outline
• Grand challenges in Speech and Language
• Vision videos
• Products today
• Prototypes
• The role of speech
• Technology Introduction
User Expectations for Speech
The Turing Test
•
Imitation Game:
–
–
–
–
–
•
Judge, man, and a woman
All chat via Email.
Man pretends to be a woman.
Man lies, woman tries to help judge.
Judge must identify man after 5 minutes.
Turing Test
–
–
Replace man or woman with a computer.
Fool judge 30% of the time.
Thanks to Jim Gray for material
What Turing Said
“I believe that in about fifty years' time it will be possible, to
programme computers, with a storage capacity of about
109, to make them play the imitation game so well that an
average interrogator will not have more than 70 per cent
chance of making the right identification after five minutes of
questioning. The original question, "Can machines think?" I
believe to be too meaningless to deserve discussion.
Nevertheless I believe that at the end of the century the use
of words and general educated opinion will have altered so
much that one will be able to speak of machines thinking
without expecting to be contradicted.”
Alan M.Turing, 1950
“Computing machinery and intelligence.” Mind, Vol.
LIX. 433-460
Prediction 59 Years Later
• Turing’s technology forecast was great!
– Gigabyte memory is common
• Computer beat world chess champion
– with some help from its programming staff!
• Computers help design most things today
Prediction 59 Years Later
• Intelligence forecast was optimistic
– Several internet sites offer Turning Test
chatterbots.
– None pass (yet) http://www.loebner.net/Prizef/loebner-prize.html
• But I believe it will not be long:
– less than 50 years, more than 10 years
• Turing test still stands as a long-term challenge
Challenges Implicit in the Turing Test
1. Read and understand as well as a
human
2. Think and write as well as a human
3. Hear as well as a native speaker:

Speech Recognition (speech to text)
4. Speak as well as a native speaker:

Speech Synthesis (text to speech)
5. Remember what is heard and quickly
return it on request.
Moore’s law (1965)
• Gordon Moore: “The number of transistors per chip will
double every 18 months”: 100x per decade
• Progress in next 18 months
= ALL previous progress
– New storage = sum of all old storage (ever)
– New processing = sum of all old processing.
15 years ago
Making Chips Smaller
• Advances in Lithography: science of "drawing" circuits on
chips
• Impact of Moore’s law:
– Short distances => smaller processing time
– Smaller size => lower cost per transistor
– Amount of memory is increased
• But, it is not a law of physics: a mere self fulfilling prophecy.
Moore’s law not applicable to
Machine Intelligence
• Speech technology benefited from Moore’s Law in the
1990’s.
• In the 21th century, faster chips mean recognition error
appears faster 
• New algorithmic advances needed to pass the Turing Test
• Error rate halves approx every 7 years
Grand Challenges
“Within 10 years speech will be in every device.
Things like speech and ink are so natural, when
they get the right quality level they will be in
everything. As technical hurdles such as
background noise and context are overcome,
major adoption of speech technology will arrive.
Soon, dictating to PCs and giving commands to
cell phones will be basic modes of interacting with
technology”
Bill Gates, March 2004
Outline
• Grand challenges in Speech and Language
• Vision videos
• Products today
• Prototypes
• The role of speech
• Technology Introduction
Speech in Mobile devices
Speech for Students
Speech in cars
Soccer Mom in car
Insurance Agent driving
Outline
• Grand challenges in Speech and Language
• Vision videos
• Products today
• Prototypes
• The role of speech
• Technology Introduction
Japanese dictation
Telephony: Response point
Directory Assistance
• Automatic generation of robust grammars
– Users say “Calabria” or “Calabria restaurant”
• Nearby cities
– Is “Calabria restaurant” in Redmond or Kirkland?
• Some people say the address too
– “Pizza hut on 3rd Avenue” in New York, New York
• Automatic normalization
– Acronyms, compound words, homonyms, misspelled words
Multimodal voice search
Click-Driven Automated Feedback
Language Model
Acoustic Model
Outline
• Grand challenges in Speech and Language
• Vision videos
• Products today
• Prototypes
• The role of speech
• Technology Introduction
CommuteUX
Speech in Education
VerbalMath
Virtual Receptionist
Video Search
(Frank Seide, MSRA)
Browsing a Video
(Milind Mahajan & Patrick Nguyen)
Podcast authoring (Patrick Nguyen)
Outline
• Grand challenges in Speech and Language
• Vision videos
• Products today
• Prototypes
• The role of speech
• Technology Introduction
Role of Speech in Different Devices
High
Ease
of GUI
(screen/
Pointer)
Tablet
PC
PC
Internet
TV
PDA
Screen
Screen
Phone
Phone
Car
Phone
Low
Ease of text input (keyboard/pen)
High
A Roadmap for Speech
High
Ease
of GUI
(screen/
Pointer)
Tablet
Dictation PC
Internet
TV
PC
PDA
Screen
Multimodal
Phone
Command/Control
Car
Phone
Speech-Only
Telephony
Low
Ease of text input (keyboard/pen)
High
Speech Technology
Customer Poor
Need
Alternative
Desktop Command &
Control
Desktop Dictation
Meeting / Voicemail
Transcription
Accessibility
Mobile Devices / Cars
Telephony / Call Center
Market
Opportunity
Technology
Readiness
Outline
• Grand challenges in Speech and Language
• Vision videos
• Products today
• Prototypes
• The role of speech
• Technology Introduction
Voice-enabled System
Technology Components
Speech
Speech
TTS
ASR
Text-to-Speech
Synthesis
Data,
Rules
Words
Spoken Language
Generation
SLG
Action
Words
SLU
DM
Dialog
Management
Automatic Speech
Recognition
Meaning
Spoken Language
Understanding
Voice-enabled System
Technology Components
Speech
Speech
TTS
ASR
Text-to-Speech
Synthesis
Data,
Rules
Words
Spoken Language
Generation
SLG
Action
Words
SLU
DM
Dialog
Management
Automatic Speech
Recognition
Meaning
Spoken Language
Understanding
Basic Formulation
• Basic equation of speech recognition is
Wˆ  arg max pW | X   arg max p X | W  pW 
W
W
X=X1,X2,…,Xn is the acoustic observation is the word sequence
P(X|W) is the acoustic model
P(W) is the language model
Speech Recognition
TTS
ASR
SLG
SLU
DM
Acoustic
Model
Input
Speech
Feature
Extraction
Pattern
Classification
(Decoding,
Search)
Language
Model
Word
Lexicon
Confidence
Scoring
“Hello World”
(0.9) (0.8)
Feature Extraction
Goal: Extract robust features (information)
from the speech that are relevant for ASR.
Acoustic
Model
Feature
Extraction
Method: Spectral analysis through either a
bank-of-filters or through Linear Predictive Coding
followed by non-linearity and normalization.
Pattern
Classification
Language
Model
Confidence
Scoring
Word
Lexicon
Result: Signal compression where for each window of speech
samples where 30 or so features are extracted (64,000 b/s -> 5,200
b/s).
Challenges: Robustness to environment (office, airport, car),
devices (speakerphones, cellphones), speakers (accents, dialect,
style, speaking defects), noise and echo.
Acoustic Modeling
Acoustic
Model
Goal:
Model probability of acoustic features
for each phone model i.e. p(X |/ae/)
Feature
Extraction
Method:
Pattern
Classification
Language
Model
Confidence
Scoring
Word
Lexicon
Hidden Markov Models (HMM) through
Maximum likelihood (EM) or discriminative methods
Challenges/variability:
•
•
•
•
•
Background noise: Cocktail Party Effect
Dialect/accent
Speaker
Phonetic context: “It aly” vs “It alian”
No spaces in speech:
“Recognize speech”
0
1
“Wreck a nice beach”
2
Word Lexicon
Acoustic
Model
Goal:
Map legal phone sequences into words
according to phonotactic rules:
David /d/ /ey/ /v/ /ih/ /d/
Multiple Pronunciations:
Feature
Extraction
Pattern
Classification
Language
Model
Word
Lexicon
Several words may have multiple pronunciations:
Data
/d/ /ae/ /t/ /ax/
Data
/d/ /ey/ /t/ /ax/
Challenges:
• How do you generate a word lexicon automatically?
•LTS rules can be automatically trained with decision trees
(CART) less than 8% errors, but proper nouns are hard!
• How do you add new variant dialects and word pronunciations?
Confidence
Scoring
Pattern Classification
Acoustic
Model
Goal:
Find “optimal” word sequence:
Combine information (probabilities) from
• Acoustic model
• Word lexicon
• Language model
Feature
Extraction
Pattern
Classification
Language
Model
Word
Lexicon
Method:
Decoder searches through all possible recognition
choices using a Viterbi decoding algorithm
Challenge:
Efficient search through a large network space is computationally
expensive for large vocabulary ASR: Beam search, WFST
Confidence
Scoring
Confidence Scoring
Goal:
Identify possible recognition errors and
out-of-vocabulary events. Potentiallyimproves
the performance of ASR, SLU and DM.
Acoustic
Model
Feature
Extraction
Pattern
Classification
Method:
A confidence score based on a hypothesis likelihood
ratio test is associated with each recognized word:
Label:
Recognized:
Confidence:
Language
Model
Word
Lexicon
credit please
credit fees
(0.9) (0.3)
Command-and-control: false rejection and false acceptance => ROC
curves
Challenges:
Rejection of extraneous acoustic events (noise, background speech,
door slams) without rejection of valid user input speech.
Confidence
Scoring
Voice-enabled System
Technology Components
Speech
Speech
TTS
ASR
Text-to-Speech
Synthesis
Data,
Rules
Words
Spoken Language
Generation
SLG
Action
Words
SLU
DM
Dialog
Management
Automatic Speech
Recognition
Meaning
Spoken Language
Understanding
Text-to-Speech Systems
TTS Engine
Text Analysis
Raw text
or tagged text
Document Structure Detection
Text Normalization
Linguistic Analysis
tagged text
Phonetic Analysis
Homograph disambiguation
Grapheme-to-Phoneme Conversion
tagged phones
Prosodic Analysis
Pitch & Duration Attachment
controls
Speech
Audio Out
Speech Synthesis
Voice Rendering
Multimedia Customer Care
(Courtesy of AT&T)
Voice-enabled System
Technology Components
Speech
Speech
TTS
ASR
Text-to-Speech
Synthesis
Data,
Rules
Words
Spoken Language
Generation
SLG
Action
Words
SLU
DM
Dialog
Management
Automatic Speech
Recognition
Meaning
Spoken Language
Understanding
Language Understanding
• Application Schema (XML for semantic entities) defines the
application status
• A Semantic Context Free Grammar (CFG) parses an English
sentence and fills in slots of the application schema.
Application Schema
<itinerary>
<origin>
<city></city>
<state></state>
</origin>
<destination>
<city></city>
<state></state>
</destination>
<date></date>
</itinerary>
Semantic CFG
<rule name=“itinerary”>
Show me flights from <ruleref name=“origin"/>
to <ruleref name=“destination"/>
</rule>
<rule name=“origin”>
<ruleref name=“city”>
</rule>
<rule name=“destination”>
<ruleref name=“city”>
</rule>
<rule name=“city”>
Seattle | San Francisco | New York
</rule>
An example sentence
“Show me flights from Seattle to New York”
would populate the application schema as
<itinerary>
<origin>
<city>Seattle</city>
<state></state>
</origin>
<destination>
<city>New York</city>
<state></state>
</destination>
<date></date>
</itinerary>
Voice-enabled System
Technology Components
Speech
Speech
TTS
ASR
Text-to-Speech
Synthesis
Data,
Rules
Words
Spoken Language
Generation
SLG
Action
Words
SLU
DM
Dialog
Management
Automatic Speech
Recognition
Meaning
Spoken Language
Understanding
Who manages the Dialog?
Directed Dialog
–
–
–
–
“Who would you like to contact?”
Finite State Machine
Simple CFG
MSConnect
Initiative
User Initiative Dialog



“What can I do for you?”
Ngrams
Windows Airlines
Reservations
Flight Status
Baggage Claim
Special Announcements
Problems with directed dialogs
User-initiative dialogs
• Pros:
– Can result in a shorter call
– Can feel more natural
– Useful when too many choices
• Cons:
– Requires expensive expertise
– Could lead to user frustration: system appears human
but caller can’t use full natural language
NLU Dialog Module
• Drag-and-drop Dialog Flow Designer
• Developer specifies:
– Destination branches
– Example sentences per branch
– Prompts (initial, mumble, no speech, etc)
• Module generates SLM and classifier
• It handles confirmation, reprompt, etc.
Natural Language
Multimodal System Technology
Components
Speech
Speech
Pen
Gesture
Visual
TTS
ASR
Text-to-Speech
Synthesis
Data,
Rules
Words
Spoken Language
Generation
SLG
Action
Words
SLU
DM
Dialog
Management
Automatic Speech
Recognition
Meaning
Spoken Language
Understanding
MIPad
• Multimodal Interactive Pad
• MiPad
– Tap and Talk combines speech and
pen
– Use context to simplify recognition
– Dictation allows complex command
entry
• Usability studies show double
throughput for English
• Speech is mostly useful in cases
with lots of alternatives
Speech-centric Multimodal
Multimodality Benefits
• Compared to speech-only:
– User sees system response more quickly
– User sees what system understood
– User can know what system expects
• Compared to GUI-only:
– Faster entry
– Better use of small screen
But general language understanding is hard