IMLab Template Slide - 廖峻鋒(Chun

Download Report

Transcript IMLab Template Slide - 廖峻鋒(Chun

CMU Shpinx
Speech Recognition Engine
Reporter : Chun-Feng Liao
NCCU Dept. of Computer Sceince
Intelligent Media Lab
Purposes of this project
• Finding out how an efficient speech
recognition engine can be implemented.
• Examine the source code of Sphinx2 to
find out the role and function of each
component.
• Reading key chapters of Dr. Mosur K.
Ravishankar’s thesis as a reference.
• Some demo programs will be given
during oral presentation.
Presentation Agenda
•
•
•
•
Project Summary/ Agenda/ Goal. (In English)
Introduction.
Basics of Speech Recognitions.
Architecture of CMU Sphinx.
– Acoustic Model and HMM.
– Language Model.
• Java™ Platform Issues.
• Demo
• Conclusion.
Voice Technologies
• In the mid- to late 1990s, personal
computers started to become
powerful enough to support ASR
• The two key underlying
technologies behind these advances
are speech recognition (SR) and
text-to-speech synthesis (TTS).
Basics of Speech Recognition
Speech Recognition
• Capturing speech (analog) signals
• Digitizing the sound waves, converting
them to basic language units or
phonemes(音素).
• Constructing words from phonemes, and
contextually analyzing the words to
ensure correct spelling for words that
sound alike (such as write and right).
Speech Recognition Process
Flow
Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )
Recognition Process Flow
Summary
• Step 1:User Input
– The system catches user’s voice in the
form of analog acoustic signal .
• Step 2:Digitization
– Digitize the analog acoustic signal.
• Step 3:Phonetic Breakdown
– Breaking signals into phonemes.
Recognition Process Flow
Summary(2)
• Step 4:Statistical Modeling
– Mapping phonemes to their phonetic
representation using statistics model.
• Step 5:Matching
– According to grammar , phonetic representation
and Dictionary , the system returns an n-best
list (I.e.:a word plus a confidence score)
– Grammar-the union words or phrases to
constraint the range of input or output in the
voice application.
– Dictionary-the mapping table of phonetic
representation and word(EX:thu,theethe)
Architecture of CMU Sphinx.
Introduction to CMU Sphinx
• A speech recognition system
developed at Carnegie Mellon
University.
• Consists of a set of libraries
– core speech recognition functions
– low-level audio capture
• Continuous speech decoding
• Speaker-independent
Brief History of CMU Sphinx
• Sphinx-I (1987)
– The first user independent ,high performance
ASR of the world.
– Written in C by Kai-Fu Lee (李開復博士,現任
Microsoft Asia首席技術顧問/副總裁).
• Sphinx-II (1992)
– Written by Xuedong Huang in C. (黃學東博士,
現為Microsoft Speech.NET團隊領導人)
– 5-state HMM / N-gram LM.
• (我們可以推測,CMU Sphinx的核心技術對
Microsoft Speech SDK影響很大。)
Brief History of CMU Sphinx (2)
• Sphinx 3 (1996)
– Built by Eric Thayer and Mosur
Ravishankar.
– Slower than Sphinx-II but the design is
more flexible.
• Sphinx 4 (Originally Sphinx 3j)
– Refactored from Sphinx 3.
– Fully implemented in Java.
– Not finished yet.
Components of CMU Sphinx
Front End
• libsphinx2fe.lib / libsphinx2ad.lib
• Low-level audio access
• Continuous Listening and Silence
Filtering
• Front End API overview.
Knowledge Base
• The data that drives the decoder.
• Three sets of data
– Acoustic Model.
– Language Model.
– Lexicon (Dictionary).
Acoustic Model
• /model/hmm/6k
• Database of statistical model.
• Each statistical model represents a
phoneme.
• Acoustic Models are trained by
analyzing large amount of speech
data.
HMM in Acoustic Model
• HMM represent each unit of speech
in the Acoustic Model.
• Typical HMM use 3-5 states to
model a phoneme.
• Each state of HMM is represented by
a set of Gaussian mixture density
functions.
• Sphinx2 default phone set.
Gaussian Mixtures
• Refer to text book p 33 eq 38
• Represent each state in HMM.
• Each set of Gaussian Mixtures are called
“senones”.
• HMM can share “senones”.
Language Model
• Describes what is likely to be spoken in a
particular context
• Word transitions are defined in terms of
transition probabilities
• Helps to constrain the search space
• See examples of LM.
N-gram Language Model
• Probability of word N dependent on word
N-1, N-2, ...
• Bigrams and trigrams most commonly
used
• Used for large vocabulary applications
such as dictation
• Typically trained by very large (millions of
words) corpus
Decoder
• Selects next set of likely states
• Scores incoming features against
these states
• Drop low scoring states
• Generates results
Speech in Java™ Platform
Sun Java Speech API
• First released on October 26, 1998.
• The Java™ Speech API allows Java
applications to incorporate speech
technology into their user interfaces.
• Defines a cross-platform API to
support command and control
recognizers, dictation systems and
speech synthesizers.
Implementations of Java
Speech API
• Open Source
– FreeTTS / CMU Sphinx4.
• IBM Speech for Java.
• Cloud Garden.
• L&H TTS for Java Speech API.
• Conversa Web 3.0.
Free TTS
• Fully implemented with Java.
• Based upon Flite 1.1: a small runtime speech synthesis engine
developed at CMU.
• Partial support for JSAPI 1.0.
– Speech Recognition functions.
– JSML.
Sphinx 4 (Sphinx 3j)
• Fully implemented with Java.
• Speed is equal or faster than Sphinx3.
• Acoustic model and Language model is
under construction.
• Source code are available by CVS.(but
you can not run any applications
without models !)
For Example : To check out the Sphinx4 ,you can using the following command.
cvs -z3 -d:pserver:[email protected]:/cvsroot/cmusphinx co sphinx4
Java™ Platform Issues
• GC makes managing data much
easier
• Native engines typically optimize
inner loops for the CPU –can't do
that on the Java platform.
• Native engines arrange data to
• optimize cache hits –can't really do
that either.
DEMO
• Sphinx-II batch mode.
• Sphinx-II live mode.
• Sphinx-II Client / Server mode.
• A Simple Free TTS Application.
• (Java-based) TTS vs (c-based)SR .
• Motion Planner with Free TTS-using
Java Web Start™.(This is GRA course
final project)
Summary
• Sphinx is a open source Speech
Recognition developed at CMU.
• FE / KB / Decoder form the core of SR
system.
• FE receives and processes speech signal.
• Knowledge Base provide data for Decoder.
• Decoder search the states and return the
results.
• Speech Recognition is a challenging
problem for the Java platform.
Reference
• Mosur K.Ravishankar, Efficient
Alogrithms for Speech Recognition,
CMU, 1996.
• Mosur K.Ravishankar, Kevin A.
Lenzo ,Sphinx-II User Guide ,
CMU,2001.
• Xuedong Huang,Alex Acerd,HsiaoWuen hon,Spoken Language
Processing,Prentice Hall,2000.
Reference (on-line)
• On-line documents of Java™ Speech API
– http://java.sun.com/products/javamedia/speech/
• On-line documents of Free TTS
– http://freetts.sourceforge.net/docs/
• On-line documents of Sphinx-II
– http://www.speech.cs.cmu.edu/sphinx/
Q&A