Report Document

Download Report

Transcript Report Document

Introduction to
Speech Signal Processing
Dr. Zhang Sen
[email protected]
Chinese Academy of Sciences
Beijing, China
2016/4/8
Report Document
•Introduction
–Sampling and quantization
–Speech coding
•Features and Analysis
–Main features
–Some transformations
•Speech-to-Text
–State of the art
–Main approaches
•Text-to-Speech
–State of the art
–Main approaches
•Applications
–Human-machine dialogue systems
2
Report Document
• Some useful websites for ASR Tools
– http://htk.eng.cam.ac.uk
•
•
•
•
•
Free, available since 2000, relation with MS
Over 12000 users, ver. 2.1, 3.0, 3.1, 3.2
Include source code and HTK books
A set of tools for training, decoding, evaluation
Steve Young in Cambridge University
– http://www.cs.cmu.edu
•
•
•
•
Free for research and education
Sphinx 2 and 3
Tools, source code, speech database
Reddy in CMU
3
Report Document
Research on speech recognition
in the world
4
Report Document
• Carnegie Mellon University
– CMU SCS Speech Group
– Interact Lab
• Oregon Graduate Institute
– Center for Spoken Language Understanding
• MIT
–
–
–
–
Lab for Computer Science, Spoken Language Systems
Acoustics & Vibration Lab
AI LAB
Lincoln Lab, Speech Systems Technology Group
• Stanford University
– Center for Computer Research in Music and Acoustics
– Center for the Study of Language and Information
5
Report Document
• University of California
– Berkeley, Santa Cruz, Los Angeles
• Boston University
– Signal Processing and Interpretation Lab
• Georgia Institute of Technology
– Digital Signal Processing Lab
• Johns Hopkins University
– Center for Language and Speech Processing
• Brown University
– Lab for Engineering Man-Machine Systems
• Mississippi State University
• Colorado University
• Cornell University
6
Report Document
• Cambridge University
– speech Vision and Robotics Group
• Edinburgh University
– human Communication Research Center
– center for Speech Technology Research
• University College London
– Phonetics and Linguistics
• University of Essex
– Dept. Language and Linguistics
7
Report Document
• LIMSI, France
• INRIA
– Institut National de Recherche en Informatique et
Automatique
• University of Karlsruhe, Germany
– Interractive Systems Lab
• DFKI
– German Research Center for Artificial Intelligence
• KTH Speech Communication & Music Acoustics
• CSELT, Italy
– Centro Studi e Laboratori Telecommunicazioni, Torino
• IRST
– Istituto per la Ricerca Scientifica e Tecnologica, Trento
• ATR, Japan
8
Report Document
•
•
•
•
•
•
•
•
•
•
•
•
AT&T, Advanced Speech Product Group
Lucent Technologies, Bell Laboratories
IBM , IBM VoiceType
Texas Instruments Incorporated
National Institute of Standards and Technology
Apple Computer Co.
Digital Equipment Corporation (DEC)
SRI International
Dragon systems Co.
Sun Microsystems Lab. , speech applications
Microsoft Corporation, Speech technology SAPI
Entropic Research Laboratory, Inc.
9
Report Document
• Important conferences and journals
–
–
–
–
–
IEEE trans. on ASSP
ICASSP (every year)
EUROSPEECH (every odd year)
ICSLP (every even year)
STAR
• Speech Technology and Research at SRI
10
Report Document
Brief history and state-of-the-art
of the research on speech recognition
11
Report Document
ASR Progress Overview
• 50'S
– ISOLATED DIGIT RECOGNITION (BELL LAB)
• 60'S :
– HARDWARE SPEECH SEGMENTATOR (JAPAN)
– DYNAMIC PROGRAMMING (U.S.S.R)
• 70'S :
– CLUSTERING ALGORITHM (SPEAKER INDEPENDECY)
– DTW
• 80'S:
– HMM, DARPA, SPHINX
• 90'S :
– ADAPTION, ROBUSTNESS
12
Report Document
1952 Bell Labs Digits
• First word (digit) recognizer
• Approximates energy in formants (vocal
tract resonances) over word
• Already has some robust ideas
(insensitive to amplitude, timing variation)
• Worked very well
• Main weakness was technological (resistors
and capacitors)
13
Report Document
The 60’s
• Better digit recognition
• Breakthroughs: Spectrum Estimation (FFT,
cepstra, LPC), Dynamic Time Warp (DTW),
and Hidden Markov Model (HMM) theory
• HARDWARE SPEECH SEGMENTATOR (JAPAN)
14
Report Document
1971-76 ARPA Project
• Focus on Speech Understanding
• Main work at 3 sites: System Development
Corporation, CMU and BBN
• Other work at Lincoln, SRI, Berkeley
• Goal was 1000-word ASR, a few speakers,
connected speech, constrained grammar,
less than 10% semantic error
15
Report Document
Results
• Only CMU Harpy fulfilled goals used LPC, segments, lots of high level
knowledge, learned from Dragon *
(Baker)
* The CMU system done in the early ‘70’s;
as opposed to the company formed in the ‘80’s
16
Report Document
Achieved by 1976
• Spectral and cepstral features, LPC
• Some work with phonetic features
• Incorporating syntax and semantics
• Initial Neural Network approaches
• DTW-based systems (many)
• HMM-based systems (Dragon, IBM)
17
Report Document
Dynamic Time Warp
• Optimal time normalization with
dynamic programming
• Proposed by Sakoe and Chiba, circa 1970
• Similar time, proposal by Itakura
• Probably Vintsyuk was first (1968)
18
Report Document
HMMs for Speech
• Math from Baum and others, 1966-1972
• Applied to speech by Baker in the
original CMU Dragon System (1974)
• Developed by IBM (Baker, Jelinek, Bahl,
Mercer,….) (1970-1993)
• Extended by others in the mid-1980’s
19
Report Document
The 1980’s
• Collection of large standard corpora
• Front ends: auditory models, dynamics
• Engineering: scaling to large
vocabulary continuous speech
• Second major (D)ARPA ASR project
• HMMs become ready for prime time
20
Report Document
Standard Corpora Collection
• Before 1984, chaos
• TIMIT
• RM (later WSJ)
• ATIS
• NIST, ARPA, LDC
21
Report Document
Front Ends in the 1980’s
• Mel cepstrum (Bridle, Mermelstein)
• PLP (Hermansky)
• Delta cepstrum (Furui)
• Auditory models (Seneff, Ghitza, others)
22
Report Document
Dynamic Speech Features
• temporal dynamics useful for ASR
• local time derivatives of cepstra
• “delta’’ features estimated over
multiple frames (typically 5)
• usually augments static features
• can be viewed as a temporal filter
23
Report Document
HMM’s for Continuous Speech
• Using dynamic programming for cts speech
(Vintsyuk, Bridle, Sakoe, Ney….)
• Application of Baker-Jelinek ideas to
continuous speech (IBM, BBN, Philips, ...)
• Multiple groups developing major HMM
systems (CMU, SRI, Lincoln, BBN, ATT)
• Engineering development - coping with
data, fast computers
24
Report Document
2nd (D)ARPA Project
•
•
•
•
Common task
Frequent evaluations
Convergence to good, but similar, systems
Lots of engineering development - now up to
60,000 word recognition, in real time, on a
workstation, with less than 10% word error
• Competition inspired others not in project Cambridge did HTK, now widely distributed
25
Report Document
Some 1990’s Issues
• Independence to long-term spectrum
• Adaptation
• Effects of spontaneous speech
• Information retrieval/extraction with
broadcast material
• Query-style systems (e.g., ATIS)
• Applying ASR technology to related
areas (language ID, speaker verification)
26
Report Document
Real Uses
• Telephone: phone company services
(collect versus credit card)
• Telephone: call centers for query
information (e.g., stock quotes,
parcel tracking)
• Dictation products: continuous
recognition, speaker dependent/adaptive
27
Report Document
State-of-the-art of ASR
• Tremendous technical advances in the last few
years
• From small to large vocabularies
– 5,000 - 10,000 word vocabulary
– 10,000-60,000 word vocabulary
• From isolated word to spontaneous talk
– Continuous speech recognition
– Conversational and spontaneous speech recognition
• From speaker-dependent to speaker-independent
– Modern ASR is fully speaker independent
28
Report Document
SOTA ASR Systems
• IBM, Via Voice
– Speaker independent, continuous command
recognition
– Large vocabulary recognition
– Text-to-speech confirmation
– Barge in (The ability to interrupt an audio
prompt as it is playing)
• Microsoft, Whisper, Dr Who
29
Report Document
SOTA ASR Systems
• DARPA
– 1982
– GOAL
•
•
•
•
HIGH ACCURACY
REAL-TIME PERFORMANCE
UNDERSTANDING CAPABILITY
CONTINUOUS SPEECH RECOGNITION
– DARPA DATABASE
• 997 WORDS (RM)
• ABOVE 100 SPEAKERS
• TIMID
30
Report Document
SOTA ASR Systems
• SPHINX II
–
–
–
–
–
–
CMU
HMM BASED SPEECH RECOGNITION
BIGRAM, WORD PAIR
GENERALIZED TRIPHONE
DARPA DATABASE
97% RECOGNITION (PERPLEXITY 20)
• SPHINX III
– CHMM BASED
– WER, about 15% on WSJ
31
Report Document
ASR Advances
2005
wherever
speech
occurs
2000
vehicle noise
radio
cell phones
NOISE
ENVIRONMENT
all speakers of
the language
including foreign
regional accents
native speakers
competent
foreign speakers
1995
normal office
various
microphones
telephone
quiet room
fixed high –
quality mic
speaker
independent and
adaptive
USER
speakerdep.
POPULATION
1985
careful
reading
SPEECH STYLE
planned
speech
natural humanmachine dialog
(user can adapt)
all styles
including
human-human
(unaware)
application
– specific
speech and expert
years to
language
create
app–
specific
language
model
COMPLEXITY
some
application–
specific data and
one engineer
year
application
independent or
adaptive
32
Report Document
But
• Still <97% accurate on “yes” for telephone
• Unexpected rate of speech causes doubling
or tripling of error rate
• Unexpected accent hurts badly
• Accuracy on unrestricted speech at 60%
• Don’t know when we know
• Few advances in basic understanding
33
Report Document
How to Measure the Performance?
• What benchmarks?
– DARPA
– NIST (hub-4, hub-5, …)
•
•
•
•
•
What was training?
What was the test?
Were they independent?
The vocabulary and the sample size?
Was the noise added or coincident with speech?
What kind of noise?
34
Report Document
ASR Performance
Word Error Rate (WER)
Conversational
Speech
40%
30%
Broadcast
News
20%
Read Speech
• Spontaneous telephone
speech is still a “grand
challenge”.
• Telephone-quality speech
is still central to the
problem.
• Broadcast news is a very
dynamic domain.
10%
Continuous
Digits
Digits
0%
Letters and Numbers
Command and Control
Level Of Difficulty
35
Report Document
Machine vs Human Performance
Word Error Rate
20%
Wall Street Journal (Additive Noise)
• Human performance exceeds machine
performance by a factor ranging from
4x to 10x depending on the task.
• On some tasks, such as credit card
number recognition, machine
performance exceeds humans due to
human memory retrieval capacity.
15%
Machines
10%
• The nature of the noise is as important
as the SNR (e.g., cellular phones).
5%
Human Listeners (Committee)
0%
10 dB
16 dB
22 dB
Quiet
• A primary failure mode for humans is
inattention.
• A second major failure mode is the lack
of familiarity with the domain (i.e.,
business terms and corporation names).
Speech-To-Noise Ratio
36
Report Document
Core technology for ASR
37
Report Document
Why is ASR Hard?
• Natural speech is continuous
• Natural speech has disfluencies
• Natural speech is variable over:
global rate, local rate, pronunciation
within speaker, pronunciation across
speakers, phonemes in different
contexts
38
Report Document
Why is ASR Hard?
(continued)
• Large vocabularies are confusable
• Out of vocabulary words inevitable
• Recorded speech is variable over:
room acoustics, channel characteristics,
background noise
• Large training times are not practical
• User expectations are for equal to or
greater than “human performance”
39
Report Document
Main Causes of Speech Variability
Environment
Speech - correlated noise
reverberation, reflection
Uncorrelated noise
additive noise
(stationary, nonstationary)
Attributes of speakers
dialect, gender, age
Speaker
Input
Equipment
Manner of speaking
breath & lip noise
stress
Lombard effect
rate
level
pitch
cooperativeness
Microphone (Transmitter)
Distance from microphone
Filter
Transmission system
distortion, noise, echo
Recording equipment
40
Report Document
ASR Dimensions
• Speaker dependent, independent
• Isolated, continuous, keywords
• Lexicon size and difficulty
• Task constraints, perplexity
• Adverse or easy conditions
• Natural or read speech
41
Report Document
Telephone Speech
•
•
•
•
•
•
Limited bandwidth (F vs S)
Large speaker variability
Large noise variability
Channel distortion
Different handset microphones
Mobile and handsfree acoustics
42
Report Document
What is Speech Recognition?
Speech Signal
Speech
Recognition
Words
“How are you?”
• Related area’s:
– Who is the talker (speaker recognition, identification)
– What language did he speak? (language recognition)
– What is his meaning? (speech understanding)
43
Report Document
What is the problem?
Find the most likely word sequence Ŵ among all
possible sequences given acoustic evidence A
A tractable reformulation of the problem is:
Acoustic model
Language model
Daunting search task
44
Report Document
View ASR as Pattern Recognition
Recognition
Front
End
O1O2
OT
Decoder
WT
Best Word
Sequence
Analog Observation
Speech
Sequence
Acoustic
Model
W1W2
Dictionary
Language
Model
45
Report Document
View ASR in Hierarchy
Speech
Waveform
Feature Extraction
(Signal Processing)
Neural Net
Spectral
Feature
Vectors
Phone Likelihood
Estimation (Gaussians
or Neural Networks)
N-gram Grammar
HMM Lexicon
Phone
Likelihoods
P(o|q)
Decoding (Viterbi
or Stack Decoder)
Words
46
Report Document
Front-End Processing
Dynamic features
K.F. Lee
47
Report Document
Feature Extraction
• GOAL :
– LESS COMPUTATION & MEMORY
– SIMPLE REPRESENTATION OF SIGNAL
• METHODS :
– FOURIER SPECTRUM BASED
• MFCC (mel frequency ceptrum coeffcient)
• LFCC (linear frequency ceptrum coefficient)
• filter-bank energy
– LINEAR PREDICTION SPECTRUM BASED
• LPC (linear predictive coding)
• LPCC (linear predictive ceptrum coefficeint)
– OTHERS
• ZERO CROSSING, PITCH, FORMANT, AMPLITUDE
48
Report Document
Cepstrum Computation
• Cepstrum is the inverse Fourier transform of the
log spectrum
1
c ( n) 
2
 



log S (e j ) e jn d , n  0,1,, L  1
IDFT takes form of weighted DCT in computation, see in HTK
49
Report Document
Mel Cepstral Coefficients
FFT and log
DCT transform
• Construct mel-frequency domain using a triangularly-shaped weighting
function applied to mel-transformed log-magnitude spectral samples
• Filter-bank, under 1k hz, linear, above 1k hz, log
• Motivated by human auditory response characteristics
50
Report Document
Cepstrum as Vector Space Features
Overlap
51
Report Document
Other Features
• LPC
– Linear predictive coefficients
• PLP
– Perceptual Linear Prediction
• Though MFCC has been successfully used,
what is the robust speech feature?
52
Report Document
Acoustic Models
• Template-based AM, used in DTW, obsolete
• Hidden Markov Model based AM, popular now
• Other AMs
– Articulatory AM
– KNOWLEDGE BASED APPROACH
• spectrogram reading (expert system)
– CONNECTIONIST APPROACH - TDNN
53
Report Document
Template-based Approach
•
•
•
•
•
•
DYNAMIC PROGRAMMING ALGORITHM
DISTANCE MEASURE
ISOLATED WORD
SCALING INVARIANCE
TIME WARPING
CLUSTER METHOD
54
A SSR presentation: 8.2 Definition of the Hidden Markov Model
Definition of HMM
Formal definition HMM
An output observation alphabet
O  {o1 , o2 ,..., oM }
The set of states
  {1,2,..., N}
A transition probability matrix
A  {aij }  P(st  j | st 1  i)
An output probability matrix
bi (k )  P( X t  ok | st  i )
An initial state distribution
  P ( s0  i )
Assumptions
• Markov assumption
• Output independence assumption
B  {bi (k )}
A SSR presentation: 8.2 Definition of the Hidden Markov Model
Three Problems of HMM
Given a model Ф and a sequence of observations
• The Evaluation Problem
How to compute the probability of the observation sequence?
Forward algorithm
• The Decoding Problem
How to find the optimal sequence associated with a given observation?
Viterbi algorithm
• The Training/Learning Problem
How can we adjust the model parameter to maximize the joint probability?
Baum-Welch algorithm (FORWARD-BACKWARD ALGORITHM )
Report Document
Advantages of HMM
•
•
•
•
ISOLATED & CONTINUOUS SPEECH RECOGNITION
NO ATTEMPT TO FIND WORD BOUNDARIES
RECOVERY OF ERRONEOUS ASSUMPTION
SCALING INVARIANCE, TIME WARPING, LEARNING
CAPABILITY
57
Report Document
Limitations of HMM
• HMMs assume the duration follows an exponential
distribution
• The transition probability depends only on the
origin and destination
• All observation frames are dependent only on the
state that generated them, not on the neighboring
observation frames (observation frames dependent)
58
Report Document
HMM-based AM
• Hidden Markov Models (HMMs)
– Probabilistic State Machines - state sequence unknown,
only feature vector outputs observed
– Each state has output symbol distribution
– Each state has transition probability distribution
– Issues:
• what topology is proper?
• how many states in a model?
• how many mixtures in a state?
59
Report Document
Hidden Markov Models
• Acoustic models encode the
temporal evolution of the
features (spectrum).
• Gaussian mixture distributions
are used to account for
variations in speaker, accent,
and pronunciation.
• Phonetic model topologies are
simple left-to-right structures.
• Skip states (time-warping) and
multiple paths (alternate
pronunciations) are also
common features of models.
• Sharing model parameters (tied)
is a common strategy to reduce
complexity.
60
Report Document
AM Parameter Estimation
•
Closed-loop data-driven modeling
supervised only from a word-level
transcription.
• Single
Gaussian
Estimation
•
The expectation/maximization (EM)
algorithm is used to improve our
parameter estimates.
• 2-Way Split
•
• Mixture
Distribution
Reestimation
Computationally efficient training
algorithms (Forward-Backward)
have been crucial.
•
Batch mode parameter updates are
typically preferred.
•
Decision trees are used to optimize
parameter-sharing, system
complexity, and the use of additional
linguistic knowledge.
• Initialization
• 4-Way Split
• Reestimation
•••
61
Report Document
Basic Speech Units
• RECOGNITION UNITS
–
–
–
–
–
–
PHONEME
WORD
SYLLABLE
DEMISYLLABLE
TRIPHONE
DIPHONE
62
Report Document
Basic Units Selection
• Create a set of HMM’s representing the basic
sounds (phones) of a language?
–
–
–
–
–
English has about 40 distinct phonemes
Chinese has about 22 Initials + 37 Finials
Need “lexicon” for pronunciations
Letter to sound rules for unusual words
Co-articulation effects must be modeled
• tri-phones - each phone modified by onset and
trailing context phones (1k-2k used in English)
– e.g. pl-c+pr
63
Report Document
Language Models
• What is a language model?
– Quantitative ordering of the likelihood of word
sequences (statistical viewpoint)
– A set of rule specifying how to create word
sequences or sentences (grammar viewpoint)
• Why use language models?
– Not all word sequences equally likely
– Search space optimization (*)
– Improve accuracy (multiple passes)
– Wordlattice to n-best
64
Report Document
Finite-State Language Model
show
me
the next
any
display
the last
page
picture
text file
• Write Grammar of Possible Sentence Patterns
• Advantages:
– Long History/ Context
– Don’t Need Large Text Database (Rapid Prototyping)
– Integrated Syntactic Parsing
• Problem:
– Work to write grammars
– Words sequences not enabled do not exist
– Used in small vocabulary ASR, not for LVCASR
65
Report Document
Statistical Language Models
• Predict next word based on current and history
• Probability of next word is given by
– Trigram: P(wi | wi-1, wi-2)
– Bigram: P(wi | wi-1)
– Unigram: P(wi)
• Advantage:
– Trainable on Large Text Databases
– ‘Soft’ Prediction (Probabilities)
– Can be directly combined with AM in decoding
• Problem:
– Need Large Text Database for each Domain
– Sparse problems, smoothing approaches
• backoff approach
• word class approach
• Used in LVCASR
66
Report Document
Statistical LM Performance
67
Report Document
ASR Decoding Levels
/w/
States
/ah/
Acoustic
Models
/ts/
/th/
/ax/
Phonemes
Dictionary
/w/ -> /ah/ -> /ts/
/th/ ->
/ax/
Words
Sentences
Language
Model
what's
display
the
willamette's
location
kirk's
longitude
sterett's
lattitude
68
Report Document
Decoding Algorithms
• Given observations, how to determine the most probable
utterance/word sequence? (DTW in template-based match)
• Dynamic Programming ( DP) algorithm was proposed by
Bellman in 50s for multistep decision process,
the “principle of optimality” is divide and conquer.
• The DP-based search algorithms have been used in speech
recognition decoder to return n-best paths or wordlattice
through the acoustic model and the language model
• Complete search is usually impossible since the search
space is too large, so beam search is required to prune less
probable paths and save computation load.
• Issues: computation underflow, balance of LM, AM.
69
Report Document
Viterbi Search
• Uses Viterbi decoding
– Takes MAX, not SUM (Viterbi vs. Forward)
– Finds the optimal state sequence, not optimal
word sequence
– Computation load: O(T*N2)
• Time synchronous
– Extends all paths at each time step
– All paths have same length (no need to normalize
to compare scores, but A* decoding needs)
70
Report Document
Viterbi Search Algorithm
Function Viterbi(observations length T, state-graph) returns best-path
Num-states<-num-of-states(state-graph)
Create path prob matrix viterbi[num-states+2,T+2]
Viterbi[0,0]<- 1.0
For each time step t from 0 to T do
for each state s from 0 to num-states do
for each transition s’ from s in state-graph
new-score<-viterbi[s,t]*at[s,s’]*bs’(ot)
if ((viterbi[s’,t+1]=0) || (viterbi[s’,t+1]<new-score))
then
viterbi[s’,t+1] <- new-score
back-pointer[s’,t+1]<-s
Backtrace from highest prob state in final column of viterbi[] & return
71
Report Document
Viterbi Search Trellis
W2
W1
0
1
2
3
t
72
Report Document
Viterbi Search Insight
Word 1
OldProb(S1) • OutProb • Transprob
Word 1
Word 2
Word 2
OldProb(S3) • P(W2 | W1)
S1
S2
S3
S1
S2
S3
S1
S2
S3
S1
S2
S3
time t
time t+1
score
backptr
parmptr
73
Report Document
Bachtracking
• Find Best Association between Word and Signal
• Compose Words from Phones Using Dictionary
• Backtracking is to find the best state sequence
/e/
/th/
t1
tn
74
Report Document
N-Best Speech Results
ASR
N=1
N=2
N=3
Speech
Waveform
“Get me two movie tickets…”
“I want to movie trips…”
“My car’s too groovy”
N-Best Result
Grammar
• Use grammar to guide recognition
• Post-processing based on grammar/LM
• Wordlattice to n-best conversion
75
Report Document
Complexity of Search
•Lexicon: contains all the words in the system’s vocabulary
along with their pronunciations (often there are multiple
pronunciations per word, # of items in lexicon)
•Acoustic Models: HMMs that represent the basic sound
units the system is capable of recognizing (# of models, # of
states per model, # of mixtures per state)
•Language Model: determines the possible word
sequences allowed by the system (fan-out, PP, entropy)
76
Report Document
ASR vs Modern AI
• ASR is based on AI techniques
– Knowledge representation & manipulation
• AM and LM, lexicon, observation vector
– Machine Learning
• Baum-Welch for HMMs
• Nearest neighbor & k-means clustering for signal id
– “Soft” probabilistic reasoning/Bayes rule
• Manage uncertainty mapping in signal, phone, word
– ASR is an expert system
77
Report Document
ASR Summary
• Performance criterion is WER (word error rate)
• Three main knowledge sources
– Acoustic Model (Gaussian Mixture Models)
– Language Model (N-Grams, FS Grammars)
– Dictionary (Context-dependent sub-phonetic units)
• Decoding
– Viterbi Decoder
– Time-synchronous
– A* decoding (stack decoding, IBM, X.D. Huang)
78
Report Document
We still need
• We still need science
• Need language, intelligence
• Acoustic robustness still poor
• Perceptual research, models
• Fundamentals of statistical pattern
recognition for sequences
• Robustness to accent, stress,
rate of speech, ……..
79
Report Document
Future Directions
Analog Filter Banks
1960
Dynamic Time-Warping
Hidden Markov Models
1980
1970
Conclusions:
Challenges:
• supervised training is a good
machine learning technique
•
•
•
•
• large databases are essential for
the development of robust statistics
2004
1990
discrimination vs. representation
generalization vs. memorization
pronunciation modeling
human-centered language modeling
The algorithmic issues for the next decade:
• Better features by extracting articulatory information?
• Bayesian statistics? Bayesian networks?
• Decision Trees? Information-theoretic measures?
• Nonlinear dynamics? Chaos?
80
Report Document
References
• Speech & Language Processing
– Jurafsky & Martin -Prentice Hall - 2000
• Spoken Language Processing
– X.. D. Huang, al et, Prentice Hall, Inc., 2000
• Statistical Methods for Speech Recognition
– Jelinek - MIT Press - 1999
• Foundations of Statistical Natural Language Processing
– Manning & Schutze - MIT Press - 1999
• Fundamentals of Speech Recognition
– L. R. Rabiner and B. H. Juang, Prentice-Hall, 1993
• Dr. J. Picone - Speech Website
– www.isip.msstate.edu
81
Report Document
Test
• Mode
– A final 4-page report or
– A 30-min presentation
• Content
–
–
–
–
Review of speech processing
Speech features and processing approaches
Review of TTS or ASR
Audio in computer engineering
82
Report Document
THANKS
83