Transcript Document
Human vs. Machine
• Human beings are much better at resolving signal
ambiguities than are computers.
– Computers are improving (e.g. Watson and Jeopardy)
– Turing Test: Are we communicating with a human or machine?
• Case in point - Speech
Sloppy speech is okay, as long as the hearer still understands.
– “haya dun”
– “ay d ih s h er d s ah m th in ng ah b aw m uh v ih ng r ih s en l ih”
There are difficulties, even when a computer recognizes phonemes in a speech signal
One sentence, eight possible meanings
I made her duck
–
–
–
–
–
–
–
–
I cooked waterfowl for her.
I stole her waterfowl and cooked it.
I used my abilities to create a living waterfowl for her.
I caused her to bid low in the game of bridge.
I created the plastic duck that she owns.
I caused her to quickly lower her head or body.
I waved my magic wand and turned her into waterfowl.
I caused her to avoid the test.
Robot-human dialog
99% accuracy
Robot: “Hi, my name is Robo. I am looking for work to raise funds for Natural
Language Processing research.”
Person: “Do you know how to paint?”
Robo: “I have successfully completed training in this skill.”
Person: “Great! The porch needs painting. Here are the brushes and paint.”
Robot rolls away efficiently. An hour later he returns.
Robo: “The task is complete.”
Person: “That was fast, here is your salary; good job, and come back again.”
Robo speaks while rolling away with the payment.
Robo: “The car was not a Porche; it was a Mercedes.”
Moral: You need a sense of humor to work in this field.
Difficulties
Today's best systems cannot match human perception
• Challenges
– Speaker variability
– Slurring and running
words together
– Co-articulation
– Handling words not in
the vocabulary
– Grammar Complexities
– Speech semantics
– Recognizing idioms
– Background noise
– Signal transmission
distortion
• Approaches
– Use large pre-recorded
data samples
– Train for particular users
– Require artificial pauses
between words
– Limit vocabulary size
– Limit the grammar
– Use high quality
microphones
– Require low noise
environments
ASR Difficulties
•
•
•
•
•
•
•
•
•
•
Realizations are points in continuous space, not discrete
Sounds take characteristics of adjacent sounds (assimilation)
Sounds that are combinations of two (co-articulation)
Articulator targets are often not reached
Diphthongs combine different phonemes
Adding (epenthesis) or deleting (elision)
Missing word, phrase boundaries, endings
Many tonal variations during speech
Varied vowel durations
Common knowledge, familiar background leads to more
sloppy speech with additional non-linearities.
Possible Applications
• Compare two audio signals to compare speaker’s
utterance to records from a database of recordings
• Convert audio into a text document
• Visually represent the vocal tract of the speaker in
real time
• Recognize a particular speaker for enhanced security
• Transform audio signal to enhance its speech
qualities
• Perform tasks based on user commands
• Recognize the language and perform appropriately
A sample of issues to consider
• Can we assume the target language or is the application to be
language independent?
• Is there access to databases describing grammatical,
morphological, and phonological rules?
• Are there digital dictionaries available? Does the application
require a large dictionary?
• Are there corpora available to scientifically measure
performance against other implementations?
• How does the system perform when the SNR is low? What is a
typical SNR characteristics when the application is in use?
• What is the accuracy requirements for the application?
• Are statistical training procedures practical for the application?
Phonological Grammars
Phonology: Study of sound combinations
• Sound Patterns
– English: 13 features for 8192 combinations
– Complete descriptive grammar
– Rule based, meaning a formal grammar can represent valid
sound combinations in a language
– Unfortunately, these rules are language-specific
• Recent research
– Trend towards context-sensitive descriptions
– Little thought concerning computational feasibility
– Listeners likely don’t perceive using thousands of rules
Formal Grammars (Chomsky 1950)
• Formal grammar definition: G = (N, T, s0, P, F)
–
–
–
–
–
N is a set of non-terminal symbols (or states)
T is the set of terminal symbols (N ∩ T = {})
S0 a start symbol
P is a set of production rules
F (a subset of N) is a set of final symbols
• Right regular grammar productions have the forms
B → a, B → aC, or B → "" where B,C ∈ N and a ∈ T
• Context Free (Programming language) productions have forms
B → w where B ∈ N and w is a possibly empty string from N, T
• Context Sensitive (Natural language) productions have forms
αAβ → αγβ or αAβ "" where A∈N and α,γ,β∈(N U T)* abd |αAβ|≤|αγβ|
Chomsky Language Hierarchy
Classifying the Chomsky Grammars
Notes
Regular
• Left hand side contains one non terminal, right hand has only one non-terminal
Context Free
• Left hand side contains one non-terminal, right hand side mixes terminals and
non-terminals
Context sensitive
• Left hand side has both terminals and non-terminals
Turing Equivalent: All rules are fair game (computational power of a computer)
Context Free Grammars
Chomsky (1956) Backus (1959)
• Capture constituents and ordering
– Regular grammars are too limited to represent grammars
• Context Free Grammars consist of
– Set of non-terminal symbols N
– Finite alphabet of terminals
– Set of productions A → such that A N, -string (N)*
– A designated start symbol
• Used for programming language syntax. Too restrictive for
natural languages
Example Grammar (L0)
Context Free Grammars
for Natural Language
• Context free grammars work well for basic
grammar syntax
• Disadvantage
– Some complex syntactical rules requires clumsy
constructions
– Agreement: He ate many meal
– Movement of grammatical components:
o Which flight do you want me to have the travel agent book?
o The object is far from its matching verb
Morphology
• How phonemes combine to make words
• Important for speech recognition and
synthesis
• Example: singular to plural
– Run to runs: z sound (voiced)
– Hit to Hits: s sound (unvoiced)
• One approach: Devise language specific sets
of rules of pronunciation
Syllables
• Organizational phonological unit
– Vowel between two consonants
– Ambiguous positioning of consonants into
syllables
– Tree structured representation
• Basic unit of prosody
– Lexical stress: inherent property of a word
– Sentential stress: speaker choice to emphasize or
clarrify
Finite State Automata
•
•
•
•
•
Definition: (N, T, s0, δ, F) where
N is a finite, non-empty set of non-terminal states
T is a finite, non-empty set of terminal symbols
s0 is an initial state, an element of
δ is the state-transition function
– Deterministic transition function: δ :Sx∑S
– Nondeterministic transition function: δ:Sx∑P⊂S
– Transducers: add Γ, a set of output symbols and ω:ΓO
• F⊂S is the (possibly empty) set of final states
Finite-state Automata
Equivalent to:
• Finite-state automata (FSA)
• Regular languages
• Regular expressions
Finite-state Automata (Machines)
baa!
baaa!
baaaa!
baaaaa!
...
/baa+!/
a
b
q0
a
q1
a
q2
state
!
q3
transition
q4
final
state
Input Tape
q0
a
b
b
0
a
!
a
1
a
a
2
REJECT
b
3
!
4
Input Tape
q0
q1
q2
q3
q3
b
a
a
a
b
0
a
1
q4
a
a
2
ACCEPT
!
3
!
4
State-transition Tables
State
0
1
2
3
4:
b
1
0
0
0
0
Input
a
0
2
3
3
0
!
0
0
0
4
0
Finite State Machine Examples
Deterministic
0
1
1
0
1
1
Non deterministic
0
1
a
b
b
a
a
b
a
b
Finite State Transducer
A Finite State Automata that produces an output string
Input: Features from a sequence of frames
Processing: Find the most likely path through the sequence
using hidden Markov models or Neural Networks
Output: The most likely word, phoneme, or syllable
O is a set of output states, ω: S->O
Back End Processing
• Rule Based: Insufficient to represent the differences in how
words are constructed
• Statistics based: Most other areas of Natural Language
processing are trending to statistical-based methods
• Procedure
– Supervised training: An algorithm “learns” the parameters
using a training set of data. The “trained” algorithm then is
ready to run in an actual environment.
– Unsupervised training: An algorithm trains itself by
computing categories from the training data
Representing Stress
• There have been unsuccessful attempts to
automatically assign stress to phonemes
• Notations for representing stress
– IPA (International Phonetic Alphabet) has a diacritic
symbol for stress
– Numeric representation
• 0: reduced, 1: normal, 2: stressed
– Relative
• Reduced (R) or Stressed (S)
• No notation means undistinguished
Random Variables
• Random Variable, X, is a quantity that assigns a
numerical value to each possible event
• Reason: It would not be possible to analyze the
results without this.
• Example: pick a ball out of a bag. Suppose the balls
are red, blue, and green. We could assign X=0 if red,
X=1 if blue, and X=2 if green.
• Discrete random variable has a finite number of
possible values (∑i=1,np(xi) = ∑i=1,nP(X=xi) = 1).
Probability Chain Rule
• Conditional Probability P(A1,A2) = P(A1) * P(A2|A1)
• The Chain Rule generalizes to multiple events
– P(A1, …,An) = P(A1) P(A2|A1) P(A3|A1,A2)…P(An|A1…An-1)
• Examples:
– P(the dog) = P(the) P(dog | the)
– P(the dog bites) = P(the) P(dog | the) P(bites| the dog)
• Conditional probability applies more than individual relative
word frequencies because they consider the context
– Dog may be relatively rare word in a corpus
– But if we see barking, P(dog|barking) is much more likely
• In general, the probability of a complete string of words w1…wn is:
P(w1n) = P(w1)P(w2|w1)P(w3|w1..w2)…P(wn|w1…wn-1)
=
n
P ( wk | w1k 1)
k 1
Note: A large n requires a lot of data; chains of two or three work well
Probability Density Function
• f(x) is a continuous probability density
function if ∫-∞, ∞ f(x)dx = 1
0
a
b
Note: The shaded area is the probability that a <= x <= b
Mean, Variance, Standard Deviation
• The mean or expected value
– Discrete: µ = E(x) = ∑ x p(x) over all x values
– Continuous µ = E(x) = ∫ x f(x) dx from -∞ to∞
• Variance
– Discrete: σ2 = ∑(x - µ)2 p(x) = ∑x2p(x) – (∑x p(x))2
– Continuous: σ2 = ∫(x - µ)2f(x)dx = ∫x2dx – (∫x f(x)dx)2
• Standard Deviation: σ = square root of variance
• Intuition
–
–
–
–
–
Mean: center of the distribution (1st moment)
Variance: spread of the distribution (2nd moment)
standard deviation: percent within a distance from the mean
Skew: asymmetry of the distribution (3rd moment)
Kurtosis: how peaked is the distribution (4th moment)
Note: Same mean, different variances
Example
• Bag of numbered balls
• Pick a single ball from the bag
• Mean: (µ = ∑ x p(x) )
1*8/30 + 2*5/30 + 3*3/30
+ 4*10/30 + 5*4/30
= 87/30 = 2.9
x
1
2
3
4
5
Quantity
8
5
3
10
4
• Variance Method 1: σ2 = ∑(x - µ)2 p(x)
σ2 = 8/30*(-1.9)2 + 5/30*(-0.9)2 + 3/30*0.12
+ 10/30*1.12 + 4/30*2.12 = 2.09
P(x)
8/30
5/30
3/30
10/30
4/30
• Variance Method 2 (without mean): σ2 = ∑x2p(x)–(∑x p(x))2
σ2 = 1*8/30 + 4*5/30 + 9*3/30 + 16*10/30 + 25*4/30
– (1*8/30+2*5/30+3*3/30+4*10/30+5*4/30)2 = 10.5–2.92 = 2.09
• Standard Deviation = (2.09)1/2 ≈ 1.45
Covariance
Covariance determines how two random variables relate
• A positive covariance occurs if two random variables tend to
both be above or below their means together
• A negative covariance occurs when the random variables tend
to be on opposite sides of the mean
• If no correlation, the covariance will be close to zero
• Covariance formula
– Discrete: Cov(X,Y) = ∑x ∑y (x-µx)p(x) * (y-µy)p(y)
– Continuous: Cov(X,Y) = ∫x ∫y (x-µx)f(x) * (y-µy)f(y) dy dx
• Correlation coefficient: ρxy = Cov(x,y)/(σx*σy) or Cov(x,y)/(N-1)
• Numbers greater than unity imply the variables are related
Covariance (Dispersion) Matrix
• Given Random variables
• The covariance matrix ∑, where ∑i,j = E(Xi-ui)(Xj-uj) = cov(xi,xj)
• Equivalent Matrix Definition: ∑ = E[(X – E[X])(X – E[X})T]
• Note: The T means transpose, some use a single quote instead
Covariance Example
Three random variables (x0, x1, x2), five observations (N = 5) each
X0 – u 0
X1 – u 1
X2 – u 2
-0.1 = 4.0-4.1
-0.08=2.0-2.08
-0.04=.60-.604
0.1 = 4.2-4.1
0.02=2.1-2.08
-0.014=.59-.604
-0.2 = 3.9-4.1
-0.08=2.0-2.08
-0.024=.58-.604
0.2 = 4.3-4.1
0.02=2.1-2.08
0.016=.62-.604
0.0 = 4.1-4.1
0.12=2.2-2.08
0.026=.63-.604
Note: ∑ results from the
multiplication of the 5x3 matrix
and its 3x5 transpose
∑
Uniform Distribution
The probability of every value is equal
• pdf: f(x) = 1 / (b-a) a≤x≤b; 0 otherwise
• µ = (a + b)/2, Variance: 1/12 (b-a)2
• Initial training data for acoustic information can
be set up as a uniform distribution
Binomial Distribution
Repeated experiments each with two possible outcomes
• pdf:
where
and
n = # of experiments
p = success probability
• µ = np
• σ2 = np(1-p)
Multinomial Distribution
Number of successes in n independent experiments
• pdf:
•
•
•
•
µi = n pi
σi2 = n pi (1-pi)
Cov(xi,xj) = -npipj
Extends the binomial
distribution to multiple
random variables
Gaussian Distribution
• When we analyze probability involving many random processes, the
distribution is almost always Gaussian.
• Central Limit Theorem: As the sample size of random variables
approach ∞, the distribution approaches Gaussian
• Probability distribution:
f(x | µ,σ2)
= 1/(2 πσ)½ * ez
where
z = -(x-µ)2 / (2 σ2)
Multivariate Mixture Gaussian
Distribution
• Multiple independent random variables
• Each variable can have its own mean and variance
Two independent
random variables
X and Y
Multivariate Normal Distribution
Determinant of a 3x3 Matrix
Example: Compute Determinant of:
5
3
4
2
1
5
3
6
2
Bayes' Rule
Fundamental to many speech recognition algorithms
• P(A | B) = P(B | A)*P(A)
P(B)
• P(B) =
∑k=1,n P(B | Ak) * P(Ak)
A2
A4
B B
• P(Ai | B) =
P(B|Ai)P(Ai)
∑k=1,n P(B | Ak) * P(Ak)
A3
A1
A5
Max [P(word | sound)] = Max [P(sound | word) * P(word) / P(sound)]
= Max [P(sound | word) * P(word)] because the denominator is a constant
Bayes Example
Probability that a car will be late to its destination
Noisy Channel Decoding
Source
Noisy Channel
Decoder
• Assume Input=word w; Feature vector=f, V=vocabulary
– We want to find the word, w = max wεv p(w|f)
– Using Bayes Rule: maxwεv p(f|w)p(w)/p(f)
• Why use Bayes Rule?
– P(w|f) is difficult to compute
– P(f|w) is relatively easy to compute. Just add probabilities to
reflect spelling or pronunciation variation rules
– P(w) is how often w occurs in a large corpus (prior priority)
– Ignore P(f). f doesn’t change as we search the lexicon.
Bayesian Inference
4 Vegetarians
3 CS majors
• Randomly choose students from population of ten. Find
probabilities:
–
–
–
–
–
p(vegetarian) = .4, p(cs major) = .3
Student vegetarian is a CS major? p(c|v) = .5 = p(c) p(v|c) / p(v)
Student is a vegetarian and CS major? p(c,v) = .2 = p(v) p(c|v)
Student vegetarian and CS major? p(c,v) = .2 = p(c) p(v|c)
Student CS major is a vegetarian? p(v|c) = .66 = p(v) p(c|v) / p(c)
Definitions
• Stochastic process: A process of change of one or random variables
{Xi} over time based on a well-defined set of probabilities
• Markov model: a Markov model consists of a list of the possible states
of that system, the possible transitions from one state to another, and the
rates that govern those transitions. Transitions can depend on the current
state and some number of previous states.
• Markov Chain
– Markov model with a finite number of states in which the probability
of a next state depends only upon the current state and the
immediate past state
• Examples
– The next phoneme’s probability depends solely on the preceding one
of the sequence
– A model of word or phoneme prediction that uses the previous N-1
words or phonemes to predict the next (N-gram model)
– Hidden Markov model, predicting the hidden cause after observing
the output (predicting the words, when observing the features)
Vector Quantization
• Partition the data into cells (Ci)
• Cell centroids quantized as zi
• Compute distance between
received data and centroids
• Received data
– quantized into one of the cells
– q(x)=zi if x in cell Ci boundary
• Distortion (distance) formulas
–
–
–
–
Euclidian: d(x,z) = ∑i=1,D(xi–zi)2
Linear: d(x,z) = (x-z)T∑-1 (x-z)
Mahalanobs: Euclidian/ variance
D is the quantization codebook size
K- Means Algorithm
• Input:
– F = {f1, …, fk} is a list of feature vectors
– N = desired number of categories (phoneme types)
• Output:
– C = {c1, …, cN) center of each category
– m: F->C Maps feature vector to one of the categories
• Pseudocode
Randomly put members of F into an initial C
WHILE true
FOR EACH fj ∈ F assign fj to the closest ck
IF no reassignments have taken place THEN BREAK
Recompute the center of each member of C
• Issues
– What is the metric that we use to compute distances?
– A poor initial selection will lead to incorrect results or poor performance
LBG Extension of K Means
Linde, Buzo, and Gray
1.
2.
3.
Let M= 1 to form a single partition
Find centroid of all training data ( 1/T ∑i=0,Txi )
While (M < desired number of partitions)
For each M
i. Compute centroid position
ii. Replace old centroid with new one
iii. Partition the partition in half
iv. Estimate centroid in each half
v. Use the k-means algorithm to optimize centroid position
vi. M = 2*M
Maximum Likelihood Formulation
Vector of Outcomes: ϴ
Probability
for each ϴi
Context Free Grammar Example
G = (N, T, s0, P, F)
Goal: Frequency of a particular grammatical construction
Parameters
–
–
–
–
–
X = set of all possible parse trees
T = {t1 … tO} where ti ∈ X are observed parse tree sequences
Let ϴp = probability that a parse tree applies production p ∈ P
Parameter space, Ω = set of ϴ ∈ [0,1]|P| where for all α, ∑p∈P ϴp=1
Number of times a production p is in tree ti (C(ti,p))
Estimate of parse tree probability P(t|ϴ) = 𝑝∈𝑃(ϴp )C(ti,p)
Easier to deal with logs: log(P(t|ϴ’)) = ∑p∈P ϴp* C(ti,p)
Estimate over all trees L(ϴ’) = ∑t log(P(t|ϴ)) = ∑t ∑p∈P ϴp* C(t,p)
ϴMostLikely =
𝐶𝑜𝑢𝑛𝑡 𝑜𝑓 𝑝𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑜𝑛 𝑝 𝑖𝑛 𝑡𝑖
=
𝐶𝑜𝑢𝑛𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑝𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑𝑡𝑟𝑒𝑒𝑠
∑p∈P C(ti,p)
∑t∈T∑s∈P C(t,s)
EM Algorithm
EM = Expectation-Maximization
1. Perform an initial Maximum Likelihood (MLI) estimation
2. Expectation Step: Compute the expected value of the
Maximum likelihood function with respect to the observed
distribution
3. Maximization Step: Use the adjusted values computed in
step 2 to refine the expectation estimation
4. Repeat step 2 until algorithm converges
Note: The Baum-Welsh Hidden Markov Algorithm, which we will
discuss later, is a special case of the EM Algorithm
Decision Trees
Partition a series of questions, each with a discrete set of answers
x x
x
x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x
x
x
x
Reasonably Good Partition
x
x x
x
x
x
x
x
x
x
x x
x
x
Poor Partition
x
CART Algorithm
Classification and regression trees
1.
Create a set of questions that can distinguish between the
measured variables
a. Singleton Questions: Boolean (yes/no or true/false) answers
b. Complex Questions: many possible answers
2.
3.
4.
5.
6.
7.
Initialize the tree with one root node
Compute the entropy for a node to be split
Pick the question that with the greatest entropy gain
Split the tree based on step 4
Return to step 3 as long as nodes remain to split
Prune the tree to the optimal size by removing leaf nodes
with minimal improvement
Note: We build the tree from top down. We prune the tree from bottom up.
Example: Play or not Play?
Outlook
Temperature
Humidity
Windy
Play?
sunny
hot
high
false
No
sunny
hot
high
true
No
overcast
hot
high
false
Yes
rain
mild
high
false
Yes
rain
cool
normal
false
Yes
rain
cool
normal
true
No
overcast
cool
normal
true
Yes
sunny
mild
high
false
No
sunny
cool
normal
false
Yes
rain
mild
normal
false
Yes
sunny
mild
normal
true
Yes
overcast
mild
high
true
Yes
overcast
hot
normal
false
Yes
rain
mild
high
true
No
Questions
1) What is the outlook?
2) What is the temperature?
3) What is the humidity?
4) Is it Windy?
Goal: Order the questions in
the most efficient way
Example Tree for “Do we play?”
Goal: Find the optimal tree
Outlook
sunny
overcast
Humidity
Yes
rain
Windy
high
normal
true
false
No
Yes
No
Yes
Which question to select?
witten&eibe
Computing Entropy
• Entropy: Bits needed to store possible question answers
• Formula: Computing the entropy for a question:
Entropy(p1, p2, …, pn) = - p1log2p1 – p2log2p2 … - pn log2pn
• Where
pi is the probability of the ith answer to a question
log2x is logarithm base 2 of x
• Examples:
– A coin toss requires one bit (head=1, tail=0)
– A question with 30 equally likely answers requires
∑i=1,30-(1/30)log2(1/30) = - log2(1/30) = 4.907
Example: question “Outlook”
Compute the entropy for the question: What is the outlook?
Entropy(“Outlook”=“Sunny”)=Entropy(0.4, 0.6)=-0.4 log2(0.4)-0.6 log2(0.6)=0.971
Five outcomes, 2 for play for P = 0.4, 3 for not play for P=0.6
Entropy(“Outlook” = “Overcast”) = Entropy(1.0, 0.0)= -1 log2(1.0) - 0 log2(0.0) = 0.0
Four outcomes, all for play. P = 1.0 for play and P = 0.0 for no play.
Entropy(“Outlook”=“Rainy”)= Entropy(0.6,0.4)= -0.6 log2(0.6) - 0.4 log2(0.4)= 0.971
Five Outcomes, 3 for play for P=0.6, 2 for not play for P=0.4
Entropy(Outlook) = Entropy(Sunny, Overcast, Rainy)
= 5/14*0.971+4/14*0+5/14*0.971 = 0.693
Computing the Entropy gain
• Original Entropy : Do we play?
Entropy(“Play“)=Entropy(9/14,5/14)=-9/14log2(9/14) - 5/14 log2(5/14)=0.940
14 outcomes, 9 for Play P = 9/14, 5 for not play P=5/14
• Information gain equals
(information before) – (information after)
gain("Outlook") = 0.940 – 0.693 = 0.247
• Information gain for other weather questions
– gain("Temperature") = 0.029
– gain("Humidity") = 0.152
– gain("Windy") = 0.048
• Conclusion: Ask, “What is the Outlook?” first
Continuing to split
yes
no
no
gain(" Temperatur e" ) 0.571 bits
gain(" Humidity" ) 0.971 bits
gain(" Windy" ) 0.020 bits
For each child question, do the same thing to form the complete decision tree
Example: After the outlook sunny node, we still can ask about temperature,
humidity, and windiness
The final decision tree
Note: The splitting stops when further splits don't reduce
entropy more than some threshold value
Senone Model
Definition: A cluster of similar Markov States
• Goal: Reduce the trainable
units that the recognizer
needs process
• Approach:
– HMMs represent subphonetic units
– A tree structure Combine
sub-phonetic units
– Phoneme recognizer searches
tree to find HMMs
– Nodes partition with
questions about neighbors
• Performance:
– Triphones reduces error rate
by:15%
– Senones reduces error rate by
24%
Is left phone sonorant or nasal?
Is right a back-R?
Is right voiced?
Is left a back-L?
Is left s, z, sh, zh?
Scoring Acoustic Features
• Choose the model: discrete, continuous, semi-continuous
• Continuous: Insufficient training data
– Consider discrete ranges of values
– Problem: difficult to determine boundaries between ranges
• Discrete or semi-continuous
– Consider multiple codebooks
– Multiple codebooks require HMM formulas adjustments
– For example: αt,i = ∑i=0,N-1 αt-1,iai,j ∏ bj(xt)
• Decide whether to use a word or sub word model
– Word model: collect training data for each word
– Sub-word models: share the subunits across the vocabulary