Presentation1

Download Report

Transcript Presentation1

Models -- Introduction
Introduction



All prior models of speech are
nonparametric and non--statistical.
Hence estimates of Variables are
uniformed by the relative deviations of
models
Hidden Markov Models
– An attempt to reproduce the statistical
fluctuation in speech across a small utterance
– An attempt which has a training theory which
is well motivated
s Lecture


What is a Hidden Markov Model
What are the various types of estimation procedur
Agenda


Markov Chains -- How to estimate probabilities
Hidden Markov Models
–
–
–
–

definition
how to identify
how to choose parameters
how to optimize parameters to produce the best mode
Types of Hidden Markov Models
Agenda II

Next Lecture Different Type of Hidden Markov M
– Distinct implimentation details.
Overview




Techniques of Choosing Hidden Markov
Models and estimating parameters
Related to Dynamic Programming already
done.
Quantities Recursively defined
Key Difference
– Can estimate true probabilities and effectively
variances and weight estimates
– Estimation Time Surprisingly fast.
Vocabulary

Hidden Markov Model
– Much more below, but a doubly stochastic
model, the underlying states are Markov, the
output states are produced by a random
process.

Alpha Terminal, Beta Terminal
– Alpha terminal, the probability of the initial
portion of a state sequence given it ends in a
particular state.
– Beta terminal, the probability that a terminal
sequence starts in state s
abulary II

Maximum Likelihood Estimation
– Choosing parameters of the set so that the probability o
– The classical principle for statistical inference, others b

Sufficient Statistics.
– Functions of the input data which bear on the parametri
– If you know the sufficient statistics you know everythin
ocabulary III


Jensen’s Inequality
– For convex functions and any probability distributio

E(f(x))>f(E(x)) I.e. E(X*X)>=E(x)*E(x)
– For concave functions


E(log(x))<=logE(x)
Markov Models




Introduction to the basic properties of discrete Ma
Definition of a Hidden Markov Model
Their use in discrete word recognition
Techniques to Evaluate and Train Discrete Hidden
ains -- The Weather Model
a11
a22
a12
Rainy
Sunny
a21
a33
a32
a23
Cloudy
a34
a44
Snowy
a43

Where ajk is the probability of changing from weather stat
he Weather Model

As drawn the model is recurrent, I.e. any state can co
Definition
Markov Chain

– Consists of a sequence of states v1…vn. At regular fixed interval of
P qt  Si | qt 1  S j  aij

Furthermore,
P qt  Si | qt 1  Sj 1,... qt k  Sj k  P[qt  Si | qt 1  Sj 1 ]

Only, memory is used for transition probabilities
Hidden Markov Model
Vs Markov Chain


Markov chains have entirely observable
states. However a “Hidden Markov Model”
is a model of a Markov Source which
admits an element each time slot depending
upon the state. The states are not directly
observed
For instance...
Markov Chain and Urn Model

Suppose States are hidden
– Consider Urn model
– Colored balls in each Urn
– Observer sees only the balls selected out of each slot
URN 1
q1
URN N-1
qn-1
URN N
qn
P(R)=
P(G)=
P(B)=
P(R)=
P(G)=
P(B)=
P(R)=
P(G)=
P(B)=
Operation of the Model

I. Step 1
– One is in a state corresponding to an URN qi

II. Step II
– Select a colored ball at random out of this
URN. The observer sees the ball replace it.

III.Step III
– Flip a biased die or chose a special ball out of
another urn corresponding to the one selected.
Then replace the ball.

Note
– The observer only sees a sequence of colors
Formal Definition

A hidden Markov model is a triple (a,b,p) where
Name
Definition
Output Probabilities

B= b (k )  P  O |q )
Initial Probabilities
 =P(q j at t=1)

Transition Probabilities A= aij  P  q j at t|qi at t-1)

j
k
j
A Hidden Markov Model is a triple (A,B,p)
where
– Outputs are generated in the following manner
Output Generation





1. Choose an initial state in accord with the
starting distribution 
2. Set t=1
3. Choose Ok in accord with bqt
4. Choose qt+1 in accord with A i.e. aqt .
5. Set t=t+1 and return to 3
Problems Using Hidden
Markov Models

Its hard a priori to say what is the best structure
for a HMM for a given problem.
– Empirically, many models of a given complexity often
produce a similar fit, hence its hard to identify models.

It’s possible now, due to Amari, to say whether or
not two models are stochastically equivalent. I.E.
Generate same proabilities,
– Metric on HMM’s.
– (Usually probability 0).
Criticism Leveled Against
HMM’s: Somewhat Bogus

For a hidden Markov model
– The past history is reflected only in the last
state that the sequence is in. Therefore prior
history cannot be influencing result. Speech
because of coarticulation is dependent upon
prior history. /pinz/ /pits/

There can be no backward effects.
– There can be no effects of “future” utterances
on present, I.e. backwards assimilation,
– grey chips, Vs grey ship., great chip
Answers to Criticism

First Objection.
– Markov model by itself cannot handle this
elementary. However, distortion coefficients
delta coefficients effectively convey framed
information about locally prior parts of the
utterance.

Second Objection
– Shows that speech has to be locally buffered
and conclusion about a phoneme cannot be
made without a limited lookahead like people
due. Can easily construct a Markov model to
do this
No ideal method to determine


Best Model for Phone, Word Sentence.
However,
– In fact, they are the only existing statistical
models of speech recognition.
– Can be use to self--validate as well as
recognize, validate significance
Summary


Cannot Directly identify HMM structure,
however, can still use model and assume the
speech source obeys the given structure.
BUT
– If cannot choose suitable parameters for the
model it turns out to be useless.
– This problem has been solved
History

Technique originated by Leonard Baum.
– Baum (1966),

Author, wrote 3 or 4 papers, math journals.
– Probably most important innovation in
mathematical statistics, at time.




Took about 10 years for Fred Jelinek and
baker to pick up for speech.
Now used all over the place, popularized
by A.P. Dempster and Rubin at Harvard.
Preconditions

For speech recognition application suppose
that frames are Vector Quantized codewords
representing the speech signal See later
Hidden Markov models can do their own
quantization. However, this case treated
first for simplicity.
Three Basic Prerequisites for
Hidden Markov Model Use

Problem I
– Given an observation sequence, O1,…OT and
LA,B,) how does one compute the
probability P(O| L)

Problem II
– Given the observation sequence O1,…OT how
can one find a state sequence which is optimal
in some sense
Problem III

Given a training sequence how do we train
the model O=O1…OT to maximize P(O|L).
– Hidden Markov models are a form of maximal
likelihood estimation. In principal one can use
them to do statistical tests of hypotheses, in
particular tests values of certain parameters …
– Maximal Likelihood estimation is a method
which is know to be asymptotically optimal for
estimating the parameters. Implicitly
minimizing the probability of error sequences.
–
Solutions to the Three Hidden
Markov Problems

Problem I.
– Given an observation sequence how do we compute its likelihood.
– Solution

Brute force
– 1. Enumerate a state sequence q1,…qt=I
– 2. Calculate output probabilities
•
n
–
P(O| I )   bq (Oi )
1
– 3.Calculate transitional probabilities
• .
n
P( I | L)  p q  aq q
i
1
i 2
i 1
i
Problem I, Brute Force
Continued


Sum over all sequences of length T
P(O| L ) 
T
 P(O| I ) P( I| L)
i 1


Method is exponential in complexity,
requires approximately 2TNT computations,
totally intractable But this can be shown to
be of complexity TN!
How to Solve Problem
Define

 t (i)  P(O1 ,... Ot , qt  i| L)

This function called the  terminal is the
probability of starting an observation and ending
up in state t. There are TN of these alpha
terminals and they can be calculated recursively
 t (i)  P(Ot 1 ,... OT , qt  St | L)

This function called the  terminal is the
probability that one has a given terminal
sequence given that one starts at time t in state
Forward Algorithm


Using  and  terminals defined recursively, one can
compute the answer to these questions in NT steps. First in
the Forward Direction, i.e the forward algorithm
Initialization
 1 (t )  p 1bi (O1 )

1
Recursion
a1k
n
 t (i)   a ji t 1 ( j )bi (Ot )
j 1

Computation Trellis
bk(Ot)
j
Termination
ajk
n
P(O| L)   P(O| L)
i 1
k
t-1
t
Forward Algorithm
Explanation

Key Recursion
– Sum of products of three terms
– To calculate the probability of a initial sequence
ending in state j,
– Need to consider contribution from

Each prior state ending in state k
– Consists of
• alpha terminal
• multiplied by corresponding transition probability
• multiplied by probability of output state
Backward Algorithm

Very similar to the forward algorithm

Initialization
 T (i)  1,(convention)

Recursion
n
 t (i)   aij b j (Ot 1 ) t 1 ( j )
j 1

Computation Trellis
1
a1kb1(Ot)
k
Termination
n
P(O| L)    1 (i)
i 1
t-1
j
ajkbj(Ot)
t
Backward Algorithm
Explanation

Backward Algorithm
– Sum of products of three terms (as before)
– Calculation probability of sequence ending in
state j,
– Need to consider contribution from

Each future state starting in state k
– Consists of
• beta terminal
• multiplied by corresponding transition probability
• multiplied by probability of output state
Problem II

How do we calculate the probability of the optimal
state sequence.
– Why bother



Often much faster than calculating probability of full
observation sequence and then chosing maximum likelihood
One may want to “parse a long string to segment it”
Problem, what is the definition of optimality
– Can choose the most likely state at each time but
– May not even be a valid path: Why?
– Commonly chosen definition of optimality
Q  arg max I P( I, O)
Optimal Legal path
Algorithm: Viterbi Search

Should already be familiar from Dynamic
Programming
– Viterbi Search
Initialization
 1 (i )  p i bi (O1 )
 1 (i )  
Re cursion
 t ( j )  max
 t 1 (i )aij b j (Ot )
1 i  N
 t ( j )  arg max i  t 1 (i )aij b j (Ot )
Ter min ation
pT  max  T (i )
1 i  N
qT  arg max
 T (i )
1 i  N
Viterbi Search




Principle Same as dynamic programming
principle discussed two lectures ago.
Frequent Use
Multitude of paths through full model.
Example


Sequence Model
one
one
one
one
two
two
two
two
nine
nine
nine
nine
Word Model
w
L

n
Phone Model
Frequent Use of Viterbi
Search

Calculating the paths through the full model
and full search for a large vocabulary model
involves massive transitions through
network. One can prune search at each stage
by only considering transitions from states
such that
 t (k )  max j  t ( j )  P
Such a search is suboptimal and is
called a Viterbi Beam Search
Problem III

How do we train model given multiple
observation sequences
– No known way analytically to find formula
which maximizes the probability of an
observation sequence. There is an iterative
procedure (Baum--Welsh) update, or EM
algorithm which always increases P(O|L) until
maximim is achieved
Need Certain Additional
Quantities

Probability of Transferring from State k to state j
at time t.
P(qt 1 sk | qt s j , O, )
 t (k, j )   t (k )akj bj (Ot 1 ) t 1 ( j )
P(O| )
 Probability of being in state i at time t given the
model and observation sequence
 t (i ) 
 t (i )  t (i )
P(O| L )
Auxiliary Quantities II
T


( j)
is the expected number of
transitions out of state i given the
observation sequence and model
t
t 1
T 1

t
(i, j )
t 1


is the expected number of transitions from
state I to state j given the observation
sequence and the model
Baum Welch reupdate: EM
algorithm

Start with estimates for (A,B,)
 t ,  t ,t


Reestimate the parameters by calculating
their most likely value. This amounts to in
this case replacing the parameters by their
expected value.
Given the observations estimate the
sufficient statistics of the model, which are
Update Formula
T 1
aijn 1  E ( aijn ) 

t
( i, j )
t 1
T 1

t
(i )
t 1
p in 1  E (p in ) 
 1 (i )
N

1
(k )
k 1
b jn 1 (Ok )  E ( b jn (Ok )) 

t
( j)
t :Ot  Ok
n

t
(t )
i 1

Continue reupdating parameters until one
obtains no significant change.
Properties of the Update Rule




For each revision of the parameters chosen
of the likelihood sequence.
N
N
i 1
i 1
n 1
n

(
i
)


 T
 
In other words, the likelihood of the
observed data increases with every re-estimation of the parameters Unfortunately,
local, not global maximum, (best one can
do)
Baum Welch: EM reupdate


Like Gradient Ascent but with constant
improvement.
Class of Algorithms called EM algorithm
–
–
–
–
–
–
Uses Auxiliary Function
Q(L, L' )  log pI (| L' )
Step I: Calculate its expectation
Step II: Maximize its expectation by
choosing new sets of parameters.
Step III: Iterate
EM interpretation

Auxiliary Function is Log probability of an
observation sequence for a set of transitions
Its natural to believe that if we maximize
the expectation of the log probability then
the by changing parameters the the overall
log probability, likelihood will increase.
Proof: Result I

Need Two Results

– says, log of the ratio of
two sums greater than
the average of the log
of the probabiliies
defined by summands
in denominator
i
ln
u
i
i 1

n

j 1
uj
n
u
i
i 1
cln v
j
 ln u j

n
v
i
log
u
i
n
j 1
i 1
n
i 1
j
j
n
i
i 1

j
n
u
j
i
i 1

n
i 1
n
v O
L
u
M
u P
P
log M

M
P
u

M
P
l
N
Q
u
clog v  log u h
j
Let ui  0, and vi  0 then
v
Proof
h

Direct application of
Jensen’s inequality
since log is concave
log(E(x))>Elog(x)
j
Result II

If xi are a vector of probabilities and if ci is a
vector of positive numbers then
– f(x)= icilog(xi) has a maximum when
– xi=ci/ ici

Simple Use
– Use method of Lagrange Multipliers, maximize
L( x ) 
 ci log xi  
i
F
x
G

H
i
i
IJ
K
1
then, taking the deriviative
and seting the derivative equal to zero
yeilds.
c i  xi , using constraint yeilds
=
c
i
i
, hense result
Likelihood Always Increases
Using HMM learning

One does no worse
than choose the
current model. If we
maximize Q, the the
likelihood of the
probabilities increase.
Let I be a state sequence
u I  P( I , O| L )
vI  P( I , O| L' )
u
I
 P(O| L )
I
 P(O| L' )
I
v
I
Let Q( L, L' ) =  uI log vI
I
P( L' )
log
 Q( L, L' ) - Q( L, L )
P( L )
Now do the optimization and
solve the problem
T 1
T 1
i 1
i 1
log vI  log ps0   log asi si1   log bsi1 (Ot 1 )
Sum over all state sequences and regroup terms
T 1
   (i, j ) log a
t
ij

t 1
ij
  log bk (Oj )
jk
T

t 1,Ot  O j
 t ( k )   p i  1 (i )
Reupdate is derived as using lemma
i
Properties of Reupdate Rule
Structure of the model is preserved.
For parameters which sum to one.
n+1=f(P) , Therefore if a parameter starts out zero
in will stay as zero. If parameters start out as 1 and
represent probabilities they stay as sure event.

Generalizations of Hidden
Markov Models: Very Flexible

Explicitly modeling state duration: Next
lecture Continuous state density hidden
Markov models. Very general models can
be done next lecture Other variants of EM
algorithm: --backprop, next lecture
Continuous time densiites: next time I
teach!
Tied States

Its quite possible to force states to have the
same transition probabilities. All events
which mention the same state are pooled. If
events updating probabilities on two nodes
are pooled and they start out equal, they will
end up equal
Null Transitions: Original IBM
Model

IBM Hidden Markov
– For clarity in presentation models presented
where observations are associated with states.
– However, models might very well be
constructed where outputs are associated with
transitions.
– In this case, its useful to have models where
null transitions exist. I.e. Jumps from a state to
another produces no output.
Examples of Null Transition
Models

Left right model with at least one segment
B. Finite State
network
C. Grammar
Network
Speech Model

Speech Model is Usually not fully recurrent.
– Use one or another variant of left to right model




Lack of full recurrence for model.
No problem structure is preserved.
Types of Hidden Markov
Models

A. Fully
recurrent
model. B.
Left to right
C. Left right
parallel
pattern
recognition
Summary: Intro to HMM’s


Presented Markov chains
Defined Hidden Markov Models
– showed that difficult to estimate parameters

Discussed basic method of estimating
parameters and segmenting speech
Summary II:

Showed how Baum Welch reupdate leads to
ever increasing likelihood
– Better than classical gradient ascent.

Different types of Markov Models
– tied states
– null transitions
Not Covered II



Continuous time Hidden Markov Models
Continuous State Hidden Markov Models
Additional Material
Additional Material





Not much theoretical despite much use use.
Eliot et. al.
Lipster and Shiraev ….
Blizzard of Applied material.