Transcript PPT slides
T.J. Watson Research Center, Human Language Technologies
EARS Progress Update:
Improved MPE, Inline Lattice Rescoring, Fast
Decoding, Gaussianization & Fisher experiments
Dan Povey, George Saon, Lidia Mangu, Brian
Kingsbury & Geoffrey Zweig
12/1/2003
T.J. Watson Research Center, Human Language Technologies
Part 1: Improved MPE
2
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Previous discriminative training setup –
Implicit Lattice MMI
•Used unigram decoding graph and fast decoding to generate statelevel “posteriors” (actually relative likelihoods: delta between best path
using the state and best path overall)
•Posteriors used directly (without forward-backward) to accumulate
“denominator” statistics.
•Numerator statistics accumulated as for ML training, with full forwardbackward
•Fairly effective but not “MMI/MPE standard”
3
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Current discriminative training setup (for standard MMI)
Creating lattices with unigram scores on links
Forward-backward on lattices (using fixed state
sequence) to get occupation probabilities, use same
lattices on multiple iterations
Creating num + den stats in a consistent way
Use slower training speed (E=2, not 1) and more
iterations
Also implemented MPE
4
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Experimental conditions
Same as for RT’03 evaluation
274 hours of Switchboard training data
Training + test data adapted using FMLLR transform [from ML
system]
60dim PLPs, VTLN, no MLLR
5
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Basic MMI results (eval’00)
With word-internal phone context, 142K Gaussians
ML
Iter-1
Iter-2
Old MMI, E=1
23.5%
22.7%
22.2%
New MMI, E=2
23.5%
22.5%
21.7%
Iter-3
Iter-4
20.9%
20.8%
1.4% more improvement (2.7% total) with this setup
6
EARS progress update
T.J. Watson Research Center, Human Language Technologies
MPE results (eval’00)
ML
Iter-1
Iter-2
Iter-3
Iter-4
MMI
23.5%
22.5%
21.7%
20.9%
20.8%
MPE
23.5%
22.2%
21.5%*
21.3%*
MPE+
MMI
23.5%
21.8%
21.3%
20.9%
20.5%
Iter-5
20.3%
Standard MPE is not as good as MMI with this setup
“MPE+MMI”, which is MPE with I-smoothing to MMI update (not ML),
gives 0.5% absolute over MMI
* Conditions differ, treat with caution.
7
EARS progress update
T.J. Watson Research Center, Human Language Technologies
MPE+MMI continued
“MPE+MMI” involves storing 4 sets of statistics rather than 3: num, den,
ml and now also mmi-den. 33% more storage, no extra computation
Do standard MMI update using ml and mmi-den stats, use resulting
mean & var in place of ML mean & var in I-smoothing.
(Note- I-smoothing is a kind of gradual backoff to a more robust estimate
of mean & variance).
Probability scaling in MPE
MPE training leads to an excess of deletions.
Based on previous experience, this can be due to a probability scale
that is too extreme.
Changing the probability scale from 1/18 to 1/10 gave a ~0.3% win.
1/10 used as scale on all MPE experiments with left-context (see later)
8
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Fast MMI
Work presented by Bill Byrne at Eurospeech’03 showed improved
results from MMI where the correctly recognized data was excluded*
Achieve a similar effect without hard decisions, by canceling num &
den stats
I.e., if a state has nonzero occupation probabilities for both
numerator and denominator at time t, cancel the shared part so only
one is positive.
Gives as good or better results as baseline, with half the iterations.
Use E=2 as before.
ML
Iter-1
Iter-2
Iter-3
Iter-4
MMI
23.5%
22.5%
21.7%
20.9%
20.8%
Fast MMI
23.5%
21.2%
20.7%
21.2%
* “Lattice segmentation and Minimum Bayes Risk Discriminative Training”, Vlasios
Doumpiotis et. al, Eurospeech 2003
9
EARS progress update
T.J. Watson Research Center, Human Language Technologies
MMI+MPE with cross-word (left) phone context
Similar size system (about 160K vs 142K), with cross-word context
Results shown here connect word-traces into lattices
indiscriminately (ignoring constraints of context)
There is an additional win possible from using context constraints
(~0.2%)
RT’00
ML
Old MMI*
22.0%
Fast MMI
22.0%
20.0%
19.9%
MPE
22.0%
20.5%
20.2%
20.0%
MPE+MMI
22.0%
20.5%
19.8%
19.4%
*I.e. last year, different setup
10
EARS progress update
Iter-1
Iter-2
Iter-3
Iter-4
20.8%
19.5%
T.J. Watson Research Center, Human Language Technologies
MMI and MPE with cross-word context.. on RT’03
The new MMI setup (including ‘fast MMI’) is no better than old MMI
About 1.8% improvement on RT’03 from MPE+MMI; MPE alone gives
1.4% improvement.
Those numbers are 2.5% and 2.0% on RT’00
Comparison with MPE results in Cambridge’s 28-mix system (~170K
Gaussians) from 2002:
Most comparable number is 2.2% improvement (30.4% to 28.2%) on
dev01sub using FMLLR (“constrained MLLR”) and F-SAT training (*)
RT’03
ML
Old MMI*
22.0%
29.8%
Fast MMI
30.9%
29.9%
MPE
30.9%
MPE+MMI
30.9%
Iter-1
29.7%
Iter-2
29.6%
Iter-3
Iter-4
29.5%
29.1%
“Automatic transcription of conversational telephone speech”, T. Hain et. al, submitted to
IEEE transactions on Speech & Audio processing
11
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Part 2: Inline Lattice Rescoring
12
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Language model rescoring – some preliminary work
Very large LMs help, e.g. moving from a typical to huge (unpruned)
LM can help by 0.8% (*)
Very hard to build static decoding graphs for huge LMs
Good to be able to efficiently rescore lattices with a different LM
Also useful for adaptive language modeling
… adaptive language modeling gives us ~1% on “superhuman” test
set, and 0.2% on RT’03 (+)
* “Large LM”, Nikolai Duta & Richard Schwartz (BBN), presentation 2003 EARS
meeting, IDIAP, Martigny
+ “Experiments on adaptive LM”, Lidia Mangu & Geoff Zweig (IBM), ibid.
13
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Lattice rescoring algorithm
Taking a lattice and applying a 3 or 4-gram LM involves expanding
lattice nodes
This algorithm can take very large amounts of time for some lattices
Can be solved by heavy pruning- but this is undesirable if LMs are
quite different.
Developed lattice LM-rescoring algorithm.
Finds the best path through a lattice given a different LM (*)
*(We are working on a modified algorithm that will generate rescored lattices)
14
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Lattice rescoring algorithm (cont’d)
Each word-instance in lattice has k tokens (e.g. k=3)
Each token has a partial word history ending in the current
word, and a traceback to the best predecessor token
WHY, -101
WHY
CAP, -310
WHEN THE, -205
WHY THE, -210
CAP
THE
WHEN, -101
WHEN
15
EARS progress update
WHY THE CAT, -345
THE CAT, -310
CAT
T.J. Watson Research Center, Human Language Technologies
Lattice rescoring algorithm (cont’d)
For each word-instance in lattice from left to right…
…for each token in each predecessor word-instance...
…...Add current word to that token’s word-history and work out LM &
acoustic costs;
…... delete word left-context until the word-history exists in the LM as an LM
context
…... Form a new token pointing back to predecessor token.
……and add token to the current word-instance’s list of tokens.
Always ensure that no two tokens with the same word-history exist (delete
the least likely one)
… and always keep only the k most likely tokens.
Finally, trace-back from most likely token at end of utterance.
All done within decoder
Highly efficient
16
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Lattice rescoring algorithm – experiments
To verify that it works…
Took the 4-gram LM used for the RT’03 evaluation and pruned it 13-fold
Built a decoding graph, and rescored with original LM
Testing on RT’03, MPE-trained system with Gaussianization
WER (RT’03)
Big LM (132 MB)
28.5%
Tiny LM (10MB)
31.7%
Tiny LM + rescoring, k=3
28.5%
Tiny LM + rescoring, k=2
28.6%
Tiny LM + rescoring, k=3,
Backwards traces only (*)
30.1%
* See next slide
Note, all experiments actually include an n-1 word history in each token, even when
not necessary. This should decrease the accuracy of the algorithm for a given k.
17
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Lattice rescoring algorithm – forward vs backward
Lattice generation algorithm:
Both alpha and beta likelihoods are available to the algorithm
Whenever a word-end state likelihood is within delta of the best path…
Trace back until a word beginning state whose best predecessor is
word-end, is reached
...and create a “word trace.”
Join all these word traces to form a lattice (using graph connectivity
constraints)
Equivalent to Julian Odell’s algorithm (with n=infinity)
BUT we also add “forwards” traces, based on tracing forward from word
beginning to word end. Time-symmetric with backtraces.
There are fewer forwards traces (due to graph topology)
Adding forwards traces is important (0.6% hit from removing them)
I don’t believe there is much effect on lattice oracle WER.
… it is the alignments of word-sequences that are affected.
18
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Part 3: Progress in Fast Decoding
19
EARS progress update
T.J. Watson Research Center, Human Language Technologies
RT’03 Sub-realtime Architecture
20
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Improvements in Fast Decoding
Switched from rank pruning to running beam pruning
Hypotheses are pruned early on based on running max estimate
during successor expansion then pruned again after final max
states
max update
max update; pruned at the end
prune based on current max-beam
max update
prune based on current max-beam
t
t+1
time
21
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Runtime vs. WER: Beam and rank pruning
Resulted in a 10% decoding speed-up without loss in accuracy
22
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Reducing the memory requirements
Run-time memory reduction by storing minimum traceback
information for Viterbi word sequence recovery
Previously we stored information for full state-level alignment
Now we store only information for word-level alignment
– Alpha entry has accumulated cost and pointer to originating word
token
– Two alpha vectors for “flip-flop”
– Permanent word-level tokens created only at active word-ends
No penalty in speed and dynamic memory reduction by two
orders of magnitude
23
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Part 4: Feature-Space Gaussianization
24
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Feature space Gaussianization [Saon et al. 04]
Idea: transform each dimension non-linearly such that it becomes
Gaussian distributed
Motivations:
Perform speaker adaptation with non-linear transforms
Natural form of non-linear speaker adaptive training (SAT)
Effort of modeling output distribution with GMMs is reduced
rank ( xi )
yi
N
1
Transform is given by the inverse gaussian CDF applied to the empirical CDF
25
EARS progress update
T.J. Watson Research Center, Human Language Technologies
New data values, absolute
Feature Space Gaussianization, Pictorially
Inverse Gaussian CDF
(mean 0, variance 1)
1
0
+- 1 std dev
-1
68%
16
26
EARS progress update
50
84
Old data values, percentile
T.J. Watson Research Center, Human Language Technologies
An actual transform
27
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Feature space Gaussianization: WER
Results on RT’03 at the SAT level (no MLLR):
28
ML
MPE
Baseline
(FMLLR-SAT)
30.9%
29.1%
Gaussianized
30.5%
28.5%
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Part 5: Experiments with Fisher Data
29
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Acoustic Training Data
Training set size based on
aligned frames only.
corpus
# frames
# hours
Fisher 1-4
Total is 829 hours of speech;
486 hours excluding Fisher.
SWB-1
130 M
98.6M
361
274
IBM
Voicemail
37.9M
105
BBN
CTRANS
20.5M
57
SWB
Cellular
6.4M
18
Callhome
English
4.9M
14
Training vocabulary includes
61K tokens.
First experiments with Fisher 14. Iteration likely to improve
results.
30
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Effect of new Fisher data on WER
RT-03
RT-03 RT-03
IBM
Switchboard Fisher overall Superhuman
2002 System
34.1
26.0
30.2
36.8
All data
32.7
25.1
29.1
36.7
All less Fisher
33.2
25.7
29.6
36.2
All less VM
32.2
25.4
28.9
36.8
Systems are PLP VTLN SAT, 60-dim. LDA+MLLT features
One-shot decoding on IBM 2003 RT-03 LM (interpolated
4gm) for RT-03; generic interpolated 3gm for Superhuman.
Fisher data in AM only – not LM
31
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Summary
Discriminative training
New MPE 0.7% better than old MMI on RT03
Used MMI estimate rather than ML estimate for I-smoothing with MPE
(consistently gives about 0.4% improvement over standard MPE)
LM rescoring
10x Reduction in static graph size – 132M 10M
Useful for rescoring with adaptive LMs
Fast Decoding
10% speedup - incremental application of absolute pruning threshold
Gaussianization
0.6% improvement on top of MPE
Useful on a variety of tasks (e.g. C&C in cars)
Fisher Data
1.3% improvement over last year without it (AM only)
Not useful in a broader context
32
EARS progress update