Transcript PPT slides

Using Web Text Sources
for Conversational Speech Language Modeling
Ivan Bulyko, Mari Ostendorf &
Andreas Stolcke
University of Washington
SRI International
Problem: LMs for conversational speech



Language models need a lot of
training data that matches the task
both in terms of style and topic
Conversational speech transcripts are
expensive to collect
Easily obtained news & web text isn’t
conversational
Status Update -- Summary/Outline

Where we were in May (reminder+)
Basic approach: web data + text
normalization + class-dependent mixtures
– Perplexity & WER reductions on both CTS
and meeting data
–

What we’ve done since then
Further exploring class-dependent mixtures
– First steps at application to Mandarin
– Adding disfluencies to web data
–
Review of Approach



Collect text data from the web
(filtering for topic and style)
Clean up and transform to spoken
form (text normalization + new work
on disfluencies)
Use class-dependent interpolation for
handling source mismatch
Obtaining Data
• Use
–
to search for…
Exact matches of conversational n-grams
“I never thought I would”
“I would think so”
“but I don’t know”
–
Topic-related data (preferably conversational)
“wireless mikes like”
“kilohertz sampling rate”
“I know that recognizer”
• Optionally filter based on likelihood of data
according to Switchboard LM (after clean up)
Examples

Conversational
We were friends but we don’t actually have a relationship.

Topic-related (for ICSI meetings)
For our experiments we used the Bellman-Ford algorithm...

Very conversational
Well I actually I I really haven’t seen her for years
… from
transcripts
of Friends
Cleaning up data




Strip HTML tags and headers/footers
Ignore documents containing 8-bit
characters or where OOV rate is >50%
Automatic sentence boundary detection
Text normalization (written  spoken)
123 St. Mary’s St.  one twenty three Saint Mary’s Street
Combining Sources

Use class-dependent mixture weights
p( wi | wi 1...wi  N 1 )   s c( wi 1 )  ps ( wi |wi 1 ...wi  N 1 )
sS
c(wi-1) = part-of-speech classes (35) +
100 most frequent words from swbd

On held-out data from target task:
Estimate mixture weights
– Prune LM (remove n-grams with
probabilities below a threshold)
–
Jan-May Experiments

Task domains & test data
–
–

CTS (HUB5) eval2001 (swbd1+swbd2+cell)
Meeting recorder test set
LM training data sources
LDC conversational speech sources (3M words)
– LDC broadcast news text (150M words)
– General meeting transcripts (200K words)
– Web text: general conversational (191M words),
meeting topics (28M), Fisher conv (102M)
–

Both tasks use SRI HUB5 recognizer in
rescoring mode, intermediate stage
Class-based Mixture Weights on CTS
100%
90%
80%
70%
ch_en
swbd-p2
swbd-cell
swbd
BN
Google
60%
50%
40%
30%
20%
10%
0%
1gr
2gr
No class
• Weights
3gr
2gr
3gr
Noun
2gr
3gr
Backchannel
for web data are higher for content words, lower
for conversational speech phenomena
• Higher order n-grams have higher weight on web data
Main Results: Meetings
LM Data sources
Baseline (CTS + BN)
Std.
Mix
Class
Mix
38.2%
+ 0.2M Meetings
37.2% 36.9%
+ 28M Web (topic)
36.9% 36.7%
+ 0.2M Meetings + 28M Web (topic)
36.2% 35.9%
• Lots of web data is better than a little target data
• Class-dependent mixture increases benefit in both cases
• Pruning expts show that the benefit of class-dependent
weights is not simply due to increased # of params
Old CTS Results
LM Data sources
Std.
Mix
Class
Mix
Baseline CTS
38.9% 38.9%
+ 150M BN
37.9% 37.8%
+ 66M Web (Random)
38.6% 38.3%
+ 61M Web (conversational)
37.7% 37.6%
+ 191M Web (conversational)
37.6% 37.4%
+ 150M BN + 61M Web
37.7% 37.3%
+ 150M BN + 191M Web
37.5% 37.2%
+ 150M BN + 61M Web (PP filtered)
37.7% 37.3%
CTS Experiments

Initial expts (Jan report, on Eval01, old AM)
–
–

Web data helps (38.9 -> 37.5)
Class mixture helps (37.5 -> 37.2)
More development (May report, new AM)
–
–
–
Eval01: 30.4 -> 29.9 (all sources help a little)
Eval03: 33.8 -> 33.0 (no gain from Fisher web data)
Class mix gives small gain on Eval01 but not Eval03
Note: these results do not use interpolation
with class (or other) n-grams.
Recent Work: Learning from Eval03…


Text normalization fixes from IBM help them,
but not us (in last data release from UW)
WER gains from class-dependent mixtures
have disappeared, maybe because …
–
–
it’s mainly important when there is little in-domain
data (e.g. meetings)
recent expts are with an improved acoustic model
(though not the latest & greatest)
but not because …
–

limited training for class-dependent mixture
weights
But, web data is useful for almost-parsing LM
(see Stolcke talk)
Why do we think weight training is OK?


Increasing to 200 top words for classes
doesn’t help, increasing much further hurts
No improvement from constraining class
weights in how much they can deviate from
class-independent weights, based on
–
–

Pre-defined priors, or
Number of observations in heldout data
No gain from order-independent vs. orderdependent mixture weights
Mandarin LM – Preliminary Results

Web text normalization
–
–
–

Use punctuation for sentence segmentation
Word segmentation with ICSI word tokenization tools
Convert digits into spoken form (more to come…)
Classes = top 100 words + 30 categories for
other words, either:
–
–
POS from LDC lexicon, OR
Automatically learned w/ SRI LMtools
LM training sources
CTS(0.5M)+BN(100M)
CTS+BN+Web(152M)
CTS+BN+filtered Web(84M)
Same
performance
CER
61.9%
61.4%
61.4%
Inserting Disfluencies

Use SWBD class n-grams (POS+top 100
words) as a generative model to insert
um, uh and fragments
– Repetitions of I, and, the
– Sentence initial and, but, so, well
–

Randomly generate according to a linear
combination of standard and reverse Ngrams
P(Dfbefore wi) = P(DF|wi-1, wi-2, wi-3)+(1-)P(DF|wi,wi+1,wi+2)
Examples



Well I I don’t know how uh she puts
up with this speech impairment of
mine
And I I think that’s really important
instead of always doing um the hard
work
Well monitoring for acid rain where uh
the the primary components are
sulphates and nitrates was conducted
in twenty nine parks
Inserting Disfluencies -- Results
100%
100%
90%
90%
80%
80%
70%
70%
60%
60%
CTS
CTS
50%
50%
BN
BN
40%
40%
Google
30%
30%
20%
20%
10%
10%
Google
0%
0%
1-gram
2-gram
3-gram
4-gram
Class-independent weights
Noun
UH
Before insertion
Noun
UH
After insertion
• Weights of web data increase with added
disfluencies
• Small PP reduction, but no WER reduction… yet.
Summary Observations

Findings that generalize across tasks (so far):
–
–
–

Results that vary with task:
–

Web data is useful, but is better leveraged with
Google “filtering” (+ text normalization)
Additional perplexity-based filtering is not useful
No gain, but no loss with automatic classes IF top
100 words are included
Class-dependent mixture weights are mainly useful
when there is less in-domain data
The verdict is still out on usefulness of
disfluency insertion.
Other Observations (in response to Rich)


Pruning LMs doesn’t hurt that much
(but maybe we didn’t get big enough)
High-order n-gram hit rate is not as
good a predictor as perplexity (bigram
is not bad), based on correlation with
WER.
Questions

Will web data still be useful for English
CTS once we have the new data?
(Note: web data will still be 10x in-domain
data or more.)



Should we collect (and give out) more
English CTS-oriented web data in the
next couple months?
If so, should we switch focus to more
topic-driven collections?
Research challenge: how to model style
differences in a principled way?