Statistical language modeling – overview of my work
Download
Report
Transcript Statistical language modeling – overview of my work
Character based language
models
Tomáš Mikolov, 2010
1
Motivation – why this?
No OOV problem
More information for model (model can make
words out of characters, but not vice versa)
Theoretically better solution (no ad-hoc
definition of what “word” is)
2
Why not?
Worse performance: larger models (arpa LM format
not suitable for long context LMs), lower word
accuracy
Smoothing seems to be weak point: word histories
should not be clustered just by length of context!
However, some people claim smoothing is not
important..
3
Was someone working on this
before?
Yes, many researchers
Mahoney, Schmidhubber - text compression,
information theoretic approach
Elman - models of language based on connectionist
models, linguistic approach
Carpenter – language modeling for classification etc.,
just started reading that..
4
Comparison of standard LM
and RNN LM
Conclusion: simple RNN can learn long
context information (6-9 gram and maybe
more)
MODEL
ENTROPY
RNN 160
- 41 227
RNN 320
- 40 582
RNN 640
- 40 484
RNN 1280
- 40 927
LM 4gram
- 41 822
LM 6gram
- 39 804
LM 9gram
- 40 278
5
Comparison of char-based and
word-based LMs
MODEL
WORD ERROR RATE
Baseline – word bigram
32.6%
Word KN 4gram
30.4%
Char 6gram
36.5%
Char 9gram
32.3%
Word KN 4gram + Char 9 gram
30.3%
Task: RT07, LM trained just on Switchboard
6
What can be done
Combining strengths: automatically derived "word" units
What is word? in different languages this varies a lot
Word boundaries in English can be automatically found - high
entropy at first few characters in word, very low entropy in the
rest of word
7
Example from Elman
8
Example of automatically
derived lexical units
Word boundaries are simply chosen on places
with high entropy (trivial approach)
LETTERS WORDS
SUBWORDS
MULTIWORDS
A
YEAH
RE
YOUKNOW
S
HMMM
TH
KINDOF
I
WE
CO
YEAHI
O
OKAY
SE
ANDI
E
AND
DON
ONTHE
T
BECAUSE
LI
ITIS
9
Conclusion I.
Units similar to words can be automatically detected in
sequential data by using entropy
We can attempt to build LVCSR system without any implicit
language model - it can be learned from the data
However, these learned units are not words
This approach is quite biologically plausible - "weak" model here
is the model that works on phoneme/character level, higher
model works on words/subwords, even higher may be working
on phrases etc.
10
Conclusion II.
LVCSR: acoustic models + language models
Language models are very good predictors
for phonemes in the middle/end of words
Shouldn't acoustic models focus more on first
few phonemes in words? (or maybe simply
high-entropy phonemes (given by LM))
11