LING 696B: Computational Models of Phonological Learning

Download Report

Transcript LING 696B: Computational Models of Phonological Learning

LING 439/539: Statistical
Methods in Speech and
Language Processing
Ying Lin
Department of Linguistics
University of Arizona
1
Welcome!







Get the syllabus
Fill out and return the information sheet
Email: [email protected]
Office: Douglass 224
OH: MW 2:00 --3:00 by appoint (also
teaching another undergrad class)
Course webpage: see syllabus
Listserv coming soon.
2
438/538 and 439/539

LING 438/538 (Computational
Linguistics):



Symbolic representations (mostly syntax),
e.g. FSA, CFG.
Focus on logic
Simple probabilistic models, e.g. N-grams.
3
438/538 and 439/539

This class complements 438/538:



Numerical representations (speech
signals): need digital signal processing
Focus on statistics/learning
More sophisticated probabilistic models,
e.g. HMM, PCFG
4
Main reference texts (!)






Huang, Acero and Hon (2001). Spoken Language Processing: A
guide to theory, algorithm, and system development. PrenticeHall.
Manning and Schutze (1999). Foundations of Statistical Natural
Language Processing. MIT Press.
Rabiner and Juang (1993). Fundamental of Speech Recognition.
Prentice-Hall.
Duda, Hart and Stork (2001). Pattern Classification (2nd ed).
JohnWiley & Sons.
Rabiner and Schafer (1978). Digital Processing of Speech
Signals. Prentice-Hall.
Hastie, Tibshirani and Friedman (2001). The Elements of
Statistical Learning. Springer.
5
Guideline for course reading

There is no single book that covers all of our
materials.



Most books are written either for EE or CS
audience only.
A few chapters are selected from each book
(see the reading list). Lecture notes will
summarize the reading.
Expect a rough ride for the first time -feedback is greatly appreciated!
6
Three skills for this class




1. Linguistics: understanding source of
particular patterns.
2. Math/Statistics: underlying principles
of the model.
3. Programming: implementation
This class emphasizes 2, reason:


Models are based on simple structures
Programming skills require much practice
7
What is “statistical approach”?

Narrow: uses statistical principle, I.e.
based on the probability calculus or
other theories of inductive inference


Compared to logic: dedutive inference
Broad: any work that uses a quantative
measure of success

Relevant to both language engineering and
linguistic science
8
What is “statistical approach”?

This
course
Narrow: uses statistical principle, I.e.
based on the probability calculus or
other theories of inductive inference


Compared to logic: dedutive inference
Broad: any work that uses a quantative
measure of success

Relevant to both anguage engineering and
linguistic science
9
Language engineering:
speech recognition

Tasks: increasing level of difficulty
Word
Error
Rate
10
A brief history of speech
recognition


1950’s: U.S. government started funding
research on automatic recognition of speech
1960-70’s: Isolated words, digit strings



Debate: rules v.s. statistics
Dynamic time warping
1980-now: continuous speech, speech
understanding, spoken dialog

Hidden Markov model dominates
11
Why the rules didn’t work?

Completely bottom-up approach:
Phonetic
rules


h
A
U
Phonological
rules “How
A
j
 are
U you?”
Rules are hand-coded by experts
Problem: variability in speech

Sophisticated, symbolic rules are not flexible
enough to handle continuous speech
12
The rise of statistical methods
in speech

Initial solution: hire many linguists to
continually improve the rule system


Advantage of statistical models:




This turns out to be costly and slow, failing the
high expectation
Allows training on different data: flexible, scalable
Computing power much cheaper than expert
Drives the move to less and less constrained tasks
Bitterness: “every time I fire a linguist, the
word error rate goes up” -- F. Jelinek (IBM)
13
The rise of statistics in NLP

Very similar scenarios also happened in NLP:




E.g. tagging, parsing, machine translation
“Old” NLP: deductive systems, hand-coded
“New” NLP: broad-coverage, corpus-based,
emphasize training, evaluation
Speech is now merging with NLP


Many tools originated in speech, then got copied
to NLP
New task keep emerging: web as an
(unstructured) data source
14
Basic architecture of today’s
ASR system
Acoustic
modeling
Audio speech
Feature
extraction
X
Language
model
p(M1),p(M2)
Likelihood
p(X|M1), p(X|M2)
Scoring
rank
Model parameters
trained offline:
M1 = “I recognize speech”
M2 = “I wreck a nice beach”
ANSWER
15
Component 1: signal
processing / feature extraction

First 1/3 of the course (also useful for
understanding synthesis):
16
Examples of some common
features
17
Component 2:
Acoustic models


Mixture of Gaussians: p(ot | qi) = 
Dimension reduction: principle
component analysis, linear discriminant
analysis, parameter tying
18
Component 3:
Pronunciation modeling

Model for differnent pronunciations of “you”
in continuous speech
Each unit is an HMM

start
j
ou
end
a

Other types of units: triphones, syllables
19
Component 4:
Language model

Provide the probability of word sequence
models p(M) to combine with the
acoustic model p(X|M)



Common: N-gram with smoothing, backoff,
very hard and specialized business
Just starting to integrate parsing
Fundamental equation:
M* = argmaxM p(M|X)
= argmaxM p(X|M)p(M)
Viterbi, beam, A*, N-best search
20
ASR: example of a generative
model

Component 2+3+4 provide an instance of
generative models




Unsupervised learning/training



Language M generates word sequences
Word sequence generates pronunciation
Pronunciation generates acoustic features
Maximum likelihood estimation
Expectation-Maximization algorithm (different
incarnations)
Main focus of this class
21
Other models to look at:

Descriptive/maximum entropy models



Started in vision, then copied to speech, then NLP
Discriminative models: directly using data
to construct classifiers, with weak
assumptions about prob distribution
Supervised learning, focus on the perspective
of classification
Input string
count
Feature vector
classifier
Output labels
“Machine learning approach to NLP”
22
Problem solved?

No, improvements are mostly due to
larger training set and speed up
Driven by
Moore’s law?
23
Challenges

Environment distortion (microphone, noise,
cocktail party) breaks feature extraction




Acoustic condition mismatch
Between + within speaker variability breaks
the pronunciation modeling and acoustic
modeling
Conversational speech breaks the language
model
Understanding these problems is crucial for
improving the performance of ASR
24
Dreaming

“2001: A Space Odyssey” (1968)
Dave: “Open the pod bay doors, HAL”
HAL9000: “I’m sorry Dave. I’m afraid I can’t do that.”
25
The reality,
before the problem is solved

Speech is used as a user interface only
when people can’t use hand




Driving a car (use speech to drive?)
Device too small (cellphone)
Customer service (who will tolerate touch
tone?)
Dictation (how many people actually use
it?)
26
For next time:

We will start with signal processing


Uses engineering math, including power
series (including convergence),
trigonometric functions, integration and
representation of complex numbers.
If you forgot or do not know these
materials, please look for references
and study it before class.
27