Snack for Ruby - rbSnack

Download Report

Transcript Snack for Ruby - rbSnack

Snack
for
Ruby
S Legrand
Talk Objectives
Tour of API
Learn the walk and talk
Have Fun
Snack
Snack library is a tool to aid in the
learning about sound, voice, ASR,
and is hopefully a fun way to
experiment
Snack is a tcl-based API
Snack has been adapted to and
included in Standard Python
Distribution
Snack
Snack is Swedish for “talk” or
“chat”
Kåre Sjölander is the principal
investigator for tcl-based snack
Tcl Snack is available at
http://www.speech.kth.se/snack/
Snack for Ruby
rbSnack is a ruby wrapper around
tcl snack
rbSnack has additional ruby based
utilities
rbSnack has html-based help.
(rdoc+rbTeX)
rbSnack can be found at
http://rbsnack.sourceforge.net/
Snack Toolkit Includes
Recording, Playback
Waveform display
Spectrogram: Fourier, LPC
Formant analysis
Power analysis
Filters
(will demo)
The Speech Signal
Continuous speech is discretely
sampled
Signal consist of rapidly changing
data points.
The display of the sampled signal is
called the waveform
Snack can display the waveform
real-time
Analysis uses frames
Signal is broken into frames
Frames may overlap
Characteristics of signal analyzed
using Fourier and LPC analysis
on a per frame basis.
Going in Circles
Complex numbers is just a funny
way of multiplying: add angles.
Eulers formula
Fourier Analysis
Fourier matrix is an unitary matrix
Multiplication by Fourier matrix
returns the frequency components
of the signal, called the Fourier
coefficients
Easy to compute the inverse:
Called Fourier Inverse
The Fourier Matrix
Looks Like
Spinning disks
Multiplication by signal produces Fourier
coefficients (frequency components)
Examining Fourier
components
A Spectrogram gives a picture of
the Fourier components
(coefficients) as they evolve over
time. Snack can display real time.
Looks like an X Ray
Bands of high activity correspond
to formants
Linear Filters
Useful to understand nature of
speech signals
Generators: generate square
waves, sin waves, saw tooth, etc.
Composers: composes several
filters.
FIR: Finite impulse response
IIR: Infinite impulse response
FIR Filter
Determined completely by
response to a unit impulse.
Response finite in duration.
y(t)=b0 x(t) + b1 x(t-1)+ b2x(t-2)+…+bn x(t-n)
(We will demo FIR using rbSnack)
IIR Filter
Also called Recursive filter
Response infinite in duration.
y(t)=b0 x(t) + b1 x(t-1)+ b2x(t-2)+…+bn x(t-n)
+a1 y(t-1)+ a2y(t-2)+…+an y(t-n)
(We will demo IIR using rbSnack)
Linear Predictive Analysis
Analogous to Fourier analysis
Assumption: For each frame, the signal
is predicted by
y(t)=a1 y(t-1)+ a2y(t-2)+…+ap y(t-p)
The LPC coefficients are the best least
squares approximation.
Can also be used to predict formants
What is Sound?
What is Speech?
Sound is the resulting signal
created by the longitude waves in
some medium like air.
Sound waves are continuous
Can be decomposed into linear
combination of sin waves.
Speech is a special noise made by
humans
It’s Just Tubing…
The simplest model of speech is to
consider the lungs and trachea as
one long tube.
Resonance frequencies are called
Formants.
F1
F2
Some Speech
Recognition Features
Formants
Pitch
Voiced/Unvoiced
Nasality
Frication
Energy
Our current work only uses Formants and Energy
Basic Utterances
A basic unit of speech is called a Phone
Vowels are utterances with constant
formants
Diphthong is the transitioning from one
vowel to another
Vowels and Diphthongs are essentially
characterized by the first and second
formant.
Other Phones: The
Consonants
Plosives: closure in oral cavity /p/
Nasal: Closure of nasal cavity /m/
Fricative: Turbulent airstream noise /s/
Retroflex liquid: Vowel like-tongue high
curled back /r/
Lateral liquid: Vowel like, tongue
central, side air stream /l/
Glide: Vowel like /y/
Some Problems with
Speech Signals
Segmentation: when does a word
begin and end? (Noise?)
Wet ware: (speaker’s internal
configuration + lip smacks,
breathing etc.)
SegmentationWorkshop demos one
approach.
Code Books
A code book consists of code
words.
Idea is to search through code
book to find code word
corresponding to best match of
feature sequence.
RbSnack uses codebook approach
in word recognition.
Code Book Approach
++ Easy to implement
+ Good for isolated words
+- Works best on small
vocabularies
-- Is insensitive to context,
prone to errors
Code Book Approach
WhichWay is a simple
demo of this approach
More Problems with
Speech Signals
Accent: Southern vs. New England
vs. California Valley vs. Other.
Variation in rate of speech makes it
hard to compare words
Dynamic Time
Warping
A pattern comparison technique
A way of stretching or compressing
one sequence to match another.
Evaluated using dynamic
programming
Dynamic Programming
Form a grid, with start at lower
left, end at upper right.
Label each node with difference
(error) between pattern 1 at time i
and pattern 2 at time j.
Find minimal distance from start to
end using
Dynamic Programming
Basic Assumption:
If best path P(S,E)
passes through node
N, then P(S,E) is the
concatenation of
P(S,N) (best from S
to N) and P(N,E)
(best from N to E)
A possible path
Dynamic Programming
1
2
1
3
2
3
Type I
Type III
RbSnack includes examples for various
time alignment approaches
Dynamic Programming
1
1
1
1
1
Type IV
1
1
1
Itakura
Hidden Markov Models
Sometime the second (or third) best
match is the right word. Use HMM’s to
ascertain the correct word in the
context of the sentence. (Ditto for
phones within a word)
HMM’s are similar to non-deterministic
finite state machines, except for they
have non-deterministic output.
Hidden Markov Models
Dynamic Programming is used to
compute weights.
HMM’s look like
.4
.2
P(/i/)=.5
P(/a/)=.2
P(/o/)=.3
2
1
.4
4
3
PossibleFuture
Directions
Examine other features, (pitch?)
Incorporate other libraries. (Do the
computationally hard work in C)
Add more signal processing
routines
Add more examples
Use Hidden Markov Models
Lessons Learned
/to be learned
Document everything.
Nothings perfect
Automate everything
Project is never done
What’s next?
Try it out.