Ch3-Production_and_C..

Download Report

Transcript Ch3-Production_and_C..

Speech Processing
Production and Classification of
Speech Sounds
Introduction
 Simplified view of Speech Production (see
Figure 3.1 in the next slide)
 Lungs – act as a power supply and provide
airflow to the larynx stage.
 Larynx – modulates airflow and provides either:
 Periodic puff-like airflow, or
 Noisy airflow to vocal tract.
 Vocal-tract – gives the modulated airflow its
“color” (spectrally shaping the source) with:
 Oral,
 Nasal, and
 Pharynx cavities.
2 April 2016
Veton Këpuska
2
Figure 3.1
2 April 2016
Veton Këpuska
3
Introduction

Sound sources can also be generated by constrictions and
boundaries that are made within the vocal tract itself:



Periodic source,
Noisy source, or
Impulsive airflow source.

Note that speech production mechanism does not generate a
perfect periodic, impulsive, or noisy source.

Three general categories of the source for speech
sounds:
1. Periodic
2. Noisy
3. Impulsive

Illustration of each in the word “shop”:



2 April 2016
“sh” – noisy
“o” – periodic
“p” - impulse
Veton Këpuska
4
Example of “Shop”
Noise like signal
2 April 2016
Period Source
Veton Këpuska
Impulse Source
5
Introduction

Distinguishable speech sounds are determined








not only by source, but
also by different vocal tract configurations,
and combination of both.
Speech sound classes are referred to as phonemes.
Phonemics is the discipline that studies phoneme realizations
(e.g., in a language).
Each phoneme class provides a certain meaning in a word.
Within a phoneme class there exist many sound variations that
provide the same meaning. The study of these sound variations is
called phonetics.
Phonemes are the basic building blocks of a language:


They are concatenated (more or less), as discrete elements into words,
According to a certain phonemic and grammatical rules.
2 April 2016
Veton Këpuska
6
Introduction
 This chapter will cover:
 Description of speech production mechanism
 Resulting variety of phonetic sound patterns
 How these sounds differ among different
speakers.
2 April 2016
Veton Këpuska
7
Anatomy and Physiology of
Speech Production
2 April 2016
Veton Këpuska
8
Anatomy and Physiology of Speech
Production

Anatomy of speech production is
shown in Figure 3.2

Lungs:


Inhalation and exhalation of air.
Connected through trachea
(“windpipe”) and epiglottis to
Vocal Tract.


~12-cm-long and ~1.5-2-cmdiameter pipe.
During the speaking, rhythmical
cycle of inhalation and exhalation
changes to accommodate speech
production:


2 April 2016
Duration of exhalation becomes
roughly equal to the length of
sentence/phrase.
Lung air pressure during this time
is maintained at a constant level,
slightly above the atmospheric
pressure.
Veton Këpuska
9
Anatomy and Physiology of Speech
Production

Larynx


Complicated system of cartilages, flesh, muscles, and ligaments.
Primary function (in context of speech production) is to control the
vocal cords (vocal folds) as illustrated in Figure 3.3.
 Vocal folds are:


2 April 2016
~15 mm in men
~13 mm in women
Veton Këpuska
10
Anatomy and Physiology of Speech
Production

Three primary states of the vocal folds:

Breathing – Arytenoid Cartilages
are held outward

Voiced - Arytenoid Cartilages are
held close together.

Unvoiced – Arytenoid Cartilages are
held outward or partially close



Complex motion of the vocal folds
illustrated in Figure 3.4
Nonlinear two-mass model of
Flanagan et al. (Figure 3.5)
Arytenoid: ar·y·te·noid Pronunciation: \ˌa-rə-ˈtēˌnȯid, ə-ˈri-tən-ˌȯid\ Function: adjective Etymology:
New Latin arytaenoides, from Greek arytainoeidēs,
literally, ladle-shaped, from arytaina ladle Date:
circa 1751 1 : relating to or being either of two
small laryngeal cartilages to which the vocal cords
are attached 2 : relating to or being either of a pair
of small muscles or an unpaired muscle of the
larynx — arytenoid noun
2 April 2016
Veton Këpuska
11
Anatomy and Physiology of Speech
Production
2 April 2016
Veton Këpuska
12
Anatomy and Physiology of Speech
Production


If one were to measure the airflow velocity at the glottis as a function of time, obtained
waveform will be approximately similar to that of Figure 3.6.



Closed phase: folds are closed and no flow occurs
Open phase: folds are open and the flow increases up to a maximum.
Return phase: Time interval from the maximum air flow until the glottal closure.



Speaker
Speaking style
And specific speech sound.
Specific flow shape can change with:

Glottal air-flow is referred to glottal flow.


Time duration of one glottal cycle is referred to as the pitch period
Reciprocal of pitch period is referred to as pitch, also as fundamental frequency.
2 April 2016
Veton Këpuska
13
Example 3.1

Consider a glottal flow waveform model of the form:
u[n] = g[n]*p[n]
Where g[n] is the glottal flow waveform over a single cycle and
p[n] is an impulse train with spacing P.
p[n] 

  [n  kP]
k  
Because the waveform is infinitely long, a segment is extracted
by multiplying u[n] by a short sequence called an analysis
window or simply a window. The window, denoted by w[n,],
is centered at time , as illustrated in Figure 3.7 – next slide,
and the resulting waveform segment is written as:
u[n, ] = w[n,](g[n]*p[n])
Using Multiplication and Convolution Theorem of Chapter 2, the
following expression in frequency domain is obtained:
1
 

U [ , ]  W ( , ) *   G ( ) [  k ]
P
k 

2 April 2016
Veton Këpuska
14
Example 3.1
1 

U [ , ]   W ( , )  G ( ) (  k )
P k  

1 

U [ , ]    G (k )W (  k , )
P k  

where





W(,) is the Fourier transform of w[n,],
G() is the Fourier transform of g[n],
k=(2/P)k, where 2/P is the fundamental frequency or pitch.
As illustrated in Figure 3.7 the Fourier transform of the window
sequence is characterized by a narrow main lobe centered at =0 with
lower surrounding side lobes.
Effect of the harmonics of the glottal waveform on the spectrum.
2 April 2016
Veton Këpuska
15
Figure 3.7
2 April 2016
Veton Këpuska
16
Example 3.1
 Degrease in pitch period () causes increase () in the
spacing of harmonics of glottal waveform: k=(2/P)k.
 First harmonic is also the fundamental frequency.
 At each harmonic frequency there is a translated
window Fourier transform W(-k) weighted by
G(k)
 Magnitude of the spectral shaping function,
i.e., glottal flow |G(k)| is referred to as
spectral envelope of the harmonics.
2 April 2016
Veton Këpuska
17
Anatomy and Physiology of Speech
Production


Fourier transform of periodic glottal waveform is characterized by
harmonics.
Typically the spectral envelope of the harmonics (governed by the glottal
flow over tone cycle, has on average a -12 dB/octave rolloff.



Rolloff is dependent on the nature of airflow and speaker characteristics.
See Exercise 3.18 for further details.
The model in Example 3.1 is ideal in the sense that even for sustained
voicing – a fixed pitch period is almost never maintained in time:


It can “randomly” vary over successive periods – pitch “jitter”.
Amplitude of the airflow velocity within a glottal cycle may differ across
consecutive pitch periods – amplitude “shimmer”.

Those variations are due to (perhaps!)






Time-varying characteristics of the vocal tract and vocal folds.
Nonlinear behavior in the speech anatomy, or
Appear random while being the result of an underlying deterministic (chaotic)
system.
Jitter and shimmer are one component that give the vowels its naturalness.
In contrast a monotone pitch and fixed amplitude results in a machine-like sound.
Voice character is determined by the extend of jitter and shimmer in voice (e.g.,
hoarse voice).
2 April 2016
Veton Këpuska
18
Anatomy and Physiology of Speech
Production

States of Vocal Folds:



Breathing
Voicing
Unvoicing –

Turbulence at the vocal folds – aspiration


Aspiration occurs also with voiced sounds (breathy voice)

2 April 2016
Example: “he” – whispered sounds
Part of the vocal folds vibrate and part of it are nearly fixed.
Veton Këpuska
19
Anatomy and Physiology of Speech
Production
 Other forms of atypical Vocal Fold movement:

Creaky voice – very tense vocal folds with only a short
portion of the folds oscillating. Resulting in a voice that has
 High pitch, and
 Irregular pitch

Vocal fry – focal folds are massy and relaxed resulting in a
voice with an abnormally:
 Low pitch
 Irregular pitch.
 Characterized by secondary glottal pulses close to and
overlapping the primary glottal pulse.


Result of coupling of false vocal folds with true vocal folds.
Diplophonic voice – secondary glottal pulses occur
between the primary pulses within the closed phase
(see Figure 3.9b and Figure 3.16).
2 April 2016
Veton Këpuska
20
Anatomy and Physiology of Speech
Production
2 April 2016
Veton Këpuska
21
Examples of atypical voice types
2 April 2016
Veton Këpuska
22
Vocal Tract
 Comprised of the oral cavity:



From larynx
To the lips including
the nasal passage – coupled to the oral tract by way of the
velum.
 Oral tract takes on many different lengths and crosssections. This is accomplished by moving the articulators:




Tongue
Teeth
Lips
Jaw.
 Average length for a adult male is 17 cm, and cross
sectional area of up to 20 cm2.
 Purpose of vocal tract is to:


Spectrally “color” the source, and
Generate new sources for sound production.
2 April 2016
Veton Këpuska
23
Spectral Shaping
 Under a certain conditions, the relation
between a glottal airflow velocity input and
vocal tract airflow velocity output can be
approximated by a linear filter with
resonances.
 Resonance frequencies of the vocal tract
are called formant frequencies or simply
formants.
 Formants (resonance frequencies) change
with different vocal tract configurations as
depicted in Figure 3.10.
2 April 2016
Veton Këpuska
24
Figure 3.10
2 April 2016
Veton Këpuska
25
Spectral Shaping

The peaks of the spectrum of the vocal tract response
correspond approximately to its formants:

For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a
vocal tract formant.



Frequency of the formant is 0
Bandwidth is dependent on the distance from the unit circle (r0).
Because the vocal tract is assumed stable (with poles inside the
unit circle), its transfer function can be expressed either in product
or partial fraction expansion form:
H ( z) 
A
Ni
1
* 1
(
1

c
z
)(
1

c
 k
kz )
k 1
Ni
Ak
H ( z)  
1
* 1
(
1

c
z
)(
1

c
k 1
k
kz )
2 April 2016
Veton Këpuska
26
Spectral Shaping
 Formants of the vocal tract are numbered from the
low to high formants according to their location.
 F1, F2, etc.
 In general, the formant frequencies degrease as the
vocal tract length increases:
 Male speakers tend to have lower formants than a
female.
 Female speakers have lower formants than children.
 Under a vocal-tract’s:
 Linearity and time-invariance assumption, and
 When the sound source occurs at the glottis,
 Then:
 The speech waveform (the airflow velocity at the vocal
tract output) can be expressed as the convolution of the
glottal flow input and vocal tract impulse response.
2 April 2016
Veton Këpuska
27
Example 3.2

Consider a periodic glottal flow source of the form:
u[n]=g[n]*p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit
sample train with spacing P. When the sequence u[n] is passed
through a linear time-invariant vocal tract with impulse response
h[n], the vocal tract output is given by:
x[n]=h[n]*(g[n]*p[n])
A window center at time , w[n,], is applied to the vocal tract
output to obtain the speech segment:
x[n,]=w[n,]{h[n]*(g[n]*p[n])}
Using Multiplication and Convolution Theorems, Fourier transform
of the speech segment representing frequency domain
representation is obtained:
2 April 2016
Veton Këpuska
28
Example 3.2

1


X ( , )  W ( , ) *  H ( )G ( )   (  k )
P
k  


1 
X ( , )   H (k )G (k )W (  k , )
P k  



Where W(,) is the Fourier transform of w[n,], and
k=(2/P)k, and (2/P) is fundamental frequency or pitch.
Figure 3.11 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1, 2 ,…, N is determined by
the spectral envelope |H()G()| - consisting of:


Glottal and
Vocal tract contributions
(unlike example 3.1 consisting only of glottal contribution)
2 April 2016
Veton Këpuska
29
Example 3.2
2 April 2016
Veton Këpuska
30
Example 3.2
 The general upward or downward slope of the spectral
envelope, also called spectral tilt, is influenced by:
 The nature of the glottal flow waveform over a cycle,
e.g., a gradual or abrupt closing, and by
 The manner in which formant tails add.
 Note also from the figure 3.11 that the formant
locations are not always clear from the short-time
Fourier transform magnitude |X(,)| because of
sparse sampling of the spectral envelope |H()G()|
by the source harmonics.

This is especially the case for high pitched speech.
2 April 2016
Veton Këpuska
31
Spectral Shaping
 Previous example is important because:
 It illustrates the difference between:
 Formant (resonance frequency of vocal tract), and
 Harmonic frequency.
 A formant corresponds to the vocal tract pole
(resonant frequency)
 Harmonics arise due to the periodicity of glottal
source (pitch).
 In developing signal processing algorithms that
require formants the scarcity of spectral information
can perhaps be detriment to formant estimation.
 On the other hand, the spectral sampling harmonics
can be exploited to enhance perception of sound (as
in singing voice).
2 April 2016
Veton Këpuska
32
Example 3.3
 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher
than the first formant frequency (F1) of the vowel
being sung. As shown in the next figure (Figure 3.12),
when the nulls of the vocal tract spectrum are
sampled at the harmonics, the resulting sound is
weak, especially in the face of competing instruments.
 To enhance the sound, the singer creates a vocal tract
configuration with a widened jaw which increases the
first formant frequency (Exercise 3.4) and can match
the frequency of the first harmonic, thus generating a
louder sound.
2 April 2016
Veton Këpuska
33
Figure 3.12
2 April 2016
Veton Këpuska
34
Nasal Sounds
Spectral Shaping
 Nasal and oral components of the vocal tract are coupled
by the velum.
 When the vocal tract velum is lowered – introducing
an opening into the nasal passage, and
 Oral tract is shut off by the tongue or lips,
Sound propagates through the nasal passage and out
through the nose.
 The resulting sounds have a spectrum that is
dominated by low-frequency formants of the
large volume of the nasal cavity and are
appropriately called nasal sounds:
 Examples: “nose” and “mouse”.
2 April 2016
Veton Këpuska
36
Spectral Shaping: Nose
2 April 2016
Veton Këpuska
37
Spectral Shaping: Mouse
2 April 2016
Veton Këpuska
38
Spectral Shaping
 Because the nasal cavity (unlike the oral tract) is
essentially constant, characteristics of nasal sounds may
be particularly useful in speaker identification.
 Velum can be lowered even when the vocal tract is open:


When this coupling occurs the resulting sound is said to be
nasalized (e.g., nasalized vowel):
There are two dominant effects that characterize
nasalization:
 Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage,
 Introduction of anti-resonances (i.e., zeros in the vocal
tract transfer function) due to the absorption of energy at
the resonances of the nasal passage.
2 April 2016
Veton Këpuska
39
Plosives
Source Generation
 In previous section the effect of vocal tract
shape in the sound production was
discussed.
 In the Figure 3.10 (b) a complete closure of
the tract (the tongue pressing against the
palate) is depicted. This closure is required
when making an impulsive sound
(plosives):
 Build-up of pressure behind the palate, and
 Abrupt release of pressure.
2 April 2016
Veton Këpuska
41
Source Generation: Plosives “Drop”
2 April 2016
Veton Këpuska
42
Fricatives
Source Generation
 Another sound source is created when the tongue is
very close to the palette (but not completely
impeded) used to generate turbulence and thus noise
source (e.g., fricatives).
 As with periodic glottal sound source, a spectral
shaping can also occur for either type of input (i.e.,
impulse or noise source).
 There is no harmonic structure with these types of
inputs. The source spectrum is shaped at all
frequencies by |H()|.
 Note that the spectrum of noise was idealized
assuming a flat spectrum. In reality these sources
will themselves have a non-flat spectral shape.
2 April 2016
Veton Këpuska
44
Source Generation: Fricatives
“NASA”
2 April 2016
Veton Këpuska
45
Source Generation
 There is another class of the source type that is
generated within the vocal tract, however, it is less
understood than noisy and impulsive sources at oral
tract constrictions.
 This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal
folds, teeth, or occlusions in the oral tract.
 Vortex can be thought off as a tiny rotational airflow
in the oral tract.
 There is evidence that sources due to vortices
influence the
 temporal and
 spectral and perhaps
 perceptual characteristics of speech sounds.
2 April 2016
Veton Këpuska
46
Categorization of Sound By Source
 Voiced: Speech sounds generated with a periodic glottal
source.
 Unvoiced: Speech sounds not generated with periodic
glottal source. There are variety of unvoiced sounds:



Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction. Example: “thin”
Plosives – Created with an impulsive source within the oral
tract. Example: “top”
Whispers – Barrier made at the vocal folds by partially closing
the vocal folds, but without oscillations. Example: “he”.
 However, the unvoiced sounds do not exclusively relate to
the sound source. That is the Vocal folds can be vibrating
simultaneously with impulsive or noisy sources. Thus above
subcategories may exists for voiced sounds.

Example:
 “zebra”
 “bin”
2 April 2016
vs.
vs.
“sheba”
“pin”
Veton Këpuska
-- Fricatives
-- Plosives
47
Categorization of Sound By Source
2 April 2016
Veton Këpuska
48
Spectrographic Analysis of
Speech
Spectrographic Analysis of Speech
 Speech waveform consists of a sequence of
different events. This time-variation
corresponds to highly fluctuating spectral
characteristics over time.
 Example of a word “to”.
 A single Fourier transform of the entire acoustic
signal of the word “to” cannot capture this timevarying frequency content.
 In contrast short-time Fourier transform (SFFT)
that consists of a separate Fourier transform of
pieces of the waveform under a sliding window
can capture this temporal variability.
2 April 2016
Veton Këpuska
50
Spectrographic Analysis of Speech
 In examples 3.1 and 3.2 presented earlier, a sliding
(analysis) window concept was introduced.
 This window, w[n,], is typically tapered at its end (Figure 3.14)

to avoid unnatural discontinuities in the speech segment and
distortion in its underlying spectrum.
Example - Hamming window:
w[n,]=0.54-0.46cos[2(n-)/(Nw-1)]

for 0≤n≤Nw-1
Window typically does not necessarily move one sample at a time,
but rather moves at some frame interval (determines frame
rate) consistent with temporal structure one wants to reveal.
X ( , ) 
where

 j n
x
[
n
,

]
e

n  
x[n,]= w[n,]x[n]
represents the windowed speech segments as function of the
window center at time .
2 April 2016
Veton Këpuska
51
Spectrographic Analysis of Speech

The spectrogram is graphically displayed as:
S(,) = |X(,)|2





S(,) – is a 2-D (two dimensional) representation of “energy
density” of the signal.
For each window position  , one could plot S(,).
A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude
measurements vertically in three-dimensional mesh or twodimensionally with intensity coming out of the page.
This display is illustrated (caricature) in Figure 3.14.
This figure also illustrates two kinds of spectrograms:
 Narrowband – it gives good spectral resolution: a good view
of the frequency content of sine-waves with closely spaced
frequencies.
 Wideband - which gives a good temporal resolution: a good
view of the temporal context of impulses closely spaced in
time.
2 April 2016
Veton Këpuska
52
Spectrographic Analysis of Speech
2 April 2016
Veton Këpuska
53
Wide-band Spectrogram
2 April 2016
Veton Këpuska
54
Narrow-band Spectrogram
2 April 2016
Veton Këpuska
55
Spectrographic Analysis of Speech

Note that for voiced speech, the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a
glottal flow input given by the convolution of the glottal flow over one cycle, g[n],
with the impulse train p[n] = [n-kP]:
x[n,]=
w[n,]{(p[n]*g[n])*h[n]}
x[n,]= w[n,]{p[n]*ĥ[n]}
Where glottal waveform over a cycle and vocal tract impulse response was
combined as ĥ[n] = g[n]*h[n]. From the result of example 3.2 the spectrogram
of x[n] can be therefore expressed as:
1
S ( , )  2
P

~
 H (k )W (  k , )
2
k  
where
~
H ( )  H ( )G ( )
and where k 
2 April 2016
2
2
k , and
 is the fundametal frequency
P
P
Veton Këpuska
56
Spectrographic Analysis of Speech
 Difference of narrowband and wideband
spectrogram is in the length of the (analysis)
window w[n,].
 Narrowband Spectrogram:
 Uses “long” window with a duration of typically at
least two pitch periods.
 Under the conditions that:
 The main lobes of shifted window Fourier transforms
are non-overlapping, and that
 Corresponding transform side-lobes are negligible, from
the equation in pervious slide the following
approximation holds (exercise 3.8):
1
S ( , )  2
P
2 April 2016


k  
2
~
2
H (k ) W (  k , )
Veton Këpuska
57
Spectrographic Analysis of Speech
 Narrowband Spectrogram (cont):
 Harmonic lines are “resolved” –
horizontal striations in the timefrequency plane of the spectrogram.
 Long window which covers several pitch
periods smears closely spaced temporal
events and thus gives poor time
resolutions (e.g., plosives that are
closely spaced to a succeeding voiced
sound are poorly represented).
2 April 2016
Veton Këpuska
58
Spectrographic Analysis of Speech
 Wideband Spectrogram:
 Wideband spectrogram is defined by a short window
with a duration of less than one pitch period (see
Figure 3.14).
 Shortening the window widens the Fourier transform
(recall the uncertainty principle).
 Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window
transforms thus smearing the harmonic line
structure: roughly tracing out the spectral envelope
|Ĥ()| due to vocal tract and glottal flow
contributions.
 From temporal perspective since the window length
is less than a pitch period, the window “sees”
essentially pieces of the periodically occurring
sequence ĥ[n].
2 April 2016
Veton Këpuska
59
Spectrographic Analysis of Speech
 Wideband Spectrogram (cont):
 For the steady-state voiced sound, we can therefore
express the wideband spectrogram roughly as (see
Exercise 3.9):
2
~
S (, )   H (k ) E[ ]
 Where  is a constant scale factor and where E[n] is
the energy in the waveform under the sliding
window:

2
E[ ] 
 x[n, ]
n  
2 April 2016
Veton Këpuska
60
Spectrographic Analysis of Speech
 Wideband Spectrogram (cont):


Shows the formants of the vocal tract in frequency, also
Gives vertical striations in time every pitch period, rather
than the harmonic horizontal striations as in narrowband
spectrogram.
 Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech
waveform.

Figure 3.15 in the next slide compares the narrowband
(20-ms Hamming window) and wideband (4-ms Hamming
window) spectrograms.
2 April 2016
Veton Këpuska
61
Figure 3.15
2 April 2016
Veton Këpuska
62
Figure 3.16
2 April 2016
Veton Këpuska
63
Categorization of Speech Sounds


Sound source can be created with either the


vocal folds or
constriction in the vocal tract.
1.
The nature of the source:
Classification of speech sounds can be also be done from the
following perspectives:
2.
3.
4.




Periodic
Noisy
Impulsive, or
Combination of the three.



Place of the tongue hump along the oral tact and
The degree of the constriction of the hump.
The shape is also determined by possible connection to the nasal
passage by way of velum.
The shape of vocal tract - place and manner of articulation.
The time-domain waveform which gives the pressure change with
time at the lips output.
The time-varying spectral characteristics revealed through the
spectrogram.
2 April 2016
Veton Këpuska
64
Elements of a Language
 Phoneme – a fundamental distinctive unit of a
language.
 To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme, speech
scientist use the term phone to mean a particular
instantiation of a phoneme.
 Different languages contain different phoneme sets.



Syllables contain one or more phonemes.
Words are formed from one or more syllables.
Phrases are concatenation of words.
 If first two factors are used to study speech sounds then
this is referred to as articulatory phonetics.
 If last two descriptors are used to study the speech
sounds then this is referred to as acoustic phonetics.
2 April 2016
Veton Këpuska
65
Elements of a Language
 One broad classification for English language is done in
terms of:





Vowels,
Consonants,
Diphthongs,
Affricates, and
Semi-vowels.
 In the next slide, this classification is illustrated in Figure
3.17.
2 April 2016
Veton Këpuska
66
Figure 3.17
2 April 2016
Veton Këpuska
67
Elements of a Language
 Phonemes arise from a combination of vocal fold and vocal
tract articulatory features.
 Articulatory features (corresponding to the first 2 category
descriptors) include:

Vocal fold state

Tongue position and height

Constriction

Velum state
 Vibrating or
 Open
 Front
 Central
 Back along the palate.
 Partial
 Complete
 Nasal sound
 Not a nasal sound.
2 April 2016
Veton Këpuska
68
Elements of a Language






In English the combinations of features are such to give 40 phonemes.
Other languages can yield a smaller/larger number:


11 in Polynesian
141 in the “click” language of Khosian

In Italian consonants are not allowed at the end of words.




Adjacent phonemes,
Speaking rate,
Emphasis in speaking, and
Time-varying nature of the articulators.
Rules of a language define which phones can be stringed together and how
to form words.
A phoneme is not strictly defined by the precise adjustment of articulators
(dialects and accents).
The articulatory properties are influenced by:
The variants of sounds or phones, that convey the same phoneme are
called the allophones of the phoneme:


Example: “butter”, “but” and “to”, were /t/ in each word is somewhat different.
Motor theory of perception – uses articulatory features from the speech
waveform and its acoustic temporal and spectral features to study the
sounds in a language.
2 April 2016
Veton Këpuska
69
Elements of a Language: Vowels

Vowels

Source: quasi-periodic


System:


The particular shape of the vocal tract determines its resonances
(concentrations of energies in the spectrogram).
Waveform:


Each vowel phoneme corresponds to a different vocal tract
configuration.
Spectrogram:


Pitch (not important to categorize a sound in English, however, in
Mandarin Chinese language some sounds are interpreted based on the
pitch – tonal languages)
Certain vowels properties are also seen in the speech waveform within a
pitch period. (see Figure 3.19 in the slide after next)
In spite of the specific properties of different vowels, there is much
variability of vowel characteristics among speakers.

Articulatory differences in speakers is one cause of allophonic
variations.

The place and degree of constriction of the tongue hump, and

Cross-section and length of vocal tract,
=> And therefore the vocal tract formants will vary with speaker.
2 April 2016
Veton Këpuska
70
Figure 3.18
2 April 2016
Veton Këpuska
71
Figure 3.19
2 April 2016
Veton Këpuska
72
Elements of a Language: Nasals

Nasals:


Source:

Quasi-periodic airflow puffs from the vibrating vocal folds.


The velum is lowered and the air flows mainly through the nasal cavity.
Because oral tract is being constricted the sound is radiated at the
nostrils.
Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 3.20).
System:


Spectrogram:


Is dominated by the low resonance of the large volume of the nasal
cavity.
Closed oral cavity acts as a side branch with its own resonances that
change with the place of constriction of the tongue:



2 April 2016
These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract.
Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract.
Consequently nasals have very low energy in high-frequency range.
Veton Këpuska
73
Figure 3.20
2 April 2016
Veton Këpuska
74
Figure 3.21
2 April 2016
Veton Këpuska
75
Elements of a Language: Fricatives


There are two broad classes of fricatives:


Voiced and
Unvoiced


Vocal folds are relaxed and not vibrating for unvoiced fricatives.
Vocal folds are vibrating simultaneously with noise generation at the
constriction.
Noise is generated by turbulent airflow at some point of constriction
along the oral tract.
Constriction is narrower than with vowels.
Source:



System:

The location of the constriction by the tongue, lips determines which
sound is produced:





Back
Center, or
Front of the oral tract, as well as
The teeth or lips.
Spectrogram:

Noise like. Energy is concentrated in higher frequencies.
2 April 2016
Veton Këpuska
76
Example 3.4

A voiced fricative is generated with both a periodic and noise source. The periodic
glottal flow component can be expressed as:
u[n] = g[n]*p[n]



g[n] is the glottal flow over one cycle
p[n] is an impulse train with pitch period P.
Voiced fricative simplified model of the output at the lips:
xg[n] = h[n]*(g[n]*p[n])


h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n].
Modeling the noise source component of the turbulent airflow velocity source at the
constriction denoted by q[n] (assumed white noise). The glottal flow u[n] modulates
this noise function q[n] which in turn excites the front oral cavity that has impulse
response hf[n]:
xq[n] = hf[n]*(q[n]u[n])
2 April 2016
Veton Këpuska
77
Example 3.4
 We assume in simplified model that the results of the
two airflow sources add:
 x[n] = xg[n] + xq[n]
= h[n]*u[n] + hf[n]*(q[n]u[n])
 See Exercise 3.10 for special characteristics of x[n].
 Issues that have been ignored:
 u[n] is modified by the oral cavity
 xq[n] can be influenced by the back cavity.
 Sources of non-linear effects (distributed sources due
to traveling vortices)
2 April 2016
Veton Këpuska
78
Elements of a Language: Fricatives
 Spectrogram:
 Unvoiced fricatives are characterized by a “noisy”
spectrum, while
 Voiced fricatives often show both noise and
harmonics.
 Waveform:
 Unvoiced fricative contains only noise,
 Voiced fricative contains noise superimposed on
quasi-periodic signal.
 Whisper:
 Forms a class of its own under general category of
Consonants.
 Turbulent flow is produced at the glottis rather than
at the vocal tract constriction.
2 April 2016
Veton Këpuska
79
Figure 3.24 - Fricatives
2 April 2016
Veton Këpuska
80
Figure 3.23
2 April 2016
Veton Këpuska
81
Elements of a Language: Plosives


Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow. As with fricatives plosives can be:


Voiced and
Unvoiced.

Constriction can occur at:
System:




Front
Center, or
Back of the oral tract. (Figure 3.24)
1.
2.
Complete closure of the oral tract and buildup of air pressure.
Release of air pressure and generation of turbulence over a very short-time
duration
Generation of aspiration due to turbulence at the open vocal folds
Onset of the following vowel about 40-50 ms after the burst.
Sequence of events:
3.
4.

With voiced plosives vocal folds vibrate for duration of all 4 steps. During the
period when oral tract is closed, we hear a low-frequency vibration due to
propagation of vocal folds vibrations through the walls of the throat. This activity
is referred to as a “voice bar”.



After the release of the burst, unlike the unvoiced plosive, there is little or no
aspiration.
There is much shorter delay between the burst and the voicing of the vowel onset.
Figure 3.26 compares voiced/unvoiced plosive pair.
2 April 2016
Veton Këpuska
82
Elements of a Language: Plosives
 Waveform:
2 April 2016
Veton Këpuska
83
Elements of a Language: Plosives
 Spectrogram:
2 April 2016
Veton Këpuska
84
Elements of a Language: Plosives

Example 3.5: A time –varying system model for the voiced plosive.

Voiced plosive is generated with a burst source and can also have present a periodic source
throughout the user and into the following vowel. Assuming that the burst occurs at time n=0,
we idealize the burst source as an impulse [n]. The glottal flow velocity model for the periodic
source component is given by:
u[n] = g[n]*p[n]



Assume that the vocal tract is linear but time-varying, due to changing vocal tract shape during
its transition from the burst to a following steady vowel.





g[n] is the glottal flow over one cycle
p[n] is an impulse train with pitch period P.
This implies that vocal tract output cannot be obtained by the convolution operator.
Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2.
In this simple model, the periodic glottal flow excites a time-varying vocal tract, with impulse
response denoted by h[n,m], while the burst excites a time-varying front cavity beyond a
constriction, denoted by hf[n.m].
h[n,m] and hf[n.m] represent time-varying impulse responses at time n due to a unit sample
applied m samples earlier at time n-m.
The output then can be written using generalization of the convolution operator:


m
m
xn hn,m unm  h f n,m nm

We have assumed that two outputs can be linearly combined.
2 April 2016
Veton Këpuska
85
Elements of a Language:
Transitional Speech Sounds

Diphthongs:






Vowel like nature with
vibrating vocal folds.
Do not have a steady
vocal tract configuration.:
 They are produced by
varying in time the
vocal tract smoothly
between two vowel
configurations.
 Characterized by
movement from one
vowel target to
another.
hide /Y/
out /W/
boy /O/
new /JU/
2 April 2016
Veton Këpuska
86
Elements of a Language:
Transitional Speech Sounds
 Semi-Vowels: Two categories of vowel like
sounds:
 Glides (/w/ as in “we” and /y/ as in “you”), and
 Liquids (/r/ as in “read”, and /l/ as in “let”).
 Glides:
 Greater constriction of oral tract during the
transition, and
 Greater speed of the oral tract movement,
compared to diphthongs
2 April 2016
Veton Këpuska
87
Figure 3.28 – Liquids & Glides
2 April 2016
Veton Këpuska
88
Elements of a Language:
Transitional Speech Sounds
 Affricates: are the counterpart of
diphthongs consisting of consonant plosivefricative combinations.
 The difference as compared to fricatives is that
the affricates have:
 A fricative portion preceded by a complete
constriction of the oral cavity
 Formed at the same place as for the plosive.
 Examples:
 /tS/ as in “chew” - unvoiced
 /J/ as in “just” - voiced
2 April 2016
Veton Këpuska
89
Coarticulation
 Vocal fold/vocal tract muscles are “programmed” to seek a
target state or shape, often the target is never reached:


Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present.
Furthermore, to make anatomical movement easy and
graceful, the brain anticipates the future, and so the
articulators at any time instant are influenced by where they
have been and where they are going.
 Coarticulation refers to the influence of the articulation of
one sound on the articulation of another sound in the same
utterance.

Coarticulation can occur on different temporal level:
 Local – articulation of a phoneme is influenced by its adjacent
neighbors or by neighbors close in time:


“horse” vs. “horseshoe”.
“sweep” vs. “seep”
 Global – articulators are influenced by phonemes that occur
some time in the future beyond the succeeding or nearby
phonemes;
2 April 2016
Veton Këpuska
90
Prosody: The Melody of Speech
 Prosody of a language is defined by the
rules that define changes in speech
extending over more than one phoneme:
 Intonation (change in pitch)
 Amplitude/Energy (loudness)
 Timing (articulation rate or rhythm).
 These rules are followed to convey
different:
 Meaning,
 Stress, and
 Emotion
2 April 2016
Veton Këpuska
91
Figure 3.29 - Prosody
2 April 2016
Veton Këpuska
92
Figure 3.30 – Global Coarticulation
2 April 2016
Veton Këpuska
93
Narrowband Spectrogram
2 April 2016
Veton Këpuska
94
Wideband Spectrogram
2 April 2016
Veton Këpuska
95
Utterance Depicted in Previous
slides.
 “Cat and Dogs each hate the other.”
2 April 2016
Veton Këpuska
96
END
2 April 2016
Veton Këpuska
97