Transcript Fourth

Speech and Audio Processing
and Coding (cont.)
Dr Wenwu Wang
Centre for Vision Speech and Signal Processing
Department of Electronic Engineering
[email protected]
http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html
1
Psychoacoustics
 Psychoacoustics is the study of how humans perceive sound, such as
o Perception of loudness
o Pitch perception
o Space perception
 References
o B. C.J. Moore, An Introduction to the Psychology of Hearing,
Academic Press, 1995.
o D.M. Howard and J. Angus, Acoustics and Psychoacoustics, Focal
Press, 1996.
o W. A. Yost, Fundamentals of Hearing: an Introduction, Academic
Press, 1994.
o R. M. Warren, Auditory Perception, Cambridge Univ. Press, 1999.
Inner Ear Function
 The inner ear consists of cochlea which has a snail-like structure.
o It transfers the mechanical vibrations to the movement of basilar
membrane, and then converts into nerve firings (organ of corti
which consists of a number of hair cells).
o The basilar membrane carries out frequency analysis of input
sounds, and it responds best to high frequencies at the (narrow
and thin) base end, and to low frequencies at the (wide and thick)
apex end.
Inner Ear Function
(a) The spiral nature of
the cochlea
(b) The cochlea unrolled
(c) Vertical crosssection through the
cochlea
(d) Detailed view of the
cochlea tube
From: (Howard &
Angus, 1996)
Basilar Membrane
Idealised shape of unrolled basilar membrane
From: (Howard & Angus, 1996)
Displacement of Basilar Membrane
Idealised envelope of basilar membrane movement to sounds at five
different frequencies
From: (Howard & Angus, 1996)
‘Place’ Theory of Hearing
 The displacement of the basilar membrane changes as the
frequencies change.
 The basilar membrane is stimulated from the base end which
responds best to high frequencies, and it is important to note that its
envelope of movement for a pure tone (or individual component of a
complex sound) is not symmetrical, but it tails off less rapidly towards
higher frequencies than towards lower frequencies.
 The linear distance measured from the apex to the place of the
maximum basilar membrane displacement is directly proportional to
the logarithm of the input frequency.
Critical Bands
 An illustration of the perceptual changes when playing two tones
simultaneously with the frequency of a pure tone (F1) fixed and the
other (F2) changing.
From: (Howard & Angus, 1996)
Critical Bands (cont)
 The discrimination between two frequencies depends whether the
basilar membrane displacements are separated or not.
 A listener’s perception change for the frequency difference between
two pure tones from rough and separate to smooth and separate is
known as ‘critical bandwidth’ (CB).
“The critical bandwidth is that bandwidth at which subjective responses
rather abruptly change.” (Scharf, 1970)
 The ‘equivalent rectangular bandwidth’ (ERB) was proposed to use the
notion of critical bandwidth practically. (Moore and Glasberg, 1983)
ERB  {[ 6.23 106  f c ]  [93.39 103  f c ]  28.52}Hz
2
2
Critical Bands (cont)
 The relationship between the ERB and the centre filter frequency
(Howard & Angus, 1996)
Critical Bands (cont)
 Semitone: is the smallest musical interval between musical notes,
defined as the interval between two adjacent notes in a 12-tone scale
(e.g. from C to C#). Hence, it equals to 100 cents (i.e. a twelfth of an
octave)
 Octave: the interval between two music pitches with one has a double
frequency of the other. In other words, the frequency of one note is 12
semitones higher or lower than that of the other. For example, A4 note
is one octave higher than an A3 note, but one octave lower than A5
note.
Loudness Perception
 The ear’s sensitivity to sounds of different frequencies varies over a wide
range of sound pressure level (SPL). The minimum SPL that can be
detected by the human hearing system around 4kHz is approximately
10e-5Pa, while the maximum SPL (i.e the threshold of pain) is 20Pa.
 For convenience, in practice, SPL is usually represented in decibels (dB)
relative to 20e-5Pa.
P 
dB( SPL)  20 log  m 
 Pr 
where
Pm
is the measured SPL,
5
 For example, the threshold of hearing at 1 kHz is, in fact, Pr  2 10 Pa
In dB, it equals to
 2 105 
  0dB
20 log 
5 
 2 10 
 While the threshold of pain is 20Pa which in dB equals to
 2 10 
20 log 
 120dB
5 
 2 10 
Loudness Perception (cont.)
 The perceived loudness of an acoustic sound is related to its
amplitude (but not a simple one-to-one relationship), as well as the
context and nature of the sound.
 As the sensitivity of our hearing system varies as the frequency
changes, it is possible for a sound with a larger pressure amplitude to
be heard as quieter than a sound with a lower pressure amplitude (for
example, if they are at different frequencies). [recall the equal
loudness contour of the human auditory system shown in the first
lecture]
Demos for Loudness Perception
 Resources: Audio Box CD from Univ. of Victoria

Decibels vs Loudness
Starting with a 440Hz tone (i.e. note A4), then it is reduced
1dB each step
Starting with a 440Hz tone (i.e. note A4), then it is reduced
3dB each step
Starting with a 440Hz tone (i.e. note A4), then it is reduced
5dB each step

Intensity vs Loudness
Various frequencies played at a constant SPL
A reference tone is played and then the same tone is played
5dB higher; followed by the reference tone, and then the tone
8dB higher and finally the reference tone and then the one
10dB higher
Pitch Perception
Pitch
 What is pitch? Pitch
• is “the attribute of auditory sensation in terms of which sounds may be
ordered on a musical scale extending from low to high” (American
Standard Association, 1960)
• is a “subjective” attribute, and cannot be measured directly. Therefore, a
specific pitch value is usually referred to the frequency of a pure tone that
has the equal subjective pitch of the sound. In other words, the
measurement of pitch requires a human listener (the “subject”) to make a
perceptual judgement. This is in contrast to the measurement in the
laboratory of, for example, the fundamental frequency of a complex tone,
which is an “objective” measurement. (Howard & Angus, 1996)
• is related to the repetition rate of the waveform of a sound, therefore it
corresponds to the frequency of a pure tone and the fundamental
frequency of a complex tone. In general, sounds having a periodic
acoustic pressure variation with time are perceived as pitched sounds, for
non-periodic acoustic pressure waveform, as non-pitched sounds.
(Howard & Angus, 1996)
Pitch
 Comparison of pitched and non-pitched sounds (Howard & Angus, 1996)
Pitched
Non-pitched
Waveform (time
domain)
Periodic (regular
repetitions)
Non-periodic (no regular
repetitions)
Spectrum
(frequency
domain)
Line (harmonic
components)
Continuous (no harmonic
components)
Pitch
 Examples of pitched (see the figures in “Musical Notes and its
Fundamental Frequencies”) and non-pitched sounds (see the figure
below, the waveform and spectrum of a drum being brushed, Howard
& Angus, 1996)
Existing Pitch Perception Theories
 ‘Place’ theory
 Spectral analysis is performed on the stimulus in the inner ear,
different frequency components of the input sound excite different
places or positions along the basilar membrane, and hence
neurones with different centre frequencies.
 ‘Temporal’ theory
 Pitch corresponds to the time pattern of the neural impulses
evoked by that stimulus. Nerve firings tend to occur at a particular
phase of the stimulating waveform, and thus the intervals between
successive neural impulses approximate integral multiples of the
period of the stimulating waveform.
Place Theory
Three methods are commonly used for finding the value of f0 based on a
place analysis of the frequency components of the input sound:
 Method 1: locate the f0 component itself.
 Method 2: find the minimum frequency difference between adjacent
harmonics, i.e. (n+1)*f0 – n*f0 = f0.
 Method 3: find the highest common factor of the frequency
components that are present in the input sound.
Place Theory (cont)
 Method 1:
• Suggests that the pitch of a sound corresponds to the place
stimulated by the lowest frequency component, i.e. fundamental
frequency f0.
• Assumes that f0 is always present in the sound. For example, as
stated by Olm: “a pitch corresponding to a certain frequency can
only be heard if the acoustic wave contains power at that
frequency”.
 Exceptional case:
• As demonstrated by Schouten (1940) that even removing the f0
from a pulse wave, its pitch remained the same.
• Therefore, f0 doesn’t have to be present for pitch perception. Also,
the lowest frequency component is not the basis for pitch
perception.
Place Theory (cont)
 Method 2:
• Suggests that whether or not the fundamental frequency f0 is
present, some adjacent harmonics, provided that they exist, should
be used as a basis for pitch perception.
• For most musical sound, adjacent harmonics are indeed present.
 Exceptional case:
• As shown in the figure below, when f0 is present (or absent), the
difference between adjacent frequencies are f0, 2f0, 2f0, etc. (or
3f0, 2f0, 2f0, etc), while the perceived pitch would not change.
(Howard & Angus, 1996)
Place Theory
 Method 3:
• The highest common factor is the highest value appearing in all rows
of the place analysis table below, where as an example, f0 = 100Hz.
• It can address the exceptional cases in both Method 1 and Method 2.
(Howard & Angus, 1996)
Place Theory
 Method 3:
• Another example shown by Schouten was using the analysis table
to interpret pitch perception for non-harmonic sound. For a sound
whose component frequencies were 1040Hz, 1240Hz and 1440Hz,
and it was found the pitch was approximately 207Hz. Using Method
2, the pitch would be the spacing between these components, and
hence, 200Hz.
• Using the processing table (shown in the next page), the highest
common factor would be approximately 207Hz which is an average
of 208Hz, 207Hz, 206Hz, of which the components are the 5th, 6th,
and 7th harmonic respectively. The pitch perceived in such a
situation is referred to as “residue pitch”, “pitch of the residue”, or
“virtual pitch”. Actually, the fundamental frequency of these
components is 40Hz, of which they are the 26th, 31st, and 36th
harmonic respectively. It seems that the perceived pitch found by
the auditory system is based on the adjacent harmonics that
present in these frequencies.
Place Theory
(Howard & Angus, 1996)
Problems with the Place Theory
 Although it provides a basis for understanding how f0 is found in terms of
frequency analysis, it does not explain (Howard & Angus, 1996):
• The discrimination of frequency difference in pitch perception. [To discuss]
• The pitch perception of sounds with frequency components that could not be
resolved by the place mechanism of basilar membrane. [In general, no
harmonic above about the 5th to 7th is resolved for any fundamental frequency,
because in these situations, the critical bandwidth at the centre frequencies
(i.e. these harmonics), will be higher than the fundamental frequency.]
• The pitch perceived for some sounds which has non-harmonic (i.e.
continuous) spectra. [For example, most listeners would rate ‘ss’ in “sea” to
have higher pitch as compared with ‘sh’ in “shell”, as the energy is biased
more towards the lower frequencies for ‘sh’ with a peak around 2.5kHz, as
compared with a peak around 5kHz for ‘ss’. Figure shown in the next page.]
• Pitch perception for sounds with a fundamental frequency less than 50Hz
[This is because the pattern of vibration on the basilar membrane does not
seem to change in that region.]
‘ss’ versus ‘sh’
Frequency Discrimination
 The size of the frequency difference limen (DL), or sometimes called
just noticeable difference (JND), is the smallest detectable change in
frequency. Two methods were used to measure DL, including
 DLF - The subject is asked to judge which of two frequencies has
higher pitch. This method was used by Henning (1970), Moore
(1973), etc. It was found that expressed in Hz, the change is
smallest at low frequencies, and increases monotonically with
increasing frequencies; expressed as a proportion of centre
frequency, it tends to be smallest for middle frequencies, and larger
for very high and very low frequencies.
 FMDL - Tones which are frequency modulated (FM) at a low rate
(typically 2-4Hz) are used for the measurement. This method was
used by Shower & Biddulph (1931). FMDL seems to vary less with
frequency than DLF, and both get smaller as the sound level
increases.
Frequency Discrimination
 The frequency discrimination thresholds change with the centre
frequencies, plotted as log(threshold) versus square root of centre
frequency below:
Frequency
discrimination
threshold
measured by
several different
authors, all
measured DLFs
except S & B who
measured FMDLs
(figure first
published by Wier
et al, 1977, and
reproduced in
Moore, 1995)
Temporal Theory
 This theory is based on the fact that the waveform of an acoustic signal with a
strong pitch is periodic.
 This theory suggests that it is the detailed nature of the actual waveform that
excites the different places along the basilar membrane. Therefore, it depends
on the timing of neural firings generated in the organ of Corti, in response to
vibrations of the basilar membrane.
 It can be simulated by a bank of band-pass filters whose centre frequencies
and bandwidths vary according to the critical bandwidth of the human hearing
system.
 The nerve fibres fire at all places along the basilar membrane, and a given
nerve fibre may only fire at one phase or instant in each cycle of the
stimulating waveform. This process is known as phase locking.
 Due to phase locking, the time between firings for any particular nerve will
always be an integer multiple of periods of the stimulus. At each place, there
are a number of nerves involved.
(Howard & Angus, 1996)
Simulation of Temporal Theory
 Band-pass
filtering of note
C4 played on a
violin, whose f0
is 261.6Hz.
(Howard & Angus, 1996)
Simulation of Temporal Theory
 The first six harmonics (around 260, 520, 780, 1040, 1300 and 1560Hz) are
well resolved by the band-pass filters, and therefore can be explained by the
place theory.
 For the output waveforms whose filter centre frequencies above the sixth
harmonic are not sinusoidal since they are not resolved individually, as the
bandwidth is higher than the fundamental frequency.
 When two components close in frequency are combined, they produce a beat
waveform if both components are harmonics of some fundamental frequency.
The beat frequency is equal to the f0, as shown in the filter outputs above the
1.5KHz in the figure of the previous page.
 The minimum time between the firings (i.e. 1 period of the stimulus) can be
inferred from the filter output (which is the period of the lower harmonics and
the period of the input wave itself).
 Note that, although the nerve does not necessarily fire in every cycle, and the
cycle in which it fires tends to be random, due to phase locking, the time
between the firings for any particular nerve will always be an integer multiple of
periods of the stimulating waveform.
(Howard & Angus, 1996)
Nerve Firing
 An illustration of
nerve firing along the
basilar membrane for
the first 16 harmonics
of an input sound.
(Howard & Angus, 1996)
Problems with Temporal Theory
 Although it provides a basis for understanding how the fundamental period
could be found from an analysis of the timing of the nerve firing from all places
across the basilar membrane, it couldn’t explain the following:
• Pitch perception of sounds whose f0 is higher than 5kHz. [This is because
phase locking breaks down above 5kHz.]
• In practice, this means there will be only approximately two harmonics to
be analysed, due to the limitation of the human hearing system (i.e. the
upper limit 20kHz).
(Howard & Angus, 1996)
Contemporary Theory
 Neither of the theories is perfect for explaining the mechanism of human pitch
perception. A combination of both theories will benefit the analysis of pitch
perception, as a model proposed by Moore (1982) for complex tones, shown
below.
(Howard & Angus, 1996)
Musical Intervals (Melody)
 One tone evokes a pitch, a sequence of tones with appropriate
frequencies can evoke the perception of a musical interval (or melody).
 A sequence of tones below 5kHz evokes a sense of melody, while a
sequence of tones above 5kHz does not evoke a clear sense of
melody, although different frequencies can be heard. (Moore, 1989)
 For example, two tones which are separated in frequency by an
interval of one octave (i.e. one has twice the frequency of the other)
sound similar. Hence, they are judged to have the same name on the
musical scale (for example, C or D).
 It appears that the musical interval of an octave is only clearly
perceived when both tones are below 5kHz. Above 5kHz, a sequence
of pure tones does not produce a clear sense of melody, as shown by
Atteneave and Olson, 1971.
Pitch versus Sound Level
 The pitch of a pure tone is determined not only by its frequency
(mainly), but also by its sound level (lightly).
 On average, the pitch of tones below about 2kHz decreases with
increasing sound level, while the pitch of tones above about 4kHz
increases with increasing sound level. (Moore, 1989)
 For tones between 1 and 2kHz, changes in pitch with level are
generally less than 1%, while for tones of lower and higher
frequencies, the changes can be larger (up to 5%). (Verschuure and
van Meeteren, 1975; Moore, 1989)
Musical Notes
 Notes are played by music instruments that have different pitches.
 As the sensitivity of our hearing system varies as the frequency
changes, it is possible for a sound with a larger pressure amplitude to
be heard as quieter than a sound with a lower pressure amplitude (for
example, if they are at different frequencies). [recall the equal
loudness contour of the human auditory system shown in the first
lecture]
Musical Note and its Fundamental
Frequency (Waveform of A4)
(Howard &
Angus, 1996)
1
1
f0  
 440.5 Hz  A4(444 Hz )
T 2.28ms
Musical Note and its Harmonics
(Spectrum of A4)
Musical Note and its Harmonics
 The shape of the waveform and the spectrum for each of the notes
played by the four different instruments shown in the previous page is
different, even though they all perceived as note A4 (i.e. they have the
same fundamental frequencies). It is the so-called “timbre” that
distinguishes the four different music instruments.
 The frequency components of notes produced by any pitched
instruments are called harmonics which are integer multiples of the
fundamental frequency f0. Therefore, the first harmonic is the
fundamental frequency f0, and the 2nd harmonic is 2f0, and the third is
3f0, etc.
 Another term that is also used by many authors is “overtones”. The
first overtone refers to the first frequency component that is above f0,
which is the second harmonic, i.e. 2f0. For example, for the note A4
played by violin, f0=440.5Hz, the first harmonic is therefore 440.5Hz,
and the first overtone is 881.0Hz.
Demos for Pitch Perception
 Resources: Audio Box CD from Univ. of Victoria
This three demos show how pitch is perceived with different
time duration of the signals. In each track, time bursts of
sounds are played. Three different pitches are played in
these three tracks.
Space Perception
Sound Localisation
 Sound localisation refers to judgements of the direction and distance
of a sound source, usually achieved through the use of two ears
(binaural hearing).
 Help humans and animals to locate the sounds of threats and to
avoid such threats.
 Help humans and animals direct visual attention
 Help humans and animals focus attention on sounds from specific
directions by excluding other interfering sounds in a noisy and
reverberant environment
 For blind people, in particular, they can use information from the
echoes and reflections to estimate the distance of sound sources.
 Although binaural hearing is crucial for sound localisation, monaural
perception is similarly effective in some cases, such as in the detection
of signals in quiet, intensity discrimination, and frequency
discrimination.
Localisation Cues
 There are two important cues that enable us to
localise sounds:
 interaural time difference
 interaural intensity difference
Interaural Time Difference (ITD)
 The two ears are separated by the dimension of the head. For an
average head, the distance between the ears is about 18cm. As such,
there will be a time difference between the sound reaching the ear
near the source and the one further away. Such difference is called
interaural time difference (ITD).
 A simple and rough model to calculate the ITD is given below, in which
it assumes that the sound travel around the head can be ignored:
d sin(  )
t 
c
Where
t
d

c
- ITD (in s)
- Distance between the ears (in m)
- The angle of arrival of the sound from the
median (in radians)
- Sound speed (in m/s)
Interaural Time Difference (ITD)
(Howard & Angus, 1996)
Interaural Time Difference (ITD)
 However, in reality, the sound has to travel around the head in order to
reach the ear.
 A more accurate model to calculate the ITD is given below, in which it
assumes that the head is spherical:
 Based on the equation below, it can be shown that the maximum ITD
occurs at 90 degree (considering
the average head diameter), which
c
4
is: 6.73 10  673s
Where
r (  sin(  ))
t 
c
t - ITD (in s)
- Half the distance between the ears (in m)
r

c
- The angle of arrival of the sound from the
median (in radians)
- Sound speed (in m/s)
Interaural Time Difference (ITD)
(Howard & Angus, 1996)
ITD as a Function of Angle
(Howard & Angus, 1996)
ITD and IPD
 The ear appears to use the interaural phase difference (IPD) caused
by the ITD in the two waves to resolve the sound direction.
 The phase difference is given by:
  2fr(  sin(  ))
Where

- The phase difference between the two ears
(in radians)
r
- Half the distance between the ears (in m)

- The angle of arrival of the sound from the
median (in radians)
f
- The frequency (in Hz)
 When the phase difference is greater than 180 degree, there will be an
unresolvable ambiguity in the sound direction as the angles could be
the one to the left or to the right.
ITD and IPD (cont)
 The maximum frequency (without phase ambiguity), at a particular
angle, is given by
f max
1


2r (  sin(  )) 2r (  sin(  ))
 For an angle of
f max

  90 and the average size of head r  0.09m
1

 743Hz
2  0.09  (  sin(  / 2))
 The ambiguous frequency limit would be higher at smaller angles. For
the frequencies higher than the maximum frequency, other cues are
used by human ears to resolve the direction of sound sources, such as
the interaural intensity difference (IID).
Interaural Intensity Difference (IID)
 Due to the shading effect of the head, the intensity of the sound levels
reaching each ear is also different. Such difference is called interaural
intensity difference (ITD).
 When the sound source is on the median plane, the sound level at
each ear is equal, while the level at one ear progressively reduces,
and increases at the other, as the sources move away from the
median plane.
c
 The shading effect of the head is difficult to calculate, however,
experiments seem to show that the intensity ratio between the two
ears varies sinusoidally from 0dB up to 20dB with the sound direction
angles, for various frequencies.
 The shading effect is not significant unless the size of the head is
about one third of a wavelength in size. For a head with a diameter of
18cm, this corresponds to a minimum frequency (Howard & Angus,
1996) of:
1  c  1  344m / s 
f min(  / 2)     
  637 Hz
3  d  3  0.18m 
Shading Effect in IID
c
(Howard & Angus, 1996)
IID as a Function of Angle and
Frequency
c
(Data from Gulick, 1971, reproduced from Howard & Angus, 1996)
ITD and IID Trading
 Both ITD and IID are used for the perception of sound source
directions, while in fact it is possible that one cue could be confused
(or cancelled) by the other. This is known as ITD and IID trading.
 The time delay versus intensity trading is effective over the range of
delay times which correspond to the maximum interaural time delay of
0.673ms.
 For the delays between 0.673ms and 30ms, small intensity difference
c
will not alter the perceived direction of the sound source. However, if
the delayed sound’s intensity is more than 12dB greater than the
earlier arrival sound, we will perceive the direction of the sound to be
towards the delayed sound.
 For the delays of more than 30ms, the delayed sound is perceived as
an echo.
 Therefore, it is possible to determine the direction of the sound source
based purely on ITD or IID.
ITD and IID Trading
c
(Data from Madsen, 1990, reproduced from Howard & Angus, 1996)