Transcript Fifth

Speech and Audio Processing
and Coding (cont.)
Dr Wenwu Wang
Centre for Vision Speech and Signal Processing
Department of Electronic Engineering
[email protected]
http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html
1
Timber Perception
(Ack. S. Zielinski)
2
What is Timbre?
 According to American Standard Association, it is defined as “that
attribute of sensation in terms of which a listener can judge that two
sounds have the same loudness and pitch are dissimilar”.
 Musically, it is “the quality of a musical note which distinguishes
different types of musical instruments.”
 It can be defined as “everything that is not loudness, pitch or spatial
perception”.
•
Loudness < - > Amplitude (frequency dependent)
•
Pitch < - > Fundamental Frequency
•
Spatial perception <-> IID, IPD
•
Timbre <-> ???
3
Physical Parameters
 Timbre relates to:
•
Static spectrum (e.g. harmonic content of spectrum)
•
Envelope of spectrum (e.g. the peaks in the LPC spectrum which
corresponds to formants)
•
Dynamic spectrum (time evolving)
•
Phase
•
…
4
Static Spectrum
5
Spectrum Envelope
Formant affects the sensation of timbre
6
Spectrum Envelope (cont)
Formants determines not only timbre, but also the
recognition of vowels
7
Spectrum Envelope (cont)
This figure shows how the spectral envelope
looks like in a trumpet sound
8
Spectrum Envelope (cont)
The spectral envelopes of the flute (the above figure)
and the piano (the below figure) suggest that they are
different for different music instrument.
9
Dynamic Spectrum
This figure shows how the spectral envelope
looks like in a trumpet sound
10
Phase
The above two magnitude spectra are identical, while their
waveforms are totally different. The timbre of these two sounds
are almost identical, and hence phase affects the timbre but to
very little extent. This also suggests that human hearing is not
sensitive to phase difference.
11
Demos for Timbre Perception
 Resources: Audio Box CD from Univ. of Victoria
Examples of differences in timbres
12
Auditory Masking
13
What is masking ?
 Masking: One sound is made inaudible by another one.
•
Simultaneous masking refers to the situation where one sound
(signal) is made inaudible by another simultaneous sound (i.e. the
masker). In other words, both the signal and the masker happen
at the same duration. It is also known as frequency masking or
spectral masking since if two sounds share a same frequency
band, they can be perceived clearly when separated, but cannot
be perceived clearly when simultaneous, such as the tones at
440Hz and 450Hz
•
Non-simultaneous masking refers to the situation where one
sound (signal) is made inaudible by another sound (i.e. the
masker) that proceeds or follows the signal. In other words, they
do not present at the same time.
14
What is masking? (cont)
15
Simultaneous Masking
On-frequency masking
Off-frequency masking
The masker and the signal are
within the same auditory filter
band, with the louder sound
masks the quieter one.
The masker and the signal are
with different frequency
bands. The masking effect is
weaker as compared with the
on-frequency masking.
(Source: figures from wikipedia, 2010)
16
Simultaneous Masking (cont)
To have a same masking
In off-frequency masking, the amount
effect as in on-frequency
that the masker raises the threshold of
masking, the level of masker
the signal is much less as compared
needs to be greater in offwith on-frequency masking, however,
frequency masking.
it does have some masking effect on
the signal, as shown in the above
figure.
(Source: figures from wikipedia, 2010)
17
Demos for Simultaneous Masking
(Frequency Domain Masking)
 Resources: Audio Box CD from Univ. of Victoria
A single tone is played, followed by the same tone and a higher
frequency tone. The higher frequency tone is reduced in intensity
first by 12 dB, then by steps of 5 dB. The sequence is repeated
twice. The second time the frequency separation between the
tones is increased.
Pure tones mask higher frequencies better than lower frequencies.
This demo tries to mask high frequencies.
Pure tones mask higher frequencies better than lower frequencies.
This demo tries to mask low frequencies.
This demo shows a tone of greater intensity masks a broader ranger
of tones than a tone of less intensity. A single tone is played,
followed by the same tone and a higher frequency tone. The higher
frequency tone is reduced in intensity first by 10 dB, then by steps of
3 dB. The sequence above is repeated twice, the second time
increasing the intensity of the single tone by 28 dB.
18
The Amount of Masking
In the example above, the amount of masking is 16dB, which is the
difference between the masked threshold and un-masked threshold. Note
that the threshold for a signal that is masked will be raised as compared
with the signal is not masked (for example, when the signal is heard in a
quiet environment.)
(Source: figures from wikipedia, 2010)
19
Masking Interprets Frequency
Resolution of Auditory System
 Frequency selectivity, also known as frequency resolution, is referred
to as the ability of human auditory system to separate the different
frequency components of a complex sound. Recall the concept of the
critical bandwidth, two sounds with different frequencies (pitches) can
be heard as two separate tones.
 It is achieved and performed by the filtering process of the cochlear,
where the complex sound is (band-pass) filtered and decomposed into
individual frequency components (sinusoids), and then coded
independently in the auditory nerve.
 Masking is usually used to quantify and characterise the frequency
resolution of the auditory system. The auditory system would not be
able to separate the two frequencies if the sound of one frequency is
masked by that of the other. Therefore, masking explains the limits of
frequency resolution of the human auditory system.
20
Use Masking to Estimate
the Critical Band
 The original experiment by Fletcher (1940) to measure the threshold for
detecting a sinusoidal signal as a function of the bandwidth of a bandpass
noise masker
 Conditions: The noise was centred at the signal frequency. Noise power
density was constant.
 Findings: At first, the threshold increases as the noise bandwidth increases.
However, it flats off with the further increases in noise. This was due to the
critical bandwidth: where the noise bandwidth exceeds the bandwidth of the
auditory filter and the threshold ceases to increase even if the noise power
increases.
 The power-spectrum model of masking assumes (Moore, 1995):
 The auditory system is a bank of linear overlapping band-pass filters.
 Use one filter with a centre frequency close to that of the signal for the
detection of the signal.
 The signal is only masked by the noise component that passes through the
auditory filter.
 The threshold corresponds to a certain signal-to-noise (masker) ratio.
21
Psychophysical Tuning Curves
Psychophysical tuning curves (PTCs) is a method for the estimation of the shape of the
auditory filter. The PTCs above were determined in simultaneous masking, using sinusoidal
signals at 10 dB SPL. For each curve, the diamond below it shows the frequency and the
level of the signal. The masker was a sinusoid that had a fixed starting phase relationship to
the signal. The masker level required for threshold (i.e. just mask the signal) is plotted as a
function of masker frequency on a logarithmic scale. The dashed line represents the
absolute threshold for the signal. Figure from (Moore, 1995).
22
Shape of Auditory Filter
The shape of the auditory filter centred at 1kHz plotted for input sound levels ranging
from 20 to 90 dB SPL/ ERB. The output level of the filter is plotted as a function of
the frequency. On the low-frequency side, the filter becomes progressively less
sharply tuned with increasing sound level. On the high-frequency side, the sharpness
of tuning increases slightly with increasing sound level. At moderate sound levels the
filter is approximately symmetric on the linear frequency scale used. Figure from 23
(Moore, 1995)
Bark Scale
 Proposed in 1961 by Eberhard Zwicker, named after Heinrich
Barkhausen who proposed the first subjective measurement of
loudness.
 The scale ranges from 1 to 24 and corresponds to the first 24 critical
bands of hearing. The subsequent band edges are (in Hz) 20, 100,
200, 300, 400, 510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000,
2320, 2700, 3150, 3700, 4400, 5300, 6400, 7700, 9500, 12000,
15500.
Bark  13arctan(0.00076f )  3.5 arctan((f / 7500)2 )
24
Non-Simultaneous Masking
T
Forward masking
Masked tone
Masking tone
time
T cannot be as long as 20-30ms
T
Backward masking
Masked tone
Masking tone
time
T cannot be more than 10ms
25
Forwarding Masking
The left figure shows the amount of forward masking of a 2kHz signal as a function of the
time delay between the signal and the end of the noise masker. Each curve represents a
different noise level. The results for each spectrum level fall on a straight line when the
signal delay is plotted on a logarithmic scale. The right figure shows the same thresholds
plotted as a function of the masker level. The slopes of these growth of masking functions
26
decrease with increasing signal delay. Figures from (Moore, 1995)
Forwarding Masking
 Forward masking is greater the nearer in time to the masker that the
signal occurs.
 Increments in masker level do not produce equal increments in amount of
forward masking, i.e. the slope of the growth of masking function is less
than 1, which is in contrast to the simultaneous masking where the slope
is close to 1.
27
PTCs Comparisons
Comparison of the psychophysical tuning curves determined by the simultaneous
masking (triangle) and the forward masking (square). The masker frequency is
plotted as a function of the deviation of the centre frequency divided by the centre
frequency. The unit for the centre frequency is kHz. Figures from (Moore et al, 1984)
28
Demos for Non-simultaneous
Masking (Time Domain Masking)
 Resources: Audio Box CD from Univ. of Victoria
Forward masking: a masking tone is played and then a tone which
is semitone lower is followed with a 100ms delay in the between.
Two tones can be heard even though the second tone is decreased
in 3dB increments.
Forward masking: a masking tone is played and then a tone which
is semitone lower is followed with a 10ms delay in the between.
Masking occurs in this demo. How many steps are audible before
the second tone is masked.
Backward masking: the initial tone is masked by the one that
follows. The time delay is 100ms.
Backward masking: the initial tone is masked by the one that
follows. The time delay is decreased by still more than 10ms.
Backward masking: the initial tone is masked by the one that
follows. The time delay is below 10ms. Masking occurs. How many
29
steps are audible?
Examples of Modern Audio Formats
 MP3: MPEG-1 or MPEG-2 Audio Layer 3 (or III), is a patented lossy audio codec.
It is a common audio format for consumer audio storage, as well as a standard of
digital audio compression for the transfer and playback of music on digital audio
players.
 Ogg Vorbis: an lossy audio codec developed by the Xiph.Org Foundation (formerly
Xiphophorus company). Free and open source.
 AAC: Advanced Audio Coding, an audio compression format specified by MPEG-2
and MPEG-4, and successor to MPEG-1’s “MP3” format.
 WMA: Windows Media Audio, is an audio codec developed by Microsoft.
 MPEG-1 Layer II or MPEG-2 Audio Layer II (MP2): a lossy audio compression
format defined by ISO/IEC 11172-3 alongside MPEG-1 Audio Layer I and MPEG-1
Audio Layer III (MP3). While MP3 is much more popular for PC and internet
applications, MP2 remains a dominant standard for audio broadcasting.
 ATRAC: Adaptive Transform Acoustic Coding (ATRAC) is a family of proprietary
audio compression algorithms developed by Sony. ATRAC allowed a relatively small
disc like MiniDisc to have the same running time as CD while storing audio
information with minimal loss in perceptible quality.
30
Auditory Scene Analysis
31
Demos for Sequential Organisation
 Resources: Audio Box CD from Univ. of Victoria
In this demo, the sound is perceived as a single stream of notes C4
G4 F4 B3
As the notes are sped up, rhythmic beats played as a melody begin
to be heard. The auditory system is now hearing two groups of two
notes.
If the time delay is further decreased. We no longer hear a melody,
we only hear the rhythmic beats. Our auditory system is now
hearing four groups of one note each.
32
Demo for Speech Segregation
 Resources: Audio Box CD from Univ. of Victoria
This demo begins the two melodies of “Camptown Races” and
“Yankee Doodle” at the same pitch. Each time the interleaved
melody is played, one of the songs is shifted in pitch until
eventually the two melodies become distinguishable.
This demo plays the two melodies at the same pitch, but at
different timbre. The two melodies are distinguishable instantly.
This demo adjusts the amplitude of the two songs while leaving the
pitch constant.
33
Segregation of a melody from
interfering tones
Track 1 in Bregman’s ASA Demonstration
34
Segregation of a melody from
interfering tones
Track 5 in Bregman’s ASA Demonstration
35
Segregation of high notes from low
ones in a sonata by Telemann
Track 6 in Bregman’s ASA Demonstration
36
Streaming in African xylophone music
Track 7 in Bregman’s ASA Demonstration
37
Effects of a timbre difference
between the two parts in African
xylophone music
Track 9 in Bregman’s ASA Demonstration
38
Stream segregation of vowels
and diphthongs
Track 11 in Bregman’s ASA Demonstration
39
Stream segregation of high and
low bands of noise
Track 14 in Bregman’s ASA Demonstration
40
Apparent Continuity
Track 28 in Bregman’s ASA Demonstration
41
Perceptual continuation
Track 29 in Bregman’s ASA Demonstration
42