Basic Audio Technology A Short Course (bring your coffee!)

Download Report

Transcript Basic Audio Technology A Short Course (bring your coffee!)

Conversion – Issues in
Hearing, Sampling,
Quantization, and
Implementation
James D. Johnston
Microsoft Corporation
Audio Architect
Copyright © 2006 by James D. Johnston
All rights reserved.
Basic Hearing Issues
•
•
•
•
•
•
•
•
•
Parts of the ear.
Their contribution to the hearing process.
What is loudness?
What is intensity?
How “loud” can you hear?
How “quiet” can you hear?
How “high” is high?
How “low” is low?
What do two ears do for us?
Fundamental Divisions of the Ear
• Outer ear:
– Head and shoulders
– Pinna
– Ear Canal
• Middle Ear
– Eardrum
– 3 bones and spaces
• Inner ear
– Cochlea
• Organ of Corti
–
–
–
–
Basilar Membrane
Tectoral Membrane
Inner Hair Cells
Outer Hair Cells
Functions of the Outer Ear
• In a word, HRTF’s. HRTF means “head related transfer functions”,
which are defined as the transfer functions of the body, head, and
pinna as a function of direction. Sometimes people refer to HRIR’s,
or “head related impulse responses”, which are the same
information expressed in the time, as opposed to frequency, domain.
• HRTF’s provide information on directionality above and beyond that
of binaural hearing.
• HRTF’s also provide disambiguation of front/back, up/down
sensations along the “cone of confusion”.
• The ear canal resonates somewhere between 1 and 4 kHz, resulting
in some increased sensitivity at the point of resonance. This
resonance is about 1 octave wide, give or take.
Middle Ear
• The middle ear acts primarily as a
highpass filter (at about 700 Hz) followed
by an impedance-matching mechanical
transformer.
• The middle ear is affected by muscle
activity, and can also provide some level
clamping and protection at high levels.
– You don’t want to be exposed to sound at
that kind of level.
Inner Ear (Cochlea)
• In addition to the balance mechanism, the inner
ear is where most sound is transduced into
neural information.
• The inner ear is a mechanical filterbank,
implementing a filter whose center-frequency
tuning goes from high to low as one goes farther
into the cochlea.
– The bandpass nature is actually due to coupled
tuning of two highpass filters, along with detectors
(inner hair cells) that detect the difference between
the two highpass (HP) filters.
Critical Bandwidths
•
•
•
•
•
•
•
The bandwidth of a filter is referred to as the “Critical Band” or “Equivalent
Rectangular Bandwidth” (ERB).
ERB’s and Critical Bands (measured in units of “Barks”, after Barkhausen)
are reported as slightly different.
ERB’s are narrower at all frequencies.
ERB’s are probably closer to the right bandwidths, note the narrowing of the
filters on the “Bark” scale in the previous slide at high Bark’s (i.e. high
frequencies).
I will use the term “Critical Band” in this talk, by habit. None the less, I
encourage the use of a decent ERB scale.
Bear in mind that both Critical Band(widths) and ERB’s are useful, valid
measures, and that you may wish to use one or the other, depending on
your task.
There is no established “ERB” scale to date, rather researchers disagree
quite strongly, especially at low frequencies. It is likely that leading-edge
effects as well as filter bandwidths lead to these differences. The physics
suggests that the lowest critical bands or ERB’s are not as narrow as the
literature suggests.
What are the main points?
• The cochlea functions as a mechanical
time/frequency analyzer.
– Center frequency changes as a function of the
distance from the entrance end of the cochlea. High
frequencies are closest to the entrance.
– At higher center frequencies, the filters are roughly a
constant fraction of an octave bandwidth.
– At lower center frequencies, the filters are close to
uniform bandwidth.
– The filter bandwidth, and therefore the filter time
response length varies by a factor of about 40:1.
What happens as a function of
Level?
• As level rises, the ear desensitizes itself
by many dB.
• As level rises, the filter responses in the
ear shift slightly in position vs. frequency.
• The ear, with a basic 30dB SNR (1000^.5)
in the detector, can range over at least
120dB of level.
What does this mean?
• The internal experience, called Loudness,
is a highly nonlinear function of level,
spectrum, and signal timing.
• The external measurement in the
atmosphere, called Intensity, behaves
according to physics, and is somewhat
close to linear.
• The moral? There’s nothing linear
involved.
Some points on the SPL Scale
(x is very loud pipe organ, o is threshold of pain/damage, + is moderate barometric pressure change
Edge effects and the Eardrum
• The eardrum’s HP filter desensitizes the ear
below 700Hz or so. The exact frequency varies
by individual. This means that we are not
deafened by the loudness of weather patterns,
for instance.
• At both ends of the cochlea, edge effects lessen
the compression mechanisms in the ear.
• The results of these two effects, coupled with the
ear’s compression characteristics, results in the
kind of response shown in the next slide.
Fletcher and Munson’s famous
“equal loudness curves”.
The best one-picture summary of hearing in existence.
What’s quiet, and what’s loud?
• As seen from the previous graph, the ear can hear to
something below 0dB SPL at the ear canal resonance.
• As we approach 120dB, the filter responses of the ear
start to broaden, and precise frequency analysis
becomes difficult. As we approach 120dB SPL, we also
approach the level at which near-instantaneous injury to
the cochlea occurs.
• Air is made of discrete molecules. As a result, the noise
floor of the atmosphere at STP is approximately 6dB
SPL white noise in the range of 20Hz-20kHz. This noise
may JUST be audible at the point of ear canal
resonance. Remember that the audibility of such noise
must be calculated inside of an ERB or critical band, not
broadband.
So, what’s “high” and what’s “low”
in frequency?
• First, the ear is increasingly insensitive to
low frequencies, as shown in the Fletcher
curve set. This is due to both basilar
membrane and eardrum effects. 20Hz is
usually mentioned as the lowest frequency
detected by the hearing apparatus. Low
frequencies at high levels are easily
perceived by skin, chest, and abdominal
areas as well as the hearing apparatus.
• At higher frequencies, all of the detection ability
above 15-16 kHz lies in the very first small
section of the basilar membrane. While some
young folks have been said to show supra20kHz hearing ability (and this is likely true due
to their smaller ear, ear canal, and lack of
exposure damage), in the modern world, this
first section of the basilar membrane appears to
be damaged very quickly by “environmental”
noise.
• At very high levels, high (and ultrasonic) signals
can be perceived by the skin. You probably don’t
want to be exposed to that kind of level.
What about binaural issues?
Binaurally, with broadband signals, we can
distinguish 10 microsecond shifts in left vs.
right stimulii of the right characteristics.
While this has implications in blockprocessed algorithms with pre-echo, it
does not generally relate substantially to
ADC and DAC hardware that is properly
clocked.
The results?
• For presentation (NOT capture, certainly not
processing), a range of 6dB SPL (flat noise floor)
to 120dB (maximum you should hear, also
maximum most systems can achieve) should be
more than sufficient. This is about 19 bits.
• An input signal range of 20Hz to 20kHz is
probably enough, there are, however, some filter
issues that will be raised later, that may affect
your choice of sampling frequency.
Sampling
and
Quantization
Sampling and Quantization
•
Continuous domain vs. sampled domain.
– Sampling
– Aliasing
•
Discrete level (quantized) vs. noisy
continuous domain
– Quantization
– Dithering
•
•
Time/frequency duality
FFT’s (DFT’s too)
What do “Analog” and “Digital”
really mean?
• The domain we commonly refer to as “analog” is a timecontinuous (at least to mortal eyes) domain, with
“continuous” level resolution limited by physical
properties of material. The level resolution and time
resolution are never exact due to basic physics.
• The “digital” domain is a sampled, quantized domain.
That means that we only know the value of the signal at
specified time, and that the level of the signal occupies
one of a set of discrete levels. The set of levels, and the
times that the signal has a value, however, are exact in
the digital framework. (Although there may, of course, be
errors in acquisition.)
Properties of the analog domain
• The time domain is continuous. This means that
any frequency limits come from physical
processes, not from mathematical restrictions.
However, physics places some very strong
constraints on such signals:
–
–
–
–
–
All signals have finite energy
All signals have finite bandwidth
All signals have finite duration
All signals have a finite noise floor
All four of the points above are very important!
A reminder about Duality in the
Fourier domain
• Multiplying two signals means that you convolve
their (full, complex) spectra.
• Convolving two signals means that you multiply
their (full, complex) spectra.
• These two properties of Fourier Analysis (other
commonly used transforms obey them as well)
are very important. Remember them even if you
don’t know anything about Fourier Analysis.
Fourier Domain Properties
• Please remember the properties. I don’t want or
expect anyone to understand all the details, but
please remember the PROPERTIES.
• Fourier analysis is valid on all finite energy,
finite-bandwidth signals. That describes all realworld audio signals that we care to deal with.
(The only counterexamples occur in
astrophysics and particle physics, neither of
which a listener can be comfortably seated
near.)
So, let’s sample that analog signal.
• Sampling means capturing the value of the
signal at a periodic rate.
– This means that we MULTIPLY the signal by the
specific impulse at a regular interval.
• Quite obviously, that’s not what actual hardware does. Most
use a track/hold, or other capture method. The result, in the
sampled domain, is the same.
– That means that we CONVOLVE the signal spectrum
and the sampling spectrum. This means that the
spectrum repeats at every multiple of the sampling
frequency.
– Hence, we have the Nyquist criterion, later proven by
Shannon.
2*b<fs
Three examples
2*b=fs
2*b>fs
In each case, top is signal spectrum (same for all 3) middle is sampling spectrum and bottom is the result
The Nyquist Criterion
• Simply put, if we wish to sample a signal
of bandwidth ‘B’, we must sample it at
least at 2B sampling rate.
– If you think about this briefly, that follows from
the previous slide, where the spectrum
(extending to +-B from DC) will overlap if you
sample it at a lower frequency
– This overlap is called “aliasing”.
A Graphical Example
Top to bottom: Sampling train, spectrum of sampling train, Sine wave below half the
sampling rate and resulting samples, their spectra, sine wave as far above half the
sampling rate and resulting samples, their spectra
What would I hear if there was a
demo of aliasing?
• Aliasing and imaging (imaging, as we will
see shortly, is the reconstruction version of
aliasing) sound awful. Aliasing in general
is anharmonic, and remarkably annoying.
• THEREFORE: Filtering is a requirement.
It’s not an option.
• The presence of a filter has consequences
that we will discuss later.
• This leads us to the sampling theorem.
Hence, the Sampling Theorem
• We must limit the bandwidth of the signal to fs/2,
where fs is the sampling frequency. (This is
saying the same thing as the Nyquist conjecture,
restated in terms of the data to be captured,
rather than the sampling rate.)
• While this does not mean dc to fs/2, that’s what
we do in audio, since we want signals close to
dc. (There are other sampling methods that
sample other regions of frequency.)
• This means that we must band limit the signal
into the sampler. An anti-alias filter is not just a
good idea, it’s mathematically necessary.
Consequences of the Sampling
Theorem
• We must band limit the signal in order to avoid
aliasing.
• Any out-of-band signals will alias back into the
base band.
• That ^^ has consequences far beyond the initial
sampling of the material. We’ll get to that later
when we talk about things like clipping and
nonlinearities, or jitter.
• This means that we know what times the
samples were taken, so we can “reconstruct”
that periodicity later, without an error that always
grows with time, distance, number of copies, etc.
Ok, but we represent those
samples as binary values, right?
Yes, we do.
That’s called “quantization”. That’s the other
necessary process in digitization of a signal.
Quantization is why digital signals can be saved
and re-saved without degradation in terms of
level.
It’s also why digital PCM signals have a fixed,
unchanging noise floor. (There are other
possiblities, we’ll talk about those later.)
So, quantization is like rounding,
right? Well, Let’s see!
Using rounding only:
Original, quantized, error, and spectrum of original and error.
To drive the point even farther
home:
That’s right, we have to dither quantizers. It’s not just a “good idea”.
Dither? What’s that?
• As the spectra (and error waveform of the
second slide) show, the error of an undithered
quantizer is highly correlated to the original
signal.
• Dither consists of adding some random function
BEFORE the quantization so that the noise is
decorrelated.
• The first kind of dither people tried was called
“uniform”. Let’s see how that works out.
Uniform +- ½ step-size dither.
Not bad, but notice the noise level coming and going around the zero crossing?
Hence, Triangular PDF Dither:
Now, the noise stays constant over all amplitudes.
To recap:
Notice, even in this very mild case, where harmonics do not alias over each
other, in addition to eliminating tonal components and preserving information,
dither RAISES the noise floor, and lowers the PERCIEVED noise floor.
To summarize:
• A digital signal is sampled and quantized.
• Sampling requires anti-aliasing filters.
• Quantization requires TPD.
• Dither and Anti-aliasing are not options!
What about reconstruction?
• Yes, that convolution theorem applies again, this
time usually convolving a “square pulse” with the
digital signal.
• This leads to a form of signal “images”. While
“images” and “aliases” come about by
mathematically similar processes, people persist
in having different names for them.
• Some (many in the high end) omit the antiimaging filter, and imagine that there is a
‘beating’ problem. If you haven’t heard about this
“problem” yet, you will at some point. The next few
graphs show why it isn’t so.
Basically, the same thing happens.
Top Line: Sine wave Below fs/2
Next line Sine wave plus first alias pair (blue) and just the aliases (red)
Each line adds another pair, except the last, which adds the first 100 pairs.
The gain of the red waveform is greatly increased in order to make it visible.
Notice that after 100 alias pairs are added in the original waveform has the familiar “stairstep”
Notice, at the bottom, the “beating” that some audio enthusiasts complain
about. Notice that “beating” only happens when aliases are added.
Low frequency reconstruction example.
Elements of reconstruction:
• In reconstruction, filtering is also necessary to remove the “image”
signals that originate in the same fashion as aliases arise in
sampling.
• In reconstruction, the waveform is sometimes a “step” rather than an
impulse, so other compensation is sometimes necessary to get a flat
frequency response. Why?
– Again, using the “step” in time (convolving) means that you multiply the
signal by the frequency response of the “step” in the frequency domain,
leading to a rolloff like sin(x)/x.
– This rolloff can be as much as -3.92dB at fs/2, and can cause audible
“softness” if not corrected somehow.
• Modern converters of the delta-sigma variety do not use a step at
the final sampling frequency (although they certainly use a “step” it’s
at a much higher frequency). Their design, however, introduces
other issues. That discussion comes later.
How to Build converters
A quick survey of methods.
No real commercial converter is
Described.
Baseband Conversion
Audio
Input
Filter
A to D
converter
Sampling
Clock
•This is the basic block diagram for any PCM converter.
•In this converter, the filter is outside the converter, and the quantizer
is part of the sampling mechanism.
•This method is not very common any more, but we will discuss its
properties before moving on to oversampling converters.
Spectrum of Signal and Noise
Original
Spectrum of Original
Quantized and Dithered
Spectrum of Quantized, Dithered Signal
Red 8 bits, green 9 bits
(note, to make scaling easier, I will use 7/8/9 bit quantization)
• Things to notice.
– The noise floor is flat. If you sum up all of the energy
in the noise floor, you will wind up with the SNR you
expect. Notice that practically all of the noise is IN
BAND when there is no oversampling.
– It’s really hard to see quantization at even very noisy
levels in a waveform plot.
• Each bit of quantization is worth 6.02dB of signal
to noise ratio. 1 more bit will drop the noise floor
by 6 dB. 1 less bit raises the noise floor by 6dB.
• NOTICE THAT NOISE SPREADS OVER THE
ENTIRE OUTPUT SPECTRUM.
Why so much emphasis on how
the noise spectrum spreads
out?
Therein lies the beginnings of oversampling.
Sine wave, original sampling rate
Spectrum of 8 bit quantization
Sine wave, 4x sampling rate
Spectrum of 7 bit quantization, shown in same bandwidth!
4x oversampled.
Full spectrum of 7 bit quantized signal at 4x sample rate. Notice
that the noise has 4x the bandwidth, but ¼ of it falls in the original
passband
That demonstrates the most
trivial form of oversampling.
• This trivial form of oversampling provides
the equivalent of 1 bit, in-band, for every
4x the sampling frequency, i.e. 3db per
doubling.
• Now we move on to more sophisticated
forms of oversampling, with the noise
spectrum shaped as well.
Noise Shaping
Output
Bits
Error Signal
+
H(s)
Quantizer
Quantized Signal
This is the basic form of a noise-shaper. I’m not going to do a full
mathematical analysis for the sake of time. What H(s) does is shape
the noise floor. This can be done with or without oversampling. Two
examples will follow.
The output bits of this system are PCM. ALL OVERSAMPLED
SYSTEMS ARE PCM SYSTEMS AT THEIR HEART!!!!
Adding this H(s) introduces some
Issues, of course.
• The values of H(s) must be carefully controlled
in order to ensure stability.
• H(s) has storage in it, so quantization noise gets
stored. This means that you have “more noise”
than just the basic quantization noise. So, there
is a penalty, especially if there is a lot of
“storage” or “memory” in the noise shaper, as
well as a gain.
• The shape of the noise is closely related to the
inverse of H(s).
• I won’t try to present a full analysis, that’s for the
hardware engineer and chip designer.
An example of noise shaping with no oversampling:
NOTE: This is an example, nobody uses this particular H(z), and in fact I’ve not
even tested it for stability!!!!
The point is simple, you CAN do noise shaping even with no oversampling, and
some DAC’s do it, to attempt to match zero loudness curves. We can discuss
the utility of that in Q/A.
What’s the point?
• Within limits, using a noise-shaping
system, you can move the noise around in
frequency.
• You can, for instance, push lots of the
noise up to high frequencies.
– That is one of the reasons for oversampling.
What’s another reason for
Oversampling?
• You get to control the response of the initial antialiasing/anti-imaging filter digitally.
– As most everyone knows by now, high-order analog
filters have a variety of problems:
• They are hard to manufacture
– That means expensive
• Their long-term performance is very hard to assure.
– That means that they tend to annoy the customer
• Since they are IIR filters, they have startling phase problems
near the transition and stop bands.
– That annoys the customer, too
What sort of oversampling does the
filter issue lead us to?
4x Oversampling,
Digital FIR filter
5th order
Analog filter
This is shown as an example
The results?
• The 13th order analog filter (with horrible
phase response) is replaced by a 5th order
analog filter.
• The first, sharp antialiasing filter is now a
digital filter, with deterministic behavior
and performance. All it takes is “MIPS”.
– Nowdays, MIPS are cheap.
– The filter is trustworthy. It won’t drift, oscillate,
distort, etc, if it’s designed and implemented
properly. Its characteristics are exactly known.
Remember the Filters in the ear?
• Your ear is a time analyzer. It will analyze the
filter taps, etc, AS THEY ARRIVE, it doesn’t wait
for the whole filter to arrive.
• If the filter has substantial energy that leads the
main peak, this may be able to affect the
auditory system.
– In Codecs this is a known, classic problem, and one
that is hard to solve.
– In some older rate converters, the pre-echo was quite
audible.
• The point? We have to consider how the ear will
analyze any anti-aliasing filter. Two examples
follow.
An example of a filter with passband ripple and barely enough stop band rejection.
An example of a good, longer filter, with less passband ripple.
An interesting result
• Trying to use the shortest possible filter
(i.e. minimizing MIPS) results in a worse
time response from the point of view of the
auditory system.
– Passband ripple means that there are “tails”
on the filter.
Another interesting result
• Sharper filters have more “ringing”, and may
have more auditory problems:
– The main lobe of a filter cutting off in 2.05 kHz must
necessarily have a wider main lobe than the
narrowest (in time) cochlear filter. df * dt >=1.
– The main lobe of a filter cutting off over 4kHz will have
a main lobe a bit smaller than the narrowest cochlear
filter.
– This suggests that for higher sampling rates, we do
not want the ‘fastest’ filter, rather a filter with a wider
transition band, and narrower time response.
Two examples:
Is this audible?
• That’s a good question. Since we are stuck, in
general, with the filters our ADC’s and DAC’s
use, it’s dreadfully hard to actually run this
listening test.
• How would I do that?
– Get a DAC with a SLOW rolloff running at 4x (192K).
– Make a DC to 20 K Gaussian pulse at 192kHz.
– Downsample by zeroing 3 of every 4 samples and
multiplying the others by 4.
– Generate a third signal with a TIGHT filter.
– Compare the three signals in a listening test.
Is this 4x oversampling what
people do?
• Not generally. That’s what they did for a while,
until MIPS got even cheaper.
• What they did was go to more oversampling, a
LOT more oversampling? Yes.
– Uses more digital, less analog
– There are a whole variety of circuitry and linearity
reasons, almost all of them point toward much more
oversampling and less “analog” hardware.
Massive oversampling:
• Remember: One gets 3dB per doubling of Fs
from oversampling with a flat noise floor.
• If we also put a single integrator with its zero at
20kHz into H(s), we will see that the increased
SNR available is 3 + 6 db/doubling of Fs. There
will be some cost in the form of a constant
negative term to this SNR, which is overcome by
very moderate levels of oversampling.
• Each additional order of integration adds
another 6dB/doubling of the sampling frequency.
• On the next page are some curves:
7
Noise shape
Vs.
Order,
integrator pole
At w=1
6
5
4
3
2
1
(Note: Curves as examples only. Real-world circuit considerations limit these curves)
Right. So what does that do for me,
anyhow?
• Remember, nearly the same amount of
noise is being shaped in each case.
– As there is more “space” under the curve at
high frequencies, more of the noise moves to
high frequencies.
– That means there is LESS noise at low
frequencies.
– Therefore, if we FILTER OUT the high
frequencies, we wind up with a lower
sampling rate signal with a higher SNR.
Some Examples (low order)
Order 1
Order 2
1x
2x
4x
8x
16x
32x
64x
128x
Original SNR 0dB. Base before upsampling = 48kHz
SNR vs. order vs downsampling
rate for that (ideal) system.
Order
1
1x 0 (dB)
2x
5.9
4x
11.8
8x
17.6
16x
23.1
32x
28.1
64x
32.1
128x 34.7
Order Order Order Order Order
2
3
4
5
6
0
0
0
0
0
8.9
11.9 15.0 17.8 29,8
17.8 23.7 30.0 35.6 41.6
26.5 35.3 44.1 53.8 61.9
34.8 46.5 58.1 69.8 81.5
40.3 57.0 71.2 85.5 99.8
47.8 66.3 82.8 99.4 116.0
50.6 74.0 92.1 110.8 129.3
Real converters
• IC designers have found that having 4-bit flash
converters (16 levels, 24dB SNR) inside a deltasigma converter is often the cheapeast way to
achieve the required results with present-day
digital circuitry.
– The sampling rate can be lower, so the circuitry and
power run slower.
– The filters can be shorter.
– The flash converter takes some space, but less power
and space than additional DSP circuitry.
Details about those examples
• All of the examples have ‘n’ integrators
with a knee at 20kHz. This is not
necessarily the optimum solution, it is
used for example
• The examples are theoretically calculated,
there is no component or electrical error
involved.
– Nothing is ever this good in the real world. Are
you surprised?
SUMMARY
Auditory system characteristics
• Everything must be considered within the
relevant cochlear filter bandwidth.
• 0dB SPL is slightly below atmospheric noise
level.
• 120dB SPL is a good maximum, even that level
is very dangerous for hearing.
• High frequency issues may be due to actual
hearing, to filter time response issues, or both.
• Gradual filters are safer than steep filters.
Quantization and Sampling
• Antialiasing and antiimaging filters are not
just a good idea, it’s a requirement.
• Dithering is not just a good idea, it’s a
requirement.
• There are many ways to quantize and
sample.
converter Technology
• converter technology exists to do proper, clean conversions that
operate over the advisable part of the human hearing range, both in
frequency and level.
• There is no basic mathematical difference in the result of SAR vs. a
Delta-Sigma converter in terms of what it delivers to the PCM
system, the differences are due to circuitry and cost issues.
• Capturing the direct delta-sigma waveform (single or multibit) can be
done. One group of high-rate proponents sells such a system. The
only things this results in, practically, are the removal of the sharp
anti-aliasing filter and the retention of the high-frequency noise.
– This also makes it hard to process or capture the signal
– For instance, an IIR filter that does bass boost might take 128 bits of
arithmetic width to impliement for a 1-bit data input.
– Most electronics then require a filter to protect them from the HF noise.
Tweeters and power amps in particular do not like this kind of input at
all.
How to test converters
•
•
•
•
Noise in the presence of low signal
Noise in the presence of maximum signal
Single tone source
Broadband (i.e. something like the “room
correction” noise) stimulii
• Multitones that test for aliasing.
One could do another 2 hours on how to test
converters. Is there a demand?