audiocompressionx
Download
Report
Transcript audiocompressionx
Digital Audio Compression
CIS 465 Spring 2013
Speech Compression
Compression of voice data
◦ We have previously mentioned several
methods that are used to compress voice
data
mu-law and A-law companding
ADPCM and delta modulation
◦ These are examples of methods which work
in the time domain (as opposed to the
frequency domain)
Often they are not even considered compression
methods
Speech Compression
Although the previous techniques are
generally applied to speech data they are not
designed specifically for such data
Vocoders, instead, are
◦ Can’t be used with other analog signals
◦ Model speech so that the salient features can be
captured in as few bits as possible
◦ Linear Predictive Coders model the speech
waveform in time
◦ Also channel vocoders and formant vocoders
◦ In electronic music, vocoders allow a voice to
modulate a musical source (via synthesizer, e.g.)
General Audio Compression
If we want to compress general audio
(not just speech), different techniques are
needed
◦ In particular, music compression is a more
general form of audio compression
We make use of psychoacoustical
modeling
◦ Enable perceptual encoding based upon an
analysis of the ear and brain perceive sound
◦ Perceptual encoding exploits audio elements
that the human ear cannot hear well
Psychoacoustics
If you have been listening to very loud
music, you may have trouble afterwards
hearing soft sounds (that normally you
could hear)
◦ Temporal masking
A loud sound at one frequency (a lead
guitar) may drown out a sound at another
frequency (the singer)
◦ Frequency masking
Equal-Loudness Relations
If we play two pure tones, sinusoidal
sound waves, with the same amplitude but
different frequencies
◦ One may sound louder than another
◦ The ear does not hear low or high
frequencies as well as mid-range ones
(speech)
◦ This can be shown with equal-loudness curves
which plot perceived loudness on the axes of
true loudness and frequency
Equal-Loudness Relations
Threshold of Hearing
The following image is a plot of the
threshold of human hearing for pure
tones – at loudness below the curve, we
don’t hear a tone
Threshold of Hearing
A loud sound can mask other sounds at
nearby frequencies as shown below
Frequency masking
We can determine how a pure tone at a
particular frequency affects our ability to
hear tones at nearby frequencies
Then, if a signal can be decomposed into
frequencies, for those frequencies that are
only partially masked, only the audible
part will be used to set the quantization
noise thresholds
Critical Bands
Human hearing range divides into critical
bands
Human auditory system cannot resolve sounds
better than within about one critical band when
other sounds are present
Critical bandwidth represents the ear’s resolving
power for simultaneous tones
At lower frequencies the bands are narrower than
at higher frequencies
The band is the section of the inner ear which
responds to a particular frequency
Critical Bands
Critical Bands
Generally, the audio frequency range for
hearing (20 Hz – 20 kHz) can be
partitioned into about 24 critical bands
(25 are typically used for coding
applications
◦ The previous slide does not show several of
the highest frequency critical bands
◦ The critical band at the highest audible
frequency is over 4000 Hz wide
◦ The ear is not very discriminating within a
critical band
Temporal Masking
A loud tone causes the hearing receptors
in the inner ear to become saturated, and
they require time to recover
◦ This leads to the temporal masking effect
◦ After the loud tone we cannot immediately
hear another tone – post-masking
The length of the masking depends on the duration
of the masking tone
◦ A masking tone can also block sounds played
just before – pre-masking (shorter time)
Temporal Masking
MPEG audio compression takes advantage
of both temporal and frequency masking
to transmit masked frequency
components using fewer bits
MPEG Audio Compression
MPEG (Motion Picture Experts Group) is
a family of standards for compression of
both audio and video data
◦ MPEG-1 (1991) CD quality audio
◦ MPEG-2 (1994) Multi-channel surround sound
◦ MPEG-4 (1998) Also includes MIDI, speech,
etc.
◦ MPEG-7 (2003) Not compression – searching
◦ MPEG-21 (2004) Not compression – digital
rights management
MPEG Audio Compression
MPEG-1 defined three downward
compatible layers of audio compression
◦ Each layer offers more complexity in the
psychoacoustic model used and hence better
compression
◦ Increased complexity leads to increased delay
◦ Compatibility achieved by shared file header
information
◦ Layer 1 – used for Digital Audio Tape
◦ Layer 2 – proposed for digital audio broadcasting
◦ Layer 3 – music (MPEG-1 layer 3 == mp3)
MPEG Audio Compression
MPEG audio compression relies on
quantization, masking, critical bands
◦ The encoder uses a bank of 32 filters to
decompose the signal into sub-bands
Uniform width – not exactly aligned to crit. bands
Overlapping
◦ A Fourier transform is used for the psychoacoustical model
◦ Layer 3 adds a DCT to the sub-band filtering
so that layers 1 and 2 work in the temporal
domain and layer 3 in the frequency domain
MPEG Audio Compression
PCM input filtered into 32 bands
PCM FFT transformed for PA model
Windows of samples (384, 576, 1152)
coded at a time
MPEG Audio Compression
Since the sub-bands overlap, aliasing may
occur
◦ This is overcome by the use of a quadrature
mirror filter bank
Attenuation slopes of adjacent bands are mirror
images
MPEG Audio Algorithm
The PCM audio data is assembled into
frames
◦ Header – sync code of 12 1s
◦ SBS format – describe how many sub-band
samples (SBS) are in the frame
◦ The SBS (384 in Layer 1, 1152 in Layers 2, 3)
◦ Ancillary data – e.g. multi-lingual data or
surround-sound data
MPEG Audio Algorithm
The sampling rate determines the
frequency range
That range is divided up into 32
overlapping bands
The frames are sent through a
corresponding 32-filter filter bank
If X is the number of samples per frame,
each filter produces X/32 samples
◦ These are still samples in the temporal
domain
MPEG Audio Algorithm
The Fourier transform is performed on a
window of samples surrounding the
samples in the frame (either 1024 or
2*1024 samples)
◦ This feeds into the psychoacoustic model
(along with the subband samples)
◦ Analyze tonal and nontonal elements in each
band
◦ Determine spreading functions (how much
each band affects another)
MPEG Audio Algorithm
Find the masking threshold and signal-tomask ratios for each band
The scaling factor for each band is the
maximum amplitude of the samples in that
band
The bit-allocation algorithm takes the
SMRs and scaling factor and determines how
many bits can be allocated (quantization
granularity) for each band
◦ In MP3, the bits can be moved from band to band
as needed to ensure a minimum amount of
compression while achieving higher quality
MPEG Audio Algorithm
Layer 1 has 12 samples encoded per band
per frame
Layer 2 has 3 groups of 12 (36 samples) per
frame
Layer 3 has non-equal frequency bands
Layer 3 also performs a Modified DCT on
the filtered data, so we are in the frequency
(not time) domain
Layer 3 does non-uniform quantization
followed by Huffman coding
◦ All of these modifications make for better (if
more complex) performance for MP3
Stereo Encoding
MPEG codes stereo data in several
different ways
◦
◦
◦
◦
Joint stereo
Intensity stereo
Etc.
We are not discussing these
MPEG File Format
MPEG files do not have a header (so you
can start playing/processing anywhere in
the file)
◦ Consist of a sequence of frames
◦ Each frame has a header followed by audio
data
MPEG File Format
MPEG File Format
ID3 is a metadata container most often
used in conjunction with the MP3 audio
file format.
Allows information such as the title, artist,
album, track number, year, genre, and
other information about the file to be
stored in the file itself.
Last 128 bytes of the file
Bit Rates
Audio (or Video) compression schemes
can be characterized as either constant
bit rate (CBR) or variable bit rate (VBR)
◦ In general, higher compression can be
achieved with VBR (at the cost of added
complexity for code/decode)
◦ MPEG-1 Layers 1 and 2 are CBR only
◦ MP3 is either VBR or CBR
◦ Average Bit Rate (ABR) is a compromise
MPEG-2 AAC
MPEG-2 (which is used for encoding DVDs)
has an audio component as well
MPEG-2 AAC (Advanced Audio Coding)
standard was aimed at transparent sound
reproduction for theatres
◦ 320 kbps for five channels (left, right, center, leftsurround and right-surround)
◦ 5.1 channel systems include a low-frequency
enhancement channel (“woofer”)
◦ AAC can also deliver high-quality stereo sound at
bitrates less than 128 kbps
MPEG-2 AAC
AAC is the default audio format for (e.g.):
YouTube, iPod (iTunes), PS3, Nintendo Dsi,
etc.
Compared to MP3
◦ More sampling frequencies
◦ More channels
◦ More efficient, simpler filterbank (pure
MDCT)
◦ Arbitrary bit rates and variable frame lengths
◦ Etc. etc.
MPEG-4 Audio
MPEG-4 audio integrates a number of
audio components into one standard
◦
◦
◦
◦
◦
Speech compression
Text-to-speech
MIDI
MPEG-4 AAC (similar to MPEG-2 AAC)
Alternative coders (perceptual coders and
structured coders)