audiocompressionx

Download Report

Transcript audiocompressionx

Digital Audio Compression
CIS 465 Spring 2013
Speech Compression

Compression of voice data
◦ We have previously mentioned several
methods that are used to compress voice
data
 mu-law and A-law companding
 ADPCM and delta modulation
◦ These are examples of methods which work
in the time domain (as opposed to the
frequency domain)
 Often they are not even considered compression
methods
Speech Compression


Although the previous techniques are
generally applied to speech data they are not
designed specifically for such data
Vocoders, instead, are
◦ Can’t be used with other analog signals
◦ Model speech so that the salient features can be
captured in as few bits as possible
◦ Linear Predictive Coders model the speech
waveform in time
◦ Also channel vocoders and formant vocoders
◦ In electronic music, vocoders allow a voice to
modulate a musical source (via synthesizer, e.g.)
General Audio Compression

If we want to compress general audio
(not just speech), different techniques are
needed
◦ In particular, music compression is a more
general form of audio compression

We make use of psychoacoustical
modeling
◦ Enable perceptual encoding based upon an
analysis of the ear and brain perceive sound
◦ Perceptual encoding exploits audio elements
that the human ear cannot hear well
Psychoacoustics

If you have been listening to very loud
music, you may have trouble afterwards
hearing soft sounds (that normally you
could hear)
◦ Temporal masking

A loud sound at one frequency (a lead
guitar) may drown out a sound at another
frequency (the singer)
◦ Frequency masking
Equal-Loudness Relations

If we play two pure tones, sinusoidal
sound waves, with the same amplitude but
different frequencies
◦ One may sound louder than another
◦ The ear does not hear low or high
frequencies as well as mid-range ones
(speech)
◦ This can be shown with equal-loudness curves
which plot perceived loudness on the axes of
true loudness and frequency
Equal-Loudness Relations
Threshold of Hearing

The following image is a plot of the
threshold of human hearing for pure
tones – at loudness below the curve, we
don’t hear a tone
Threshold of Hearing

A loud sound can mask other sounds at
nearby frequencies as shown below
Frequency masking
We can determine how a pure tone at a
particular frequency affects our ability to
hear tones at nearby frequencies
 Then, if a signal can be decomposed into
frequencies, for those frequencies that are
only partially masked, only the audible
part will be used to set the quantization
noise thresholds

Critical Bands

Human hearing range divides into critical
bands
 Human auditory system cannot resolve sounds
better than within about one critical band when
other sounds are present
 Critical bandwidth represents the ear’s resolving
power for simultaneous tones
 At lower frequencies the bands are narrower than
at higher frequencies
 The band is the section of the inner ear which
responds to a particular frequency
Critical Bands
Critical Bands

Generally, the audio frequency range for
hearing (20 Hz – 20 kHz) can be
partitioned into about 24 critical bands
(25 are typically used for coding
applications
◦ The previous slide does not show several of
the highest frequency critical bands
◦ The critical band at the highest audible
frequency is over 4000 Hz wide
◦ The ear is not very discriminating within a
critical band
Temporal Masking

A loud tone causes the hearing receptors
in the inner ear to become saturated, and
they require time to recover
◦ This leads to the temporal masking effect
◦ After the loud tone we cannot immediately
hear another tone – post-masking
 The length of the masking depends on the duration
of the masking tone
◦ A masking tone can also block sounds played
just before – pre-masking (shorter time)
Temporal Masking

MPEG audio compression takes advantage
of both temporal and frequency masking
to transmit masked frequency
components using fewer bits
MPEG Audio Compression

MPEG (Motion Picture Experts Group) is
a family of standards for compression of
both audio and video data
◦ MPEG-1 (1991) CD quality audio
◦ MPEG-2 (1994) Multi-channel surround sound
◦ MPEG-4 (1998) Also includes MIDI, speech,
etc.
◦ MPEG-7 (2003) Not compression – searching
◦ MPEG-21 (2004) Not compression – digital
rights management
MPEG Audio Compression

MPEG-1 defined three downward
compatible layers of audio compression
◦ Each layer offers more complexity in the
psychoacoustic model used and hence better
compression
◦ Increased complexity leads to increased delay
◦ Compatibility achieved by shared file header
information
◦ Layer 1 – used for Digital Audio Tape
◦ Layer 2 – proposed for digital audio broadcasting
◦ Layer 3 – music (MPEG-1 layer 3 == mp3)
MPEG Audio Compression

MPEG audio compression relies on
quantization, masking, critical bands
◦ The encoder uses a bank of 32 filters to
decompose the signal into sub-bands
 Uniform width – not exactly aligned to crit. bands
 Overlapping
◦ A Fourier transform is used for the psychoacoustical model
◦ Layer 3 adds a DCT to the sub-band filtering
so that layers 1 and 2 work in the temporal
domain and layer 3 in the frequency domain
MPEG Audio Compression
PCM input filtered into 32 bands
 PCM FFT transformed for PA model
 Windows of samples (384, 576, 1152)
coded at a time

MPEG Audio Compression

Since the sub-bands overlap, aliasing may
occur
◦ This is overcome by the use of a quadrature
mirror filter bank
 Attenuation slopes of adjacent bands are mirror
images
MPEG Audio Algorithm

The PCM audio data is assembled into
frames
◦ Header – sync code of 12 1s
◦ SBS format – describe how many sub-band
samples (SBS) are in the frame
◦ The SBS (384 in Layer 1, 1152 in Layers 2, 3)
◦ Ancillary data – e.g. multi-lingual data or
surround-sound data
MPEG Audio Algorithm
The sampling rate determines the
frequency range
 That range is divided up into 32
overlapping bands
 The frames are sent through a
corresponding 32-filter filter bank
 If X is the number of samples per frame,
each filter produces X/32 samples

◦ These are still samples in the temporal
domain
MPEG Audio Algorithm

The Fourier transform is performed on a
window of samples surrounding the
samples in the frame (either 1024 or
2*1024 samples)
◦ This feeds into the psychoacoustic model
(along with the subband samples)
◦ Analyze tonal and nontonal elements in each
band
◦ Determine spreading functions (how much
each band affects another)
MPEG Audio Algorithm



Find the masking threshold and signal-tomask ratios for each band
The scaling factor for each band is the
maximum amplitude of the samples in that
band
The bit-allocation algorithm takes the
SMRs and scaling factor and determines how
many bits can be allocated (quantization
granularity) for each band
◦ In MP3, the bits can be moved from band to band
as needed to ensure a minimum amount of
compression while achieving higher quality
MPEG Audio Algorithm
Layer 1 has 12 samples encoded per band
per frame
 Layer 2 has 3 groups of 12 (36 samples) per
frame
 Layer 3 has non-equal frequency bands
 Layer 3 also performs a Modified DCT on
the filtered data, so we are in the frequency
(not time) domain
 Layer 3 does non-uniform quantization
followed by Huffman coding

◦ All of these modifications make for better (if
more complex) performance for MP3
Stereo Encoding

MPEG codes stereo data in several
different ways
◦
◦
◦
◦
Joint stereo
Intensity stereo
Etc.
We are not discussing these
MPEG File Format

MPEG files do not have a header (so you
can start playing/processing anywhere in
the file)
◦ Consist of a sequence of frames
◦ Each frame has a header followed by audio
data
MPEG File Format
MPEG File Format
ID3 is a metadata container most often
used in conjunction with the MP3 audio
file format.
 Allows information such as the title, artist,
album, track number, year, genre, and
other information about the file to be
stored in the file itself.
 Last 128 bytes of the file

Bit Rates

Audio (or Video) compression schemes
can be characterized as either constant
bit rate (CBR) or variable bit rate (VBR)
◦ In general, higher compression can be
achieved with VBR (at the cost of added
complexity for code/decode)
◦ MPEG-1 Layers 1 and 2 are CBR only
◦ MP3 is either VBR or CBR
◦ Average Bit Rate (ABR) is a compromise
MPEG-2 AAC


MPEG-2 (which is used for encoding DVDs)
has an audio component as well
MPEG-2 AAC (Advanced Audio Coding)
standard was aimed at transparent sound
reproduction for theatres
◦ 320 kbps for five channels (left, right, center, leftsurround and right-surround)
◦ 5.1 channel systems include a low-frequency
enhancement channel (“woofer”)
◦ AAC can also deliver high-quality stereo sound at
bitrates less than 128 kbps
MPEG-2 AAC
AAC is the default audio format for (e.g.):
YouTube, iPod (iTunes), PS3, Nintendo Dsi,
etc.
 Compared to MP3

◦ More sampling frequencies
◦ More channels
◦ More efficient, simpler filterbank (pure
MDCT)
◦ Arbitrary bit rates and variable frame lengths
◦ Etc. etc.
MPEG-4 Audio

MPEG-4 audio integrates a number of
audio components into one standard
◦
◦
◦
◦
◦
Speech compression
Text-to-speech
MIDI
MPEG-4 AAC (similar to MPEG-2 AAC)
Alternative coders (perceptual coders and
structured coders)