9. Audio databases

Download Report

Transcript 9. Audio databases

8. Audio databases
About digital audio:
 Advent of digital audio CD in 1983.
 Order of magnitude improvement in overall sound quality and
signal-to-noise ratio over the best analog systems.
 Wide bandwidth required in on-line transmission.
Converting an analog signal into digital form:
 Linear Pulse Code Modulation (PCM)
 Two-stage process:
(a) Sampling: Observing the signal amplitude at certain time
intervals; typical sampling frequencies: 16-48 kHz
(b) Quantization: discrete scale for observed amplitudes,
typically 16 bits per sample  65536 possible values.
 Audio-CD: 16-bit samples at 44.1 kHz rate, with two (stereo)
channels: 2 x 16 x 44 100  1.4 Mbits per second
MMDB-8
J. Teuhola 2012
184
Illustration of audio concepts
amplitude
wavelength
time
sampling
interval
MMDB-8
J. Teuhola 2012
185
Audio compression techniques
(a) Delta modulation:
 Extremely simple, used sometimes for speech coding
 1-bit quantizer for amplitude differences: 0 = -, 1=+
(b) Adaptive Differential Pulse Code Modulation (ADPCM)
 The next sample value is predicted on the basis of recent
history; the prediction error is quantized and coded
 Used mainly for speech coding, e.g. ITU-T G.726
(c) Subband coding
 Division of the signal into frequency components (bands)
 Encoding of bands separately
 E.g. ITU-T recommendation G.722: High-quality speech at 64
Kbits per second
MMDB-8
J. Teuhola 2012
186
MPEG audio








Sampling rates 32, 44.1 or 48 kHz (or half of these);
samples processed in frames; 384/1152 samples per frame.
Subband coding with a bank of 32 filters, each with a bandwidth
of 1/64 of the sampling frequency.
Samples coded with variable quantization steps.
Psychoacoustics uses the masking properties of the human ear
Compressed bitrates range from 32 to 224 Kbits per second.
Compression factor from 2.7 to 24.
MPEG Layer I: best for bitrates > 128 Kbits per sec (per
channel).
MPEG Layer II: best for bitrates  128 Kbits per sec (per
channel).
MPEG Layer III: best for bitrates  64 Kbits per sec (per
channel) = MP3 music in the Internet (compression  12:1).
Discrete Cosine Transform (DCT) on subband signals.
MMDB-8
J. Teuhola 2012
187
Audio data retrieval
(a) Based on metadata



Additional attributes can be attached to voice data
(such as to images and video), e.g. speaker, date, duration,
composer, orchestra, instrument, ...
Attributes can be connected to the whole audio sequence or
some parts of it (e.g. parts of a symphony).
General document retrieval techniques usually apply.
MMDB-8
J. Teuhola 2012
188
Audio data retrieval (cont.)
(b) Speech recognition:




Proximity search of the waveform; feature extraction e.g. from
coefficients of DCT-transformed signal.
Some fuzzyness involved
Simple application:
 Giving voice commands to a user interface.
Advanced application:
 Parsing of spoken sentences and conversion e.g. to database
queries
 Can be coupled with natural language understanding techniques.
 Usually based on a predefined set of patterns and associated
phonetic rules.
MMDB-8
J. Teuhola 2012
189
Audio data retrieval (cont.)
(c) Speaker recognition:



Application: security systems.
Sensitive to the physical condition (e.g. flu) of the speaker.
Variations:
 Text-dependent recognition (simpler):
Restricted set of possible words/sentences
Comparison of digital waveforms.
 Text-independent recognition (more difficult):
Based e.g. on voice pitch recognition.
More elaborate sentences from particular users must be
stored, and complex verification algorithms are run against
the spoken samples.
MMDB-8
J. Teuhola 2012
190
Audio data retrieval (cont.)
(d) Recognition and retrieval of songs (recorded music)
Query input alternatives:
 Query-by-humming:
Succeeds for clearly distinguishable melodies (or themes), in spite
of small pitch errors. Similarity measure uses some kind of edit
distance
 Tapping the tempo:
Complements humming/singing
 Playing a (virtual) keyboard
Output:
 Ranked list of candidate songs
Example search engine:
 Musipedia (http://www.musipedia.org/)
MMDB-8
J. Teuhola 2012
191
Encoding and retrieval of (synthetic) music




Music encoding:
 For digital electronic instruments (no singing!)
 Timing of note-on/note-off events,
 Control of instrument and playback parameters (pitch, loudness)
 Can be played with a syntherizer
Encoding formats:
 MIDI (Musical Instrument Digital Interface)
 MPEG-4 SA (Structured Audio)
Music XML (Notes represented using structured markup)
Retrieval criteria:
 Notes: Generalization of string matching (but: polyphony!)
 Time-dependent parameters: Instruments, tempo, volume, ...
 Textual metadata: Title, composer, artist, genre, date, ...
MMDB-8
J. Teuhola 2012
192
Indexing of audio data


Indexing of metadata (external attributes):
 As with any other documents: Inverted indexes, multiattribute indexes, signature files, etc.
Indexing of audio signal:
 First split into segments (= frames, windows).
Segmentation requires some rules, e.g. ‘quiet’ zones are
possibly good split points.
 Transformation (e.g. DCT) of each segment into features
 A multidimensional index is built from groups of the features
(e.g. main DCT coefficients).
 Proximity queries (nearest neighbor, or k nearest neighbors
of the query sample) should be supported by the index.
MMDB-8
J. Teuhola 2012
193