a missing feature approach to instrument identification in polyphonic

Download Report

Transcript a missing feature approach to instrument identification in polyphonic

A MISSING FEATURE APPROACH TO
INSTRUMENT IDENTIFICATION IN
POLYPHONIC MUSIC
Jana Eggink and Guy J. Brown
University of Sheffield
Automatic Music Transcription
• input: audio recording
• output: score or other symbolic representation
• needed (for every note):
• pitch
• start and duration
• instrument
• extras: key (C major), meter (4/4), bars, loudness, expression...
• useful for:
• musicologists
• musicians
• music information retrieval
Instrument Identification
possible clues:
• method of excitation (hitting, blowing, plucked or bowed
strings) causes:
• noise during onset
• delayed begin of individual partials during onset
• spectral fluctuations during steady state
• resonance properties of the instrument body mostly effect
the steady state:
• energy distribution among high and low partials
• formant regions
• spectral bandwidth
Example Spectrograms
oboe
cello
8000
8000
7000
6000
6000
5000
5000
Frequency (Hz)
Frequency (Hz)
7000
4000
3000
4000
3000
2000
2000
1000
1000
0
0
0
0.5
1
1.5
Time (s)
2
2.5
0
0.5
1
1.5
Time (s)
2
Human Instrument Identification
• different clues from onset and steady state are used, individual
clues like e.g. static spectrum can be enough to identify some,
but not all instruments
• onset seems most relevant for instrument family
discrimination
• better performance on musical phrases than on single tones
• experts are better than non-experts
Computer Instrument Identification
JC Brown et al. (2001):
• GMM classifier
• frame based cepstral
coefficients
• 4 woodwinds (flute,
clarinet, oboe,
saxophone)
• realistic, monophonic
phrases
• computer:
60% correct average
80% best parameter
choice
• humans: 85%
KD Martin (1999):
• hierarchical classification
scheme
• different features, both
temporal and spectral
• 27 different instruments
• realistic, monophonic
phrases and single notes
• computer:
48% instrument correct
75% instrument family
• humans:
57% instrument correct
95% instrument family
Polyphonic
Kashino & Murase (1999)
• time domain approach
• example waveforms stored
for each note of each
instrument
• best match found using
adaptive filtering techniques
• iterative subtraction scheme
• 3 instruments: flute, violin,
piano
• specially made recording
• F0s and onset times supplied
• 68% correct (max.
polyphony 3)
Kinoshita et al. (1999)
• frequency domain approach
• features measuring temporal
variation at the onset, and
spectral energy distribution
• colliding partials are identified
and
• corresponding feature values
are (mostly) ignored
• 3 instruments: clarinet, violin,
piano
• random chord combinations
made from 2 isolated tones
• 70% correct (78% if correct
F0s were supplied)
Our System
• missing feature approach – works for speech recognition
in the presence of noise
• GMMs trained with spectral features perform well for
realistic monophonic music and
• GMMs have also been used in combination with a
missing feature approach for speaker identification in
noise
use a GMM classifier in combination with a missing
feature approach for instrument recognition in realistic,
polyphonic music
System Overview
Sampled audio signal
F0 analysis
Fourier
analysis
Feature
mask
GMM
classifier
Instrument class
Spectral
features
F0-analysis
• iterative approach based on harmonic sieves (Scheffers, 1983)
bad fitting sieve
best fitting sieve
determines F0
Missing Feature Estimation
• finding reliable and unreliable features is one of the
main problems
• instrument tones have an approximately harmonic
overtone series
• based on the extracted F0s, all frequency regions where
a partial from a non-target tone is found are marked as
unreliable and excluded from the recognition process
Features
• local spectral features are required for missing feature
• frame based (exact onset detection is hard in polyphonic
music)
• energy in narrow frequency bands (60 Hz)
• linear spacing, corresponding to linear spacing of partials
Example Features with Mask
target tone
(violin D)
target tone + mask
non-target tone
(oboe G sharp)
non-target tone + mask
mixture
mixture + mask
GMMs
• approximate a distribution by a combination of
individual gaussians
example of a 2dimensional
distribution modeled
by a GMM consisting
of 3 individual
Gaussians
• means and covariances trained by EM-algorithm
GMMs with Missing Features
probability density function (pdf) of observed spectral D-dimensional feature
vector x is modeled as:
N
p ( x)   pi  i ( x,  i ,  i )
i 1
assuming feature independence, this can be rewritten as:
N
D
i 1
j 1
p( x)   pi   i ( x j , mij ,  2 ij )
approximating the pdf from reliable data only leads to:
N
p( x)   pi   i ( x j , mij ,  2 ij )
i 1
jM '
N = number of Gaussians in the mixture model, pi = mixture weight, i = univariate
Gaussians with i = mean vector, mij = mean, i = covariance matrix, 2ij = standard
deviation, M’ = subset of reliable features in Mask M
Results Monophonic
• GMMs trained for 5 instruments: flute, clarinet, oboe,
violin, cello
• realistic monophonic phrases (3-4 per instrument) 83%
correct
• single notes: 66% instrument correct, 85% instrument
family correct
flute
clarinet oboe
violin
cello
77%
15%
0%
0%
8%
clarinet 15%
62%
0%
8%
15%
oboe
0%
15%
69%
8%
8%
violin
0%
0%
15%
54%
31%
cello
0%
0%
15%
15%
69%
flute
Random 2-tone Chords
• correct F0 were provided
• 49% instrument correct, 72%
instrument family
flute
clarinet oboe
violin
cello
75%
6%
0%
10%
9%
clarinet 13%
49%
3%
22%
12%
oboe
20%
13%
25%
28%
16%
violin
3%
4%
10%
57%
25%
cello
3%
9%
16%
36%
37%
flute
Realistic Duet Recording
• duet for flute and clarinet by H. Villa-Lobos
• F0s extracted by the system
system output:
700
original score:
flute
clarinet
in A
F0s according to the score in Hz:
415 - 415 - 415 - 622 - 622
208 - 185 - 175 - 277 - 294 - 247 220 - 208
fundamental frequency (Hz)
600
500
400
300
200
100
0
20
40
60
time (frames)
80
100
120
Conclusions
• looks promising for small ensembles
• works with realistic stimuli
Future Work
• include temporal information
• idea: one HMM for every instrument tone
• missing feature approach comparable to the one used
or
• spectral subtraction based on templates