On the Correlation between Energy and Pitch Accent in Read

Download Report

Transcript On the Correlation between Energy and Pitch Accent in Read

On the Correlation
between Energy and
Pitch Accent in Read
English Speech
Andrew Rosenberg
Weekly Speech Lab Talk
6/27/06
Talk Outline







Introduction to Pitch Accent
Previous Work
Contribution and Approach
Corpus
Results and Discussion
Conclusion
Future Work
Introduction

Pitch Accent is the way a word is made to “stand
out” from its surrounding utterance.
 As opposed to lexical stress which refers to the most
prominent syllable within a word.

Accurate detection of pitch accent is particularly
important to many NLU tasks.
 Identification of “important” words.
 Indication of Discourse Status and Structure.
 Disambiguation Syntax/Semantics.

Pitch (f0), Duration, and Energy are all known
correlates of Pitch Accent
Previous Work

Sluijter and van Heuven 96, 97 showed that accent in
Dutch strongly correlates with the energy of a word
extracted from the frequency subband > 500Hz.
 Heldner 99,01 and Fant, et al. 00 found that energy in a
particular spectral region indicated accent in Swedish.
 A lot of researh attention has been given to the automatic
identification of prominent or accented words.
 Tamburini 03,05 used the energy components of the 500Hz2000Hz band.
 Tepperman 05 used the RMS energy from the 60Hz-400Hz band
 Far too many others to mention here.
Contribution and Approach


There is no agreement as to the best -- most discriminative -frequency subband from which to extract energy information.
We set up a battery of analysis-by-classification experiments varying:
 The frequency band:

lower bound frequency ranged from 0 to 19 bark
 bandwidth ranged from 1 to 20 bark
 upper bound was 20 bark by the 8KHz Nyquist rate

Also, analyzed the first and/or second formants.
 The region of analysis:

Full word, only syllable nuclei, longest syllable, longest syllable nuclei
 Speaker:


Each of 4 speakers separately, and all together.
We performed the classification using J48 -- a java implementation of
C4.5.
Contribution and Approach

Local Features:





minimum, maximum, mean, standard deviation and RMS of energy
z score of max energy within the word
mean slope
energy contour classification {rising, falling, peak, valley}
Context-based Features:
 Use 6 contexts: (# previous words, #following words)






(2,2) (1,1) (1,0) (2,0) (0,1) (2,1)
(maxword - meanregion) / std.devregion
(meanword - meanregion) / std.devregion
(maxword - maxregion) / std.devregion
maxword / (maxregion-minregion)
meanword / (maxregion-minregion)
Corpus

Boston Directions Corpus (BDC)
[Hirschberg&Nakatani96]





Speech elicited from a direction-giving task.
Used only the read portion.
50 minutes
Fully ToBI labeled
10825 words
 Manually
segmented
 4 Speakers: 3 male, 1 female
Results and Discussion

Energy from
different
frequency regions
predict pitch
accent differently
 mean relative
improvement of
best region over
worst: 14.8%
Results and Discussion


Our experiments did not confirm previously
reported results.
The single most predictive subband for all
speakers was 3-18bark over full words
 Classification Accuracy: 76% (42.4% baseline)
 p=71.6,r=73.4
 However, performs significantly worse than the
best for analyzing a single speaker
 not
the female speaker
Results and Discussion

The subband from 2-20bark is performs
significantly worse than the most predicitive in
only a single experiment (h1nucl)
 Accuracy: 75.5% (p=70.5, r=72.5)
 Due to its robustness we consider this band the “best”

The formant-based energy features tend to
perform worse
 6.4% mean accuracy reduction from 2-20bark
 Attributable to:
 Errors
in the formant tracking algorithm
 The presence of discriminative information in higher formants
Results and Discussion

Most predictive features were normalized
maximum energy relative to the mean and
standard deviation of three contextual
regions
1
previous and 1 following word
 2 previous and 1 following word
 2 previous and 2 following words
Results and Discussion



There is a relatively small intersection of
correct predictions even among similar
subbands.
10823 of 10825 words were correctly
classified by at least one classifier.
Using a majority voting scheme:
 Accuracy: 81.9% (p=76.7, r=82.5)
Results and Discussion

How do the regioning strategies perform?
Full Word > All Nuclei > Longest Syllable ~ Longest Nuclei

Why does analysis of the full word outperform
other regioning strategies?




Duration is a crude measure of lexical stress
Syllable/nuclei segmentation algorithms are imperfect
Pitch accents are not neatly placed
More data has the ability to highlight distinctions more
easily
Conclusion

Using an analysis-by-classification approach
we showed:
 Energy from different frequency bands correlate
with pitch accent differently.
 The “best” (highest accuracy, most robust)
frequency region to be 2-20bark (>2bark?)
 A voting classifier based exclusively on energy
can predict accent reliably.
Future Work



Can we predict which bands will predict
accent best for a given word?
We plan on incorporating these findings into
a general pitch accent classifier with pitch
and duration features.
We plan on repeating these experiments on
spontaneous speech data.