Snerg2003 - UB Computer Science and Engineering

Download Report

Transcript Snerg2003 - UB Computer Science and Engineering

Towards a model of speech
production: Cognitive
modeling and computational
applications
Michelle L. Gregory
SNeRG 2003
Outline
Where I’ve been
• Predictability affects on word duration
• Predictability effects on pitch accent
 Where I’m at
• Computational model of pitch accent
• Prosodic information to aid parsing
• Psychological models of production
 Where I’d like to go
• Prosody
• Disfluencies
• Speech synthesis

Where I’ve been
Where I’m at…
CU-BOULDER LINGUISTICS PROFESSOR
WINS 2002 MACARTHUR FELLOWSHIP
Where I’m headed …
Predictability affects on word duration



Methodology
•
•
Corpus and Design (swbd, regression)
Measures of predictability (frequency, bigram, joint,
mutual information, repetition)
Function words (top ten most frequent)
•
•
•
Vowel reduction
Coda deletion
Duration
Content words
•
•
t/d deletion
Duration
Predictability affects on word duration
The
probabilistic reduction hypothesis:
The higher the probability of a word, the more it is
reduced/shortened/lenited in lexical production. (Gregory et
a. 1999, Jurafsky, Bell, Gregory, and Raymond 2000)

Implications
• Any factor that increases the probability of a
word also increases phonological reduction.
Is that the only role of probabilistic information?
Predictability effects on pitch accent

Same database, used regression models, but
this time coded for pitch accent.

What is pitch accent?
Perceptual phenomenon (Hirschberg, 1993)
Associated with duration, amplitude, and F0 of units. Words
that appear more intonationally prominent than others are
said to bear pitch accent.
Predictability and pitch accent
Even now I would I would
LIKE
to
NOT
have to
WORK
in SOME ways.
Pitch accent is associated with meaning:
600
300
100
0.0819116
0.67453
Time (s)
100
0.11522
0.538167
Time (s)
Predictability effects on pitch accent

Results. More predictable words are less likely to bear pitch accent, as
measured by (this is true for all parts of speech):
•
•
•
•
•
Frequency
Conditional bigram probability
Joint probability
Semantic relatedness
Repetition
(not all the same measures that affect reduction, e.g., preceding context is more
important with pitch accent)

Implications
•
•
•
The role of predictability is not limited to reduction processes
Predictability is not just a fact about lexical access, this information is
available during phonological encoding
Prosody in speech synthesis is rudimentary, a probabilistic model is
(relatively) easy to implement.
Current Research



Computational model of pitch accent
Prosodic information to aid parsing
Psychological models of production
Computational model of pitch accent
(joint work with Yasemin Altun)

Problem
Predicting accent is not an exact science.
Hirschberg (1993) and Pan & Hirschberg (2000)
demonstrate that frequency and conditional probability
increase accuracy in pitch accent prediction.
• Function vs content only
• Frequency, conditional probability
• BOTH 1 and 2
68%
71%
73%
Will the addition of more/different probabilistic variables increase
accuracy as well?
Computational model of pitch accent

Testing more variables
•
•
•
•
Joint probability, reverse conditional
probability
The effects of surrounding accents
More fine grained part of speech
Things like rate of speech, etc.
Prosodic information to aid parsing
(Joint work with Mark Johnson and Eugene Charniak)
Problem:
Parsing conversational speech is difficult


Accuracy of parsing
•
•
•
•
the wall street journal
switchboard
wsj, no punctuation
swbd, no punctuation
90% (Charniak 2000)
84.5%
86%
81%
Add prosodic features instead of punctuation
Prosodic information to aid parsing

Methodology
•
•
Get timing information from the transcripts
Add pause duration information as a term in the parser
• (use pauses as a cue instead of punctuation)
• For sentence-internal punctuation only
http://cog.brown.edu:16080/~mj/papers/acl02-emptynodes.pdf

Results
•
•
Accuracy goes down (80%)
Because the language model is not as strong?
Psychological models of production:
Disfluencies
(joint work with Julie Sedivy and Dan Grodner)

Looking at what’s going on during speech and when

Initially, we were interested in how prosody maps to discourse
constraints in the production of prenominal adjectives
Move the red cup

Facts:
• Speakers only use scalar or material adjectives in the environment
of a contrast. Speakers use color adjectives ALL the time.
• Marking a contrast is prosodically marked (there is an increase in
pitch range in the presence of a contrast)
• Despite an increase in pitch range, there is not a duration increase
with adjectives produced in a contrastive environment.
• BUT Scalar adjectives are longer
Psychological models of production:
Disfluencies

Really neat fact: Speakers produce more disfluencies with
scalar adjectives compared to material or color.

disfluencies account for about 6% of spontaneous speech.
Shriberg (2002)
•
•
•
•
•
silent pauses
elongated pronunciations
filled pauses
repetitions
restarts
move the <sil> red …
move theee
move the um
move the the
move the uh the red …
Psychological models of production:
Disfluencies
• Used an eye-tracking device to find out what’s happening during the disfluency
Move the, uh, big car next to the turtle
Psychological models of production:
Disfluencies

Results:
•
•

We found that speakers are looking more at the contrasting
object in the case of the scalars during the disfluency
AND during the adjective!
Implications:
•
•
•
Marking a contrast set does not increase processing load
Encoding a relative property does increase processing load
Duration is affected by lexical encoding (suggests a
continuum of planning difficulty effects)
Near-future research
Prosody

In general, continue looking a the factors that
influence prosodic variation and see if these
can be modeled probabilistically.

The challenge:
•
•
•
Lots of people have found discourse-pragmatic factors
contribute to prosodic marking
Others, including myself, have found that prosody is
affected by probabilistic variables
How can we model aspects of the speech context
probabilistically?
Disfluencies

Disfluencies have proven to be a very useful
window into processes of speech production.
•
•
•
Are there more disfluencies around evaluative terms in
general?
Do different types of disfluencies correspond to
difficulties associated with difference aspects of
production (initial planning versus lexical encoding and
access)
Investigate more fully the connection between
disfluencies and the length of surrounding words.
• Why is it that words following a disfluency are longer?
• How much of duration variation can be accounted for by
planning difficulties versus other factors?
Speech synthesis
Three types of TTS systems:


Concatenated or diphone models.
•
•
Advantages: the ability to process of novel strings of text, does not require a huge database
of stored speech.
Disadvantage: mechanical sounding speech, a lot of post-processing
Corpus based--prosodic patterns (durations, stress, F0 contours) are not defined by the
signal processor, but rather the phoneme sequences are chosen based on exact prosodic pattern
matches in a corpus.
• Advantage: natural sounding speech, specifically with regard to prosody.
• Disadvantage: a much larger database is required with a lot more hand coding involved. It
also does not allow for totally novel sequences of sounds or words that are not in the
database.

Phrase splicing (unit selection)--selects the largest unit possible from a corpus of one
speaker.
• Advantage: Very natural, requires very little post-speech processing from a signal processor.
• Disadvantage: Requires an extremely large (~10) hours of hand-annotated corpus of
speech. It also does not allow for novel sequences of speech, thus must be used in
conjunction with a diphone model.
Speech synthesis
(joint work with Mike Buckley and Kris Schindler)



•
•
•
•
•
•
Using a Probabilistic Model to Improve Speech
Synthesis in the UB Talker
The UB Talker:
The UB Talker
artificial speaking device
menu-driven means of selecting words and phrases,
Menus, words, and phrases can be pre-programmed
or entered in on-screen
Uses context-awareness and phrase completion to predict
responses
Statistics are derived using frequency of use, mostrecently used, time of day, day of week, and time of year
to present most-likely phrases to users.
Speech synthesis

Once a string is selected, a synthesizer
component produced speech.
Two goals:
1.
Add a probabilistic model of prosody to the
current free TTS system
2.
Build a corpus of speech toward a unit selection
model (the Client has about 2,000 phrases in the
system that can be pre-recorded)
Speech synthesis

some academically available and
commercially available synthesizers:

http://www.cstr.ed.ac.uk/projects/festival/userin.html

http://www.rhetorical.com/cgi-bin/demo.cgi

http://www.research.att.com/projects/tts/demo.html