tilburg04nosound - University of Pittsburgh

Download Report

Transcript tilburg04nosound - University of Pittsburgh

Using Prosody to Recognize
Student Emotions and Attitudes in
Spoken Tutoring Dialogues
Diane Litman
Department of Computer Science
and
Learning Research and Development Center
University of Pittsburgh
Outline
 Introduction
 The
ITSPOKE System and Corpora
 Emotion Prediction from Prosody & Other Features
– Method
– Human-human tutoring
– Computer-human tutoring

Current Directions and Summary
Motivation
 Working
hypothesis regarding learning gains
– Human Dialogue > Computer Dialogue > Text
 Most
human tutoring involves face-to-face
spoken interaction, while most computer
dialogue tutors are text-based
– Evens et al., 2001; Zinn et al., 2002; Vanlehn et
al., 2002; Aleven et al., 2001
 Can
the effectiveness of dialogue tutorial
systems be further increased by using spoken
interactions?
Potential Benefits of Speech
 Self-explanation
correlates with learning and occurs
more in speech
– Hausmann and Chi, 2002
 Speech
contains prosodic information, providing
new sources of information for dialogue adaptation
– Forbes-Riley and Litman, 2004
 Spoken
computational environments may prime a
more social interpretation that enhances learning
– Moreno et al., 2001; Graesser et al., 2003
 Potential
for hands-free interaction
Spoken Tutorial Dialogue Systems
Recent
tutoring systems have begun to add
spoken language capabilities
– Rickel and Johnson, 2000; Graesser et al. 2001;
Mostow and Aist, 2001; Aist et al., 2003; Fry et al.,
2001; Schultz et al., 2003
However,
little empirical analysis of the
learning ramifications of using speech
Outline
 Introduction
 The
ITSPOKE System and Corpora
 Emotion Prediction from Prosody & Other Features
– Method
– Human-human tutoring
– Computer-human tutoring

Current Directions and Summary
ITSPOKE:
Intelligent Tutoring SPOKEn Dialogue System
 Back-end
is text-based Why2-Atlas tutorial dialogue
system (VanLehn et al., 2002)

Student speech digitized from microphone input;
Sphinx2 speech recognizer

Tutor speech played via headphones/speakers; Cepstral
text-to-speech synthesizer
additions: access to Why2-Atlas “internals”,
speech recognition repairs, etc.
 Other
Spoken Tutoring Corpora
 Human-Human
Corpus
– 14 students, 128 physics problems (dialogues), 5948
student turns, 5505 tutor turns
– Student and tutor turns were manually transcribed for
content and segmented for boundaries
Corpus Transcription & Annotation
Spoken Tutoring Corpora (cont.)
 Computer-Human
Corpus
– 20 students, 100 physics problems (dialogues) , 2445
student turns, 2967 tutor turns
– Noisy student turn boundaries and transcriptions
extractable from ITSPOKE logs
– Content of student turns also manually transcribed
ITSPOKE Corpora Comparison
Human-Human
Human-Computer
…1.3 minutes into session…
…3.5 minutes into session…
Student: Can I ask you questions?
Tutor: I didn't understand. What force acts on all objects within close proximity
to earth?
Tutor: Yes
Student: uh the force of gravity
Student: Is that ok?
Tutor: Fine. Besides the Earth's gravitational force, are there any other forces
acting on the person, keys, and elevator after the release?
Tutor: Mm-hm
Student: Um from what we learned from the last problem
they're gonna have the same- the keys and you have
the same acceleration right?
Tutor: Yes
Student: So that means they'll both hit the bottom at the
same time. But I don't understand where- I don't
understand if you're in the elevatorTutor: You see
Student: Where are you going to-?
Tutor: The uh let me uh the key uh- the person holds the
key in front ofStudent: Their face yeahTutor: Uh his or her face and then lets it uh uh it is let go so
the the the question relates to the relative position of
the key and the face uh-
Student: no
Tutor: Yep. Thus, keep in mind that after the release, the only force on the
person, keys, and elevator is the force of gravity. If gravity is the only force
acting on an object, will it be moving or staying still?
Student: it will be moving
Tutor: Fine. Not only are the person, keys, and elevator moving, they have only
gravitational forces on them. When an object is falling and has only gravitational
force on it, it is said to be in what?
Student: freefall
Tutor: Good. Thus, after the release the only force on the keys and man is the
downward force of earth's gravity, so they are in freefall. Before the release, what
is the relationship between the man's velocity and the keys' velocity?
Student: they're equal
Tutor: Uh how will they- as they go- as they both fall what
will be their relative position? That is the question.
Tutor: Correct. We know that after the release, the keys, the person, and the
elevator are in freefall. Is freefall acceleration near earth a constant? Or does it
vary depending upon which object is in freefall?
Student: (sigh)
Student: it's a constant
Student: So the key and the face-
Outline
 Introduction
 The
ITSPOKE System and Corpora
 Emotion Prediction from Prosody & Other Features
– Method
– Human-human tutoring
– Computer-human tutoring

Current Directions and Summary
Monitoring Student State (motivation)
Tutor26:
Let me ask my question in a more general way. What happens to the
velocity of a body when there is no force acting on it?
Student27: dammit (ASR: it is)
Tutor28 : Could you please repeat that?
Student29 : same (ASR: i same)
Tutor30 : Right. And what happens to the HORIZONTAL velocity of a body when
there is no HORIZONTAL force acting on it?
Student31: zero (ASR: the zero)
Tutor32 : Well... OK, let's consider how forces in the vertical direction affect the
motion of a body in a horizontal direction in a different scenario <…omitted…>
Student33: oh gravity you already said this (ASR: beats gravity you're exert less)
Tutor34:
Fine. Are there any other forces acting on the apple as it falls?
Student35: no why are you doing this again (ASR: no y and to it yes)
Tutor36:
Correct. The only force acting on the apple as it falls to the earth is the
earth's gravitational force. In which direction does gravitational force act?
Student37: downward you computer (ASR: downward you computer)
Related Research in Emotional Speech
 Elicited Speech
(Polzin & Waibel 1998; Oudeyer 2002; Liscombe et al. 2003)
 Naturally-Occurring Speech
(Ang et al. 2002; Lee et al. 2002; Batliner et al. 2003; Devillers et al.
2003; Shafran et al. 2003)
 Our
Work
naturally-occurring tutoring data
analysis of comparable human and computer corpora
Methodology
 Emotion Annotation
 Machine
Learning Experiments
– Extract linguistic features from student turns
– Use different feature sets to predict emotions
– Significant reduction of baseline error
Emotion Annotation Scheme (Sigdial’04)

‘Emotion’: emotions/attitudes that may impact
learning
 Annotation
of Student Turns
 3 Main Emotion Classes
negative e.g. uncertain, bored, irritated, confused, sad
positive e.g. confident, enthusiastic
neutral no expression of negative or positive emotion
 3 Minor Emotion Classes
– weak negative, weak positive, mixed
Feature Extraction per Student Turn

Five feature types
–
–
Acoustic-prosodic (1)
Non acoustic-prosodic
– Lexical (2)
– Other Automatic (3)
– Manual (4)
–

Identifiers (5)
Research questions
–
–
–
Relative predictive utility of feature types
Impact of speech recognition
Comparison across computer and human tutoring
Feature Types (1)
Acoustic-Prosodic Features
 4 pitch (f0) : max, min, mean, standard dev.
 4 energy (RMS) : max, min, mean, standard dev.
 4 temporal: turn duration (seconds)
pause length preceding turn (seconds)
tempo (syllables/second)
internal silence in turn (zero f0 frames)
 available to ITSPOKE in real time
Feature Types (2)
Lexical (Word Occurrence Vectors)
Human-transcribed lexical items in the turn
ITSPOKE-recognized lexical items
Feature Types (3)
Other Automatic Features: available from logs
 Turn Begin Time (seconds from dialog start)
 Turn End Time (seconds from dialog start)
 Is Temporal Barge-in (student begins before tutor turn ends)
 Is Temporal Overlap (student begins and ends in tutor turn)
 Number of Words in Turn
 Number of Syllables in Turn
Feature Types (4)
Manual Features: (currently) available only
from human transcription
 Is Prior Tutor Question (tutor turn contains “?”)
 Is Student Question (student turn contains “?”)
 Is Semantic Barge-in (student turn begins at tutor
word/pause boundary)
 Number of Hedging/Grounding Phrases (e.g. “mm-hm”,
“um”)
 Is Grounding (canonical phrase turns not preceded by a
tutor question)
 Number of False Starts in Turn (e.g. acc-acceleration)
Feature Types (5)
Identifier Features
 student number
 student gender
 problem number
Empirical Results I
Predicting Emotion in Spoken Dialogue from
Multiple Knowledge Sources
Kate Forbes-Riley and Diane Litman
Proceedings of the Human Language Technology
Conference: 4th Meeting of the North American Chapter
of the Association for Computational Linguistics
(HLT/NAACL 2004)
Annotated Human-Human Excerpt
(weak, mixed -> neutral)
Tutor: Uh let us talk of one car first.
Student: ok. (EMOTION = NEUTRAL)
Tutor: If there is a car, what is it that exerts force on the
car such that it accelerates forward?
Student: The engine. (EMOTION = POSITIVE)
Tutor: Uh well engine is part of the car, so how can it
exert force on itself?
Student: um… (EMOTION = NEGATIVE)
Human Tutoring:
Annotation Agreement Study
• 453 student turns, 10 dialogues
• 2 annotators (the authors)
• 385/453 agreed (85%, Kappa .7)
Negative
Neutral
Positive
Negative
90
6
4
Neutral
23
280
30
Positive
0
5
15
Machine Learning Experiments

Task: predict negative/positive/neutral using 5 feature types

Data: “agreed” subset of annotated student turns

Weka software: boosted decision trees

Methodology: 10 runs of 10-fold cross validation

Evaluation Metrics
– Mean Accuracy: %Correct
– Relative Improvement Over Baseline (RI):
error(baseline) – error(x)
error(baseline)
Acoustic-Prosodic vs. Other Features
 Acoustic-prosodic features (“speech”) outperform
majority baseline, but other feature types yield even
higher accuracy, and the more the better
Feature Set
speech
lexical
lexical + automatic
lexical + automatic + manual
-ident
76.20%
78.31%
80.38%
83.19%
• Baseline = 72.74%; RI range = 12.69% - 43.87%
Acoustic-Prosodic plus Other Features
 Adding acoustic-prosodic to other feature sets
doesn’t significantly improve performance
Feature Set
-ident
speech + lexical
79.26%
speech + lexical + automatic
79.64%
speech + lexical + automatic + manual
83.69%
• Baseline = 72.74%; RI range = 23.29% - 42.26%
Adding Contextual Features
(Litman et al. 2001, Batliner et al 2003): adding
contextual features improves prediction accuracy
Local Features: the values of all features for the two
student turns preceding the student turn to be predicted
Global Features: running averages and total for all
features, over all student turns preceding the student
turn to be predicted
Previous Feature Sets plus Context
 Adding global contextual features marginally
improves performance, e.g.
Feature Set
+context
-ident
speech + lexical + auto +
manual
speech + lexical + auto +
manual
speech + lexical + auto +
manual
local
82.44
global
84.75
local+global 81.43
• Same feature set with no context: 83.69%
Feature Usage
Feature Type
Turn + Global
Acoustic-Prosodic
Temporal
Energy
Pitch
Other
Lexical
Automatic
Manual
16.26%
13.80%
2.46%
0.00%
83.74%
41.87%
9.36%
32.51%
Accuracies over ML Experiments
Empirical Results II
Predicting Student Emotions in ComputerHuman Tutoring Dialogues
Diane J. Litman and Kate Forbes-Riley
Proceedings of the 42nd Annual Meeting of the
Association for Computational Linguistics
(ACL 2004)
Computer Tutoring Study
 Additional
dataset
– Consensus (all turns after annotators resolved
disagreements)
 Different
treatment of minor classes
 Additional
binary prediction tasks (in paper)
– Emotional/non-emotional and negative/non-negative
 Slightly
different features
– strict turn-taking protocol (no barge-in)
– ASR output rather than actual student utterances
Annotated Computer-Human Excerpt
(weak -> pos/neg, mixed -> neutral)
ITSPOKE: What happens to the velocity of a body
when there is no force acting on it?
Student: dammit (NEGATIVE)
ASR: it is
ITSPOKE : Could you please repeat that?
Student: same (NEUTRAL)
ASR: i same
Computer Tutoring:
Annotation Agreement Study
• 333 student turns, 15 dialogues
• 2 annotators (the authors)
• 202/333 agreed (61%; Kappa=.4)
Negative
Neutral
Positive
Negative
89
30
6
Neutral
32
94
38
Positive
6
19
19
Acoustic-Prosodic vs. Lexical Features
(Agreed Turns)
Both acoustic-prosodic (“speech”) and lexical
features significantly outperform the majority baseline
Combining feature types yields an even higher
accuracy
Feature Set
speech
lexical
speech+lexical
• Baseline = 46.52%
-ident
55.49%
52.66%
62.08%
Adding Identifier Features
(Agreed Turns)
Adding identifier features improves all results
With identifier features, lexical information now
yields the highest accuracy
Feature Set
speech
lexical
speech+lexical
• Baseline = 46.52%
-ident
55.49%
52.66%
62.08%
+ident
62.03%
67.84%
63.52%
Using Automatic Speech Recognition
(Agreed Turns)
Surprisingly, using ASR output rather than human
transcriptions does not particularly degrade accuracy
Feature Set
lexical
-ident
52.66%
+ident
67.84%
ASR
57.95%
65.70%
speech+lexical
62.08%
63.52%
speech+ASR
61.22%
62.23%
• Baseline = 46.52%
Summary of Results (Agreed Turns)
70
65
60
+id
-id
maj
55
50
45
40
sp
asr
lex
sp+asr sp+lex
Comparison with Human Tutoring
90
80
70
60
50
40
30
human+id
human-id
ITSPOKE+id
ITSPOKE-id
20
10
0
sp
lex
sp+lex
- In human tutoring dialogues, emotion prediction (and annotation)
is more accurate and based on somewhat different features
Summary of Results (Consensus Turns)
70
65
60
+id
-id
maj
55
50
45
40
sp
- Using
asr
lex
sp+asr sp+lex
consensus rather than agreed data decreases predictive
accuracy for all feature sets, but other observations generally hold
Recap

Recognition of annotated student emotions in spoken
computer and human tutoring dialogues, using multiple
knowledge sources

Significant improvements in predictive accuracy compared
to majority class baselines

A first step towards implementing emotion prediction and
adaptation in ITSPOKE
Outline
 Introduction
 The
ITSPOKE System and Corpora
 Emotion Prediction from Prosody & Other Features
– Method
– Human-human tutoring
– Computer-human tutoring

Current Directions and Summary
Word Level Emotion Models
(joint research with Mihai Rotaru)
 Motivation
– Emotion might not be expressed over the entire turn
– Some pitch features make more sense at a smaller level
 Simple word-level emotion
– Label each word with turn
class
– Learn a word level emotion
model
– Predict the class of each word
in a test turn
– Combine word classes using
majority/weighted voting
model
Word Level Emotion Models - Results
75%
70%
 Feature sets
– Lexical
– Pitch
– PitchLex
 Results
(Turn and Word levels)
65%
60%
55%
50%
Baseline
Lex-Word-MBL
Pitch-Word-MBL
PitchLex-Word-MBL
Lex-Turn-MBL
Pitch-Turn-MBL
PitchLex-Turn-MBL
DK-MBL
HC, EnE, MBL
– Word-level better than Turn-level counterpart
– PitchLex at Word-level always among the best performers
– PitchLex at Word-level comparable with state-of-art on our
corpora
Prosody-Learning Correlations
(joint work with Kate Forbes-Riley)
 What
aspects of spoken tutoring dialogues
correlate with learning gains?
– Dialogue features (Litman et al. 2004)
– Student emotions (frequency or patterns)
– Acoustic-prosodic features
 Human Tutoring
– Faster tempos (syllables/second) and longer turns
(seconds) negatively correlate with learning (p < .09)
 Computer Tutoring
– Higher pitch features (average, max, min) negatively
correlate with learning (p < .07)
Other Directions
 Co-training
to address annotation bottleneck
– Maeireizo, Litman, and Hwa, ACL 2004
 Development
of adaptive strategies for ITSPOKE
– Annotatation of human tutor turns
 ITSPOKE
version 2 and beyond
– Pre-recorded prompts and domain-specific TTS
– Barge-in
– Dynamic adaptation to predicted student emotions
Summary
 Recognition
of annotated student emotions in
spoken tutoring dialogues
 Significant
improvements in predictive accuracy
compared to majority class baselines
– role of feature types and speech recognition errors
– comparable analysis of human and computer tutoring
 This
research is a first step towards implementing
emotion prediction and adaptation in ITSPOKE
Acknowledgments
 Kurt
 The
–
–
–
–
–
–
VanLehn and the Why2 Team
ITSPOKE Group
Kate Forbes-Riley, LRDC
Beatriz Maeireizo, Computer Science
Amruta Purandare, Intelligent Systems
Mihai Rotaru, Computer Science
Scott Silliman, LRDC
Art Ward, Intelligent Systems
 NSF
and ONR
Thank You!
Questions?
Architecture
www
server
html
essay
ITSpoke
java
Why2
xml
Text Manager
www
browser
student
text
(xml)
Essay Analysis
essay
text
Speech
Analysis
dialogue
tutorial
goals
(Sphinx)
repair
goals
dialogue
(Carmel, Tacituslite+)
text
Cepstral
Spoken
Dialogue
Manager
dialogue
tutor turn
(xml)
Content
Dialogue
Manager (Ape,
Carmel)
Speech Recognition: Sphinx2 (CMU)
 Probabilistic
language models for different dialogue
states
 Initial training data
– typed student utterances from Why2-Atlas corpora
 Later
training data
– spoken utterances obtained during development and pilot
testing of ITSPOKE
 Total
vocabulary
– 1240 unique words
 “Semantic Accuracy”
Rate = 92.4%
Speech Synthesis: Cepstral
 Commercial
outgrowth of Festival text-tospeech synthesizer (Edinburgh, CMU)
 Required
additional processing of Why2-Atlas
prompts (e.g., f=m*a)
Common Experimental Aspects
 Students
take a physics pretest
 Students read background material
 Students use web interface to work through up to
10 problems with either a computer or a human
tutor
 Students take a posttest
– 40 multiple choice questions, isomorphic to pretest
Hypotheses
 Compared
to typed dialogues, spoken
interactions will yield better learning gains, and
will be more efficient and natural
 Different student behaviors will correlate with
learning in spoken versus typed dialogues, and
will be elicited by different tutor actions
 Findings in human-human and human-computer
dialogues will vary as a function of system
performance
Recap
 Human
Tutoring: spoken dialogue yielded
significant performance improvements
– Greater learning gains
– Reduced dialogue time
– Many differences in superficial dialogue characteristics
 Computer
Tutoring: spoken dialogue made little
difference
– No change in learning
– Increased dialogue time
– Fewer dialogue differences