Misunderstandings, Corrections and Beliefs in Spoken Language

Download Report

Transcript Misunderstandings, Corrections and Beliefs in Spoken Language

misunderstandings, corrections and
beliefs in spoken language interfaces
Dan Bohus
www.cs.cmu.edu/~dbohus
[email protected]
Computer Science Department
Carnegie Mellon University
Pittsburgh, PA 15213
problem
spoken language interfaces lack robustness
when faced with understanding errors
 stems mostly from speech recognition
 spans most domains and interaction types
 exacerbated by operating conditions
2
more concretely …
S: What city are you leaving from?
U: Urbana Champaign [OKAY IN THAT SAME PAY]
S: Sorry I’m not sure I understood what you said. What city are you leaving
from ?
U: Urbana Champaign [FOR MINUTE SINCE HEY]
S: I’m still having trouble understanding you. To go on, I need you to answer
the following question. What city are you leaving from?
U: Chicago [CHICAGO]
S: traveling from Chicago. Where would you like to go?
U: Huntsville [SEOUL]
S: traveling to Seoul. What day did you need to travel?
U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M]
S: traveling in the afternoon. Okay what day would you be departing chicago
U: /uh/ the tenth of august [AT THE TENTH OF AUGUST]
S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at
1:40pm, arrives Seoul at 5pm the next day. Do you want to take that?
U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND]
S: traveling on Saturday, August 12th … I have a flight departing Chicago at
1:40pm arrives Seoul at ………
3
some statistics …
 semantic error rates: ~25-35%
SpeechActs
CU Communicator
Jupiter
25%
[SRI]
[CU]
How May I Help You?
 corrections
28%
[MIT]
CMU Communicator
27%
[CMU]
[AT&T]
32%
36%
[Krahmer, Swerts, Litman, Levow]
 30% of utterances correct system mistakes
 2-3 times more likely to be misrecognized
4
two types of understanding errors
NONunderstanding
 System cannot extract any meaningful information
from the user’s turn
S: What city are you leaving from?
U: Urbana Champaign [OKAY IN THAT SAME PAY]
MISunderstanding
 System extracts incorrect information from the user’s
turn
S: What city are you leaving from?
U: Birmingham [BERLIN PM]
5
misunderstandings
MISunderstanding
 fix
recognition
System extracts incorrect information from the user’s
turn
S: What city are you leaving from?
U: Birmingham
detect potential
misunderstandings;
[BERLIN PM]
do something about them
6
outline
 detecting misunderstandings
 detecting user corrections
[late-detection of misunderstandings]
 belief updating
[construct accurate beliefs by integrating information from multiple turns]
7
detecting misunderstandings
 recognition confidence scores
S: What city are you leaving from?
U: Birmingham [BERLIN PM]
conf=0.63
 traditionally
[Bansal, Chase, Cox, Kemp, many others]
 speech recognition confidence scores
 use acoustic, language model and search info
 frame, phoneme, word-level
8
“semantic” confidence scores
 we’re interested in semantics, not words
 YES = YEAH, NO = NO WAY
 use machine learning to build confidence
annotators
 in-domain, manually labeled data


utterance:
labels:
[BERLIN PM] Birmingham
correct / misunderstood
 features from different knowledge sources
 binary classification problem
 probability of misunderstanding: regression problem
9
a typical result
 Identifying User Corrections Automatically in a Spoken
Dialog System [Walker, Wright, Langkilde]
 HowMayIHelpYou corpus: call routing for phone services
 11787 turns
 features
 ASR: recog, numwords, duration, dtmf, rg-grammar, tempo …
 understanding: confidence, context-shift, top-task, diff-conf, …
 dialog & history: sys-label, confirmation, num-reprompts, numconfirms, num-subdials, …
 binary classification task
 majority baseline (error): 36.5%
 RIPPER (error): 14%
10
outline
 detecting misunderstandings
 detecting user corrections
[late-detection of misunderstandings]
 belief updating
[construct accurate beliefs by integrating information from multiple turns]
11
detect user corrections
 is the user trying to correct the system?
S: Where would you like to go?
misunderstanding
U: Huntsville [SEOUL]
S: traveling to Seoul. What day did you need to travel?
U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M]
user correction
misunderstanding
 same story: use machine learning




in-domain, manually labeled data
features from different knowledge sources
binary classification problem
probability of correction: regression problem
12
typical result
 Identifying User Corrections Automatically in a
Spoken Dialog System [Hirschberg, Litman, Swerts]
 TOOT corpus: access to train information
 2328 turns, 152 dialogs
 features




prosodic: f0max, f0mn, rmsmax, dur, ppau, tempo …
ASR: gram, str, conf, ynstr, …
dialog position: diadist
dialog history: preturn, prepreturn, pmeanf
 binary classification task
 majority baseline: 29%
 RIPPER: 15.7%
13
outline
 detecting misunderstandings
 detecting user corrections
[late-detection of misunderstandings]
 belief updating
[construct accurate beliefs by integrating information from multiple turns]
14
belief updating problem: an easy case
S: on which day would you like to travel?
U: on September 3rd
[AN DECEMBER THIRD] {CONF=0.25}
departure_date = {Dec-03/0.25}
S: did you say you wanted to leave on December 3rd?
U: no
[NO] {CONF=0.88}
departure_date = {Ø}
15
belief updating problem: a trickier case
S: Where would you like to go?
U: Huntsville
[SEOUL] {CONF=0.65}
destination = {seoul/0.65}
S: traveling to Seoul. What day did you need to
travel?
U: no no I’m traveling to Birmingham
[THE TRAVELING TO BERLIN P_M] {CONF=0.60} {COR=0.35}
destination = {?}
16
belief updating problem formalized
destination = {seoul/0.65}
S: traveling to Seoul. What day did you need to
travel?
 given:
[THE TRAVELING TO BERLIN P_M] {CONF=0.60} {COR=0.35}
 an initial belief Pinitial(C) over
destination = {?}
concept C
 a system action SA
 a user response R
 construct an updated belief:
 Pupdated(C) ← f (Pinitial(C), SA, R)
17
outline
 detecting misunderstandings
 detecting user corrections
[late-detection of misunderstandings]
 belief updating
[construct accurate beliefs by integrating information from multiple turns]






current solutions
a restricted version
data
user response analysis
experiments and results
discussion. caveats. future work
18
belief updating: current solutions
 most systems only track values, not beliefs





new values overwrite old values
explicit confirm + yes → trust hypothesis
explicit confirm + no → kill hypothesis
explicit confirm + “other” → non-understanding
implicit confirm: not much
“users who discover errors through incorrect implicit
confirmations have a harder time getting back on track”
[Shin et al, 2002]
19
outline
 detecting misunderstandings
 detecting user corrections
[late-detection of misunderstandings]
 belief updating
[construct accurate beliefs by integrating information from multiple turns]






current solutions
a restricted version
data
user response analysis
experiments and results
discussion. caveats. future work
20
belief updating: general form
 given:
 an initial belief Pinitial(C) over concept C
 a system action SA
 a user response R
 construct an updated belief:
 Pupdated(C) ← f (Pinitial(C), SA, R)
21
restricted version: 2 simplifications
1. compact belief

system unlikely to “hear” more than 3 or 4 values



single vs. multiple recognition results
in our data: max = 3 values, only 6.9% have >1
value
confidence score of top hypothesis
2. updates after confirmation actions

reduced problem

ConfTopupdated(C) ← f (ConfTopinitial(C), SA, R)
22
outline
 detecting misunderstandings
 detecting user corrections
[late-detection of misunderstandings]
 belief updating
[construct accurate beliefs by integrating information from multiple turns]






current solutions
a restricted version
data
user response analysis
experiments and results
discussion. caveats. future work
23
data
 collected with RoomLine
 a phone-based mixed-initiative spoken dialog
system
 conference room reservation

search and negotiation
 explicit and implicit confirmations
 confidence threshold model (+ some exploration)
 implicit confirmation task
 I found 10 rooms for Friday between 1 and 3 p.m.
Would like a small room or a large one?
24
user study
 46 participants, 1st time users
 10 scenarios, fixed order
 presented graphically (explained during briefing)
 compensated per task success
25
corpus statistics
 449 sessions, 8848 user turns
 orthographically transcribed
 manually annotated




misunderstandings (concept-level)
non-understandings
user corrections
correct concept values
26
outline
 detecting misunderstandings
 detecting user corrections
[late-detection of misunderstandings]
 belief updating
[construct accurate beliefs by integrating information from multiple turns]






current solutions
a restricted version
data
user response analysis
experiments and results
discussion. caveats. future work
27
user response types
 following Krahmer and Swerts
 study on Dutch train-table information system
 3 user response types
 YES: yes, right, that’s right, correct, etc.
 NO: no, wrong, etc.
 OTHER
 cross-tabulated against correctness of
confirmations
28
user responses to explicit confirmations
 from transcripts
CORRECT
INCORRECT
YES
NO
Other
94% [93%]
0% [0%]
5% [7%]
1% [6%]
72% [57%]
27% [37%]
[numbers in brackets from Krahmer&Swerts]
~10%
 from decoded
YES
NO
Other
CORRECT
87%
1%
12%
INCORRECT
1%
61%
38%
29
other responses to explicit confirmations
 ~70% users repeat the correct value
 ~15% users don’t address the question
 attempt to shift conversation focus
CORRECT
INCORRECT
User does not
correct
User corrects
1159
0
29
250
[10% of incor]
[90% of incor]
30
user responses to implicit confirmations
 transcripts
YES
NO
Other
CORRECT
30% [0%]
7% [0%]
63% [100%]
INCORRECT
6% [0%]
33% [15%]
61% [85%]
[numbers in brackets from Krahmer&Swerts]
 decoded
YES
NO
Other
CORRECT
28%
5%
67%
INCORRECT
7%
27%
66%
31
ignoring errors in implicit confirmations
User does not
correct
User corrects
CORRECT
552
2
INCORRECT
118
111
[51% of incor]
[49% of incor]
 users correct later (40% of 118)
 users interact strategically
 correct only if essential
~correct later
correct later
~critical
55
2
critical
14
47
32
outline
 detecting misunderstandings
 detecting user corrections
[late-detection of misunderstandings]
 belief updating
[construct accurate beliefs by integrating information from multiple turns]






current solutions
a restricted version
data
user response analysis
experiments and results
discussion. caveats. future work
33
machine learning approach
 need good probability outputs
 low cross-entropy between model predictions
and reality
 cross-entropy = negative average log posterior
 logistic regression
 sample efficient
 stepwise approach → feature selection
 logistic model tree for each action
 root splits on response-type
34
features. target.
 initial situation
 initial confidence score
 concept identity, dialog state, turn number
 system action
 other actions performed in parallel
 features of the user response




acoustic / prosodic features
lexical features
grammatical features
dialog-level features
 target: was the value correct?
35
baselines
 initial baseline
 accuracy of system beliefs before the update
 heuristic baseline
 accuracy of heuristic rule currently used in the
system
 oracle baseline
 accuracy if we knew exactly when the user is
correcting the system
36
results: explicit confirmation
Hard error (%)Explicit Confirmation Soft error
Initial
Heuristic
LMT
Oracle
Hard-error (%)
20
10
0.51
0.4
0.2
8.41
Initial
Heuristic
LMT
0.6
Soft-error
31.15
30
0.19
0.12
3.57
0
2.71
0
37
results: implicit confirmation
Hard error (%)Implicit Confirmation Soft error
Hard-error (%)
30
23.37
20
16.15
15.33
Initial
Heuristic
LMT
1
0.8
Soft-error
Initial
Heuristic
LMT
Oracle
30.40
0.67
0.6
0.61
0.43
0.4
10
0.2
0
0
38
results: unplanned implicit confirmation
Hard error
(%)
Soft error
Implicit Confirmation
Unplanned
Hard-error (%)
20
15.40
14.36
12.64
10
10.37
Initial
Heuristic
LMT
0.6
Soft-error
Initial
Heuristic
LMT
Oracle
0.43
0.4
0.46
0.34
0.2
0
0
39
informative features







initial confidence score
prosody features
barge-in
expectation match
repeated grammar slots
concept id
priors on concept values
[not included in these results]
40
outline
 detecting misunderstandings
 detecting user corrections
[late-detection of misunderstandings]
 belief updating
[construct accurate beliefs by integrating information from multiple turns]






current solutions
a restricted version
data
user response analysis
experiments and results
discussion. caveats. future work
41
discussion
 evaluation
 does it make sense?
 what would be a better evaluation?
 current limitation: belief compression
 extending models to N hypothesis + other
 current limitation: system actions
 extending models to cover all system actions
42
thank you!
43
a more subtle caveat
 distribution of training data
 confidence annotator + heuristic update rules
 distribution of run-time data
 confidence annotator + learned model
 always a problem when interacting with the
world!
 hopefully, distribution shift will not cause
large degradation in performance
 remains to validate empirically
 maybe a bootstrap approach?
44
KL-divergence & cross-entropy
 KL divergence: D(p||q)
p ( x)
D( p || q)   p( x)  log
q ( x)
 Cross-entropy: CH(p, q) = H(p) + D(p||q)
CH ( p, q)   p( x)  log q( x)
 Negative log likelihood
LL(q)   log q( x)
45
logistic regression
 regression model for binomial (binary) dependent
variables
P( x  1 | f ) 
1
1  e w f
p( x  1)
log
 w f
p( x  0)
 fit a model using max likelihood (avg log-likelihood)
 any stats package will do it for you
 no R2 measure
 test fit using “likelihood ratio” test
 stepwise logistic regression
 keep adding variables while data likelihood increases signif.
 use Bayesian information criterion to avoid overfitting
46
logistic regression
P(Task Success = 1)
1
0.8
0.6
0.4
0.2
0
0
10%
20%
30%
40%
% Nonunderstandings (FNON)
50%
47
logistic model tree
 regression tree, but with logistic models on leaves
f
f=0
f=1
g
1
6.0
0.4
4.0
0.2
0
0
2.0
10%
20%
30%
40%
% Nonunderstandings (FNON)
50%
%05
%04
%03
%02
%01
)NONF( sgnidnatsrednunoN %
g<=10
0
0
g>10
1
6.0
4.0
2.0
%05
%04
%03
%02
%01
)NONF( sgnidnatsrednunoN %
0
0
1
)1 = sseccuS ksaT(P
8.0
P(Task Success = 1)
8.0
0.6
)1 = sseccuS ksaT(P
P(Task Success = 1)
1
0.8
0.8
0.6
0.4
0.2
0
0
10%
20%
30%
40%
% Nonunderstandings (FNON)
50%
48