Misunderstandings, Corrections and Beliefs in Spoken Language
Download
Report
Transcript Misunderstandings, Corrections and Beliefs in Spoken Language
misunderstandings, corrections and
beliefs in spoken language interfaces
Dan Bohus
www.cs.cmu.edu/~dbohus
[email protected]
Computer Science Department
Carnegie Mellon University
Pittsburgh, PA 15213
problem
spoken language interfaces lack robustness
when faced with understanding errors
stems mostly from speech recognition
spans most domains and interaction types
exacerbated by operating conditions
2
more concretely …
S: What city are you leaving from?
U: Urbana Champaign [OKAY IN THAT SAME PAY]
S: Sorry I’m not sure I understood what you said. What city are you leaving
from ?
U: Urbana Champaign [FOR MINUTE SINCE HEY]
S: I’m still having trouble understanding you. To go on, I need you to answer
the following question. What city are you leaving from?
U: Chicago [CHICAGO]
S: traveling from Chicago. Where would you like to go?
U: Huntsville [SEOUL]
S: traveling to Seoul. What day did you need to travel?
U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M]
S: traveling in the afternoon. Okay what day would you be departing chicago
U: /uh/ the tenth of august [AT THE TENTH OF AUGUST]
S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at
1:40pm, arrives Seoul at 5pm the next day. Do you want to take that?
U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND]
S: traveling on Saturday, August 12th … I have a flight departing Chicago at
1:40pm arrives Seoul at ………
3
some statistics …
semantic error rates: ~25-35%
SpeechActs
CU Communicator
Jupiter
25%
[SRI]
[CU]
How May I Help You?
corrections
28%
[MIT]
CMU Communicator
27%
[CMU]
[AT&T]
32%
36%
[Krahmer, Swerts, Litman, Levow]
30% of utterances correct system mistakes
2-3 times more likely to be misrecognized
4
two types of understanding errors
NONunderstanding
System cannot extract any meaningful information
from the user’s turn
S: What city are you leaving from?
U: Urbana Champaign [OKAY IN THAT SAME PAY]
MISunderstanding
System extracts incorrect information from the user’s
turn
S: What city are you leaving from?
U: Birmingham [BERLIN PM]
5
misunderstandings
MISunderstanding
fix
recognition
System extracts incorrect information from the user’s
turn
S: What city are you leaving from?
U: Birmingham
detect potential
misunderstandings;
[BERLIN PM]
do something about them
6
outline
detecting misunderstandings
detecting user corrections
[late-detection of misunderstandings]
belief updating
[construct accurate beliefs by integrating information from multiple turns]
7
detecting misunderstandings
recognition confidence scores
S: What city are you leaving from?
U: Birmingham [BERLIN PM]
conf=0.63
traditionally
[Bansal, Chase, Cox, Kemp, many others]
speech recognition confidence scores
use acoustic, language model and search info
frame, phoneme, word-level
8
“semantic” confidence scores
we’re interested in semantics, not words
YES = YEAH, NO = NO WAY
use machine learning to build confidence
annotators
in-domain, manually labeled data
utterance:
labels:
[BERLIN PM] Birmingham
correct / misunderstood
features from different knowledge sources
binary classification problem
probability of misunderstanding: regression problem
9
a typical result
Identifying User Corrections Automatically in a Spoken
Dialog System [Walker, Wright, Langkilde]
HowMayIHelpYou corpus: call routing for phone services
11787 turns
features
ASR: recog, numwords, duration, dtmf, rg-grammar, tempo …
understanding: confidence, context-shift, top-task, diff-conf, …
dialog & history: sys-label, confirmation, num-reprompts, numconfirms, num-subdials, …
binary classification task
majority baseline (error): 36.5%
RIPPER (error): 14%
10
outline
detecting misunderstandings
detecting user corrections
[late-detection of misunderstandings]
belief updating
[construct accurate beliefs by integrating information from multiple turns]
11
detect user corrections
is the user trying to correct the system?
S: Where would you like to go?
misunderstanding
U: Huntsville [SEOUL]
S: traveling to Seoul. What day did you need to travel?
U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M]
user correction
misunderstanding
same story: use machine learning
in-domain, manually labeled data
features from different knowledge sources
binary classification problem
probability of correction: regression problem
12
typical result
Identifying User Corrections Automatically in a
Spoken Dialog System [Hirschberg, Litman, Swerts]
TOOT corpus: access to train information
2328 turns, 152 dialogs
features
prosodic: f0max, f0mn, rmsmax, dur, ppau, tempo …
ASR: gram, str, conf, ynstr, …
dialog position: diadist
dialog history: preturn, prepreturn, pmeanf
binary classification task
majority baseline: 29%
RIPPER: 15.7%
13
outline
detecting misunderstandings
detecting user corrections
[late-detection of misunderstandings]
belief updating
[construct accurate beliefs by integrating information from multiple turns]
14
belief updating problem: an easy case
S: on which day would you like to travel?
U: on September 3rd
[AN DECEMBER THIRD] {CONF=0.25}
departure_date = {Dec-03/0.25}
S: did you say you wanted to leave on December 3rd?
U: no
[NO] {CONF=0.88}
departure_date = {Ø}
15
belief updating problem: a trickier case
S: Where would you like to go?
U: Huntsville
[SEOUL] {CONF=0.65}
destination = {seoul/0.65}
S: traveling to Seoul. What day did you need to
travel?
U: no no I’m traveling to Birmingham
[THE TRAVELING TO BERLIN P_M] {CONF=0.60} {COR=0.35}
destination = {?}
16
belief updating problem formalized
destination = {seoul/0.65}
S: traveling to Seoul. What day did you need to
travel?
given:
[THE TRAVELING TO BERLIN P_M] {CONF=0.60} {COR=0.35}
an initial belief Pinitial(C) over
destination = {?}
concept C
a system action SA
a user response R
construct an updated belief:
Pupdated(C) ← f (Pinitial(C), SA, R)
17
outline
detecting misunderstandings
detecting user corrections
[late-detection of misunderstandings]
belief updating
[construct accurate beliefs by integrating information from multiple turns]
current solutions
a restricted version
data
user response analysis
experiments and results
discussion. caveats. future work
18
belief updating: current solutions
most systems only track values, not beliefs
new values overwrite old values
explicit confirm + yes → trust hypothesis
explicit confirm + no → kill hypothesis
explicit confirm + “other” → non-understanding
implicit confirm: not much
“users who discover errors through incorrect implicit
confirmations have a harder time getting back on track”
[Shin et al, 2002]
19
outline
detecting misunderstandings
detecting user corrections
[late-detection of misunderstandings]
belief updating
[construct accurate beliefs by integrating information from multiple turns]
current solutions
a restricted version
data
user response analysis
experiments and results
discussion. caveats. future work
20
belief updating: general form
given:
an initial belief Pinitial(C) over concept C
a system action SA
a user response R
construct an updated belief:
Pupdated(C) ← f (Pinitial(C), SA, R)
21
restricted version: 2 simplifications
1. compact belief
system unlikely to “hear” more than 3 or 4 values
single vs. multiple recognition results
in our data: max = 3 values, only 6.9% have >1
value
confidence score of top hypothesis
2. updates after confirmation actions
reduced problem
ConfTopupdated(C) ← f (ConfTopinitial(C), SA, R)
22
outline
detecting misunderstandings
detecting user corrections
[late-detection of misunderstandings]
belief updating
[construct accurate beliefs by integrating information from multiple turns]
current solutions
a restricted version
data
user response analysis
experiments and results
discussion. caveats. future work
23
data
collected with RoomLine
a phone-based mixed-initiative spoken dialog
system
conference room reservation
search and negotiation
explicit and implicit confirmations
confidence threshold model (+ some exploration)
implicit confirmation task
I found 10 rooms for Friday between 1 and 3 p.m.
Would like a small room or a large one?
24
user study
46 participants, 1st time users
10 scenarios, fixed order
presented graphically (explained during briefing)
compensated per task success
25
corpus statistics
449 sessions, 8848 user turns
orthographically transcribed
manually annotated
misunderstandings (concept-level)
non-understandings
user corrections
correct concept values
26
outline
detecting misunderstandings
detecting user corrections
[late-detection of misunderstandings]
belief updating
[construct accurate beliefs by integrating information from multiple turns]
current solutions
a restricted version
data
user response analysis
experiments and results
discussion. caveats. future work
27
user response types
following Krahmer and Swerts
study on Dutch train-table information system
3 user response types
YES: yes, right, that’s right, correct, etc.
NO: no, wrong, etc.
OTHER
cross-tabulated against correctness of
confirmations
28
user responses to explicit confirmations
from transcripts
CORRECT
INCORRECT
YES
NO
Other
94% [93%]
0% [0%]
5% [7%]
1% [6%]
72% [57%]
27% [37%]
[numbers in brackets from Krahmer&Swerts]
~10%
from decoded
YES
NO
Other
CORRECT
87%
1%
12%
INCORRECT
1%
61%
38%
29
other responses to explicit confirmations
~70% users repeat the correct value
~15% users don’t address the question
attempt to shift conversation focus
CORRECT
INCORRECT
User does not
correct
User corrects
1159
0
29
250
[10% of incor]
[90% of incor]
30
user responses to implicit confirmations
transcripts
YES
NO
Other
CORRECT
30% [0%]
7% [0%]
63% [100%]
INCORRECT
6% [0%]
33% [15%]
61% [85%]
[numbers in brackets from Krahmer&Swerts]
decoded
YES
NO
Other
CORRECT
28%
5%
67%
INCORRECT
7%
27%
66%
31
ignoring errors in implicit confirmations
User does not
correct
User corrects
CORRECT
552
2
INCORRECT
118
111
[51% of incor]
[49% of incor]
users correct later (40% of 118)
users interact strategically
correct only if essential
~correct later
correct later
~critical
55
2
critical
14
47
32
outline
detecting misunderstandings
detecting user corrections
[late-detection of misunderstandings]
belief updating
[construct accurate beliefs by integrating information from multiple turns]
current solutions
a restricted version
data
user response analysis
experiments and results
discussion. caveats. future work
33
machine learning approach
need good probability outputs
low cross-entropy between model predictions
and reality
cross-entropy = negative average log posterior
logistic regression
sample efficient
stepwise approach → feature selection
logistic model tree for each action
root splits on response-type
34
features. target.
initial situation
initial confidence score
concept identity, dialog state, turn number
system action
other actions performed in parallel
features of the user response
acoustic / prosodic features
lexical features
grammatical features
dialog-level features
target: was the value correct?
35
baselines
initial baseline
accuracy of system beliefs before the update
heuristic baseline
accuracy of heuristic rule currently used in the
system
oracle baseline
accuracy if we knew exactly when the user is
correcting the system
36
results: explicit confirmation
Hard error (%)Explicit Confirmation Soft error
Initial
Heuristic
LMT
Oracle
Hard-error (%)
20
10
0.51
0.4
0.2
8.41
Initial
Heuristic
LMT
0.6
Soft-error
31.15
30
0.19
0.12
3.57
0
2.71
0
37
results: implicit confirmation
Hard error (%)Implicit Confirmation Soft error
Hard-error (%)
30
23.37
20
16.15
15.33
Initial
Heuristic
LMT
1
0.8
Soft-error
Initial
Heuristic
LMT
Oracle
30.40
0.67
0.6
0.61
0.43
0.4
10
0.2
0
0
38
results: unplanned implicit confirmation
Hard error
(%)
Soft error
Implicit Confirmation
Unplanned
Hard-error (%)
20
15.40
14.36
12.64
10
10.37
Initial
Heuristic
LMT
0.6
Soft-error
Initial
Heuristic
LMT
Oracle
0.43
0.4
0.46
0.34
0.2
0
0
39
informative features
initial confidence score
prosody features
barge-in
expectation match
repeated grammar slots
concept id
priors on concept values
[not included in these results]
40
outline
detecting misunderstandings
detecting user corrections
[late-detection of misunderstandings]
belief updating
[construct accurate beliefs by integrating information from multiple turns]
current solutions
a restricted version
data
user response analysis
experiments and results
discussion. caveats. future work
41
discussion
evaluation
does it make sense?
what would be a better evaluation?
current limitation: belief compression
extending models to N hypothesis + other
current limitation: system actions
extending models to cover all system actions
42
thank you!
43
a more subtle caveat
distribution of training data
confidence annotator + heuristic update rules
distribution of run-time data
confidence annotator + learned model
always a problem when interacting with the
world!
hopefully, distribution shift will not cause
large degradation in performance
remains to validate empirically
maybe a bootstrap approach?
44
KL-divergence & cross-entropy
KL divergence: D(p||q)
p ( x)
D( p || q) p( x) log
q ( x)
Cross-entropy: CH(p, q) = H(p) + D(p||q)
CH ( p, q) p( x) log q( x)
Negative log likelihood
LL(q) log q( x)
45
logistic regression
regression model for binomial (binary) dependent
variables
P( x 1 | f )
1
1 e w f
p( x 1)
log
w f
p( x 0)
fit a model using max likelihood (avg log-likelihood)
any stats package will do it for you
no R2 measure
test fit using “likelihood ratio” test
stepwise logistic regression
keep adding variables while data likelihood increases signif.
use Bayesian information criterion to avoid overfitting
46
logistic regression
P(Task Success = 1)
1
0.8
0.6
0.4
0.2
0
0
10%
20%
30%
40%
% Nonunderstandings (FNON)
50%
47
logistic model tree
regression tree, but with logistic models on leaves
f
f=0
f=1
g
1
6.0
0.4
4.0
0.2
0
0
2.0
10%
20%
30%
40%
% Nonunderstandings (FNON)
50%
%05
%04
%03
%02
%01
)NONF( sgnidnatsrednunoN %
g<=10
0
0
g>10
1
6.0
4.0
2.0
%05
%04
%03
%02
%01
)NONF( sgnidnatsrednunoN %
0
0
1
)1 = sseccuS ksaT(P
8.0
P(Task Success = 1)
8.0
0.6
)1 = sseccuS ksaT(P
P(Task Success = 1)
1
0.8
0.8
0.6
0.4
0.2
0
0
10%
20%
30%
40%
% Nonunderstandings (FNON)
50%
48