N-Gram Based Approaches - Computer Science

Download Report

Transcript N-Gram Based Approaches - Computer Science

Approaches for Automatically
Tagging Affect
Nathanael Chambers, Joel Tetreault, James Allen
University of Rochester
Department of Computer Science
Affective Computing
• Why use computers to detect affect?
– Make human-computer interaction more natural
• Computers express emotion
• And detect user’s emotion
• Tailor responses to situation
– Use affect for text summarization
• Understanding affect improves computerhuman interaction systems
From the Psychologist’s P.O.V
• However, if computers can detect affect, it
can also help humans understand affect
• By observing the changes in emotion and
attitude in people conversing, psychologists
can determine correct treatments for
patients
Marriage Counseling
• Emotion and communication are important to
mental and physical health
• Psychological theories suggest that how well a
couple copes with serious illness is related to how
well they interact to deal with it
• Poor interactions (ie. Disengagement during
conversations) can at times exacerbate an illness
• Tested hypothesis by observing the engagementlevels of conversation between married-couples
presented with a task
Example Interactions
• Good interaction sequence:
W: Well I guess we'd just have to develop a plan wouldn't we?
H: And we would be just more watchful or plan or maybe not, or be together
more when the other one went to do something
W: In other words going together
H: Going together more
W: That's right. And working more closely together and like you say, doing
things more closely together. And I think we certainly would want to share
with the family openly what we felt was going on so we could kind of work out
family plans
• Poor interaction sequence:
W: So how would you deal with that?
H: I don't know. I'd probably try to help. And you know, go with you or do
things like that if I, if I could. And you know, I don't know. I would try to do the
best I could to help you
Testing theory
• Record and transcribe conversations of married
couples presented with “what-if” scenario of one
of them having Alzheimer’s.
– Participants asked to discuss how they would deal with
the sickness
• Tag sentences of transcripts with affect-related
codes. Certain textual patterns evoke negative or
position connotations
• Use distribution of tags to look for correlations
between communication and marital satisfaction
• Use tag distribution to decide on treatment for
couple
Problem
• However tagging (step 2) is timeconsuming and requires training time for
new annotators, as well as being unreliable
• Solution: use computers to do tagging work
so psychologists can spend more time with
patients and less time coding
Goals
• Develop algorithms to automatically tag
transcripts of a Marriage Counseling Corpus
(Shields, 1997)
• Develop a tool that human annotators can
use to pre-tag a transcript given the best
algorithm, and then quickly correct it
Outline
•
•
•
•
Background
Marriage Counseling Corpus
N-gram based approaches
Information-Retrieval/Call Routing
approaches
• Results
• CATS Tool
Background
• Affective computing, or detecting emotion in texts
or from a user, is a young field
• Earliest approaches used keyword matching
• Tagged dictionaries with grammatical features
(Boucouvalas and Ze, 2002)
• Statistical methods – LSA (Webmind project),
TSB (Wu et al., 2000) to tag a dialogue
• Liu et al. (2003) use common-sense rules to detect
emotion in emails
New Methods for Tagging Affect
• Our approaches differ from others in two ways:
• Use different statistical methods based on
computing N-grams
• Tag individual sentences as opposed to discourse
chunks
• Our approaches are based on methods that have
been successful in another domain: discourse act
tagging
Marriage Counseling Corpus
• 45 annotated transcripts of married couples
working on a task of Alzheimer’s
• Collected by psychologists in the Center for
Future Health, Rochester, NY
• Transcripts broken into “thought units” – one or
more sentences that represent how the speaker
feels toward a topic (4,040 total)
• Tagging thought units takes into account positive
and negative words, level of detail, comments on
health, family, travel, etc, sensitivity
Code Tags
• DTL – “Detail” (11.2%) speaker’s verbal content
is concise and distinct with regards to illness,
emotions, dealing with death:
– “It would be hard for me to see you so helpless”
• GEN – “General” (41.6%) verbal content towards
illness is vague or generic, or speaker does not
take ownership of emotions:
– “I think that it would be important”
Code Tags
• SAT: “Statements About the Task” – (7.2%)
couple discusses what the task is, how to perform
it:
– “I thought I would be the caregiver”
• TNG – “Tangent” – (2.9%) statements that are
way off topic.
• ACK – “Acknowledgments” (22.8%) of the other
speaker’s comments:
– “Yeah” “right”
N-Gram Based Approaches
n-gram: a sequential list of n words, used to encode the likelihood that the
phrase will appear in the future
Involves splitting sentence into chunks of consecutive words of length “n”
“I don’t know what to say”
1-gram (unigram): I, don’t, know, what, to, say
2-gram (bigram): I don’t, don’t know, know what, what to, to say
3-gram (trigram): I don’t know, don’t know what, know what to, etc.
…
n-gram
Frequency Table (Training)
GEN DTL ACK SAT
“I” 0.5
0.2
0.2
0.1
“Yeah” 0.3
0.2
0.4
0.1
“Don’t want to be” 0.2
0.8
0.0
0.0
“I don’t want to be” 0.0
1.0
0.0
0.0
Each entry: Probability that n-gram is labeled a certain tag
N-Gram Motivation
Advantages
• Encode not just keywords, but also word ordering, automatically
• Models are not biased by hand coded lists of words, but are completely
dependent on real data
• Learning features of each affect type is relatively fast and easy
Disadvantages
• Long range dependencies are not captured
• Dependent on having a corpus of data to train from
– Sparse data for low frequency affect tags adversely affects the quality of
the n-gram model
Naïve Approach
P(tagi | utt) = maxj,k P(tagi | ngramjk)
• Where i is one of {GEN, DTL, ACK, SAT, TNG}
• And ngramjk is the j-th ngram of length k
• So for all n-grams in a thought unit, find the one
with the highest probability for a given tag, and
select that tag
Naïve Approach Example
I don’t want to be chained to a wall.
k
Tag
Top N-gram
Probability
1
GEN
don’t
0.665
2
GEN
to a
0.692
3
GEN
<s> I don’t
0.524
4
DTL
don’t want to be
0.833
5
DTL
I don’t want to be
1.00
N-Gram Approaches
• Weighted Approach
– Weight the longer n-grams higher in the stochastic model
• Lengths Approach
– Include a length-of-utterances factor, capturing the differences in
utterance length between affect tags
• Weights with Lengths Approach
– Combine Weighted with Lengths
• Repetition Approach
– Combine all the above information,with overlap of words between
thought units
Repetition Approach
Many acknowledgement ACK utterances were being mistagged as GEN
by the previous approaches. Most of the errors came from grounding
that involved word repetition:
A - so then you check that your tire is not flat.
B - check the tire
• We created a model that takes into account word repetition in adjacent
utterances in a dialogue.
• We also include a length probability to capture the Lengths Approach.
• Only unigrams are used to avoid sparseness in the training data.
IR-based approaches
• Work based on call-routing algorithm of Chu-Carroll and
Carpenter (1999)
• Problem: route a user’s call to a financial call center to the
correct destination
• Do this by comparing a query from the user (speech
converted to text) into a vector to be compared with a list
of possible destination vectors in a database
Database Table (Training)
Database
“yeah, that’s right”
GEN DTL ACK SAT
Query
0.0
1.0
0.0
0.0
0.5
0.2
0.2
0.1
“yeah” 0.3
0.2
0.4
0.1
“Don’t want to be” 0.2
0.8
0.0
0.0
“I don’t want to be” 0.0
1.0
0.0
0.0
“I”
Cosine comparison
Query (thought unit) compared against each tag vector in database
Database Creation
• Construct database in the same manner as N-gram
• Database then normalized
• Filter: Inverse Document Frequency (IDF) – lowers the
weight of terms that occur in many documents:
IDF(t) = log2 (N / d(t) )
• Where d(t) is the number of tags containing n-gram t, and
N is the total number of tags
Method 1: Routing-based method
• Modified call-routing method with entropy (amount of
disorder) to further reduce contribution of terms that occur
frequently
• Also created two more terms (rows in database)
– Sentence length: tags may be correlated with sentences
of a certain length
– Repetition – acknowledgments tend to repeat the words
stated in the previous thought unit
Method 1: Example
ACK=0.002
query
DTL = 0.073
GEN = 0.072
SAT = 0.014
TNG = 0.0001
Cosine scores for tags compared against query vector for
“I don’t want to be chained to a wall”
Method 2: Direct Comparison
• Instead of comparing queries to a normalized
database of exemplar documents, compare them to
all test sentences
• Advantage: no normalizing or construction of
documents
• Cosine test is used to get the top ten matches. Add
matches with the same tag. The tag that has the
highest sum in the end is selected.
Method 2: Example
Cosine Score
Tag
Sentence
0.64
SAT
Are we supposed to get them?
0.60
GEN
That sounds good
0.60
TNG
That’s due to my throat
0.56
DTL
But if I said to you I don’t want…
0.55
DTL
If it were me, I’d want to be a guinea pig
to try things
DTL selected with total score of 1.11
Evaluation
• Performed six-fold cross-validation over the
Marriage Corpus and Switchboard Corpus
• Averaged scores from each of the six evaluations
Results
6-Fold Cross Validation for N-gram Methods
Naive
Weighted
Lengths
Weights with
Lengths
Repetition
66.80%
67.43%
64.35%
66.02%
66.60%
6-Fold Cross Validation for IR Methods
Original
Entropy
Repetition Length
Repetition Direct
and Length
61.37%
66.16%
66.39%
66.76%
66.76%
63.16%
Discussion
• N-gram approaches do slightly better than IR over
Marriage Counseling
• Incorporating additional features of sentence
length and repetition improve both models
• Entropy model better than IDF in call-routing
system (gets 4% boost)
• Psychologists currently using tool to tag their
work. Note sometimes computer tags better than
the human annotators
CATS
CATS: An Automated Tagging System for affect and other
similar information retrieval tasks.
• Written in Java for cross-platform interoperability.
• Implements the Naïve approach with unigrams and bigrams only.
• Builds the stochastic models automatically off of a tagged corpus,
input by the user into the GUI display.
• Automatically tags new data using the user’s models. Each tag also
receives a confidence score, allowing the user to hand check the
dialogue quickly and with greater confidence.
The CATS GUI provides a clear workspace for text and tags.
Tagging new data and training old data is done with a mouse click.
Customizable models are available. Create your own list
of tags, provide a training corpus, and build a new model.
Tags are marked with confidence scores based on the
probabilistic models.