Detection Deception in Asynchronous CMC
Download
Report
Transcript Detection Deception in Asynchronous CMC
Using Linguistic
Analysis and
Classification
Techniques to Identify
Ingroup and Outgroup
Messages in the Enron
Email Corpus
Introduction
“America's most innovative company from
1999 to 2000", Enron was the 7th largest
company in the United States
Enron had 21,000 employees in mid-2001
Went bankrupt in December 2001
Analysis involved linguistic analysis of the
publicly available Enron email corpus
Research Direction
Can linguistic cues used in deception
detection be utilized to identify other
classifications?
Ingroup
vs. outgroup communication
Motivators
Baseline
truth and deception may be too
difficult or costly to identify
Existing automated techniques could be
readily applied
Ingroup and Outgroup
Communication
Social Identity Theory (Tajfel and Turner):
discrimination in favor of ingroup and in
opposition to outgroup
Includes
prejudice, stereotyping, negotiation,
and language use
Linguistic Masking (Platow and Broadie)
Done
voice
with strategic use of passive and active
Linguistic Cues for Deception
Detection
Existing research of automated linguistic
analysis of asynchronous computermediated communication
Better
than chance
Existing and established cues and
automation techniques could be applied to
similar classification schemes
Cues identified by Twitchell et al (2005)
used in this study
Methodology
Define our selection criteria
Ingroup:
communication between people found guilty,
submitted a guilty plea, or awaiting trial
Outgroup: communication from a person found guilty,
submitted a guilty plea, or awaiting trial to a person
not convicted or charged
Identify ingroup and outgroup members
News
articles
Court transcripts
Extract senders from emails
Identify ingroup and outgroup messages
Methodology (cont)
Email Identification
Publicly
available corpus appears to include email
from 151 employees
Parsing the sender/receiver addresses (using Enron
naming convention which includes first and last
name) resulted in an actual email employee count of
over 5,000
Identified 29 ingroup messages and over 600
outgroup messages
Random sample of 29 outgroup messages was used for
analysis
Analyze identified emails with GATE and Weka
Methodology: GATE and Weka
GATE (General Architecture for Text
Engineering)
Extracted
39 features for each message
Weka (Waikato Environment for
Knowledge Analysis)
Classification
engine supporting decision
trees, neural networks, and other AI
algorithms
Automated Analysis Results
Using a J48 decision tree with ten-fold crossvalidation
Accurately
classified 48 out of 58 e-mail messages
as ingroup or outgroup (82.7% accuracy).
Cues
Only 5 out of 39 cues were needed for classification:
Pleasantness
Average Sentence Length
Verb Quality
You References
Passive Verb Ratio
J48 Decision Tree
Pleasantness <= 0.007673: false (19.0/1.0)
Pleasantness > 0.007673
| Average_Sentence_Length <= 34.5
| | Verb_Quantity <= 8: false (3.0)
| | Verb_Quantity > 8
| | | You_References <= 0.024155: true (20.0)
| | | You_References > 0.024155
| | | | passive_verb_ratio <= 0: true (9.0/1.0)
| | | | passive_verb_ratio > 0: false (2.0)
| Average_Sentence_Length > 34.5: false (5.0)
Future Research Directions
Perform similar analysis on transcripts of
wire-tapped phone conversations (also
publicly available)
Perform additional research to identify
deceptive and non-deceptive messages
from Enron corpus
Explore additional ingroup and outgroup
scenarios
Questions?