Detection Deception in Asynchronous CMC

Download Report

Transcript Detection Deception in Asynchronous CMC

Using Linguistic
Analysis and
Classification
Techniques to Identify
Ingroup and Outgroup
Messages in the Enron
Email Corpus
Introduction
“America's most innovative company from
1999 to 2000", Enron was the 7th largest
company in the United States
 Enron had 21,000 employees in mid-2001
 Went bankrupt in December 2001
 Analysis involved linguistic analysis of the
publicly available Enron email corpus

Research Direction

Can linguistic cues used in deception
detection be utilized to identify other
classifications?
 Ingroup

vs. outgroup communication
Motivators
 Baseline
truth and deception may be too
difficult or costly to identify
 Existing automated techniques could be
readily applied
Ingroup and Outgroup
Communication

Social Identity Theory (Tajfel and Turner):
discrimination in favor of ingroup and in
opposition to outgroup
 Includes
prejudice, stereotyping, negotiation,
and language use

Linguistic Masking (Platow and Broadie)
 Done
voice
with strategic use of passive and active
Linguistic Cues for Deception
Detection

Existing research of automated linguistic
analysis of asynchronous computermediated communication
 Better
than chance
 Existing and established cues and
automation techniques could be applied to
similar classification schemes

Cues identified by Twitchell et al (2005)
used in this study
Methodology

Define our selection criteria
 Ingroup:
communication between people found guilty,
submitted a guilty plea, or awaiting trial
 Outgroup: communication from a person found guilty,
submitted a guilty plea, or awaiting trial to a person
not convicted or charged

Identify ingroup and outgroup members
 News
articles
 Court transcripts
 Extract senders from emails
 Identify ingroup and outgroup messages
Methodology (cont)

Email Identification
 Publicly
available corpus appears to include email
from 151 employees
 Parsing the sender/receiver addresses (using Enron
naming convention which includes first and last
name) resulted in an actual email employee count of
over 5,000
 Identified 29 ingroup messages and over 600
outgroup messages


Random sample of 29 outgroup messages was used for
analysis
Analyze identified emails with GATE and Weka
Methodology: GATE and Weka

GATE (General Architecture for Text
Engineering)
 Extracted

39 features for each message
Weka (Waikato Environment for
Knowledge Analysis)
 Classification
engine supporting decision
trees, neural networks, and other AI
algorithms
Automated Analysis Results

Using a J48 decision tree with ten-fold crossvalidation
 Accurately
classified 48 out of 58 e-mail messages
as ingroup or outgroup (82.7% accuracy).
 Cues

Only 5 out of 39 cues were needed for classification:





Pleasantness
Average Sentence Length
Verb Quality
You References
Passive Verb Ratio
J48 Decision Tree
Pleasantness <= 0.007673: false (19.0/1.0)
Pleasantness > 0.007673
| Average_Sentence_Length <= 34.5
| | Verb_Quantity <= 8: false (3.0)
| | Verb_Quantity > 8
| | | You_References <= 0.024155: true (20.0)
| | | You_References > 0.024155
| | | | passive_verb_ratio <= 0: true (9.0/1.0)
| | | | passive_verb_ratio > 0: false (2.0)
| Average_Sentence_Length > 34.5: false (5.0)
Future Research Directions
Perform similar analysis on transcripts of
wire-tapped phone conversations (also
publicly available)
 Perform additional research to identify
deceptive and non-deceptive messages
from Enron corpus
 Explore additional ingroup and outgroup
scenarios

Questions?