Transcript Title

5th Symposium on Information Systems Assurance
Data Mining of E-Mails to
Support Periodic &
Continuous Assurance
Glen L. Gray
California State University at Northridge
Roger Debreceny
University of Hawai`i at Mānoa
Toronto: October 2007
In this Presentation
 Continuous monitoring of emails – why?
 Technologies


Social Network Analysis
Text analysis
 Challenges
 Opportunities
Continuous Monitoring of Emails –
Why?
 Increased focus on forensic approaches to
auditing
 Increased interest in continuous assurance and
monitoring of business processes
 Emails = Organization’s DNA
 Evidential matter on:




Employee & management fraud (overrides)
Compliance (e.g., HIPAA)
Loss of intellectual property
Corporate policies
Enron Email Archive
 Released by Federal Energy Regulatory
Commission
 500K emails
 151 Enron employees
 Cleaned version at Carnegie Mellon
www.cs.cmu.edu/~enron/
 Relational DB version at USC
www.isi.edu/~adibi/Enron/Enron_Dataset_R
eport.pdf
Email Mining Targets
Email
Data Mining
Content
Analysis
Key Word
Queries
Log
Analysis
Deception
Clues
Volume &
Velocity
Social Network
Analysis
Content Analysis
Key Word Queries
 Yes, people do say self-incriminating things in
their emails


Fraud
Corporate dysfunction
 Overwhelming false positives
 Need “smart” compound queries
 Good continuous auditing (CA) candidate
 Already scanning for spam, porn, etc.
Sender Deception -- Content
 Deceptive emails include:




Fewer first-person pronouns to dissociate
themselves from their own words
Fewer exclusive words, such as but and
except, to indicate a less complex story
More negative emotion words because of the
sender’s underlying feeling of guilt
More action verbs to, again, indicate a less
complex story
Sender Deception -- Identification
 Writeprint features

Lexical -- characters & words





Function words
Root words
Syntactic -- sentences
Structural -- paragraphs
Content-specific
Sender Deception -- Identification
 Number of potential features unlimited

Optimum number can vary by
context and language
 Developing user profiles and comparing new
emails to profiles would be challenging for
real-time CA
Temporal/Log Analysis
Volume & Velocity
 Volume = number of emails a person sends and/or
receives over a period of time.
 Velocity = how quickly the volume changes.
 Many external factors (e.g., vacations, seasonal
activities, etc.) impact these numbers
 Need “rolling histogram”
Volume & Velocity
 Key issue -- determining the optimum time intervals
to sample the data
 Continuous monitoring cannot be continuous in terms
of sampling in real time
 Comparing hourly, daily, and even weekly volumes
and velocities will result in many false positives
 Optimum time internal could vary by job title
Social Network
Analysis
Social Network Analysis
 Social relationships as an undirected graph
 Importance of understanding relationships
within the flow of email exchanges
Social Network Analysis in Emails
 Emails semi-structured data
 sender
 primary recipient(s)
 copied recipient(s)
 date
 subject line
 Social groups and cliques
 CA = who doesn’t belong?
Thread Analysis – This?
Time
C
C
S
R
C
R
C
S
C
C
S
R
C
R
C
S
Thread Analysis – Or this?
Time
C
S
S
R
R
R
C
C
C
R
R
S
S
R
Integrating Content Analysis and
Social Network Analysis
Key Word
Queries
Deception
Clues
Volume &
Velocity
Content
Analysis
Social Network
Analysis
Log
Analysis
Email
Data Mining
Challenges of Email Mining
 Textual




Inconsistent use of abbreviations
Misspelled words
Smileys etc. etc.
Replies, replies, and more replies…
 Inability to identify:

Identities of email participants


[email protected]
Roles and responsibilities
What Enron Emails Show?
 People do say the darnest things
 What did he know and when did he know it?
 Verified numerous bodies of email data
mining research


Content analysis
Social network analysis
Tools
 Content monitoring





eSoft Corporation’s ThreatWall
Symantec’s Mail Security 8x00 Series
Vericept Corporation’s Vericept Content 360º
Reconnex Corporation’s iGuard Appliance
InBoxer, Inc. Anti-Risk Appliance
 Social networks


Microsoft SNARF
Heer Vizter
Research Opportunities
Research Questions
 Role of email monitoring in overall CA






environment?
Join SNA with examination of textual patterns.
Link SNA with control environment
Frauds/control overrides footprint?
What email cleaning is required for CA purposes?
Privacy and policy issues?
Lessons from existing commercial products?
Your Questions
Thank You
[email protected]
[email protected]