Transcript document

Detecting Phishing in
Emails
Srikanth Palla
Ram Dantu
University of North Texas, Denton
What is Phishing?




Phishing is a form of online identity theft
Employs both social engineering and technical subterfuge
Targets consumers' personal identity data and financial account
credentials such as credit card numbers, account usernames,
passwords and social security numbers.
Social-engineering schemes use 'spoofed' e-mails to lead
consumers to counterfeit websites.
-Anti Phishing Working Group (APWG)
Phishing Tactics




Hijacking reputable brand
names
Creating a plausible premise
Redirecting URL’s
Collecting confidential
information through emails
Do we need to restrict Phishing attacks?
The Statistics…
Sources: Anti Phishing Working Group
Problems with Current Spam
Filtering Techniques

Current spam filters focus on analyzing the content

Majority of the Phishers obfuscate their email content to
bypass the email filters

Labels an email as BULK and expect the recipients’ to
make a decision on the authenticity of the email source

Current spam filters have high degree of false positives
Methodology
Our method examines:




The header of the email (not content)
The social network of the recipient
Credibility of the source
Classifies Phishers as:




Prospective Phishers
Recent Phishers
Suspects
Serial Phishers
Traffic Profile
The following Figure describes the incoming email traffic profiles
based on number of recipients and how often they receive the message.
PRODUCTIVITY GAIN
PRODUCTIVITY LOSS
Frequency of emails arriving
DISCUSSION
THREADS
INDIVIDUAL
DISCUSSIONS
BUSINESS DISCUSSIONS
TELEMARKETING
PROFESSIONAL
DISCUSSIONS
PHISHING
NEWS GROUPS
PROFESSIONAL/
BUSINESS
ANNOUNCEMENTS
PERSONAL
CLUB INVITATIONS
STRANGERS
GOOD NEWS
Number of Recipients in an enterprise
LEGITIMATE
OPTIONAL
ANNOYANCE/
COUNTERFIET/ NUISANCE
Email Corpus Traffic Profile




Our analysis requires sent email
folder of the recipient
Emails provided in the TREC
evaluation tool kit are spam and
non spam emails
We require a mix of legitimate and
phising emails to evaluate our filter
We have analyzed a live corpus of
13,843 emails, collected over 2.5
years. This corpus has a mix of
legitimate, spam and phishing
emails. Different categories of
emails are shown in the figure
Experimental Setup





We deployed our classifier on a
recipient’s local machine running an
IMAP proxy and thunderbird
(MUA).
All the recipient’s emails were fed
directly into our classifier by the
proxy.
Our classifier periodically scans the
user’s mailbox files for any new
incoming emails.
DNS-based header analysis, social
network
analysis,
wantedness
analysis were performed on each of
the emails.
The end result is tagging of emails as
either Phishing, Opt-outs, Socially
distinct and Socially close.
Architecture
Mail
Box
The architecture model of our
classifier consists of three
analyses




Step 1: DNS-based header
analysis
Step 2: Social network analysis
Step 3: Wantedness analysis
Step 4: Classification
DNS-based
Header
Analysis
Phishing
Opt-outs
Classification
Based on
Wantedness
And
Credibility
Social network
Analysis
wantedness
Analysis
User Feed back
User Feed back
socially close
socially distinct
Step 1: DNS-based Header Analysis
Stage 1: In this step, we validate the information provided in the email
header: the hostname position of the sender, the mail server and the
relays in the rest of the path. We divide the entire corpus into two
buckets.
 The emails which are valid for DNS lookups (Bucket 1).
 The emails which are not valid for DNS lookups (Bucket 2).
Stage 2: This step involves doing DNS lookup on the hostname
provided in the Received: lines of the header and matching the IP
address returned, with the IP address which is stored next to the
hostname, by the relays during the SMTP authorization process.
Bucket 1 is further divided into:
 Trusted bucket.
 Untrusted bucket.
We pass the Bucket2 and both trusted and untrusted buckets to the
Social Network Analysis phase for further analysis.
Step 2: Social Network Analysis
Each of the three buckets: bucket2, untrusted bucket and trusted
bucket received from the DNS-based header analysis are
treated with the rules formulated by analyzing the “sent” folder
emails of the receiver.
For instance,
All emails from trusted domains will be removed
Familiarity to sender’s community
 Familiarity to the path traversed
The rules can be built as per the recipients’ email filtering
preferences.
Classification of Trusted and Untrusted
Senders
Socially
Untrusted
DNS lookup
Valid
Size:13087
Socially
Trusted
Phishers
Untrusted
Emails
Size: 563
Opt-outs
Email corpus
Size: 13843
Socially
Untrusted
DNS lookup
Invalid
Size:1875
Socially
Trusted
Trusted
Emails
Size: 13280
Socially
Wanted
Socially
Unwanted
Step 3: Wantedness Analysis
Measuring the senders credibility (ρ):



We believe the credibility of a sender depends on the nature
of his recent emails
If the recent emails sent by the sender are legitimate, his
credibility increases
If the recent emails from the sender are fraudulent, his
fraudulency increases
Credibility Drops As Time Progresses for
Untrusted Senders
ρ i,n   ρ i, je  ΔT
Computing Credibility
Belief τ  
Disbelief τ̂ 

1


 ΔTlegitimate emails


..........................................1



1


 
..................................2 
ΔT
 fraudulent emails 
 ΔT

 fraudulent emails 
τ
Credibilit y  ρ      
.........................(3)
ΔT
 τ̂ 
 legitimate emails 
Fraudulenc y  ρ̂    1  ρ.............................................................(4)
(ΔT legitimate emails) is the average time period of all legitimate email w.r.t the most
recent email
(ΔT fraudulent emails) is the average time period of all fraudulent emails w.r.t the most
recent email
Credibility of Untrusted Senders
0.8
O
p
t
o
u
t
s
Threshold
O1
O2
0.7
O3
0.6
P
h
i
s
h
e
r
s
Credibility Value
0.5
0.4
0.3
0.2
0.1
0
0
20
40
60
80
Low Credible Domains eg: www.ebay.com, www.paypal.com
etc
100
High Credible
Domains
120
Measuring Recipient’s Wantedness


Tolerance (α+) for a sender is more if the recipient reads
and stores his emails for longer period
Intolerance (β-) for a sender is more if the recipient deletes
his emails with out reading them
Measuring Wantedness
Trd 
Recipient' s Tolerance
 α 

Recipient' s Tolerance
 α 
 
Recipient' s Intolerance
 β 



Δt
 legitimate emails 

1
Turd 

1
Recipient' s Intolerance  β    
 Δt fraudulent emails



  T   Δt
 Tolerance α   
 α 
fraudulent emails
 rd
Wantedness  χ R   

     
 Intolerance β   
 β 
  Turd   Δt legitimate emails
 

Unwantedness γ R  1  χ
R

 



(ΔT legitimate emails) is the average time period of all legitimate email w.r.t the most recent email
(ΔT fraudulent emails) is the average time period of all fraudulent emails w.r.t the most recent
email
Trd is the average storage time period of all the read emails
Turd is the average storage time period of all unread emails
Wantedness of Trusted Senders
Classification

Classification of Phishers:


Credibility Vs Phishing Frequency
Classification of Trusted Senders:

Credibility Vs Wantedness
Classification of Phishers
1
High Risk
Recent
Phishers
0.9
0.8
0.7
Fraudulency
0.6
0.5
Prospective Phishers
Prospective Phishers
High Risk
Suspects
Phishers Under Review
0.4
Suspects
0.3
0.2
0.1
0
0
15
Phishing Frequency
30
Classification of Trusted Senders
1
Socially Close
Opt-Ins
0.9
0.8
0.7
Credibility
0.6
0.5
High Risk
Strangers
0.4
0.3
0.2
Spammers, Phishers, Telemarketers
Socially Distinct
Opt-Ins
Family, Friends etc
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Wantedness
0.6
0.7
0.8
0.9
1
Summary of Results
# of emails
False Positives
False Negatives
Precision
DNS Analysis
11968
260
0
85%
{[DNS Analysis] + [Social
Network Analysis]}
2548
03
05
95.6%
563 (Domains)
03
01
98.4%
DNS Analysis
756
5
0
90.4%
{[DNS Analysis] + [Social
Network Analysis]}
59
0
0
93.75%
{[DNS Analysis] + [Social
Network Analysis]+
[Wantedness Analysis]}
148
1
0
99.2%
Corpus-I
{[DNS Analysis] + [Social
Network Analysis]+
[Wantedness Analysis]}
Corpus-II
Precision is the percentage of messages that were classified as phishing that actually are phishing
Conclusions

Phishers use special software's to conceal the path taken by their
emails to reach the recipient. Most of the times the path length is
single hop.

Our classifier can be used in conjunction with any existing spam
filtering techniques for restricting spam and phishing emails

Rather than labeling an email as BULK, based on the sender’s
credibility and his wantedness, we further classify them as:


Prospective phishers

Suspects

Recent phishers

Serial phishers
We classified two different email corpuses with a precision of 98.4%
and 99.2% respectively