PhishDef & Weblog

Download Report

Transcript PhishDef & Weblog

Reporter: Jing Chiu
Advisor: Yuh-Jye Lee
Email: [email protected]
Data Mining and Machine Learning Lab.
2011/3/17
1
 Authors:


Anh Le, Athina Markopoulou
(University of California, Irvine)
Michalis Faloutsos
(University of California, Riverside)
 Source:

to appear in IEEE INFOCOM 2011 Mini Conference,
Shanghai, China, April 10-15, 2011. (poster, tech
report)
Data Mining and Machine Learning Lab.
2011/3/17
2
 Introduction
 Dataset
and Feature Extraction
 Classification Algorithms
 Evaluation Results
 System Deployment
 Conclusion
Data Mining and Machine Learning Lab.
2011/3/17
3
 “How
well can one detect phishing URLs
using only lexical features compared to using
full features?”
 PhishDef Properties:

High accuracy:


Light-weight:



Low latency
Imposes a modest overhead
Proactive approach


96%-97%
As opposed to reactively relying on blacklist
Resilience to noise

95%-86% accuracy when there is 5%-45% noise
Data Mining and Machine Learning Lab.
2011/3/17
4
 Dataset

Malicious URLs



PhishTank
MalwarePatrol
Legitimate URLs


Yahoo Directory
Open Directory (DMOZ)
 External


Feature Collection
WHOIS
Team Cymru
Data Mining and Machine Learning Lab.
2011/3/17
5
 Feature

Automatically selected features



Extraction
Delimiters: ‘/’, ’?’, ‘.’, ‘=‘, ‘_’, ‘&’ and ‘-’.
Four parts:
 Domain Name
 Directory
 File Name
 Argument
Obfuscation-resistant lexical features


Four different URL obfuscation techniques
Five categories of hand-selected lexical features
Data Mining and Machine Learning Lab.
2011/3/17
6
 (I)
Obfuscating the host with an IP address
 (II) Obfuscating the host with another domain
 (III) Obfuscating with large host names
 (IV) Domain unknown or misspelled
Data Mining and Machine Learning Lab.
2011/3/17
7

Features related to the full URL
Length of the URL (Type II)
 Number of dots in the URL (Type II)
 Blacklisted words (Type IV)




Features related to the domain name






confirm, account, banking, secure, ebayisapi, webscr, login and signin
Paypal, free, lucky and bonus
Length of the domain name (Type III)
IP or port number is used in the domain name (Type I)
Number of tokens of the domain name (Type III)
Number of hyphens used in the domain name (Type III)
The length of the longest token (Type III)
Features related to the directory
Length of the directory (Type II)
 Number of sub-directory tokens (Type II)
 Length of the longest sub-directory token (Type II)
 Maximum number of dots and other delimiters used in a sub-directory
token (Type II)

Data Mining and Machine Learning Lab.
2011/3/17
8

Features related to the file name
Length of the file name (Type II)
 Number of dots and other delimiters used in the file name (Type II)


Features related to the argument part
Length of the argument part
 Number of variables
 Length of the longest variable value
 The maximum number of delimiters used in a value


Summary of dataset
Data Mining and Machine Learning Lab.
2011/3/17
9
 Batch

Learning
Support Vector Machine (SVM)
 Online
Learning

Online Perception (OP)
Confidence Weighted (CW)

Adaptive Regularization of Weights (AROW)

Data Mining and Machine Learning Lab.
2011/3/17
10
 Batch-based


vs. Online algorithms
SVM vs. AROW
Yahoo-Phish
Data Mining and Machine Learning Lab.
2011/3/17
11
 Lexical


Features vs. Full Features
OP, CW and AROW
Yahoo-Phish
Data Mining and Machine Learning Lab.
2011/3/17
12
 Obfuscation-Resistant

Lexical Features
Performance of AROW with/without OR features
after the last URL
Data Mining and Machine Learning Lab.
2011/3/17
13
 The


resilience of AROW to noisy data
AROW and CW
Yahoo-Phish
Data Mining and Machine Learning Lab.
2011/3/17
14
 Minimum/Maximum
URL Similarity Distance
distribution
Data Mining and Machine Learning Lab.
2011/3/17
15
Data Mining and Machine Learning Lab.
2011/3/17
16
 Proposed
PhishDef – a proactive defense
scheme of phishing attacks
 PhishDef detecting phishing URLs on-the-fly
 PhishDef use only lexical features




High accuracy (97%)
Low overhead
Resilient to noisy training data
Firefox and Chrome add-ons implementation
Data Mining and Machine Learning Lab.
2011/3/17
17
 Q&A?
Data Mining and Machine Learning Lab.
2011/3/17
18
Data Mining and Machine Learning Lab.
2011/3/17
19