Anomaly based Web Phishing Page Detection
Download
Report
Transcript Anomaly based Web Phishing Page Detection
Reporter: Jing Chiu
Advisor: Yuh-Jye Lee
Email: [email protected]
2015/4/13
Data Mining & Machine Learning Lab
1
Paper Information
Authors:
Ying Pan
School of Information systems, Singapore Management
University
Xuhua Ding
School of Information systems, Singapore Management
University
Source
Annual Computer Security Application Conference 2006
(ACSAC’06)
2015/4/13
Data Mining & Machine Learning Lab
2
Outline
Introduction
Related Work
Analysis of Phishing Pages
Mechanism
Architecture
Identity Extractor
Page Classifier
Feature Vector Generation
Experiments
Experiments of Identity Extractor
Experiments of Page Classifier
Conclusion
2015/4/13
Data Mining & Machine Learning Lab
3
Introduction
A common factor among all phishing sites
Maliciously mislead users to believe that they are other
legitimate sites
Phishing site maliciously claims a false identity
Proposed Method
Use web DOM object to obtain web identity
Use the web identity to capture phishing site anomalies
2015/4/13
Data Mining & Machine Learning Lab
4
Related Work
Existing anti-phishing schemes
Server based schemes
Requiring server authentication to defend against phishing
attacks
Black listing services
Browser based schemes
Browser regulate web pages’ visual behaviors to prevent
cheating
Black list plug-in in browser
Proactive schemes
Detecting phishing pages based on visual similarity
Detecting phishing pages by phishing-related activity
2015/4/13
Data Mining & Machine Learning Lab
5
Analysis of Phishing Pages
Web identity: a set of words which uniquely identify
the web site’s ownership in the cyberspace
An abbreviation of organization’s full name
Unique string appearing in its domain name
Phishing web site with its own identity A attempts to
claim a false identity B
A list of characteristics of phishing pages
Based on study of about 300 phishing sites from APWG’s
repository
List I & List II
2015/4/13
Data Mining & Machine Learning Lab
6
Mechanism
Architecture
Identity Extractor
Page Classifier
Feature Vector Generation
2015/4/13
Data Mining & Machine Learning Lab
7
Architecture
2015/4/13
Data Mining & Machine Learning Lab
8
Identity Extractor
Extract identity from DOM objects/properties
Title
Description
Copyright
ALT/title
Address
Body
Related DOM objects/properties
Extract identity by following steps
Form an identity relevant object set D
Initiates a word set W from D as identity candidates
Use Chi-square to separate identity from ordinary words
Identity Extraction Algorithm (I, II)
2015/4/13
Data Mining & Machine Learning Lab
9
Page Classifier
Support Vector Machine
LibSVM
Feature Vector Generation
Given the identity set I
10 features are extracted
2015/4/13
Data Mining & Machine Learning Lab
10
Feature Vector Generation
Feature 1: URL address
F1 = 1 if no identity in URL address
F1 = 0 if one page only use IP and can not be resolved
into host name
F1 = -1 otherwise
Feature 2: DNS record
F2 = -1 if all identity are substrings of DNS record R
F2 = 0 if no record returned
F2 = 1 otherwise
2015/4/13
Data Mining & Machine Learning Lab
11
Feature Vector Generation (cont.)
Feature 3.1-3.3: URL of anchor
F31: Nil anchor (point to nothing)
F23: ID anchor (point to another domain contains
identity)
F33: Domain anchor (point to a foreign domain)
2015/4/13
Data Mining & Machine Learning Lab
12
Feature Vector Generation (cont.)
Feature 4: Server form handler
F4 = 1 if any void or foreign form handler exists
F4 = 0 if no form
F4 = -1 otherwise
Feature 5.1-5.2: Request URL
F51: ID Request URL (point to another domain contains
identity)
F52: Domain request URL (point to a foreign domain)
2015/4/13
Data Mining & Machine Learning Lab
13
Feature Vector Generation (cont.)
Feature 6: Domain in cookie
F6 = 1 if any foreign domain exists in cookie
F6 = 0 if no domain in cookies of no cookies
F6 = -1 otherwise
Feature 7: Certificate in SSL
F7 = 1 if one of the claimed identities does not appear in
the certificate or URL specified in the certificate is
different from L
F7 = 0 if the SSL is not applied
F7 = -1 otherwise
2015/4/13
Data Mining & Machine Learning Lab
14
Experiments
Dataset
279 Phishing pages vs. 100 official pages
279 attacks only have 49 different targets
Experiments of Identity Extractor
Three web pages results
Success rate
Experiments of Page Classifier
Dataset
Training set size: 50 positive + 50 negative
Testing set size: 50 pages
Positive portions: 2%, 6%, 10%, 20%, 30%, 40%, 50%
Use FP rate and miss rate (FN rate) as measurement
2015/4/13
Data Mining & Machine Learning Lab
15
Exp. of Identity Extractor
Identity Extraction Results of Three Web Pages
Success Rate(λ) of the Identity Extractor
N is total number
n is correct number
2015/4/13
Data Mining & Machine Learning Lab
16
Exp. of Page Classifier
2015/4/13
Data Mining & Machine Learning Lab
17
Exp. of Page Classifier (cont.)
2015/4/13
Data Mining & Machine Learning Lab
18
Conclusion
The benefits
Need not requires online interactions with a third party
Also need not users to change their navigation behavior
Resistant to adaptive phishing attackers
2015/4/13
Complete evasion of this scheme tolls attacker a high cost
Data Mining & Machine Learning Lab
19
Characteristics of Phishing Pages I
Disguised Keyword/Description
Phishing page will use the fake identity to pretend a normal
site
Abnormal URL
The hostname in URL or revolved from the IP does not match
the claimed identity
Abnormal DNS record
DNS usually contains identity information
Abnormal Anchors
Domains of anchors’ URL are different from the page’s
domain and these domains contain the claimed identity
Anchors do not link to any page
2015/4/13
Data Mining & Machine Learning Lab
20
Characteristics of Phishing Pages II
Abnormal Server Form Handler
No action of the form or the action handled by a server in
different domain
Abnormal request URL
Phishing site usually has objects referenced to real site
Abnormal cookie
Phishing sites’ cookie either point to its domain (inconsistent
of claimed identity) or point to the real site (inconsistent with
its own domain)
Abnormal certificate in SSL
The Distinguished Names in the certificates are inconsistent
with the claimed identities
2015/4/13
Data Mining & Machine Learning Lab
21
Identity Extraction Algorithm
Input: Web page P; Output: Identity set I
Construction of object set D
From the related DOM objects/properties
Construction of word set W
Tokenization by stop marks, remove stop words and
stemming
Remove all stop words object d from D
Calculation of the occurrences Cw,d
Supplement of body object
Calculation of term frequency
2015/4/13
Data Mining & Machine Learning Lab
22
Identity Extraction Algorithm (cont.)
Calculation of expected probability
Where
Calculation of χ2 value
Output an identity set with the largest χ2 value
2015/4/13
Data Mining & Machine Learning Lab
23
Related DOM objects/properties
2015/4/13
Data Mining & Machine Learning Lab
24