Anomaly based Web Phishing Page Detection

Download Report

Transcript Anomaly based Web Phishing Page Detection

Reporter: Jing Chiu
Advisor: Yuh-Jye Lee
Email: [email protected]
2015/4/13
Data Mining & Machine Learning Lab
1
Paper Information
 Authors:
 Ying Pan

School of Information systems, Singapore Management
University
 Xuhua Ding

School of Information systems, Singapore Management
University
 Source
 Annual Computer Security Application Conference 2006
(ACSAC’06)
2015/4/13
Data Mining & Machine Learning Lab
2
Outline




Introduction
Related Work
Analysis of Phishing Pages
Mechanism
 Architecture
 Identity Extractor
 Page Classifier
 Feature Vector Generation
 Experiments
 Experiments of Identity Extractor
 Experiments of Page Classifier
 Conclusion
2015/4/13
Data Mining & Machine Learning Lab
3
Introduction
 A common factor among all phishing sites
 Maliciously mislead users to believe that they are other
legitimate sites
 Phishing site maliciously claims a false identity
 Proposed Method
 Use web DOM object to obtain web identity
 Use the web identity to capture phishing site anomalies
2015/4/13
Data Mining & Machine Learning Lab
4
Related Work
 Existing anti-phishing schemes
 Server based schemes


Requiring server authentication to defend against phishing
attacks
Black listing services
 Browser based schemes
 Browser regulate web pages’ visual behaviors to prevent
cheating
 Black list plug-in in browser
 Proactive schemes
 Detecting phishing pages based on visual similarity
 Detecting phishing pages by phishing-related activity
2015/4/13
Data Mining & Machine Learning Lab
5
Analysis of Phishing Pages
 Web identity: a set of words which uniquely identify
the web site’s ownership in the cyberspace
 An abbreviation of organization’s full name
 Unique string appearing in its domain name
 Phishing web site with its own identity A attempts to
claim a false identity B
 A list of characteristics of phishing pages
 Based on study of about 300 phishing sites from APWG’s
repository
 List I & List II
2015/4/13
Data Mining & Machine Learning Lab
6
Mechanism
 Architecture
 Identity Extractor
 Page Classifier
 Feature Vector Generation
2015/4/13
Data Mining & Machine Learning Lab
7
Architecture
2015/4/13
Data Mining & Machine Learning Lab
8
Identity Extractor
 Extract identity from DOM objects/properties
 Title
 Description
 Copyright
 ALT/title
 Address
 Body
 Related DOM objects/properties
 Extract identity by following steps
 Form an identity relevant object set D
 Initiates a word set W from D as identity candidates
 Use Chi-square to separate identity from ordinary words
 Identity Extraction Algorithm (I, II)
2015/4/13
Data Mining & Machine Learning Lab
9
Page Classifier
 Support Vector Machine
 LibSVM
 Feature Vector Generation
 Given the identity set I
 10 features are extracted
2015/4/13
Data Mining & Machine Learning Lab
10
Feature Vector Generation
 Feature 1: URL address
 F1 = 1 if no identity in URL address
 F1 = 0 if one page only use IP and can not be resolved
into host name
 F1 = -1 otherwise
 Feature 2: DNS record
 F2 = -1 if all identity are substrings of DNS record R
 F2 = 0 if no record returned
 F2 = 1 otherwise
2015/4/13
Data Mining & Machine Learning Lab
11
Feature Vector Generation (cont.)
 Feature 3.1-3.3: URL of anchor
 F31: Nil anchor (point to nothing)
 F23: ID anchor (point to another domain contains
identity)
 F33: Domain anchor (point to a foreign domain)
2015/4/13
Data Mining & Machine Learning Lab
12
Feature Vector Generation (cont.)
 Feature 4: Server form handler
 F4 = 1 if any void or foreign form handler exists
 F4 = 0 if no form
 F4 = -1 otherwise
 Feature 5.1-5.2: Request URL
 F51: ID Request URL (point to another domain contains
identity)
 F52: Domain request URL (point to a foreign domain)
2015/4/13
Data Mining & Machine Learning Lab
13
Feature Vector Generation (cont.)
 Feature 6: Domain in cookie
 F6 = 1 if any foreign domain exists in cookie
 F6 = 0 if no domain in cookies of no cookies
 F6 = -1 otherwise
 Feature 7: Certificate in SSL
 F7 = 1 if one of the claimed identities does not appear in
the certificate or URL specified in the certificate is
different from L
 F7 = 0 if the SSL is not applied
 F7 = -1 otherwise
2015/4/13
Data Mining & Machine Learning Lab
14
Experiments
 Dataset
 279 Phishing pages vs. 100 official pages
 279 attacks only have 49 different targets
 Experiments of Identity Extractor
 Three web pages results
 Success rate
 Experiments of Page Classifier
 Dataset


Training set size: 50 positive + 50 negative
Testing set size: 50 pages
 Positive portions: 2%, 6%, 10%, 20%, 30%, 40%, 50%
 Use FP rate and miss rate (FN rate) as measurement
2015/4/13
Data Mining & Machine Learning Lab
15
Exp. of Identity Extractor
 Identity Extraction Results of Three Web Pages
 Success Rate(λ) of the Identity Extractor
 N is total number
 n is correct number
2015/4/13
Data Mining & Machine Learning Lab
16
Exp. of Page Classifier
2015/4/13
Data Mining & Machine Learning Lab
17
Exp. of Page Classifier (cont.)
2015/4/13
Data Mining & Machine Learning Lab
18
Conclusion
 The benefits
 Need not requires online interactions with a third party
 Also need not users to change their navigation behavior
 Resistant to adaptive phishing attackers

2015/4/13
Complete evasion of this scheme tolls attacker a high cost
Data Mining & Machine Learning Lab
19
Characteristics of Phishing Pages I
 Disguised Keyword/Description
 Phishing page will use the fake identity to pretend a normal
site
 Abnormal URL
 The hostname in URL or revolved from the IP does not match
the claimed identity
 Abnormal DNS record
 DNS usually contains identity information
 Abnormal Anchors
 Domains of anchors’ URL are different from the page’s
domain and these domains contain the claimed identity
 Anchors do not link to any page
2015/4/13
Data Mining & Machine Learning Lab
20
Characteristics of Phishing Pages II
 Abnormal Server Form Handler
 No action of the form or the action handled by a server in
different domain
 Abnormal request URL
 Phishing site usually has objects referenced to real site
 Abnormal cookie
 Phishing sites’ cookie either point to its domain (inconsistent
of claimed identity) or point to the real site (inconsistent with
its own domain)
 Abnormal certificate in SSL
 The Distinguished Names in the certificates are inconsistent
with the claimed identities
2015/4/13
Data Mining & Machine Learning Lab
21
Identity Extraction Algorithm
 Input: Web page P; Output: Identity set I
 Construction of object set D
 From the related DOM objects/properties
 Construction of word set W
 Tokenization by stop marks, remove stop words and
stemming
 Remove all stop words object d from D
 Calculation of the occurrences Cw,d
 Supplement of body object
 Calculation of term frequency
2015/4/13
Data Mining & Machine Learning Lab
22
Identity Extraction Algorithm (cont.)
 Calculation of expected probability
 Where
 Calculation of χ2 value
 Output an identity set with the largest χ2 value
2015/4/13
Data Mining & Machine Learning Lab
23
Related DOM objects/properties
2015/4/13
Data Mining & Machine Learning Lab
24