A Supervised Machine Learning Approach to Classify Host Roles

Download Report

Transcript A Supervised Machine Learning Approach to Classify Host Roles

Network Security Monitoring and Analysis
based on Big Data Technologies
Bingdong Li
August 26, 2013
Outline
Motivation
Objectives
System Design
Monitoring and Visualization
Network Measurement
Classification and Identification of Network Objects
Conclusion
Future Work
1
Motivation
 Traditional security systems assume a static system
 Network attacks
– sophisticated
– organized
– targeted
– persistent
– dynamic
– external
– internal
3
Motivation
 Problem: Network Security is becoming more
challenging
 Resource: A Large Amount of Security Data
–
–
–
–
–
Network flow
Firewall log
Application log
Server log
SNMP
 Opportunity: Big Data Technologies, Machine
Learning
6
Objectives
 A network security monitor and analysis system based
on Big Data technologies to
– Measures the network
– Real time continuous monitoring and interactive
visualization
– Intelligent network object classification and identification
based on role behavior as context
7
Objectives
Network
Security
Big Data
Machine
Learning
8
10
System Design
 Data Collection
11
System Design
 Online Real Time Process
12
System Design
 NoSQL Storage
13
System Design
 User Interfaces
14
15
System Design
 The Design supports features:
– Real Time Continuous Monitoring and
Interactive Visualization
– Network Measurement
– Classification and Identification of
Network Objects
16
Monitoring and Visualization
 Real Time
response within a time constraint
 Interactive
involve user interaction
 Continuously
“continue to be effective overtime in light
of the inevitable changes that occur”
(NIST)
17
Monitoring and Visualization
 Retrieve Data
 Web User Interfaces
 Video Demo
18
Monitoring and Visualization
 Data Retrieving:
Data are stored with IP as primary key and time slice as
the secondary key in column
Accessing these data is in ϑ (1)
19
Real Time Querying
20
Host Network Connection
21
Network Status
22
Top N
23
Demo of Interactivity and Continuity
Video Demo
24
Network Measurement
 A case study
The Anonymity Technology Usage on Campus Network
Using sFlow
– Geo-Location
– Usage of Anonymity Systems
25
Geo-location of Anonymity Usage on Campus
One Instance: Bahamas, Belarus,
Belgium, Bulgaria, Cambodia, Chile,
Colombia, Estonia, Ghana, Greece,
Hungary, Ireland, Israel, Jamaica,
Jordan, Korea, Mongolia, Namibia,
Nigeria, Pakistan, Panama, Philippines,
Slovakia, Turkey, Ukraine, Vietnam,
Zimbabwe
Two Instances: Chad, ChezchRep,
Denmark, Hongkong, Iran, Japan,
Kazakhistan, Poland, Romania, Spain,
Switzerland
Three Instances: Austria, France,
Singapore
Four Instances: Australia,
Indonesia, Taiwan, Thailand
26
Usage of Anonymity Systems
Packets (%)
Traffic (MB %)
Observed IPs (%)
Proxies
5,580 (62.65)
8.13 (43.53)
234 (3.23)
Tor
3,129 (35.13)
9.04 (48.37)
152 (0.25)
I2P
190 (2.13)
1.50 (8.02)
23 (1.01)
7 (0.08)
0.016 (0.08)
2 (N/A)
8,906 (100)
16.69 (100)
411 (N/A)
Commercial
Total
27
Classification of Host Roles
Data: Three months sFlow data from a large campus
Role
Count
Client
5494
Server
1920
Public Place
784
Personal Office
416
College1
163
College2
253
Web Server
56
Web Email Server
25
28
Classification of Host Roles
 Algorithms
Decision Tree
On-line SVM
29
Classification of Host Roles
 Features
Ad hoc based on domain knowledge
Aggregating features for on-line classification
24 features normalized between 0 and 1, inclusive
30
Classification of Host Roles
 Features
24 features derived from
src/dest IP address
src/dest Port number
TTL
Package Size
Transport protocol
31
Classification of Host Roles
 Ground Truth
Host Information in Active Directory
Crawler to validate its status
32
Classification of Host Roles
 Classifying Client vs. Server
 Classifying Web Server vs. Web Email Server
 Classifying Hosts at Personal Office vs. Public Place
 Classifying Hosts at Two Different Colleges
 Feature Contributions
33
Classifying Client vs. Server
34
Classifying Web Server vs. Web Email Server
35
Classifying Host From Personal Office vs. Public Place
36
Classifying Host From Two Different Colleges
37
Accuracy
 High accuracies of Host Role Classification
Classification
Accuracy (%)
Clients vs. Server
99.2
Regular web server vs. Web email server
100
Hosts from personal office vs. public places
93.3
Host from two different colleges
93.3
38
Feature Contribution
39
Identification of a User
Data: NetFlow data from a large campus
College1
College2
Count
163
253
40
Identification of a User
 Algorithms
Decision Tree
On-line SVM
 Ground Truth
Host Information in Active Directory
Crawler to validate its status
41
Identification of a User
 Features
Discrete probability distribution function (pdf)
An Example:
System Port Number [6, 8, 9, 11, 14, 30, 80, 1020]
– Outliner (P) is 1%,
– 80 is the interested port (S)
– Number of bin 4 ( R )
42
Identification of a User
 An Example
(1-0.01) * 8 to 7, the 7th is 80,
bin slice size = 80 / (4-1) = 26.6
[6, 8, 9, 11, 14, 30, 80, 1020]
pdf = 0.625
6,8,9,11,
14
0.125
30
0.125
80
0.125
1020
43
Identification of a User
 An Example without P and S
Bin size slice is 1024/4 = 256,
[6, 8, 9, 11, 14, 30, 80, 1020]
pdf = 0.875
6,8,9,11,
14,30,80
0
0
0.125
1020
44
Identify a User Among Other Users
45
Accuracy
 Identifying a particular user among other users
Decision Tree 93.3%
On-line Support Vector Machine 78.5%
46
Feature Contribution
47
Conclusion
 Major Contributions
– A Big Data analysis system
• a conference paper
– Monitoring and interactive visualization
– Usage of anonymity technologies
• a conference and a journal paper
– Models of classification of host roles and identification
and users
• a conference paper
48
Conclusion
 The Big Data analysis system is high performance and
scalable
 Real Time Continuous Network Monitoring and
Interactive Visualization are implemented and supported
by the high performance system
49
Conclusion
 Proxies and Tor are main anonymity technologies used
on campus;
– US, Germany, and China are the top 3 countries
 Models and Features for Classification of Host roles:
– client vs. server, non-web server vs. web server,
personal office vs. public office, from two different
colleges
 Models of Features for Identification of a particular user
among other users
50
Future Work
 Improvement to the Current Work
– More interactive features and better user interfaces
– Further analysis on user identification: features,
algorithm (such as deep learning)
51
Future Work
 Extension to the Current Work
– Define and filter out background traffic
– Detection of operating system fingerprinting
– Identity anonymity
– Fusion with other network security data source
52
Future Work
 Vision
To Provide network security as a service for individuals,
small businesses, or government offices
53