A Supervised Machine Learning Approach to Classify Host Roles
Download
Report
Transcript A Supervised Machine Learning Approach to Classify Host Roles
Network Security Monitoring and Analysis
based on Big Data Technologies
Bingdong Li
August 26, 2013
Outline
Motivation
Objectives
System Design
Monitoring and Visualization
Network Measurement
Classification and Identification of Network Objects
Conclusion
Future Work
1
Motivation
Traditional security systems assume a static system
Network attacks
– sophisticated
– organized
– targeted
– persistent
– dynamic
– external
– internal
3
Motivation
Problem: Network Security is becoming more
challenging
Resource: A Large Amount of Security Data
–
–
–
–
–
Network flow
Firewall log
Application log
Server log
SNMP
Opportunity: Big Data Technologies, Machine
Learning
6
Objectives
A network security monitor and analysis system based
on Big Data technologies to
– Measures the network
– Real time continuous monitoring and interactive
visualization
– Intelligent network object classification and identification
based on role behavior as context
7
Objectives
Network
Security
Big Data
Machine
Learning
8
10
System Design
Data Collection
11
System Design
Online Real Time Process
12
System Design
NoSQL Storage
13
System Design
User Interfaces
14
15
System Design
The Design supports features:
– Real Time Continuous Monitoring and
Interactive Visualization
– Network Measurement
– Classification and Identification of
Network Objects
16
Monitoring and Visualization
Real Time
response within a time constraint
Interactive
involve user interaction
Continuously
“continue to be effective overtime in light
of the inevitable changes that occur”
(NIST)
17
Monitoring and Visualization
Retrieve Data
Web User Interfaces
Video Demo
18
Monitoring and Visualization
Data Retrieving:
Data are stored with IP as primary key and time slice as
the secondary key in column
Accessing these data is in ϑ (1)
19
Real Time Querying
20
Host Network Connection
21
Network Status
22
Top N
23
Demo of Interactivity and Continuity
Video Demo
24
Network Measurement
A case study
The Anonymity Technology Usage on Campus Network
Using sFlow
– Geo-Location
– Usage of Anonymity Systems
25
Geo-location of Anonymity Usage on Campus
One Instance: Bahamas, Belarus,
Belgium, Bulgaria, Cambodia, Chile,
Colombia, Estonia, Ghana, Greece,
Hungary, Ireland, Israel, Jamaica,
Jordan, Korea, Mongolia, Namibia,
Nigeria, Pakistan, Panama, Philippines,
Slovakia, Turkey, Ukraine, Vietnam,
Zimbabwe
Two Instances: Chad, ChezchRep,
Denmark, Hongkong, Iran, Japan,
Kazakhistan, Poland, Romania, Spain,
Switzerland
Three Instances: Austria, France,
Singapore
Four Instances: Australia,
Indonesia, Taiwan, Thailand
26
Usage of Anonymity Systems
Packets (%)
Traffic (MB %)
Observed IPs (%)
Proxies
5,580 (62.65)
8.13 (43.53)
234 (3.23)
Tor
3,129 (35.13)
9.04 (48.37)
152 (0.25)
I2P
190 (2.13)
1.50 (8.02)
23 (1.01)
7 (0.08)
0.016 (0.08)
2 (N/A)
8,906 (100)
16.69 (100)
411 (N/A)
Commercial
Total
27
Classification of Host Roles
Data: Three months sFlow data from a large campus
Role
Count
Client
5494
Server
1920
Public Place
784
Personal Office
416
College1
163
College2
253
Web Server
56
Web Email Server
25
28
Classification of Host Roles
Algorithms
Decision Tree
On-line SVM
29
Classification of Host Roles
Features
Ad hoc based on domain knowledge
Aggregating features for on-line classification
24 features normalized between 0 and 1, inclusive
30
Classification of Host Roles
Features
24 features derived from
src/dest IP address
src/dest Port number
TTL
Package Size
Transport protocol
31
Classification of Host Roles
Ground Truth
Host Information in Active Directory
Crawler to validate its status
32
Classification of Host Roles
Classifying Client vs. Server
Classifying Web Server vs. Web Email Server
Classifying Hosts at Personal Office vs. Public Place
Classifying Hosts at Two Different Colleges
Feature Contributions
33
Classifying Client vs. Server
34
Classifying Web Server vs. Web Email Server
35
Classifying Host From Personal Office vs. Public Place
36
Classifying Host From Two Different Colleges
37
Accuracy
High accuracies of Host Role Classification
Classification
Accuracy (%)
Clients vs. Server
99.2
Regular web server vs. Web email server
100
Hosts from personal office vs. public places
93.3
Host from two different colleges
93.3
38
Feature Contribution
39
Identification of a User
Data: NetFlow data from a large campus
College1
College2
Count
163
253
40
Identification of a User
Algorithms
Decision Tree
On-line SVM
Ground Truth
Host Information in Active Directory
Crawler to validate its status
41
Identification of a User
Features
Discrete probability distribution function (pdf)
An Example:
System Port Number [6, 8, 9, 11, 14, 30, 80, 1020]
– Outliner (P) is 1%,
– 80 is the interested port (S)
– Number of bin 4 ( R )
42
Identification of a User
An Example
(1-0.01) * 8 to 7, the 7th is 80,
bin slice size = 80 / (4-1) = 26.6
[6, 8, 9, 11, 14, 30, 80, 1020]
pdf = 0.625
6,8,9,11,
14
0.125
30
0.125
80
0.125
1020
43
Identification of a User
An Example without P and S
Bin size slice is 1024/4 = 256,
[6, 8, 9, 11, 14, 30, 80, 1020]
pdf = 0.875
6,8,9,11,
14,30,80
0
0
0.125
1020
44
Identify a User Among Other Users
45
Accuracy
Identifying a particular user among other users
Decision Tree 93.3%
On-line Support Vector Machine 78.5%
46
Feature Contribution
47
Conclusion
Major Contributions
– A Big Data analysis system
• a conference paper
– Monitoring and interactive visualization
– Usage of anonymity technologies
• a conference and a journal paper
– Models of classification of host roles and identification
and users
• a conference paper
48
Conclusion
The Big Data analysis system is high performance and
scalable
Real Time Continuous Network Monitoring and
Interactive Visualization are implemented and supported
by the high performance system
49
Conclusion
Proxies and Tor are main anonymity technologies used
on campus;
– US, Germany, and China are the top 3 countries
Models and Features for Classification of Host roles:
– client vs. server, non-web server vs. web server,
personal office vs. public office, from two different
colleges
Models of Features for Identification of a particular user
among other users
50
Future Work
Improvement to the Current Work
– More interactive features and better user interfaces
– Further analysis on user identification: features,
algorithm (such as deep learning)
51
Future Work
Extension to the Current Work
– Define and filter out background traffic
– Detection of operating system fingerprinting
– Identity anonymity
– Fusion with other network security data source
52
Future Work
Vision
To Provide network security as a service for individuals,
small businesses, or government offices
53