TCP Session Classification - National Institute of

Download Report

Transcript TCP Session Classification - National Institute of

Statistical Methods for Detecting
Computer Attacks from Streaming
Internet Data
Ginger Davis, University of Virginia
Systems & Information Engineering
Department
Joint work with:
David Marchette & Karen Kafadar
INTERFACE 2008
May 22, 2008
1
Outline
•
•
•
•
Motivation
Data
TCP Classification
Graphical Displays
2
Motivation
• Cyber attacks on computer networks are
threats to nearly all operations in society.
• We need computational tools and
statistical methods to identify attacks and
stop them before they force shutdowns.
• Use patterns in Internet traffic data to
– Perform user profiling
– Detect anomalies, network interruptions,
unusual behavior, masquerades
3
Project Background
=
+
Personal Computer
The Internet (Circa 2006)
Burning Power Transformer (May 2007)
Facts:
• The Internet is growing
• Computer network attacks are increasing
• Need for network security research & tools
4
Previous Work in
Detecting Aberrations
• Examples
– Disease surveillance
– Nuclear product manufacturing
– Fraud detection (credit cards; phone use)
• These data sets are often
– Reasonable small (say less than 100 per day)
– Easily stratified (by disease, site, cardholder)
– Approximately independent
• Can often apply Statistical Process Control tools
5
Features of Internet
Traffic Data
• Relentless (“streaming”)
• Not independent of other systems:
thousands of messages from thousands of
ports/addresses each minute
• Diverse (text, numeric, image)
• Dispersed (geographically)
• Data often not from some convenient
mathematical pdf
6
Four Stages of Data Graphics
• Static Graphics
– Scatterplot, conditioning plot, density plot
• Interactive Graphics
– Brushing, cropping, cutting, coloring, rotating, linked
plots
• Dynamic Graphics (interact directly with fixed size data
set on the client)
– Recursive or dynamically smoothed plot, mode tree
• Evolutionary Graphics (continually evolving streaming
data sets)
– Waterfall diagram, streaming chart, skyline plot
7
Challenges
• Internet traffic data are streaming
• Unusable in raw form and require preprocessing
• Detecting anomalies requires
characterizing typical behavior
8
Specific challenges for
streaming data
• Data value
– what to collect/discard/save for later
• Data warehouse
– acquisition, storage, distribution
• Tools/algorithms for pre-processing
• Methods for analysis
– Robustness,sufficiency
• Informative visual displays
9
Internet Traffic Data
• All internet communications are
transmitted via packets.
• Fundamental unit of information is a
packet
• Packet consists of data and headers that
control communication
– Internet Protocol (IP) addresses
– Transmission Control Protocol (TCP)
10
Internet Traffic Data
11
Internet Traffic Data
12
Internet Traffic Data
13
IP Header (Marchette 2001)
14
TCP
15
TCP
16
TCP Header
17
Hierarchy of Data
• Packets
– Identifying characteristics
– Bytes of information being sent
• Flows
– Communication between source-destination
• Connection
– Collection of source flows and destination flows
• Activity
– Collection of similar connections
• User session
– Collection of activities
18
Hierarchy of Data Example
19
Goal for Data Hierarchy
• Developing models for each level of the
hierarchy which are dependent on models
for other levels in the hierarchy
20
TCP Classification
• Detecting anomalies requires characterizing
typical behavior
• We will classify network traffic according to its
application
21
Background
Motivation:
• Port numbers map packets to their respective applications
• The only thing that matters is that the two communicating
hosts know which port number to look for
• Malicious users can use a well known port like 80 (web
traffic) for other uses and as a result are less likely to be
noticed.
22
Goal and Objective
Goal:
To prevent malicious users from masquerading their activities.
Objective:
To develop classification tree and multinomial logit models which
could be used to correctly identify application protocols by looking at
session variable characteristics
23
Data
Preliminary Data Processing Methodology:
• Convert: Binary ->Text -> SQL
 Proved to be slow, and inefficient
 Inadequate Session Aggregation Results
GMU
Inward
Traffic
Step 1.
Wireshark
Merge &
Data Dump
GMU
Outward
Traffic
Packet
Dump
File
Step 2.
C++ Parser
MySQL
Packet
Table
Step 3.
Session
Aggregator
MySQL
Session
Table
Diagram Legend:
Process
Data File
Database
24
Data
Revised Data Processing Methodology:
• Convert: Binary -> Text -> SQL
 Faster, more efficient, tracks more variables for each session
Libpcap
Files
Step 1.
C++ Binary
Packet
Importer
MySQL
Packet
Table
Step 2.
Revised
Session
Aggregator
MySQL
Session
Table
Diagram Legend:
Process
Data File
Database
25
Data
Session Aggregation Process:
• Ordered observations in database by time
• Logically grouped each packet into a session using standard
TCP semantics
• Created unique session definitions
• Maintained averages and variances for each session’s variables
• Session completion status is determined and marked according
to TCP semantics
• Packet and session tables were linked by foreign keys
26
Data
Enterprise Data Set
GMU Data Set
House Data Set
Collected By: Lawrence
Berkeley National
Laboratory
Collected By: George
Mason University
_______
Collected By: Capstone
Team 8
_______________
Contains:
Contains:
Contains:
129,903,861 TCP Packets
7,024,590 TCP Packets
1,110,335 TCP Packets
453,135 TCP Sessions
91,016 TCP Sessions
21,311 TCP Sessions
27
Model Creation
Training and Testing Data Set Creation
Complete
Sessions
csv File
Step 2.
IMiner Training
and Testing
Data Set
Creation
Training Data
(70% of Originial
Observations)
Testing Data
(30% of Original
Observations)
Step 1.
IMiner
csv
Export
MySQL
Session
Table
All
Sessions
csv File
Diagram Legend:
Step 2.
IMiner Training
and Testing
Data Set
Creation
Training Data
(70% of Originial
Observations)
Testing Data
(30% of Original
Observations)
Data File
Database
Process
28
Model Creation
Scenarios Used in Data Analysis
Real World Corporate
Scenario
Idealized Scenario
Home Network Scenario
Used all application ports
present in the data sets
Used only “top” application
ports in the data sets
Used only http, https, pop,
and smtp application ports
present in the data sets
29
Model Creation
Classification Tree Algorithm Parameters
•
•
•
•
RPART – originally developed for R
Dependent Variable – Application Port
Independent Variables – 39 session variables
Splitting Criteria – Gini Index
30
Model Creation
Classification Tree Snapshot
31
Model Creation
Multinomial Logit Models
• Dependent variable – Application Port
• Independent variables – 39 session variables
32
Results: Classification Trees
Real World Corporate Scenario: All Ports and All Variables
Takeaway: Good prediction capability within the same data set;
inconsistent results when benchmarked against different data sets.
33
Results: Classification Trees
Idealized Scenario: Top Ports and All Variables
Takeaway: Significant prediction improvement for the Enterprise data set.
Limiting ports, cleansed the noise from the data.
34
Results: Classification Trees
Home Network Scenario: Four Ports and All Variables
Takeaway: Improved prediction results both within and across data sets.
35
Results: Classification Trees
Port 80 Across Data Sets – 4 Application Ports
Takeaway: HTTP traffic (port 80) predictions appear to be robust across
the models when only looking at four application variables.
36
Results: Multi-categorical Logistical
Idealized Scenario: Top Ports – All Variables
Takeaway: Weaker prediction results in the Enterprise data set. Practical
in a real-time environment given appropriate environment/implementation.
37
Conclusion
Project Takeaways:
• Replicated / expanded prior research work successfully on real
network data
• Used a fast/exportable model creation and classification process
-> Classification Trees
• Created a robust toolkit for processing and storing network data
38
Future Work
• Implement classification trees in a real network security
application
• Handle minority class presence in the data
• Make use of pruning to develop smaller models
39
Evolutionary Displays for EDA
40
Waterfall Diagrams (Wegman & Marchette 2003)
41
42
43
Summary / Future Work
44