Understanding traffic activities in large networks
Download
Report
Transcript Understanding traffic activities in large networks
Tackling Network Management Problems
with Machine Learning Techniques
Preliminary Oral Exam
Yu Jin
Advisor: Professor Zhi-Li Zhang
Motivation
Effective solutions are required for various problems in large
networks:
Traffic classification
Anomaly detection
Host/traffic profiling
Trouble-shooting
Manual rule-based solutions are either costly or unavailable
Scalability is a major issue
Our Solution
Abstract into machine learning problems
Select advanced machine learning algorithms to solve these
problems
Address the scalability issue
Deliver a system and evaluate the system under real
operational environment
Outline
Work in the past – Traffic classification
Design and implementation of a modular machine learning
architecture for flow-level traffic classification
Analysis of traffic interaction patterns using traffic activity
graphs (TAGs)
Visualizing spatial traffic class distribution using colored TAGs
Traffic classification using colored TAGs
On-going work
Customer ticket prediction and trouble-shooting in DSL
networks
Summary and time table
A Light-Weight Modular Machine Learning Approach
to Large-Scale Network Traffic Classification
Motivation
Required for:
Security monitoring
Traffic policing and prioritization
Predicting application trends
Identifying new applications
An interesting research topic
Across multiple research areas:
machine learning, traffic profiling, social network analysis, system
optimization
Design a operational system for a large ISP network
Learning valuable lessons for solving other practical problems
Accuracy
Challenges
Scalability
Stability
Challenges:
Scalability: training and operating on 10Gbps links
Versatility
Accuracy: Similar performance as the rule-based classifier
Stability: remain accurate without human intervention
Versatility: reconfigurable
300k
--- TCP
--- UDP
250k
200k
150k
100k
50k
Mon
Tue
Wed
Thu
Fri
Sat
Sun
New flow arrival
rate (per minute) on
a 1Gbps link
Current solution – Rule based classifier
Matching layer 4/layer 7 packet headers with manually defined
rules
Expensive in operation (flow/packet sampling is required)
Expensive in deployment (special hardware is required)
Inapplicable if the packet is encrypted, e.g., end-to-end IPSec
traffic
However, the rule based classifier can provide a good set of
“labeled” flow data
A machine learning solution
Network 2
Network 3
Network 1
Traffic
collection
Raw traffic
data
Network 0
Training
data
Labeled by rule- training
based classifier
Network 4
A modular machine learning architecture
A modular architecture for parallelization
in both training and operation
First level modularization,
pre-partition the data by flow features
Better accuracy
Flow data
Higher scalability from
parallelization
Page 10
By flow size, IP
protocol, etc.
Flow partition
Classification
on partition 1
Classification
on partition 2
Prediction on
partition 1
Prediction on
partition 2
…
Classification
on partition m
Prediction on
partition m
Second level modularization
From a single k-class classifier to k 1-vs-all classifiers
Accelerate training and operation
Low memory consumption
Parallelization
Flow data in
No significant performance loss
partition j
Flow data in
partition j
Classifier j
Binary classifier
for application
class 1
Binary classifier
for application
class 2
max
Prediction on
partition j
Page 11
Based on the
posteriors
P(Cj | x)
Prediction on
partition j
…
Binary classifier
for application
class k
Training a binary classifier
Sampling is necessary due to the huge amount of training data (90
million TCP flows plus 85 million UDP flows)
Weighted threshold sampling for a more balanced and
representative training set
2.778%
if count(Cj) ≤ θ{
keep all the flows in Cj;
}else{
sample with rate θ/count(Cj);
}
63.276%
2.517%
0.060%
6%
0%
11%
Business
Business
Chat
Chat
Dns
Dns
FileSharing
10%
FileSharing
Ftp
Ftp
Games4%
Games
Mail
Mail
10%
24.268%
6%
11%
10%
0.397%
11%
1.874%
11%
3.626%
1.013%
0.090%
0.003%
0.099%
Multimedia
Multimedia
NetNews
NetNews
10%
SecurityThreat
SecurityThreat
Voip
Voip
Web
Web
Selection of binary classifier
Any classifier can be used as a component in our traffic classification
architecture
Boosting decision stumps (1-level decision trees)
Fast
Accurate
Simple
Implicit L-1 regularization
TCP flow error rates for
different binary
classifiers.
“Non-linear” classifier
Boosting Trees (BTree) has
the best performance.
Boosting stumps (BStump)
is a bit better than L1Maxent.
Page 13
Logistic calibration
The IID assumption has been violated by the weighted thresholded
sampling method
Score output from adaboost makes direct combination of binary
classification results infeasible
We need to calibrate the binary classifiers
Address the difference in traffic distributions
Convert scores fc(x) to posterior probabilities P(C|x)
Reliability diagram for TCP Web
Training Architecture for a binary classifier
One binary classifier for each application class
One calibrator is trained according to the classification results on a
small independent flow samples (s.r.s.)
training
Thresholded
sampled
Binary classifier
Application
classifier
Training
data
Calibrator
Reliability
diagram
training
Page 15
Performance Evaluation - Accuracy
Our classifier can reproduce the classification results from the rule-
based classifier with high accuracy
Bef ore Calibration
Af ter Calibration
26.92%
30.00%
Direct training of
a multi-class classifier
has little gain on
accuracy according to
testing on small samples
Flow error rate
25.00%
20.00%
15.00%
10.00%
13.76%
10.19%
7.64%
3.07%
1.77%
1.34%
5.00%
0.34%
0.00%
BStump
TCP
Page 16
NBayes
TCP
BStump
UDP
NBayes
UDP
Scalability
Scalability in training
Accuracy increases with the
increase of training data size
and the number of iterations
Less than 1.5 hours with
700K training samples and
640 iterations
Scalability in operation
Use basic optimization
Recall on 2 x 1Gbps links, the
new flow arrival rate is 450K/min
We achieve 800K/min with
a single thread
Close to 7M/min with 10 threads,
can scale up to 10Gbps links
Page 17
Evaluation on Stability
Temporal stability
After two-month, the flow error rates are 3±0.5% for TCP traffic and
0.4±0.04% for UDP traffic
After one year, the TCP flow error rate is 5.48±0.54% and the UDP flow
error rate is 1.22±0.2%.
Spatial stability
Train and test at two geolocations
Page 18
Importance of the Port number
Using our architecture, we obtain an error rate of 4.1% for TCP
and 0.72% for UDP with only port features (3.13% for TCP and
0.35% for UDP after adding other flow features)
We use port graph to visualize and understand machine learned
port rules
TCP Multimedia use 554
and 5000-5180
UDP Games uses port
88 (Xbox)
UDP Chat uses port
6661-6670
Page 19
Summary
We have designed a modular machine learning architecture for
large scale ISP/enterprise network traffic classification
The system scales up to 10Gbps links and remains accurate for one
year on multiple sites without re-training
We have conducted evaluation on a large operational ISP network
What if the port number and other flow level statistics are
unavailable? For example, classifying the end-to-end
IPSec traffic.
Page 20
IPSec traffic
Limited traffic statistics for IPSec traffic
IPSec
header
Encrypted Inner
packet
No port
number,
protocol,pay
load etc.
Number of packets, number of bytes,
averageClassification
packet inter-arrivalon
time,
averageActivity Graphs
Our solution:
Traffic
packet size in both directions
Only 80% accuracy using the proposed machine learning
architecture
Visualizing and Inferring Network Applications
using Traffic Activity Graphs (TAGs)
Traffic Activity Graphs
Nodes represent the hosts in the network
Edges represent the interaction between these hosts
Defined on a set of flows collected in a time period T
Help us study the communication patterns of different
applications
GHTTP
UMN
GEmail
Internet
GGnutella
Application traffic activity graphs (TAGs) and evolution
HTTP 1K to 3K
Email 1K to 3K
AOL IM 1K to 3K
DNS 1K to 3K
Properties of TAGs
We observe difference in terms of basic statistics, such as
GCC size
graph density, average in/out degree, etc.
ALL TAGs contain giant connected component (GCC),
which accounts for more than 85% of all the edges.
100.00%
98.00%
96.00%
94.00%
92.00%
90.00%
88.00%
86.00%
84.00%
82.00%
80.00%
Understanding the Interaction Patterns
by Decomposing TAGs
Block structures in the adjacency matrices indicate dense
subgraphs in TAGs
1110010
1110000
1110000
0001111
0001011
HTTP
Email
AOL IM
BitTorrent
DNS
TAG Decomposition using TriNonnegative Matrix Factorization
Extracting dense subgraphs can be formulated as a co-
clustering problem, i.e., cluster hosts into inside host groups
and outside host groups, then extract pairs of groups with
more edges connected (higher density).
This co-clustering problem can be solved by tri-nonnegative
matrix factorization algorithm, which minimizes:
Tri-nonnegative matrix factorization
A
≈
100…0
100…0
010…0
010…0
…
000…1
×
≈
R
×
×
H
×
11100…0
00011…0
...
00000…1
C
R is m-by-k, C is r-by-n, hence, the product is a low-rank
approximation
rank min(k,
Row group of A, with
Proportional
to the r)
Column group
Adjacency matrix
assoc. with TAG
membership Indicator
matrix
subgraph density
matrix
We identify dense subgraphs based
on the large entries in H
membership
Indicator matrix
Subgraph prototypes
Recall inside (UMN) hosts are (likely) service requesters and
outside hosts are service providers.
Based on the number of inside/outside hosts in each
subgraph, we propose three prototypes.
In-star
One inside client
accesses multiple
outside servers
Out-star
Bi-mesh
Multiple inside client
accesses one
outside servers
Multiple inside clients
interacts with many
outside servers
Characterizing TAGs with subgraph prototypes
Different application TAGs contain different types of subgraphs
HTTP
Email
AOL IM
BitTorrent
We can distinguish and characterize applications based on the
subgraph components
What do these subgraphs mean?
DNS
Interpreting HTTP bi-mesh structures
Most star structures are due to popular
servers or active clients
We can explain more than 80% of the
HTTP bi-meshes identified in one day
Server correlation driven
Server farms
Lycos, Yahoo, Google
Correlated service providers
CDN: LLNW, Akamai, SAVVIS, Level3
Advertising providers: DoubleClick, etc.
User interests driven
News: WashingtonPost, NewYork Times, Cnet
Media: ImageShack, casalemedia, tl4s2
Online shopping: Ebay, Costco, Walmart
Social network: Facebook, MySpace
redirection
redirection
How are the dense subgraphs
connected?
AS1
(C) Pool
(A) Randomly connected stars
AS1
(B) Tree: client/server dual role
SIGMETRICS/Performance 2009
AS2
(D) Correlated pool
Summary
We introduce the notion of traffic activity graphs (TAGs)
Different applications show different interaction patterns
We propose a tNMF based graph decomposition method to help
understand the formation of application TAGs.
Can we classify different application classes based on
TAGs?
Page 33
Traffic classification using
Collective Traffic Statistics
Colored TAGs
Different applications are displayed as edges with different
colors in TAGs
Original TAG
Web and FileSharing traffic removed
Characterizing Colored TAGs
Clustering effects: edges with the
same color tend to cluster together
Attractive (A) / Repulsive (R)
effects:
The collective traffic statistics summarize the
spatial distribution of application classes in TAGs
Methodology
A two-step approach
Input
Unclassified
TAG & traffic
statistics
Bootstrapping
Initially
labeled
TAG
Initial edge classification
only based on
traffic statistics
Prediction
output
Graph
Calibration
Prediction on
the traffic graph
Edge color calibration
using only colored neighborhood
information in the initially
labeled traffic graph
Training the Two-Step Model
The two-step model can be integrated into the existing traffic
classification system easily
The bootstrapping step uses traffic statistics associated with
each edge to conduct initial classification on the edges
Available traffic statistics depend on specific applications
The graph calibration step uses collective traffic statistics in
the TAG to reinforce/correct the initial labels
Collective traffic statistics are encoded as histograms
1
eij
1
0.5
hi
0
Web FS Mail
hj
0.5
0
Web FTP Mail
Evaluation on Network-Level Traffic
classification
Packet header information (e.g., port number, TCP flags) is
unavailable for bootstrapping, similar to the situation of
classifying end-to-end IPSec traffic
How accurate can we classify such traffic when the flow-level
classification system only achieves 80% accuracy due to the
lack of traffic features?
Evaluation on Accuracy
Our graph-based calibration reduces the error rate by 50%!
The classifier remains stable across time and geolocation.
Edge accuracy
Bootstrap
Calibration
92%
90%
88%
86%
84%
82%
80%
78%
76%
74%
2 day
1 week
1 month
1 year
1 month
site 2
Evaluation on Per-Class Classification
The two-step method improve the accuracy for all traffic
classes
The repulsive rules enable us to improve on the traffic classes
with very poor initial labeling.
Evaluation on Real-time Performance
We can implement the two-step approach as a real-time
system with little additional cost
Evaluation on Flow-Level Traffic
Classification
We have access to all packet header information
How much will the collective traffic statistics improve the
overall accuracy of our system?
Evaluation on Accuracy
We achieve 15% reduction in errors within a month
The F1 scores are improved for most application classes
Summary
We introduced the concept of colored TAGs
We proposed a two-step model to utilize the spatial
distribution of application classes in TAGs (collective traffic
statistics) to help improve the classification accuracy
The collective traffic statistics help reduce 50% of the errors
for classification at the network layer
15% of the errors can be reduced for the flow-level traffic
classification using graph calibration
Trouble-shooting in Large DSL Networks
(work in progress)
Motivation
The current solution for trouble-shooting in DSL networks is
reactive and inefficient
Potentially
lead to churns
Challenges
Millions of users
A large number of devices on each DSL line which cannot be
controlled remotely
Many possible locations where the line problem can happen
Methodology
Trouble Locator
Measure the line condition between the DSL server and the
cable modem for each customer
Use machine learning techniques to learn the correlation
between different line problems and our line measurements
Ticket Predictor
Maintain periodic measurements for each customer (every
Saturday night)
Learn the correlation between the measurement history and the
potential line problems
Overview of the Proactive Solution
Proactively resolve line problems before the customer
complains
Planned Trial
Evaluate our method in an operational DSL network
Time Table
Design of the DSL network trouble-shooting system
(Feb. 2010 to Apr. 2010)
Implementation of the system and offline evaluation
(May. 2010 to Jul. 2010)
Trial in an operational DSL network
(Aug. 2010 to Oct.2010)
Thesis
(Nov.2010 to Jan.2011)
Publications before 2010
Aiyou Chen,Yu Jin, Jin Cao, Li (Erran) Li, Tracking Long Duration Flows in Network Traffic, to
appear in the 29th IEEE International Conference on Computer Communications INFOCOM2010 (mini-conference) (acceptance ratio 24.3%).
Yu Jin, Esam Sharafuddin, Zhi-Li Zhang, Unveiling Core Network-Wide Communication Patterns
through Application Traffic Activity Graph Decomposition, in Proc. of the 2009 ACM International
Conference on Measurement and Modeling of Computer Systems (ACM SIGMETRICS 2009)
(acceptance ratio 14.9%).
Jin Cao,Yu Jin, Aiyou Chen, Tian Bu, Zhi-Li Zhang, Identifying High Cardinality Internet Hosts, in
Proc. of the 28th Conference on Computer Communications (IEEE INFOCOM 2009) (acceptance
ratio 19.6%).
Yu Jin, Esam Sharafuddin, Zhi-Li Zhang, Identifying Dynamic IP Address Blocks Serendipitously
through Background Scanning Traffic, in Proc. of the 3rd International Conference on emerging
Networking EXperiments and Technologies (CoNEXT 2007), New York, NY, December 10, 2007
(acceptance ratio 19.5%).
Yu Jin, Zhi-Li Zhang, Kuai Xu, Feng Cao, Sambit Sahu, "Identifying and Tracking Suspicious
Activities through IP Gray Space Analysis", In Proc. of the 3rd Workshop on Mining Network Data
(MineNet'07), San Diego, CA, June 12, 2007 (in conjunction with ACM SIGMETRICS'07).
Yu Jin, Gyorgy Simon, Kuai Xu, Zhi-Li Zhang, Vipin Kumar, "Gray's Anatomy: Dissecting Scanning
Activities Using IP Gray Space Analysis", in Proc. of the Second Workshop on Tackling Computer
Systems Problems with Machine Learning Techniques (SysML07), Boston, MA, April 10, 2007 (in
conjunction with USENIX NSDI'07).
Thanks!
Questions?
Backup slides
Characteristics of app. TAGs
These statistics show difference between various app. TAGs
It does not explain the formation of TAGs
TNMF algorithm related
Iterative optimization algorithm
Group density matrix derivation
Backup for traffic classification
Default flow features
Compare of different algorithms for
multi-class classification
Training time for different machine
learning algorithms
Selection of flow size for partitioning
Boosting decision stumps
t =1
tcpflag
contains S
no
t =T
t =2
dstport_low
= 443
yes
S-= -1.066 S+= -2.523
no
yes
S-= -0.226 S+= 2.139
…
byte
>= 64.5
no
S-= 0.446
yes
S+= -0.202