Understanding traffic activities in large networks

Transcript Understanding traffic activities in large networks

Tackling Network Management Problems
with Machine Learning Techniques
Preliminary Oral Exam
Yu Jin
Advisor: Professor Zhi-Li Zhang
Motivation
 Effective solutions are required for various problems in large
networks:
 Traffic classification
 Anomaly detection
 Host/traffic profiling
 Trouble-shooting
 Manual rule-based solutions are either costly or unavailable
 Scalability is a major issue
Our Solution
 Abstract into machine learning problems
 Select advanced machine learning algorithms to solve these
problems
 Address the scalability issue
 Deliver a system and evaluate the system under real
operational environment
Outline
 Work in the past – Traffic classification
 Design and implementation of a modular machine learning
architecture for flow-level traffic classification
 Analysis of traffic interaction patterns using traffic activity
graphs (TAGs)
 Visualizing spatial traffic class distribution using colored TAGs
 Traffic classification using colored TAGs
 On-going work
 Customer ticket prediction and trouble-shooting in DSL
networks
 Summary and time table
A Light-Weight Modular Machine Learning Approach
to Large-Scale Network Traffic Classification
Motivation
 Required for:
 Security monitoring
 Traffic policing and prioritization
 Predicting application trends
 Identifying new applications
 An interesting research topic
 Across multiple research areas:
machine learning, traffic profiling, social network analysis, system
optimization
 Design a operational system for a large ISP network
 Learning valuable lessons for solving other practical problems
Accuracy
Challenges
Scalability
Stability
 Challenges:
 Scalability: training and operating on 10Gbps links
Versatility
 Accuracy: Similar performance as the rule-based classifier
 Stability: remain accurate without human intervention
 Versatility: reconfigurable
300k
--- TCP
--- UDP
250k
200k
150k
100k
50k
Mon
Tue
Wed
Thu
Fri
Sat
Sun
New flow arrival
rate (per minute) on
a 1Gbps link
Current solution – Rule based classifier
 Matching layer 4/layer 7 packet headers with manually defined
rules
 Expensive in operation (flow/packet sampling is required)
 Expensive in deployment (special hardware is required)
 Inapplicable if the packet is encrypted, e.g., end-to-end IPSec
traffic
 However, the rule based classifier can provide a good set of
“labeled” flow data
A machine learning solution
Network 2
Network 3
Network 1
Traffic
collection
Raw traffic
data
Network 0
Training
data
Labeled by rule- training
based classifier
Network 4
A modular machine learning architecture
 A modular architecture for parallelization
in both training and operation
 First level modularization,
pre-partition the data by flow features
 Better accuracy
Flow data
 Higher scalability from
parallelization
Page 10
By flow size, IP
protocol, etc.
Flow partition
Classification
on partition 1
Classification
on partition 2
Prediction on
partition 1
Prediction on
partition 2
…
Classification
on partition m
Prediction on
partition m
Second level modularization
 From a single k-class classifier to k 1-vs-all classifiers
 Accelerate training and operation
 Low memory consumption
 Parallelization
Flow data in
 No significant performance loss
partition j
Flow data in
partition j
Classifier j
Binary classifier
for application
class 1
Binary classifier
for application
class 2
max
Prediction on
partition j
Page 11
Based on the
posteriors
P(Cj | x)
Prediction on
partition j
…
Binary classifier
for application
class k
Training a binary classifier
 Sampling is necessary due to the huge amount of training data (90
million TCP flows plus 85 million UDP flows)
 Weighted threshold sampling for a more balanced and
representative training set
2.778%
if count(Cj) ≤ θ{
keep all the flows in Cj;
}else{
sample with rate θ/count(Cj);
}
63.276%
2.517%
0.060%
6%
0%
11%
Business
Business
Chat
Chat
Dns
Dns
FileSharing
10%
FileSharing
Ftp
Ftp
Games4%
Games
Mail
Mail
10%
24.268%
6%
11%
10%
0.397%
11%
1.874%
11%
3.626%
1.013%
0.090%
0.003%
0.099%
Multimedia
Multimedia
NetNews
NetNews
10%
SecurityThreat
SecurityThreat
Voip
Voip
Web
Web
Selection of binary classifier
 Any classifier can be used as a component in our traffic classification
architecture
 Boosting decision stumps (1-level decision trees)
 Fast
 Accurate
 Simple
 Implicit L-1 regularization
TCP flow error rates for
different binary
classifiers.
“Non-linear” classifier
Boosting Trees (BTree) has
the best performance.
Boosting stumps (BStump)
is a bit better than L1Maxent.
Page 13
Logistic calibration
 The IID assumption has been violated by the weighted thresholded
sampling method
 Score output from adaboost makes direct combination of binary
classification results infeasible
 We need to calibrate the binary classifiers
 Address the difference in traffic distributions
 Convert scores fc(x) to posterior probabilities P(C|x)
Reliability diagram for TCP Web
Training Architecture for a binary classifier
 One binary classifier for each application class
 One calibrator is trained according to the classification results on a
small independent flow samples (s.r.s.)
training
Thresholded
sampled
Binary classifier
Application
classifier
Training
data
Calibrator
Reliability
diagram
training
Page 15
Performance Evaluation - Accuracy
 Our classifier can reproduce the classification results from the rule-
based classifier with high accuracy
Bef ore Calibration
Af ter Calibration
26.92%
30.00%
Direct training of
a multi-class classifier
has little gain on
accuracy according to
testing on small samples
Flow error rate
25.00%
20.00%
15.00%
10.00%
13.76%
10.19%
7.64%
3.07%
1.77%
1.34%
5.00%
0.34%
0.00%
BStump
TCP
Page 16
NBayes
TCP
BStump
UDP
NBayes
UDP
Scalability
 Scalability in training
 Accuracy increases with the
increase of training data size
and the number of iterations
 Less than 1.5 hours with
700K training samples and
640 iterations
 Scalability in operation
 Use basic optimization
 Recall on 2 x 1Gbps links, the
new flow arrival rate is 450K/min
 We achieve 800K/min with
a single thread
 Close to 7M/min with 10 threads,
can scale up to 10Gbps links
Page 17
Evaluation on Stability
 Temporal stability
 After two-month, the flow error rates are 3±0.5% for TCP traffic and
0.4±0.04% for UDP traffic
 After one year, the TCP flow error rate is 5.48±0.54% and the UDP flow
error rate is 1.22±0.2%.
 Spatial stability
 Train and test at two geolocations
Page 18
Importance of the Port number
 Using our architecture, we obtain an error rate of 4.1% for TCP
and 0.72% for UDP with only port features (3.13% for TCP and
0.35% for UDP after adding other flow features)
 We use port graph to visualize and understand machine learned
port rules
 TCP Multimedia use 554
and 5000-5180
 UDP Games uses port
88 (Xbox)
 UDP Chat uses port
6661-6670
Page 19
Summary
 We have designed a modular machine learning architecture for
large scale ISP/enterprise network traffic classification
 The system scales up to 10Gbps links and remains accurate for one
year on multiple sites without re-training
 We have conducted evaluation on a large operational ISP network
 What if the port number and other flow level statistics are
unavailable? For example, classifying the end-to-end
IPSec traffic.
Page 20
IPSec traffic
 Limited traffic statistics for IPSec traffic
IPSec
header
Encrypted Inner
packet
No port
number,
protocol,pay
load etc.
Number of packets, number of bytes,
averageClassification
packet inter-arrivalon
time,
averageActivity Graphs
Our solution:
Traffic
packet size in both directions
 Only 80% accuracy using the proposed machine learning
architecture
Visualizing and Inferring Network Applications
using Traffic Activity Graphs (TAGs)
Traffic Activity Graphs
 Nodes represent the hosts in the network
 Edges represent the interaction between these hosts
 Defined on a set of flows collected in a time period T
 Help us study the communication patterns of different
applications
GHTTP
UMN
GEmail
Internet
GGnutella
Application traffic activity graphs (TAGs) and evolution
HTTP 1K to 3K
Email 1K to 3K
AOL IM 1K to 3K
DNS 1K to 3K
Properties of TAGs
 We observe difference in terms of basic statistics, such as
GCC size
graph density, average in/out degree, etc.
 ALL TAGs contain giant connected component (GCC),
which accounts for more than 85% of all the edges.
100.00%
98.00%
96.00%
94.00%
92.00%
90.00%
88.00%
86.00%
84.00%
82.00%
80.00%
Understanding the Interaction Patterns
by Decomposing TAGs
 Block structures in the adjacency matrices indicate dense
subgraphs in TAGs
1110010
1110000
1110000
0001111
0001011
HTTP
Email
AOL IM
BitTorrent
DNS
TAG Decomposition using TriNonnegative Matrix Factorization
 Extracting dense subgraphs can be formulated as a co-
clustering problem, i.e., cluster hosts into inside host groups
and outside host groups, then extract pairs of groups with
more edges connected (higher density).
 This co-clustering problem can be solved by tri-nonnegative
matrix factorization algorithm, which minimizes:
Tri-nonnegative matrix factorization
A
≈
100…0
100…0
010…0
010…0
…
000…1
×
≈
R
×
×
H
×
11100…0
00011…0
...
00000…1
C
R is m-by-k, C is r-by-n, hence, the product is a low-rank
approximation
rank min(k,
Row group of A, with
Proportional
to the r)
Column group
Adjacency matrix
assoc. with TAG
membership Indicator
matrix
subgraph density
matrix
We identify dense subgraphs based
on the large entries in H
membership
Indicator matrix
Subgraph prototypes
 Recall inside (UMN) hosts are (likely) service requesters and
outside hosts are service providers.
 Based on the number of inside/outside hosts in each
subgraph, we propose three prototypes.
In-star
One inside client
accesses multiple
outside servers
Out-star
Bi-mesh
Multiple inside client
accesses one
outside servers
Multiple inside clients
interacts with many
outside servers
Characterizing TAGs with subgraph prototypes
 Different application TAGs contain different types of subgraphs
HTTP
Email
AOL IM
BitTorrent
 We can distinguish and characterize applications based on the
subgraph components
 What do these subgraphs mean?
DNS
Interpreting HTTP bi-mesh structures
 Most star structures are due to popular
servers or active clients
 We can explain more than 80% of the
HTTP bi-meshes identified in one day
Server correlation driven
 Server farms
 Lycos, Yahoo, Google
 Correlated service providers
 CDN: LLNW, Akamai, SAVVIS, Level3
 Advertising providers: DoubleClick, etc.
 User interests driven




News: WashingtonPost, NewYork Times, Cnet
Media: ImageShack, casalemedia, tl4s2
Online shopping: Ebay, Costco, Walmart
Social network: Facebook, MySpace
redirection
redirection
How are the dense subgraphs
connected?
AS1
(C) Pool
(A) Randomly connected stars
AS1
(B) Tree: client/server dual role
SIGMETRICS/Performance 2009
AS2
(D) Correlated pool
Summary
 We introduce the notion of traffic activity graphs (TAGs)
 Different applications show different interaction patterns
 We propose a tNMF based graph decomposition method to help
understand the formation of application TAGs.
 Can we classify different application classes based on
TAGs?
Page 33
Traffic classification using
Collective Traffic Statistics
Colored TAGs
 Different applications are displayed as edges with different
colors in TAGs
Original TAG
Web and FileSharing traffic removed
Characterizing Colored TAGs
 Clustering effects: edges with the
same color tend to cluster together
 Attractive (A) / Repulsive (R)
effects:
The collective traffic statistics summarize the
spatial distribution of application classes in TAGs
Methodology
 A two-step approach
Input
Unclassified
TAG & traffic
statistics
Bootstrapping
Initially
labeled
TAG
Initial edge classification
only based on
traffic statistics
Prediction
output
Graph
Calibration
Prediction on
the traffic graph
Edge color calibration
using only colored neighborhood
information in the initially
labeled traffic graph
Training the Two-Step Model
 The two-step model can be integrated into the existing traffic
classification system easily
 The bootstrapping step uses traffic statistics associated with
each edge to conduct initial classification on the edges
 Available traffic statistics depend on specific applications
 The graph calibration step uses collective traffic statistics in
the TAG to reinforce/correct the initial labels
 Collective traffic statistics are encoded as histograms
1
eij
1
0.5
hi
0
Web FS Mail
hj
0.5
0
Web FTP Mail
Evaluation on Network-Level Traffic
classification
 Packet header information (e.g., port number, TCP flags) is
unavailable for bootstrapping, similar to the situation of
classifying end-to-end IPSec traffic
 How accurate can we classify such traffic when the flow-level
classification system only achieves 80% accuracy due to the
lack of traffic features?
Evaluation on Accuracy
 Our graph-based calibration reduces the error rate by 50%!
 The classifier remains stable across time and geolocation.
Edge accuracy
Bootstrap
Calibration
92%
90%
88%
86%
84%
82%
80%
78%
76%
74%
2 day
1 week
1 month
1 year
1 month
site 2
Evaluation on Per-Class Classification
 The two-step method improve the accuracy for all traffic
classes
 The repulsive rules enable us to improve on the traffic classes
with very poor initial labeling.
Evaluation on Real-time Performance
 We can implement the two-step approach as a real-time
system with little additional cost
Evaluation on Flow-Level Traffic
Classification
 We have access to all packet header information
 How much will the collective traffic statistics improve the
overall accuracy of our system?
Evaluation on Accuracy
 We achieve 15% reduction in errors within a month
 The F1 scores are improved for most application classes
Summary
 We introduced the concept of colored TAGs
 We proposed a two-step model to utilize the spatial
distribution of application classes in TAGs (collective traffic
statistics) to help improve the classification accuracy
 The collective traffic statistics help reduce 50% of the errors
for classification at the network layer
 15% of the errors can be reduced for the flow-level traffic
classification using graph calibration
Trouble-shooting in Large DSL Networks
(work in progress)
Motivation
 The current solution for trouble-shooting in DSL networks is
reactive and inefficient
Potentially
lead to churns
Challenges
 Millions of users
 A large number of devices on each DSL line which cannot be
controlled remotely
 Many possible locations where the line problem can happen
Methodology
 Trouble Locator
 Measure the line condition between the DSL server and the
cable modem for each customer
 Use machine learning techniques to learn the correlation
between different line problems and our line measurements
 Ticket Predictor
 Maintain periodic measurements for each customer (every
Saturday night)
 Learn the correlation between the measurement history and the
potential line problems
Overview of the Proactive Solution
 Proactively resolve line problems before the customer
complains
Planned Trial
 Evaluate our method in an operational DSL network
Time Table
 Design of the DSL network trouble-shooting system
(Feb. 2010 to Apr. 2010)
 Implementation of the system and offline evaluation
(May. 2010 to Jul. 2010)
 Trial in an operational DSL network
(Aug. 2010 to Oct.2010)
 Thesis
(Nov.2010 to Jan.2011)
Publications before 2010






Aiyou Chen,Yu Jin, Jin Cao, Li (Erran) Li, Tracking Long Duration Flows in Network Traffic, to
appear in the 29th IEEE International Conference on Computer Communications INFOCOM2010 (mini-conference) (acceptance ratio 24.3%).
Yu Jin, Esam Sharafuddin, Zhi-Li Zhang, Unveiling Core Network-Wide Communication Patterns
through Application Traffic Activity Graph Decomposition, in Proc. of the 2009 ACM International
Conference on Measurement and Modeling of Computer Systems (ACM SIGMETRICS 2009)
(acceptance ratio 14.9%).
Jin Cao,Yu Jin, Aiyou Chen, Tian Bu, Zhi-Li Zhang, Identifying High Cardinality Internet Hosts, in
Proc. of the 28th Conference on Computer Communications (IEEE INFOCOM 2009) (acceptance
ratio 19.6%).
Yu Jin, Esam Sharafuddin, Zhi-Li Zhang, Identifying Dynamic IP Address Blocks Serendipitously
through Background Scanning Traffic, in Proc. of the 3rd International Conference on emerging
Networking EXperiments and Technologies (CoNEXT 2007), New York, NY, December 10, 2007
(acceptance ratio 19.5%).
Yu Jin, Zhi-Li Zhang, Kuai Xu, Feng Cao, Sambit Sahu, "Identifying and Tracking Suspicious
Activities through IP Gray Space Analysis", In Proc. of the 3rd Workshop on Mining Network Data
(MineNet'07), San Diego, CA, June 12, 2007 (in conjunction with ACM SIGMETRICS'07).
Yu Jin, Gyorgy Simon, Kuai Xu, Zhi-Li Zhang, Vipin Kumar, "Gray's Anatomy: Dissecting Scanning
Activities Using IP Gray Space Analysis", in Proc. of the Second Workshop on Tackling Computer
Systems Problems with Machine Learning Techniques (SysML07), Boston, MA, April 10, 2007 (in
conjunction with USENIX NSDI'07).
Thanks!
Questions?
Backup slides
Characteristics of app. TAGs
 These statistics show difference between various app. TAGs
 It does not explain the formation of TAGs
TNMF algorithm related
 Iterative optimization algorithm
 Group density matrix derivation
Backup for traffic classification
Default flow features
Compare of different algorithms for
multi-class classification
Training time for different machine
learning algorithms
Selection of flow size for partitioning
Boosting decision stumps
t =1
tcpflag
contains S
no
t =T
t =2
dstport_low
= 443
yes
S-= -1.066 S+= -2.523
no
yes
S-= -0.226 S+= 2.139
…
byte
>= 64.5
no
S-= 0.446
yes
S+= -0.202

Understanding traffic activities in large networks

Transcript Understanding traffic activities in large networks

Directory