On the Validation of Traffic Classification Algorithms

Download Report

Transcript On the Validation of Traffic Classification Algorithms

On the Validation of Traffic
Classification Algorithms
Géza Szabó, Dániel Orincsay, Szabolcs Malomsoky, István Szabó
Traffic Lab, Ericsson Research Hungary
Aim & Contents
 Aim:
– Introduce our novel validation method which makes it
possible to measure the accuracy of traffic classification
methods
 Contents:
–
–
–
–
–
Requirements – How should validation be done?
Related work – How is it currently done?
Our proposal – What have we proposed?
Working mechanism – How does our proposal work?
Validation a state-of-the-art traffic classification method –
What have we learnt from the validation?
– Future work – What else can be done with the proposed method?
2 /17
On the Validation of Traffic Classification Algorithms
2008-04-29
Requirements – How should validation be done?
 Objective of traffic classification:
– Identify applications in passively observed traffic
 Validation of classification method by active test
-It should be independent from
classification methods
-About each packet the test should
provide reference information
-The test should be deterministic
-Feasibility: create large tests in a
highly automated way
-Realistic environment
3 /17
On the Validation of Traffic Classification Algorithms
2008-04-29
Related work – How is it currently done?
CURRENTLY
•Weak Traffic
and ad hoc validationValidation
Measurement
classification
•No
reliable and widely accepted validation technique
methods
data
•No reference
packet
trace
with
well-defined
content
is
available
methods
•Dynamically
allocated
ports
•Proprietary
protocols
•Encryption
•Be up2date
Port based
classification
Signature based
classification
•Lot of flows
•Simultaneous
applications
Connection pattern
based classification
•Previously
well-classified
traces
Statistics based
classification
•Just hint
Information theory
based classification
Combined
classification method
Manually created /
Active measurement
•Nonrealistic
environment
Public available
•Header
traces
→ port based
method
Non public available
•Impossible
to validate
by others
Manual validation
S. Sen and J. Wang:
Analyzing Peer-topeer Traffic Across
Large Networks
J. Erman, M. Arlitt and A.
Mahanti : Traffic
Classification Using
Clustering Algorithms
Use of other traffic
classification method
T. Karagiannis, K.
Papagiannaki and M.
Faloutsos : BLINC:
Multilevel Traffic
Classification in the Dark
4 /17
Online measurement
L. Bernaille et al: Traffic
Classification On The Fly
On the Validation of Traffic Classification Algorithms
2008-04-29
•Impossible
to repeat
with same
conditions
OUR PROPOSAL
5 /17
On the Validation of Traffic Classification Algorithms
2008-04-29
The proposed method for validation

Principle:
–
–

Packets are collected into flows at the traffic generating terminal
Flows are marked with the identifier of the application that generated the packets of
the flow
The main requirements on the realization of the method:
–
–

It should not deteriorate the performance of the terminal
The byte overhead of marking should be negligible
The preferred realization is a driver that can be easily installed on terminals
User mode
IExplorer
Outlook
Measurement point
Skype
Internet
Kernel mode
TCP
UDP
IP
Packet marking driver
NDIS
NDIS hook driver
Network drivers
Network connections-Process ID association
Protocol-Local Address:Port-Foreign Address:Port-State
-Process ID
TCP
-192.168.0.1 :2154-82.99.36.186
:80 -Established-5126
TCP
-192.168.0.1 :2189-86.101.125.82 :110 -Established-1932
UDP
-0.0.0.0
:2196-212.19.63.112 :9612-Established-2056
...
Process IDApplication name association
Process ID-Application
5126
-IExplorer.exe
1932
-Outlook.exe
2056
-Skype.exe
...
The position of the proposed driver within the terminal
6 /17
On the Validation of Traffic Classification Algorithms
2008-04-29
Working mechanism
1.
2.
The packet is examined whether it is an incoming or outgoing packet
In case of an outgoing packet, the size of the packet is examined

3.
4.
5.
6.
Continues with only those packets which are smaller than the MTU decreased with
the size of marking
The process continues with only TCP or UDP packets
According to the five-tuple identifier of the packet, it is checked whether there is
already available information about which application the flow belongs to
Query operation system
Need marking:

Randomly
 User mode
Only first

Leave the first

No mark
IExplorer
Outlook
Measurement point
Skype
Get info
Internet
Kernel mode
Packet passing
through the
TCP
interface
NO
Outgoing
UDP packet?
YES
Proper size?
YES
Protocol
TCP/UDP
Exist info?
IP
YES
YES
Need to mark?
NO
Packet marking driver
NDIS
NDIS hook driver
Network drivers
Network connections-Process ID association
Mark
Protocol-Local Address:Port-Foreign Address:Port-State
-Process ID
Other
TCP
-192.168.0.1 :2154-82.99.36.186
:80 -Established-5126
NO
TCP
-192.168.0.1 :2189-86.101.125.82
:110 -Established-1932
NO
UDP
-0.0.0.0
:2196-212.19.63.112 :9612-Established-2056
...
The working mechanism of the introduced driver
7 /17
Process
IDSend
Application name association
Process ID-Application
5126
-IExplorer.exe
1932
-Outlook.exe
2056
-Skype.exe
...
On the Validation of Traffic Classification Algorithms
2008-04-29
Place of marking

Extending the original IP packet with one option field
–

Router Alert option field
 Transparent for both the routers on the path and also for
the receiver host (according to RFC 2113 [3]).
The first two characters of the corresponding executable file name
are added
–
–
–
Increasing the size of the packet with 4 bytes
The packet size field in the IP header is also increased with 4
bytes
Header checksum is recalculated
A marked packet of the BitTorrent protocol
8 /17
On the Validation of Traffic Classification Algorithms
2008-04-29
PROOF-OF-CONCEPT
9 /17
On the Validation of Traffic Classification Algorithms
2008-04-29
Reference measurement






Available at
http://pics.etl.hu/˜szabog/measurement.tar
In a separated access network
Our driver has been installed onto all
computers on this network
Duration of the measurement: 43 hours
Captured data volume: 6 Gbytes,
containing 12 million packets
The measurement contains the traffic of
the most popular
–
–
–
–
–
–
–
–
–
–
P2P protocols:

BitTorrent

eDonkey

Gnutella

DirectConnect
VoIP and chat applications:

Skype

MSN Live
FTP sessions
Download manager
E-mail sending, receiving sessions
Web based e-mail (e.g., Gmail)
SSH sessions
SCP sessions
FPS, MMORPG gaming sessions
Streaming:

Radio

Video

Web based
10 /17
The traffic mix of the measurement
On the Validation of Traffic Classification Algorithms
2008-04-29
Validation results (1) – Success

Combined traffic classification method
(described in [1]) with the addition that
the classification of VoIP applications
has been extended with ideas from [2]

Accurately identified:
–
–
–
–
–

Success due to:
–
–
–

E-mail
Filetransfer
Streaming
Secure channel
Gaming traffic
Well-documented protocols
Open standards
Do not constantly change
Difficulties in case of…?
–
Encryption:

But: session initiation
phase is critical as this
The results of the classification compared
phase can be identified
accurately
[1] to the reference measurement

Success: SSH or SCP [1] G. Szabo, I. Szabo and D. Orincsay: Accurate Traffic Classification
[2] M. Perenyi and S. Molnar: Enhanced Skype Traffic Identification
11 /17
On the Validation of Traffic Classification Algorithms
2008-04-29
Validation results (2) – P2P
Difficulties:
 Many TCP flows containing 1-2 SYN
packets probably to disconnected peers
–
–
–

Also some small non-P2P flows were
misclassified into the P2P class
–
–

No payload in these packets =>the
signature based methods can not
work
Dynamically allocated source ports
towards not well-known destination
ports => the port based methods fail
Server search and P2P
communication heuristic [1] methods
also fail => there are no other
successful flows to such IPs
Not fully proper content of the portapplication database
Creating too many port-application
associations easily results in the rise
of the misclassification ratio.
The constant change of P2P protocols
–
–
New features added to P2P clients
day-by-day
Working mechanism can be typical
for a selected client not the whole
protocol itself
The results of the classification compared
[1] to the reference measurement
[1] G. Szabo, I. Szabo and D. Orincsay: Accurate Traffic Classification
[2] M. Perenyi and S. Molnar: Enhanced Skype Traffic Identification
12 /17
On the Validation of Traffic Classification Algorithms
2008-04-29
Validation results (3) – Philosophy

Traffic which is the derivation of
other traffic:
–
–
–

E.g., DNS traffic
MSN: HTTP protocol for
transmitting chat messages
MSN client transmits
advertisements over HTTP, but
this cannot be recognized as
deliberate web browsing
Hit := the classification outcome
and the generating application
type (the validation outcome)
agreed
–
E.g., the chat on the
DirectConnect hubs which has
been classified as chat could
have been considered as
actually correct but in this
comparison it was considered
as misclassification
The results of the classification compared
[1] to the reference measurement
[1] G. Szabo, I. Szabo and D. Orincsay: Accurate Traffic Classification
13 /17
On the Validation of Traffic Classification Algorithms
2008-04-29
Validation results (4) – VoIP: MSN, Skype

High VoIP hit ratio is due to the
successful identification
–
–

Skype is difficult to identify
–
–
Same problem as in the case of P2P
Proprietary protocol designed to
ensure secure communication
–
[2] characteristic feature: the
application sends packets even when
there is no ongoing call with an exact
20 sec interval.
In [1]: a P2P identification heuristic
which was designed to track any
message which has a periodicity in
packet sending
Extension of [1] was straightforward
–
–

MSN Messenger
Skype
The validation showed:
–
The deficiency of the classification of
Skype

–
–
Simple extension of the
algorithm
Idea of [1] has been validated as it
proved to be robust for the extension
with new application recognition
Also the validation mechanism proved
to be useful
The results of the classification compared
[1] to the reference measurement
[1] G. Szabo, I. Szabo and D. Orincsay: Accurate Traffic Classification
[2] M. Perenyi and S. Molnar: Enhanced Skype Traffic Identification
14 /17
On the Validation of Traffic Classification Algorithms
2008-04-29
Summary


We introduced a new active
measurement method which can help
in the validation of traffic classification
methods.
The introduced method is a network
driver
–

Mark the outgoing packets from the
clients with an application specific
marking
With the introduced method we created
a measurement and used this to
validate the method presented in [1]
–
–
Benefits:
It is
independent
from
classification
methods
About each
packet the test
provides
reference
information
The test is
deterministic
Feasibility:
creates large
tests in a
highly
automated
way
The method has been proved to be
working accurately
Some deficiencies in the
classification
 P2P applications
 Skype
[1] G. Szabo, I. Szabo and D. Orincsay: Accurate Traffic Classification
15 /17
On the Validation of Traffic Classification Algorithms
2008-04-29
Further work
 Use the marking method at the measurement side for
online traffic classification
– Assumptions:
 The terminals accessing an operator’s network are
all installed with the proposed driver
 The driver is made tamper-proof to avoid users
forging the marking
– Online clustering of the traffic into QoS classes based on
the resource requirements of the generating application
– Used by operators to charge on the basis of the used
application by the user
 Extension of the marking by other information about the
traffic generating application
– E.g., version number
 Operator could track the security risks of an old
application
16 /17
On the Validation of Traffic Classification Algorithms
2008-04-29
 Thank you very much for your kind attention!
Questions, discussion…
 Contact:
– E-mail: [email protected]
17 /17
On the Validation of Traffic Classification Algorithms
2008-04-29
18 /17
On the Validation of Traffic Classification Algorithms
2008-04-29