Statistical Analysis

Download Report

Transcript Statistical Analysis

Application Identification in
information-poor environments
Charalampos Rotsos
What is application identification
Current status
My work
Future plans
Open questions
02/02/2010
1
Why?
E.g., CoS, security, perfomance-analysis
02/02/2010
2
Taxonomy of Application
identification techniques
• Deep Packet Inspection
Match payload with well know protocol signatures
• Statistical Analysis
Extract network measurement ( packet size, pack
interarrival time ) and search for patterns (ML,
statistical analysis etc.)
• Behavioral/Graph Analysis
Find connection pattern
Create features based on the connection graph
02/02/2010
3
Statistical Analysis
???
Focused on flow-features
• Which features are high-quality?
• Which features are computationally-simple?
02/02/2010
4
Progress so far
• The problem is solved
– 5 packets sufficient to classify a flow
– Achieve at least 90% accuracy on all classes
• But not really….
– Difficult to extract required features
– Identification accuracy
– Temporal stability is aweful
– Technical issues:
•Long running connections are difficult
to label
•What about new applications?
•What about simplex flows?
•Mesauring is hard
02/02/2010
•Labelling traffic is EVEN harder.
•Hard to keep fast and lightweight and
uptodate and ….. Things you need if
this is part of your IDS
5
Can we do better?
• Restate the problem.
• Use information that can be extracted from
current networks (a.k.a. SNMP, NetFlow).
• Use better machine learning.
• Define models that bridge the gap between
statistical and behavioral properties.
02/02/2010
6
Better ML on NetFlow
• Semi-supervised learning on NetFlow data using
Bayesian data analysis.
 Better performance than Bayes classifier in Weka
 Bayesian modeling provides good parameterization
 Efficient reduction of the effect of time dependence
of the feature set.
Temporal and Spatial decay
Difficult to balance between a model both accurate
and flexible
NetFlow doesn’t provide clean separation of classes
02/02/2010
7
What is next?
• Richer dataset
– Aggregate flows for ports/hosts/networks
– Increase dimensions by simple feature
engineering.
• Better mathematical models
– Incorporate domain-specific knowledge.
– Connection graph defined inference diagram.
02/02/2010
8
Inference Diagram
Webbrowser
Webbrowser
Alice
Web Server
Bob
• The flows between Alice - web server are correlated and respond to the same
application.
• The flow of Alice - web server and Bob - web server also correspond to the
same application.
• Research on application identification hasn’t found a framework to
accommodate these observations.
02/02/2010
9
Inference Diagram – more difficult
Alice
Use random ports
Ftp Server – port 22
Database Server – port 1680
Web Server – port 80
Bob
Use random ports
• Computers will run multiple application in parallel.
• BUT, applications on a particular server will always use a specific port.
02/02/2010
10
A first approach!
• Similar problem can be found in the case of
node labeling
– Aggregate flow records over some defined period
– Use Markov Random Fields model for inference
propagations
– Apply approximate inference methods (Gibbs
sampling, Message Passing)
– In the end, apply some engineering ideas to refine
results
02/02/2010
11
Open problems
• Is the model a good approximation?
• What am I classifying and for how long?
• Ports, Hosts or Networks? Is it possible to do
multi-layer analysis?
• Are the approximation techniques
converging?
Turning the difficulty to “Eleven”…
• Compute the performance of an individual
traffic within a VPN… by monitoring alone.
02/02/2010
12