Statistical Analysis
Download
Report
Transcript Statistical Analysis
Application Identification in
information-poor environments
Charalampos Rotsos
What is application identification
Current status
My work
Future plans
Open questions
02/02/2010
1
Why?
E.g., CoS, security, perfomance-analysis
02/02/2010
2
Taxonomy of Application
identification techniques
• Deep Packet Inspection
Match payload with well know protocol signatures
• Statistical Analysis
Extract network measurement ( packet size, pack
interarrival time ) and search for patterns (ML,
statistical analysis etc.)
• Behavioral/Graph Analysis
Find connection pattern
Create features based on the connection graph
02/02/2010
3
Statistical Analysis
???
Focused on flow-features
• Which features are high-quality?
• Which features are computationally-simple?
02/02/2010
4
Progress so far
• The problem is solved
– 5 packets sufficient to classify a flow
– Achieve at least 90% accuracy on all classes
• But not really….
– Difficult to extract required features
– Identification accuracy
– Temporal stability is aweful
– Technical issues:
•Long running connections are difficult
to label
•What about new applications?
•What about simplex flows?
•Mesauring is hard
02/02/2010
•Labelling traffic is EVEN harder.
•Hard to keep fast and lightweight and
uptodate and ….. Things you need if
this is part of your IDS
5
Can we do better?
• Restate the problem.
• Use information that can be extracted from
current networks (a.k.a. SNMP, NetFlow).
• Use better machine learning.
• Define models that bridge the gap between
statistical and behavioral properties.
02/02/2010
6
Better ML on NetFlow
• Semi-supervised learning on NetFlow data using
Bayesian data analysis.
Better performance than Bayes classifier in Weka
Bayesian modeling provides good parameterization
Efficient reduction of the effect of time dependence
of the feature set.
Temporal and Spatial decay
Difficult to balance between a model both accurate
and flexible
NetFlow doesn’t provide clean separation of classes
02/02/2010
7
What is next?
• Richer dataset
– Aggregate flows for ports/hosts/networks
– Increase dimensions by simple feature
engineering.
• Better mathematical models
– Incorporate domain-specific knowledge.
– Connection graph defined inference diagram.
02/02/2010
8
Inference Diagram
Webbrowser
Webbrowser
Alice
Web Server
Bob
• The flows between Alice - web server are correlated and respond to the same
application.
• The flow of Alice - web server and Bob - web server also correspond to the
same application.
• Research on application identification hasn’t found a framework to
accommodate these observations.
02/02/2010
9
Inference Diagram – more difficult
Alice
Use random ports
Ftp Server – port 22
Database Server – port 1680
Web Server – port 80
Bob
Use random ports
• Computers will run multiple application in parallel.
• BUT, applications on a particular server will always use a specific port.
02/02/2010
10
A first approach!
• Similar problem can be found in the case of
node labeling
– Aggregate flow records over some defined period
– Use Markov Random Fields model for inference
propagations
– Apply approximate inference methods (Gibbs
sampling, Message Passing)
– In the end, apply some engineering ideas to refine
results
02/02/2010
11
Open problems
• Is the model a good approximation?
• What am I classifying and for how long?
• Ports, Hosts or Networks? Is it possible to do
multi-layer analysis?
• Are the approximation techniques
converging?
Turning the difficulty to “Eleven”…
• Compute the performance of an individual
traffic within a VPN… by monitoring alone.
02/02/2010
12