Transcript NO!!
Passive inference:
Troubleshooting the Cloud with Tstat
Alessandro Finamore
<[email protected]>
TMA
Traffic monitoring
and Analysis
4th TMA PHD School - London – Apr 16th, 2014
Active -vs- passive inference
2
Active inference:
Study cause/effect relationships, i.e., inject some traffic in the
network to observe a reaction
PRO: world-wide scale (e.g., Planetlab)
CONS: synthetic benchmark suffer from lack of generality
Passive inference:
study traffic properties just by observing it and without
interfering with it
PRO: study traffic generated from actual Internet users
CONS: limited number of vantage points
The network monitoring playground
3
Sup e r visor
Collect some
measurements
Re p osit or
y
Challenges?
Extract analytics
passive
probe
Automation
Flexibility/Openness
data
What are the
performance of
a cache?
Deploy some
vantage points
What are the performance of
YouTube video streaming?
Pushing the paradigm further with
4
FP7 European project about the design and implementation of a
measurement plane for the Internet
Large scale
Vantage points deployed on a worldwide scale
Flexible
Offers APIs for integrating existing measurement frameworks
Not
strictly bounded to specific “use cases”
Intelligent
Automate/simplify the process of “cooking” raw data
Identify anomalies and unexpected events
Provide root-cause-analysis capabilities
mPlane consortium
16 partners
3 operators
6 research centers
5 universities
2 small enterprises
Coordinator
WP7
FP7 IP
3 years long
11Meuro
Marco Mellia
POLITO
Saverio Nicolini
NEC
WP2
WP1
Ernst Biersack
Eurecom
Dina Papagiannaki
Telefonica
Brian Trammell
ETH
Tivadar Szemethy
NetVisor
WP5
WP6
Andrea Fregosi
Fastweb
Dario Rossi
ENST
WP3
Guy Leduc
Univ. Liege
Pietro Michiardi
Eurecom
Fabrizio Invernizzi
Telecom Italia
WP4
Pedro Casas
FTW
Pushing the paradigm further with
6
Sup e r visor
active
probe
passive
probe
Integration with existing
monitoring frameworks
Re p osit or y
data
control
Active and passive analysis
for iterative root-cause-analysis
What else beside
?
“From global measurements to local management”
Specific Targeted Research Projects (STReP)
3 years 2 left, 10 partners, 3.8 Meuros … is a sort of “mPlane use case”
Build a measure framework out of
probes
IETF, Large-Scale Measurement of Broadband Performance (LMAP)
Standardization effort on how to do broadband measurements
Strong similarities for
Defining the components, protocols, rules, etc.
the architecture core
It does not specifically target adding “a brain” to the system
The network monitoring trinity
8
Sup e r visor
Re p osit or y
Post-processing
Focus on
How to process network traffic?
How to scale at 10Gbps?
Repository
Raw measurements
Try not to focus on just
one aspect but rather on
“mastering the trinity”
http://tstat.polito.it
9
Is the passive sniffer developed @Polito over the last 10 years
IN
Private
Network
Border router
Question: Which are the most popular
accessed services?
Rest of
the world
Question: How CDNs/datacenters are composed?
Traffic stats
http://tstat.polito.it
10
Is the passive sniffer developed @Polito over the last 10 years
Per-flow stats including
Several L3/L4 metrics (e.g., #pkts, #bytes, RTT, TTL, etc.)
Traffic classification
Deep Packet Inspection (DPI)
Statistical methods (Skype, obfuscated P2P)
Different output formats (logs, RRDs, histograms, pcap)
Run on off-the-shelf HW
Up to 2Gb/s with standard NIC
Currently adopted in real network scenarios (campus and ISP)
research/technology challenge
11
Challenge: Is it possible to build a “full-fledged” passive probe that
cope with >10Gbps?
Ad-hoc NICs are too expensive (>10keuro)
Software solutions build on top of common Intel NICs
ntop DNA [ACM Queue] Revisiting network I/O APIs: The netmaps Framework
[PAM’12] PFQ: a Novel Engine for Multi-Gigabit Packet Capturing With
netmap
Multi-Core Commodity Hardware
[IMC’10] High Speed Network Traffic Analysis with Commodity Multi-core Systems
PFQ
By offering direct access to the NIC (i.e., bypassing the kernel stack)
the libraries can count packets at wire speed
…but what about doing real processing?
Possible system architecture
12
merge
out2
outN
18
16
14
How to organize the analysis modules workflow?
consumerN
20
% pkts drop
consumer2
consumer1
out1
If needed, design “mergeable” output
N identical consumer instances?
Within each consumer, single execution flow?
2 Tstat + libDNA
(synth. traffic)Per-flow packet scheduling
Marginis the simplest option, but
12
10
8
6
to improve
What about correlating
multiple flows
(e.g., DNS/TCP)?
What about scheduing per traffic class?
4
2
Dispatch / Scheduling
0
1
Read pkts
2
3
4
5
6
7
8
9
10
Wire speedUnder
[Gbps]
testing a solution based on libDNA
One or more process for reading? Depends…
Other traffic classification tools?
13
WAND (Shane Alcock) - http://research.wand.net.nz
Libprotoident, traffic classification using 4 bytes of payload
It doesn’t matter having a fancy
Libtrace, rebuilds TCP/UDP and other toolsclassifier
for processing
pcaps
if you do not
have
proper flow characterization
ntop (Luca Deri) - http://www.ntop.org/products/ndpi
nDPI, a super set of OpenDPI
l7filter, but is known to be inaccurate
The literature is full of statistical/behavioral traffic classification
methodologies [1,2] but AFAIK
no real deployment
no open source tool released
[1] “A survey of techniques for internet traffic classification using machine learning”
IEEE Communications Surveys & Tutorials, 2009
[2] “Reviewing Traffic Classification”, LNCS Vol. 7754, 2013
Measurement frameworks
14
RIPE Atlas – http://ripe.atlas.net
World wide deployment of inexpensive active probes
User Defined Measurement (UDM) credit based
Ping, traceroute/traceroute6, DNS, HTTP
Google mLAB Network Diagnostic Test (NDT)
http://mlab-live.appspot.com/tools/ndt
Connectivity and bandwidth speed
Public available data … but IMO not straightforward to use
Recent research activities
15
Sup e r visor
Re p osit or y
Focus on
Post-processing
Focus on
How to process network traffic?
How to scale at 10Gbps?
Repository
Raw measurements
How to export/consolidate
data continuously?
What about BigData?
(Big)Data export frameworks
16
Overcrowded scenario
https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Logging_Solutions_Recommendation
(Big)Data export frameworks
17
Overcrowded scenario
All general purpose frameworks
Data center scale
Emphasis on throughput and/or real-time and/or consistency, etc.
Typically designed/optimized for HDFS
log_sync, “ad-hoc” solution @ POLITO
Designed to manage a few passive probes
Emphasis on throughput and data consistency
Data management @ POLITO
NAS
18
~40TB (3TB x 12) = 1year data
Gateway
probe1
NAS
cluster
log_sync (server)
probeN
log_sync (server)
ISP/Campus
Cluster
gateway
11 nodes = 9 data nodes +
2 namenode
log_sync (client)
pre-processing
(dual 4-core,
3TB disk, 16GB ram)
416GB RAM = 32GBx9 + 64GBx2
~32TB HDFS
Single 6-core = 66 cores (x2 with HT)
Debian 6 + CDH 4.5.0
BigData = Hadoop?
19
Almost true but there are other NoSQL solutions
MongoDB, REDIS, Cassandra, Spark, Neo4J, etc. http://nosql-database.org
How to choose? Not so easy to say, but
Avoid BigData frameworks if you have just few GB of data
Sooner or later you are going to do some coding so pick
something that seems “confortable”
Fun fact: MapReduce is a NoSQL paradigm but people are used
to SQL queries
Rise of Pig, Hive, Impala, Shark, etc. which allow to do SQL-like
queries on top of MapReduce
Recent research activities
Sup e r visor
20
Re p osit or y
Focus on
Focus on
Case study of an Akamai Post-processing
“cache” performance
Repository
“DBStream: an Online Aggregation,
Filtering and Processing System for
Network Traffic Monitoring” TRAC’14
Focus on
How to process network traffic?
How to scale at 10Gbps?
Raw measurements
How to export/consolidate
data continuously?
What about BigData?
Monitoring an
cache
21
Focusing on vantage point of ~20k ADSL customers
1 week of HTTP logs (May 2012)
Content served by Akamai CDN
The ISP hosts an Akamai “preferred cache” (a specific /25 subnet)
?
?
?
Reasoning about the problem
22
Q1: Is this affecting specific FQDN accessed?
Q2: Are the variations due to “faulty” servers?
Q3: Was this triggered by CDN performance issues?
Etc…
How to automate/simplify this reasoning?
DBStream (FTW)
Continuous big data analytics
Flexible processing language
Full SQL processing capabilities
Processing in small batches
Storage for post-mortem analysis
Q1: Is this affecting a specific FQDN?
23
Select the top 500 Fully Qualified Domain Names (FQDN) served by Akamai
Check if they are served by the preferred /25 subnet
Repeat every 5 min
1
500
FQDN not
served by the
preferred cache
0.8
400
FQDN
FQDN hosted
by the preferred
cache, except
during the
anomaly
Akamai
Others
300
0.6
200
0.4
Other subnets
0.2
100
Preferred /25 subnet
0
0
06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00
Mon Mon Mon Tue Tue Tue Tue Wed
The two sets have “services” in common
Same results extending to more than 500 FQDN
Akamai
Preferred
Q2: Are the variations due to “faulty” servers?
24
Compute the traffic volume per IP address
Check the behavior during the disruption
Repeat each 5 min
Akamai preferred IPs (/25 subnet)
1
120
"ips.matrix" matrix
0.8
100
80
0.6
60
0.4
40
0.2
20
0
06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00
Mon Mon Mon Tue Tue Tue Tue Wed
Q3: Was this triggered by performance issues?
25
Compute the distribution of server query elaboration time
It is the time between the TCP ACK of the HTTP GET and the
reception of the first byte of the reply
Focus on the traffic of the /25passive
preferred subnet
client
server
probe
Compare the quartiles of the server elaboration time every 5 min
Elaboration time
100
Performance decreases
right before the
anomaly
@6pm
query processing
time
10
50th DATA
75th
06:00
Mon
12:00
Mon
18:00
Mon
00:00
Tue
25th
06:00
Tue
5th
12:00
Tue
18:00
Tue
00:00
Wed
Reasoning about the problem
26
Q1: Is this affecting only specific services?
Q2: Are the variations due to “faulty” servers?
Q3: Was this triggered by CDN performance issues?
What else?
Other vantage points report the same problem? YES!
What about extending the time period?
The anomaly is present along the whole period we considered
On going extension of the analysis on more recent data sets (possibly
exposing also other effects/anomalies)
Routing? TODO route views
DNS mapping? TODO RipeAtlas + ISP active probing infrastructure
Other suggestions are welcomed
Reasoning about the problem
27
Q1: Is this affecting only specific services?
Q2:
Are but
the variations
“faulty”
servers?
…ok,
what are due
the to
final
takeaways?
Q3: Was this triggered by CDN performance issues?
Try to automate your analysis
What else?
Think about what you measure and be creative especially
Other vantage points report the same problem? YES!
for visualization
What about extending the time period?
Enlarge your prospective
The anomaly is present along the whole period we considered
multiple vantage points
On going extension of the analysis on more recent data sets (possibly
multiplealso
data
sources
exposing
other
effects/anomalies)
analysis
Routing?
TODOonlarge
route time
viewswindows
DNS
mapping?
TODO
RipeAtlas
Don’t
be afraid
to ask
opinions + ISP active probing infrastructure
Other suggestions are welcomed
?? || ##
<[email protected]>
TMA
Traffic monitoring
and Analysis