Transcript NO!!

Passive inference:
Troubleshooting the Cloud with Tstat
Alessandro Finamore
<[email protected]>
TMA
Traffic monitoring
and Analysis
4th TMA PHD School - London – Apr 16th, 2014
Active -vs- passive inference
2


Active inference:
 Study cause/effect relationships, i.e., inject some traffic in the
network to observe a reaction
 PRO: world-wide scale (e.g., Planetlab)
 CONS: synthetic benchmark suffer from lack of generality
Passive inference:
 study traffic properties just by observing it and without
interfering with it
 PRO: study traffic generated from actual Internet users
 CONS: limited number of vantage points
The network monitoring playground
3
Sup e r visor
Collect some
measurements
Re p osit or
y
Challenges?
Extract analytics
passive
probe

Automation

Flexibility/Openness
data
What are the
performance of
a cache?
Deploy some
vantage points
What are the performance of
YouTube video streaming?
Pushing the paradigm further with
4

FP7 European project about the design and implementation of a
measurement plane for the Internet
 Large scale
 Vantage points deployed on a worldwide scale
 Flexible
 Offers APIs for integrating existing measurement frameworks
 Not

strictly bounded to specific “use cases”
Intelligent
 Automate/simplify the process of “cooking” raw data
 Identify anomalies and unexpected events
 Provide root-cause-analysis capabilities
mPlane consortium

16 partners




3 operators
6 research centers
5 universities
2 small enterprises

Coordinator
WP7
FP7 IP


3 years long
11Meuro
Marco Mellia
POLITO
Saverio Nicolini
NEC
WP2
WP1
Ernst Biersack
Eurecom
Dina Papagiannaki
Telefonica
Brian Trammell
ETH
Tivadar Szemethy
NetVisor
WP5
WP6
Andrea Fregosi
Fastweb
Dario Rossi
ENST
WP3
Guy Leduc
Univ. Liege
Pietro Michiardi
Eurecom
Fabrizio Invernizzi
Telecom Italia
WP4
Pedro Casas
FTW
Pushing the paradigm further with
6
Sup e r visor
active
probe
passive
probe
Integration with existing
monitoring frameworks
Re p osit or y
data
control
Active and passive analysis
for iterative root-cause-analysis
What else beside





?
“From global measurements to local management”
Specific Targeted Research Projects (STReP)
3 years  2 left, 10 partners, 3.8 Meuros … is a sort of “mPlane use case”
Build a measure framework out of
probes
IETF, Large-Scale Measurement of Broadband Performance (LMAP)
 Standardization effort on how to do broadband measurements
Strong similarities for
 Defining the components, protocols, rules, etc.
the architecture core
 It does not specifically target adding “a brain” to the system
The network monitoring trinity
8
Sup e r visor
Re p osit or y
Post-processing
Focus on
How to process network traffic?
How to scale at 10Gbps?
Repository
Raw measurements
Try not to focus on just
one aspect but rather on
“mastering the trinity”
http://tstat.polito.it
9

Is the passive sniffer developed @Polito over the last 10 years
IN
Private
Network
Border router
Question: Which are the most popular
accessed services?
Rest of
the world
Question: How CDNs/datacenters are composed?
Traffic stats
http://tstat.polito.it
10





Is the passive sniffer developed @Polito over the last 10 years
Per-flow stats including
 Several L3/L4 metrics (e.g., #pkts, #bytes, RTT, TTL, etc.)
 Traffic classification
 Deep Packet Inspection (DPI)
 Statistical methods (Skype, obfuscated P2P)
Different output formats (logs, RRDs, histograms, pcap)
Run on off-the-shelf HW
 Up to 2Gb/s with standard NIC
Currently adopted in real network scenarios (campus and ISP)
research/technology challenge
11

Challenge: Is it possible to build a “full-fledged” passive probe that
cope with >10Gbps?
 Ad-hoc NICs are too expensive (>10keuro)
 Software solutions build on top of common Intel NICs
 ntop DNA [ACM Queue] Revisiting network I/O APIs: The netmaps Framework
[PAM’12] PFQ: a Novel Engine for Multi-Gigabit Packet Capturing With
 netmap
Multi-Core Commodity Hardware
[IMC’10] High Speed Network Traffic Analysis with Commodity Multi-core Systems
 PFQ
By offering direct access to the NIC (i.e., bypassing the kernel stack)
the libraries can count packets at wire speed
…but what about doing real processing?
Possible system architecture
12
merge
out2
outN
18
16
14
How to organize the analysis modules workflow?
consumerN
20
% pkts drop
consumer2
consumer1
out1
If needed, design “mergeable” output

N identical consumer instances?

Within each consumer, single execution flow?
2 Tstat + libDNA
(synth. traffic)Per-flow packet scheduling
Marginis the simplest option, but
12
10
8
6

to improve
What about correlating
multiple flows
(e.g., DNS/TCP)?

What about scheduing per traffic class?
4
2
Dispatch / Scheduling
0
1
Read pkts
2
3
4
5
6
7
8
9
10
Wire speedUnder
[Gbps]
testing a solution based on libDNA
One or more process for reading? Depends…
Other traffic classification tools?
13




WAND (Shane Alcock) - http://research.wand.net.nz
 Libprotoident, traffic classification using 4 bytes of payload
It doesn’t matter having a fancy
 Libtrace, rebuilds TCP/UDP and other toolsclassifier
for processing
pcaps
if you do not
have
proper flow characterization
ntop (Luca Deri) - http://www.ntop.org/products/ndpi
 nDPI, a super set of OpenDPI
l7filter, but is known to be inaccurate
The literature is full of statistical/behavioral traffic classification
methodologies [1,2] but AFAIK
 no real deployment
 no open source tool released
[1] “A survey of techniques for internet traffic classification using machine learning”
IEEE Communications Surveys & Tutorials, 2009
[2] “Reviewing Traffic Classification”, LNCS Vol. 7754, 2013
Measurement frameworks
14


RIPE Atlas – http://ripe.atlas.net
 World wide deployment of inexpensive active probes
 User Defined Measurement (UDM) credit based
 Ping, traceroute/traceroute6, DNS, HTTP
Google mLAB Network Diagnostic Test (NDT)
http://mlab-live.appspot.com/tools/ndt


Connectivity and bandwidth speed
Public available data … but IMO not straightforward to use 
Recent research activities
15
Sup e r visor
Re p osit or y
Focus on
Post-processing
Focus on
How to process network traffic?
How to scale at 10Gbps?
Repository
Raw measurements
How to export/consolidate
data continuously?
What about BigData?
(Big)Data export frameworks
16

Overcrowded scenario
https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Logging_Solutions_Recommendation
(Big)Data export frameworks
17



Overcrowded scenario
All general purpose frameworks
 Data center scale
 Emphasis on throughput and/or real-time and/or consistency, etc.
 Typically designed/optimized for HDFS
log_sync, “ad-hoc” solution @ POLITO


Designed to manage a few passive probes
Emphasis on throughput and data consistency
Data management @ POLITO
NAS
18
~40TB (3TB x 12) = 1year data
Gateway
probe1
NAS
cluster
log_sync (server)
probeN
log_sync (server)


ISP/Campus
Cluster
gateway
11 nodes = 9 data nodes +
2 namenode
log_sync (client)

pre-processing
(dual 4-core,
3TB disk, 16GB ram)

416GB RAM = 32GBx9 + 64GBx2

~32TB HDFS

Single 6-core = 66 cores (x2 with HT)

Debian 6 + CDH 4.5.0
BigData = Hadoop?
19


Almost true but there are other NoSQL solutions
 MongoDB, REDIS, Cassandra, Spark, Neo4J, etc. http://nosql-database.org
 How to choose? Not so easy to say, but
 Avoid BigData frameworks if you have just few GB of data
 Sooner or later you are going to do some coding so pick
something that seems “confortable”
Fun fact: MapReduce is a NoSQL paradigm but people are used
to SQL queries
 Rise of Pig, Hive, Impala, Shark, etc. which allow to do SQL-like
queries on top of MapReduce
Recent research activities
Sup e r visor
20
Re p osit or y
Focus on
Focus on
Case study of an Akamai Post-processing
“cache” performance
Repository
“DBStream: an Online Aggregation,
Filtering and Processing System for
Network Traffic Monitoring” TRAC’14
Focus on
How to process network traffic?
How to scale at 10Gbps?
Raw measurements
How to export/consolidate
data continuously?
What about BigData?
Monitoring an
cache
21


Focusing on vantage point of ~20k ADSL customers
1 week of HTTP logs (May 2012)


Content served by Akamai CDN
The ISP hosts an Akamai “preferred cache” (a specific /25 subnet)
?
?
?
Reasoning about the problem
22

Q1: Is this affecting specific FQDN accessed?

Q2: Are the variations due to “faulty” servers?

Q3: Was this triggered by CDN performance issues?

Etc…
How to automate/simplify this reasoning?
DBStream (FTW)
 Continuous big data analytics
 Flexible processing language
 Full SQL processing capabilities
 Processing in small batches
 Storage for post-mortem analysis
Q1: Is this affecting a specific FQDN?
23

Select the top 500 Fully Qualified Domain Names (FQDN) served by Akamai

Check if they are served by the preferred /25 subnet

Repeat every 5 min
1
500
FQDN not
served by the
preferred cache
0.8
400
FQDN
FQDN hosted
by the preferred
cache, except
during the
anomaly
Akamai
Others
300
0.6
200
0.4
Other subnets
0.2
100
Preferred /25 subnet
0
0
06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00
Mon Mon Mon Tue Tue Tue Tue Wed
 The two sets have “services” in common
 Same results extending to more than 500 FQDN
Akamai
Preferred
Q2: Are the variations due to “faulty” servers?
24
Compute the traffic volume per IP address

Check the behavior during the disruption

Repeat each 5 min
Akamai preferred IPs (/25 subnet)

1
120
"ips.matrix" matrix
0.8
100
80
0.6
60
0.4
40
0.2
20
0
06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00
Mon Mon Mon Tue Tue Tue Tue Wed
Q3: Was this triggered by performance issues?
25


Compute the distribution of server query elaboration time
 It is the time between the TCP ACK of the HTTP GET and the
reception of the first byte of the reply
Focus on the traffic of the /25passive
preferred subnet
client
server
probe
Compare the quartiles of the server elaboration time every 5 min
Elaboration time

100
Performance decreases
right before the
anomaly
@6pm
query processing
time
10
50th DATA
75th
06:00
Mon
12:00
Mon
18:00
Mon
00:00
Tue
25th
06:00
Tue
5th
12:00
Tue
18:00
Tue
00:00
Wed
Reasoning about the problem
26

Q1: Is this affecting only specific services?

Q2: Are the variations due to “faulty” servers?

Q3: Was this triggered by CDN performance issues?

What else?
 Other vantage points report the same problem? YES!
 What about extending the time period?
 The anomaly is present along the whole period we considered
 On going extension of the analysis on more recent data sets (possibly
exposing also other effects/anomalies)



Routing? TODO  route views
DNS mapping? TODO  RipeAtlas + ISP active probing infrastructure
Other suggestions are welcomed 
Reasoning about the problem
27

Q1: Is this affecting only specific services?

Q2:
Are but
the variations
“faulty”
servers?
…ok,
what are due
the to
final
takeaways?


Q3: Was this triggered by CDN performance issues?

Try to automate your analysis
What else?
 Think about what you measure and be creative especially
 Other vantage points report the same problem? YES!
for visualization
 What about extending the time period?
 Enlarge your prospective
 The anomaly is present along the whole period we considered
 multiple vantage points
 On going extension of the analysis on more recent data sets (possibly

multiplealso
data
sources
exposing
other
effects/anomalies)
 analysis
Routing?
TODOonlarge
route time
viewswindows
DNS
mapping?
TODO
 RipeAtlas
Don’t
be afraid
to ask
opinions + ISP active probing infrastructure


Other suggestions are welcomed 
?? || ##
<[email protected]>
TMA
Traffic monitoring
and Analysis