Transcript ppt - OASIS
An Annotation
Layer for Network
Management
George Porter, Arne Baste,
David Chu, Dilip Joseph
Randy H. Katz
NetRads Retreat - June 2005
Goal of today’s talk
Snapshot of our thinking in this area
Several open research problems as to
appropriateness of piggybacking, effectiveness of
distributed observation, etc.
Your feedback appreciated
Outline
Motivating example: Discovering and protecting
network service performance during stress
PNEs as A-Layer building block
Overview: Annotation layer as provider of
component building block for network
management
Revisit network service example with A-Layer
Research challenges, open issues, opportunities
Outline
Motivating example: Discovering and protecting
network service performance
PNEs as A-Layer building block
Overview: Annotation layer as provider of
component building block for network
management
Revisit network service example with A-Layer
Research challenges, open issues, opportunities
Motivating Example:
Network service slowdown/failure
Dist Tier
Client
IC
R
DNS
R
IS
FTP
Web
NFS
DNS
DNS
Server tier
Problem:
Users in the access tier complain of slow web access, can’t
mount files, and “DNS operation timed out messages”
This problem started today at 10am
Where to begin?
Network connectivity between users and outside seems ok
But name resolution is intermittent and slow
We need tools to figure out who is affected, who isn’t affected,
the cause, and a solution.
Motivating Example:
Network service slowdown/failure
Dist Tier
Client
R
IC
R
FTP
IS
DNS
Network connectivity to DNS? [ping,traceroute]
Are DNS requests making it to the server tier?
NFS
DNS
DNS
Server tier
What is happening to the request completion rate (is it lower)? Vs
network path losses (I.e., is it the path or the service?) DNS server CPU
level up
Localize the problem:
Web
Only this user? Or other clients?
Just that server? What is happening to the DNS req/reply completion rate
of other servers in that cluster? Correlations? Is this user anomalous?
So far: DNS overloaded, leading to timeouts on client end
Dist Tier
Client
IC
R
R
IS
SMTP
II
Web
NFS
DNS
DNS
Server tier
Why is the service overloaded?
Is there an usual number of requests from other sources? [deviation from
the mean]
What is the status of requests to this service network-wide? How has it
changed since before the first reports of the problem?
We discover that the number of DNS requests from access and ISP
networks is unchanged (must be in server tier)
R
FTP
Other correlations? Yes, to SMTP traffic at ISP ingress
We suspect the endpoint of SMTP traffic, a spam appliance, as the cause of DNS
performance loss
No unusual surges of DNS from access or ISP (from outside our enterprise network)
Thus originating inside the server tier
And correlated to SMTP traffic
DNS
Dist Tier
Client
IC
R
R
SMTP
II
R
IS
FTP
Web
NFS
DNS
DNS
Server tier
Eliminate false positives: testing this conjecture via experimental intervention
Temporarily b/w throttle SMTP traffic from ISP ingress
Test DNS latency from access network
Find that DNS latency goes down when SMTP volume goes down
We enact a new (but temporary) policy:
Redirect requests from access tier to secondary or tertiary DNS
server (service separation for different users)
BW regulate SMTP traffic to keep DNS server CPU load from
peaking
Access users’ service restored--their traffic is protected.
Problem localized and mitigated
Long term solution: software upgrade, firmware upgrade, add
dedicated DNS cache for appliance
Example Review
Localizing and identifying problem required
Network-wide visibility despite stressed links/servers
Path information (network connectivity, protocol
request/reply completion information)
Finding changes in behavior (avg # requests/unit time,
rate of change of traffic)
Finding correlations between traffic (traffic classes,
volume, network level paths)
Experimental intervention (correlation to causation)
Enabling new policy (redirecting traffic to secondary
server, BW throttling/fencing misbehaving flows)
Principles for network
management
Network-wide visibility
despite surges/overload/high
loss rates
Low overhead
Path statistics gathering
Some protocol visibility (TCP,
IP, Services like DNS, NFS)
Need to discover
Changes to request-reply
rate, completions, latency
over time
Correlations between
different flows, protocols,
parts of the network
New policies (Actions)
For experimental
intervention (root cause
discovery)
To protect good traffic
BW shaping, blocking,
scheduling, fencing, selective
drop
Security
Against non-operators
using this infrastructure
Against DoS attacks
Outline
Motivating example: Discovering and protecting
network service performance
PNEs as A-Layer building block
Overview: Annotation layer as provider of
component building block for network
management
Revisit network service example with A-Layer
Research challenges, open issues, opportunities
PNEs (Programmable Network
Elements) and iBoxes
Inspection-and-action points
Deep, multiprotocol, packet inspection
No routing, just observation and marking
Actions: Selective drop, b/w fencing and shaping,
notification of operators, query “points of observation”
Some protocol visibility to TCP, UDP, ‘good’ network
service protocols like DNS/NFS
Per-flow session state and reverse path visibility
Per-flow and per-path simple statistics gathering
(latencies, round trip times, requests/sec, address source
and destinations)
iBox
Annotation Layer
Explicit layer for iBox-to-iBox communication
via packet annotations
url: X
iBox
iBox
Annotations:
iBox
Fixed size
Encoded to enable the de-annotation of packets
Multiple payload types based on any layer of the flow
Security field for authentication
A-Layer Annotation Design
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Prior Protocol
Type
Authentication Field
Authentication Field
(10 bytes)
AL unit headers (14 bytes)
Sequence Number
Destination Address
Source Address
Annotation Layer Payload
12 bytes of payload
in one AL unit
Encode annotations in between IP and transport
Allow annotations to be stacked (multiple)
Annotations are removed by iBoxes before reaching
endhosts
Motivation: start with large (but versatile) annotation
format
When we discover the set of annotations that are most
effective for network management, we can reduce the
footprint to support that set
Categories of annotations
SNMP proxy
Netflows
Alteon, Packeteer.
iBox placement
In an Enterprise Network: iBoxes at points of hierarchical division
II
10.0.0.101
10.0.0.102
...
...
10.0.0.255
R
Primary &
Secondary S
DNS
S
Servers
Mail
Server
S
Spam
Appliance S
Internet
Edge
Distribution
Tier
IS
R
Server
Edge
A
R
IA
Access
Edge
B
C
D
These locations give iBoxes ability to monitor and classify
traffic flowing through them. Also, iBoxes can slow down,
block, fence, and drop traffic to ease surges and protect
“good” traffic from bad/ugly traffic
10.0.0.1
10.0.0.2
...
10.0.0.100
Routing to other iBoxes
Once we know which iBoxes
exist, we need to know how to
reach them so we can send
them annotations
Requires building up this table
at each iBox
Topology dependent
If a packet’s destination
address doesn’t match an iBox
in this table, we remove all
annotations to ensure endhost
correctness
Represents “core” iBoxes
IPv4/v6 Address
169.229.62/24
169.229.60/24
169.229/16
128.40.1.3/32
128.40.1.4/32
0/0
iBox ID
A
B
C
D
B
none
Represents “edge” iBoxes
Active vs Passive annotations
When to send “active” annotations (I.e., a separate packet) vs when
to passively annotate?
Available during high traffic (passive) vs expedient (active)
Associate timers with each queue
When packet arrives and an annotation is dequeued, we reset the
timer
If the timer goes off, we generate a new dummy packet, annotate
it, send it off to the right destination iBox, and reset the timer
A
B
C
D
E
IPv4/v6 Address
169.229.62/24
169.229.60/24
169.229/16
128.40.1.3/32
128.40.1.4/32
0/0
iBox ID
A
B
C
D
B
none
Outline
Motivating example: Discovering and protecting
network service performance
PNEs as A-Layer building block
Overview: Annotation layer as provider of
component building block for network
management
Revisit network service example with A-Layer
Research challenges, open issues, opportunities
A-Layer as component building
blocks for observe-analyse-act
Observe
Analyse
Path statistics; req/reply completion rate,latency;
new conn rate; connection age; protocol
types/mixtures; their change over time
Correlations; mean changing over time (chi-sq);
PCA; experimental intervention (act, then observe)
Act
BW throttling, selective drop, packet scheduling, bw
fencing
Why Distributed observeanalyse-act?
Distributed
Quick distribution of
information
Need for information
throughout the network
Works during network
partitions, provides visibility
during surges when it is hard
to get packets through
Up-to-date info, but might be
inconsistent
But, consistency hard; could
start bad feedback loops; need
to elect leader
Centralized
More control, consistent
information (but could be out
of date)
Centralize policy (no need to
cast policy over multiple
nodes)
Distributed routing preferred
over centralized approach
Similar motivation for
iBoxes/A-Layer
Outline
Motivating example: Discovering and protecting
network service performance
PNEs as A-Layer building block
Overview: Annotation layer as provider of
component building block for network
management
Revisit network service example with A-Layer
Research challenges, open issues, opportunities
DNS
Dist Tier
Client
IC
R
R
IS
SMTP
II
R
FTP
Web
NFS
DNS
DNS
Server tier
Path-oriented connectivity and reachability
Network service monitoring
Are requests getting through? What is their rate? What has been happening to the
DNS latency? Where are “DNS hotspots”?
iBoxes can store characteristics of paths through the network
Types of protocols they see, volume of protocols, rate of change of traffic,
distribution of source/destination addresses seen, network errors, topology
information
NetFlows as statistics gathering at a single point
Extract and share reports from this information
Annotate packets with IBox Source annotation to have access to insidevs-outside/paths chosen and paths taken
Annotate packets with service reachability reports, link conditions, traffic
rates and changes of traffic rates
Annotate packets with protocol reports that represent the mixture of
protocols seen at various points throughout the network
DNS
Dist Tier
Client
IC
R
R
SMTP
II
R
IS
FTP
Web
NFS
DNS
DNS
Server tier
Relationship between traffic classes, correlations, anomolies
Discovering anomalies: iBoxes consuming annotations from other
parts of the network need to be able to discover when good services
lose performance
SLT problem of anomaly detection made easier with more information and
visibility
Network data stored in vector form for rate, quantity, time domain
Discovering correlations: For good services that are degrading,
finding correlations to anomalous traffic surges, flash traffic, etc.
provides hints to cause of problem
Each iBox representing affected traffic needs annotations containing network
wide events capturing changes in traffic patterns
“Analysis” components of observe-analyze-act done from multiple network
vantage points or centralized?
DNS
Dist Tier
Client
IC
R
R
SMTP
II
R
IS
FTP
Web
NFS
DNS
DNS
Server tier
Experimental Intervention, protection of good traffic via policy actions
Experimental intervention:
Control annotations sent to iBox near source of surge to temporarily
throttle
Annotations routed to iBox at ISP ingress to invoke new policy
The policy in the annotation relies on iBox actions of BW shaping, fencing,
and TCP ack manipulation to reduce SMTP flow rate
Protection of good traffic:
Policy could include network-level redirection to channel good DNS
requests from access networks to a secondary, backup DNS service
Marking traffic not affiliated with surge for protection elsewhere in the
network closer to the service location
Outline
Motivating example: Discovering and protecting
network service performance
PNEs as A-Layer building block
Overview: Annotation layer as provider of
component building block for network
management
Revisit network service example with A-Layer
Research challenges, open issues, opportunities
Policy expression and
deployment
When correlations discovered, what to do with
them?
Initial efforts are to provide observation platform
for visualization of network state
A-Layer/iBoxes as building blocks for operator
interaction
“Above the network” services
Right now we envision iBoxes understanding well
known network services
Open question as to visibility to higher level
applications like web services, enterprise-specific
apps
New policy complexity, new correlations and state
management needed
Statistical visualization for
operators
Open problem to aggregate distributed
observations into coherent visualization for
operators
Where does the visualization reside?
What are the right metrics/correlations/deviations
from mean that are relevant?
How do actions relate to visualization?
SLT analysis
Choice of algorithm
Finding “interesting” correlations
Not being overloaded with too many correlations
and events
Deviation from mean, finding patterns, what is
normal operation for a protocol?
Managing distributed actions
Managing feedback loops
Providing coherent actions at the global scale
based on iBoxes distributed throughout the
network
Coordinating actions despite network surges and
limited network access, path losses, etc.
Q&A
Q: What about the e2e argument?
Adding/removing annotations:
Annotations easy to remove
Packet paths not modified
Actions such as throttling, scheduling, dropping
Con: affects traffic in ways endhosts can detect
Pro: Provides “library” of components to enable new
network services / management features
That’s how we build software
A-Layer gives enterprise operators control over their
networks
As long as their applications are supported and work
Enterprise networks usually have white list of allowed
apps, all other disallowed
Contrast this to ISPs
Q: What about per-flow state
management?
Some routers can keep per-flow state (Netflows)
iBoxes can sample traffic
iBoxes not in correctness path--can act as ‘nops’
Network traffic parallelizable, targeting 1 GigE
Can be merged into expandable network devices
(see Cisco’s server cards that plug into routers)
Q: What about e2e security (IPsec?)
E2e security obscured protocol, but not path stats
Conceivable to discover request/response phases, infer
completion rate; keep stats on # connections, flow rates
Statistically infer when a flow is starved for bandwidth;
observe bandwidth over time; correlate with
destination/sources function (web server, mail server, etc)
Correlations still work over encrypted traffic
Can still perform experiments by affecting flow X,
observing flow Y
Q: Why annotate? (Why not send
separate packets?)
Annotations are about path characteristics
Can bind to the flow they describe
Statistics follow paths where they are the most relevant
Marries per-path context with each packet of a particular
flow (gives iBoxes info they need to throttle, fence, etc)
As packet flow rate increases, more opportunity for
visibility by piggybacking
Lower overhead during times of stress
Possible preference of fewer large packets than more
small packets
Explicit sending of separate packets still ok
Especially for discovery, control, and policy distribution
Q: Why distributed?
Centralized statistics gathering easy in enterprise
networks
Information might be needed in more than one place
“Act” operations to protect good traffic needs timely info
Contrast to 5-min avgs common in SNMP
Raises difficulty, though
But hard during times of stress/traffic spikes/flash traffic
Election protocols, distributed consensus, negative
feedback loops, management of iBoxes
Let’s experiment and see
Open research question as to benefit of distributed vs
centralized network observation, analysis, and
action/actuation