Anemone: Edge-based network management
Download
Report
Transcript Anemone: Edge-based network management
Anemone:
Edge-based network management
Mort (Richard Mortier)
MSR-Cambridge
December 2004
Network management
…is the process of monitoring and controlling a
large complex distributed system of dumb devices
where failures are common and resources scarce
Enterprise networks are large but closely managed
No-one has the big picture!
Contrast with the Internet or university campus networks
Internet routeing uses distributed protocols
Current management tools all consider local info
Patchy SNMP support, configuration issues, sampling
artefacts, tools generate CPU and network load
Anemone
Building edge-based network management platform
Collect flow information from hosts, and
Combine with topology information from routeing protocols
Enable visualization, analysis, simulation, control
Avoid problems of not-quite-standard interfaces
Do the work where resources are plentiful
Management support is typically ‘non-critical’ (i.e. buggy )
and not extensively tested for inter-operability
Hosts have lots of cycles and little traffic (relatively)
Protocol visibility: see into tunnels, IPSec, etc
Problem context: Enterprise networks
Large
Geographically distributed
105 edge devices, 103 network devices
Multiple continents, 102 countries
Tightly controlled
IT department has (nearly) complete control over
user desktops and network connected equipment
Talk outline
System outline
What would it be good for?
In more detail…
Research issues
System outline
Packets
Routeing
protocol
Flows
Topology
Traffic matrix
Set of routes
Anemone
platform
routes
srcs
dsts
Simulator
Visualize
Simulate
Control
Where is my traffic going today?
Pictures of current topology and traffic
In fact, where did my traffic go yesterday?
Routes+flows+forwarding rules BIG PICTURE
Keep historical data for capacity planning, etc
A platform for anomaly detection
Historical data suggests “normality,” live
monitoring allows anomalies to be detected
Where might my traffic go tomorrow?
Plug into a simulator back-end
Run multiple ‘what-if’ scenarios
Discrete event simulator, flow allocation solver
…failures
…reconfigurations
…technology deployments
E.g. “What happens if we coalesce all the
Exchange servers in one data-centre?”
Where should my traffic be going?
Close the loop: compute link weights to
implement policy goals
Allows more dynamic policies
Recompute on order of hours/days
Modify network configuration to track e.g. time of
day load changes
Make network more efficient (~cheaper)?
Where are we now?
Three major components
Flow collection
Route collection
Anemone platform
Studying feasibility and building prototypes
Data collection: flows
Hosts track active flows
Using ETW, low overhead event posting
infrastructure
Built prototype device driver provider & userspace consumer
Used 24h packet traces from (client, server)
for feasibility study
Peaks at (165, 5667) live and (39, 567) active
flows per sec
Data collection: routes
OSPF is link-state so collect link state adverts
Completely passive
Modulo configuration
Process data to recover network “events” and topology
Data collected for (local, backbone) areas (20 days)
Similar to Sprint IS-IS collection
Was also done at AT&T (NSDI’04 paper)
LSA DB size: (700, 1048) LSAs ~ (21, 34) kB
Event totals: (2526, 3238) events ~ (5.3, 6.7) evts/hr
Small, generally stable with bursts of activity
NB: Spike to ~100 from
initial DB collection
truncated for readability
complete dataset
steady state
35 mins: LSRefreshTime+CheckAge?
30 mins: LSRefreshTime?
10 mins: data ca. 25/Nov?
1–2 mins: RouterDeadInterval?
The Anemone platform
“Distributed database,” logically containing
1.
Traffic flow matrix (bandwidths), {srcs} × {dsts}
2.
…each entry annotated with current route, src to dst
Hosts can supply flows they source and sink
Only need a subset of this data to get complete traffic matrix
Note src/dst might be e.g. (IP end-point, application)
OSPF supplies topology → routes
Where/what/how much to distribute/aggregate?
Is data read- or write-dominated?
Which is more dynamic, flow or topology data?
Can the system successfully self-tune?
The Anemone platform
Wish to be able to answer queries like
“Who are the top-10 traffic generators?”
“What is the load on link l?”
Can aggregate from hosts, but need to know routes
“What happens if we remove links {l…m}?”
Easy to aggregate, don’t care about topology
Interaction between traffic matrix, topology, even flow control
Related work
{ distributed, continuous query, temporal } databases
Sensor networks, Astrolabe, SDIMS, PHI …
The Anemone platform
Building simulation model
OSPF data gives topology, event list, routes
Simple load model to start with (load ~ # subnets)
Predecessor matrix (from SPF) reduces flow-data query set
Can we do as well/better than e.g. NetFlow?
Accuracy/coverage trade-off
How should we distribute the data and by what protocols?
Just OSPF data? Just flow data? A mixture?
How many levels of aggregation?
How many nodes do queries touch?
What sort of API is suitable?
Example queries for sample applications
Research issues
Corner cases
Scalability
Robustness, accuracy
Control systems
Research issues
Corner cases
Multi-homed hosts: how best to define a flow
L4 routeing, NAT, proxy ARP, transparent proxies
(Solve using device config files, perhaps SNMP)
Scalability
Host measurement must not be intrusive (in terms of
packet latency, CPU load, network bandwidth)
Aggregators must elect themselves in such a way that they
do not implode under event load
What happens if network radically alters? E.g.
Extensive use of multicast
Connection patterns shift due to e.g. P2P deployment
Research issues
Robustness
Network management had better still work as nodes fail or
the network partitions!
Accuracy in the face of late, partial information
By accident: unmonitored hosts
By design: aggregation, more detail about local area
Inference of link contribution to cumulative metrics, e.g. RTT
Network control: modify link weights
How efficient is the current configuration anyway?
What are plausible timescales to reconfigure?
Summary
Aim to build a coherent edge-based network
management platform using flow monitoring and
standard routeing protocols
Applications include visualization, simulation, dynamic
control
Research issues include
Scalability: want to manage a 300,000 node network
Robustness: must work as nodes fail or network partitions
Accuracy: will not be able to monitor 100% of traffic
Control systems: use the data to optimize the network in
real-time, as well as just observe and simulate
Current status
Submitted Networking 2005 paper
Prototype ETW provider/consumer driver
Studied feasibility of flow monitoring
Prototype OSPF collector & topology reconstruction
Investigating “distributed database” via simulation
Query properties
System decomposition
Protocols for data distribution
Questions, comments?
Backup slides
SNMP
Internet routeing
OSPF
BGP
Security
SNMP
Protocol to manage information tables at devices
Provides get, set, trap, notify operations
get, set: read, write values
trap: signal a condition (e.g. threshold exceeded)
notify: reliable trap
Complexity mostly in the table design
Some standard tables, but many vendor specific
Non-critical, so often tables populated incorrectly
Internet routeing
Q: how to get a packet from node to destination?
A1: advertise all reachable destinations and apply a
consistent cost function (distance vector)
A2: learn network topology and compute consistent
shortest paths (link state)
Each node (1) discovers and advertises adjacencies;
(2) builds link state database; (3) computes shortest paths
A1, A2: Forward to next-hop using longest-prefixmatch
OSPF (~link state routeing)
Q: how to route given packet from any node to
destination?
A: learn network topology; compute shortest paths
For each node
Discover adjacencies (~immediate neighbours); advertise
Build link state database (~network topology)
Compute shortest paths to all destination prefixes
Forward to next-hop using longest-prefix-match (~most
specific route)
BGP (~path vector routeing)
Q: how to route given packet from any node to destination?
A: neighbours tell you destinations they can reach; pick cheapest
option
For each node
Receive (destination, cost, next-hop) for all destinations known to
neighbour
Select among all possible next-hops for given destination
Advertise selected (destination, cost+, next-hop') for all known
destinations
Selection process is complicated
Routes can be modified/hidden at all three stages
General mechanism for application of policy
Security
Threat: malicious/compromised host
Threat: DoS on monitors
Authenticate participants
Must secure route collector as if a router
Difference between client under DoS and server?
Rate pace output from monitors
Threat: eavesdropping
Standard IPSec/encryption solutions