Enterprise Network Management
Download
Report
Transcript Enterprise Network Management
ENMA:
Co-operation in the corporation
Mort (Richard Mortier)
MSR-Cambridge
September 2004
Network management
…is the process of monitoring and controlling a
large complex distributed system of dumb devices
where failures are common and resources scarce
Enterprise networks are large but closely managed
No-one has the big picture!
Contrast with the Internet or university campus networks
Internet routeing uses distributed protocols
Current management tools all consider local info
Patchy SNMP support, configuration issues, sampling
artefacts, tools generate CPU and network load
This project
Building edge-based network management platform
Collect flow information from hosts, and
Combine with topology information from routeing protocols
Enable visualization, analysis, simulation, control
Avoid problems of not-quite-standard interfaces
Do the work where resources are plentiful
Management support is typically ‘non-critical’ (i.e. buggy )
and not extensively tested for inter-operability
Hosts have lots of cycles and little traffic (relatively)
Protocol visibility: see into tunnels, IPSec, etc
Problem context: Enterprise networks
Large
Geographically distributed
105 edge devices, 103 network devices
Multiple continents, 102 countries
Tightly controlled
IT department has (nearly) complete control over
user desktops and network connected equipment
Talk outline
System outline
What would it be good for?
In more detail…
Research issues
System outline
Packets
Routeing
protocol
Flows
Topology
Traffic matrix
Set of routes
Distributed
database
routes
srcs
dsts
Simulator
Visualize
Simulate
Control
Where is my traffic going today?
Pictures of current topology and traffic
In fact, where did my traffic go yesterday?
Routes+flows+forwarding rules BIG PICTURE
Keep historical data for capacity planning, etc
A platform for anomaly detection
Historical data suggests “normality”, live
monitoring allows anomalies to be detected
Where might my traffic go tomorrow?
Plug into a simulator back-end
Run multiple ‘what-if’ scenarios
Discrete event simulator, flow allocation solver
…failures
…reconfigurations
…technology deployments
E.g. “What happens if we coalesce all the
Exchange servers in one data-centre?”
Where should my traffic be going?
Close the loop: compute link weights to
implement policy goals
Allows more dynamic policies
Recompute on order of hours/days
Modify network configuration to track e.g. time of
day load changes
Might make network more efficient(~cheaper)
Where are we now?
Three major components
Flow collection
Route collection
Distributed database
Still studying feasibility
Starting to build prototypes
Data collection
Flow collection
Hosts track active flows
Used packet traces for feasibility study on (client, server)
Using low overhead event posting infrastructure, ETW
Built prototype device driver provider & user-space consumer
Peaks at (165, 5667) live and (39, 567) active flows per sec
Route collection
OSPF is link-state: passively collect link state adverts
Extension of my work at Sprint (for IS-IS and BGP); also
been done at AT&T (NSDI’04 paper)
The distributed database
Logically contains
1.
Traffic flow matrix (bandwidths), {srcs} × {dsts}
2.
…each entry annotated with current route from src to dst
N.B. src/dst might be e.g. (IP end-point, application)
Large dynamic data set suggests aggregation
Related work
{ distributed, continuous query, temporal } databases
Sensor networks
Potential starting points: Astrolabe or SDIMS (SIGCOMM’04)
Where/what/how much to aggregate?
Is data read- or write-dominated?
Which is more dynamic, flow or topology data?
Can the system successfully self-tune?
The distributed database
Construct traffic matrix from flow monitoring
Hosts can supply flows they source and sink
Only need a subset of this data to get complete traffic matrix
Construct topology from route collection
OSPF supplies topology → routes
Wish to be able to answer queries like
“Who are the top-10 traffic generators?”
“What is the load on link l?”
Easy to aggregate, don’t care about topology
Can aggregate from hosts, but need to know routes
“What happens if we remove links {l…m}?”
Interaction between traffic matrix, topology, even flow control
The distributed database
Building simulation model
OSPF data gives topology, event list, routes
Simple load model to start with (load ~ # subnets)
Precedence matrix (from SPF) reduces flow-data query set
Can we do as well/better than e.g. NetFlow?
Accuracy/coverage trade-off
How should we distribute the DB?
Just OSPF data? Just flow data? A mixture?
How many levels of aggregation?
How many nodes do queries touch?
What sort of API is suitable?
Example queries for sample applications
Research issues
Corner cases
Scalability
Robustness, accuracy
Control systems
Research issues
Corner cases
Multi-homed hosts: how best to define a flow
L4 routeing, NAT, proxy ARP, transparent proxies
(Solve using device config files, perhaps SNMP)
Scalability
Host measurement must not be intrusive (in terms of
packet latency, CPU load, network bandwidth)
Aggregators must elect themselves in such a way that they
do not implode under event load
What happens if network radically alters? E.g.
Extensive use of multicast
Connection patterns shift due to e.g. P2P deployment
Research issues
Robustness
Network management had better still work as nodes fail or
the network partitions!
Accuracy in the face of late, partial information
By accident: unmonitored hosts
By design: aggregation, more detail about local area
Inference of link contribution to cumulative metrics, e.g. RTT
Network control: modify link weights
How efficient is the current configuration anyway?
What are plausible timescales to reconfigure?
Summary
Aim to build a coherent edge-based network
management platform using flow monitoring and
standard routeing protocols
Applications include visualization, simulation, dynamic
control
Research issues include
Scalability: want to manage a 300,000 node network
Robustness: must work as nodes fail or network partitions
Accuracy: will not be able to monitor 100% of traffic
Control systems: use the data to optimize the network in
real-time, as well as just observe and simulate
Current status
Submitted HotNets paper
Prototype ETW provider/consumer driver
Studied feasibility of flow monitoring
Prototype OSPF collector & topology reconstruction
Investigating “distributed database” via simulation
Query properties
System decomposition
Questions, comments?
Backup slides
SNMP
Internet routeing
OSPF
BGP
Security
SNMP
Protocol to manage information tables at devices
Provides get, set, trap, notify operations
get, set: read, write values
trap: signal a condition (e.g. threshold exceeded)
notify: reliable trap
Complexity mostly in the table design
Some standard tables, but many vendor specific
Non-critical, so often tables populated incorrectly
Internet routeing
Q: how to get a packet from node to destination?
A1: advertise all reachable destinations and apply a
consistent cost function (distance vector)
A2: learn network topology and compute consistent
shortest paths (link state)
Each node (1) discovers and advertises adjacencies;
(2) builds link state database; (3) computes shortest paths
A1, A2: Forward to next-hop using longest-prefixmatch
OSPF (~link state routeing)
Q: how to route given packet from any node to
destination?
A: learn network topology; compute shortest paths
For each node
Discover adjacencies (~immediate neighbours); advertise
Build link state database (~network topology)
Compute shortest paths to all destination prefixes
Forward to next-hop using longest-prefix-match (~most
specific route)
BGP (~path vector routeing)
Q: how to route given packet from any node to destination?
A: neighbours tell you destinations they can reach; pick cheapest
option
For each node
Receive (destination, cost, next-hop) for all destinations known to
neighbour
Select among all possible next-hops for given destination
Advertise selected (destination, cost+, next-hop') for all known
destinations
Selection process is complicated
Routes can be modified/hidden at all three stages
General mechanism for application of policy
Security
Threat: malicious/compromised host
Threat: DoS on monitors
Authenticate participants
Must secure route collector as if a router
Difference between client under DoS and server?
Rate pace output from monitors
Threat: eavesdropping
Standard IPSec/encryption solutions