network management

Download Report

Transcript network management

Network Management
Richard Mortier
Microsoft Research, Cambridge
(Guest lecture, Digital Communications II)
Overview






Introduction
Abstractions
IP network components
IP network management protocols
Pulling it all together
An alternative approach
Overview

Introduction






What’s it all about then?
Abstractions
IP network components
IP network management protocols
Pulling it all together
An alternative approach
What is network management?

One point-of-view: a large field full of acronyms



From question.com:


EMS, TMN, NE, CMIP, CMISE, OSS, AN.1, TL1, EML,
FCAPS, ITU, ...
(Don’t ask me what all of those mean, I don’t care!)
In 1989, a random of the journalistic persuasion asked
hacker Paul Boutin “What do you think will be the biggest
problem in computing in the 90s?” Paul's straight-faced
response: “There are only 17,000 three-letter acronyms.”
(To be exact, there are 26^3 = 17,576.)
Will ignore most of them 
What is network management?

Computer networks are considered to have
three operating timescales





Data: packet forwarding [ μs, ms ]
Control: flows/connections [ secs, mins ]
Management: aggregates, networks [ hours,days ]
…so we’re concerned with “the network”
rather than particular devices
Standardization is key!
Overview


Introduction
Abstractions





ISO FCAPS, TMN EMS, ATM
IP network components
IP network management protocols
Pulling it all together
An alternative approach
ISO FCAPS: functional separation

Fault


Configuration


Collect statistics, bill users, enforce quotas
Performance


Collect, store, track configurations
Accounting


Recognize, isolate, correct, log faults
Monitor trends, set thresholds, trigger alarms
Security

Identify, secure, manage risks
TMN EMS: administrative separation

Telecommunications Management Network
Element Management System

“...simple but elegant...” (!)







(my emphasis)
NEL: network elements (switches, transmission systems)
EML: element management (devices, links)
NML: network management (capacity, congestion)
SML: service management (SLAs, time-to-market)
BML: business management (RoI, market share, blah)
The B-ISDN reference model

Asynchronous Transfer Mode “cube”


Plane management…






Specific layers
Topology
Configuration
Fault
Operations
Accounting
Performance
control plane
higher layers
user plane
higher layers
ATM adaptation layer
ATM layer
physical layer
layer management

The whole network
…vs layer management

management plane
plane management

See IAP lectures, maybe 
Network management

Models of general communication networks




Tend to be quite abstract and exceedingly tedious!
Many practitioners still seem excited about OO
programming, WIMP interfaces, etc
…probably because implementation is hard due to so
many excessively long and complex standards!
My view: basic “need-to-know” requirements are
1.
2.
3.
4.
What should be happening? [ c ]
What is happening? [ f, p, a ]
What shouldn’t be happening? [ f, s ]
What will be happening? [ p, a ]
Network management

We’ll concentrate on IP networks



We’ll concentrate on the network core


Still acronym city: ICMP, SNMP, MIB, RFC 
Sample size: 102 routers, 105 hosts
Routers, not hosts
We’ll ignore “service management”

DNS, AD, file stores, etc
Overview



Introduction
Abstractions
IP network components




IP primer, router configuration
IP network management protocols
Pulling it all together
An alternative approach
IP primer (you probably know all this)

Destination-routed packets – no connections


Routers forward packets based on routeing tables


Tables populated by routeing protocols
Routers and protocols operate independently


Time-to-live field: allow removal of looping packets
…although protocols aim to build consistent state
RFCs ~= standards


Often much looser semantics than e.g. ISO, ITU standards
Compare for example OSPF [RFC2327] and IS-IS
[RFC1142, RFC1195], two link-state routeing protocols
So, how do you build an IP network?
1.
2.
3.
4.
5.

Buy (lease) routers
Buy (lease) fibre
Connect them all together
Configure routers appropriately
Configure end-systems appropriately
Assume you’ve done 1–3 and someone
else is doing 5…
Router configuration



Initialization
 Name the router, setup boot options, setup authentication options
Configure interfaces
 Loopback, ethernet, fibre, ATM
 Subnet/mask, filters, static routes
 Shutdown (or not), queueing options, full/half duplex
Configure routeing protocols (OSPF, BGP, IS-IS, …)




Process number, addresses to accept routes from, networks to advertise
Access lists, filters, ...
 Numeric id, permit/deny, subnet/mask, protocol, port
Route-maps, matching routes rather than data traffic
Other configuration aspects: traps, syslog, etc
Router configuration fragments
hostname FOOBAR
!
boot system flash slot0:a-boot-image.bin
boot system flash bootflash:
logging buffered 100000
debugging
interface
Loopback0
logging console informational
description router-1.network.corp.com
aaa new-model
ip address 10.65.21.43 255.255.255.255
aaa authentication
! login default tacacs local aaa
authentication login
consoleport
none
interface
FastEthernet0/0/0
router
ospf
aaa authenticationdescription
ppp default
if-needed
Link to 2Newtacacs
York
log-adjacency-changes
aaa authorization network
tacacs
!
ip address 10.65.43.21
255.255.255.128
passive-interface
FastEthernet0/0/0
ip tftp source-interface
Loopback0
ip access-group
175 in
passive-interface
FastEthernet0/1/0
no ip domain-lookup
ip helper-address 10.65.12.34
passive-interface
FastEthernet1/0/0
ip name-server 10.34.56.78
ip pim sparse-mode
passive-interface
FastEthernet1/1/0
!
ip cgmp
passive-interface
FastEthernet2/0/0
ip multicast-routing
ip dvmrp accept-filter 98 neighbor-list
99
passive-interface
FastEthernet2/1/0
ip dvmrp route-limit
7000
full-duplex
passive-interface FastEthernet3/0/0
ip cef distributed
!
access-list 24 remark
Mcast
ACL
10.65.23.45 0.0.0.255 area 1.0.0.0
interface network
FastEthernet4/0/0
access-list 24 permit
239.255.255.254
network
10.65.34.56 0.0.0.255 area 1.0.0.0
no ip address
access-list 24 permit
224.0.1.111
network
10.65.43.0
0.0.0.127 area 1.0.0.0
ip access-group 183
in
access-list 24 permit
239.192.0.0
0.3.255.255
ip pim sparse-mode
access-list 24 permit
232.192.0.0 0.3.255.255
ip cgmp
access-list 24 permit
224.0.0.0 0.0.0.255
shutdown
tftp-server
access-list slot1:some-other-image.bin
1011 deny
0000.0000.0000 ffff.ffff.ffff ffff.ffff.ffff 0000.0000.0000 0xD1 2 eq 0x42
full-duplex
tacacs-server
host
10.65.0.2
access-list 1011
permit
0000.0000.0000 ffff.ffff.ffff 0000.0000.0000 ffff.ffff.ffff
tacacs-server key xxxxxxxx
rmon event 1 trap Trap1 description "CPU Utilization>75%" owner config
rmon event 2 trap Trap2 description "CPU Utilization>95%" owner config
Router configuration



Lots of quite large and fragile text files
 00s/000s routers, 00s/000s lines per config
 Errors are hard to find and have non-obvious results
 Router configuration also editable on-line
How to keep track of them all?
 Naming schemes, directory hierarchies, CVS
 ssh upload and atomic commit to router
 Perhaps even a database
State of the art is pretty basic

Few tools to check consistency
Generally generate configurations from templates and have
human-intensive process to control access to running configs
Topic of current research [Feamster et al]


this counts as
quite advanced!
Overview




Introduction
Abstractions
IP network components
IP network management protocols



ICMP, SNMP, Netflow
Pulling it all together
An alternative approach
ICMP

Internet Control Message Protocol [RFC792]



IP protocol #1
In-band “control”
Variety of message types




echo/echo reply [ PING (packet internet groper) ]
time exceeded [ TRACEROUTE ]
destination unreachable, redirect
source quench
Ping (Packet INternet Groper)

Test for liveness

…also used to measure (round-trip) latency

Send ICMP echo
Valid IP host [RFC1122, RFC1123] must reply with
ICMP echo response

Subnet PING?




Useful but often not available/deprecated
“ACK” implosion could be a problem
RFCs ~= standards
Traceroute

Which route do my packets take to their destination?
 Send UDP packets with increasing time-to-live values
 Compliant IP host must respond with ICMP “time exceeded”
 Triggers each host along path to so respond

Not quite that simple
 One router, many IP addresses: which source address?

Router control processor, inbound or outbound interface
Routes often asymmetric, so return path != outbound path
 Routes change
Do we want full-mesh host-host routes anyway?!
 Size of data set, amount of probe traffic



This is topology, what about load on links?
SNMP


Protocol to manage information tables at devices
Provides get, set, trap, notify operations




get, set: read, write values
trap: signal a condition (e.g. threshold exceeded)
notify: reliable trap
Complexity mostly in the MIB design




Some standard tables, but many vendor specific
Non-critical, so often tables populated incorrectly
Many tens of MIBs (thousands of lines) per device
Different versions, different data, different semantics


Yet another configuration tracking problem
Inter-relationships between MIBs
IPFIX

IETF working group




Statistics reporting



Export of flow based data out of IP network devices
Developing suitable protocol based on Cisco NetFlow™ v9
[RFC3954, RFC3955]
Setup template
Send data records matching template
Many variables

Packet/flow counters, rule matches, quite flexible
Overview





Introduction
Abstractions
IP network components
IP network management protocols
Pulling it all together


Network mapping, statistics gathering, control
An alternative approach
An hypothetical NMS




GUI around ICMP (ping, traceroute), SNMP, etc
Recursive host discovery
 Broadcast ping, ARP, default gateway: start somewhere
 Recursively SNMP query for known hosts/connected networks
 Ping known hosts to test liveness
 Iterate
Display topology: allow “drill-down” to particular devices
Configure and monitor known devices
 Trap, Netflow™, syslog message destinations
 Counter thresholds, CPU utilization threshold, fault reporting
 Particular faults or fault patterns
 Interface statistics and graphs
A real NOC (Network Operations Centre)
[ from AT&T ]
An hypothetical NMS




All very straightforward? No, not really
 A lot of software engineering: corner cases, traceroute
interpretation, NATs, etc
 MIBs may contain rubbish
 Can only view inside your network anyway
Efficiency
 Rate pacing discovery traffic: ping implosion/explosion
 SNMP overloading router CPUs
Tunnelled, encrypted protocols becoming prevalent
Using NMSs also not straightforward
 How to setup “correct” thresholds?
 How to decide when something “bad” has happened?
 How to present (or even interpret) reams and reams of data?
Overview






Introduction
Abstractions
IP network components
IP network management protocols
Pulling it all together
An alternative approach

From the edges…
ENMA

Edge-based network management platform


Collect flow information from hosts, and
Combine with topology information from routeing protocols

Enable visualization, analysis, simulation, control

Avoid problems of not-quite-standard interfaces


Do the work where resources are plentiful


Management support is typically ‘non-critical’ (i.e. buggy )
and not extensively tested for inter-operability
Hosts have lots of cycles and little traffic (relatively)
Protocol visibility: see into tunnels, IPSec, etc
System outline
Packets
Routeing
protocol
Flows
Topology
Traffic matrix
Set of routes
Distributed
database
routes
srcs
dsts
Simulator
Visualize
Simulate
Control
Where is my traffic going today?

Pictures of current topology and traffic


In fact, where did my traffic go yesterday?


Routes+flows+forwarding rules  BIG PICTURE
Keep historical data for capacity planning, etc
A platform for anomaly detection

Historical data suggests “normality,” live
monitoring allows anomalies to be detected
Where might my traffic go tomorrow?

Plug into a simulator back-end


Run multiple ‘what-if’ scenarios




Discrete event simulator, flow allocation solver
…failures
…reconfigurations
…technology deployments
E.g. “What happens if we coalesce all the
Exchange servers in one data-centre?”
Where should my traffic be going?

Close the loop: compute link weights to
implement policy goals


Allows more dynamic policies


Recompute on order of hours/days
Modify network configuration to track e.g. time of
day load changes
Make network more efficient (~cheaper)?
Where are we now?

Three major components




Flow collection
Route collection
Distributed database
Building prototypes, simulating system
Data collection

Flow collection

Hosts track active flows



Used packet traces for feasibility study on (client, server)


Using low overhead event posting infrastructure, ETW
Built prototype device driver provider & user-space consumer
Peaks at (165, 5667) live and (39, 567) active flows per sec
Route collection


OSPF is link-state: passively collect link state adverts
Extension of my work at Sprint (for IS-IS and BGP); also
been done at AT&T (NSDI’04 paper)
The distributed database

Logically contains
1.
Traffic flow matrix (bandwidths), {srcs} × {dsts}
2.
…each entry annotated with current route from src to dst




N.B. src/dst might be e.g. (IP end-point, application)
Large dynamic data set suggests aggregation
Related work

{ distributed, continuous query, temporal } databases

Sensor networks
Potential starting points: Astrolabe or SDIMS (SIGCOMM’04)

Where/what/how much to aggregate?



Is data read- or write-dominated?
Which is more dynamic, flow or topology data?
Can the system successfully self-tune?
The distributed database



Construct traffic matrix from flow monitoring
 Hosts can supply flows they source and sink
 Only need a subset of this data to get complete traffic matrix
Construct topology from route collection
 OSPF supplies topology → routes
Wish to be able to answer queries like
 “Who are the top-10 traffic generators?”


“What is the load on link l ?”


Easy to aggregate, don’t care about topology
Can aggregate from hosts, but need to know routes
“What happens if we remove links {l…m} ?”

Interaction between traffic matrix, topology, even flow control
The distributed database

Building simulation model
 OSPF data gives topology, event list, routes
 Simple load model to start with (load ~ # subnets)
 Precedence matrix (from SPF) reduces flow-data query set

Can we do as well/better than e.g. NetFlow?
 Accuracy/coverage trade-off
How should we distribute the DB?
 Just OSPF data? Just flow data? A mixture?
How many levels of aggregation?
 How many nodes do queries touch?
What sort of API is suitable?
 Example queries for sample applications



Summary

Introduction


Abstractions


ICMP, SNMP, etc
Pulling it all together


IP, routers, configurations
IP network management protocols


ISO FCAPS, TMN EMS, ATM
IP network components


What is network management?
Outline of a network management system
An alternative approach: from the edges
The end







Questions
Answers?
http://www.cisco.com/
http://www.routergod.com/
http://www.ietf.org/
http://ipmon.sprintlabs.com/pyrt/
http://www.nanog.org/
Backup slides



Internet routeing
OSPF
BGP
Internet routeing

Q: how to get a packet from node to destination?

A1: advertise all reachable destinations and apply a
consistent cost function (distance vector)
A2: learn network topology and compute consistent
shortest paths (link state)



Each node (1) discovers and advertises adjacencies;
(2) builds link state database; (3) computes shortest paths
A1, A2: Forward to next-hop using longest-prefixmatch
OSPF (~link state routeing)

Q: how to route given packet from any node to
destination?
A: learn network topology; compute shortest paths

For each node





Discover adjacencies (~immediate neighbours); advertise
Build link state database (~network topology)
Compute shortest paths to all destination prefixes
Forward to next-hop using longest-prefix-match (~most
specific route)
BGP (~path vector routeing)





Q: how to route given packet from any node to destination?
A: neighbours tell you destinations they can reach; pick cheapest
option
For each node
 Receive (destination, cost, next-hop) for all destinations known to
neighbour
 Longest-prefix-match among next-hops for given destination
 Advertise selected (destination, cost+, next-hop') for all known
destinations
Selection process is complicated
Routes can be modified/hidden at all three stages
 General mechanism for application of policy