network management
Download
Report
Transcript network management
Network Management
Richard Mortier
Microsoft Research, Cambridge
(Guest lecture, Digital Communications II)
Overview
Introduction
Abstractions
IP network components
IP network management protocols
Pulling it all together
An alternative approach
Overview
Introduction
What’s it all about then?
Abstractions
IP network components
IP network management protocols
Pulling it all together
An alternative approach
What is network management?
One point-of-view: a large field full of acronyms
From question.com:
EMS, TMN, NE, CMIP, CMISE, OSS, AN.1, TL1, EML,
FCAPS, ITU, ...
(Don’t ask me what all of those mean, I don’t care!)
In 1989, a random of the journalistic persuasion asked
hacker Paul Boutin “What do you think will be the biggest
problem in computing in the 90s?” Paul's straight-faced
response: “There are only 17,000 three-letter acronyms.”
(To be exact, there are 26^3 = 17,576.)
Will ignore most of them
What is network management?
Computer networks are considered to have
three operating timescales
Data: packet forwarding [ μs, ms ]
Control: flows/connections [ secs, mins ]
Management: aggregates, networks [ hours,days ]
…so we’re concerned with “the network”
rather than particular devices
Standardization is key!
Overview
Introduction
Abstractions
ISO FCAPS, TMN EMS, ATM
IP network components
IP network management protocols
Pulling it all together
An alternative approach
ISO FCAPS: functional separation
Fault
Configuration
Collect statistics, bill users, enforce quotas
Performance
Collect, store, track configurations
Accounting
Recognize, isolate, correct, log faults
Monitor trends, set thresholds, trigger alarms
Security
Identify, secure, manage risks
TMN EMS: administrative separation
Telecommunications Management Network
Element Management System
“...simple but elegant...” (!)
(my emphasis)
NEL: network elements (switches, transmission systems)
EML: element management (devices, links)
NML: network management (capacity, congestion)
SML: service management (SLAs, time-to-market)
BML: business management (RoI, market share, blah)
The B-ISDN reference model
Asynchronous Transfer Mode “cube”
Plane management…
Specific layers
Topology
Configuration
Fault
Operations
Accounting
Performance
control plane
higher layers
user plane
higher layers
ATM adaptation layer
ATM layer
physical layer
layer management
The whole network
…vs layer management
management plane
plane management
See IAP lectures, maybe
Network management
Models of general communication networks
Tend to be quite abstract and exceedingly tedious!
Many practitioners still seem excited about OO
programming, WIMP interfaces, etc
…probably because implementation is hard due to so
many excessively long and complex standards!
My view: basic “need-to-know” requirements are
1.
2.
3.
4.
What should be happening? [ c ]
What is happening? [ f, p, a ]
What shouldn’t be happening? [ f, s ]
What will be happening? [ p, a ]
Network management
We’ll concentrate on IP networks
We’ll concentrate on the network core
Still acronym city: ICMP, SNMP, MIB, RFC
Sample size: 102 routers, 105 hosts
Routers, not hosts
We’ll ignore “service management”
DNS, AD, file stores, etc
Overview
Introduction
Abstractions
IP network components
IP primer, router configuration
IP network management protocols
Pulling it all together
An alternative approach
IP primer (you probably know all this)
Destination-routed packets – no connections
Routers forward packets based on routeing tables
Tables populated by routeing protocols
Routers and protocols operate independently
Time-to-live field: allow removal of looping packets
…although protocols aim to build consistent state
RFCs ~= standards
Often much looser semantics than e.g. ISO, ITU standards
Compare for example OSPF [RFC2327] and IS-IS
[RFC1142, RFC1195], two link-state routeing protocols
So, how do you build an IP network?
1.
2.
3.
4.
5.
Buy (lease) routers
Buy (lease) fibre
Connect them all together
Configure routers appropriately
Configure end-systems appropriately
Assume you’ve done 1–3 and someone
else is doing 5…
Router configuration
Initialization
Name the router, setup boot options, setup authentication options
Configure interfaces
Loopback, ethernet, fibre, ATM
Subnet/mask, filters, static routes
Shutdown (or not), queueing options, full/half duplex
Configure routeing protocols (OSPF, BGP, IS-IS, …)
Process number, addresses to accept routes from, networks to advertise
Access lists, filters, ...
Numeric id, permit/deny, subnet/mask, protocol, port
Route-maps, matching routes rather than data traffic
Other configuration aspects: traps, syslog, etc
Router configuration fragments
hostname FOOBAR
!
boot system flash slot0:a-boot-image.bin
boot system flash bootflash:
logging buffered 100000
debugging
interface
Loopback0
logging console informational
description router-1.network.corp.com
aaa new-model
ip address 10.65.21.43 255.255.255.255
aaa authentication
! login default tacacs local aaa
authentication login
consoleport
none
interface
FastEthernet0/0/0
router
ospf
aaa authenticationdescription
ppp default
if-needed
Link to 2Newtacacs
York
log-adjacency-changes
aaa authorization network
tacacs
!
ip address 10.65.43.21
255.255.255.128
passive-interface
FastEthernet0/0/0
ip tftp source-interface
Loopback0
ip access-group
175 in
passive-interface
FastEthernet0/1/0
no ip domain-lookup
ip helper-address 10.65.12.34
passive-interface
FastEthernet1/0/0
ip name-server 10.34.56.78
ip pim sparse-mode
passive-interface
FastEthernet1/1/0
!
ip cgmp
passive-interface
FastEthernet2/0/0
ip multicast-routing
ip dvmrp accept-filter 98 neighbor-list
99
passive-interface
FastEthernet2/1/0
ip dvmrp route-limit
7000
full-duplex
passive-interface FastEthernet3/0/0
ip cef distributed
!
access-list 24 remark
Mcast
ACL
10.65.23.45 0.0.0.255 area 1.0.0.0
interface network
FastEthernet4/0/0
access-list 24 permit
239.255.255.254
network
10.65.34.56 0.0.0.255 area 1.0.0.0
no ip address
access-list 24 permit
224.0.1.111
network
10.65.43.0
0.0.0.127 area 1.0.0.0
ip access-group 183
in
access-list 24 permit
239.192.0.0
0.3.255.255
ip pim sparse-mode
access-list 24 permit
232.192.0.0 0.3.255.255
ip cgmp
access-list 24 permit
224.0.0.0 0.0.0.255
shutdown
tftp-server
access-list slot1:some-other-image.bin
1011 deny
0000.0000.0000 ffff.ffff.ffff ffff.ffff.ffff 0000.0000.0000 0xD1 2 eq 0x42
full-duplex
tacacs-server
host
10.65.0.2
access-list 1011
permit
0000.0000.0000 ffff.ffff.ffff 0000.0000.0000 ffff.ffff.ffff
tacacs-server key xxxxxxxx
rmon event 1 trap Trap1 description "CPU Utilization>75%" owner config
rmon event 2 trap Trap2 description "CPU Utilization>95%" owner config
Router configuration
Lots of quite large and fragile text files
00s/000s routers, 00s/000s lines per config
Errors are hard to find and have non-obvious results
Router configuration also editable on-line
How to keep track of them all?
Naming schemes, directory hierarchies, CVS
ssh upload and atomic commit to router
Perhaps even a database
State of the art is pretty basic
Few tools to check consistency
Generally generate configurations from templates and have
human-intensive process to control access to running configs
Topic of current research [Feamster et al]
this counts as
quite advanced!
Overview
Introduction
Abstractions
IP network components
IP network management protocols
ICMP, SNMP, Netflow
Pulling it all together
An alternative approach
ICMP
Internet Control Message Protocol [RFC792]
IP protocol #1
In-band “control”
Variety of message types
echo/echo reply [ PING (packet internet groper) ]
time exceeded [ TRACEROUTE ]
destination unreachable, redirect
source quench
Ping (Packet INternet Groper)
Test for liveness
…also used to measure (round-trip) latency
Send ICMP echo
Valid IP host [RFC1122, RFC1123] must reply with
ICMP echo response
Subnet PING?
Useful but often not available/deprecated
“ACK” implosion could be a problem
RFCs ~= standards
Traceroute
Which route do my packets take to their destination?
Send UDP packets with increasing time-to-live values
Compliant IP host must respond with ICMP “time exceeded”
Triggers each host along path to so respond
Not quite that simple
One router, many IP addresses: which source address?
Router control processor, inbound or outbound interface
Routes often asymmetric, so return path != outbound path
Routes change
Do we want full-mesh host-host routes anyway?!
Size of data set, amount of probe traffic
This is topology, what about load on links?
SNMP
Protocol to manage information tables at devices
Provides get, set, trap, notify operations
get, set: read, write values
trap: signal a condition (e.g. threshold exceeded)
notify: reliable trap
Complexity mostly in the MIB design
Some standard tables, but many vendor specific
Non-critical, so often tables populated incorrectly
Many tens of MIBs (thousands of lines) per device
Different versions, different data, different semantics
Yet another configuration tracking problem
Inter-relationships between MIBs
IPFIX
IETF working group
Statistics reporting
Export of flow based data out of IP network devices
Developing suitable protocol based on Cisco NetFlow™ v9
[RFC3954, RFC3955]
Setup template
Send data records matching template
Many variables
Packet/flow counters, rule matches, quite flexible
Overview
Introduction
Abstractions
IP network components
IP network management protocols
Pulling it all together
Network mapping, statistics gathering, control
An alternative approach
An hypothetical NMS
GUI around ICMP (ping, traceroute), SNMP, etc
Recursive host discovery
Broadcast ping, ARP, default gateway: start somewhere
Recursively SNMP query for known hosts/connected networks
Ping known hosts to test liveness
Iterate
Display topology: allow “drill-down” to particular devices
Configure and monitor known devices
Trap, Netflow™, syslog message destinations
Counter thresholds, CPU utilization threshold, fault reporting
Particular faults or fault patterns
Interface statistics and graphs
A real NOC (Network Operations Centre)
[ from AT&T ]
An hypothetical NMS
All very straightforward? No, not really
A lot of software engineering: corner cases, traceroute
interpretation, NATs, etc
MIBs may contain rubbish
Can only view inside your network anyway
Efficiency
Rate pacing discovery traffic: ping implosion/explosion
SNMP overloading router CPUs
Tunnelled, encrypted protocols becoming prevalent
Using NMSs also not straightforward
How to setup “correct” thresholds?
How to decide when something “bad” has happened?
How to present (or even interpret) reams and reams of data?
Overview
Introduction
Abstractions
IP network components
IP network management protocols
Pulling it all together
An alternative approach
From the edges…
ENMA
Edge-based network management platform
Collect flow information from hosts, and
Combine with topology information from routeing protocols
Enable visualization, analysis, simulation, control
Avoid problems of not-quite-standard interfaces
Do the work where resources are plentiful
Management support is typically ‘non-critical’ (i.e. buggy )
and not extensively tested for inter-operability
Hosts have lots of cycles and little traffic (relatively)
Protocol visibility: see into tunnels, IPSec, etc
System outline
Packets
Routeing
protocol
Flows
Topology
Traffic matrix
Set of routes
Distributed
database
routes
srcs
dsts
Simulator
Visualize
Simulate
Control
Where is my traffic going today?
Pictures of current topology and traffic
In fact, where did my traffic go yesterday?
Routes+flows+forwarding rules BIG PICTURE
Keep historical data for capacity planning, etc
A platform for anomaly detection
Historical data suggests “normality,” live
monitoring allows anomalies to be detected
Where might my traffic go tomorrow?
Plug into a simulator back-end
Run multiple ‘what-if’ scenarios
Discrete event simulator, flow allocation solver
…failures
…reconfigurations
…technology deployments
E.g. “What happens if we coalesce all the
Exchange servers in one data-centre?”
Where should my traffic be going?
Close the loop: compute link weights to
implement policy goals
Allows more dynamic policies
Recompute on order of hours/days
Modify network configuration to track e.g. time of
day load changes
Make network more efficient (~cheaper)?
Where are we now?
Three major components
Flow collection
Route collection
Distributed database
Building prototypes, simulating system
Data collection
Flow collection
Hosts track active flows
Used packet traces for feasibility study on (client, server)
Using low overhead event posting infrastructure, ETW
Built prototype device driver provider & user-space consumer
Peaks at (165, 5667) live and (39, 567) active flows per sec
Route collection
OSPF is link-state: passively collect link state adverts
Extension of my work at Sprint (for IS-IS and BGP); also
been done at AT&T (NSDI’04 paper)
The distributed database
Logically contains
1.
Traffic flow matrix (bandwidths), {srcs} × {dsts}
2.
…each entry annotated with current route from src to dst
N.B. src/dst might be e.g. (IP end-point, application)
Large dynamic data set suggests aggregation
Related work
{ distributed, continuous query, temporal } databases
Sensor networks
Potential starting points: Astrolabe or SDIMS (SIGCOMM’04)
Where/what/how much to aggregate?
Is data read- or write-dominated?
Which is more dynamic, flow or topology data?
Can the system successfully self-tune?
The distributed database
Construct traffic matrix from flow monitoring
Hosts can supply flows they source and sink
Only need a subset of this data to get complete traffic matrix
Construct topology from route collection
OSPF supplies topology → routes
Wish to be able to answer queries like
“Who are the top-10 traffic generators?”
“What is the load on link l ?”
Easy to aggregate, don’t care about topology
Can aggregate from hosts, but need to know routes
“What happens if we remove links {l…m} ?”
Interaction between traffic matrix, topology, even flow control
The distributed database
Building simulation model
OSPF data gives topology, event list, routes
Simple load model to start with (load ~ # subnets)
Precedence matrix (from SPF) reduces flow-data query set
Can we do as well/better than e.g. NetFlow?
Accuracy/coverage trade-off
How should we distribute the DB?
Just OSPF data? Just flow data? A mixture?
How many levels of aggregation?
How many nodes do queries touch?
What sort of API is suitable?
Example queries for sample applications
Summary
Introduction
Abstractions
ICMP, SNMP, etc
Pulling it all together
IP, routers, configurations
IP network management protocols
ISO FCAPS, TMN EMS, ATM
IP network components
What is network management?
Outline of a network management system
An alternative approach: from the edges
The end
Questions
Answers?
http://www.cisco.com/
http://www.routergod.com/
http://www.ietf.org/
http://ipmon.sprintlabs.com/pyrt/
http://www.nanog.org/
Backup slides
Internet routeing
OSPF
BGP
Internet routeing
Q: how to get a packet from node to destination?
A1: advertise all reachable destinations and apply a
consistent cost function (distance vector)
A2: learn network topology and compute consistent
shortest paths (link state)
Each node (1) discovers and advertises adjacencies;
(2) builds link state database; (3) computes shortest paths
A1, A2: Forward to next-hop using longest-prefixmatch
OSPF (~link state routeing)
Q: how to route given packet from any node to
destination?
A: learn network topology; compute shortest paths
For each node
Discover adjacencies (~immediate neighbours); advertise
Build link state database (~network topology)
Compute shortest paths to all destination prefixes
Forward to next-hop using longest-prefix-match (~most
specific route)
BGP (~path vector routeing)
Q: how to route given packet from any node to destination?
A: neighbours tell you destinations they can reach; pick cheapest
option
For each node
Receive (destination, cost, next-hop) for all destinations known to
neighbour
Longest-prefix-match among next-hops for given destination
Advertise selected (destination, cost+, next-hop') for all known
destinations
Selection process is complicated
Routes can be modified/hidden at all three stages
General mechanism for application of policy