Network Control and Management in the 100x100 Architecture

Download Report

Transcript Network Control and Management in the 100x100 Architecture

Rethinking Network Control & Management
The Case for a New 4D Architecture
David A. Maltz
Carnegie Mellon University/Microsoft Research
Joint work with
Albert Greenberg, Gisli Hjalmtysson
Andy Myers, Jennifer Rexford, Geoffrey Xie,
Hong Yan, Jibin Zhan, Hui Zhang
1
The Role of Network Control and Management


Many different network environments

Access, backbone networks

Data-center networks, enterprise/campus

Sizes: 10-10,000 routers/switches
Many different technologies


Longest-prefix routing (IP), fixed-width routing (Ethernet),
label switching (MPLS, ATM), circuit switching (optical, TDM)
Many different policies

Routing, reachability, transit, traffic engineering, robustness
The control plane software binds these elements
together and defines the network
2
We Can Change the Control Plane!


3
Pre-existing industry trend towards
separating router hardware from software

IETF: FORCES, GSMP, GMPLS

SoftRouter [Lakshman, HotNets’04]
Incremental deployment path exists

Individual networks can upgrade their control
planes and gain benefits

Small enterprise networks have most to gain

No changes to end-systems required
A Clean-slate Design

What are the fundamental causes of network problems?

How to secure the network and protect the infrastructure?

How to provide flexibility in defining management logic?

What functionality needs to be distributed – what can be
centralized?


How to reduce/simplify the software in networks?

What would a “RISC” router look like?
How to leverage technology trends?

4
CPU and link-speed growing faster than # of switches
Three Principles for
Network Control & Management
Network-level Objectives:

Express goals explicitly


Security policies, QoS, egress point selection
Do not bury goals in box-specific configuration
Reachability matrix
Traffic engineering rules
Management
Logic
5
Three Principles for
Network Control & Management
Network-wide Views:

Design network to provide timely, accurate info


Topology, traffic, resource limitations
Give logic the inputs it needs
Reachability matrix
Traffic engineering rules
Management
Logic
Read state info
6
Three Principles for
Network Control & Management
Direct Control:

Allow logic to directly set forwarding state


FIB entries, packet filters, queuing parameters
Logic computes desired network state, let it implement it
Reachability matrix
Traffic engineering rules
Write state
Management
Logic
Read state info
7
Overview of the 4D Architecture
Network-level
objectives
Decision
Network-wide
views
Dissemination
Discovery
Direct
control
Data
Decision Plane:
8

All management logic implemented on centralized
servers making all decisions

Decision Elements use views to compute data plane
state that meets objectives, then directly writes this
state to routers
Overview of the 4D Architecture
Network-level
objectives
Decision
Network-wide
views
Dissemination
Discovery
Direct
control
Data
Dissemination Plane:
9

Provides a robust communication channel to each
router – and robustness is the only goal!

May run over same links as user data, but logically
separate and independently controlled
Overview of the 4D Architecture
Network-level
objectives
Decision
Network-wide
views
Dissemination
Discovery
Direct
control
Data
Discovery Plane:
10

Each router discovers its own resources and its local
environment

E.g., the identity of its immediate neighbors
Overview of the 4D Architecture
Network-level
objectives
Decision
Network-wide
views
Dissemination
Discovery
Direct
control
Data
Data Plane:
11

Spatially distributed routers/switches

Can deploy with today’s technology

Looking at ways to unify forwarding paradigms
across technologies
Concerns and Challenges


12
Distributed Systems issues

How will communication between routers and DEs
survive failures in the network?

Latency means DE’s view of network is behind
reality. Will the control loop be stable?

What is the overhead to/from the DEs?

What happens in a network partition?
Networking issues

Does the 4D simplify control and management?

Can we create logic to meet multiple objectives?
The Feasibility of the 4D Architecture
We designed and built a prototype of the 4D Architecture

4D Architecture permits many designs – prototype is a
single, simple design point

Decision plane

13

Contains logic to simultaneously compute routes and enforce
reachability matrix

Multiple Decision Elements per network, using simple election
protocol to pick master
Dissemination plane

Uses source routes to direct control messages

Extremely simple, but can route around failed data links
Evaluation of the 4D Prototype

Evaluated using Emulab (www.emulab.net)

Linux PCs used as routers (650 – 800MHz)

Tested on 9 enterprise network
topologies (10-100 routers each)
Example network with
49 switches and 5 DEs
14
Performance of the 4D Prototype
Trivial prototype has performance comparable to welltuned production networks



Recovers from single link failure in < 300 ms

< 1 s response considered “excellent”

Faster forwarding reconvergence possible
Survives failure of master Decision Element

New DE takes control within 1 s

No disruption unless second fault occurs
Gracefully handles complete network partitions

15
Less than 1.5 s of outage
Fundamental Problem: Wrong Abstractions
Shell scripts
Traffic Eng
Planning tools
Databases
Configs SNMP
OSPF
Link
metrics
OSPF
BGP
FIB
OSPF
BGP
FIB
• Figure out what is happening in
network
netflow modems • Decide how to change it
Routing
policies
OSPF
BGP
FIBPacket
filters
16
Management Plane
Control Plane
• Multiple routing processes on
each router
• Each router with different
configuration program
• Huge number of control knobs:
metrics, ACLs, policy
Data Plane
• Distributed routers
• Forwarding, filtering, queueing
• Based on FIB or labels
Good Abstractions Reduce Complexity
Management
Plane
Control
Plane
Data Plane
Configs
FIBs, ACLs
Decision
Plane
FIBs, ACLs Dissemination
Data Plane
All decision making logic lifted out of control plane
17

Eliminates duplicate logic in management plane

Dissemination plane provides robust
communication to/from data plane switches
Today: Simple Things are Hard to Do
D
Access
Networks
18
Inter-POP
Links
Fundamental Problem: Configurations Allow Too
Many Degrees of Freedom


Computing configuration files that cause control plane to
compute desired forwarding states is intractable

NP-hard in many cases

Requires predictive model of control plane behavior
Configurations files form a program that defines a set of
forwarding states

19
Very hard to create program that permits only desired states, and
doesn’t transit through bad ones
Forwarding states
allowed by configs
Auto-adaptation leads
to/thru bad states
Direct Control avoids
bad states
Fundamental Problem: Conflation of Issues

Ideal case: all routing information flooded to
all routers inside network


20
Robustness achieved via flooding
Reality: routing information filtered and
aggregated extensively

Route filtering used to implement security and
resource policies

Route aggregation used to achieve scalability
4D Separates Distributed Computing Issues from
Networking Issues


21
Distributed computing issues ! protocols and network
architecture

Overhead

Resiliency

Scalability
Networking issues ! management logic

Traffic engineering and service provisioning

Egress point selection

Reachability control (VPNs)

Precomputation of backup paths
Future Work



Scalability

Evaluate over 1-10K switches, 10-100K routes

Networks with backbone-like propagation delays
Structuring decision logic

Arbitrate among multiple, potentially competing objectives

Unify control when some logic takes longer than others
Protocol improvements


Deployment in today’s networks

22
Better dissemination and discovery planes
Data center, enterprise, campus, backbone (RCP)
Future Work

Experiment with network appliances


Expand relationships with security


23
Using 4D as mechanism for monitoring/quarantine
Formulate models that establish bounds of 4D


Traffic shapers, traffic scrubbers
Scale, latency, stability, failure models, objectives
Generate evidence to support/refute principles
Questions?
24
Direct Control Provides Complete Control
25

Zero device-specific configuration

Supports many models for “pushing” routes

Trivial push – convergence requires time for all
updates to be receive and applied – same as today

Synchronized update – updates propagated, but
not applied till agreed time in the future – clock
skew defines convergence time

Controlled state trajectory – DE serializes updates
to avoid all incorrect transient states
Fundamental Problem: Wrong Abstractions
interface Ethernet0
ip address 6.2.5.14 255.255.255.128
interface Serial1/0.5 point-to-point
ip address 6.2.2.85 255.255.255.252
ip access-group 143 in
frame-relay interface-dlci 28
access-list 143 deny 1.1.0.0/16
access-list 143 permit any
route-map 8aTzlvBrbaW deny 10
match ip address 4
route-map 8aTzlvBrbaW permit 20
match ip address 7
ip route 10.2.2.1/16 10.2.1.7
router ospf 64
redistribute connected subnets
redistribute bgp 64780 metric 1 subnets
network 66.251.75.128 0.0.0.127 area 0
router bgp 64780
redistribute ospf 64 match route-map 8aTzlvBrbaW
neighbor 66.253.160.68 remote-as 12762
neighbor 66.253.160.68 distribute-list 4 in
26
Fundamental Problem: Wrong Abstractions
2000
Size of configuration files in a
single enterprise network (881
routers)
Lines in
config file
1000
0
0
881
Router ID (sorted by file size)
27
28
29
Fundamental Problem: Conflating Distributed
Systems Issues with Networking Issues
Routing
Process
D
left
D
Routing
Process
D

D
D
Routing
Process
D
left
Distributed Systems Concern: resiliency to link failures

30
left
D
Solution: multiple paths through routing process graph
Fundamental Problem: Conflating Distributed
Systems Issues with Networking Issues
Routing
Process
D
right
D
Routing
Process
D

D
D
left
Distributed Systems Concern: resiliency to link failures

31
left
D
Routing
Process
Solution: multiple paths through routing process graph
Fundamental Problem: Conflating Distributed
Systems Issues with Networking Issues
Routing
Process
D
Filter routes to D
left
D
Routing
Process
D

D
D
Routing
Process
D
left
Networking Concern: implement resource or security
policy

32
left
D
Solution: restrict flow of routing information, filter routes,
summarize/aggregate routes
4D Supports Network Evolution & Expansion

Decision logic can be upgraded as needed


Decision Elements can be upgraded as
needed

33
No need for update of distributed protocols
implemented in software distributed on every
switch
Network expansion requires upgrades only to
DEs, not every switch
Reachability Example
R1
Chicago (chi)
R2
New York (nyc)
Data Center
Front Office
R5
R3
34
R4

Two locations, each with data center & front office

All routers exchange routes over all links
Reachability Example
R1
Chicago (chi)
R2
New York (nyc)
Data Center
Front Office
R5
R3
chi-DC
chi-FO
nyc-DC
nyc-FO
35
R4
Reachability Example
R1
Data Center
Packet filter:
Drop nyc-FO -> *
Permit *
Packet filter:
Drop chi-FO -> *
Permit *
R3
chi-DC
chi-FO
nyc-DC
nyc-FO
36
R2
R5
R4
chi
Front Office
nyc
Reachability Example
R1
Data Center
Packet filter:
Drop chi-FO -> *
Permit *
R3
37
Packet filter:
Drop nyc-FO -> *
Permit *
R2
R5
chi
Front Office
nyc
R4

A new short-cut link added between data centers

Intended for backup traffic between centers
Reachability Example
R1
Data Center
Packet filter:
Drop nyc-FO -> *
Permit *
Packet filter:
Drop chi-FO -> *
Permit *
R3
38
R2
R5
chi
Front Office
nyc
R4

Oops – new link lets packets violate security policy!

Routing changed, but

Packet filters don’t update automatically
Prohibiting Packets from chi-FO to nyc-DC
39
Reachability Example
R1
Data Center
Packet filter:
Drop chi-FO -> *
Permit *
R3

40
Packet filter:
Drop nyc-FO -> *
Permit *
R2
R5
chi
Front Office
nyc
R4
Typical response – add more packet filters to plug the
holes in security policy
Reachability Example
R1
Drop nyc-FO -> *
R2
Data Center
R5
Drop chi-FO -> *
R3
41
R4

Packet filters have surprising consequences

Consider a link failure

chi-FO and nyc-FO still connected
chi
Front Office
nyc
Reachability Example
R1
Drop nyc-FO -> *
R2
Data Center
R5
chi
Front Office
nyc
Drop chi-FO -> *
R3
42
R4

Network has less survivability than topology suggests

chi-FO and nyc-FO still connected

But packet filter means no data can flow!

Probing the network won’t predict this problem
Allowing Packets from chi-FO to nyc-FO
43
Multiple Interacting Routing Processes
Client
Server
OSPF
OSPF
Internet
FIB
OSPF
FIB
44
Policy1
OSPF
Policy2
BGP
FIB
OSPF
FIB
OSPF
FIB
The Routing Instance Graph of a
881 Router Network
45
Reconvergence Time Under
Single Link Failure
46
Reconvergence Time When
Master DE Crashes
47
Reconvergence Time When
Network Partitions
48
Reconvergence Time When
Network Partitions
49
Many Implementations Possible
Single redundant decision
engine
Multiple decision engines
• Hot stand-by
• Divide network & load share
Distributed decision engines
• Up to one per router
Choice can be based on reliability requirements
• Dessim. Plane can be in-band, or leverage OOB links
Less need for distributed solutions (harder to reason about)
• More focus on network issues, less on distributed protocols
50
Direct Expression Enables New Algorithms
D
OSPF normally calculates a single path to each destination D

OSPF allows load-balancing only for equal-cost paths to avoid loops

Using ECMP requires careful engineering of link weights
D
Decision Plane with network-wide view can compute multiple paths
• “Backup paths” installed for free!
• Bounded stretch, bounded fan-in
51
Systems of Systems

Systems are designed as components to be used in larger
systems in different contexts, for different purposes,
interacting with different components



52
Example: OSPF and BGP are complex systems in its own right,
they are components in a routing system of a network,
interacting with each other and packet filters, interacting with
management tools …
Complex configuration to enable flexibility

The glue has tremendous impact on network performance

State of art: multiple interactive distributed programs written in
assembly language
Lack of intellectual framework to understand global
behavior
Supporting Network Evolution


53
Logic for controlling the network needs to change
over time

Traffic engineering rules

Interactions with other networks

Service characteristics
Upgrades to field-deployed network equipment must
be avoided

Very high cost

Software upgrades often require hardware upgrades (more
CPU or memory)
Supporting Network Evolution
Today

Today’s “Solution”

Vendors stuff their routers with software implementing all
possible “features”
– Multiple routing protocols
– Multiple signaling protocols (RSVP, CR-LDP)
– Each feature controlled by parameters set at configuration time to
achieve late binding

Feature-creep creates configuration nightmare
– Tremendous complexity for syntax & semantics
– Mis-interactions between features is common
Our Goal: Separate decision making logic from the fielddeployed devices
54
Supporting Network Expansion


Networks are constantly growing

New routers/switches/links added

Old equipment rarely removed
Adding a new switch can cause old
equipment to become overloaded

55
CPU/Memory demands on each device should not
scale up with network size
Supporting Network Expansion
Today


Routers run a link-state routing protocol

Size of link-state database scales with # of routers

Expanding network can exceed memory limits of old routers
Today’s “Solution”

Monitor resources on all routers

Predict approach of exhaustion and then:
– Global upgrade
– Rearchitecture of routing design to add summarization, route
aggregation, information hiding
Our Goal: make demands scale with hardware (e.g., # of
interfaces)
56
Supporting Remote Devices


57
Maintaining communication with all network
devices is critical for network management

Diagnosis of problems

Monitoring status and network health

Updating configuration or software
“the chicken or the egg….”

Cannot send device configuration/management
information until it can communicate

Device cannot communicate until it is correctly
configured
Supporting Remote Devices
Today

Today’s “Solution”

Use PSTN as management network of last resort

Connect console of remote routers to phone modem

Can’t be used for customer premise equipment (CPE):
DSL/cable modems, integrated access devices (IADs)

In a converged network, PSTN is decommissioned
Our Goal: Preserve management communication to any
device that is not physically partitioned, regardless of
configuration state
58
Recent Publications
59

G. Xie, J. Zhan, D. A. Maltz, H. Zhang, A. Greenberg, G. Hjalmtysson, J. Rexford, “On
Static Reachability Analysis of IP Networks,” IEEE INFOCOM 2005, Orlando, FL,
March 2005.

J. Rexford, A. Greenberg, G. Hjalmtysson, D. A. Maltz, A. Myers, G. Xie, J. Zhan, H.
Zhang, “Network-Wide Decision Making: Toward a Wafer-Thin Control Plane,”
Proceedings of ACM HotNets-III, San Diego, CA, November 2004.

D. A. Maltz, J. Zhan, G. Xie, G. Hjalmtysson, A. Greenberg, H. Zhang, “Routing
Design in Operational Networks: A Look from the Inside,” Proceedings of the 2004
Conference on Applications, Technologies, Architectures, and Protocols for Computer
Communications (ACM SIGCOMM 2004), Portland, Oregon, 2004.

D. A. Maltz, J. Zhan, G. Xie, H. Zhang, G. Hjalmtysson, A. Greenberg, J. Rexford,
“Structure Preserving Anonymization of Router Configuration Data,” Proceedings
of ACM/Usenix Internet Measurement Conference (IMC 2004), Sicily, Italy, 2004.