Experience in Black-box OSPF Measurement

Download Report

Transcript Experience in Black-box OSPF Measurement

Operations and Management of IP Networks:
What Researchers Should Know
Aman Shaikh
Albert Greenberg
AT&T Labs (Research)
SIGCOMM 2005 Tutorial
Aman Shaikh, Albert Greenberg, August 2005
1
Perceptions…
• IP networks are simple …
– Best effort service only
– Simple and stupid core; complex and intelligent edge
• IP networks manage themselves and work just
fine…
– Capable of routing around failures
– Excess capacity in the core
• IP network operations and management is SNMP
– SNMP MIBs exist for everything you need to know
– SNMP is widely supported and deployed
Aman Shaikh, Albert Greenberg, August 2005
2
Reality: Limitations of IP
• Application needs: reliable, predictable network service
– But, IP only provides best-effort service
– But, IP network elements are not very reliable
• Operators want fine-grained control over the network
– But, routers do not do fine-grained resource allocation
• Operators want accountability of resources
– But, routers do not maintain state about packet transfers
– But, measurement is not part of the infrastructure
• SNMP is the only exception; but SNMP is not adequate
IP was not designed with network management in mind!
Network management is much more than SNMP!
Aman Shaikh, Albert Greenberg, August 2005
3
Reality: Scale and Diversity
• IP networks are large, diverse and complex
– Network elements, protocols, processes, applications, services
– IP/Optical integration and general cross layer interactions
– Multiple vendors and multiple platforms (even within one
vendor)
• IP networks are large and distributed
– Workflow management and maintenance across multiple time
zones, in a highly coupled distributed system
• IP network control plane is diverse and complex
– Protocols offer numerous complex, overlapping features
– Protocols sometimes interact in strange and complex ways
• IP supports diverse services and applications
– Applications overlays on the IP infrastructure (e.g.. VoIP)
– Application-level servers, gateways, databases…
Aman Shaikh, Albert Greenberg, August 2005
4
Reality: Dynamism
• Failures, maintenance and upgrades are common (e.g., IOS
upgrades)
• Technological advances (e.g., ULH), protocol evolution (e.g.,
SIP)
• Architectural changes (e.g., MPLS)
• Network migrations, convergence (e.g., BGP route free cores)
• New applications and services (e.g., IPTV)
• Threats: Worms, viruses, DDoS, malware, … (e.g., Witty worm)
• Customer leave, join, change/upgrade services, … (e.g., Frame
Relay to VPN migrations)
• Traffic fluctuates: routing anomalies, failures, misconfiguration,
attacks, flash crowds (e.g., customer-side rebalancing)
Aman Shaikh, Albert Greenberg, August 2005
5
How to Build an IP Network?
Shell
scripts
Shell
Shellscripts
scripts
Traffic
Eng tools
Traffic
TrafficEng
Engtools
tools
• Multiple routing processes on
each router
Databases
Planning
tools
Databases
Planning
tools
• Each router with different
Databases
Planning tools
configuration program
Configs SNMP
netflow modems
• Huge number of control knobs:
OSPF
metrics, ACLs, policy
Link
Routing
OSPF
• Distributed routers
metrics
policies
BGP
• Forwarding, filtering, queuing
LDP
• FIBs, LFIBs, Labels
OSPF
OSPF
BGP
BGP
• Plethora of uncoordinated,
FIB/
LDP
LDP
overlapping network
LFIB
management scripts, tools,
databases
FIB/
FIB/
LFIB
LFIBPacket
filters
Aman Shaikh, Albert Greenberg, August 2005
6
Complex Associations Below (and Above) IP Layer
Dual uplinks on ION
Dual uplinks on common ring
ADM
BR
BR
ADM
BR
RAR
BR
RAR
Dual uplinks on different rings, common conduits
One uplink ring; other ION
BR
ADM
ADM
ADM
ADM
ADM
ADM
BR
BR
RAR
RAR
BR
Dual uplinks on different rings, diversely routed
Uplinks with ring and ION
BR
ADM
ADM
ADM
RAR
ADM
ADM
ADM
BR
ADM
BR
ADM
RAR
ADM
ADM
ADM
ADM
BR
Uplinks with unprotected, non-diverse
segments
RAR
ADM
ADM
ADM
ADM
ADM
ADM
Aman Shaikh, Albert Greenberg, August 2005
ADM
ADM
7
Network Management Systems Interactions
Mgt 1
Ticket Management
11
xyz
19
xyzr
1
Mgt 4
Mgt 3
Mgt 2
26
xyz
abc1
abc
23
abc
abce
abc3
xyz
27
29
xyz
gef
abc2
abc
To abc
1
xxx
xxx
2
CFKB
9
9
1
def
7
xxxx
abc
ABC
abc
Portal abc
29
1
abc
abc
abc
NF-TA
CAPRI
abc
abc
ghi
abct
DB4
3
DB2
abc
DB3
xyz
9
1
19
20
abc
abc
7
11
def
7
7
11
xyz
27
abc
7
abc
7
abc
Platform xyz
xyz
xyz
abc
19
def
20
xyz
xyz
ghi
21
abc
3
abc
def
abc
abc
abc
abc
abc
xyz
def
xyz
ghi
abc
def
abc
ghi
abc
Dga;ljag;lkj
Platform for xyz
xyz
abc
dsagjag
abc
7
abc
20
18
abc
ghi
abc
abc
19
xyz
abc
xyz
xyz
abc
6
def
28
abc
yyy
21
abce
11
28
9
xyz
def
Platform for abc
abc
7
xxx
xyz
23
xyz
xyz
System X
abc
DB1
abc
abc
11
xyz
xyz
abc
xyz
To another sys
9
abc
platform for xyz
11
ghi
xyz
abc
abc
xyz
xyz
xyz
xyz
xyz
def
xxx!
xyz
abc
ghi
xxx
11
zzzzz
def
xyz
abc
xyz
abc
abc
abc
6
xxx
abc
abc
9
xyz
yyy
xyz
abc
xyz
xyz
11
def
11
xyz
11
30
abc
xyz
abc
xxx
abc
7
ghi
nms
xxx
ghi
Transport
def
xyz
xyz
occ
abc
abc
abc
5
abc
xyz
def
ghi
xyz
def
14
abc
abc
xyz
7 13
def
abc
ghi
11
9
Platform for task xyz
Platform for doing xyz
mno
30 9
29
13
30
2
ghi
abc
def
abc
1
abc
abc
abc
ghi
7
Task xyzs
xyz
26
6
abc
21
7
ghi
xyz
13
xxx
abc
3
yyz
xyz
ISE
18
System for xyz
def
5
26
11
xxx
abc
abc
7
11
abc5
7
13 14
24
xyz
25
24
6
abc4
abc
5
xyz
abc
abc
25
Mgt sys xyz
abc
20
abc
7
Databases
Database 1
Database 2
Database 3
abc
Aman Shaikh, Albert Greenberg, August 2005
def
ghi
8
Complex, Massive Streaming Data
COMMON FORMAT
as of 03/15/2005$
RAW CDR FORMAT
Common Gway 1 Gway 1 Gway 1 Gway 2 Gway 2
Format
STOP ATTEMPT FEATURE STOP ATTEMPT
Category Description Field #
Field #
Field #
Field #
Field #
Field #
Record
MAIN
Type (Start / Stop / Attempt)
1
1
1
1
1
1
MAIN
Call ID
2
4
4
4
3
3
Originating
MAIN Network Element3 Name
2
2
2
2
2
Destination
MAIN Network Element4 Name
14
14
14
NP
NP
MAIN Start Date
5
21
61
21(or 61#)
6
6
"Start
MAIN
Time" {ASX/GSX definitions
6 differ) 21
61
21(or 61#)
7
7
MAIN Calling Number 7
5
5
5
20
17
MAIN Called Number
8
6
6
6
21
18
MAINDisconnect Date 9
61
61
61(or21*)
11
105
MAINDisconnect Time 10
61
61
61(or21*)
12
10
MAIN
Call Duration (seconds)
11
computed
NP
computed
14
NP
Disconnect
MAINReason (for GSX, progress
12
msg) 64
63
NP
15
12
MAIN Call Direction
13
28
28
28
17
14
MAIN
Disconnect Initiator 14
65
64
NP
64
57
TERMINATION
Disconnect Reason Xmit to 15
Ingress
NP
NP
NP
121
111
TERMINATION
Disconnect Reason Xmit to 16
Egress
NP
NP
NP
122
112
PATH
Ingress PSTN Trunk17
NP
NP
NP
34
31
Ingress
PATH PSTN Circuit End18
Point
NP
NP
NP
35
32
Egress
PATH PSTN Circuit End 19
Point
NP
NP
NP
37
34
PATH
Ingress IP Circuit End Point
20
NP
NP
NP
36
33
PATH
Egress IP Circuit End Point
21
13
13
13
38
35
Aman Shaikh, Albert Greenberg, August 2005
• First 25 lines
describing
individual
VoIP Call
Detail Record
Data Types
• A simple
case!
9
Tutorial Objectives
• Understanding elements of
network management
IP/MPLS networking
Optical networking
Statistics Security
Visualization Software
Algorithms
Machine Learning
Data mining
Automation
– Numbers, network elements,
services, systems, processes
– Problems, solutions
– Research challenges and
opportunities
• Expose the tip of the iceberg
– To excite you to look deeper
and help improve the state of
the art
Aman Shaikh, Albert Greenberg, August 2005
10
How does Network Management Fit In?
• Product Management and Sales
– Strategy and New Technologies: VPNs, IPTV, WiMax, VoIP,
CDNs, …
• Network Development
– Architecture, Capacity Planning, Testing and Certification,
Technology incubation
• Software Development
– Network management systems; Billing systems
• Network Management (Operations)
– Customer Care
– Network Care
Tutorial
Boundaries can be fuzzy. IP Operations often write
significant (and creative) code or scripts, for example.
Aman Shaikh, Albert Greenberg, August 2005
11
Our Network Management Tutorial
•
•
•
•
Lay of the Land
Network Operations and Management
VoIP Case Study
Some Directions and Challenges
Aman Shaikh, Albert Greenberg, August 2005
12
Lay of the Land
Aman Shaikh, Albert Greenberg, August 2005
13
Lay of the Land
• Physical networking
– What IP networks look like
• Topologies, network structures, taxonomies
• Logical networking
– Routing protocols, MPLS switching
Aman Shaikh, Albert Greenberg, August 2005
14
IP Networks
• IP is the most prevalent technology for communication
– Everything over IP
• Enterprise networks
– Use IP networking for internal communication needs
– Hierarchical topologies typically: the right structure for small set
of hubs (data centers), huge set of spokes (remote offices)
• Service provider networks
Tutorial Focus
– Use IP to support a wide range of communication services to a
wide range of business and residential customers
– Mesh-like backbone structure: the right structure for convolving
tens of thousands of enterprise and other networks
• Routers concentrated in PoPs (Points of Presences)
• Both enterprise and service provider networks can have
enormous geographic span, and involve thousands of
complex network elements
Aman Shaikh, Albert Greenberg, August 2005
15
AT&T North America IP Network
TO
TO ANCHORAGE,
ANCHORAGE, AK
AK
VANCOUVER
VANCOUVER
CALGARY
CALGARY
SEATTLE
SEATTLE
SPOKANE
SPOKANE
SEATTLE NTS
PORTLAND
PORTLAND
MINNEAPOLIS
MINNEAPOLIS
ST
ST PAUL
PAUL
MONTREAL
MONTREAL
MILWAUKEE
MILWAUKEE
TORONTO
TORONTO (2)
(2)
MANCHESTER
GLENVIEW
MANCHESTER
GLENVIEW
SYRACUSE
SYRACUSE
GRAND
RAPIDS
GRAND
RAPIDS
DES
MOINES
DES MOINES
ROLLING
BUFFALO
ROLLING MEADOWS
MEADOWS
BUFFALO
BIRMINGHAM
CAMBRIDGE
BIRMINGHAM
ROCHESTER
CAMBRIDGE
ROCHESTER
CHICAGO
CHICAGO
DETROIT
CHESHIRE
DETROIT
CHESHIRE
DAVENPORT
SALT
SALT LAKE
LAKE CITY
CITY
PLYMOUTH
OMAHA
BOSTON
PLYMOUTH CLEVELAND
OMAHA DAVENPORT
BOSTON
CLEVELAND
MADISON
MADISON
SOUTH
BEND
SOUTH BENDAKRON
AKRON PHILADELPHIA
TO
PHILADELPHIA
TO TOKYO,
TOKYO, JAPAN
JAPAN
OAK
OAK BROOK
BROOK
NEW
YORK
CITY
NEW
YORK
CITY
COLUMBUS
COLUMBUS
TO
KANSAS
TO HONG
HONG KONG
KONG
KANSAS CITY
CITY
DENVER
FLORISSANT
DENVER
FLORISSANT
TO
DAYTON
SECAUCUS
TO SYDNEY,
SYDNEY, AUSTRALIA
AUSTRALIA
SACRAMENTO
SECAUCUS NTS
NTS
INDIANAPOLIS
SACRAMENTO
INDIANAPOLIS DAYTON
WASHINGTON
DC
CINCINNATTI
CINCINNATTI WASHINGTON DC
ST.
COLORADO
ST. LOUIS
LOUIS
COLORADO SPRINGS
SPRINGS
SAN JOSE NTS
LOUISVILLE
PITTSBURGH,
PA
SAN
FRANCISCO
LOUISVILLE
PITTSBURGH,
PA
SAN FRANCISCO
OAKLAND
OAKLAND
LAS
HAMILTON
LAS VEGAS
VEGAS
OKLAHOMA
SPRINGFIELD
GREENSBORO
HAMILTON SQ.,
SQ., NJ
NJ
OKLAHOMA CITY
CITY SPRINGFIELD
GREENSBORO
REDWOOD
REDWOOD CITY
CITY
ALBUQUERQUE
NASHVILLE
SAN
JOSE
ALBUQUERQUE
NASHVILLE
SAN
JOSE
CAMDEN,
NORFOLK
CAMDEN, NJ
NJ
NORFOLK
TO
TULSA
TO TOKYO,
TOKYO, JAPAN
JAPAN
TULSA
CHARLOTTE
MEMPHIS
CHARLOTTE RALEIGH
MEMPHIS
WAYNE,
PA
WAYNE, PA
RALEIGH
TO
TO SINGAPORE
SINGAPORE
LOS
LOS ANGELES
ANGELES
NORCROSS
NORCROSS DUNWOODY
LITTLE
LITTLE ROCK
ROCK
HARRISBURG,
DUNWOODY
HARRISBURG, PA
PA
ANAHEIM
ANAHEIM
SHERMAN
SHERMAN OAKS
OAKS
PHOENIX
PHOENIX
DALLAS
DALLAS
COLUMBIA
COLUMBIA
FORT
FORT WORTH
WORTH
BIRMINGHAM
BIRMINGHAMATLANTA
ATLANTA
SAN DIEGO
DIEGO
GARDENA
LA NTS2 SAN
GARDENA
DALLAS
DALLAS NTS
NTS
NEW
SAN
NEW ORLEANS
ORLEANS
SAN BERNARDINO
BERNARDINO LA NTS
JACKSONVILLE
AUSTIN
JACKSONVILLE
AUSTIN
TO
TO HONOLULU,
HONOLULU, HI
HI
SAN
SAN ANTONIO
ANTONIO
MONTERREY
ORLANDO
ORLANDO
HOUSTON
HOUSTON
TAMPA
TAMPA
FT.
FT. LAUDERDALE
LAUDERDALE
MIAMI NTS
W.
W. PALM
PALM BEACH
BEACH
OJUS
OJUS
MIAMI
MIAMI
PORTLAND,
PORTLAND, ME
ME
PROVIDENCE,
PROVIDENCE, RI
RI
WORCESTER,
WORCESTER, MA
MA
FRAMINGHAM,
FRAMINGHAM, MA
MA
ALBANY,
ALBANY, NY
NY
NYC
NYC BROADWAY,
BROADWAY, NY
NY
NEWARK,
NEWARK, NJ
NJ
ALBANY,
ALBANY, NY
NY
STAMFORD,
STAMFORD, CT
CT
HARTFORD,
HARTFORD, CT
CT
BRIDGEPORT,
BRIDGEPORT, CT
CT
NEW
NEW BRUNSWICK,
BRUNSWICK, NJ
NJ
WHITE
WHITE PLAINS,
PLAINS, NY
NY
NYC
NYC BROADWAY,
BROADWAY, NY
NY
CEDAR
CEDAR KNOLLS,
KNOLLS, NJ
NJ
ROCHELLE
ROCHELLE PARK,
PARK, NJ
NJ
FREEHOLD,
FREEHOLD, NJ
NJ
BOHEMIA,
BOHEMIA, NY
NY
NEWARK,
NEWARK, NJ
NJ
NEWARK
NEWARK NTS
NTS
BALTIMORE,
BALTIMORE, MD
MD
ARLINGTON,
ARLINGTON, VA
VA
SILVER
SILVER SPRINGS,
SPRINGS, MD
MD
RICHMOND,
RICHMOND, VA
VA
ASHBURN NTS
GUADALAJARA
GUADALAJARA
MEXICO
MEXICO CITY
CITY
SAN
SAN JUAN,
JUAN, PR
PR
November 2004
Aman Shaikh, Albert Greenberg, August 2005
16
AT&T EMEA IP Network
OSLO
HELSINKI
STOCKHOLM
ST. PETERSBURG
COPENHAGEN
MOSCOW
DUBLIN (2)
AMSTERDAM (2)
WARWICK
BIRMINGHAM/REDDITCH
TO NEW YORK CITY
PORTSMOUTH
WARSAW
HAMBURG
BERLIN
LONDON (2)
ROTTERDAM
DUSSELDORF
BRUSSELS (2)
FRANKFURT (2)
PRAGUE
LA HULPE
STUTTGART
EHNINGEN
OLTEN
BERN
LAUSANNE
GENEVA
PARIS (2)
TO WASHINGTON DC
BASEL
MUNICH
LINZ
VIENNA (2)
BRATISLAVA
BUDAPEST
ZAGREB
MILAN (2)
TURIN
BRNO
LJUBLJANA
BUCHAREST
ST. GALLEN
ZURICH
NICE
SOFIA
BARCELONA
ISTANBUL
MADRID
THESSALONIKA
LISBON
ATHENS
NICOSIA
TO SOUTH AFRICA
TO PAKISTAN
HAIFA
November 2004
Aman Shaikh, Albert Greenberg, August 2005
17
CalREN Backbone
Aman Shaikh, Albert Greenberg, August 2005
18
Abilene Backbone
Aman Shaikh, Albert Greenberg, August 2005
19
Taxonomy of Routers by Roles
• Customer Edge Routers (CE)
– On the customer premise
• Provider Edge Routers (PEs)
– Terminate access for large number of customers
• Complex, customer specific access control, packet handling, routing
policies
• IP and IP VPN service
• End-to-end SLAs for on-net services (VPN, VoIP, IPTV, …)
– Terminate peering for a moderate number of private and public
peering points
• Complex, peer specific routing policies
• Bilateral/proprietary peering agreements
• Provider Core Routers (P)
– WAN transport between and within PoPs
– High-speed links, high-speed switching, low functionality,
high reliability
Aman Shaikh, Albert Greenberg, August 2005
20
Tier-1 Service Provider Network
DWDM
systems
PC
PC
PoP
PC
PC
PE
E
OC-48 or OC-192
DWDM
Intercity
PC
PE
E
PC
PC
PE
E
PC
PC
PoP
PC
PE
E
PE
E
Customer facing
PE interfaces
PoP
Metro
Metro
CPE
Access CPE
CE
CE
CE
P: Backbone (core) Router
PE: Provider Edge Router
CE: Customer Edge Router
LEC
CPE
Access
CPE
CE
Rough stats:
100s of offices
100s of Ps, 1000s of PEs, 10000s of CEs
100000s of transport facilities
Aman Shaikh, Albert Greenberg, August 2005
21
Taxonomy of Links by Roles
• Core links
– High-speed links: OC48, OC192, n x OCX composite links
– Core Link Protection
• IP Layer
– Intra-PoP and inter-PoP carried directly over DWDMs
– Optical restoration currently has little utility for IP backbones
– ULH (Ultra-long Haul) technologies may change that
• Edge links
– Access, peering and network management
• High-speed links: OCX, Ethernet/OCX
• Low-speed links: TDM backhauled over transport access network to PE,
potentially over multiple carriers
– Plethora of transport technologies (Ethernet, Cable, DSL, Frame Relay,
Wireless), and vendors
• Edge link protection
– IP layer and transport layer
– Higher speed: SONET Rings, Intelligent Optical Networks
– Lower speed: TDM mesh networks (intelligent networks or centralized
control)
Aman Shaikh, Albert Greenberg, August 2005
22
Routing
• Routing protocols allow routers to build their FIB
(Forwarding Information Base)
– FIB contains (next-hop router, outgoing interface) for each
prefix and is consulted when router forwards packets
• Every router performs following steps:
– Learn topology information
• Identify and keep up with changes
– Calculate the FIB
• Variety of different routing protocols
–
–
–
–
–
OSPF (Open Shortest Path First) [rfc2328]
RIP (Routing Information Protocol) [rfc2453]
IS-IS (Intermediate System-Intermediate System) [rfc1195]
EIGRP (Cisco proprietary protocol)
BGP (Border Gateway Protocol) [rfc1771]
Aman Shaikh, Albert Greenberg, August 2005
23
Taxonomy of Routing Protocols
Administrative Hierarchy
AS 1
AS 2
AS 3
• The Internet is a collection of Autonomous Systems (ASes)
– An AS is roughly a network administered by a single authority
• ISPs, enterprises, educational institutes, government organizations
• AS is identified by AS Number (ASN)
• Two classes of routing protocols
– Intra-AS or Interior Gateway Protocol (IGP)
• Requirements: simplicity, stability, fast convergence
• OSPF, IS-IS, RIP, EIGRP
– Inter-AS or Exterior Gateway Protocol (EGP)
• Requirements: stability, “fast” convergence, scalability, and policy control
– Policy example: AT&T would not want to provide transit to peers
• BGP is the only EGP used today!
Aman Shaikh, Albert Greenberg, August 2005
24
Taxonomy of Routing Protocols
Topology Information Learned by each Router
• Link-state routing protocol: each router learns the
entire topology
– Examples: OSPF, IS-IS
• Distance vector protocol: each router learns how
far every destination in the network is from each
of its neighbors
– Example: RIP
• Path vector protocol: each router learns each
neighbor’s path to every destination
– Example: BGP
Aman Shaikh, Albert Greenberg, August 2005
25
IGP in Service Provider Networks
• Most tier-1 service providers use OSPF and IS-IS as IGPs
• For scalability, OSPF and IS-IS allow hierarchical routing
– Use of areas (OSPF) or levels (IS-IS) to form a hub-and-spoke topology
• Typically each PoP forms a spoke and inter-PoP links form the hub
– Link-state routing is used within an area, whereas distance vector approach is used
across areas
– Advantage: reduction of state and processing within each area, problem localization
• Example: impact of problems in one area can be minimized/hidden from other
areas
– Disadvantage: sub-optimal routing, management complexity
PC
PC
Area 0
PoP
PE
E
Area 1
PC
PC
PoP
Intercity
PC
PE
E
PC
PC
PC
P
C
PE
E
Area 3
PC
PoP
PE
E
PE
E
Area 2
Aman Shaikh, Albert Greenberg, August 2005
26
BGP in Service Provider Networks
• BGP is used to learn routes from neighbor ASes (peers and customers)
– PE routers form eBGP (external BGP) sessions with CE routers (customers) or PE
routers (peers)
• PE router sends externally learned route to all routers in service provider AS
– PE router forms iBGP (internal BGP) sessions with all routers (PE and P) in the AS
• iBGP scalability
– Routers (PE and P) have to form a full mesh which does not scale beyond few tens
of routers
– Form clusters of routers (cluster leader is called Route Reflector)
• Typically routers in a PoP form a cluster
– Disadvantage: information hiding, complicated routing and management
eBGP
eBGP
iBGP
Route
Reflectors
Aman Shaikh, Albert Greenberg, August 2005
iBGP
27
MPLS (Multi-Protocol Label Switching)
• Outgrowth of IP switching technologies
– E.g. Epsilon’s IP switching, Cisco’ tag switching
• Key concept: separate routing (i.e., selection of paths)
from forwarding/switching
– Traditional forwarding: each router looks at destination
address along the path
• Even though routing protocol has already determined the path!
– MPLS-based forwarding: assign a label to a path and switch
packets based on the label at each router
• Gives rise to a Label Switched Path (LSP)
Layer 2 Header | PID MPLS Label 1 MPLS Label 2
…
MPLS Label n
Layer 3 Packet
Label (20bits) | CoS (3 bits) | Stack (1 bit) | TTL (8 bits)
Aman Shaikh, Albert Greenberg, August 2005
28
MPLS in Service Provider Networks
• Form an FEC (Forwarding Equivalence Class) of
all packets with same forwarding requirements
– E.g., BGP destination prefix, packets with same CoS
(Class of Service) bits
• Associate an LSP with each FEC
– All packets within an FEC are forwarded same way
– Typically LSPs are established from ingress to egress
• Switch packets from ingress to egress
LSP
ingress router:
push a label onto packet
egress router:
pop label from packet
Backbone routers
Label-switch packet
Aman Shaikh, Albert Greenberg, August 2005
29
MPLS Applications
• IP VPNs
– Provider-based, simple, scalable VPNs
– Overlapping, private addressing
Booming Demand
• Converged Cores
– IP VPNS + Internet Access
– Common BGP-free core
• Switch packets from ingress router to egress router
• Backbone routers do not need BGP routes
• Potential for More Reliable Cores
– Traffic engineering
• Establish “customized” paths
– IGPs do not provide fine-grained control over traffic
– Fast re-route
• Pre-compute and establish alternate MPLS paths to quickly route around
failures
Aman Shaikh, Albert Greenberg, August 2005
30
Network Management
• Customer Care
• Network Care
Aman Shaikh, Albert Greenberg, August 2005
31
Operations and Management Division of Roles
Customer vs. Network Care
• Customer Care is Edge (CE-PE) centric
– Focus: where customers meet the network
• CE-PE access circuit; CE, PE configuration
– A great deal of coordination is needed across the lifecycle of
assessment, onboarding, and steady state management
• Often about determining on which side of the customer/provider
interface a problem and associated action lies
– Call centers, relatively large teams, with technical and soft
skills needed to deal with customer problems
• Customer care is a differentiating service feature sold to customers
• Network Care is Core (PE) centric
– Focus: Where network internals that customers have no
interest in
• e.g., BGP route reflectors
– Network Operations Centers, relatively small teams, with deep
technical skills needed to deal with network problems
Aman Shaikh, Albert Greenberg, August 2005
32
Customer Care
•
•
•
•
Fundamentals
Provisioning
SLAs
Data Management
Aman Shaikh, Albert Greenberg, August 2005
33
What is a Customer?
• Multiple views for multiple purposes
– Sales, Billing, Provisioning, Troubleshooting, Interactions with third parties
– A network provider publishes different data to different contacts
• Who gets views of the bill vs. who gets views of the trouble tickets
– Defining a customer is not easy!
• Much more complex than knowing the customer’s Dunn & Bradsheet D-U-N-S
number (though this helps)
• Customer Data Management is a difficult, dynamic problem
– A layer above networking, yet critical to network management
• For our purposes, associate a customer with a project
– Contracted network services: Internet access, VPN, reporting, SLAs, …
– CE (Customer Edge) routers – on the customer premises
• Managed by the customer or outsourced to the provider
– outsourced management may extend beyond the CE’s WAN interface
– PE (Provider Edge) routers – on the provider’s network
• Managed by the provider
– Access arrangements
• Site info: location, circuit ID, associated IP addresses, …
• Off-net (third parity) access (LECs, PTTs) or on-net access via Packet on SONET
(POS), Frame Relay, Ethernet, DSL, …
Aman Shaikh, Albert Greenberg, August 2005
34
Elements of Customer Care
Bootstrapping: customer + requirements  network + services
• Vanilla customer: automated flow-through from technical
questionnaire to service activation
• Complex customer: multiple, iterative steps
– Assessment
• Understanding existing customer networks/services
• Understanding requirements for new networks/services
– Base-lining of existing services
• Understanding traffic, topology, configuration …
– Design of new services
– Phased implementation of new services
• A complex enterprise may migrate to a new provider over a multiyear time frame
• Provisioning
– Logistics of getting routers, circuits, configurations to right place at right time
– Managed by workflow management systems
• Phased, scripted component and end to end test and turn up procedures
• In synchrony with updates to databases supporting network management systems
and billing
• Gold mine of complex data about network operations
Aman Shaikh, Albert Greenberg, August 2005
35
Elements of Customer Care
Troubleshooting/Tech Support: customer + problem  solution
• Reactive: call centers – 24x7 tiered support
– Low tiers handle high volume, relatively simple or localized problem types
– High tiers handle lower volume, relatively complex and higher severity problems
• Proactive: alerts from monitoring systems
– Triggered by reachability, performance and fault monitoring
– Internal notifications to network care, access providers
– External notifications (IVR, email contact lists) to customers
• Essence of Customer Care: superfast problem localization and dispatch
– Detection and classification including level of severity
– Localization to the appropriate control domain: customer, network provider, access
provider
• Solution dispatch
– Again, to the appropriate control domain: customer, provider (network care), or
access provider
– And track the problem
• At the heart, this is automated systems workflow
– Rules driven, automated and audited escalations of problems through technical and
business channels from detection to localization to post mortem reporting
Aman Shaikh, Albert Greenberg, August 2005
36
Provisioning
Transforming Service Intent to Network Reality
• Customers want service increasingly “on demand”
• Providers want revenue, which flows the moment
service is provisioned
• Provisioning speed is a huge priority
– Today’s bottleneck: physical provisioning of circuits
– Technology mechanisms, such as intelligent optical
networking (with bandwidth on demand) have sprung up to
address networking issues
– Market mechanisms, such as exchange points (e.g., PAIX),
have sprung up to address some of the physical wiring issues
• Customer brings a fiber to the exchange point, and chooses a provider
among those already there
Aman Shaikh, Albert Greenberg, August 2005
37
Provisioning Workflow
• Technical Questionnaire
• E.g., Web form
• (Service Level)
Logic: allocations of
ports, IP addresses,
VRFs, …
• Device/service specific templates,
with embedded variables and
callouts to computations and
databases
• E.g., callouts for ports, IP
addresses, ACL clauses, …
• Detailed Device
Configuration commands –
bundled as a “configlet”
• (Network Element Level)
Aman Shaikh, Albert Greenberg, August 2005
38
Provisioning Example
Access Interfaces
• Basic interface configuration
– Media and location in router (POS7/3, ATM5/0.1)
– IP address and network address (mask)
– Capacity (bandwidth)
• Rich configurable parameters at layer 3
– Packet marking and scheduling (differentiated services)
– Buffer management (memory size, RED parameters)
– Access control (inbound and outbound packet filters)
• Diverse communication media at layer 2
– Serial link, ATM, Frame Relay, packet over SONET, etc.
– Various low-level, media-specific parameters
Aman Shaikh, Albert Greenberg, August 2005
39
Example
Example:Provisioning
BGP Customer
Configuration
BGP Customer Configuration
• Determine customer’s AS number
– Some customers have their own AS number
• Example: customers multi-homed to multiple providers
– Some customers cannot get their own AS number
• Example: single-homed customers
• Assign private ASN (64,512 to 65,535) or use provider’s ASN
• Establish communication with the customer
– Determine interface(s) connected to the customer
– Configure BGP session with the customer
– Associate BGP session with the interfaces
• Enforce provider’s routing policies while taking
customer’s routing intent into account
– BGP import and export policies
• Configure other BGP sessions parameters
– Password, timer settings, description, etc.
Aman Shaikh, Albert Greenberg, August 2005
40
Provisioning
BGP RoutingExample
Policies
BGP Routing Policies
• What are BGP routing policies?
– Applied to BGP update messages at PE (or AR) router
• Based on the prefix (and/or other attributes) listed in the update
– Determines route selection and distribution within AS as well as
distribution to other customers and peers
• Two kinds of routing policies
– Import: applied to routes received from the customer
• Filter routes for unwanted prefixes
• Influence the selection of the best route
• Tag routes for future export to other customers, and/or peers
– Export: applied to routers sent to the customer
• Filter routes for unwanted prefixes
• Select routes and attributes to send to customer
– E.g., send default route to customer (if needed)
• What makes them complicated?
– Often have to decompose them across routers to achieve intent
Aman Shaikh, Albert Greenberg, August 2005
41
Example: Controlling Route Distribution
10.0.0.1
A
192.168.0.1
C
Peer
Customer
Customer intent: “Don’t advertise my routes to peers”
Need policies at both the customer and peer
neighbor 192.168.0.1 route-map IMPORT-C in
route-map IMPORT-C permit 10
set community 0:1000
Assign routes
“Don’t import to peers”
tag at router C
ip community-list 1 permit 0:1000
neighbor 10.0.0.1 route-map EXPORT-A out
route-map EXPORT-A deny 10
match community 1
Don’t send route with
“Don’t import to peers”
tag to peer at router A
Aman Shaikh, Albert Greenberg, August 2005
42
Auditing What’s Provisioned (Checks and Balances)
• Again, provisioning is about translating service intent to
network reality
– Automation helps enormously
– Simpler, better configuration languages (e.g., XML-based) and
configuration protocols (e.g., IETF’s netconf) may help
• Yet!!!
– Engineered artifacts (large scale, operational complex
networks and databases) are imperfect, are moving targets, and
are hard to reason about
– Flaws creep into design, realization, management
– Some level of noise or error is inevitable
• Key parts of the solution
– Auditing service intent and network reality, flagging and
fixing “discords”
– Data integrity and data cleaning
Aman Shaikh, Albert Greenberg, August 2005
43
Auditing is Bottom Up!
Auditing
Provisioning
queries
customer/
network
database
Low level
standard
Discords
form (tables)
fix
polled
• Parsing network-level data
– Box-level dumps (show running
config; show diag; show trace …)
translated into a form (RDBMS,
XML) for network-level query and
analysis
• Cross-validation
–
–
–
–
Box level compliance to templates
Network-wide integrity (routing…)
Access control and security
Alignment of network views with
database views
– IP and Optical Associations
(interfaces to circuit-IDs)
• Fixing config discords
– Report warnings & errors
– Cruft, serious problems, time-bombs
waiting to explode when the
triggering network event occurs
Router configuration
Aman Shaikh, Albert Greenberg, August 2005
44
Example: Joining Parts of OSPF Config Together
(references/constraints scattered thru config file)
hostname MyRouter
Remote end is in 12.123.36.72/30
!
interface POS7/0
ip address 12.123.36.74 255.255.255.252
ip ospf cost 50
Interface participates in OSPF
!
router ospf 2
network 12.123.36.0 255.255.255.0 area 9
passive-interface Serial2/1/0/3.1
!
Aman Shaikh, Albert Greenberg, August 2005
45
Example: Remote End in Different OSPF Area
(auditing tool joins/analyzes info in database)
Extracted
tables
interface
link
OSPF network
OSPF passive interface
OSPF interface
Intermediate
tables
active OSPF interface
Simple SQL queries
OSPF link with area mismatch
Aman Shaikh, Albert Greenberg, August 2005
Presentation
query result
46
Service Level Agreements (SLA’s)
• A sort of warrantee: financials + the “fine print”
• Fine print: technical reliability and performance
– Measurement intervals/methods, statistics (VoIP R-values, delay, loss,
jitter, availability), force majeure indemnifications regarding hurricanes …,
outages caused by the customer itself, variations based on interfaces and
bandwidth characteristics, etc.
– IP networking is maturing and the marketplace is extremely competitive
– SLAs have real meaning and are getting increasingly stringent
• Site to site (CE to CE) VPN SLAs cover Class of Service (COS) specific targets
for delay, loss, jitter availability between pairs of sites within the customer’s
VPN
• Financials
– SLA compliance data incorporated into the billing data stream
– When SLAs are not met
• Customers unhappy: service quality is below expectations
• Providers unhappy: revenue suffers
Aman Shaikh, Albert Greenberg, August 2005
47
Example: Provisioning for site to site SLAs
Network
CE
PE
Provider Network
PE
CE
CE, PE interfaces
• 4 interfaces are essential for a given CE pair
• To meet the SLA the detailed configurations
• must be aligned end to end across the network
• must match customer and service-specific data
– bandwidths (e.g. rate limiting parameters for FR/ATM CEPE link), CoS markings, shaping/queuing/marking/
dropping packet handling behaviors, customer-specific
routing and packet filter parameters
Aman Shaikh, Albert Greenberg, August 2005
48
Example:Provisioning
Provisioning for
SLAs
Example:
forsite
sitetotosite
site
SLAs
Probes/reporting
CE
PE
Provider Network
PE
CE
CE probes
• 2 CE probes are essential for a given CE pair
– Probes have roles as both senders and responders
• Agent running on the CE router (e.g., Cisco’s SAA), or another box attached
to a CE port or on the CE – PE link (e.g., RMON probe)
– Collects detailed data on site to site performance via passive and active
measurements
• The detailed probe configurations must be aligned with the network
configurations, the customer, and the service
– Probe packet type (UDP, ICMP, CoS), probing frequency
– Interface packet filtering must permit the probes
• The performance/SLA monitoring system must collect the data with very high
fidelity
– Agent must itself be very reliable
– Performance monitoring platform must be designed for statistical soundness and
high reliability
• Polling frequency; data collection; data validation; SLA reporting
Aman Shaikh, Albert Greenberg, August 2005
49
Data Management
• Extremely important issue in running networks
– Often overlooked by the academic community
– True for both customer and network care
• Customer level
– Service contracts, VPNs, CE, routing/access control
parameters/policies, site and access/circuit data, ordering/
billing records, provisioning/ updating service, workflow
related events and trouble tickets, performance reports, …
• Network level
– Layers 1-3 (+ network servers, such as DNS) topology,
routing, performance, security, fault, operational workflow,
provisioning/updating network and associated systems,
network-focused inventory and configuration, …
Aman Shaikh, Albert Greenberg, August 2005
50
Data Management Challenges!
• Scale: tens of thousands of business customers; millions of
consumers
– Cisco platforms/command sets, VoIP telephony adapters, firmware updates
• Customization for complex customers
– Example: variations in IOS version, features, architecture
• Rapid Evolution of IP network services
– VoIP: multiple telephony adapters, firmware loads, interactions with
equipment
– Number of features that need to be maintained increases
• Features (almost) never die
• Data is managed by applications
– Software rot can lead to data rot
• Strikes when a program’s assumptions become out of date
• Churn in database design and development
– Multiple teams, creating multiple APIs
Aman Shaikh, Albert Greenberg, August 2005
51
Important Data Management Activities
• Data integration/correlation
– Associations (keys) for
• mapping customers to access circuits, router interfaces, network
policies, transport facilities, monitoring systems
– Provisioning requires precise, normalized current data
– Troubleshooting requires extensive correlation of current and
historical data
• Data integrity/cleaning
– Getting high quality, readily available data
• Large real world datasets always have some level of dirty data
– Auditing and fixing process and process “fallouts” due to
inconsistent or missing data
• Approaches
– Top Down: Data modeling/engineering: integration
– Bottom Up: Google-like (read-only) virtual integration, but
read/write
Aman Shaikh, Albert Greenberg, August 2005
52
Virtual Data Integration Methodology
MetaSearch
Local Interfaces
External Interfaces
VIP GUI
Custom Views
VIP Cache
Data Staging
Web Crawlers
DB snapshots
Direct Access
Data Access
A
V
B
W
C
X
D
E
Y
Z
Data Sources
•
Ideal solution: integration off all systems BUT
– Large integration projects often fail because it is too expensive and time-consuming to reengineer everything and get all the necessary buy-ins.
– Will create just another monster?
•
Virtual integration: use lightweight web and database technologies to give users the
impression/value of systems integration
– Virtually integrated = not physically integrated
– No re-engineering of legacy systems
Aman Shaikh, Albert Greenberg, August 2005
53
Virtual Integration Benefits
• Troubleshooting requires accessing many different systems
– Multiple logins; variety of interfaces (terminal, java-based, web)
• Access to the data is the first step towards assuring data quality
– Virtual integration system exploits established APIs in all component
systems (CGI-programming, terminal emulators, data feeds. …)
• Cross-index datasets on all possible combinations of joinable
keys
– Allow user get to data by any means available
– Google-style approach, include direct links to main databases
• Build customized web interfaces based on user feedback
– Fast, no reorganization of the underlying systems
– Use AI/Data-Mining techniques to flag/correct input errors
• Exploit big opportunities for automation
– Auto-populate forms
– Use AI/Data-Mining techniques to flag/correct input errors
Aman Shaikh, Albert Greenberg, August 2005
54
Network Care
•
•
•
•
Fundamentals
Troubleshooting
Maintenance
Network Security
Aman Shaikh, Albert Greenberg, August 2005
55
Elements of Network Care
• Troubleshooting and Maintenance are intertwined with each other
– One can trigger the other;
• Example: diagnosis of a failing line card can lead to its replacement
• Example: things can go wrong during maintenance that lead to troubleshooting
Maintenance and upgrades
Troubleshooting
Plan
Detect
Notify customers
Localize
Prepare network
Diagnose
Perform
Fix
Verify
Verify
Aman Shaikh, Albert Greenberg, August 2005
56
Proactivity and Reactivity
• Target
– Prevent problems, rather than getting better and better at fixing
problems
• How
– Robust design
– Automation of network management
– Forensics and post mortem analysis of problems
• Limitations
– Moving target!
• Can walk on water if its frozen
– Silent failures!
• No trap, no measurement
Aman Shaikh, Albert Greenberg, August 2005
57
Network Troubleshooting
• Workflow
– Detect, Localize, Diagnose, Fix, Verify
• Target: automate all of this!
– Reality: we are not all the way there yet
– Reality: better at automating the earlier parts of the workflow, thanks to
continuously operating comprehensive monitoring tools and systems
• Importance of real time execution: obvious
• Importance of off line analysis (post mortem): critical driver for
network improvement
• Systems and Tools
– Passive and Active monitoring
• Tools that apply in many roles across the workflow
– Correlation
• Accelerating, improving and automating the workflow
Aman Shaikh, Albert Greenberg, August 2005
58
Example
• Detect
– Continuous active monitoring (PE-PE) shows loss of continuity for some
PE pairs
• Localize
– Active monitoring, in this case, provides immediate localization to
impacted PEs
• Diagnose
– Syslogs, MIBs, OSPF monitoring reveals
• CPU spikes coincident with high BGP workloads, OSPF sessions dropped,
customer provisioning
– Diagnosis: Unsustainable BGP workload (running at higher priority than
OSPF) on a certain class of PE routers
• Fix
– Short-term: Control provisioning and other configuration changes to avoid
triggering the problem
– Permanent: Vendor fixed scheduling priorities of OSPF and BGP processes
• Verify
– Enhancements to active monitoring, specific to provisioning
Aman Shaikh, Albert Greenberg, August 2005
59
Network Troubleshooting Toolkit: Box Level
• Up Down
– Triggered by hard failures (link, card, router, etc)
– Near real-time alarms
• Statistical
– Traffic, buffers, CPU, …
– Degrading conditions; e.g., significant loss, no queues 
degrading hardware on linecard  plan maintenance
• Scalable
– Easy when looking data source by data source
– Harder when looking at the huge number of data sources from
diverse network elements: SNMP, syslogs, SONET alarms, …
Aman Shaikh, Albert Greenberg, August 2005
60
Network Troubleshooting Toolkit: Network Level and External
• Network Level
– Active measurement
•
•
•
•
Path level performance information
Delay and delay variation measurements
Indication of customer degradation (except hard failures)
Scalability problems (N Squared issues)
– Control Plane monitoring (BGP, OSPF, LDP)
• Passively forming the views of routing akin to the routers themselves
– Correlation
• Data fusion of network measurements and associating alerts
• Anomography – network wide anomaly detection from network
element
• External
– TACACs and workflow logs – who is doing what and where
on the network
– Alerting and tickets from other layers (Optical, VoIP)
Aman Shaikh, Albert Greenberg, August 2005
61
Box-level: SNMP
• What is it?
– SNMP = Simple Network Management Protocol
– Allows NMS to query devices for information
• Information stored as MIB (Management Information Base)
– Allows devices to notify NMS about events
– SNMPv1, SNMPv2, SNMPv3 (work in progress)
• Backwards-compatible?
• Usage in troubleshooting
– Detection (Up Down and Statistical)
– Quite often in diagnosis
• Other usage
– Reporting, trending and statistics
• SNMP link utilization forms key component of traffic matrix estimation
– Capacity planning, evolution of network architecture
Aman Shaikh, Albert Greenberg, August 2005
62
SNMP MIB
• A model of how information is stored in a device
– Collection of objects identified by object Ids (OID)
• Information is organized hierarchically
– Hierarchies allow grouping of information by topics
• E.g., interface group stores information about interface state
– Hierarchies allow controlled extension of the model
• E.g., router vendors have defined their own MIBs
• Accessing the MIB
– NMS  Device read: get, getNext
– NMS  Device write: set
– Device  NMS notification: Traps
Aman Shaikh, Albert Greenberg, August 2005
63
Example MIB
ROOT
ccit(0) iso(1) joint(2)
standard(0)
reg-authority(1)
member-body(2)
indent-org(3)
dod(6)
internet(1)
directory(1)
mgmt(2)
experimental(3)
mib(1)
system(1)
interfaces(2)
RMON2(17)
att(3)
icmp(5)
ip(4) tcp(6)
RMON(16)
udp(7)
private(4)
enterprises(1)
Vendor-specific
snmp(11) cisco(9) MIBs
egp(8) transmission(10)
OID for ICMP: 1.3.6.1.2.1.5
Aman Shaikh, Albert Greenberg, August 2005
64
Limitations of SNMP
• Inadequate
– SNMP MIBs are inconsistently implemented (or not at all)
– SNMP MIBs cover only a small portion of critical information
on the health and behavior of the router
• Statistics hard-coded
– No local intelligence to: accumulate relevant information, alert
NMS to pre-specified conditions, etc.
• Highly aggregated traffic information
– Aggregate link statistics
– Cannot drill down
• Protocol: simple = dumb
– Cannot express complex queries over MIB information in
SNMPv1
• “Get all or nothing”
• More expressibility in SNMPv3
Aman Shaikh, Albert Greenberg, August 2005
65
Box-level: Syslog
• What is it?
– Moral equivalent of #if (DEBUG) printf(…) in the router
code
– Vendors print plethora of information via syslog
– The syslog output can be collected at a remote server
• Usage in Troubleshooting
– Detection, localization, diagnosis
• Valuable source of information on what equipments are doing
• Limitation
– Syslog output is not standardized
• No consistency across vendors or different platforms of same vendor
– Makes it cumbersome to write portable tools that feed off
syslog
– Syslog is not reliable
• Loss of messages when router CPU is busy
Aman Shaikh, Albert Greenberg, August 2005
66
Box-level: Telnet/CLI
• What is it?
– Telnet/ssh into routers and issue commands for
troubleshooting
• Ping, traceroute, show/debug,…
• Resetting sometimes fixes problems!
– “shutdown/no shutdown” can sometimes solve problems on linecard!
– Often used extensively in troubleshooting
• Usage in Troubleshooting
– Localization, diagnosis, fix, verify
• Limitations
– Doing things via CLI is playing with fire…
• Tight access control and authorization, considerable expertise required,
“ask yourself” training
• Also need to control how many people can simultaneously telnet in
– Places load on router CPU
Aman Shaikh, Albert Greenberg, August 2005
67
Network-level: Active Measurements
Probe
PoP 1
Probe
PE
PE
CE
PoP 2
CE
edge to edge probes
CE
CE
• Probe may be onboard the router (SAA) or separate server
• Utility
– Alarms are driven on estimates of application impact
– Routing design can be assessed and adjusted for efficiency
– The effect of equipment/facility failures can be assessed and mitigation put into
place
– Operations Methods are designed to minimize application impact
– The behavior of new applications (e.g. VoIP) can be estimated
– The risk for Service Level Agreements can be gauged
– Customers are given a view of the measurements to provide a view into backbone
performance
Aman Shaikh, Albert Greenberg, August 2005
68
Active Monitoring Design
• Goal
– Schedule packet transmissions (Poisson, Periodic, …) so that
virtually every performance affecting event longer than a few
seconds will be detected
• Performance impacting events include
– Card changes on backbone routers that cause re-routes
(previously not considered customer-impacting)
– Small but persistent drops at interfaces
– Major congestion events
– Events that cause indirect harm via excessive jitter
• This provides the ability of the backbone to support realtime protocols can be tracked fairly accurately
Aman Shaikh, Albert Greenberg, August 2005
69
Views of the Information
• Public/Customer View
– Current Round Trip (RT) Loss and mean RT delay by city-pair
– Monthly averages for Loss and Delay Network-wide
• Global Operations View
–
–
–
–
RT Loss
RT Delay (95th percentile, min, mean)
Inter-Packet Delay Variation (IPDV) or ‘jitter’
Degraded seconds or minutes in test
• Operations View
– For analysis and investigation
– Numerous metrics and raw data available
Aman Shaikh, Albert Greenberg, August 2005
70
Network-level: Control Plane Monitoring
• What are Route Monitors?
– Allow collection and analysis of routing messages
• E.g., OSPF  Link State Advertisements (LSAs), BGP  routing
updates
• Trouble-shooting usage:
– Detect: Real-time tracking of routing events
– Diagnosis: Post-mortem analysis of problems
• Other usage:
– Network maintenance
• Track and validate maintenance steps
– “What-if” Analysis
• Capacity planning, architectural changes, policy changes, risk analysis
– Understanding routing dynamics of commercial networks
• Convergence, stability, robustness
• Interaction of protocols
Aman Shaikh, Albert Greenberg, August 2005
71
Route Monitors in Practice
• Research and academic
– Route-views and RIPE [route-views, ripe-ris]
• Public archives of BGP updates
• Have spawned numerous research papers on BGP
– OSPF Monitor from AT&T Labs [shaikh-nsdi04]
– IPMon project at Spring Labs [spint-ipmon,pyrt]
• Commercial products:
– RouteExplorer by PacketDesign [packetdesign]
• OSPF, IS-IS, EIGRP, BGP
– RouteDynamics by IPSUM [ipsum]
• OSPF, IS-IS, BGP
Aman Shaikh, Albert Greenberg, August 2005
72
Collecting Routing Data
• Challenge
– How to collect data passively
• BGP Monitor
– Use of public-domain routing software: Zebra/Quagga
– Passiveness achieved through configuration on routers
• Route filters that block any route updates from the monitors
• OSPF Monitor [shaikh-nsdi04]
– Various modes of connecting to the network
– Need one connection per area
– Passiveness achieved through careful implementation of the
collector
Aman Shaikh, Albert Greenberg, August 2005
73
Correlation Across Data Sources
• What is it?
– Correlate multiple data sources
– Simplest is to align multiple time-series
• Trouble-shooting usage:
– Detection:
• Dramatic reduction in false positives, and in redundant alarms
• Discovery of new and unexpected failure modes (e.g., IP/Optical interactions)
– Localization:
• Correlation of active and passive monitoring helps to simultaneously provide
the severity and the locus of the problem
– Diagnosis: root cause analysis and fault localization
• Correlation enables automated drill-down
• Correlation capabilities are extremely powerful for post mortem analysis and
for identification of recurring failure modes flying, previously, under the radar
• Sample research work:
– BGP and SNMP (link utilization) for anomaly detection [roughan04]
– Risk modeling for fault localization [kompella05]
– OSPF and BGP correlation for root cause analysis [teixeira04]
Aman Shaikh, Albert Greenberg, August 2005
74
Two Sources: SNMP and BGP
• SNMP
– Traffic volumes within a time interval
– Two detection algorithms
• Holt-winters
• Decomposition-based algorithm
• BGP
– Fluctuations in number of routes per exit-point
– Use EWMA (Exponentially Weighted Moving
Average)
Aman Shaikh, Albert Greenberg, August 2005
75
Example 1
Anomaly that triggers an alarm –
major network peer failure
Anomaly that does not
trigger an alarm: monitor
session resets
Aman Shaikh, Albert Greenberg, August 2005
76
Example 2
No alarm:
monitor data loss
Alarm – again,
peering related
Aman Shaikh, Albert Greenberg, August 2005
77
Network Maintenance
• A very large problem
– Under-explored. Research opportunities
• Why?
– Continuous drivers for software update (routers, linecards, processors)
• Bugs, vulnerabilities, upgrades, enhancements
– new features, new knobs to turns, new protocols, new services
– Continuous drivers for hardware updates (routers, linecards, processors)
• Failures, upgrades (higher speeds, new technologies), …
• Workflow
– Plan, Notify Customers, Prepare Network, Perform, Verify
• Very large opportunities for automation of workflow execution
• Systems
– Decision support
• Analysis of network and customer impact for each network update
• Optimizing, scheduling systems and workforce
– Execution
• Methodology and tools for minimizing impact during update execution
Aman Shaikh, Albert Greenberg, August 2005
78
Example: Router OS Upgrade
• Plan
– On site work force available? Customer notification required?
Piggyback Opportunities? Architectural Exceptions? Special
customer exceptions? …
– Resolve conflicts with other activities
– Risk/impact analysis on network and customers
• Notify customers if needed
– Leveraging the customer database
Decision Support
• Prepare the network
– Move traffic around by reconfiguring IGP (and BGP)
– Take out of production the router under maintenance
• E.g., move traffic off links incident on the router
• Perform the update
– Checkpoint state
– Minimize hit on the network, and time to upgrade
Aman Shaikh, Albert Greenberg, August 2005
79
Example: Router OS Upgrade (Continued)
• Verify
– In final steps of execution, perform series of checks
• Examples: diff with checkpoint, check OS version after
router reboot
– Rollback network to previous state
• Revert IGP and BGP (e.g., move traffic back on links)
– Check performance and fault monitoring
• Router is in production
• No adverse impact on network
• No adverse impact on customers
Aman Shaikh, Albert Greenberg, August 2005
80
Decision Support
• Goal: A robust network configuration
– Good performance, even during failures and planned changes
– Limit impact of network update
• Maintenance: assess impact of planned outages
– Assessment of impact from maintenance on routers or underlying
technologies (fibers, transponders, optical amplifiers, …)
– What if tools
• Compute flexible set of potential routing metric changes to minimize impact
• Key ingredients
– Data, models, and process – IP and cross-layer (optical, service)
• Importance and difficulty of data flow and data integrity
• Wide field of use beyond maintenance
– Risk, survivability and vulnerability analysis, network and service
evolution, capacity planning
Aman Shaikh, Albert Greenberg, August 2005
81
Decision Support Needs
• Risk modeling
– Transport level SRLG data: Shared Risk Link Groups
• e.g., all IP links whose integrity depend on a common fiber conduit
belong to an SRLG associated with that conduit
– IP Level: routers, interfaces
• Traffic modeling
– Traffic matrix: where the traffic is coming from and going to
– Hard problem in IP networks!
• Topology and Routing Analysis
– Via configuration management and route monitoring systems
– Route simulation
• Algorithms and analysis
– Impact analysis, optimization plans to minimize impact, …
Aman Shaikh, Albert Greenberg, August 2005
82
Risk Modelling
• Risk management: tradeoff likelihood of failure, impact and economics
– Links (lasers), Fiber spans (SRLG), fibers (e.g., optical amplifiers), routers
• Impact analyzed through Risk Assessment Tool
– Probabilities model; drives requirements
• Integrity of a simple IP link depends on a complex set of transport facilities
LA
NY
SF
Washington
IP (logical) layer
Physical (fiber) layer
LA
NY
SF
Washington
Common SRLG
Aman Shaikh, Albert Greenberg, August 2005
83
Traffic Matrices: Big Picture
• Router Level Demand Matrices
– Granularity: router or router interface
– Killer App: Network Maintenance
– Innovation: Tomo-gravity
Focus Here
• Flow Level Demand Matrices
– Granularity: TCP/IP headers
– Killer App: Traffic Analysis with Drill-down
– Innovation: Priority Sampling
• Path Matrices
– Granularity: TCP/IP headers
– Killer App: Passive Performance Measurement
– Innovation: Trajectory Sampling
• Still working its way through standards and implementations
Aman Shaikh, Albert Greenberg, August 2005
84
Requirements
• Use only data that is widely available, is built into the
network elements, and is easy to collect on any interface
on any router in a timely fashion
• Simple, statistically sound, scalable algorithms
– Frameworks that cover range of approaches, and help to
explain how and why the approaches work
– Robust to the harsh realities of the operational environment
• Graceful degradation given data loss, corruption
• What this means for traffic matrix estimation
– Use link loads: SMMP MIB 2
• Ubiquitously available, robust
– Cope gracefully with missing, late, corrupted or otherwise
flawed data
Aman Shaikh, Albert Greenberg, August 2005
85
Network Tomography
Have link traffic measurements
Want to know demands from source to destination
B
C
A
  x A, B
.
.
TM  
.
.
 .
.
Aman Shaikh, Albert Greenberg, August 2005
x A ,C
.
.
.




86
Problem: b=Ax
b1  x2  x3
1
Only measure at links
route 3
link 1
route 2
router
link 2
2
route 1
3
link 3
 b1   0 1 1   x1 
  
 
 b2    1 0 1   x2 
 b  1 1 0  x 
 3 
 3
Problem: Estimate traffic matrix (x’s) from the link measurements (b’s)
Aman Shaikh, Albert Greenberg, August 2005
87
Approach: Direct SVD Solution of b=Ax
The problem is massively under-constrained
Aman Shaikh, Albert Greenberg, August 2005
88
A successful approach: Tomo-gravity
• Tomo-gravity = tomo-graphy + gravity modeling
• Reduce problem size
– Exploit topological equivalence
• Find a solution x, which
– satisfies the constraints, and is closest to the generalized
gravity model solution (g)
– minimizes x  g
tomo-gravity solution (x)
generalized gravity solution (g)
constraint subspace (b=Ax)
(from link measurements)
Aman Shaikh, Albert Greenberg, August 2005
89
Foundation in Information Theory
• Minimize Mutual Information I(S,D)
– Information gained about source (S) from destination (D)
– Assume no information beyond the link load constraints
b=Ax
• Framework for tomo-gravity
– Gravity model = independence (between S and D)
– Generalized gravity model = conditional independence
• Explains tomo-gravity’s success with

|| x  g || p x[p]  g[p] 
g[p]

2
– since this is the first-order approx. to Kullback-Leibler
divergence from independence for I(S,D)
K(x || g)  p x[p]  log
 x[p] 
x[p]
 p x[p] 
 1
g[p]
 g[p] 
There will be a test
at the end of the
tutorial ;-)
 x[p]  g[p] 
 x[p] 

 p x[p] 
 1  p x[p]  g[p]   p 

g[p] 
 g[p] 

Aman Shaikh, Albert Greenberg, August 2005
2
90
Tomo-gravity Works
• Best of tomography and gravity
modeling (solid foundation in
information theory)
Killer App: Network Maintenance
• Simple, and quick: A few seconds for
K
large IP backbone
• Accurate: average ~11% error
– Including netflow now significantly
improves this! Errors become a few
percent.
• Uses widely available SNMP data
– Highly robust  Can work within the
limitations of SNMP data
– Only uses first order statistics 
Interpolation very effective
• Limited scope for improvement
– Can easily incorporate additional
constraints
Aman Shaikh, Albert Greenberg, August 2005
91
Executing the Plan
• Prepare Network, Perform, Verify
– Back to the Router Upgrade Example – Via IGP (OSPF, IS-IS) metric
changes
• Cost-out: assign high weight to link(s) so that traffic is drained
out before bringing the link down
Cost-out a link
Bring the link down
Perform maintenance/upgrade
Bring the link up
Cost-in the link
• Cost-out/cost-in does not mean zero impact on traffic
– Possibility of loops
– However, traffic is handled more gracefully
Aman Shaikh, Albert Greenberg, August 2005
92
Router Cost-out Options
• Option 1: Cost-out all outgoing links of a router
– Based on IETF RFC 3137 [rfc3137]
– Configuration changes only at the router in question
• Cisco ‘max-metric router-LSA’ command allows one to
perform entire cost-out in one atomic operation
• Option 2: Cost-out all incoming links of a router
– Have to cost-out links at the neighboring routers
• Which option is better?
– Option 1 is operationally easier than option 2
– Impact on traffic: not clear
Aman Shaikh, Albert Greenberg, August 2005
93
Hitless Upgrades?
• Make hardware/software upgrades completely nonintrusive
– No impact on routing and forwarding performance of routers
– Other than the router being upgraded, no impact on customer
performance and traffic
• Other operational uses:
– Router internals for continuous operation during upgrade also
provide increased reliability and availability during failures
• How?
– Component redundancy
– Component Plug-n-play
– Protocol extensions
• Nirvana
– Active research area
Aman Shaikh, Albert Greenberg, August 2005
94
Component Redundancy
• Backup route processor
– Duplicate state at the backup route processor
– Seamlessly transfer control to backup processor
• E.g., Avici’s NSR (non-stop routing) [avici-nsr]
• Bundle multiple physical links into a single IP
layer link
– Issues:
• Ensure packets from a single flow are delivered in-order
• Fast failover required
• Failure of some links can overload link
– Bandwidth thresholds to bring a link down
– Avici’s Composite links [avici-composite-link]
Aman Shaikh, Albert Greenberg, August 2005
95
Protocol Extensions for Hitless Restart
• Extend routing protocols so that a router is used for
forwarding even if routing process is inactive
• Issues:
– Need support from multiple routers
– What to do upon topology changes to avoid black-holes and
loops?
• Example: Two proposals for OSPF
– Graceful restart [rfc3623]
• Support from neighbor routers required
• Abandon hitless restart upon topology change
– I’ll Be Back (IBB) [shaikh-infocom02]
• Support from entire OSPF domain required
• Abandon IBB only if loops and/or black-holes can actually form and
only for affected destinations
• Cisco’s NSF with SSO [cisco-bgp-nsf]
Aman Shaikh, Albert Greenberg, August 2005
96
Network Security
• Intelligence is key
– If you don’t understand it how can you secure it?
– If you don’t understand it how can you tell what’s different?
• Network Security and (normal) Network Management two sides
of the same coin
– Information Needs
• Topology, Traffic, Routing, Configuration, Service – Customer associations
– Example: same network-wide data sources that feed traffic engineering, feed online
threat analysis – e.g., netflow
– Yet, for security, perhaps more so than normal NM tasks, the details really
matter
– DoS is Denial of Service – not necessarily Distributed Denial of Service
Attack (DDoS Attack)
• A (difficult) task for network care is to determine whether an anomaly arises
from “natural causes” or from DDoS attacks
• Example: SYN floods caused by web server crash (HTTP and user retries) or a
router crash (BGP retries) vs. SYN floods caused by an attack
• Example: Spikes and swings in traffic with root causes in the optical layer –
traffic not being monitored by DoS sensors suddenly becomes monitored
• ...
Aman Shaikh, Albert Greenberg, August 2005
97
End System Trends (Enterprise and Home)
• Explosion of security risks in the end systems
– PDAs generate and hold a ton of private information
• Example: Paris Hilton’s sidekick PDA
– Appealing applications open new doors for exploits
• Email (W32/Mytob …), instant messaging, …
– Urgent! Click. Try this URL? Click. Install this? Click. You sure? Click.
» Malware installed
– Solutions
• Ways to cope: vulnerability testing, user training, desktop configuration
management
• Microsoft Tuesdays: teams of specialists who analyze monthly advisories from
major software houses on newly discovered vulnerabilities, and on cost/benefit
analyses on deployment
• Explosion of software and devices running software
– Adding a lot of new code and new vulnerabilities
• Bad guys never had it so good
– More complex end system firewalls and rules may not be the solution
• VPNs, fancy group management, network definitions, bandwidth controls, …
• Witty worm: clever one packet worm that successfully exploited a firewall
manufacturer’s product line, exploiting ports the firewall meant to block
Aman Shaikh, Albert Greenberg, August 2005
Complexity!
98
Enterprise Trends: Outsourcing
• Outsource the wide area network
– MPLS VPNs run by a network service provider
• Outsource the servers
– Network firewall, hosted email, e-business, VoIP infrastructure, web applications run
by a network service provider
• Why?
– Complexity: Complicated to secure and expensive to manage
• Advisories, patches, best practices, churn
• Routers increasingly complex: distributed intelligence across line cards, route processors
– VPN technologies
• Greasing the skids for server outsourcing
• Lowering the expense for backhauling to data centers to reach outsourced servers
• Enterprises cope by concentrating and centralizing solutions and expertise
– Providers have a multiplexing advantage
• Amortization of knowledge: more data, more confirmation of attacks or problems, more
information shared across customers
• More efficient engineering – less over-engineering with pooled resources
• Consequence: Security an active area for network service providers and
networking research
– Securing the core, the data centers, the networked applications, and the customers
Aman Shaikh, Albert Greenberg, August 2005
99
Attack Traffic Trends
• Decline of worms and viruses that jam the network
– On the front page in the early 2000’s, the carpet bombers that jam networks
-- Slammer, Safire, Code Red, …
• Relatively small, stable residual traffic persists from these
– Yet, the potential is still there for another carpet bombing worm
– Greatest potential for research on worm mitigation is for the enterprise
• Throttling at or near the source
• Not every enterprise hit by the Slammer
• Rise of the targeted, purposeful attacks that jam or compromise
more focused targets
– Hackers against hackers
• Poor Man’s Internet gaming
– Attacks against specific applications, services, customers
• Wide set of popular toolkits/attacks available: smurf, fraggle, TCP SYN flood,
connection killing, distributed reflection
– Identity theft
• Example: phishing attacks
– Malware installation
• Bots bought and sold to spammers, and other bad guys
Aman Shaikh, Albert Greenberg, August 2005
100
Network Security Activities: Prevention
• Prevention has a bigger bang for the buck
– Some Enterprises may think of detection and mitigation as too little too late
• Perimeter defense
– Routers: ACLs, blackhole routes
– Servers: firewalls
• Core cloaking
– MPLS (one IP hop) core stops attackers from knowing internal topology and
routing
• Access control for network elements
– Logins/passwords via centralized authentication servers
– Controls on which users/systems can execute which commands
– Audit trails
• Vulnerability Analysis
– Testing: how routers, switches, servers hold up to a range of emulated attacks in the
test lab
– Simulation: identification of weaknesses and better mitigation policies via networklevel simulations
• Network’s weakest link is the network’s strongest component
• Rigorous and periodic security audits for all network and service elements
– Routers, switches, servers, …
Aman Shaikh, Albert Greenberg, August 2005
101
Prevention: Perimeter Control
• Forwarding mechanisms
– Unicast reverse path forwarding controls for spoofed source IP addresses
• Drop at edge router if source IP is not routable
– Blackhole routes for specific destination IP addresses
• Static routes whose BGP next hop is not routable
• Used to drop packets directed to infrastructure, and to drop attacks on customer
routes
– Somewhat coarse logging to support attack forensics
• Filtering mechanisms
– Data: Access Control Lists (ACLs)
•
•
•
•
•
More precise blocking (src/dest IPs, ports) and rate limiting of packet streams
Provider network ACLs: relatively simple, instantiated at edge interfaces
Enterprise network ACLs: relatively complex, instantiated across the network
Somewhat more precise logging to support attack forensics
Downside: intensive processing and memory resources often precludes wide
use
– Control: routing import and export policies
• CE import policies: control plane counterpart of ACLs – route scoping
• CE export policies: route scoping and good citizenship – limiting route
propagation to specific groups, not propagating any instabilities
Aman Shaikh, Albert Greenberg, August 2005
102
Prevention: Testing
• Complex feature interactions in routers have the
potential of amplifying small DoS attacks
CPU correlated with
link load!
QoS config that removes process switching
CPU load
Time (diurnal traffic loading pattern)
• Subtle sequence of QoS configuration commands can cause
packets to be process switched (by CPU, whose cycles are needed
for OSPF, BGP, …) rather than line card switched
• Consequences
– CPU and traffic correlated
– Small DoS attack (e.g., on a T1 interface) can bring down a large CE
Aman Shaikh, Albert Greenberg, August 2005
103
Network Security Activities: Detection/Forensics
• Network Providers strategically positioned to fight DDoS
• Traffic Analysis
– Detection – early sensing of possible attack
• Today, catching the high volume attacks for the most part
– Forensics – sustained analysis and trace-back
• Challenges and balancing acts in creating and maintaining relatively raw data
– Traffic analysis challenge: massive traffic volumes across the network edge
•
•
•
•
Flow-based monitoring: scalable, comprehensive
Packet-header monitoring: deep, analysis on important interfaces (or interfaces under attack)
Dark space monitoring: helpful when source IP spoofing is occurring
Arbor, Riverhead (Cisco), Cloudshield (underlay), Snort …
– Owing to the scale of the network and the traffic, all of the above is research
• Routing analysis
– Monitoring diversion of routes and traffic from intended destinations
– In the middle of the Internet, a BGP speaker lies about routes
• Detection: ISPs/enterprises set up “customer” connectivity to other ISPs to monitor the
advertisement of their private address spaces (AT&T Peermon)
– Active research domain
• Again, there is a multiplexing advantage
– Seeing a large fraction of the Internet helps
Aman Shaikh, Albert Greenberg, August 2005
104
Network Security Activities: Mitigation
• Blackhole routes and ACLs
– Stops the bad traffic and any collateral damage to the victim, not
necessarily the DoS
– Buys time for forensics, other mitigation
• Scrubbing
– Diversion to scrubbing farm, which attempts to drop/analyze the attack
traffic and send remaining on to the destination
– Makes most sense in a network, as a shared resource
• Scrubbers involve expensive deep packet inspection of traffic diverted to the
scrubbing farm via routing and tunneling
• Challenges
– Whether or not to mitigate
• Size and duration of the attack, damage (including collateral), customer
– How long to mitigate
• Fixed time interval (one week?), until the attack disappears?
– How to automate
• High cost of mitigating false positives
• Adaptive defenses, closed loop incorporating false positives and associated
costs [Duffield]
Aman Shaikh, Albert Greenberg, August 2005
105
VoIP Case Study
Aman Shaikh, Albert Greenberg, August 2005
106
Outline for VoIP
• Fundamentals
• Commercial VoIP service models
• VoIP network management
Aman Shaikh, Albert Greenberg, August 2005
107
Voice is Big (~ as Big as Data)
• U.S. long distance (rough round numbers)
– 4.5 Petabytes/day (petabyte = 1015 bytes)
• ~1 billion calls/day
• ~3 minutes/call
• ~2 x 100 kbps for encoding two 64 kbps streams per call
– Flash crowds (e.g., American Idol voting)
• Tens of millions of calls in 10 minutes (first few minutes of
voting) to a handful of phone numbers
• By comparison, a very large, tier 1 ISP carries ~
2-3 petabytes/day
Aman Shaikh, Albert Greenberg, August 2005
108
PSTN is Reliable, and Society Banks on that!
• That is, voice on the PSTN (Public Switched
Telephony Network) is amazingly reliable
– Five nines (99.999% availability) engineering
– In the U.S., outages are reported to the FCC
• FCC = Federal Communication Commission
• Voice supports critical services
– 911 and GETS
• GETS = Government Emergency Telecommunication Service
– Out-of-band network configuration management for
IP networks
• Dial into the router, rather than telnet in
– Lest you saw off the limb you are standing on 
Aman Shaikh, Albert Greenberg, August 2005
109
VoIP Signaling
Alice
VoIP
Phone
Register Alice
Register Bob
VoIP
Bob
Phone
Registrar
Service
Call Bob
Resolve
Bob’s
location
Proxy
Server
Signaling
using SIP
Location
Server
Send call to Bob’s domain
Proxy
Server
VoIP Infrastructure
• Signaling
– Call setup, session management, negotiation of session parameters, dealing
with advanced features
– Competing protocols and standards
• SIP [rfc3261], H.323 (ITU-T), MGCP [rfc3435], ...
Aman Shaikh, Albert Greenberg, August 2005
110
VoIP Transmission: Voice Samples/UDP
Bob’s VoIP Phone
Alice’s VoIP Phone
Decoder
(DA Converter)
De-jitter buffer
Coder
(AD Converter)
media packet
IP Cloud
Eth IPSec? IP UDP RTP
Voice
sample
• Voice samples transmission: IP packets (RTP [rfc3250] over UDP)
• Coder + Decoder = Codec
–
–
–
–
Perform Analog-to-digital and digital-to-analog conversion
Vary in sound quality, bandwidth requirement, computational requirement…
Each phone, gateway, service support several different CODECS
Example CODECS: ITU G.711 (64 kbps), ITU G.729 (8 kbps)
Aman Shaikh, Albert Greenberg, August 2005
111
Consumer VoIP Service Models
• BYOA (Bring Your Own Access) model
– “Overlay,” “Third Party,” …
– To date: this model works extremely well
• Today’s DSL vs. Cable wars help to explain why
– QoS in the end systems (telephony adapters), with no access to
QoS in access or core networks; accordingly no transport SLA
– Commercial offers: AT&T CallVantage, Vonnage, 8x8, Skype
• BSP (Broadband Service Provider – Comcast, SBC, …) model
– End-to-end QoS for on-net and on-net  PSTN flows, with potential for
transport SLAs
• BSP capabilities to tag and differentiate their own service offers (e.g.,
voice, video, web) from third party services
– DOCSIS, PacketCable, …
• BSP capabilities to potentially integrate modem, telephony adapter, router,
firewall and more in one residential gateway box per home
Aman Shaikh, Albert Greenberg, August 2005
112
Consumer VoIP Service Models in Pictures
BYOA Model
Home
Phone
TA
PC
Cable/DSL
Modem
Cable/DSL
Provider
VoIP
Provider
Internet
PSTN
BSP Model
Home
Phone
TA
PC
Cable/DSL
Modem
TA = Telephony Adapter
Cable/DSL
/VoIP
Provider
Internet
PSTN
Aman Shaikh, Albert Greenberg, August 2005
113
Business VoIP Service Model
Enterprise site 1
WAN/VPN
Service Provider
Enterprise site 2
PSTN
• Service models parallel to PSTN counterparts
– DIY (Do it Yourself): Enterprise maintains its own PBX
• PBX = Private Branch eXchange
– Outsourced PBX : Use of IP-centrex [ip-centrex]
• Equipment
– Enterprise side
• VoIP phones, (potentially) PBX, VoIP infrastructure
• (potential) capability of using PSTN for fail-over
– Service Provider side
• VPNs with QoS capabilities, VoIP infrastructure, (potentially) IP-centrex
• End-to-end QoS expected and possible!
Aman Shaikh, Albert Greenberg, August 2005
114
VoIP Network Management Challenges
• Data Plane
– End system management for Consumer (BYOA)
VoIP
– QoS management for Business VoIP
• Control Plane
– VoIP server infrastructure monitoring
– VoIP security issues
Aman Shaikh, Albert Greenberg, August 2005
115
VoIP End System Management
• Consumer expectations: surf (at roughly the same speed
as before adding VoIP) and talk simultaneously
– Data: set MSS and QoS parameters
– Voice: use appropriate codec with right set of parameters
• Biggest issue: upstream (from home to Internet)
bandwidth
– Too small  VoIP is infeasible!
• Natural, longer term solution: estimate available
bandwidth and dynamically change codec in TAs
– Most current generation TAs cannot do either
– Get it wrong, and web speed may degrade – a potential
dissatisfier
Aman Shaikh, Albert Greenberg, August 2005
116
Estimating Upstream Bandwidth of Customer
ICMP ECHO packets
Measurement
Source
customer
ICMP ECHOREPLY packets
Upstream background traffic
• Assumption: provider does not have direct access to
customer
• Measurement source sends ICMP ECHO packets to a
customer node
• Estimate the customer’s upstream bandwidth by
measuring the arriving rate of ICMP ECHOREPLY
packets from customer nodes
Aman Shaikh, Albert Greenberg, August 2005
117
Bandwidth Estimation: A Bit More Detail
Techo
Bdownstream
Bupstream
customer
Techoreply
• Let Secho and Sechoreply be the packet size of ICMP ECHO
and ECHOREPLY packets
Bestimated = N * Sechoreply / Techoreply ≈ Bupstream
• Assumptions:
– The upstream link of the customer is the bottleneck of the
roundtrip path, i.e., Techo < Techoreply, o.w., Bestimated ≈ Bdownstream
• Most broadband clients satisfy this requirement
– The customer node replies to (large-size) ICMP ECHO
packets
Aman Shaikh, Albert Greenberg, August 2005
118
Bandwidth Estimation: Potential Deal breakers
• Downstream congestion could make the
downstream path the bottleneck
• ICMP packet generation delay at the customer
node can increase Techoreply
• Strictly speaking, Bestimated is a value between real
available bandwidth and upstream capacity
• BSP may block ICMP…
Aman Shaikh, Albert Greenberg, August 2005
119
QoS Management for Business VoIP
• QoS is possible because of end-to-end control
• Approaches for classifying and marking VoIP traffic at
enterprise sites
– Approach 1: mark traffic in VoIP phones itself
– Approach 2: use separate VLAN (Virtual LAN) for VoIP
traffic, and mark traffic coming over this VLAN
– Approach 3: use an agent that looks for RTP traffic and marks
the packets
• Provide QoS to marked traffic inside service provider
network
– Routers provide service differentiation for different classes of
traffic
• Example: Cisco’s MPLS class of service [cisco-mpls-cos] uses WRED
and WFQ for service differentiation
– WRED (Weighted RED): for controlling packet loss probability
– WFQ (Weighted Fair Queuing): for controlling delay and bandwidth
– Setting these parameters is often challenging!
Aman Shaikh, Albert Greenberg, August 2005
120
SLA Offerings for Business VoIP
• SLA offerings possible because of end-to-end control
and QoS
– Subject to Acceptable Usage Policy (AUP)
• Example: all bets are off if bandwidth usage is more than X%
• SLAs are offered in terms of voice quality
• Determining voice quality
– Voice stands out as one network application where huge
investment has sunk into quality evaluation
• Voice is tricky, and voice is important!
– Psycho-acoustic measures
• MOS (Mean Opinion Score)
– 5 (excellent), 4 (good), 3 (fair), 2 (poor), 1 (bad)
– Via panels of people listening to voice samples
– Important to relate quality to measurable impairments on the
path from mouth to ear
• ITU’s E-Model [e-model]
Aman Shaikh, Albert Greenberg, August 2005
121
ITU’s E-Model
A tool for voice transmission planning, developed and used by the
world’s experts.
R
G.107
Default
Value
USER SATISFACTION
MOS
%GOB %POW
100
93
90
Very Satisfied
4.4
4.3
98.4
97.0
0.1
0.2
4.0
89.5
1.4
3.6
73.6
5.9
3.1
50.1
17.4
2.6
26.6
37.7
1.0
0
99.8
Satisfied
80
R is an index
for the quality
of a voice
connection,
Also known as
the “R-factor”
Some Users Dissatisfied
70
Many Users Dissatisfied
60
Nearly All Users Dissatisfied
50
Not Recommended
0
Aman Shaikh, Albert Greenberg, August 2005
Courtesy: Al Morton
122
The “R-Factor”
• R-factor is a single-integer measure of voice quality
– Range: [0, 100], 0: worst, 100: best
– PSTN range for R is 80 to 90, nominally 85, toll quality is R ≥
80
• R = 100 – Is – Id – If + A
– Simple, additive model
– Is, Id and If model impairments from the network (delay, loss,
jitter) as well as the codec (loss, delay from compression and
de-jitter buffer depth)
– A reflects lowered expectations given added convenience
• Example: A = 10 for cellular
• R can be estimated from network measurements
– Parameterized by coded parameters
• Offboard in a network probe
• Onboard the routers (example: Cisco SAA)
Aman Shaikh, Albert Greenberg, August 2005
123
Impact of Loss and Delay on R-Factor
Equipment Impairment Factor vs. Packet Loss
(G.711, 20ms)
Delay Impairment, Id
20
18
18
16
16
14
14
12
10
10
Ie
Id
12
8
8
6
6
4
4
2
2
0
0
50
100
150
200
250
300
0
0.00%
0.20%
0.40%
0.60%
0.80%
1.00%
1.20%
Loss Ratio
Delay, m s
• Knee at ~ 150 ms, for 1-way delay – quality degrades drastically
• Long propagation delays – caused by optical or routing level
anomalies have large negative impact for global VoIP
– Need for latency sensitive routing, fast fail-over and convergence
Aman Shaikh, Albert Greenberg, August 2005
Courtesy Al Morton
124
VoIP Control Plane Management Challenges
• Architectural: VoIP signaling and calls span multiple domains
• Technological: VoIP infrastructure software relatively immature
– Example: equipment does not gracefully handle overload conditions
• Flash crowds can become a big problem
– Business-oriented VoIP: conference call
– Customer-oriented VoIP: vote for “American Idol”
• Standards-related: Protocols are still evolving…
– Leads to inter-operability issues between vendors
– Multiple signaling protocols (SIP, H.323, MGCP, …) make matters worse!
• Overlapping functionality but each provides its own unique functionality
• Will community converge to a single protocol?
– SIP seems to be the protocol of choice going forward…
» Simple, service agnostic, extensible (trading off QoS and security TBD
or by leveraging other protocols, at least initially), improving standards
and implementation
• Leads to deployment of software for protocol interworking
– Adds more components to the VoIP infrastructure
Aman Shaikh, Albert Greenberg, August 2005
125
VoIP Control Plane Monitoring
• Multiple, Distributed Servers
– Access servers (AS), Call Routing Elements, PSTN Gateways,
Advanced Feature Servers
• Class 5 features (dial-tone, DTMF, call-waiting, call forwarding, …),
routing on-net or off-net (to PSTN), supporting advanced features such
as “locate me”, conferencing
• Multiple information sources
– CDRs (Call Detail Records), SNMP MIBs/Traps, Server status
and logs, distributed SIP sniffers/analyzers, active VoIPspecific probes
• Complex, distributed systems debugging
– CDRs provide “cause codes” for problems
– SIP sniffers help to localize the problem to specific
servers/databases/gateways
– Device specific diagnosis helps to trouble-shoot problem
Aman Shaikh, Albert Greenberg, August 2005
126
VoIP-Specific Security Issues
• Today’s middle-boxes, such as firewalls and NATs, do not always
work well with VoIP
– Firewalls: block dynamic ports used by VoIP
– NAT: hide the identity of the user behind it
• Traditional IP security measures can have adverse impact on
delay, jitter and bandwidth
– Example: crypto-engines used for encryption may not support QoS
• There is a trade-off between application-level encryption versus
IP-level encryption
– Application-level not good for wiretap requirements like CALEA [calea]
– IP-level (e.g., IPSec) can significantly increase bandwidth usage since VoIP
packets are small
• VoIP services requires trust and closed user group management
– Otherwise, I can hang up your phone, make your message light go on, steal
service, …
– Gaps in today’s control plane, filled via the management plane
Aman Shaikh, Albert Greenberg, August 2005
127
Some Directions and Challenges
Aman Shaikh, Albert Greenberg, August 2005
128
Core Network Management
• Myth of five nines?
– What level of reliability is really required
• SONET rings provide 50 msec protection – should customers really care
• How to design end equipment that’s more tolerant to small outages?
– Reliability is critical for some services: out of band control (now using the PSTN!
when VoIP succeeds…); 911
• Is reliability really about FRR or SONET rings?
– Old news, numerous solutions
• Yet, how do we get to robustness: understand and controls to assure a small push applied to
the network will have a small impact
• Where are the new and impactful opportunities
– Edge, enterprise, higher layer interactions
– Reliable router?
• How to deal with the whirlwind of new features and interactions
– How to be proactive?
• How to explore huge multi-dimensional space in testing
– How to uncover the plethora of failure modes in the field
• Correlation, learning
– How to design for a high software defect rate much higher than acceptable
– How to simplify enterprise networks so that they are inherently less fragile and much
simpler to reason about and control
Aman Shaikh, Albert Greenberg, August 2005
129
Security
• Problem really arise at the end systems
– Servers, PDAs, software, software, software…
– Should solutions be focused on the end systems, or the
network?
• To what extent can the network help protect the
customer’s software infrastructure?
– How much and where?
• DDoS attacks – despite all the research, a huge amount of improvement
needed
• VPNs – membership in many, without exploding complexity and
information leaks
• Stepping stones, bot-armies, marketplace of malware?
• Network itself faces thorny security challenges
– Secure router designs for handling both public and private
traffic
– Access control
Aman Shaikh, Albert Greenberg, August 2005
130
Automation
• How do we create
– Information base: accurate, timely information -- customer-feature
associations, performance/fault measurements, …
– Decision support: effective rules and/or decision support tools – predictable
response to potential control actions
– Control mechanisms: effective protocol and network management
mechanisms for direct implementation of desired controls
• How far can we push automation, coping with
– Multiple objectives, multiple criteria
– Software rot: assumptions in the software disconnecting from reality
– Small errors in information, decision and control having large impact
• Where should different elements of network management
functionality be placed? control vs. management planes
– Lift intelligence into the management plane
– Rework the control plane architecture
Aman Shaikh, Albert Greenberg, August 2005
131
Services
• How do you design an Internet that can support a range
of new services
– What do these new services require?
• TV? R-factor for video?
• How to do scalable, application-level monitoring and
adaptation, coping with
– Pollution of QoS classes
– Network/application interactions: design, management, fault
localization, provisioning
– Localization: is the application or the network broken
• What new services or enhanced existing services can the
network offer?
Aman Shaikh, Albert Greenberg, August 2005
132
References
See
http://www.research.att.com/~ashaikh/network-management
Aman Shaikh, Albert Greenberg, August 2005
133
Routing
• [rfc2328] J. Moy, “OSPF Version 2”, IETF RFC 2328
• [rfc1771] Y. Rekhtar and T. Li, “A Border Gateway Protocol 4
(BGP-4)”, IETF RFC 1771
• [rfc1195] R. Callon, “Use of OSI IS-IS for Routing in TCP/IP and
Dual Environments”, IETF RFC 1195
• [rfc2453] G. Malkin, “RIP Version 2”, IETF RFC 2453
• [cisco-eigrp] “Enhanced Interior Gateway Routing Protocol
(EIGRP)”,
http://www.cisco.com/en/US/tech/tk365/tk207/tsd_technology_su
pport_sub-protocol_home.html
• [route-views] http://www.route-views.org
• [ripe-ris] http://www.ripe.net/ris/index.html
• [ipsum] http://www.ipsumnetworks.com
• [packetdesign] http://www.packetdesign.com
Aman Shaikh, Albert Greenberg, August 2005
134
Routing (cont’d.)
•
•
•
•
•
•
•
•
•
[sprint-ipmon] http://ipmon.sprint.com
[pyrt] http://mort.belltower.co.uk/pyrt.html
[bgplay] http://www.ris.ripe.net/bgplay
[ripe-libbgpdump] http://www.ris.ripe.net/source
[shaikh-nsdi04] A. Shaikh and A. Greenberg, “OSPF Monitoring:
Architecture, Design and Deployment Experience”, Proc. Usenix
NSDI, Mar. 2004
[caesar05-policies] M. Caesar and J. Rexford, “BGP Policies in
ISP Networks”, UC Berkeley Technical Report UCB/CSD-051377, Mar 2005
[nordstrom04] O. Nordstrom and C. Dovrolis, “Beware of BGP
attacks”, in ACM SIGCOMM CCR, Apr 2004
[feamster-sigmetrics04] N. Feamster et al., “A Model of BGP
Routing for Network Engineering”, in ACM SIGMETRICS, Jun
2004
[smart-routing05] http://www.nanog.org/mtg-0206/smart.html,
NANOG 25 panel, jun 2002
Aman Shaikh, Albert Greenberg, August 2005
135
Routing (cont’d.)
• [feamster-imc04] N. Feamster et al., “BorderGuard: Detecting
Cold Potatoes from Peers”, in Proc. IMC, Oct 2004
• [feamster-nsdi05] N. Feamster and H. Balakrishnan, “Detecting
BGP Configuration Faults with Static Analysis”, Proc. USENIX
NSDI, May 2005
Aman Shaikh, Albert Greenberg, August 2005
136
Network Troubleshooting
• [cisco-netflowv9]
http://www.cisco.com/en/US/products/sw/iosswrel/ps5187/produc
ts_feature_guide09186a00801b0696.html#wp1069814
• [sflow] http://www.sflow.org/
• [ietf-ipfix] http://www.ietf.org/html.charters/ipfix-charter.html
• [daytona] http://www.research.att.com/sw/tools/daytona
• [smartsamp] http://www.research.att.com/projects/flowsamp
• [lumeta] http://www.lumeta.com
• [traceroute] http://www.traceroute.org
• [tcpdump] http://www.tcpdump.org
• [nimi] http://www.ncne.nlanr.net/nimi
• [planetlab] http://www.planet-lab.org
• [peterson02] L. Peterson et al., “A Blueprint for Introducing
Disruptive Technology into the Internet”, HotNets, Oct. 2002
• [bavier04] A. Bavier et al., “Operating System Support for
Planetary-Scale Services, Proc. USENIX NSDI, Mar. 2004
Aman Shaikh, Albert Greenberg, August 2005
137
Network Troubleshooting (cont’d.)
• [lakhina04] A. Lakhina, M. Crovella and C. Diot, “Diagnosing
Network-wide Traffic Anomalies”, Proc. ACM SIGCOMM, Sept.
2004
• [roughan04] M. Roughan et al., “Combining Routing and Traffic
Data for Detection of IP Forwarding Anomalies”, Proc. ACM
SIGCOMM NetTS Workshop, Aug. 2004
• [kompella05] R. Kompella et al., “IP Fault Localization via Risk
Modeling”, Proc. USENIX NSDI, May 2005
• [teixeira04] R. Teixeira et al., “Dynamics of Hot-Potato Routing
in IP Networks”, Proc. ACM SIGMETRICS, June 2004
• [agarwal04] S. Agarwal et al., “Impact of BGP Dynamics on
Router CPU Utilization”, Proc. PAM, April 2004
Aman Shaikh, Albert Greenberg, August 2005
138
Maintenance and Upgrade
• [rfc3623] J. Moy et al., “Graceful OSPF Restart”, IETF RFC
3623
• [shaikh-infocom02] A. Shaikh, R. Dube and A. Varma, “Avoiding
Instability during Graceful Shutdown of OSPF”, Proc. IEEE
Infocom, June 2002
• [rfc3137] A. Retana et al., “OSPF Stub Router Advertisement”,
IETF RFC 3137
• [avici-nsr] http://www.avici.com/products/nsr.shtml
• [avici-composite-link]
http://www.avici.com/technology/composite_links.shtml
• [cisco-bgp-nsf]
http://www.cisco.com/en/US/products/sw/iosswrel/ps1839/produc
ts_feature_guide09186a008015fede.html#wp1027129
Aman Shaikh, Albert Greenberg, August 2005
139
VoIP
• [voip-info-wiki] http://www.voip-info.org/tiki-index.php
• [goode02] B. Goode, “Voice Over Internet Protocol (VOIP)”,
Proc. of the IEEE, VOL. 90, NO. 9, Sept. 2002
• [mehta01] P. Mehta and S. Udani, “Overview of Voice over IP”,
Tech. Report MS-CIS-01-31, University of Pennsylvania, Feb.
2001
• [sinden02] R. Sinden, “Comparison of Voice over IP with Circuit
Switching Techniques”, Department of Electronics and Computer
Science, Southampton University, Jan. 2002s
• [rfc3261] J. Rosenberg et al., “SIP: Session Initiation Protocol”,
IETF RFC 3261
• [rfc3263] J. Rosenberg and H. Schulzrinne, “Session Initiation
Protocol (SIP): Locating SIP Servers”, IETF RFC 3263
• [rfc3435] F. Andreasen and B. Foster, “Media Control Gateway
Protocol (MGCP) Version 1.0”, IETF RFC 3435
• [rfc3250] H. Schulzrinne et al., “RTP: A Transport Protocol for
Real-Time Applications”, IETF RFC 3250
Aman Shaikh, Albert Greenberg, August 2005
140
VoIP (cont’d.)
• [internetnews-brandx] “Court Backs Cable in Brand X Case”,
Internetnews.com, June 27, 2005
http://www.internetnews.com/bus-news/article.php/3515801
• [ip-centrex] http://www.ip-centrex.org/
• [e-model] “The e-model, a computational model for use in
transmission plannning”, ITU-T Recommendation G.107, May
2000
• [cole01] R. Cole and J. Rosenbluth, “Voice over IP Performance
Monitoring”, ACM SIGCOMM CCR, Volume 31, Issue 2, Apr.
2001
• [rosenbluth01] J. Rosenbluth, “A framework for Setting Packet
Loss objectives for VoIP”, ITU-T Study Group 12 Delayed
contribution, Oct 2001
• [nist05] D. Kuhn et al., “Security Considerations for Voice Over
IP Systems”, NIST Special Publication 800-58, Jan. 2005
• [boutremans02] C. Boutremans et al., “Impact of Link Failures on
VoIP Performance”, Proc. ACM NOSSDAV, 2002
Aman Shaikh, Albert Greenberg, August 2005
141
VoIP (cont’d.)
• [cisco-saa] “Service Assurance Agent (SAA)”,
http://www.cisco.com/en/US/tech/tk447/tk823/tsd_technology_su
pport_sub-protocol_home.html
• [cisco-mpls-cos] “MPLS Class of Service”,
http://www.cisco.com/en/US/products/sw/iosswrel/ps1830/produc
ts_feature_guide09186a00800e977a.html#26823
• [calea] “Communications Assistance for Law Enforcement Act”,
http://www.askcalea.net/
Aman Shaikh, Albert Greenberg, August 2005
142