IP Routing Protocols, BGP, Longest Prefix Match

Download Report

Transcript IP Routing Protocols, BGP, Longest Prefix Match

Week 7
Routing Protocols
1
Intra-AS Routing
 Also known as Interior Gateway Protocols (IGP)
 Most common Intra-AS routing protocols:

RIP: Routing Information Protocol

OSPF: Open Shortest Path First

IGRP: Interior Gateway Routing Protocol (Cisco
proprietary)
2
RIP ( Routing Information Protocol)
 Distance vector algorithm
 Included in BSD-UNIX Distribution in 1982
 Distance metric: # of hops (max = 15 hops)
u
v
A
z
C
B
D
w
x
y
destination hops
u
1
v
2
w
2
x
3
y
3
z
2
3
RIP advertisements
 Distance vectors: exchanged among
neighbors every 30 sec via Response
Message (also called advertisement)
 Each advertisement: list of up to 25
destination nets within AS
4
RIP: Example
z
w
A
x
D
B
y
C
Destination Network
w
y
z
x
….
Next Router
Num. of hops to dest.
….
....
A
B
B
--
2
2
7
1
Routing table in D
5
RIP: Example
Dest
w
x
z
….
Next
C
…
w
hops
4
...
A
Advertisement
from A to D
z
x
Destination Network
w
y
z
x
….
D
B
C
y
Next Router
Num. of hops to dest.
….
....
A
B
B A
--
Routing table in D
2
2
7 5
1
6
RIP: Link Failure and Recovery
If no advertisement heard after 180 sec -->
neighbor/link declared dead
 routes via neighbor invalidated
 new advertisements sent to neighbors
 neighbors in turn send out new advertisements (if
tables changed)
 link failure info quickly propagates to entire net
 poison reverse used to prevent ping-pong loops
(infinite distance = 16 hops)
7
OSPF (Open Shortest Path First)
 “open”: publicly available
 Uses Link State algorithm
 LS packet dissemination
 Topology map at each node
 Route computation using Dijkstra’s algorithm
 OSPF advertisement carries one entry per neighbor
router
 Advertisements disseminated to entire AS (via
flooding)

Carried in OSPF messages directly over IP (rather than TCP
or UDP
8
OSPF “advanced” features (not in RIP)
 Security: all OSPF messages authenticated (to




prevent malicious intrusion)
Multiple same-cost paths allowed (only one path in
RIP)
For each link, multiple cost metrics for different
TOS (e.g., satellite link cost set “low” for best effort;
high for real time)
Integrated uni- and multicast support:
 Multicast OSPF (MOSPF) uses same topology data
base as OSPF
Hierarchical OSPF in large domains.
9
Hierarchical OSPF
10
Hierarchical OSPF
 Two-level hierarchy: local area, backbone.
Link-state advertisements only in area
 each nodes has detailed area topology; only know
direction (shortest path) to nets in other areas.
 Area border routers: “summarize” distances to nets
in own area, advertise to other Area Border routers.
 Backbone routers: run OSPF routing limited to
backbone.
 Boundary routers: connect to other AS’s.

11
Internet inter-AS routing: BGP
 BGP (Border Gateway Protocol): the de
facto standard
 BGP provides each AS a means to:
1.
2.
3.
Obtain subnet reachability information from
neighboring ASs.
Propagate the reachability information to all
routers internal to the AS.
Determine “good” routes to subnets based on
reachability information and policy.
 Allows a subnet to advertise its existence
to rest of the Internet: “I am here”
12
BGP basics
 Pairs of routers (BGP peers) exchange routing info over semi-
permanent TCP conctns: BGP sessions
 Note that BGP sessions do not correspond to physical links.
 When AS2 advertises a prefix to AS1, AS2 is promising it will
forward any datagrams destined to that prefix towards the
prefix.

AS2 can aggregate prefixes in its advertisement
3c
3a
3b
AS3
1a
AS1
2a
1c
1d
1b
2c
AS2
2b
eBGP session
iBGP session
13
Distributing reachability info
 With eBGP session between 3a and 1c, AS3 sends prefix
reachability info to AS1.
 1c can then use iBGP do distribute this new prefix reach info
to all routers in AS1
 1b can then re-advertise the new reach info to AS2 over the
1b-to-2a eBGP session
 When router learns about a new prefix, it creates an entry
for the prefix in its forwarding table.
3c
3a
3b
AS3
1a
AS1
2a
1c
1d
1b
2c
AS2
2b
eBGP session
iBGP session
14
Outline
1. Highlights
2. Addressing and CIDR
3. BGP Messages and Prefix Attributes
4. BGP Decision and Filtering Processes
5. I-BGP
6. Route Reflectors
7. Multihoming
8. Aggregation
9. Routing Instability
10. BGP Table Growth
15
Internet Topology
large ISP
large ISP
medium ISP
dialup
ISP
dedicated
access ISP
• AS (Autonomous System) - a collection of routers under the same technical and
administrative domain.
• EGP (External Gateway Protocol) - used between two AS’s to allow them to exchange
routing information so that traffic can be forwarded across AS borders. Example: BGP
16
Purpose: to share connectivity
information
you can reach
net A via me
AS2
BGP
AS1
R3
R2
traffic to A
R1
table at R1:
dest next hop
A
R2
A
R
border router
internal router
17
BGP Sessions
 One router can participate in many BGP sessions.
 Initially … node advertises ALL routes it wants
neighbor to know (could be >50K routes)
 Ongoing … only inform neighbor of changes
AS1
BGP Sessions
AS3
AS2
18
Routing Protocols
IGP: Interior Gateway Protocol.
Examples: IS-IS, OSPF
I-BGP
R3
R2
A
AS1
E-BGP
announce B
AS2
R1
AS3
R5
R4
R
border router
internal router
B
19
Configuration and Policy
 A BGP node has a notion of which routes to
share with its neighbor. It may only
advertise a portion of its routing table to a
neighbor.
 A BGP node does not have to accept every
route that it learns from its neighbor. It can
selectively accept and reject messages.
 What to share with neighbors and what to
accept from neighbors is determined by the
routing policy, that is specified in a router’s
configuration file.
20
History
 Popularity:
 until
a number of years ago BGP fairly unknown.
Used by small number of large ISPs.
 in 1995 (beginning of web popularity) number of
organizations using BGP grew tremendously.
 Two reasons for growth of usage and
interest:
significant growth in number of ISPs;
 organizations were born that had missioncritical dependence upon their connectivity

 CIDR (Classless Inter-Domain Routing)
introduced in 1995
21
Assigning IP address and AS
numbers (Ideally)
 A host gets its IP address from the IP address
block of its organization
 An organization gets an IP address block from its
ISP’s address block
 An ISP gets its address block from its own
provider OR from one of the 3 routing registries:



ARIN: American Registry for Internet Numbers
RIPE: Reseaux IP Europeens
APNIC: Asia Pacific Network Information Center
 Each AS is assigned a 16-bit number (65536 total)
Currently 10,000 AS’s in use
22
Addressing Schemes
 Original addressing schemes (class-based):
 32
bits divided into 2 parts:
Class A
0
8
0 network

Class B
0
0 network

Class C
0
0 network

host
16
host
24
host
~2 million nets
256 hosts
 CIDR introduced to solve 2 problems:
exhaustion of IP address space
 size and growth rate of routing table

23
Problem #1: Lifetime of
Address Space
 Example: an organization needs 500 addresses.
A single class C address not enough (256 hosts).
Instead a class B address is allocated. (~64K
hosts) That’s overkill -a huge waste.
 CIDR allows networks to be assigned on
arbitrary bit boundaries.
permits arbitrary sized masks: 178.24.14.0/23 is valid
 requires explicit masks to be passed in routing
protocols

 CIDR solution for example above: organization is
allocated a single /23 address (equivalent of 2
class C’s).
24
Problem #2: Routing Table Size
Without CIDR:
232.71.0.0
232.71.1.0
232.71.2.0
…..
232.71.255.0
service
provider
232.71.0.0
232.71.1.0
232.71.2.0
…..
232.71.255.0
Global
internet
With CIDR:
232.71.0.0
232.71.1.0
232.71.2.0
…..
232.71.255.0
service
provider
232.71.0.0/16
Global
internet
25
CIDR: Classless Inter-Domain
Routing
 Address format <IP address/prefix P>.
The prefix denotes the upper P bits of the IP
address.
 Idea - use aggregation - provide routing for a
large number of customers by advertising one
common prefix.

This is possible because nature of addressing is
hierarchical
 Summarizing routing information reduces the size
of routing tables, but allows to maintain
connectivity.
 Aggregation is critical to the scalability and
survivability of the Internet
26
Address Arithmetic: Address
Blocks
 The <address/prefix> pair defines an address block:
 Examples:
 128.15.0.0/16 => [ 128.15.0.0 - 128.15.255.255 ]
 188.24.0.0/13 => [ 188.24.0.0 - 188.31.255.255 ]
consider 2nd octet in binary:
198.00011000.0.0
settable
 Address block sizes
13th bit
 a /13 address block has 232-13 addresses (/16 has 232-16)
 a /13 address block is 8 times as big as a /16 address block
because 232-13 = 232-16 * 23
27
CIDR: longest prefix match
 Because prefixes of arbitrary length allowed,
overlapping prefixes can exist.
 Example:
router hears 124.39.0.0/16 from one neighbor
and 124.39.11.0/24 from another neighbor
 Router forwards packet according to most specific
forwarding information, called longest prefix match


Packet with destination 124.39.11.32 will be forwarded using
/24 entry.
Packet w/destination 124.39.22.45 will be forwarded using
/16 entry
28
Will CIDR work ?
 For CIDR to be successful need:
address registries must assign addresses using
CIDR strategy
 providers and subscribers should configure
their networks, and allocate addresses to allow
for a maximum amount of aggregation
 BGP must be configured to do aggregation as
much as possible

 Factors that complicate achieving
aggregation

multihoming, proxy aggregation, changing
providers
29
Four Basic Messages
 Open:
Establishes BGP session (uses TCP port #179)
 Notification:
Report unusual conditions
 Update:
Inform neighbor of new routes that become active
Inform neighbor of old routes that become inactive
 Keepalive:
Inform neighbor that connection is still viable
30
UPDATE Message
 used to either advertise and/or withdraw
prefixes
 path attributes: list of attributes that
pertain to ALL the prefixes in the
Reachability Info field
FORMAT:
Withdrawn routes length (2 octets)
Withdrawn routes
(variable length)
Total path attributes length (2 octets)
Path Attributes
(variable length)
Reachability Information (variable length)
33
ATTRIBUTES
 ORIGIN:


Who originated the announcement? Where was a prefix
injected into BGP?
IGP, EGP or Incomplete (often used for static routes)
 AS-PATH:



a list of AS’s through which the announcement for a
prefix has passed
each AS prepends its AS # to the AS-PATH attribute
when forwarding an announcement
useful to detect and prevent loops
Prefix
128.73.4.21/21
Next hop
232.14.63.4
AS Path
1239 701 3985 631
35
AS-Path Attribute
 Sequence of ASes a route
has traversed
 Loop detection
 Apply policy
AS 200
AS 100
170.10.0.0/16
180.10.0.0/16
Network
Path
180.10.0.0/16 300 200 100
170.10.0.0/16 300 200
AS 300
AS 400
150.10.0.0/16
AS 500
Network
180.10.0.0/16
170.10.0.0/16
150.10.0.0/16
Path
300 200 100
300 200
300 400
36
Loop Handling
172.16.10.0/24 -- 1
AS2
172.16.10.0/24 – 2 1
AS1
172.16.10.0/24
AS3
172.16.10.0/24 – 4 3 2 1
AS4
172.16.10.0/24 – 3 2 1
37
AS_path Manipulation
192.213.1.0/24 – 200 50
192.213.1.0/24 -- 50
AS50
AS200
AS300
192.213.1.0/24 – 300 200 50
192.213.1.0/24
192.213.1.0/24 – 50 50 50
AS100
192.213.1.0/24 – 100 50 50 50
38
Attribute: NEXT HOP
For EBGP session, NEXT HOP = IP address
of neighbor that announced the route.
 For IBGP sessions, if route originated inside
AS, NEXT HOP = IP address of neighbor
that announced the route
 For routes originated outside AS, NEXT
HOP of EBGP node that learned of route, is
carried unaltered into IBGP.

209.15.1.0/24
1.1.1.1 A
3.3.3.3 D
C
2.2.2.2
IBGP
B
140.20.1.0/24
BGP Table at Router C:
Destination
Next Hop
140.20.1.0/24
2.2.2.2
209.15.1.0/24
1.1.1.1
IP Routing Table at Router C:
Destination
Next Hop
140.20.1.0/24
2.2.2.2
2.2.2.0/24
3.3.3.3
3.3.3.0/24
Connected
209.15.1.0/24
1.1.1.1
1.1.1.0/24
3.3.3.3
39
Attribute: Multi-Exit
Discriminator (MED)
 when AS’s interconnected via







2 or more links
Hint to external neighbors
about the preferred path
into an AS that has multiple
entry points
AS announcing prefix sets
MED
enables AS2 to indicate its
preference
AS receiving prefix uses
MED to select link
a way to specify how close a
prefix is to the link it is
announced on
a lower value is preferred
called an external metric
AS1
Link B
Link A
MED=50
MED=10
AS2
AS4
AS3
40
Attribute: Local Preference
 Used to indicate
preference among
multiple paths for the
same prefix anywhere in
the internet.
 The higher the value the
more preferred
 Exchanged between
IBGP peers only. Local to
the AS.
 Often used to select a
specific exit point for a
particular destination
140.20.1.0/24
AS1
AS3
AS2
AS4
BGP table at AS4:
Destination
AS Path
Local Pref
140.20.1.0/24
AS3 AS1
300
140.20.1.0/24
AS2 AS1
100
41
128.213.0.0/16
WINET
Example
ZNET
XNET
128.213.0.0/16
YNET
128.213.0.0/16
T3
T1
Set local preference to 200
Set local preference to 300
SJ
LA
ANET
Affects the traffic going out of the AS
42
Routing Process Overview
Routes
received
from
peers
accept,
deny, set
preferences
Input
policy
engine
Choose
best route
Decision
process
BGP table
forward,
not forward
set MEDs
Routes
used by
router
Output
policy
engine
Routes
sent to
peers
IP routing
table
Page 145 of Halabi
43
Input Policy Engine
 Inbound filtering controls outbound traffic
 filters route updates received from other peers
 filtering based on IP prefixes, AS_PATH, community
 denying a prefix means BGP does not want to reach that
prefix via the peer that sent the announcement
 accepting a prefix means traffic towards that prefix may
be forwarded to the peer that sent the announcement
 Attribute Manipulation
 sets attributes on accepted routes
 example: specify LOCAL_PREF to set priorities among
multiple peers that can reach a given destination
44
BGP Decision Process
1. Choose route with highest LOCAL-PREF
2. If have more than 1 route, select route with shortest
AS-PATH
3. If have more than 1 route, select according to lowest
ORIGIN type where IGP < EGP < INCOMPLETE
4. If have more than 1 route, select route with lowest
MED value
5. Select min cost path to NEXT HOP using IGP metrics
6. If have multiple internal paths, use BGP Router ID to
break tie.
45
Output Policy Engine
 Outbound Filtering controls inbound traffic
forwarding a route means others may choose to
reach the prefix through you
 not forwarding a route means others must use
another router to reach the prefix
 may depend upon whether E-BGP or I-BGP peer
 example: if ORIGIN=EGP and you are a nontransit AS and BGP peer is non-customer, then
don’t forward

 Attribute Manipulation
 sets attributes such as AS_PATH and MEDs
46
Transit vs. Nontransit AS
Transit traffic = traffic whose source and destination are outside the AS
Nontransit AS: does not carry transit traffic
• Advertise own routes only
• Do not propagate routes learned from other AS’s
• case 1:
r1
Transit AS: does carry transit traffic
• Advertises its own routes PLUS routes
learned from other AS’s
ISP1
ISP1
r3 ISP2
r1
r3
r2
r2
r2
r1
ISP2
r3
r2,r3
r2
r2,r1
AS1
AS1
r2
r3
AS2
r1
• case 2:
r3
r1
r1
r3
r2
AS1
r2
47
Clients, Providers and Peers
 AS has customers,
providers and peers
 Relationships between
AS pairs:


customer-provider
peer-to-peer
 Type of relationship
influences policies
 Exporting to provider:
AS exports its routes & its
customer’s routes, but not
routes learned from other
providers or peers
 Exporting to peer:
(same as above)
 Exporting to customer:
AS exports its routes plus
routes learned from its
providers and peers
48
Internal BGP (I-BGP)
 Used to distribute routes learned via EBGP to all the
routers within an AS
 I-BGP and E-BGP are same protocol in that




same message types used
same attributes used
same state machine
BUT use different rules for readvertising prefixes
 Rule #1: prefixes learned from an E-BGP neighbor can
be readvertised to an I-BGP neighbor, and vice versa
 Rule #2: prefixes learned from an I-BGP neighbor
cannot be readvertised to another I-BGP neighbor
49
I-BGP: Preventing Loops and
Setting Attributes
 Why rule #2? To prevent announcements from
looping.


In E-BGP can detect via AS-PATH.
AS-PATH not changed in I-BGP
 Implication of rule: a full mesh of I-BGP sessions
between each pair of routers in an AS is required
 Setting Attributes: The router that injects the route
into the I-BGP mesh is responsible for


setting the LOCAL-PREF attribute
prepending AS # to AS-PATH
50
Route Reflectors
 Problem: requiring a full mesh
of I-BGP sessions between all
pairs of routers is hard to
manage for large AS’s.
 Solution:



 I-BGP Peering


RR
group routers into clusters.
Assign a leader to each
cluster, called a route
reflector (RR).
Members of a cluster are
called clients of the RR
clients peer only with their
RR
RR’s must be fully meshed
RR
RR
A
B
C
clusters
clients
51
Route Reflectors: Rule on
Announcements
 If received from RR, reflect to clients
 If received from a client, reflect to RRs
and clients
 If received from E-BGP, reflect to all RRs and clients
 RR’s reflect only the best route to a given
prefix, not all announcements they receive.
 helps
size of routing table
 sometimes clients don’t need to carry full table
52
Default Routes
 If you don’t have a
routing entry in the
table for a
destination, send it
along the default
route
 Can be statically
configured
 Can be learned via BGP
 Can have multiple
defaults

use LOCAL_PREF to
distinguish primary and
backup default routes
AS2
0.0.0.0/0
next_hop=1.1.1.1
Local pref=300
0.0.0.0/0
next_hop=2.2.2.2
Local pref=100
AS1
53
Multihoming
54
Single-homed vs. Multi-homed
subscribers
 A single-homed network has one connection to the
Internet (i.e., to networks outside its domain)
 A multi-homed network has multiple connections to
the Internet. Two scenarios:


can be multi-homed to a single provider
can be multi-homed to multiple providers
 Why multi-home?
 Reliability
 Performance
• a site’s bandwidth to internet is sum of bandwidth on all
links
55
Single-homed AS
 Subscriber called a “stub AS”
 Provider-Subscriber communication
for route advertisement:

static configuration. most common.
• Provider’s router is configured with
subscriber’s prefix.
• good if customer’s routes can be
represented by small set of aggregate
routes
• bad if customer has many noncontiguous
subnets

Provider
R1
R2
Subscriber
can use BGP between provider and stub
AS
56
Multihoming to a Single
Provider: 4 scenarios
ISP
ISP
R1
R1
R2
R2
Customer
R3
Customer
ISP
R1
ISP
R2
R3
Customer
R1
R2
R3
R4
Customer
57
Multihoming to Multiple
Providers
ISP 3
ISP 2
ISP 1
Customer
58
Multihoming Issues
 Load sharing
 how
distribute the traffic over the multiple
links?
 Reliability

if load sharing leads to preferering certain
links for certain subnets, is reliability reduced?
 Address/Aggregation
which subnet addresses should the multihomed
customer use?
 how will this affect its provider’s ability to
aggregate routes?

59
Load sharing from Customer to
ISP using policy
 Goal: send traffic to
ISP’s customers on one
link; send traffic to
the rest of the
Internet on 2nd link
 Implement using policy
to control
announcements
blue: announcements
red: traffic
ISP
R1
R2
advertise
customer
routes only
advertise
default
route 0/0
R3
traffic
Customer
60
Load sharing from ISP to
customer using attributes
 Goal: provider splits
traffic across 2 links
according to prefix
 Implement this
strategy using
attributes


customer sets MEDs
provider sets
LOCAL_PREF
61
Example
C4
C3
C2
C5
SF
NY
X:200
Y:200
Rest:300
C3:300
C2:300
Rest:200
(X,Y)
W:200
Z:200
Rest:250
SF
NY
C5:300
C4:300
Rest:250
(Z,W)
IBGP
62
Example
ISP
R1
140.35/16
208.22/16
R2
R3
Customer
208.22/16
140.35/16
63
Address/Aggregation Issue
 Where should the
customer get its
address block from?




1. From ISP1
2. From ISP2
3. From both ISP1 and
ISP2
4. Independently from
address registry
ISP 3
ISP 1
ISP 2
customer
(cases 1 and 2 are equivalent)
64
Case 1 & 2: Get address block
from one ISP
 example: customer gets




address from ISP 1
ISP 1’s aggregation is not
broken
customer’s prefix not
aggregatable at ISP 2
longer prefix becomes a
traffic magnet
How good is load sharing?
If all ISP’s generate same
amount of traffic for
customer, then ISP2customer link twice as loaded
as ISP1-customer link
ISP 3
140.20/16
200.50/16
140.20.6/24
ISP 1
ISP 2
140.20/16
200.50/16
140.20.6/24
140.20.6/24
customer
140.20.6/24
65
Case 3: Get address block from
both ISPs
 announcement policy:
announce prefix only
to its “parent”
 advantage: both ISP’s
can aggregate the
prefix they receive
 disadvantage: lose
reliability
 load balancing good?
depends upon how
much traffic sent to
each prefix
ISP 3
ISP 1
ISP 2
140.20/16
200.50/16
140.20.1/24
200.50.1/24
customer
140.20.1/24
200.50.1/24
66
Case 4: obtain address block
from registry
 no aggregation possible
 no traffic magnets created
ISP 3
 good reliability
 how achieve load sharing?


customer breaks address
block into 2 /25 blocks,
and announce one per link
(but may lose reliability)
OR use method of
AS-PATH manipulation
100.20/16
200.50/16
150.55.10/24
150.55.10/24
ISP 1
ISP 2
100.20/16
200.50/16
150.55.10/24
150.55.10/24
customer
150.55.10/24
67
AS-PATH manipulation
 Idea: prepend your AS
number in AS-PATH
multiple times to
discourage use of a link
 makes AS-PATH seem
longer than it is
 recall BGP decision
process uses shortest
AS-PATH length as a
criteria for selecting
best path
 Example: ISP 3 will
choose path through AS
2 because its AS-PATH
appears shorter
ISP 3
150.55.10/24 - 2 33
150.55.10/24 - 1 33 33
AS 2
AS 1
150.55.10/24 - 33 33
150.55.10/24 - 33
customer
150.55.10/24
AS 33
68
Aggregation
69
Address Arithmetic:
When is aggregation possible?
Case 1
 Possible when one prefix contained in
another.
 Example: Two AS’s having customer-provider
relationship. Provider does the aggregation.
 Provider
has address block 18.0.0.0/8
 Its customers have address blocks 18.6.0.0/15
and 18.9.0.0/15
 Provider announces its address block only
 Rule: Prefix p1 contains prefix p2 iff
length(p2) > length(p1) AND
address(p2) / 232-length(p1) = address(p1) / 232length(p1)
70
Address Arithmetic:
When is aggregation possible?
Case 2
 Some pairs of consecutive prefixes
 Example: routes within the same AS:
AS has 2 address blocks:
1.2.2.0/24 = 0000001.00000010.00000010.00000000/24
1.2.3.0/24 = 0000001.00000010.00000011.00000000/24
can announce instead 1.2.2.0/23
 Rule: consecutive prefixes p1 and p2 are aggregatable
iff
length(p1)=length(p2) AND
address(p1) / 232-length(p1) +1 = address(p2) / 232-length(p2) AND
address(p1) % 233-length(p1) = 0
71
Black holes and cardinal sins
 “The cardinal sin of BGP
NAP
routing is advertising
routes that you don't
wrong !!
know how to get to;
100.24.0.0/18
called "black-holing"
100.24/16
someone”
100.24.56.0/21
 If you announce part of
[100.24.56.0-100.24.63.255]
222.2/16
IP space owned by
Net C
someone else, using a
more-specific prefix, all
their traffic flows to
ISP 1
ISP 2
you. Makes that address
100.24/16
222.2/16
block disconnected from
the Internet.
 Example: black holes can
100.24.16.0/21
100.24.0.0/20
happen inadvertently by
[100.24.0.0-100.24.15.255]
non-careful aggregation [100.24.16.0-100.24.23.255]
Net B
Net A
73
Limitations to Aggregation
 Lack of hierarchical allocation of address
space prior to CIDR (before 1995)
 A single AS can have noncontiguous address
blocks
 Customer AS and provider AS can have noncontiguous address blocks
 Reluctance of customers to renumber their
address space when they switch providers
 Multi-homing
multi-homed prefixes require global visibility
 may choose not to: load sharing

74
Rules of Thumb for
Aggregation
 To avoid black holes: an ISP is not allowed
to aggregate routes unless it is a supernet
of the address block it wants to aggregate
 In other words, an ISP has to specifically
announce each IP address range of its
downstream customers that are not part of
its own address space.
75
Routing Instability
76
Route Stability
 Routing instability: rapid fluctuation of network
reachability information
 route flapping: when a route is withdrawn and
re-announced repeatedly in a short period of time

happens via UPDATE messages
 because messages propagate to global Internet, route
flapping behavior can cascade and deteriorate routing
performance in many places
 Effects: increased packet loss, increased network
latency, CPU overhead, loss of connectivity
 Route Stability Studies by C. Labovitz, R. Malan & F. Jahanian
77
Types of Routing Updates
 Forwarding instability
 reflects legitimate topology changes
 e.g., changes in Prefix, NEXT_HOP and/or ASPATH
 affects forwarding paths used
 Policy fluctuation
 reflects changes in policy
 e.g., changes in MED, LOCAL_PREF, etc.
 may not necessarily affect forwarding paths used
 Pathological
 redundant messages
 reflect neither topology nor policy changes
78
Anecdotes of Route Flap
Storms
 April 25, 1997 - small Virginia ISP injected
incorrect map into global Internet. Map said
Virginia ISP had optimal connectivity to all
destinations. Everyone sent their traffic to this
ISP. Result: shutdown of Tier-1 ISPs for 2 hours.
 August 14, 1998 - misconfigured database server
forwarded all queries to “.net” to wrong server.
Result: loss of connectivity to all .net servers for
few hours.
 Nov. 8, 1998 - router software bug led to
malformed routing control message. Caused
interoperability problem between Tier-1 ISPs.
Result: persistent pathological oscillations and
connectivity loss for several hours.
79
Taxonomy (as per Labovitz et.
al.) Name Type
Character
WADiff
AADiff
WADup
AADup
WWDup
Explicit withdrawal followed by
announcement. Replace route
with different path
Announced twice (implicit
withdrawal). Replace route
with different path.
Explicit withdrawal followed by
announcement. Replace route
with same path.
Announced twice (implicit
withdrawal). Replace route
with same path.
Repeated duplicate
withdrawals
Legitimate
Legitimate
Legitimate or
pathological
Policy change or
pathological
Pathological
80
General Statistics
 1996: 3-5 million updates per day in Internet core
 1998: 300K-700K updates per day in Internet core
 1996: average number of announcements per day
was ~275K.
 1998: average number of announcements per day
was ~400K
 Correlation of instability and usage


instability highest during business hours
instability lowest during nights, on weekends and in
summer
81
Per Event Type Statistics
 1996 relative impact
(approximately):
WWDup (96%),
AADup (2%),
WADup (1%),
AADiff(1/2%),
WADiff (1/2%)
 1998 relative impact
(approximately):
AADup (30-40%),
WWDup(25-30%),
AADiff (~20%)
other (rest)
82
Who’s Responsible?
 AS’s
No single AS dominates instability statistics
 No correlation between the size of an AS and
its share of updates generated.

 Prefixes
 Instability is evenly distributed across routes.
 Example of measurements:
• 75% of AADiff events come from prefixes change
less than 10 times a day.
• 80-90% of instability comes from prefixes that are
announced less than 50 times/day.
83
Sources of Instabilities in
General
 router configuration errors
 transient physical and data link problems
 software bugs
 problems with leased lines (electrical
timing issues that cause false alarms of
disconnect)
 router failures
 network upgrades and maintenance
84
Instability Problem and Cause.
Example 1.
 Problem: 3-5 million duplicate withdrawals
 Cause: stateless BGP implementation
 time-space tradeoff: no state maintained on what
advertised to peers
 when receive any change, transmit withdrawal to all peers
regardless of whether previously notified or not
 sent updates for both explicit and implicit withdrawals
 By 1998, most vendors had BGP implementations
with partial state.
 Result: number of WWDups reduced by an order
of magnitude
85
Instability Problem and Cause.
Example 2
 Problem: duplicate announcements
 Cause: min-advertisement timer &
stateless BGP
min-adv timer: wait 30 seconds. Combine all
received updates in last 30 seconds into single
outbound update message (if possible).
 within 30 seconds route can be withdrawn and
re-announced so that there is no net change to
original announcement

 Solution: Have BGP keep some state about
recently sent messages to peers. Avoid
sending duplicate messages
86
Instability Problem and Cause.
Example 3
AS 1
Net X
R2
R8
R1
R3
4
R4
AS3
R6
3
R7
6
5
10
2
R5
AS2
R
Example: interaction IGP/BGP
policy: set MED using IGP metrics,
such as shortest path
border router
internal router
E-BGP
IGP
87
Controlling route instability:
Route Dampening
 track number of times a route has flapped
over a period of time
 routes that exhibit a high level of
instability in a short period of time should
be suppressed (not advertised)
 penalize ill behaved routes proportionally
to their expected future stability
 if a suppressed route stops flapping for a
long enough period of time, unsuppress it
(readvertise)
88
Route Dampening
penalty
suppress limit
reuse limit
time
89
Route Dampening Algorithm
 Each time a route flaps, increase the
penalty
 If the route has not flapped in the last
‘half-life’ time units, then cut penalty in
half.
 If the penalty > suppress limit, then
suppress the route
 If the penalty < reuse limit, then free a
suppressed route
90
Side Effect of Route
Dampening
 A legitimate update may arrive and it will
be ignored because that route has been
suppressed and not yet released.
 The modification needed for the legitimate
announcement is delayed
91
Aggregation can help route
flapping
 If a more-specific route is
flapping, but provider only
announces aggregated
prefix, then other networks
don’t see route flap.
 Hence aggregation can mask
route flapping.
 Aggregation helps combat
instability because it
reduces the number of
networks visible in the core
Internet.
Tier 1 provider
not flapping
140.40/16
Tier 2 provider
flapping
140.40.4/24
customer
92
Growth of BGP Tables
93
Long Term Growth Trends
in Internet Routing
 Will this routing system be able to scale
and meet the growth of the Internet and
its ever-expanding level of demands?
Are there any inherent limitations?
 As more devices connect to Internet and
consume addresses, the need to maintain
reachability to these addresses implies larger
routing tables

 What is the ability of the system to
produce a stable view of the overall
network topology?
94
BGP Table Growth (1989-2001)
95
Reasons for Exponential
Growth
Measures
Growth
Number of AS’s
Growing exponentially - 50%
yearly
Range of addresses spanned by Growing 7% yearly
table
Average span of route table
entry
Growing 1 bit per 29 months
Holes in the table
Currently at 37%
Number of announcements
Growing 42% yearly
Data in last 3 slides from Geoff Huston’s INET publications
97
Increasing fine levels of
routing details
in BGP table
 AS space growth at
50%
&
addresses spanned by
the table growing at 7%
=> each AS advertising
smaller address ranges
 /24 is fastest growth
prefix in table (in
absolute terms). /24 /31 is area with fastest
relative growth
 1999: average span of
prefix 16,000
addresses (mean
prefix length 18.03)
2000: average span of
prefix 12,000
addresses (mean
prefix length 18.44)
98
Holes
 When both aggregated prefix and a more-
specific prefix exist in the table, the
more-specific prefix is called a “hole”.
 Why are holes created?
Punch hole in policy of larger aggregated
announcement to create a different policy for
finer announcement.
 Load sharing in multihoming scenario
 reliability via multihoming

 37% of BGP table is holes.
99
Multihoming vs. Resiliency
 Multiple peering relationships can be
cheaper than using a single upstream
provider

implies: multihoming is seen as a substitute for
upstream service resiliency
 Impact
 providers can’t command a price for reliability,
and thus don’t spend money to engineer it into
network.
 resiliency is becoming responsibility of customer
not provider
 Can BGP scale adequately to continue to
undertake this role?
100
Conclusions
 Things are getting better (stability)
router software and configuration management
are maturing
 increased emphasis on aggregation and route
dampening are helping

 Things are getting worse (scalability)
 multihoming is still growing
 internet topology growing less hierarchical
 ‘noise’ in BGP table is growing
101
Longest Prefix Match
Algorithms
102
Forwarding Engine
Packet
payload
header
Router
Destination
Address
Routing Lookup
Data Structure
Outgoing
Port
Forwarding Table
Dest-network
Port
65.0.0.0/8
3
128.9.0.0/16
1
149.12.0.0/19
7
103
The Search Operation is not a
Direct Lookup
(Incoming port, label)
Memory
(Outgoing port, label)
IP addresses: 32 bits long  4G entries
104
The Search Operation is also
not an Exact Match Search
Exact match search: search for a key in
a collection of keys of the same length.
Relatively well studied data structures:
 Hashing
 Balanced binary search trees
105
Example Forwarding Table
Destination IP Prefix
Outgoing Port
65.0.0.0/8
3
128.9.0.0/16
Prefix length
142.12.0.0/19
1
7
IP prefix: 0-32 bits
65.0.0.0/8
0
65.0.0.0
128.9.0.0/16
128.9.16.14
224
65.255.255.255
142.12.0.0/19
232-1
106
Prefixes can Overlap
Longest
matching prefix
128.9.176.0/24
128.9.16.0/21 128.9.172.0/21
65.0.0.0/8
0
128.9.0.0/16
128.9.16.14
142.12.0.0/19
232-1
Routing lookup: Find the longest matching
prefix (aka the most specific route) among all
prefixes that match the destination address.
107
Prefix Length
Difficulty of Longest Prefix
Match
32
24
128.9.176.0/24
128.9.16.0/21
128.9.172.0/21
142.12.0.0/19
128.9.0.0/16
8
65.0.0.0/8
128.9.16.14
Prefixes
108
Lookup Rate Required
Year
Line
Line-rate
(Gbps)
40B
packets
(Mpps)
1998-99
OC12c
0.622
1.94
1999-00
OC48c
2.5
7.81
2000-01
OC192c
10.0
31.25
2002-03
OC768c
40.0
125
31.25 Mpps  33 ns
DRAM: 50-80 ns, SRAM: 5-10 ns
109
Number of Prefixes
Size of the Forwarding Table
100000
90000
80000
70000
60000
50000
10,000/year
40000
30000
20000
10000
0
95
96
97
Year
98
99
Source: http://www.telstra.net/ops/bgptable.html
00
110
Longest Prefix Match is Harder
than Exact Match
 The destination address of an arriving
packet does not carry with it the
information to determine the length of the
longest matching prefix
 Hence, one needs to search among the
space of all prefix lengths; as well as the
space of all prefixes of a given length
111
LPM in IPv4
Use 32 exact match algorithms for LPM!
Exact match
against prefixes
of length 1
Network Address
Exact match
against prefixes
of length 2
Priority
Encode
and pick
Port
Exact match
against prefixes
of length 32
112
Metrics for Lookup Algorithms
 Speed (= number of memory accesses)
 Storage requirements (= amount of memory)
 Low update time (support >10K updates/s)
 Scalability
 With length of prefix: IPv4 unicast (32b), Ethernet
(48b), IPv4 multicast (64b), IPv6 unicast (128b)
 With size of routing table: (sweetspot for today’s
designs = 1 million)
 Flexibility in implementation
 Low preprocessing time
113
Radix Trie
Trie node
A
1
P1
111*
H1
P2
10*
H2
P3
1010*
H3
P4
10101
H4
C
P2
G
P3
Lookup 10111
next-hop-ptr (if prefix)
right-ptr
left-ptr
B
1
0
1
D
1
E
0
1
Add P5=1110*
0
P4
H
P5
P1
F
I
114
Radix Trie
 W-bit prefixes: O(W) lookup, O(NW)
storage and O(W) update complexity
Advantages
Disadvantages
Simplicity
Worst case lookup slow
Wastage of storage space in
chains
Extensible to wider fields
115
Leaf-pushed Binary Trie
Trie node
A
1
P1
111*
H1
P2
10*
H2
P3
1010*
H3
P4
10101
H4
C
P2
G
B
1
0
1
0
left-ptr or
next-hop
right-ptr or
next-hop
D
P1
E
P2
P3 P4
116
Practical Algorithm To Retrieve
Information Coded In Alphanumeric
PATRICIA
B
D
2
0
3
0
1
A
Patricia tree internal node
1
P1
E
C
bit-position
right-ptr
left-ptr
5
P2
F
P1
111*
H1
P2
10*
H2
P3
1010*
H3
P4
10101
H4
0
P3
1
P4
G
Lookup 10111
Bitpos 12345
117
PATRICIA
• W-bit prefixes: O(W2) lookup, O(N)
storage and O(W) update complexity
Advantages
Disadvantages
Decreased storage
Extensible to wider fields
Worst case lookup slow
Backtracking makes
implementation complex
118
Path-compressed Tree
Lookup 10111
1, , 2
0
0
10,P2,4
1
P4
111*
H1
P2
10*
H2
P3
1010*
H3
P4
10101
H4
1
P1
C
D
1010,P3,5
P1
B
A
E
Path-compressed tree node structure
variable-length next-hop (if
prefix present)
bitstring
left-ptr
bit-position
right-ptr
119
Path-compressed Trie
• W-bit prefixes: O(W) lookup, O(N)
storage and O(W) update complexity
Advantages
Disadvantages
Decreased storage
Worst case lookup slow
120
Binary Search by Prefix Intervals
{}, 1, 10, 100, 101, 1110
1: 10000-11111
{}: 00000-11111
10: 10000-10111
101: 10100-10111 1110: 11100-11101
100: 10000-10011
(
(
(
(
)
(
)
)
(
)
)
)
00000
10000
10000
10000
10011
10100
10111
10111
11100
11101
11111
11111
1100
121
Rules
 If the destination address we search for
fits to the right of a left paranthesis, the
prefix represented by the left paranthesis
is the longest match
 Otherwise, increment for each right
paranthesis, decrement for each left
paranthesis until the count reaches -1
122
Example revisited
(
(
(
(
)
(
)
)
(
)
)
)
00000
10000
10000
10000
10011
10100
10111
10111
11100
11101
11111
11111
1100
-1
0
1
2
1
2
1
1100 is in the interval 10000-11111, longest prefix
match is for the prefix 1
123
Binary Search on Intervals
•W-bit N prefixes: O(logN) lookup,
O(N) storage
Advantages
Disadvantages
Storage is linear
Can be ‘balanced’
Lookup time independent
of W
But, lookup time is dependent
on N
Incremental updates more
complex than tries
Each node is big in size:
requires higher memory
bandwidth
124