IP Routing Protocols, BGP, Longest Prefix Match
Download
Report
Transcript IP Routing Protocols, BGP, Longest Prefix Match
Week 7
Routing Protocols
1
Intra-AS Routing
Also known as Interior Gateway Protocols (IGP)
Most common Intra-AS routing protocols:
RIP: Routing Information Protocol
OSPF: Open Shortest Path First
IGRP: Interior Gateway Routing Protocol (Cisco
proprietary)
2
RIP ( Routing Information Protocol)
Distance vector algorithm
Included in BSD-UNIX Distribution in 1982
Distance metric: # of hops (max = 15 hops)
u
v
A
z
C
B
D
w
x
y
destination hops
u
1
v
2
w
2
x
3
y
3
z
2
3
RIP advertisements
Distance vectors: exchanged among
neighbors every 30 sec via Response
Message (also called advertisement)
Each advertisement: list of up to 25
destination nets within AS
4
RIP: Example
z
w
A
x
D
B
y
C
Destination Network
w
y
z
x
….
Next Router
Num. of hops to dest.
….
....
A
B
B
--
2
2
7
1
Routing table in D
5
RIP: Example
Dest
w
x
z
….
Next
C
…
w
hops
4
...
A
Advertisement
from A to D
z
x
Destination Network
w
y
z
x
….
D
B
C
y
Next Router
Num. of hops to dest.
….
....
A
B
B A
--
Routing table in D
2
2
7 5
1
6
RIP: Link Failure and Recovery
If no advertisement heard after 180 sec -->
neighbor/link declared dead
routes via neighbor invalidated
new advertisements sent to neighbors
neighbors in turn send out new advertisements (if
tables changed)
link failure info quickly propagates to entire net
poison reverse used to prevent ping-pong loops
(infinite distance = 16 hops)
7
OSPF (Open Shortest Path First)
“open”: publicly available
Uses Link State algorithm
LS packet dissemination
Topology map at each node
Route computation using Dijkstra’s algorithm
OSPF advertisement carries one entry per neighbor
router
Advertisements disseminated to entire AS (via
flooding)
Carried in OSPF messages directly over IP (rather than TCP
or UDP
8
OSPF “advanced” features (not in RIP)
Security: all OSPF messages authenticated (to
prevent malicious intrusion)
Multiple same-cost paths allowed (only one path in
RIP)
For each link, multiple cost metrics for different
TOS (e.g., satellite link cost set “low” for best effort;
high for real time)
Integrated uni- and multicast support:
Multicast OSPF (MOSPF) uses same topology data
base as OSPF
Hierarchical OSPF in large domains.
9
Hierarchical OSPF
10
Hierarchical OSPF
Two-level hierarchy: local area, backbone.
Link-state advertisements only in area
each nodes has detailed area topology; only know
direction (shortest path) to nets in other areas.
Area border routers: “summarize” distances to nets
in own area, advertise to other Area Border routers.
Backbone routers: run OSPF routing limited to
backbone.
Boundary routers: connect to other AS’s.
11
Internet inter-AS routing: BGP
BGP (Border Gateway Protocol): the de
facto standard
BGP provides each AS a means to:
1.
2.
3.
Obtain subnet reachability information from
neighboring ASs.
Propagate the reachability information to all
routers internal to the AS.
Determine “good” routes to subnets based on
reachability information and policy.
Allows a subnet to advertise its existence
to rest of the Internet: “I am here”
12
BGP basics
Pairs of routers (BGP peers) exchange routing info over semi-
permanent TCP conctns: BGP sessions
Note that BGP sessions do not correspond to physical links.
When AS2 advertises a prefix to AS1, AS2 is promising it will
forward any datagrams destined to that prefix towards the
prefix.
AS2 can aggregate prefixes in its advertisement
3c
3a
3b
AS3
1a
AS1
2a
1c
1d
1b
2c
AS2
2b
eBGP session
iBGP session
13
Distributing reachability info
With eBGP session between 3a and 1c, AS3 sends prefix
reachability info to AS1.
1c can then use iBGP do distribute this new prefix reach info
to all routers in AS1
1b can then re-advertise the new reach info to AS2 over the
1b-to-2a eBGP session
When router learns about a new prefix, it creates an entry
for the prefix in its forwarding table.
3c
3a
3b
AS3
1a
AS1
2a
1c
1d
1b
2c
AS2
2b
eBGP session
iBGP session
14
Outline
1. Highlights
2. Addressing and CIDR
3. BGP Messages and Prefix Attributes
4. BGP Decision and Filtering Processes
5. I-BGP
6. Route Reflectors
7. Multihoming
8. Aggregation
9. Routing Instability
10. BGP Table Growth
15
Internet Topology
large ISP
large ISP
medium ISP
dialup
ISP
dedicated
access ISP
• AS (Autonomous System) - a collection of routers under the same technical and
administrative domain.
• EGP (External Gateway Protocol) - used between two AS’s to allow them to exchange
routing information so that traffic can be forwarded across AS borders. Example: BGP
16
Purpose: to share connectivity
information
you can reach
net A via me
AS2
BGP
AS1
R3
R2
traffic to A
R1
table at R1:
dest next hop
A
R2
A
R
border router
internal router
17
BGP Sessions
One router can participate in many BGP sessions.
Initially … node advertises ALL routes it wants
neighbor to know (could be >50K routes)
Ongoing … only inform neighbor of changes
AS1
BGP Sessions
AS3
AS2
18
Routing Protocols
IGP: Interior Gateway Protocol.
Examples: IS-IS, OSPF
I-BGP
R3
R2
A
AS1
E-BGP
announce B
AS2
R1
AS3
R5
R4
R
border router
internal router
B
19
Configuration and Policy
A BGP node has a notion of which routes to
share with its neighbor. It may only
advertise a portion of its routing table to a
neighbor.
A BGP node does not have to accept every
route that it learns from its neighbor. It can
selectively accept and reject messages.
What to share with neighbors and what to
accept from neighbors is determined by the
routing policy, that is specified in a router’s
configuration file.
20
History
Popularity:
until
a number of years ago BGP fairly unknown.
Used by small number of large ISPs.
in 1995 (beginning of web popularity) number of
organizations using BGP grew tremendously.
Two reasons for growth of usage and
interest:
significant growth in number of ISPs;
organizations were born that had missioncritical dependence upon their connectivity
CIDR (Classless Inter-Domain Routing)
introduced in 1995
21
Assigning IP address and AS
numbers (Ideally)
A host gets its IP address from the IP address
block of its organization
An organization gets an IP address block from its
ISP’s address block
An ISP gets its address block from its own
provider OR from one of the 3 routing registries:
ARIN: American Registry for Internet Numbers
RIPE: Reseaux IP Europeens
APNIC: Asia Pacific Network Information Center
Each AS is assigned a 16-bit number (65536 total)
Currently 10,000 AS’s in use
22
Addressing Schemes
Original addressing schemes (class-based):
32
bits divided into 2 parts:
Class A
0
8
0 network
Class B
0
0 network
Class C
0
0 network
host
16
host
24
host
~2 million nets
256 hosts
CIDR introduced to solve 2 problems:
exhaustion of IP address space
size and growth rate of routing table
23
Problem #1: Lifetime of
Address Space
Example: an organization needs 500 addresses.
A single class C address not enough (256 hosts).
Instead a class B address is allocated. (~64K
hosts) That’s overkill -a huge waste.
CIDR allows networks to be assigned on
arbitrary bit boundaries.
permits arbitrary sized masks: 178.24.14.0/23 is valid
requires explicit masks to be passed in routing
protocols
CIDR solution for example above: organization is
allocated a single /23 address (equivalent of 2
class C’s).
24
Problem #2: Routing Table Size
Without CIDR:
232.71.0.0
232.71.1.0
232.71.2.0
…..
232.71.255.0
service
provider
232.71.0.0
232.71.1.0
232.71.2.0
…..
232.71.255.0
Global
internet
With CIDR:
232.71.0.0
232.71.1.0
232.71.2.0
…..
232.71.255.0
service
provider
232.71.0.0/16
Global
internet
25
CIDR: Classless Inter-Domain
Routing
Address format <IP address/prefix P>.
The prefix denotes the upper P bits of the IP
address.
Idea - use aggregation - provide routing for a
large number of customers by advertising one
common prefix.
This is possible because nature of addressing is
hierarchical
Summarizing routing information reduces the size
of routing tables, but allows to maintain
connectivity.
Aggregation is critical to the scalability and
survivability of the Internet
26
Address Arithmetic: Address
Blocks
The <address/prefix> pair defines an address block:
Examples:
128.15.0.0/16 => [ 128.15.0.0 - 128.15.255.255 ]
188.24.0.0/13 => [ 188.24.0.0 - 188.31.255.255 ]
consider 2nd octet in binary:
198.00011000.0.0
settable
Address block sizes
13th bit
a /13 address block has 232-13 addresses (/16 has 232-16)
a /13 address block is 8 times as big as a /16 address block
because 232-13 = 232-16 * 23
27
CIDR: longest prefix match
Because prefixes of arbitrary length allowed,
overlapping prefixes can exist.
Example:
router hears 124.39.0.0/16 from one neighbor
and 124.39.11.0/24 from another neighbor
Router forwards packet according to most specific
forwarding information, called longest prefix match
Packet with destination 124.39.11.32 will be forwarded using
/24 entry.
Packet w/destination 124.39.22.45 will be forwarded using
/16 entry
28
Will CIDR work ?
For CIDR to be successful need:
address registries must assign addresses using
CIDR strategy
providers and subscribers should configure
their networks, and allocate addresses to allow
for a maximum amount of aggregation
BGP must be configured to do aggregation as
much as possible
Factors that complicate achieving
aggregation
multihoming, proxy aggregation, changing
providers
29
Four Basic Messages
Open:
Establishes BGP session (uses TCP port #179)
Notification:
Report unusual conditions
Update:
Inform neighbor of new routes that become active
Inform neighbor of old routes that become inactive
Keepalive:
Inform neighbor that connection is still viable
30
UPDATE Message
used to either advertise and/or withdraw
prefixes
path attributes: list of attributes that
pertain to ALL the prefixes in the
Reachability Info field
FORMAT:
Withdrawn routes length (2 octets)
Withdrawn routes
(variable length)
Total path attributes length (2 octets)
Path Attributes
(variable length)
Reachability Information (variable length)
33
ATTRIBUTES
ORIGIN:
Who originated the announcement? Where was a prefix
injected into BGP?
IGP, EGP or Incomplete (often used for static routes)
AS-PATH:
a list of AS’s through which the announcement for a
prefix has passed
each AS prepends its AS # to the AS-PATH attribute
when forwarding an announcement
useful to detect and prevent loops
Prefix
128.73.4.21/21
Next hop
232.14.63.4
AS Path
1239 701 3985 631
35
AS-Path Attribute
Sequence of ASes a route
has traversed
Loop detection
Apply policy
AS 200
AS 100
170.10.0.0/16
180.10.0.0/16
Network
Path
180.10.0.0/16 300 200 100
170.10.0.0/16 300 200
AS 300
AS 400
150.10.0.0/16
AS 500
Network
180.10.0.0/16
170.10.0.0/16
150.10.0.0/16
Path
300 200 100
300 200
300 400
36
Loop Handling
172.16.10.0/24 -- 1
AS2
172.16.10.0/24 – 2 1
AS1
172.16.10.0/24
AS3
172.16.10.0/24 – 4 3 2 1
AS4
172.16.10.0/24 – 3 2 1
37
AS_path Manipulation
192.213.1.0/24 – 200 50
192.213.1.0/24 -- 50
AS50
AS200
AS300
192.213.1.0/24 – 300 200 50
192.213.1.0/24
192.213.1.0/24 – 50 50 50
AS100
192.213.1.0/24 – 100 50 50 50
38
Attribute: NEXT HOP
For EBGP session, NEXT HOP = IP address
of neighbor that announced the route.
For IBGP sessions, if route originated inside
AS, NEXT HOP = IP address of neighbor
that announced the route
For routes originated outside AS, NEXT
HOP of EBGP node that learned of route, is
carried unaltered into IBGP.
209.15.1.0/24
1.1.1.1 A
3.3.3.3 D
C
2.2.2.2
IBGP
B
140.20.1.0/24
BGP Table at Router C:
Destination
Next Hop
140.20.1.0/24
2.2.2.2
209.15.1.0/24
1.1.1.1
IP Routing Table at Router C:
Destination
Next Hop
140.20.1.0/24
2.2.2.2
2.2.2.0/24
3.3.3.3
3.3.3.0/24
Connected
209.15.1.0/24
1.1.1.1
1.1.1.0/24
3.3.3.3
39
Attribute: Multi-Exit
Discriminator (MED)
when AS’s interconnected via
2 or more links
Hint to external neighbors
about the preferred path
into an AS that has multiple
entry points
AS announcing prefix sets
MED
enables AS2 to indicate its
preference
AS receiving prefix uses
MED to select link
a way to specify how close a
prefix is to the link it is
announced on
a lower value is preferred
called an external metric
AS1
Link B
Link A
MED=50
MED=10
AS2
AS4
AS3
40
Attribute: Local Preference
Used to indicate
preference among
multiple paths for the
same prefix anywhere in
the internet.
The higher the value the
more preferred
Exchanged between
IBGP peers only. Local to
the AS.
Often used to select a
specific exit point for a
particular destination
140.20.1.0/24
AS1
AS3
AS2
AS4
BGP table at AS4:
Destination
AS Path
Local Pref
140.20.1.0/24
AS3 AS1
300
140.20.1.0/24
AS2 AS1
100
41
128.213.0.0/16
WINET
Example
ZNET
XNET
128.213.0.0/16
YNET
128.213.0.0/16
T3
T1
Set local preference to 200
Set local preference to 300
SJ
LA
ANET
Affects the traffic going out of the AS
42
Routing Process Overview
Routes
received
from
peers
accept,
deny, set
preferences
Input
policy
engine
Choose
best route
Decision
process
BGP table
forward,
not forward
set MEDs
Routes
used by
router
Output
policy
engine
Routes
sent to
peers
IP routing
table
Page 145 of Halabi
43
Input Policy Engine
Inbound filtering controls outbound traffic
filters route updates received from other peers
filtering based on IP prefixes, AS_PATH, community
denying a prefix means BGP does not want to reach that
prefix via the peer that sent the announcement
accepting a prefix means traffic towards that prefix may
be forwarded to the peer that sent the announcement
Attribute Manipulation
sets attributes on accepted routes
example: specify LOCAL_PREF to set priorities among
multiple peers that can reach a given destination
44
BGP Decision Process
1. Choose route with highest LOCAL-PREF
2. If have more than 1 route, select route with shortest
AS-PATH
3. If have more than 1 route, select according to lowest
ORIGIN type where IGP < EGP < INCOMPLETE
4. If have more than 1 route, select route with lowest
MED value
5. Select min cost path to NEXT HOP using IGP metrics
6. If have multiple internal paths, use BGP Router ID to
break tie.
45
Output Policy Engine
Outbound Filtering controls inbound traffic
forwarding a route means others may choose to
reach the prefix through you
not forwarding a route means others must use
another router to reach the prefix
may depend upon whether E-BGP or I-BGP peer
example: if ORIGIN=EGP and you are a nontransit AS and BGP peer is non-customer, then
don’t forward
Attribute Manipulation
sets attributes such as AS_PATH and MEDs
46
Transit vs. Nontransit AS
Transit traffic = traffic whose source and destination are outside the AS
Nontransit AS: does not carry transit traffic
• Advertise own routes only
• Do not propagate routes learned from other AS’s
• case 1:
r1
Transit AS: does carry transit traffic
• Advertises its own routes PLUS routes
learned from other AS’s
ISP1
ISP1
r3 ISP2
r1
r3
r2
r2
r2
r1
ISP2
r3
r2,r3
r2
r2,r1
AS1
AS1
r2
r3
AS2
r1
• case 2:
r3
r1
r1
r3
r2
AS1
r2
47
Clients, Providers and Peers
AS has customers,
providers and peers
Relationships between
AS pairs:
customer-provider
peer-to-peer
Type of relationship
influences policies
Exporting to provider:
AS exports its routes & its
customer’s routes, but not
routes learned from other
providers or peers
Exporting to peer:
(same as above)
Exporting to customer:
AS exports its routes plus
routes learned from its
providers and peers
48
Internal BGP (I-BGP)
Used to distribute routes learned via EBGP to all the
routers within an AS
I-BGP and E-BGP are same protocol in that
same message types used
same attributes used
same state machine
BUT use different rules for readvertising prefixes
Rule #1: prefixes learned from an E-BGP neighbor can
be readvertised to an I-BGP neighbor, and vice versa
Rule #2: prefixes learned from an I-BGP neighbor
cannot be readvertised to another I-BGP neighbor
49
I-BGP: Preventing Loops and
Setting Attributes
Why rule #2? To prevent announcements from
looping.
In E-BGP can detect via AS-PATH.
AS-PATH not changed in I-BGP
Implication of rule: a full mesh of I-BGP sessions
between each pair of routers in an AS is required
Setting Attributes: The router that injects the route
into the I-BGP mesh is responsible for
setting the LOCAL-PREF attribute
prepending AS # to AS-PATH
50
Route Reflectors
Problem: requiring a full mesh
of I-BGP sessions between all
pairs of routers is hard to
manage for large AS’s.
Solution:
I-BGP Peering
RR
group routers into clusters.
Assign a leader to each
cluster, called a route
reflector (RR).
Members of a cluster are
called clients of the RR
clients peer only with their
RR
RR’s must be fully meshed
RR
RR
A
B
C
clusters
clients
51
Route Reflectors: Rule on
Announcements
If received from RR, reflect to clients
If received from a client, reflect to RRs
and clients
If received from E-BGP, reflect to all RRs and clients
RR’s reflect only the best route to a given
prefix, not all announcements they receive.
helps
size of routing table
sometimes clients don’t need to carry full table
52
Default Routes
If you don’t have a
routing entry in the
table for a
destination, send it
along the default
route
Can be statically
configured
Can be learned via BGP
Can have multiple
defaults
use LOCAL_PREF to
distinguish primary and
backup default routes
AS2
0.0.0.0/0
next_hop=1.1.1.1
Local pref=300
0.0.0.0/0
next_hop=2.2.2.2
Local pref=100
AS1
53
Multihoming
54
Single-homed vs. Multi-homed
subscribers
A single-homed network has one connection to the
Internet (i.e., to networks outside its domain)
A multi-homed network has multiple connections to
the Internet. Two scenarios:
can be multi-homed to a single provider
can be multi-homed to multiple providers
Why multi-home?
Reliability
Performance
• a site’s bandwidth to internet is sum of bandwidth on all
links
55
Single-homed AS
Subscriber called a “stub AS”
Provider-Subscriber communication
for route advertisement:
static configuration. most common.
• Provider’s router is configured with
subscriber’s prefix.
• good if customer’s routes can be
represented by small set of aggregate
routes
• bad if customer has many noncontiguous
subnets
Provider
R1
R2
Subscriber
can use BGP between provider and stub
AS
56
Multihoming to a Single
Provider: 4 scenarios
ISP
ISP
R1
R1
R2
R2
Customer
R3
Customer
ISP
R1
ISP
R2
R3
Customer
R1
R2
R3
R4
Customer
57
Multihoming to Multiple
Providers
ISP 3
ISP 2
ISP 1
Customer
58
Multihoming Issues
Load sharing
how
distribute the traffic over the multiple
links?
Reliability
if load sharing leads to preferering certain
links for certain subnets, is reliability reduced?
Address/Aggregation
which subnet addresses should the multihomed
customer use?
how will this affect its provider’s ability to
aggregate routes?
59
Load sharing from Customer to
ISP using policy
Goal: send traffic to
ISP’s customers on one
link; send traffic to
the rest of the
Internet on 2nd link
Implement using policy
to control
announcements
blue: announcements
red: traffic
ISP
R1
R2
advertise
customer
routes only
advertise
default
route 0/0
R3
traffic
Customer
60
Load sharing from ISP to
customer using attributes
Goal: provider splits
traffic across 2 links
according to prefix
Implement this
strategy using
attributes
customer sets MEDs
provider sets
LOCAL_PREF
61
Example
C4
C3
C2
C5
SF
NY
X:200
Y:200
Rest:300
C3:300
C2:300
Rest:200
(X,Y)
W:200
Z:200
Rest:250
SF
NY
C5:300
C4:300
Rest:250
(Z,W)
IBGP
62
Example
ISP
R1
140.35/16
208.22/16
R2
R3
Customer
208.22/16
140.35/16
63
Address/Aggregation Issue
Where should the
customer get its
address block from?
1. From ISP1
2. From ISP2
3. From both ISP1 and
ISP2
4. Independently from
address registry
ISP 3
ISP 1
ISP 2
customer
(cases 1 and 2 are equivalent)
64
Case 1 & 2: Get address block
from one ISP
example: customer gets
address from ISP 1
ISP 1’s aggregation is not
broken
customer’s prefix not
aggregatable at ISP 2
longer prefix becomes a
traffic magnet
How good is load sharing?
If all ISP’s generate same
amount of traffic for
customer, then ISP2customer link twice as loaded
as ISP1-customer link
ISP 3
140.20/16
200.50/16
140.20.6/24
ISP 1
ISP 2
140.20/16
200.50/16
140.20.6/24
140.20.6/24
customer
140.20.6/24
65
Case 3: Get address block from
both ISPs
announcement policy:
announce prefix only
to its “parent”
advantage: both ISP’s
can aggregate the
prefix they receive
disadvantage: lose
reliability
load balancing good?
depends upon how
much traffic sent to
each prefix
ISP 3
ISP 1
ISP 2
140.20/16
200.50/16
140.20.1/24
200.50.1/24
customer
140.20.1/24
200.50.1/24
66
Case 4: obtain address block
from registry
no aggregation possible
no traffic magnets created
ISP 3
good reliability
how achieve load sharing?
customer breaks address
block into 2 /25 blocks,
and announce one per link
(but may lose reliability)
OR use method of
AS-PATH manipulation
100.20/16
200.50/16
150.55.10/24
150.55.10/24
ISP 1
ISP 2
100.20/16
200.50/16
150.55.10/24
150.55.10/24
customer
150.55.10/24
67
AS-PATH manipulation
Idea: prepend your AS
number in AS-PATH
multiple times to
discourage use of a link
makes AS-PATH seem
longer than it is
recall BGP decision
process uses shortest
AS-PATH length as a
criteria for selecting
best path
Example: ISP 3 will
choose path through AS
2 because its AS-PATH
appears shorter
ISP 3
150.55.10/24 - 2 33
150.55.10/24 - 1 33 33
AS 2
AS 1
150.55.10/24 - 33 33
150.55.10/24 - 33
customer
150.55.10/24
AS 33
68
Aggregation
69
Address Arithmetic:
When is aggregation possible?
Case 1
Possible when one prefix contained in
another.
Example: Two AS’s having customer-provider
relationship. Provider does the aggregation.
Provider
has address block 18.0.0.0/8
Its customers have address blocks 18.6.0.0/15
and 18.9.0.0/15
Provider announces its address block only
Rule: Prefix p1 contains prefix p2 iff
length(p2) > length(p1) AND
address(p2) / 232-length(p1) = address(p1) / 232length(p1)
70
Address Arithmetic:
When is aggregation possible?
Case 2
Some pairs of consecutive prefixes
Example: routes within the same AS:
AS has 2 address blocks:
1.2.2.0/24 = 0000001.00000010.00000010.00000000/24
1.2.3.0/24 = 0000001.00000010.00000011.00000000/24
can announce instead 1.2.2.0/23
Rule: consecutive prefixes p1 and p2 are aggregatable
iff
length(p1)=length(p2) AND
address(p1) / 232-length(p1) +1 = address(p2) / 232-length(p2) AND
address(p1) % 233-length(p1) = 0
71
Black holes and cardinal sins
“The cardinal sin of BGP
NAP
routing is advertising
routes that you don't
wrong !!
know how to get to;
100.24.0.0/18
called "black-holing"
100.24/16
someone”
100.24.56.0/21
If you announce part of
[100.24.56.0-100.24.63.255]
222.2/16
IP space owned by
Net C
someone else, using a
more-specific prefix, all
their traffic flows to
ISP 1
ISP 2
you. Makes that address
100.24/16
222.2/16
block disconnected from
the Internet.
Example: black holes can
100.24.16.0/21
100.24.0.0/20
happen inadvertently by
[100.24.0.0-100.24.15.255]
non-careful aggregation [100.24.16.0-100.24.23.255]
Net B
Net A
73
Limitations to Aggregation
Lack of hierarchical allocation of address
space prior to CIDR (before 1995)
A single AS can have noncontiguous address
blocks
Customer AS and provider AS can have noncontiguous address blocks
Reluctance of customers to renumber their
address space when they switch providers
Multi-homing
multi-homed prefixes require global visibility
may choose not to: load sharing
74
Rules of Thumb for
Aggregation
To avoid black holes: an ISP is not allowed
to aggregate routes unless it is a supernet
of the address block it wants to aggregate
In other words, an ISP has to specifically
announce each IP address range of its
downstream customers that are not part of
its own address space.
75
Routing Instability
76
Route Stability
Routing instability: rapid fluctuation of network
reachability information
route flapping: when a route is withdrawn and
re-announced repeatedly in a short period of time
happens via UPDATE messages
because messages propagate to global Internet, route
flapping behavior can cascade and deteriorate routing
performance in many places
Effects: increased packet loss, increased network
latency, CPU overhead, loss of connectivity
Route Stability Studies by C. Labovitz, R. Malan & F. Jahanian
77
Types of Routing Updates
Forwarding instability
reflects legitimate topology changes
e.g., changes in Prefix, NEXT_HOP and/or ASPATH
affects forwarding paths used
Policy fluctuation
reflects changes in policy
e.g., changes in MED, LOCAL_PREF, etc.
may not necessarily affect forwarding paths used
Pathological
redundant messages
reflect neither topology nor policy changes
78
Anecdotes of Route Flap
Storms
April 25, 1997 - small Virginia ISP injected
incorrect map into global Internet. Map said
Virginia ISP had optimal connectivity to all
destinations. Everyone sent their traffic to this
ISP. Result: shutdown of Tier-1 ISPs for 2 hours.
August 14, 1998 - misconfigured database server
forwarded all queries to “.net” to wrong server.
Result: loss of connectivity to all .net servers for
few hours.
Nov. 8, 1998 - router software bug led to
malformed routing control message. Caused
interoperability problem between Tier-1 ISPs.
Result: persistent pathological oscillations and
connectivity loss for several hours.
79
Taxonomy (as per Labovitz et.
al.) Name Type
Character
WADiff
AADiff
WADup
AADup
WWDup
Explicit withdrawal followed by
announcement. Replace route
with different path
Announced twice (implicit
withdrawal). Replace route
with different path.
Explicit withdrawal followed by
announcement. Replace route
with same path.
Announced twice (implicit
withdrawal). Replace route
with same path.
Repeated duplicate
withdrawals
Legitimate
Legitimate
Legitimate or
pathological
Policy change or
pathological
Pathological
80
General Statistics
1996: 3-5 million updates per day in Internet core
1998: 300K-700K updates per day in Internet core
1996: average number of announcements per day
was ~275K.
1998: average number of announcements per day
was ~400K
Correlation of instability and usage
instability highest during business hours
instability lowest during nights, on weekends and in
summer
81
Per Event Type Statistics
1996 relative impact
(approximately):
WWDup (96%),
AADup (2%),
WADup (1%),
AADiff(1/2%),
WADiff (1/2%)
1998 relative impact
(approximately):
AADup (30-40%),
WWDup(25-30%),
AADiff (~20%)
other (rest)
82
Who’s Responsible?
AS’s
No single AS dominates instability statistics
No correlation between the size of an AS and
its share of updates generated.
Prefixes
Instability is evenly distributed across routes.
Example of measurements:
• 75% of AADiff events come from prefixes change
less than 10 times a day.
• 80-90% of instability comes from prefixes that are
announced less than 50 times/day.
83
Sources of Instabilities in
General
router configuration errors
transient physical and data link problems
software bugs
problems with leased lines (electrical
timing issues that cause false alarms of
disconnect)
router failures
network upgrades and maintenance
84
Instability Problem and Cause.
Example 1.
Problem: 3-5 million duplicate withdrawals
Cause: stateless BGP implementation
time-space tradeoff: no state maintained on what
advertised to peers
when receive any change, transmit withdrawal to all peers
regardless of whether previously notified or not
sent updates for both explicit and implicit withdrawals
By 1998, most vendors had BGP implementations
with partial state.
Result: number of WWDups reduced by an order
of magnitude
85
Instability Problem and Cause.
Example 2
Problem: duplicate announcements
Cause: min-advertisement timer &
stateless BGP
min-adv timer: wait 30 seconds. Combine all
received updates in last 30 seconds into single
outbound update message (if possible).
within 30 seconds route can be withdrawn and
re-announced so that there is no net change to
original announcement
Solution: Have BGP keep some state about
recently sent messages to peers. Avoid
sending duplicate messages
86
Instability Problem and Cause.
Example 3
AS 1
Net X
R2
R8
R1
R3
4
R4
AS3
R6
3
R7
6
5
10
2
R5
AS2
R
Example: interaction IGP/BGP
policy: set MED using IGP metrics,
such as shortest path
border router
internal router
E-BGP
IGP
87
Controlling route instability:
Route Dampening
track number of times a route has flapped
over a period of time
routes that exhibit a high level of
instability in a short period of time should
be suppressed (not advertised)
penalize ill behaved routes proportionally
to their expected future stability
if a suppressed route stops flapping for a
long enough period of time, unsuppress it
(readvertise)
88
Route Dampening
penalty
suppress limit
reuse limit
time
89
Route Dampening Algorithm
Each time a route flaps, increase the
penalty
If the route has not flapped in the last
‘half-life’ time units, then cut penalty in
half.
If the penalty > suppress limit, then
suppress the route
If the penalty < reuse limit, then free a
suppressed route
90
Side Effect of Route
Dampening
A legitimate update may arrive and it will
be ignored because that route has been
suppressed and not yet released.
The modification needed for the legitimate
announcement is delayed
91
Aggregation can help route
flapping
If a more-specific route is
flapping, but provider only
announces aggregated
prefix, then other networks
don’t see route flap.
Hence aggregation can mask
route flapping.
Aggregation helps combat
instability because it
reduces the number of
networks visible in the core
Internet.
Tier 1 provider
not flapping
140.40/16
Tier 2 provider
flapping
140.40.4/24
customer
92
Growth of BGP Tables
93
Long Term Growth Trends
in Internet Routing
Will this routing system be able to scale
and meet the growth of the Internet and
its ever-expanding level of demands?
Are there any inherent limitations?
As more devices connect to Internet and
consume addresses, the need to maintain
reachability to these addresses implies larger
routing tables
What is the ability of the system to
produce a stable view of the overall
network topology?
94
BGP Table Growth (1989-2001)
95
Reasons for Exponential
Growth
Measures
Growth
Number of AS’s
Growing exponentially - 50%
yearly
Range of addresses spanned by Growing 7% yearly
table
Average span of route table
entry
Growing 1 bit per 29 months
Holes in the table
Currently at 37%
Number of announcements
Growing 42% yearly
Data in last 3 slides from Geoff Huston’s INET publications
97
Increasing fine levels of
routing details
in BGP table
AS space growth at
50%
&
addresses spanned by
the table growing at 7%
=> each AS advertising
smaller address ranges
/24 is fastest growth
prefix in table (in
absolute terms). /24 /31 is area with fastest
relative growth
1999: average span of
prefix 16,000
addresses (mean
prefix length 18.03)
2000: average span of
prefix 12,000
addresses (mean
prefix length 18.44)
98
Holes
When both aggregated prefix and a more-
specific prefix exist in the table, the
more-specific prefix is called a “hole”.
Why are holes created?
Punch hole in policy of larger aggregated
announcement to create a different policy for
finer announcement.
Load sharing in multihoming scenario
reliability via multihoming
37% of BGP table is holes.
99
Multihoming vs. Resiliency
Multiple peering relationships can be
cheaper than using a single upstream
provider
implies: multihoming is seen as a substitute for
upstream service resiliency
Impact
providers can’t command a price for reliability,
and thus don’t spend money to engineer it into
network.
resiliency is becoming responsibility of customer
not provider
Can BGP scale adequately to continue to
undertake this role?
100
Conclusions
Things are getting better (stability)
router software and configuration management
are maturing
increased emphasis on aggregation and route
dampening are helping
Things are getting worse (scalability)
multihoming is still growing
internet topology growing less hierarchical
‘noise’ in BGP table is growing
101
Longest Prefix Match
Algorithms
102
Forwarding Engine
Packet
payload
header
Router
Destination
Address
Routing Lookup
Data Structure
Outgoing
Port
Forwarding Table
Dest-network
Port
65.0.0.0/8
3
128.9.0.0/16
1
149.12.0.0/19
7
103
The Search Operation is not a
Direct Lookup
(Incoming port, label)
Memory
(Outgoing port, label)
IP addresses: 32 bits long 4G entries
104
The Search Operation is also
not an Exact Match Search
Exact match search: search for a key in
a collection of keys of the same length.
Relatively well studied data structures:
Hashing
Balanced binary search trees
105
Example Forwarding Table
Destination IP Prefix
Outgoing Port
65.0.0.0/8
3
128.9.0.0/16
Prefix length
142.12.0.0/19
1
7
IP prefix: 0-32 bits
65.0.0.0/8
0
65.0.0.0
128.9.0.0/16
128.9.16.14
224
65.255.255.255
142.12.0.0/19
232-1
106
Prefixes can Overlap
Longest
matching prefix
128.9.176.0/24
128.9.16.0/21 128.9.172.0/21
65.0.0.0/8
0
128.9.0.0/16
128.9.16.14
142.12.0.0/19
232-1
Routing lookup: Find the longest matching
prefix (aka the most specific route) among all
prefixes that match the destination address.
107
Prefix Length
Difficulty of Longest Prefix
Match
32
24
128.9.176.0/24
128.9.16.0/21
128.9.172.0/21
142.12.0.0/19
128.9.0.0/16
8
65.0.0.0/8
128.9.16.14
Prefixes
108
Lookup Rate Required
Year
Line
Line-rate
(Gbps)
40B
packets
(Mpps)
1998-99
OC12c
0.622
1.94
1999-00
OC48c
2.5
7.81
2000-01
OC192c
10.0
31.25
2002-03
OC768c
40.0
125
31.25 Mpps 33 ns
DRAM: 50-80 ns, SRAM: 5-10 ns
109
Number of Prefixes
Size of the Forwarding Table
100000
90000
80000
70000
60000
50000
10,000/year
40000
30000
20000
10000
0
95
96
97
Year
98
99
Source: http://www.telstra.net/ops/bgptable.html
00
110
Longest Prefix Match is Harder
than Exact Match
The destination address of an arriving
packet does not carry with it the
information to determine the length of the
longest matching prefix
Hence, one needs to search among the
space of all prefix lengths; as well as the
space of all prefixes of a given length
111
LPM in IPv4
Use 32 exact match algorithms for LPM!
Exact match
against prefixes
of length 1
Network Address
Exact match
against prefixes
of length 2
Priority
Encode
and pick
Port
Exact match
against prefixes
of length 32
112
Metrics for Lookup Algorithms
Speed (= number of memory accesses)
Storage requirements (= amount of memory)
Low update time (support >10K updates/s)
Scalability
With length of prefix: IPv4 unicast (32b), Ethernet
(48b), IPv4 multicast (64b), IPv6 unicast (128b)
With size of routing table: (sweetspot for today’s
designs = 1 million)
Flexibility in implementation
Low preprocessing time
113
Radix Trie
Trie node
A
1
P1
111*
H1
P2
10*
H2
P3
1010*
H3
P4
10101
H4
C
P2
G
P3
Lookup 10111
next-hop-ptr (if prefix)
right-ptr
left-ptr
B
1
0
1
D
1
E
0
1
Add P5=1110*
0
P4
H
P5
P1
F
I
114
Radix Trie
W-bit prefixes: O(W) lookup, O(NW)
storage and O(W) update complexity
Advantages
Disadvantages
Simplicity
Worst case lookup slow
Wastage of storage space in
chains
Extensible to wider fields
115
Leaf-pushed Binary Trie
Trie node
A
1
P1
111*
H1
P2
10*
H2
P3
1010*
H3
P4
10101
H4
C
P2
G
B
1
0
1
0
left-ptr or
next-hop
right-ptr or
next-hop
D
P1
E
P2
P3 P4
116
Practical Algorithm To Retrieve
Information Coded In Alphanumeric
PATRICIA
B
D
2
0
3
0
1
A
Patricia tree internal node
1
P1
E
C
bit-position
right-ptr
left-ptr
5
P2
F
P1
111*
H1
P2
10*
H2
P3
1010*
H3
P4
10101
H4
0
P3
1
P4
G
Lookup 10111
Bitpos 12345
117
PATRICIA
• W-bit prefixes: O(W2) lookup, O(N)
storage and O(W) update complexity
Advantages
Disadvantages
Decreased storage
Extensible to wider fields
Worst case lookup slow
Backtracking makes
implementation complex
118
Path-compressed Tree
Lookup 10111
1, , 2
0
0
10,P2,4
1
P4
111*
H1
P2
10*
H2
P3
1010*
H3
P4
10101
H4
1
P1
C
D
1010,P3,5
P1
B
A
E
Path-compressed tree node structure
variable-length next-hop (if
prefix present)
bitstring
left-ptr
bit-position
right-ptr
119
Path-compressed Trie
• W-bit prefixes: O(W) lookup, O(N)
storage and O(W) update complexity
Advantages
Disadvantages
Decreased storage
Worst case lookup slow
120
Binary Search by Prefix Intervals
{}, 1, 10, 100, 101, 1110
1: 10000-11111
{}: 00000-11111
10: 10000-10111
101: 10100-10111 1110: 11100-11101
100: 10000-10011
(
(
(
(
)
(
)
)
(
)
)
)
00000
10000
10000
10000
10011
10100
10111
10111
11100
11101
11111
11111
1100
121
Rules
If the destination address we search for
fits to the right of a left paranthesis, the
prefix represented by the left paranthesis
is the longest match
Otherwise, increment for each right
paranthesis, decrement for each left
paranthesis until the count reaches -1
122
Example revisited
(
(
(
(
)
(
)
)
(
)
)
)
00000
10000
10000
10000
10011
10100
10111
10111
11100
11101
11111
11111
1100
-1
0
1
2
1
2
1
1100 is in the interval 10000-11111, longest prefix
match is for the prefix 1
123
Binary Search on Intervals
•W-bit N prefixes: O(logN) lookup,
O(N) storage
Advantages
Disadvantages
Storage is linear
Can be ‘balanced’
Lookup time independent
of W
But, lookup time is dependent
on N
Incremental updates more
complex than tries
Each node is big in size:
requires higher memory
bandwidth
124