Transcript Document

Chapter 4
Internetworking
4.1 Simple Internetworking (IP)
4.2 Routing
4.3 Global Internet
4.4 Multicast
4.5 Multiprotocol Label Switching (MPLS)
1
4.1 Simple Internetworking
(IP)

Best Effort Service Model
 Global Addressing Scheme
 ARP (Address Resolution Protocol
 ICMP (Internet Message Control Protocol)
2
IP Internet
Network 1 (Ethernet)
 Concatenation of Networks
H1
H2
H7
H3
R3
H8
Network 4
(point-to-point)
Network 2 (Ethernet)
R1
R2
H4
 Protocol Stack
Network 3 (FDDI)
H6
H5
H1
H8
T CP
R1
IP
ETH
R2
IP
ETH
R3
IP
FDDI
FDDI
IP
PPP
PPP
T CP
IP
ETH
ETH
3
Service Model
 Connectionless (datagram-based)
 Best-effort delivery (unreliable service)
 packets are lost
 packets are delivered out of order
 duplicate copies of a packet are delivered
 packets can be delayed for a long time
 Datagram format
0
4
Version
8
HLen
16
T OS
31
Length
Ident
T TL
19
Flags
Protocol
Offset
Checksum
SourceAddr
DestinationAddr
Options (variable)
Pad
(variable)
Data
4
Fragmentation and Reassembly
Each network has some MTU
Design decisions
fragment when necessary (MTU < Datagram)
try to avoid fragmentation at source host
re-fragmentation is possible
fragments are self-contained datagrams
use CS-PDU (not cells) for ATM
delay reassembly until destination host
do not recover from lost fragments
5
Start of header
Ident = x
Example
0 Offset = 0
Rest of header
(a)
1400 data bytes
Start of header
Ident = x
1 Offset = 0
Rest of header
H1
R1
R1
R2
R2
R3
R3
H8
512 data bytes
(b)
Start of header
ETH IP (1400)
FDDI IP (1400)
Ident = x
1 Offset = 64
PPP IP (512)
ETH IP (512)
PPP IP (512)
ETH IP (512)
Rest of header
PPP IP (376)
ETH IP (376)
512 data bytes
Start of header
Ident = x
0 Offset = 128
Rest of header
376 data bytes
6
Global Addresses
 Properties
 globally unique
 hierarchical: network + host
 Dot Notation
 10.3.2.4
 128.96.33.81
 192.12.69.77
7
(a)
0
24
Network
Host
14
(b)
1
0
16
Network
Host
21
(c)
1
1
0
Network
8
Host
7
Datagram Forwarding
 Strategy
 every datagram contains destination’s address
 if connected to destination network, then forward to host
 if not directly connected, then forward to some router
 forwarding table maps network number into next hop
 each host has a default router
 each router maintains a forwarding table
 Example (R2)
Network Number
1
2
3
4
Next Hop
R3
R1
interface 1
interface 0
8
Address Translation
 Map IP addresses into physical addresses
 destination host
 next hop router
 Techniques
 encode physical address in host part of IP address
 table-based
 ARP
 table of IP to physical address bindings
 broadcast request if IP address not in table
 target machine responds with its physical address
 table entries are discarded if not refreshed
9
ARP Details
 Request Format
 HardwareType: type of physical network (e.g., Ethernet)
 ProtocolType: type of higher layer protocol (e.g., IP)
 HLEN & PLEN: length of physical and protocol addresses
 Operation: request or response
 Source/Target-Physical/Protocol addresses
 Notes
 table entries timeout in about 10 minutes
 update table with source when you are the target
 update table if already have an entry
 do not refresh table entries upon reference
10
ARP Packet Format
0
8
16
Hardware type = 1
HLen = 48
PLen = 32
31
ProtocolType = 0x0800
Operation
SourceHardwareAddr (bytes 0― 3)
SourceHardwareAddr (bytes 4- 5)
SourceProtocolAddr (bytes 0 - 1)
SourceProtocolAddr (bytes 2 - 3)
TargetHardwareAddr (bytes 0 – 1)
TargetHardwareAddr (bytes 2 - 5)
TargetProtocolAddr (bytes 0 - 3)
11
Internet Control Message Protocol
(ICMP)
 Echo (ping)
 Redirect (from router to source host)
 Destination unreachable (protocol, port, or host)
 TTL exceeded (so datagrams don’t cycle forever)
 Checksum failed
 Reassembly failed
 Cannot fragment
12
Redirect
G1
Network
(1)
H1
Network
(2)
G2
Network
H2
 G2 finds that H1 is directly connected and
will inform H1 to redirect the IP datagrams to G2.
4.2 Routing
 Forwarding vs Routing
 forwarding: to select an output port based on
destination address and routing table
 routing: process by which routing table is built
 Network as a Graph
A
3
4
C
6
1
2
1
B
9
E
F
1
D
 Problem: Find lowest cost path between two nodes
 Factors
 static: topology
 dynamic: load
14
Distance Vector
 Each node maintains a set of triples
 (Destination, Cost, NextHop)
 Directly connected neighbors exchange updates
 periodically (on the order of several seconds)
 whenever table changes (called triggered update)
 Each update is a list of pairs:
 (Destination, Cost)
 Update local table if receive a “better” route
 smaller cost
 came from next-hop
 Refresh existing routes; delete if they time out
15
Routing Table Example (Node B)
B
C
A
D
E
F
G
Destination Cost NextHop
A
1
A
C
1
C
D
2
C
E
2
A
F
2
A
G
3
A
16
Routing Loops
 Example 1
 F detects that link to G has failed
 F sets distance to G to infinity and sends update to A
 A sets distance to G to infinity since it uses F to reach G
 A receives periodic update from C with 2-hop path to G
 A sets distance to G to 3 and sends update to F
 F decides it can reach G in 4 hops via A
B
C
A
D
E
F
G
17
Routing Loops
 Example 2
 link from A to E fails
 A advertises distance of infinity to E
 B and C advertise a distance of 2 to E
 B decides it can reach E in 3 hops; advertises this to A
 A decides it can read E in 4 hops; advertises this to C
 C decides that it can reach E in 5 hops…
B
C
A
D
E
F
G
18
Distance Vector: link cost changes
Link cost changes:
 node detects local link cost change
1
 updates routing info, recalculates
distance vector
 if DV changes, notify neighbors
“good
news
travels
fast”
x
4
y
50
1
z
At time t0, y detects the link-cost change, updates its DV,
and informs its neighbors.
At time t1, z receives the update from y and updates its table.
It computes a new least cost to x and sends its neighbors its DV.
At time t2, y receives z’s update and updates its distance table.
y’s least costs do not change and hence y does not send any
message to z.
19
Distance Vector: link cost changes
“good news Travels
fast”
Dy
1
x
4
y
50
1
z
algorithm
terminates
Dz
20
Distance Vector: link cost changes
Link cost changes:
 bad news travels slow - “count to infinity” problem!
 44 iterations before algorithm stabilizes
 z (y) does not know that the least distance from y (z)
60
X
4
Y
1
Z
50
to x that y (z) tells z (y) is the distance of the path yz-y-x (z-y-x)
algorithm
continues
on!
21
Distance Vector: poisoned reverse
If Z routes through Y to get to X :
 Z tells Y its (Z’s) distance to X is infinite (so Y
won’t route to X via Z)
 will this completely solve count to infinity
problem?
 Loops involving three or more nodes cannot be
solved using the technique
60
X
4
Y
50
1
Z
algorithm
terminates
22
RIP ( Routing Information Protocol)
 Distance vector algorithm
 Included in BSD-UNIX Distribution in 1982
 Distance metric: # of hops (max = 15 hops)
Source node: A
u
v
A
z
C
B
D
w
x
y
destination hops
u
1
v
2
w
2
x
3
y
3
z
2
23
RIP advertisements
 Distance vectors:
exchanged among
neighbors every 30 sec
via Response Message
(also called
advertisement)
 Each advertisement: a list
of up to 25 destination
subnets within AS
0
8
Command
16
Version
Family of net 1
31
Must be zero
Address of net 1
Address of net 1
Distance to net 1
Family of net 2
Address of net 2
Address of net 2
Distance to net 2
24
RIP: Example
z
w
A
x
D
B
y
C
Destination Network
w
y
z
x
….
Next Router
Num. of hops to dest.
….
....
A
B
B
--
2
2
7
1
Routing table in D
25
RIP: Example
Dest
w
x
z
….
Next
C
…
w
hops
4
...
A
Advertisement
from A to D
z
x
Destination Network
w
y
z
x
….
D
B
C
y
Next Router
Num. of hops to dest.
….
....
A
B
B A
--
Routing table in D
2
2
7 5
1
26
RIP: Link Failure and Recovery
If no advertisement heard after 180 sec --> neighbor or
link declared dead
 routes via neighbor invalidated
 new advertisements sent to neighbors
 neighbors in turn send out new advertisements (if
tables changed)
 link failure info quickly propagates to entire net
 poison reverse used to prevent ping-pong loops
(infinite distance = 16 hops)
27
RIP Table processing
 RIP routing tables managed by application-level
process called route-d (daemon)
 advertisements sent in UDP packets, periodically
repeated
routed
routed
Transprt
(UDP)
network
(IP)
link
physical
Transprt
(UDP)
forwarding
table
forwarding
table
network
(IP)
link
physical
28
Link State
 Strategy
 send to all nodes (not just neighbors)
information about directly connected links (not
entire routing table)
 Link State Packet (LSP)
 id of the node that created the LSP
 cost of link to each directly connected neighbor
 sequence number (SEQNO)
 time-to-live (TTL) for this packet
29
Link State (cont)
Reliable flooding
store most recent LSP from each node
forward LSP to all nodes but one that sent it
generate new LSP periodically
increment SEQNO
start SEQNO at 0 when reboot
decrement TTL of each stored LSP
discard when TTL=0
30
Reliable Flooding
X
A
C
B
D
X
A
C
B
(a)
X
A
C
B
(c)
D
(b)
D
X
A
C
B
D
(d)
31
Route Calculation
 Dijkstra’s shortest path algorithm
 Let





N denotes set of nodes in the graph
l (i, j) denotes non-negative cost (weight) for edge (i, j)
s denotes this node
M denotes the set of nodes incorporated so far
C(n) denotes cost of the path from s to node n
M = {s}
for each n in N - {s}
C(n) = l(s, n)
while (N != M)
M = M union {w} such that C(w) is the minimum for
all w in (N - M)
for each n in (N - M)
C(n) = MIN(C(n), C (w) + l(w, n ))
32
A Link-State Routing Algorithm
Dijkstra’s algorithm
 net topology, link costs known
to all nodes
 accomplished via “link state
broadcast”
 all nodes have same info
 computes least cost paths from
one node (‘source”) to all other
nodes
 gives forwarding table for
that node
 iterative: after k iterations,
know least cost path to k
destinations
Notation:
 c(x,y): link cost from node x to
y; = ∞ if not direct neighbors
 D(v): current value of cost of
path from source to destination v
 p(v): predecessor node along
path from source to v
 N': set of nodes whose least cost
path definitively known
33
Dijsktra’s Algorithm
1 Initialization:
u: source node
2 N' = {u}
3 for all nodes v
4
if v adjacent to u
5
then D(v) = c(u,v)
6
else D(v) = ∞
7
8 Loop
9 find w not in N' such that D(w) is a minimum
10 add w to N'
11 update D(v) for all v adjacent to w and not in N' :
12
D(v) = min( D(v), D(w) + c(w,v) )
13 /* new cost to v is either old cost to v or known
14 shortest path cost to w plus cost from w to v */
15 until all nodes in N'
34
Dijkstra’s algorithm: example
Step
0
1
2
3
4
5
N'
u
ux
uxy
uxyv
uxyvw
uxyvwz
D(v),p(v) D(w),p(w)
2,u
5,u
2,u
4,x
2,u
3,y
3,y
D(x),p(x)
1,u
D(y),p(y)
∞
2,x
D(z),p(z)
∞
∞
4,y
4,y
4,y
5
2
u
v
2
1
x
3
w
3
1
5
z
1
y
2
35
Dijkstra’s algorithm: example
5
5
2
u
3
v
2
1
x
w
3
5
2
y
1
z
1
2
u
2
1
2
u
2
1
x
w
5
z
1
3
x
2
y
1
5
5
v
3
v
3
w
3
1
5
z
1
y
2
2
u
v
2
1
x
3
w
3
1
5
z
1
y
2
36
Dijkstra’s algorithm: example
5
5
2
u
v
2
1
x
3
w
3
1
5
z
1
y
2
2
u
v
2
1
x
3
w
3
1
5
z
1
y
2
37
Dijkstra’s algorithm, discussion
Algorithm complexity: n nodes
 each iteration: need to check all nodes, w, not in N
 n(n+1)/2 comparisons: O(n2)
 more efficient implementations possible: O(nlogn)
Oscillations possible:
 e.g., link cost = amount of carried traffic
D
1
1
0
A
0 0
C
e
1+e
e
initially
B
1
2+e
A
0
D 1+e 1 B
0
0
C
… recompute
routing
0
D
1
A
0 0
C
2+e
B
1+e
… recompute
2+e
A
0
D 1+e 1 B
e
0
C
… recompute
38
OSPF (Open Shortest Path First)
 “open”: publicly available – defined in RFC 2328
 Uses Link State algorithm
 Link-State packet dissemination
 Topology map at each node
 Route computation using Dijkstra’s algorithm
 OSPF advertisement carries one entry per neighbor
router
 Advertisements disseminated to entire AS (via flooding)
 Carried in OSPF messages directly over IP (rather than TCP
or UDP)
39
OSPF “advanced” features (not in RIP)
 Security: all OSPF messages authenticated (to prevent
malicious intrusion)
 Load Balancing: Multiple same-cost paths allowed
(only one path in RIP)
 For each link, multiple cost metrics for different TOS
(e.g., satellite link cost set “low” for best effort; high
for real time)
 Integrated uni- and multicast support:
 Multicast OSPF (MOSPF) uses same topology data
base as OSPF
 Hierarchical OSPF in large domains.
40
Hierarchical OSPF
 An OSPF autonomous system (AS) can be configured
into areas
 Exactly one OSPF area in the AS is configured to be
the backbone area
 Each area runs its own OSPF link-state routing
algorithm
 Two-level hierarchy: local area, backbone.
 Link-state advertisements only in area
 each nodes has detailed area topology; only know
direction (shortest path) to nets in other areas.
41
Hierarchical OSPF
42
Hierarchical OSPF
Four types of routers
 Internal routers: perform only intra AS routing
 Area border routers: belong to both an area
and the backbone
 Backbone routers: run OSPF routing limited to
backbone.
 Boundary routers: connect to other AS’s.
43
OSPF Advertisement Format
0
8
16
31
LS Age
Version
Type
Message length
Checksum
Authentication type
T ype=1
Link-state ID
Advertising router
SourceAddr
AreaId
Options
LS sequence number
LS checksum
Length
0 Flags
0
Number of links
Link ID
Authentication
Link data
Link type
Num_T OS
Metric
Optional T OS information
More links
Header Format
Link-State Advertisement
44
Comparison of LS and DV algorithms
Message complexity
 LS: with n nodes, E links,
O(nE) messages sent
 DV: exchange between
neighbors only
 convergence time varies
Speed of Convergence
 LS: O(n2) algorithm requires
O(nE) messages
 may have oscillations
 DV: convergence time varies
 may be routing loops
 count-to-infinity problem
Robustness: what happens if
router malfunctions?
LS:
 node can advertise incorrect
link cost
 each node computes only its
own table
DV:
 DV node can advertise
incorrect path cost
 each node’s table used by
others
 error propagate thru
network
45
Metrics
 Original ARPANET metric
 measures number of packets queued on each link
 took neither latency or bandwidth into consideration
 New ARPANET metric
 stamp each incoming packet with its arrival time (AT)
 record departure time (DT)
 when link-level ACK arrives, compute
Delay = (DT - AT) + Transmit + Latency
 if timeout, reset DT to departure time for
retransmission
 link cost = average delay over some time period
46
Metrics
 Still has problems
 Under light load, it works well since the two static
factors of delay dominated the cost.
 Under heavy load, a congested link would start to
advertise a very high cost. This caused all the
traffic to move off that link, leaving it idle, so then
it advertise a low cost,…
 The range of link values was much too large.
 Fine Tuning
 compressed dynamic range
 replaced Delay with link utilization
47
Revised ARPANET routing metric
versus link utilization
225
9.6-Kbps satellite link
140
9.6-Kbps terrestrial link
56-Kbps satellite link
56-Kbps terrestrial link
90
75
60
30
25%
50%
Utilization
75%
100%
48
Revised ARPANET routing metric
versus link utilization
 A highly loaded link never shows a cost of
more than three times its cost when idle
 The most expensive link is only seven times
the cost of least expensive
 A high-speed satellite link is more attractive
than a low-speed terrestrial link
 Cost is a function of link utilization only at
moderate to high loads.
49
4.3 Global Internet Structure
Tree Structure of the Internet in 1990
NSFNET backbone
Stanford
BARRNET
regional
Berkeley
Westnet
regional
PARC
MidNet
regional
■■■
UNM
NCAR
ISU
UNL
KU
UA
50
Global Internet
 One of the salient features of this topology is that it
consists of “end user” sites (e.g, Stanford university)
that connect to “service provider” networks (e.g,
BARRNET)
 Each provider and end user is likely to be an
administratively independent entity – Autonomous
System (AS).
 Scalability problems
 Scalability of routing
 Address utilization
 Subnetting – deals with address space utilization
 Classless routing or supernetting – tackles both address
utilization and routing scalability
51
Subnetting
 Inefficient use of Hierarchical Address Space
 class C with 2 hosts (2/255 = 0.78% efficient)
 class B with 256 hosts (256/65535 = 0.39% efficient)
 Still Too Many Networks
 routing tables do not scale
 route propagation protocols do not scale
 Subnetting provides an elegantly simple way to
reduce the total number of networks that are assigned
 The idea is to take a single IP network number and
allocate the IP addresses with that network number to
several physical networks – subnets.
52
Subnetting
 Add another level to address/routing hierarchy: subnet
 Subnet masks define variable partition of host part
 A single network number can be shared among multiple
networks involves configuring all the nodes on each
subnet with a subnet mask.
 Subnets visible only within site
Network number
Host number
Class B address
111111111111111111111111
00000000
Subnet mask (255.255.255.0)
Network number
Subnet ID
Host ID
53
Subnetted address
Subnet Example
H1  H2
255.255.255.128
128.96.34.139
128.96.34.128
Subnet mask: 255.255.255.128
Subnet number: 128.96.34.0
128.96.34.15
128.96.34.1
R1
H1
Subnet mask: 255.255.255.128
Subnet number: 128.96.34.128
128.96.34.130
128.96.34.139
128.96.34.129
H3
R2
H2
R1
255.255.255.128
128.96.34.139
128.96.34.128
128.96.33.1
128.96.33.14
Forwarding table at router R1
Subnet mask: 255.255.255.0
Subnet number: 128.96.33.0
Subnet Number
128.96.34.0
128.96.34.128
128.96.33.0
Subnet Mask
255.255.255.128
255.255.255.128
255.255.255.0
Next Hop
interface 0
interface 1
R2 54
Forwarding Algorithm
D = destination IP address
for each entry (SubnetNum, SubnetMask, NextHop)
D1 = SubnetMask & D
if D1 = SubnetNum
if NextHop is an interface
deliver datagram directly to D
else
deliver datagram to NextHop
 Use a default router if nothing matches
 Not necessary for all 1s in subnet mask to be
contiguous
 Can put multiple subnets on one physical network
 Subnets not visible from the rest of the Internet
55
Classless Routing (CIDR) Supernetting
 CIDR: Classless Inter-Domain Routing
 A technique that addresses two scaling concerns:
 the growth of backbone routing tables, and
 the potential for the 32-bit IP address space to be
exhausted well before the 4 billionth host is attached
to the Internet.
 Even though subnetting can help to assign
addresses carefully, it does not get around the
fact that any AS with more than 255 hosts wants
a class B address – exhaustion of IP address
space.
56
Classless Routing (CIDR) Supernetting
 CIDR tries to balance the desire to minimize
the number of routes that a router needs to
know against the need to hand out addresses
efficiently
 Assign block of contiguous network numbers to
nearby networks
 Represent blocks with a single pair
(first_network_address, count)
 Restrict block sizes to powers of 2
 Use a bit mask (CIDR mask) to identify block size
 All routers must understand CIDR addressing
57
Route aggregation with CIDR
Customers
128.112.128/24
Advertise
128.112.128/21
ISP
..
.
128.112.135/24
 Since all of the customers are reachable through the same
Provider network, it can advertise a single route to all of
Them by just advertising the common 21-bit prefix they share
58
IP Forwarding Revisited
 Find the network number in a packet and then lookup
that number in a forwarding table.
 Reexamine this assumption with CIDR
 Prefixes length 2-32 bits
 Prefixes may “overlap”
 Some addresses may match more than one prefix.
 Longest Prefix Matching (LPM)
 For example
 171.69 (16-bit prefix)
 171.69.10 (24-bit prefix)
 171.69.10.5 matches both
 171.69.20.5 only matches 171.69
59
Interdomain Routing (BGP)
• AS = routing domain
• Routing Policies
• Two major Interdomain
routing protocols
-- Exterior gateway Protocol
(EGP)
-- Border gateway Protocol
(BGP-4)
R1
R3
R2
Autonomous system 1
R4
Autonomous system 2
R5
R6
60
BGP-4: Border Gateway Protocol
 AS Types
 stub AS: has a single connection to one other AS
carries local traffic only
 multihomed AS: has connections to more than one AS
refuses to carry transit traffic
 transit AS: has connections to more than one AS
carries both transit and local traffic
 Each AS has:
 one or more border routers
 one BGP speaker that advertises:
local networks
other reachable networks (transit AS only)
gives path information
61
Today’s multibackbone Internet
Large corporation
“Consumer”
ISP
Peering
point
Backbone service provider
“Consumer”
ISP
Large corporation
Peering
point
“Consumer”
ISP
Small
corporation
62
BGP Example
 Speaker for AS2 advertises reachability to P and Q
 network 128.96, 192.4.153, 192.4.32, and 192.4.3, can be reached
directly from AS2
Customer P
(AS 4)
128.96
192.4.153
Customer Q
(AS 5)
192.4.32
192.4.3
Customer R
(AS 6)
192.12.69
Customer S
(AS 7)
192.4.54
192.4.23
Regional provider A
(AS 2)
Backbone network
(AS 1)
Regional provider B
(AS 3)
 Speaker for backbone advertises
 networks 128.96, 192.4.153, 192.4.32, and 192.4.3 can be reached
along the path (AS1, AS2).
 Speaker can cancel previously advertised paths
63
Internet inter-AS routing: BGP
 BGP (Border Gateway Protocol): the de facto
standard
 BGP provides each AS a means to:
1. Obtain subnet reachability information from
neighboring ASs.
2. Propagate the reachability information to all routers
internal to the AS.
3. Determine “good” routes to subnets based on
reachability information and policy.
 Allows a subnet to advertise its existence to rest
of the Internet: “I am here”
64
BGP basics
 Pairs of routers (BGP peers) exchange routing information
over semi-permanent TCP connections: BGP sessions
 Note that BGP sessions do not correspond to physical links.
 When AS2 advertises a prefix to AS1, AS2 is promising it will
forward any datagrams destined to that prefix towards the
prefix.
 AS2 can aggregate prefixes in its advertisement
3c
3a
3b
AS3
1a
AS1
2a
1c
1d
1b
2c
AS2
2b
External BGP (eBGP) session
Internal BGP (iBGP) session
65
Aggregation of prefixes
138.16.64/24
138.16.65/24
138.16.66/24 => 138.16.64/22
138.16.67/24
66
Distributing reachability info
 With eBGP session between 3a and 1c, AS3 sends prefix reachability
information to AS1.
 1c can then use iBGP to distribute this new prefix reachability
information to all routers in AS1
 1b can then re-advertise the new reachability information to AS2 over
the 1b-to-2a eBGP session
 When router learns about a new prefix, it creates an entry for the prefix
in its forwarding table.
3c
3a
3b
AS3
1a
AS1
2a
1c
1d
1b
2c
AS2
2b
eBGP session
iBGP session
67
Path attributes & BGP routes
 When advertising a prefix, advertisement includes BGP
attributes.
 prefix + attributes = “route”
 Two important attributes:
 AS-PATH: contains the ASs through which the advertisement for
the prefix passed: AS 67 AS 17
 used to detect and prevent looping advertisement
 also use in choosing among multiple path to the same prefix
 NEXT-HOP: Indicates the specific internal-AS router to next-hop
AS. (There may be multiple links from current AS to next-hopAS.)
 When gateway router receives route advertisement, uses
import policy to accept/decline.
68
BGP route selection
 Router may learn about more than 1 route to any
one prefix. Router must select route.
 Elimination rules invoked sequentially until one
route remains:
1. Local preference value attribute: policy decision – AS’s
network administrator
2. Shortest AS-PATH
3. Closest NEXT-HOP router: hot potato routing
4. Additional criteria
69
BGP messages
 BGP messages exchanged using TCP.
 BGP messages:
 OPEN: opens TCP connection to peer and authenticates
sender
 UPDATE: advertises new path (or withdraws old)
 KEEPALIVE keeps connection alive in absence of
UPDATES; also ACKs OPEN request
 NOTIFICATION: reports errors in previous message; also
used to close connection
70
BGP routing policy
legend:
B
W
provider
network
X
A
customer
network:
C
Y
Figure 4.5-BGPnew: a simple BGP scenario
 A,B,C are provider networks
 X,W,Y are customer (of provider networks)
 X is dual-homed: attached to two networks
 X does not want to route from B via X to C
 .. so X will not advertise to B a route to C
71
BGP routing policy (2)
legend:
B
W
provider
network
X
A
customer
network:
C
Y
Figure 4.5-BGPnew: a simple BGP scenario
 A advertises to B the path AW
 B advertises to X the path BAW
 Should B advertise to C the path BAW?
 No way! B gets no “revenue” for routing CBAW since neither W nor C
are B’s customers
 B wants to force C to route to w via A
 B wants to route only to/from its customers!
72
Why different Intra- and Inter-AS routing ?
Policy:
 Inter-AS: administrator wants control over how its traffic
routed, who routes through its net.
 Intra-AS: single admin, so no policy decisions needed
Scale:
 hierarchical routing saves table size, reduced update traffic
Performance:
 Intra-AS: can focus on performance
 Inter-AS: policy may dominate over performance
73
IP Version 6
 Features
 128-bit addresses (classless)
 multicast
 real-time service
 authentication and security
 autoconfiguration
 end-to-end fragmentation
 protocol extensions
 Header
 40-byte “base” header
 extension headers (fixed order, mostly fixed length)
 fragmentation
 source routing
 authentication and security
 other options
74
4.4 Broadcast/Multicast routing
 Broadcast routing –- deliver a packet from a
source node to all other nodes
 Multicast routing – deliver a packet from a
source node to a subset of other nodes
75
Source-duplication versus in-network duplication
duplicate
R1
duplicate
creation/transmission
R1
duplicate
R2
R2
R3
R4
(a)
R3
R4
(b)
(a) source duplication, (b) in-network duplication
76
Broadcast routing algorithms
Uncontrolled flooding
Controlled flooding
Spanning-tree broadcast
77
Uncontrolled flooding
 The source node sends a copy of the packet to
all of its neighbors
 When a node receives a broadcast packet, it
duplicates the packet and forwards it to all of
its neighbors (except the neighbor from which
it receives the packet)
Problems:
 If the graph has cycles, then one or more copies
of each broadcast packet will cycle indefinitely
 Broadcast storm
78
Controlled flooding
Sequence-number-controlled flooding
Source node puts its address and a broadcast
sequence number into a broadcast packet
Each node maintains a list of the source
address and sequence number of each packet
it has received
When a node receives a broadcast packet
If the packet is in the list, the packet is dropped
Otherwise, the packet is duplicated and
forwarded
79
Controlled flooding
 Reverse path forwarding
 When a router receives a broadcast packet, it
duplicates and forwards the packet only if the
packet arrives on the link that is on its own shortest
unicast path back to the source
A
B
c
F
E
Packet will be forwarded
D
G
Packet not forwarded beyond receiving router
80
Controlled flooding
Drawback
 Some of the nodes receive redundant packets
A
B
c
F
E
D
G
Redundant packets
Ideally, every node should receive only one copy of the broadcast packet.
81
Spanning-tree broadcast
Spanning tree – a tree that contains all nodes in a graph
Minimum spanning tree – a spanning tree whose cost is
the minimum among all the spanning trees of a graph
 Broadcast along a spanning tree
A
B
c
F
A
E
B
c
D
F
G
(a) Broadcast initiated at A
E
D
G
(b) Broadcast initiated at D
82
Construction of Spanning-tree
 Many algorithms have been developed
 Center-based approach
 Select a center node (rendezvous or core)
 Each node unicasts tree-join message to the center node
A
A
3
B
c
4
F
2
E
1 Center node
B
c
D
F
5
E
D
G
G
(a) Stepwise construction
of spanning tree
(b) Constructed spanning
tree
83
Multicast Routing: Problem Statement
 Goal: find a tree (or trees) connecting routers
having local multicast group members
 tree: not all paths between routers used
 source-based: different tree from each sender to receivers
 shared-tree: same tree used by all group members
Source-based trees
Shared tree
84
Approaches for building multicast trees
source-based tree: one tree per source
shortest path trees
reverse path forwarding
group-shared tree: group uses one tree
minimal spanning (Steiner)
center-based trees
…we first look at basic approaches, then specific
protocols adopting these approaches
85
Shortest Path Tree
 multicast forwarding tree: tree of shortest path
routes from source to all receivers
 Dijkstra’s algorithm
S: source
LEGEND
R1
1
2
R4
R2
3
R3
router with attached
group member
5
4
R6
router with no attached
group member
R5
6
R7
i
link used for forwarding,
i indicates order link
added by algorithm
86
Reverse Path Forwarding
 rely on router’s knowledge of unicast shortest
path from it to sender
 each router has simple forwarding behavior:
if (multicast datagram received on incoming link on
shortest path back to sender)
then flood datagram onto all outgoing links
else ignore datagram
87
Reverse Path Forwarding: example
S: source
LEGEND
R1
R4
router with attached
group member
R2
R5
R3
R6
R7
router with no attached
group member
datagram will be
forwarded
datagram will not be
forwarded
• result is a source-specific reverse SPT
– may be a bad choice with asymmetric links
88
Reverse Path Forwarding: pruning
 forwarding tree contains subtrees with no multicast
group members
 no need to forward datagrams down subtree
 “prune” messages sent upstream by router with no
downstream group members
LEGEND
S: source
R1
router with attached
group member
R4
R2
P
R5
R3
R6
P
R7
P
router with no attached
group member
prune message
links with multicast
forwarding
89
Shared-Tree: Steiner Tree
 Steiner Tree: minimum cost tree connecting
all routers with attached group members
 problem is NP-complete
 excellent heuristics exists
 not used in practice:
 computational complexity
 information about entire network needed
 monolithic: rerun whenever a router needs to
join/leave
90
Center-based trees
 single delivery tree shared by all
 one router identified as “center” of tree
 to join:
 edge router sends unicast join-message addressed to
center router
 join-message “processed” by intermediate routers and
forwarded towards center
 join-message either hits existing tree branch for this
center, or arrives at center
 path taken by join-message becomes new branch of tree
for this router
91
Center-based trees: an example
Suppose R6 chosen as center:
LEGEND
R1
3
R2
router with attached
group member
R4
2
R5
R3
1
R6
1
router with no attached
group member
path order in which join
messages generated
R7
92
Internet Multicasting Routing: DVMRP
 DVMRP: distance vector multicast routing
protocol, RFC1075
 flood and prune: source-based tree, reverse path
forwarding,
 RPF tree based on DVMRP’s own routing tables
constructed by communicating DVMRP routers
 no assumptions about underlying unicast
 initial datagram to multicast group flooded everywhere
via RPF
 routers not wanting group: send upstream prune
messages
93
DVMRP: continued…
 soft state: DVMRP router periodically (1 min.)
“forgets” branches are pruned:
 multicast data again flows down unpruned branch
 downstream router: reprune or else continue to
receive data
 routers can quickly regraft to tree
 following IGMP join at leaf
 odds and ends
 commonly implemented in commercial routers
 Mbone routing done using DVMRP
94
Tunneling
Q: How to connect “islands” of multicast routers in a
“sea” of unicast routers?
physical topology
logical topology
 multicast datagram encapsulated inside “normal” (non-multicast-
addressed) datagram
 normal IP datagram sent thru “tunnel” via regular IP unicast to
receiving multicast router
 receiving multicast router decapsulates to get multicast datagram
95
PIM: Protocol Independent Multicast
 Not dependent on any specific underlying unicast
routing algorithm (like RIP, OSPF, works with all)
 Two different multicast distribution scenarios :
Dense:
Sparse:
 group members
 # of routers with group
densely packed, in
“close” proximity.
members is small wrt total
# of routers
 group members “widely
dispersed”
96
Consequences of Sparse-Dense Dichotomy:
Dense
Sparse:
 group membership by
 no membership until
routers explicitly join
routers assumed until
routers explicitly prune  receiver-driven
 data-driven construction construction of multicast
tree (e.g., center-based)
of multicast tree (e.g.,
RPF)
 bandwidth and nongroup-router processing
 bandwidth and nonconservative
group-router processing
profligate
97
PIM- Dense Mode
Flood-and-prune RPF, similar to DVMRP but
 underlying unicast protocol provides RPF
information for incoming datagram
 less complicated (less efficient) downstream
flood than DVMRP
 reduces reliance on underlying routing
algorithm
 has protocol mechanism for router to detect if
it is a leaf-node router
98
PIM - Sparse Mode
 Center-based approach
R1
 router sends join
message to rendezvous
point (RP)
 intermediate routers
update state and forward
join
 after joining via RP,
router can switch to
source-specific tree
R4
join
R2
R3
join
R5
join
R6
all data multicast
from rendezvous
point
R7
rendezvous
point
99
PIM - Sparse Mode
Sender(s):
 unicast data to RP,
which distributes down
RP-rooted tree
 RP can extend multicast
tree upstream to source
 RP can send stop
message to the source if
no attached receivers
R1
R4
join
R2
R3
join
R5
join
R6
all data multicast
from rendezvous
point
R7
rendezvous
point
 “no one is listening!”
100
4.5 MultiProtocol Label Switching
(MPLS)
Prior Work
MPLS Overview
MPLS Architecture
Prior Work
Tag Switching (Cisco)
Aggregate Route-Based IP Switching (ARIS,
IBM)
IP Navigator
IFMP-IP Switching (Ipsilon)
Cell Switching Router (CSR, Toshiba)
102
Prior Work
Tag switching is based on the control-driven approach.
The set up of LSPs (Label Switched Paths) closely
follows control messages such as routing updates and
RSVP messages.
Aggregate route-based IP switching (ARIS) is based on
the control-driven approach. Very similar to tag
switching. ARIS introduces the concept of an “egress
identifier” (FECs) to express the granularity of LSPs.
IP Navigator is again a control-driven protocol. Use
OSPF as the internal routing protocol used within a
routing domain. Explicit routing is used to setting up
the VCs.
103
Prior Work
 Ipsilon Flow Management Protocol (IFMP) is a traffic driven
protocol. When the number of packets from a flow exceeds a
predetermined threshold, the controller uses IFMP to set up an
LSP for the particular flow.
 Cell switch router (CSR) proposal is similar to IP switching.
CSR is primarily designed as a device for interconnecting ATM
clouds. Within an LIS (logical IP subnet), ATM forum standards
are used to connection hosts and switched together.
 Multiple LISs are then interconnected with CSRs that are
capable of running both IP forwarding and cell forwarding. The
setup of LSPs is data-driven for best effort traffic and RSVPdriven for flows that require resource reservation.
104
MPLS Overview
RFC 3812
The IETF MPLS working group is to standardize a
base technology that integrates the label swapping
forwarding paradigm with network layer routing.
Cisco is the major contributor to the MPLS working
group.
substitute “Label” for “Tag” in Tag Switching @
MPLS
Core mechanisms of MPLS
Semantics assigned to a stream label
Labels are associated with specific streams of
data.
Forwarding Methods
Forwarding is simplified by the use of the short
fixed length labels to identify streams.
Forwarding may require simple functions such as
looking up a label in a table, swapping labels, and
possibly decrementing and checking a TTL.
Label Distribution Methods
Allow nodes to determine which labels to use for
specific streams.
Native IP Forwarding
IP routing: both the packet forwarding and
route determination process in an IP
network.
Native IP forwarding (NIF): hop-by-hop,
destination-based packet forwarding.
Each packet’s next hop and output port are
determined by a longest-prefix-match
forwarding table lookup.
Additional packet classification may also be
performed to derive output port queuing and
scheduling rules.
107
A Simplified NIF forwarding engine
Longest Prefix Match lookup
Forwarding
Table
Next hop + port
Packet
Classification
Input
Ports
IP Header
Queuing and
Scheduling rules
Output
Ports
IP payload
Packet Classification keys: IP source and destination addresses,
IP protocol type, DiffServ (DS) or TOS byte, and TCP/UDP port
numbers.
108
Per-Hop classification, queuing, and
scheduling
Classify
Queue
Port 1
Port N
S
Port M
109
A Simplified LSR forwarding engine
Switching
Table
Next hop + port
Queuing and
Scheduling rules
Output
Ports
Input
Ports
MPLS label
MPLS payload
110
Traffic Engineering
Conventional IP routing attempts to find the shortest
path between a packet’s current location and its
intended destination.
“Hot spots” and packet loss rates, latency, and jitter
increase as the average load on a router rises.
Solutions: (1) Faster routers, (2) Alternate routes.
Routing policy may also require traffic engineering.
For example, the external link between R6 and A3
may have been funded solely by A2 and A3.
Therefore, A1’s traffic must not be allowed to
traverse it.
111
Traffic Engineering
-- Override the shortest path route
IP Backbone
Access 1
R1
R6
Access 3
R5
Access 2
R2
R3
R4
Route from A2 to D
Desired route from A1 to D
Actual route from A1 to D
Destination D
112
Signaling and Provisioning
Signaling: when network (re)configuration can be
requested by users at any time and achieved within
milliseconds or seconds.
Provisioning: When the reaction time for
(re)configuration becomes measured in minutes or hours.
In either case, the (re)configuring action involves
establishing (or modifying) information used by routers or
switches to control their forwarding actions, including
 forwarding (routing) information,
 classification rules, and/or
 queuing and scheduling parameters.
113
Core MPLS Components
The basic routing approach
Routing is accomplished through the use of standard
L3 routing protocols (e.g. OSPF and BGP).
The information maintained by the L3 routing
protocols is then used to distribute labels to
neighboring nodes that are used in the forwarding of
packets.
Labels
Label semantics, Label granularity, Label
assignment, Label stack and forwarding operations.
Label Semantics
 The label is nothing more than a shorthand for an
aggregate stream of user data.
 The meaning of the label is a strictly local issue
between two neighboring nodes.
 MPLS could be employed between any two
neighboring nodes, even if no other nodes in the
network participate in MPLS.
 When MPLS is used between more than two nodes,
then the operation between any two neighboring
nodes could be interpreted as independent of the
operation between any other pair of nodes.
Label Granularity
 The device uses the label to forward packets
will forward all packets with the same label in
the same way.
 A Forwarding Equivalence Class (FEC) is a set
of L3 packets which are all forwarded in the
same manner by a particular Label Switching
Router (LSR).
 For unicast IP traffic, the granularity of a label
allows various levels of aggregation in a Label
Information Base (LIB).
 For IP multicast, the natural binding of a label
would be to a multicast tree.
Label assignment
 Label assignment involves allocating a label,
and then binding a label to a route.
 Label assignment can be driven by control
traffic or data traffic. (discussed later.)
 Label withdrawal is primarily a matter of
garbage collection, that is collecting up unused
labels so that they may be reassigned.
117
Routing Aggregation
R6
Access 1
4
1
R1
R5
Access 3
2
Access 2
R2
R3
3
5
R4
Destination D
118
Forwarding Component
Label Stack and Forwarding Operations
label swap : looking up the incoming label to
determine the outgoing label, encapsulation, port,
and any additional information which may pertain
to the stream such as a particular queue or other
QoS related treatment.
label push : When a packet first enters an MPLS
domain, the packet is associated with a label.
label pop : When a packet leaves an MPLS
domain, the label is removed.
The label stack is useful within hierarchical routing
domain.
Encapsulation
Label-based forwarding makes use of various pieces of
information, including a label or stack of labels, and
possibly additional information such as a TTL field.
“MPLS encapsulation” : encapsulate the label
information and information used for label based
forwarding.
An encapsulation scheme may make use of the
following fields:
label, TTL, class of service, stack indicator, next
header type indicator, and checksum
MPLS label stack encoding
Stack bottom
Stack top
COS
Label
(20 bits)
Label
(20 bits)
Label
(20 bits)
Exp
(3 bits)
Exp
(3 bits)
Exp
(3 bits)
S
(1 bit)
S
(1 bit)
S
(1 bit)
TTL
(8 bits)
TTL
(8 bits)
TTL
(8 bits)
...
Original
Packet
MPLS frame delivered to link layer
121
Label Assignment
Topology driven (Tag)
 In response to normal processing of routing protocol control
traffic
 Labels are pre-assigned; no label setup latency at forwarding
time.
Request driven (RSVP)
 In response to normal processing of request based control traffic
 May require a large number of labels to be assigned.
Traffic driven (Ipsilon)
 The arrival of data at an LSR triggers label assignment and
distribution.
 Label setup latency; potential for packet reordering.
122
Label Distribution
Explicit Label Distribution
Downstream label allocation
label allocation is done by the downstream LSR
most natural mechanism for unicast traffic
Upstream label allocation
label allocation is done by the upstream LSR
may be used for optimality for some multicast traffic
A unique label for an egress LSR within the MPLS
domain
Any stream to a particular MPLS egress node could use the
label of that node.
123
Label Distribution
Explicit Label Distribution Protocol (LDP)
Reliability : by transport protocol or as part of LDP.
Separate routing computation and label distribution.
Piggybacking on Other Control Messages
Use existing routing/control protocol for distributing
routing/control and label information.
 OSPF, BGP, RSVP, PIM
Combine routing and label distribution.
Label purge mechanisms
By time out
Exchange of MPLS control packets
124
Label Distribution Protocol
 LDP Peer:
 Two LSRs that exchange label/stream mapping information via LDP
 LDP messages
 Discovery messages (via UDP)
 announce and maintain the presence of LSR
 Session messages
 maintain session between LDP peers
 Advertisement message
 label operation (Label distribution)
 Notification message
 advisory information and signal error information
 Error notification: signal fatal errors
 Advisory notification: status of the LDP session or some previous message
received from the peer.
125
Label Swapping
 Labeled Packet
Example : Forwarding a Labeled Packet
 Map the incoming label to a
next hop label, determines
where to forward the packet.
 Encodes the new label stack
into the packet, and then
forwards it.
Incoming Label Map (ILM)
Input
Port
Label
1
4
 Unlabeled Packet
 LSR analyzes the L3 header, to
determine the packet’s stream.
 Map the stream to a next hop,
determines where to forward
the packet.
 Encodes the new label stack
into the packet, and then
forwards it.
Output
Port
Label
2
6
Label Switching Router
(LSR)
L3 Header
L2 Header
IP Router
Module
Label
Dat
H3
4
H2
Dat
1
H3
6
2
126
H2
Use of MPLS in a Hierarchy
Swap
L1
L4
OSPF
R2
R1
L2
Push
IN
OUT
IN
OUT
L2
L3
L3
L1
Swap
OUT
L1
R3
IN
OUT
L1
L4
R5
R6
R4
Pop
L2
L3
L3
L1
L1
L1
L1
BGP
L2
L1
Domain 1
Domain 2
127
Conclusion
MPLS improves the scalability of hop-by-hop
routing and forwarding, and provides traffic
engineering capabilities for better network
provisioning.
It decouples forwarding from routing and
allows multi-protocol support without
requiring changes to the basic forwarding
paradigm.
Generalized MPLS (GMPLS)
λMPLS (Optical wavelength-based)
128