A 50-Gb/s IP Router - Purdue University :: Computer Science

Transcript A 50-Gb/s IP Router - Purdue University :: Computer Science

A 50-Gb/s IP Router
Authors: Craig Partridge et al.
IEEE/ACM TON June 1998
Presenter: Srinivas R. Avasarala
CS Dept., Purdue University
Why a Gigabit Router?




Transmission link bandwidths are improving at very
fast rates
Network usage is expanding
Host Adapters, OS, switches and MUX also need to
get faster for improved network performance
The goal of the work is to show routers can keep
pace with other technologies
S.R.Avasarala
CS Dept., Purdue University
2
Goals of a Multi-Gigabit Router (MGR)
Enough internal bandwidth to move packets
between its interfaces at gigabit rates
2. Enough packet processing power to forward
several million packets per second (MPPS)
3. Conformance to protocol standards
1.
MGR achieves up to 32 MPPS forwarding rates with
50 Gb/s of full duplex backplane capacity
S.R.Avasarala
CS Dept., Purdue University
3
Router Architecture




Multiple line cards, each with one or more network
interfaces
Forwarding Engine cards (FEs), to make packet
forwarding decisions
High speed switch
Network processor
S.R.Avasarala
CS Dept., Purdue University
4
Router Architecture
S.R.Avasarala
CS Dept., Purdue University
5
Major Innovations
1.
2.
3.
4.
5.
Each FE has a complete set of the routing tables
A switched fabric is used instead of the
traditional shared bus
FEs are on boards distinct from the line cards
Use of an abstract link layer header
Include QoS processing in the router
S.R.Avasarala
CS Dept., Purdue University
6
The Forwarding Engine Processor





A 415-MHz DEC Alpha 21164 processor
64bit 32 register super scalar RISC processor
2 Integer logic units E0, E1
2 Floating point logic units FA, FM
Each cycle schedules one instruction to each logic
unit, processing 4 instructions (quad) in a group
S.R.Avasarala
CS Dept., Purdue University
7
Forwarding Engine’s Caches


3 internal caches
 First level instruction cache (Icache) 8kB
 First-level data cache (Dcache) 8kB
 An on-chip secondary cache (Scache) 96kB used
as a cache of recent routes. Can store 12000
routes approx. with 64b per route
An external tertiary cache (Bcache) 16 MB
 Divided into two 8MB banks
 One bank stores entire forwarding table
 Other is updated by NW processor via PCI bus
S.R.Avasarala
CS Dept., Purdue University
8
Forwarding Engine Hardware






Headers are placed in a request FIFO queue
Alpha reads from queue head, examines header,
makes route decision and informs inbound card
Header includes 24/56B of packet + 8B abstract
link layer header and alpha reads a min of 32B
Alpha writes out the updated header indicating the
outbound interface to use (dispatching info)
Updated header contains the outbound link layer
address and a flow id used for packet scheduling
Unique approach to ARP !!
S.R.Avasarala
CS Dept., Purdue University
9
Forwarding Engine Software




A few 100 lines of code
85 instructions in the common case taking a
minimum of 42 cycles.
This gives a peak forwarding rate of 9.8MPPS
 415 MHz/42 cycles ~ 9.8MPPS
Fast path of the code is in 3 stages, each with
about 20-30 instructions (10-15 cycles)
S.R.Avasarala
CS Dept., Purdue University
10
Fast path of the code

Stage 1:
1. Basic error checking to see if header is from a
IP datagram
2. Confirm packet/header lengths are reasonable
3. Confirm that IP header has no options
4. Compute hash offset into route cache and load
the route
5. Start loading of next header
S.R.Avasarala
CS Dept., Purdue University
11
Fast path of the code


Stage 2
1. Check if cached route matches destination of
the datagram
2. If not then do an extended lookup in the route
table in Bcache
3. Update TTL and CHECKSUM fields
Stage 3
1. Put updated ttl, checksum and the route
information into IP hdr along with link layer
info from the forwarding table
S.R.Avasarala
CS Dept., Purdue University
12
An exception !!





IP HDR checksum is not verified but only updated
The incremental update algorithm is safe because
if the checksum is bad, it remains bad
Reason: Checksum verification is expensive and is
a large penalty to pay for a rare error that can be
caught end-to-end
Requires 17 instructions with min of 14 cycles,
increasing forwarding time by 21%
IPv6 does not include a header checksum too!!
S.R.Avasarala
CS Dept., Purdue University
13
Some datagrams not handled in fast path
1. Headers whose destination misses in the cache
2. Headers with errors
3. Headers with IP options
4. Datagrams that require fragmentation
5. Multicast datagrams


Requires multicast routing which is based on
source address and inbound link as well
Requires multiple copies of header to be sent
to different line cards
S.R.Avasarala
CS Dept., Purdue University
14
Instruction set




27% of them do bit, byte or word manipulation
due to extraction of various fields from headers
The above instructions can only be done in E0,
resulting in contention (checksum verifying)
Floating point instructions account for 12% but do
not have any impact on performance as they only
set SNMP values and can be interleaved
There is a minimum of loads(6) and stores(4)
S.R.Avasarala
CS Dept., Purdue University
15
Issues in forwarding design

Why not use an ASIC in place of the engine ?



Since IP protocol is stable, why not do it ?
Answer depends on where the router will be
deployed: corporate LAN or ISP’s backbone?
How effective is a route cache ?


A full route lookup is 5 times more expensive
than a cache hit. So we need modest hit rates.
And modest hit rates seem to be assured
because of packet trains
S.R.Avasarala
CS Dept., Purdue University
16
Abstract link layer header

Designed to keep the forwarding engine and its
code simple
S.R.Avasarala
CS Dept., Purdue University
17
The Switched bus



Instead of the conventional shared bus, MGR uses
a 15-port point-to-point switch
Limitation of a point-to-point switch is that it does
not support one to many transfers
The switch has 2 interfaces to each function card
 Data Interface: 75 input, 75 output pins clocked
at 51.84 MHz
 Allocation Interface: 2 request pins, 2 inhibit
pins, 1 input status pin and 1 output status pin
clocked at 25.92 MHz
S.R.Avasarala
CS Dept., Purdue University
18
Data transfer in the switch






An epoch is 16 ticks of data clock (8 allocation clk)
Up to 15 simultaneous transfers in an epoch
Each transfer is 1024 bits of data + 176 auxiliary
bits for parity and control
Aggregate data bandwidth is 49.77 Gb/s. 58.32
Gb/s including the auxiliary bits. 3.3 Gb/s per line
card
The 1024 bits are sent in two 64B blocks
Function cards are expected to wait several epochs
for another 64B block to fill the transfer
S.R.Avasarala
CS Dept., Purdue University
19
Scheduling of the switch


Minimum of 4 epochs to schedule and complete a
transfer but scheduling is pipelined.
 Epoch1: source card signals that it has data to
send to the destination card
 Epoch2: switch allocator schedules transfer
 Epoch3: source and destination cards are
notified and told to configure themselves
 Epoch4: transfer takes place
Flow control through inhibit pins
S.R.Avasarala
CS Dept., Purdue University
20
The Switch Allocator card





Takes connection requests from function cards
Takes inhibit requests from destination cards
Computes a transfer configuration for each epoch
15X15 = 225 possible pairings with 15! Patterns
Disadvantages of the simple allocator
 Unfair: there is a preference for low-numbered
sources
 Requires evaluating 225 positions per epoch,
which is too fast for an FPGA
S.R.Avasarala
CS Dept., Purdue University
21
The Switch Allocator Card
S.R.Avasarala
CS Dept., Purdue University
22
The Switch Allocator



Solution to unfairness problem: Random shuffling
of sources and destinations
Solution to timing problem: Parallel evaluation of
multiple locations
Priority to requests from forwarding engines over
line cards to avoid header contention on line cards
S.R.Avasarala
CS Dept., Purdue University
23
Line Card Design



A line card in MGR can have up to 16 interfaces on
it, all of the same type
Total bandwidth of all interfaces on a card must
not exceed 2.5 Gb/s. The difference between 2.5
and 3.3 Gb/s is to allow for transfer of headers to
and from forwarding engines
Can support:
 1 OC-48c 2.4 Gb/s SONET interface
 4 OC-12c 622 Mb/s SONET interfaces
 3 HIPPI 800 Mb/s interfaces
 16 100Mb/s Ethernet or FDDI interfaces
S.R.Avasarala
CS Dept., Purdue University
24
Line card: Inbound packet processing



Assigns a packet id and breaks data into a chain of
64B
The first page is sent to the FE to get routing info
2 Complications
 Multicasting: FE sends multiple copies of
updated first pages for a single packet
 ATM: Cells are 53b. So we need SAR. OAM cells
between interfaces on the same card must be
sent directly in a single page.
S.R.Avasarala
CS Dept., Purdue University
25
Line card: Outbound packet processing





Receives pages of a packet from the switch
Assembles them in a list
Creates a packet record pointing to the list
Passes the packet record QoS processor (an FPGA)
which does scheduling based on flow ids
Any link layer scheduling is done separately later
S.R.Avasarala
CS Dept., Purdue University
26
Network processor, Routing tables







233-MHz 21064 Alpha processor
Access to line cards through a PCI bridge
Runs 1.1 NetBSD UNIX
All routing protocols run on the NW processor
FEs have only small tables with minimal info
NW processor periodically downloads new tables
into FEs
FEs then switch memory banks and invalidate the
route cache
S.R.Avasarala
CS Dept., Purdue University
27
Conclusions


Makes 2 important contributions
 Emphasizes on examining every header:
improves robustness and security
 Shows it is feasible to build routers that can
serve in emerging high-speed networks
In all, an excellent paper providing complete and
intricate details about high speed router design
S.R.Avasarala
CS Dept., Purdue University
28

A 50-Gb/s IP Router - Purdue University :: Computer Science

Transcript A 50-Gb/s IP Router - Purdue University :: Computer Science

Directory