PI Meeting Presentation
Download
Report
Transcript PI Meeting Presentation
Extreme Networking
Achieving Nonstop Network Operation
Under Extreme Operating Conditions
DARPA PI Meeting, January 27-29, 2003
Jon Turner
[email protected]
http://www.arl.wustl.edu/arl
Project Overview
Motivation
» data networks have become mission-critical resource
» networks often subject to extreme traffic conditions
» need to design networks for worst-case conditions
» technology advances making extreme defenses practical
Extreme network services
» Lightweight Flow Setup (LFS)
» Network Access Service (NAS)
» Reserved Tree Service (RTS)
Key router technology components
» Super-Scalable Packet Scheduling (SPS)
» Dynamic Queues with Auto-aggregation (DQA)
» Scalable Distributed Queueing (SDQ)
‹#› - Jonathan Turner – January 27-29, 2003
Prototype Extreme Router
Control
Processor
OPP
IPP
OPP
IPP
OPP
IPP
OPP
IPP
OPP
IPP
OPP
IPP
Switch Fabric
FPX
FPX
FPX
FPX
FPX
FPX
SPC
SPC
SPC
SPC
SPC
SPC
Line Card
Line Card
Line Card
Line Card
Line Card
Line Card
‹#› - Jonathan Turner – January 27-29, 2003
Prototype Extreme Router
Control
Processor
OPP
IPP
OPP
IPP
OPP
IPP
OPP
IPP
OPP
IPP
OPP
IPP
Switch Fabric
FPX
FPX
FPX
FPX
FPX
FPX
SPC
SPC
SPC
SPC
SPC
SPC
Line Card
Line Card
Line Card
Line Card
Line Card
Line Card
‹#› - Jonathan Turner – January 27-29, 2003
Prototype Extreme Router
Control
Processor
Field Programmable Port Ext.
ATM
Switch Core
FPX
FPX
FPX
FPX
Field
Programmable
SPC
SPC
SPC
SPC Port Extenders
Line Card
Line Card
‹#› - Jonathan Turner – January 27-29, 2003
Line Card
Line Card
SDRAM
FPX
128
MB
Reprogrammable
SPC
Application
Device
Line Card
SRAM
4 MB
OPP
IPP
OPP
IPP
OPP
IPP
OPP
IPP
OPP
IPP
OPP
IPP
Switch Fabric
FPX
Network
SPC
Interface
Device
Line Card
Prototype Extreme Router
Control
Processor
Smart Port Card 2
Flash
Disk
FPX
North
Bridge
SPC
APIC
128
MB
OPP
IPP
OPP
IPP
OPP
IPP
OPP
IPP
OPP
IPP
OPP
IPP
Switch Fabric
FPX
FPX
FPX
FPX
FPX
FPGA
SPC
SPC
SPC
SPC
Line Card
Line Card
Line Card
Line Card
SPC
Pentium
Embedded
Line Card Processors
Line Card
Cache
‹#› - Jonathan Turner – January 27-29, 2003
Prototype Extreme Router
Gigabit Ethernet Control
Processor
OPP
IPP
OPP
IPP
OPP
Framer
IPP
IPP
FPGA
OPP
OPP
IPP
OPP
IPP
Switch Fabric
FPX
FPXGBIC FPX
FPX
FPX
FPX
SPC
SPC
SPC
SPC
SPC
SPC
Line Card
Line Card
Line Card
Line Card
Line Card
Line Card
‹#› - Jonathan Turner – January 27-29, 2003
Performance of SPC-2
Forwarding (Mb/s)
350
300
SPC-2
250
2.5x improvement for average packet lengths
200
SPC-1
150
100
50
0
0
200
400
600
800
Packet Size
1000
Forwarding (pkts/s)
250,000
200,000
2.6x improvement for average packet lengths
150,000
100,000
SPC-2
1200
1400
Largest gain at
small packet sizes.
PCI bus limits
performance for
large packets
50,000
SPC-1
0
0
200
‹#› - Jonathan Turner – January 27-29, 2003
400
600
800
Packet Size
1,000
1,200
1,400
More SPC-2 Performance
Forwarding Rate (Mb/s)
100
32 Byte Packets
SPC-2
80
220K packets/sec
60
40
SPC-1
20
Throughput loss at
high loads due to
250
PCI bus contention
and input priority.
78K packets/sec
0
0
Forwarding Rate (Mb/s)
350
50
100
150
Input Rate (Mb/s)
200
1500 Byte Packets
300
250
SPC-2
22K packets/sec
200
SPC-1
150
15K packets/sec
100
50
0
0
50
100
150
200
250
300
Input Rate (Mb/s)
‹#› - Jonathan Turner – January 27-29, 2003
350
400
450
500
Field Programmable Port Extender (FPX)
Functions for extreme router.
» high speed packet storage manager
» packet classification & route lookup
– fast route lookup
– exact match filters
– 32 general filters
» flexible queue manager
– per-flow queues for reserved flows
– route packets to/from SPC
‹#› - Jonathan Turner – January 27-29, 2003
SRAM
36
(1 MB)
2 Gb/s
interface
Reprogrammable
App. Device
(400 Kg+80 KB)
6.4 Gb/s
Network
Interface
Device
64
SDRAM
(64 MB)
64
SDRAM
100 MHz
(1 MB)
100 MHz
» will implement core router
functions in extensible router
» may also implement arbitrary
packet processing functions
36
SRAM
Network Interface Device (NID)
routes cells to/from RAD.
Reprogrammable Application
Device (RAD) functions:
(64 MB)
2 Gb/s
interface
Logical Port Architecture
virtual output DQ
queues
...
FPX
PCU
Packet
Classification
special
flow
queues
...
...
SPC
PCU
plugins
‹#› - Jonathan Turner – January 27-29, 2003
...
Input Side Processing
RC
...
reassembly
contexts
SPC
plugins
output
queues
...
FPX
special
flow
queues
Output Side Processing
...
...
reassembly
contexts
Packet
Classification
& Route
Lookup
FPX Packet Processor Block Diagram
SDRAM
from LC
from SW
Data Path
Header
Proc.
SDRAM
Packet Storage Manager
(includes free space list)
ISAR
OSAR
Discard
Header
Pointer
Pointer
Classification and
Route Lookup
SRAM
Control
Route &
Filter
Updates
Queue Manager
Register Set
SRAM
Register Set
DQ Status &
Updates & Status Rate Control
Control Cell Processor
‹#› - Jonathan Turner – January 27-29, 2003
to LC
to SW
lookup engines.
» route lookup for routing
datagrams - best prefix
» flow filters for multicast &
reserved flows - exact
» general filters (32) for
management - exhaustive
Input
processing.
» parallel check of all three
» return highest priority
exclusive and highest priority
non-exclusive
» general filters have unique
priority
» all flow filters share single
priority
» ditto for routes
‹#› - Jonathan Turner – January 27-29, 2003
Route
Lookup
Flow
Filters
Input Demux
Three
General
Filters
headers
Result Proc. &
Priority Resolution
Classification and Route Lookup (CARL)
bypass
Output
processing.
» no route lookup on output
Route lookup & flow filters
share off-chip SRAM
General filters processed
on-chip
Exact Match Lookup
Exact match lookup table used for reserved flows.
» includes LFS, signaled QOS flows and multicast
» and, flows requiring processing by SPCs
» each of these flows has separate queue in QM
» multicast flows have two queues (recycling multicast)
» implemented using hashing
tag =[src,dst,sport,
on-chip SRAM
...
...
ingress valid
egress valid
packet
src dst
1 1 tag+data
0 1
6 5
1 0 tag+data -simple hash
00
1 1 tag+data --
‹#› - Jonathan Turner – January 27-29, 2003
tag+data --
off-chip SRAM
dport,proto]
data includes
• 2 outputs+2 QIDs
• LFS rates
• packet,byte counters
• flags
separate memory
areas for ingress
and egress packets
General Filter Match
General filter match considers full 5-tuple
» prefix match on source and destination addresses
» range match on source and destination ports
» exact or wildcard match on protocol
» each filter has a priority and may be exclusive or nonexclusive
Intended primarily for management filters.
» firewall filters
» class-based monitoring
» class-based special processing
Implemented using parallel
exhaustive search.
» limit of 32 filters
‹#› - Jonathan Turner – January 27-29, 2003
filter
memory
matcher
matcher
matcher
matcher
Fast IP Lookup (Eatherton & Dittia)
000 001 010 100 101 110
1,10
--
11
100 011 110
*
0,00
01
--
address: 101 100 101 000
0 00 0110
11101110
01,10
1
*
110
100 101
00
1,11
0
0 01 0010 0 00 0000 0 00 0001 0 00 0000 0 01 0000 1 00 0000
00000000 00001000 00010010 00000010 00001100 00000000
internal
bit vector
1 00 0000
00000000
Multibit trie with clever data
encoding.
0 00 1000
00000000
0 10 1000 0 00 0100
00000000 00000000
external
bit vector
0 01 0001 0 10 0000
00000000 00000000
» small memory requirements (<7 bytes per prefix)
» small memory bandwidth, simple lookup yields fast lookup rates
» updates have negligible impact on lookup performance
Avoid impact of external memory latency on throughput by
interleaving several concurrent lookups.
» 8 lookup engine config. uses about 6% of Virtex 2000E logic cells
‹#› - Jonathan Turner – January 27-29, 2003
Millions of lookups per second
Lookup Throughput
12
11
10
9
8
7
6
5
4
3
2
1
0
SRAM Bandwidth – 450 MB/s
Mae-West - Split Tree
Mae-West - Single Tree
Split tree cuts
storage by 30%
Worst-Case - Single Tree
linear
throughput gain
1
2
‹#› - Jonathan Turner – January 27-29, 2003
3
4
5
6
Number of Lookup engines
7
8
Update Performance
Millions of lookups per second
12
11
10
No updates - Single Tree
9
8
reasonable
update
7 have little impact
rates
6
10 5 updates/second
5
4
3
10 6 updates/second
2
1 update
1
per ms
0
1
2
3
4
5
6
Number of Lookup Engines
‹#› - Jonathan Turner – January 27-29, 2003
7
8
Queue Manager Logical View (QM)
separate queues
for each
reserved flow
datagram
queue
to output 1
...
datagram
queues
64 hashed
datagram queues
for traffic
isolation
‹#› - Jonathan Turner – January 27-29, 2003
VOQ pkt.
sched.
...
arriving
packets
...
to
link
link pkt. sched.
res. flow
queues
to output 0
res. flow
queues
DQ
to
switch
to output 8
...
separate queue
for each SPC
flow
SPC pkt.
sched.
to
SPC
from
SPC
separate
queue set for
each output.
Backlogged TCP Flows with Tail Discard
Buffer Level
20,000
15,000
10,000
5,000
with large
buffers get large
delay variance
tail drop, 20K buffer, 100 ms RTT,
100 sources, 500 Mb/s link
0
50
100
Time (seconds)
150
200
1,000
Buffer Level
800
with small buffers
get underflow and
low throughput
600
400
200
tail drop, 1K buffer, 100 ms RTT,
100 sources, 500 Mb/s link
0
50
51
52
‹#› - Jonathan Turner – January 27-29, 2003
53
54
55
56
Time (seconds)
57
58
59
60
DRR with Discard from Longest Queue
Buffer Level
20,000
15,000
10,000
5,000
DRR, 20K buffer, 100 ms RTT,
100 sources, 500 Mb/s link
0
50
100
150
200
Time (seconds)
Buffer Level
1,000
800
600
400
DRR, 1K buffer, 100 ms RTT,
100 sources, 500 Mb/s link
200
0
50
51
52
53
54
55
56
57
Time (seconds)
Smaller fluctuations, but still significant.
‹#› - Jonathan Turner – January 27-29, 2003
58
59
60
Queue State DRR
Add hysteresis to packet discard policy
» discard from same queue until shortest non-empty queue.
Buffer Level
20,000
19,800
19,600
low variation, even
with small queues,
low delay, no tuning
19,400
QSDRR, 20K buffer, 100 ms RTT,
100 sources, 500 Mb/s link
19,200
19,000
50
100
150
200
Time (seconds)
Buffer Level
1,000
980
960
940
QSDRR, 1K buffer, 100 ms RTT,
100 sources, 500 Mb/s link
920
900
50
51
52
53
54
55
56
Time (seconds)
‹#› - Jonathan Turner – January 27-29, 2003
57
58
59
60
Packet Scheduling with Approx. Radix Sorting
wheel 1
fast forward bits 00110100
wheel 2
wheel 3
10000010
00101010
output list
To implement virtual time schedulers, need to quickly find the queue
whose “lead packet” has the smallest virtual finish time.
» using priority queue, this requires O (log n) time for n queues
Use approximate radix sorting, with compensation – O (1).
» timing wheels with increasing granularity and range
» approximate sorting produces inter-packet timing errors
» observe errors & compensate when next packet scheduled
Fast-forward bits used to skip to empty slots.
Scheduler puts no limit on number of queues.
Two copies of data structure needed for approx. version of WF2Q+.
‹#› - Jonathan Turner – January 27-29, 2003
Resource Usage Estimates
Key resources in Xilinx FPGAs
» flip flops - 38,400
» lookup tables (LUTs) - 38,400
each can implement any 4 input Boolean function
» block RAMs (4 Kbits each) - 160
CARL
CCP
FIFOs
ISAR
OSAR
PSM
QM
Total
Resources
flops
3,781
1,500
159
3,674
3,795
6,196
5,605
24,710
38,400
‹#› - Jonathan Turner – January 27-29, 2003
Number
LUTs
5,199
750
340
5,053
3,208
5,746
6,472
26,768
38,400
RAMs
28
5
12
28
22
20
14
129
160
flops
9.8%
3.9%
0.4%
9.6%
9.9%
16.1%
14.6%
64.3%
% of total
LUTs
13.5%
2.0%
0.9%
13.2%
8.4%
15.0%
16.9%
69.7%
RAMs
17.5%
3.1%
7.5%
17.5%
13.8%
12.5%
8.8%
80.6%
FPGA Performance Characteristics
25
15
10
5
min separation
Delay (ns)
20
XCV2000e-6
XCV1000e-7
max separation
0
0
0
‹#› - Jonathan Turner – January 27-29, 2003
1
2
3
4
5
6
Number of Logic Levels (LUTs)
7
8
Summary
Version 1 Hardware status.
» hardware operating in lab, passing packets
» but, still have some bugs to correct
» one day for typical test-diagnose-correction cycle
» version 1 has simplified queue manager
Planning several system demos in next month.
» system level throughput testing – focus on lookup proc.
» verifying basic fair queueing behavior
» TCP SYN attack suppressor
SPC-resident plugin monitors new TCP connections going to server
when too many “half-open” connections, oldest are reset
flow filters inserted for stable connections, enabling hw forwarding
Expect to complete version 2 hardware in next six
months.
‹#› - Jonathan Turner – January 27-29, 2003