Reflections on the Development of A/P Nets

Download Report

Transcript Reflections on the Development of A/P Nets

Longest Prefix Matching
Trie-based Techniques
CS 685 Network Algorithmics
Spring 2006
Spring 2006
CS 685 Network Algorithmics
1
The Problem
• Given:
– Database of prefixes with associated next hops, say:
1000101* 128.44.2.3
01101100*  4.33.2.1
10001*  124.33.55.12
10*  151.63.10.111
01*  4.33.2.1
1000100101*  128.44.2.3
– Destination IP address, e.g. 120.16.8.211
• Find: the longest matching prefix and its next hop
Spring 2006
CS 685 Network Algorithmics
2
Constraints
• Handle 150,000 prefixes in database
• Complete lookup in minimum-sized (40-byte) packet
transmission time
– OC-768 (40 Gbps): 8 nsec
• High degree of multiplexing—packets from 250,000
flows interleaved
• Database updated every few milliseconds
 performance  number of memory accesses
Spring 2006
CS 685 Network Algorithmics
3
Basic ("Unibit") Trie Approach
• Recursive data structure (a tree)
• Nodes represent prefixes in the
database
– Root corresponds to prefix of length
zero
• Node for prefix x has three fields:
– 0 branch: pointer to node for prefix x0
(if present)
– 1 branch: pointer to node for prefix x1
(if present)
– Next hop info for x (if present)
Spring 2006
CS 685 Network Algorithmics
Example Database:
a: 0*  x
b: 01000*  y
c: 011*  z
d: 1*  w
e: 100*  u
f: 1100*  z
g: 1101*  u
h: 1110*  z
i: 1111*  x
4
a: 0*  x
b: 01000*  y
c: 011*  z
d: 1*  w
e: 100*  u
f: 1100*  z
g: 1101*  u
h: 1110*  z
i: 1111*  x
0
1
ax
0
dw
0
1
0
1
0
cz
0
0
1
1
0
1
0
1
1
ew
1
0
0
1
fz
0
0
1
gu
1
0
1
hz
1
0
ix
1
0
1
by
0
1
Spring 2006
CS 685 Network Algorithmics
5
Trie Search Algorithm
typedef struct foo {
struct foo *trie_0,
*trie_1;
NEXTHOPINFO trie_info;
} *TRIENODE;
NEXTHOPINFO best = NULL;
TRIENODE np = root;
unsigned int bit =
0x80000000;
Spring 2006
while (np != NULL) {
if (np->trie_info)
best = np->trie_info;
// check next bit
if (addr&bit)
np = np->trie_1;
else
np = np->trie_0;
bit >>= 1;
}
return best;
CS 685 Network Algorithmics
6
Conserving Space
Sparse database  wasted space
– Long chains of trie nodes with only one non-NULL pointer
– Solution: handle "one-way" branches with special nodes
• encode the bits corresponding to the missing nodes using text
strings
Spring 2006
CS 685 Network Algorithmics
7
a: 0*  x
b: 01000*  y
c: 011*  z
d: 1*  w
e: 100*  u
f: 1100*  z
g: 1101*  u
h: 1110*  z
i: 1111*  x
0
1
ax
0
dw
0
1
0
1
0
cz
0
0
1
1
0
1
0
1
1
eu
1
0
0
1
fz
0
0
1
gu
1
0
1
hz
1
0
ix
1
0
1
by
0
1
Spring 2006
CS 685 Network Algorithmics
8
a: 0*  x
b: 01000*  y
c: 011*  z
d: 1*  w
e: 100*  u
f: 1100*  z
g: 1101*  u
h: 1110*  z
i: 1111*  x
0
1
ax
0
dw
0
1
0
1
0
cz
0
1
0
1
1
eu
1
0
0
1
fz
0
0
1
gu
1
0
1
hz
1
0
ix
1
0
1
by
0
1
Spring 2006
CS 685 Network Algorithmics
9
Bigger Issue: Slow!
• Computing one bit at a time is too slow
– Worst-case: one memory access per bit (32 accesses!)
• Solution: compute n bits at a time
– n = stride length
– Use n-bit chunks of addresses as index into array in each
trie node
• How to handle prefixes which are not a multiple of n
in length?
– Extend them, replicate entries as needed
– E.g. n=3, 1* becomes 100, 101, 110, 111
Spring 2006
CS 685 Network Algorithmics
10
Extending Prefixes
Original Database
a: 0*  x
b: 01000*  y
Example:
stride length=2
c: 011*  z
d: 1*  w
e: 100*  w
f: 1100* 
g: 1101* 
h: 1110* 
i: 1111* 
Spring 2006
z
u
z
x
CS 685 Network Algorithmics
Expanded Database
a0: 00*  x
a1: 01*  x
b0: 010000*  y
b1: 010001*  y
c0: 0110*  z
c1: 0111* z
d0: 10*  w
d1: 11*  w
e0: 1000*  u
e1: 1001*  u
f: 1100*  z
g: 1101*  u
h: 1110*  z
i: 1111*  x
11
Expanded Database
a0: 00*  x
a1: 01*  x
b0: 010000*  y
b1: 010001*  y
c0: 0110*  z
c1: 0111* z
d0: 10*  w
d1: 11*  w
e0: 1000*  w
e1: 1001*  w
f: 1100*  z
g: 1101*  u
h: 1110*  z
i: 1111*  x
00
01
x
10
11
x
w
w
00
01
z
u
10
11
z
x
00
01
10
11
z
z
00
01
u
u
00
01
y
y
10
11
10
11
Total cost: 40 pointers (22 null)
Max #memory accesses: 3
Spring 2006
CS 685 Network Algorithmics
12
a: 0*  x
b: 01000*  y
c: 011*  z
d: 1*  w
e: 100*  u
f: 1100*  z
g: 1101*  u
h: 1110*  z
i: 1111*  x
0
1
x
w
0
0
1
0
1
0
z
0
1
1
0
0
0
1
Spring 2006
1
u
0
1
z
by
0
1
0
1
u
1
0
1
z
1
0
x
1
0
1
Total cost: 46 pointers (21 null)
Max #memory accesses: 5
CS 685 Network Algorithmics
13
Choosing Fixed Stride Lengths
• We are trading space for time:
– Larger stride length  fewer memory accesses
– Larger stride length  more wasted space
• Use the largest stride length that will fit in memory
and complete required accesses within the time
budget
Spring 2006
CS 685 Network Algorithmics
14
Updating
Insertion
1. Keep a unibit version of the trie, with each node labeled with
longest matching prefix and its length
2. To insert P, search for P, remembering last node, until
1. Null pointer (not present), or
2. Reach the last stride in P
3. Expand P as needed to match stride length
4. Overwrite any existing entries with length less than P's
Deletion is similar
1. Find entry for prefix to be deleted
2. Remove its entry (from unibit copy also!)
3. Expand any entries that were "covered" by the deleted prefix
Spring 2006
CS 685 Network Algorithmics
15
Variable Stride Lengths
• It is not necessary that every node have the same
stride length
• Reduce waste by allowing stride length to vary per
node
– Actual stride length encoded in pointer to the trie node
– Nodes with fewer used pointers can have smaller stride
lengths
Spring 2006
CS 685 Network Algorithmics
16
Expanded Database
a0: 00*  x
a1: 01*  x
b: 01000*  y
c0: 0110*  z
c1: 0111* z
d0: 10*  w
d1: 11*  w
e: 100*  w
f: 1100*  z
g: 1101*  u
h: 1110*  z
i: 1111*  x
1 bit
2 bits
00
01
10
11
x
x
w
w
2 bits
00
01
10
11
z
u
z
x
00
01
10
11
z
z
0
1
y
1 bit
0
1
u
Total waste: 16 pointers
Max #memory accesses: 3
Note: encoding stride length costs 2 bits/pointer
Spring 2006
CS 685 Network Algorithmics
17
Calculating Stride Lengths
• How to pick stride lengths?
– We have two variables to play with: height and stride length
– Trie height determines lookup speed  set max height first
• Call it h
– Then choose strides to minimize storage
• Define cost of trie T, C(T):
– If T is a single node, then number of array locations in the node
– Else number of array locations in root + i C(Ti), where Ti's are
children of T
• Straightforward recursive solution:
– Root stride s results in y=2s subtries T1, ... Ty
– For each possible s, recursively compute optimal strides for C(Ti)'s
using height limit h-1
– Choose root stride s to minimize total cost = (2s + i C(Ti))
Spring 2006
CS 685 Network Algorithmics
18
Calculating Stride Lengths
• Problem: Expensive, repeated subproblems
• Solution (Srinivasan & Varghese):
Dynamic programming
• Observe that each subtree of a variable-stride trie contains the
set of prefixes as some subtree of the original unibit trie
• For each node of the unibit trie, compute optimal stride and cost
for that stride
• Start at bottom (height = 1), work up
• Determine optimal grouping of leaves in subtree
• Given subtree optimal costs, compute parent optimal cost
• This results in optimal stride length selections for the given set
of prefixes
Spring 2006
CS 685 Network Algorithmics
19
Stride = 1
Cost = 7
0
Stride = 0
Cost = 1
x
0
1
1
0
w
0
1
0
Stride = 0
Cost = 1
1
0
z
Stride = 2
Cost = 4
0
1
Stride = 1
Cost = 2
1
u
1
0
0
1
z
0
0
1
u
1
0
1
z
1
0
x
1
0
1
by
0
1
Spring 2006
Stride = 0
Cost = 1
CS 685 Network Algorithmics
20
Alternative Method: Level Compression
• LC-trie (Nilsson & Karlsson '98) is a variable-stride
trie with no empty entries in trie nodes
• Procedure:
– Select largest root stride that allows no empty entries
– Do this recursively down through the tree
• Disadvantage: cannot control height precisely
Spring 2006
CS 685 Network Algorithmics
21
0
x
1
w
Stride = 1
0
0
1
0
1
0
z
0
1
Stride = 1
0
1
1
u
1
0
0
1
z
0
0
1
u
1
0
1
z
1
0
x
1
0
1
Stride = 1
Stride = 2
by
0
1
Spring 2006
CS 685 Network Algorithmics
22
Performance Comparisons
• MAE-East database (1997 snapshot)
– ~ 40K prefixes
• "Unoptimized" multibit trie: 2003 KB
• Optimal fixed-stride: 737 KB, computed in 1 msec
– Height limit = 4 ( 1 Gbps wire speed @ 80 nsec/access)
• Optimized (S&V) variable-stride: 423 KB, computed in
1.6 sec, Height limit = 4
• LC-compressed
– Height = 7
– 700 KB
Spring 2006
CS 685 Network Algorithmics
23
Lulea Compressed Tries
•
Goals:
–
–
Minimize number of memory accesses
Aggressively compress trie
•
•
Three-level trie with strides of 16, 8, 8
–
•
Goal: so it can fit in SRAM (or even cache)
8 mem accesses typical
Main Techniques
1.
2.
3.
4.
Spring 2006
Leaf-pushing
Eliminate duplicate pointers from trie node arrays
Efficient bit-counting using precomputation for large bitmaps
Use of indices instead of full pointers for next-hop info
CS 685 Network Algorithmics
24
1. Leaf-Pushing
• In general, a trie node entry has associated
– A pointer to a next trie node
– A prefix (i.e. pointer to next-hop info)
– Or both, or neither
• Observation: we don't need to know about a prefix
pointer along the way until we reach a leaf
• So: "push" prefix pointers down to leaves
– Keep only one set of pointers per node
Spring 2006
CS 685 Network Algorithmics
25
Leaf-Pushing: the Concept
Prefixes
Spring 2006
CS 685 Network Algorithmics
26
Expanded Database
a0: 00*  x
a1: 01*  x
b0: 010000*  y
b1: 010001*  y
c0: 0110*  z
c1: 0111* z
d0: 10*  w
d1: 11*  w
e0: 1000*  u
e1: 1001*  u
f: 1100*  z
g: 1101*  u
h: 1110*  z
i: 1111*  x
Before
00
01
x
10
11
x
w
w
00
01
z
u
10
11
00
01
00
01
z
x
10
11
z
z
00
01
u
u
00
01
10
11
00
01
10
11
y
10
11
Cost: 40 pointers
(22 wasted)
10
11
x
y
x
z
z
00
01
10
11
y
y
x
x
After Leaf-Pushing
00
01
10
11
Spring 2006
z
u
z
x
00
01
10
11
u
u
w
w
CS 685 Network Algorithmics
Cost: 20 pointers
27
2. Removing Duplicate Pointers
• Leaf-pushing results in many
consecutive duplicate pointers
• Would like to remove redundancy
and store only one copy in each
node
• Problem: now we can't directly
index into array using address bits
– Example: k=2, bits 01 = index 1 needs
to be converted to index 0 somehow
Spring 2006
CS 685 Network Algorithmics
00
01
10
11
u
u
w
w
u
w
28
2. Removing Duplicate Pointers
Solution: Add a bitmap: one bit per
original entry
00
01
– 1 indicates new value
– 0 indicates duplicate of previous value
• To convert index i, count 1's up to
position i in the bitmap, and
subtract 1
Example: old index 1  new index 0
old index 2  new index 1
Spring 2006
CS 685 Network Algorithmics
10
11
00
01
10
11
u
u
w
w
1
0
1
0
u
w
29
Bitmap for Duplicate Elimination
Prefixes
100000000000100010000100000000000000000010000000000100000000000010001000000100000011000000000000000000
Spring 2006
CS 685 Network Algorithmics
30
3. Efficient Bit-Counting
• Lulea first-level 16-bit stride  64K entries
• Impractical to count bits up to, say, entry 34578 on
the fly!
• Solution: Precompute (P2a)
– Divide bitmap into chunks (say, 64 bits each)
– Store the number of 1 bits in each chunk in an array B
– Compute # 1 bits up to bit k by:
chunkNum = k >> 6;
posInChunk = k & 0x3f; // k mod 64
numOnes = B[chunkNum] +
count1sInChunk(chunkNum,posInChunk) – 1;
Spring 2006
CS 685 Network Algorithmics
31
Bit-Counting Precomputation Example
index = 35
Chunk Size = 8 bits
1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 1
0
3
3
6
7
9
Converted index = 7 + 2 – 1 = 8
Cost: 2 memory accesses (maybe less)
Spring 2006
CS 685 Network Algorithmics
32
4. Efficient Pointer Representation
• Observation: the number of different next-hop
pointers is limited
– Each corresponds to an immediate neighbor of the router
– Most routers have at most a few dozen neighbors
– In some cases a router might have a few hundred distinct
next hops, even a thousand
• Apply P7: avoid unnecessary generality
– Only a few bits (say 8-12) needed to distinguish between
actual next-hop possibilities
• Store indices into table of next-hops info
– E.g., to support up to 1024 next hops: 10 bits
– 40K prefixes  40K pointers  160KB @ 32 bits,
vs 50KB @ 10 bits
Spring 2006
CS 685 Network Algorithmics
33
Other Lulea Tricks
• First level of trie uses two levels of bit-counting array
– First counts bits before the 64-bit chunk
– Second counts bits in the 16-bit word within chunk
• Second- and third-level trie nodes are laid out
differently depending on number of pointers in them
– Each node has 256 entries
– Categorized by number of pointers
• 1-8: "sparse" — store 8-bit indices + 8 16-bit pointers (24B)
• 9-64: "dense" — like first level, but only one bit-counting array
(only six bits of count needed)
• 65-256: "very dense" — like first level, with two bit-counting
arrays: 4 64-bit chunks, 16 16-bit words
Spring 2006
CS 685 Network Algorithmics
34
Lulea Performance Results
1997 MAE-East database
–
–
–
–
32K entries, 58K leaves, 56 different next hops
Resulting Trie size: 160KB
Build time: 99 msec
Almost all lookups took < 100 clock cycles
(333MHz Pentium)
Spring 2006
CS 685 Network Algorithmics
35
Trie Bitmap
(Eatherton, Dittia & Varghese)
• Goal: storage, speed comparable to Lulea plus fast
insertion
• Main culprit in slow insertion is leaf-pushing
• So get rid of leaf-pushing
– Go back to storing node and prefix pointers explicitly
– Use the same compression bitmap trick on both lists
• Store next-hop information separately, only retrieve
at the end
– Like leaf-pushing, only in the control plane!
• Use smaller strides to limit memory accesses to one
per trie node (Lulea requires at least two)
Spring 2006
CS 685 Network Algorithmics
36
Storing Prefixes Explicitly
• To avoid expansion/leaf pushing, we have to store
prefixes in the node explicitly
• There are 2k+1 – 1 possible prefixes of length k
– Store list of (unique) next hop pointers for each prefix
covered by this node
– Use same bitmap/bit counting technique as Lulea to find
pointer index
– Keep trie nodes small (stride 4 or less), exploit hardware
(P5) to do prefix matching, bit counting
Spring 2006
CS 685 Network Algorithmics
37
Example: Root node, stride = 3
a: 0*  x
b: 01000*  y
c: 011*  z
d: 1*  w
e: 100*  u
f: 1100*  z
g: 1101*  u
h: 1110*  z
i: 1111*  x
*
0
000 0
0*
1
001 0
1*
1
x
00* 0
w
011 0
01* 0
z
100 0
10* 0
u
101 0
11* 0
110 1
000* 0
010 1
001* 0
111 0
010* 0
011* 1
100* 1
to child nodes
101* 0
110* 0
111* 0
Spring 2006
CS 685 Network Algorithmics
38
Tree Bitmap Results
• Insertions are as in simple multibit tries
• May cause complete revamp of trie node, but that
requires only one memory allocation
• Performance comparable to Lulea, but insertion much
faster
Spring 2006
CS 685 Network Algorithmics
39
A Different Lookup Paradigm
• Can we use binary search to do longest-prefix
lookups?
• Observe that each prefix corresponds to a range of
addresses
– E.g. 204.198.76.0/24 covers the range
204.198.76.0 – 204.198.76.255
– Each prefix has two range endpoints
– N disjoint prefixes divide the entire space into 2N+1 disjoint
segments
– By sorting range endpoints, and comparing to address, can
determine exact prefix match
Spring 2006
CS 685 Network Algorithmics
40
Prefixes as Ranges
Spring 2006
CS 685 Network Algorithmics
41
Binary Search on Ranges
• Store 2N endpoints in sorted order
– Including the full address range for *
• Store two pointers for each entry
– ">" entry: next-hop info for addresses strictly greater than
that value
– "=" entry: next-hop info for addresses equal to that value
Spring 2006
CS 685 Network Algorithmics
42
Example: 6-bit addresses
Example Database:
a: 0*  x
b: 01000*  y
c: 011*  z
d: 1*  w
e: 100*  u
f: 1100*  z
g: 1101*  u
h: 1110*  z
i: 1111*  x
Spring 2006
a: 000000-011111  x
b: 010000-010001  y
c: 011000-011111  z
d: 100000-111111  w
e: 100000-100111  u
f: 110000-110011  z
g: 110100-110111  u
h: 111000-111011  z
i: 111100-111111  x
CS 685 Network Algorithmics
000000
010000
010001
011000
011111
100000
100111
110000
110011
110100
110111
111000
111011
111100
111111
>
x
y
x
x
x
u
w
z
x
u
x
z
x
x
-
=
x
y
y
z
x
u
u
z
z
u
u
z
z
x
x
43
Range Binary Search Results
• N prefixes can be searched in log2 N + 1 steps
– Slow compared to multibit tries
– Insertion can also be expensive
• Memory-expensive
– Requires 2 full-size entries per prefix
– 40K prefixes, 32-bit addresses: 320KB, not counting nexthop info
• Advantage: no patent restrictions!
Spring 2006
CS 685 Network Algorithmics
44
Binary Search on Prefix Lengths
Waldvogel, et al
• For same-length prefixes, a hash table gives fast
comparisons
• But linear search on prefix lengths is too expensive
• Can we do a faster (binary) search on prefix lengths?
– Challenge: how do we know whether to move "up" or
"down" in length on failure?
– Solution: include extra information to indicate presence of a
longer prefix that might match
– These are called marker entries
– Each marker entry also contains best-matching prefix for
that node
– When searching, remember best-matching prefix when
going "up" because of a marker, in case of later failure
Spring 2006
CS 685 Network Algorithmics
45
Example: Binary Search on Prefix Length
Example Database:
a: 0*  x
b: 01000*  y
c: 011*  z
d: 1*  w
e: 100*  u
f: 1100*  z
g: 1101*  u
h: 1110*  z
i: 1111*  x
Prefix Lengths: 1, 3, 4, 7
length 1
BMP
0* 1*
a,x d,w
length 3 011* 100* 110M 111M 010M
c,z
e,u
d,w d,w
a,x
BMP
length 4 1100* 1101* 1110* 1111* 0100M
f,z
g,u
h,z
i,x
a,x
BMP
length 5 01000*
b,y
BMP
Example: Search for address 011000 and 101000
Spring 2006
CS 685 Network Algorithmics
46
Binary Search on Prefix Length
Performance
• Worst-case number of hash-table accesses: 5
• However, most prefixes are 16 or 24 bits
– Arrange hash tables so these are handled in one or two
accesses
• This technique is very scalable for larger address
lengths (e.g. 128 bits for IPv6)
– Unibit Trie for IPv6: 128 accesses!
Spring 2006
CS 685 Network Algorithmics
47
Memory Allocation for Compressed
Schemes
• Problem: when using a compressed scheme (like
Lulea), trie nodes are kept at minimal size
• If a node grows (changes size), it must be
reallocated and copied over
• As we have discussed, memory allocators can
perform very badly
– Assume M is the size of the largest possible request
– Cannot guarantee more than 1/log2 M of memory will be
used!
• E.g. if M=32, 20% is max guaranteed utilization
• Router vendors cannot claim to support large databases
Spring 2006
CS 685 Network Algorithmics
48
Memory Allocation for Compressed
Schemes
• Solution: Compaction
– Copy memory from one location to another
• General-purpose OS's avoid compaction!
– Reason: very hard to find and update all pointers to objects
in the moved region
• The good news:
– Pointer usage is very constrained in IP lookup algorithms
– Most lookup structures are trees  at most one pointer to
any node
• By storing a "parent" pointer, can easily update pointers as
needed
Spring 2006
CS 685 Network Algorithmics
49