high-speed packet forwarding in software routers on multi

Transcript high-speed packet forwarding in software routers on multi

SEATTLE and Recent Work
Jennifer Rexford
Princeton University
http://www.cs.princeton.edu/~jrex
Joint with Changhoon Kim, Minlan Yu, and Matthew Caesar
Challenges in Edge Networks
• Large number of hosts
–Tens or even hundreds of thousands of hosts
• Dynamic hosts
–Host mobility
–Virtual machine migration
• Cost conscious
–Equipment costs
–Network-management costs
Need a scalable and efficient self-configuring network
2
An All-Ethernet Solution?
• Self-configuration
–Hosts: MAC addresses
–Switches: self-learning
• Simple host mobility
–Location-independent flat addresses
• But, poor scalability and performance
–Flooding frames to unknown destinations
–Large forwarding tables (one entry per address)
–Broadcast for discovery (e.g., ARP, DHCP)
–Inefficient delivery of frames over spanning tree 3
An All-IP Solution?
• Scalability
–Hierarchical prefixes
(smaller tables)
–No flooding
• Performance
–Forwarding traffic over shortest paths
• But, several disadvantages
–Complex joint configuration of routing and DHCP
–Clumsy mobility (location-dependent addresses)
–Expensive longest-prefix match in data plane
4
Compromise: Hybrid Architecture
Ethernet-based IP subnets interconnected by routers
Ethernet Bridging
-
R
Flat addressing
Self-learning
R
Flooding
Forwarding along a tree
IP Routing
R
- Hierarchical addressing
R
- Subnet configuration
- Host configuration
- Forwarding along shortest paths
R
Sacrifices Ethernet’s simplicity and IP’s efficiency for scalability
5
Can We Do Better?
• Shortest-path routing on flat addresses
–Shortest paths: scalability and performance
–MAC addresses: self-configuration and mobility
H
S
S
H
H
H
S
S
S
S
S
H
S
H
S
H
S
S
S
S
H
H
• Scalability without hierarchical addressing
–Limit dissemination and storage of host info
–Sending packets on slightly longer paths
6
SEATTLE
Scalable Ethernet Architecture for Large Enterprises
(joint work with Changhoon Kim and Matt Caesar)
7
SEATTLE Design Decisions
Objective
Approach
Solution
1. Avoiding
flooding
Never broadcast
unicast traffic
2. Restraining
broadcasting
Bootstrap hosts
via unicast
3. Reducing
routing state
Populate host info
only when and where
it is needed
Traffic-driven resolution
with caching
4. Shortest-path
forwarding
Allow switches
to learn topology
L2 link-state routing
maintaining only
switch-level topology
* Meanwhile, avoid modifying end hosts
Network-layer
one-hop DHT
8
Network-Layer One-hop DHT
• Maintains <key, value> pairs with function F
– Consistent hash mapping a key to a switch
– F is defined over the set of live switches
• One-hop DHT
2128-1 0 1
– Link-state routing ensures
switches know each other
• Benefits
– Fast and efficient
reaction to changes
– Reliability and capacity
naturally grow with
size of the network
9
Location Resolution
<key, val> = <MAC addr, location>
x
C
Owner
Host discovery
A
Hash
F(MACx) = B
y
Forward directly
from D to A
User
Publish
<MACx, A>
Tunnel
to A
Resolver
Switches
End hosts
Control message
Data traffic
Tunnel
to B
B
Store
<MACx, A>
D
Traffic to x
Hash
F(MACx ) = B
Notify
<MACx, A>
E
10
Address Resolution
<key, val> = <IP addr, MAC addr>
x
<IPx ,MACx>
Broadcast
ARP request
for IPx
C
A
Hash
F(IPx) = B
Unicast
look-up to B
D
y
Hash
F(IPx ) = B
Unicast reply
<IPx, MACx, A>
B
Store
<IPx, MACx, A>
E
Traffic following ARP takes a shortest path
without separate location resolution
11
Handling Network and Host Dynamics
• Network events
–Switch failure/recovery
 Change in <key, value> for DHT neighbor
 Fortunately, switch failures are not common
–Link failure/recovery
 Link-state routing finds new shortest paths
• Host events
–Host location, MAC address, or IP address
–Must update stale host-information entries
12
Handling Host Information Changes
Dealing with host mobility
x
Old
location
F
Host talking
with x
y
A
< x, A >
< x, D >
New
location
C
< x, A >
< x, D >
B
D
< x, D >
Resolver
< x, A >
< x, D >
E
MAC- or IP-address change can be handled similarly
13
Packet-Level Simulations
• Large-scale packet-level simulation
–Event-driven simulation of control plane
–Synthetic traffic based on LBNL traces
–Campus, data center, and ISP topologies
• Main results
–Much less routing state than Ethernet
–Only slightly more stretch than IP routing
–Low overhead for handling host mobility
14
Amount of Routing State
Ethernet
SEATTLE
SEATTLE reduces the amount of routing
w/ caching
state by more than an order of magnitude
SEATTLE
w/o caching
15
Cache Size vs. Stretch
Stretch = actual path length / shortest path length (in latency)
ROFL
SEATTLE offers near-optimal stretch
with very small amount of routing state
SEATTLE
16
Sensitivity to Mobility
Ethernet
SEATTLE rapidly updates routing state
with very low overhead
SEATTLE
w/ caching
SEATTLE
w/o caching
17
Prototype Implementation
XORP
Click
Interface
Network
Map
OSPF
Daemon
Link-state
advertisements
User/Kernel Click
Routing
Table
Ring
Manager
Host Info
Manager
Host-info registration
and notification msgs
SeattleSwitch
Data
Frames
Data
Frames
Throughput: 800 Mbps for 512B packets, or 1400 Mbps for 896B packets
18
Conclusions on SEATTLE
• SEATTLE
–Self-configuring
–Scalable
–Efficient
• Enabling design decisions
–One-hop DHT with link-state routing
–Reactive location resolution and caching
–Shortest-path forwarding
19
VL2 Architecture (Microsoft)
20
VL2 Data Center Architecture
• VL2 work at Microsoft
– Focus on data centers, rather than enterprise networks
• Similarities to SEATTLE
– Flat addressing
– Mapping of destination address to egress switch
– Packet encapsulation to egress switch
• Differences from SEATTLE
– Separate directory servers, not an in-network DHT
– Lookup and encapsulation performed by the servers
– Load balancing over many paths (ECMP + anycast)
21
Scalable Flow-Based Networking
with DIFANE
Joint work with Minlan Yu, Mike Freedman, and Jia Wang
Flexible Policies in Enterprises
• Access control
– Drop packets from
malicious hosts
• Customized routing
– Direct Skype calls on
a low-latency path
• Measurement
– Collect detailed HTTP
traffic statistics
HTTP
HTTP
23
Flow-based Switches
• Install rules in flow-based switches
– Store rules in high speed memory (TCAM)
• Perform simple actions based on rules
– Rules: Match on bits in the packet header
– Actions: Drop, forward, count
Flow space
dst.
src.
forward
via link 1
drop
24
Challenges of Policy-Based Management
• Policy-based network management
– Specify high-level policies in a management system
– Enforce low-level rules in the switches
• Challenges
– Large number of hosts, switches and policies
– Limited TCAM space in switches
– Support host mobility
– No hardware changes to commodity switches
25
Pre-install Rules in Switches
Pre-install
rules
Packets hit
the rules
Controller
Forward
• Problems:
– No host mobility support
– Switches do not have enough memory
26
Install Rules on Demand (Ethane, NOX)
Buffer and send
packet header
to the controller
Controller
Install
rules
First packet
misses the rules
Forward
• Problems:
– Delay of going through the controller
– Switch complexity
– Misbehaving hosts
27
DIFANE Architecture
(two stages)
DIstributed Flow Architecture
for Networked Enterprises
28
Stage 1
The controller proactively generates
the rules and distributes them to
authority switches.
29
Partition and Distribute the Flow Rules
Controller
Flow space
Distribute
partition
information Authority
Switch A
AuthoritySwitch B
Authority
Switch C
Authority
Switch B
Ingress
Switch
accept
Authority
Switch A
reject
Egress
Switch
Authority
Switch C
30
Stage 2
The authority switches keep packets
always in the data plane and reactively
cache rules.
31
Packet Redirection and Rule Caching
Authority
Switch
Ingress
Switch
Egress
Switch
First packet
Following
packets
Hit cached rules and forward
A slightly longer path in the data plane is
faster than going through the control plane
32
Locate Authority Switches
• Partition information in ingress switches
– Using a small set of coarse-grained wildcard rules
– … to locate the authority switch for each packet
• Distributed directory service but not DHT
– Hashing does not work for wildcards
– Keys can have wildcards in arbitrary bit positions
Authority
Switch A
AuthoritySwitch
B
Authority
Switch C
X:0-1 Y:0-3  A
X:2-5 Y: 0-1B
X:2-5 Y:2-3  C
33
Three Sets of Rules in TCAM
Type
Cache
Rules
Priority Field 1 Field 2 Action
Timeout
210
10 sec
00**
111*
Forward to Switch B
In ingress switches
209
1110
11**
Drop
reactively installed by authority switches
10 sec
…
…
…
…
…
110
00**
001*
Forward
Trigger cache manager
Infinity
…
Authority In authority switches
109
0001
0***
Drop,
proactively installed byTrigger
controller
Rules
cache manager
…
…
…
…
15
0***
000*
Redirect to auth. switch
…
…
…
…
Partition In every switch
14
…
Rules
proactively installed by controller
…
34
DIFANE Switch Prototype
Built with OpenFlow switch
Recv Cache
Updates
Control
Plane
Only in
Auth.
Switches
Send Cache
Updates
Cache
Manager
Notification
Cache Rules
Data
Just
software modification
forRules
authority switches
Authority
Plane
Partition Rules
35
Caching Wildcard Rules
• Overlapping wildcard rules
– Cannot simply cache matching rules
36
Caching Wildcard Rules
• Multiple authority switches
– Contain independent sets of rules
– Avoid cache conflicts in ingress switch
Authority
switch 1
Authority
switch 2
37
Partition Wildcard Rules
• Partition rules
– Minimize the TCAM entries in switches
– Decision-tree based rule partition algorithm
Cut B is better
than Cut A
Cut
B
Cut A
38
Handling Network Dynamics
Network
dynamics
Policy
changes at
controller
Topology
changes at
switches
Host mobility
Authority
Rules
Partition
Rules
Timeout
Change
Mostly no
change
No change
No change
Change
Cache rules
Timeout
No change No change
39
Prototype Evaluation
• Evaluation setup
– Kernel-level Click-based OpenFlow switch
– Traffic generators, switches, controller run on separate
3.0GHz 64-bit Intel Xeon machines
• Compare delay and throughput
– NOX: Buffer packets and reactively install rules
– DIFANE: Forward packets to authority switches
40
Delay Evaluation
• Average delay (RTT) of the first packet
– NOX: 10 ms
– DIFANE: 0.4 ms
• Reasons for performance improvement
– Always keep packets in the data plane
– Packets are delivered without waiting for rule caching
– Easily implemented in hardware to further improve
performance
41
Peak Throughput
• One authority switch; Single-packet flow
Throughput (flows/sec)
1,000K
DIFANE
NOX
1 ingress 2 3 4
switch
100K
DIFANE
(800K)
Ingress switch
DIFANE Bottleneck
further increases the throughput linearly
with(20K)
the number of authority switches.
10K
Controller
Bottleneck (50K)
1K
1K
10K
100K
Sending rate (flows/sec)
1000K
42
Scaling with Many Rules
• How many authority switches do we need?
– Depends on total number of rules
– … and the TCAM space in these authority switches
Campus
IPTV
# Rules
30K
5M
# Switches
1.7K
3K
Assumed Authority
Switch TCAM size
Required
# Authority Switches
160 KB
1.6 MB
5 (0.3%)
100 (3%)
43
Conclusion
• DIFANE is a scalable way to support policy
– Mix of reactive and proactive rule installation
– Proactive compute and partition the rules
– Reactively cache rules at ingress switches
• Generalized SEATTLE
– SEATTLE: destination-based forwarding
 Rules based (only) on flat MAC addresses
 Lookup using hashing, in a one-hop DHT
– DIFANE: multidimensional packet classification
 Rules based on many fields, with possible wildcards
 Lookup based on coarse-grained partition rules
• Scalable solution for flow-based networking
44
Backup Slides
45
BUFFALO
Bloom Filter Forwarding Architecture for Large Organizations
46
Data Plane Scaling Challenge
• Large layer-two networks
–Many end-host MAC addresses
–Many switches
• Forwarding-table size becomes a problem
–Requires a large, fast memory
–Expensive and power hungry
–Over-provisioning to avoid running out
• Buffalo’s goal
–Given a small, fast memory
–… make the most of it!
Bloom Filters
• Bloom filters in fast memory
– A compact data structure for a set of elements
– Calculate s hash functions to store element x
– Easy to check set membership
– Reduce memory at expense of false positives
x
Vm-1
V0
0 0 0 1 0 0 0 1 0 1
h1(x)
h2(x)
h3(x)
0 1 0 0 0
hs(x)
48
Bloom Filter Forwarding
• One Bloom filter (BF) per next hop
– Store all addresses forwarded to that next hop
Bloom Filters
query
packet
Nexthop 1
Nexthop 2
hit
……
Nexthop T
49
BUFFALO Challenges
How to optimize memory usage?
• Minimize the false-positive rate
How to handle false positives
quickly?
• No memory/payload overhead
How to handle routing dynamics?
• Make it easy and fast to adapt Bloom
filters
50
1. Optimize Memory Usage
• Goal: Minimize overall false-positive rate
– Probability that one BF has a false positive
• Input:
– Fast memory size M
– Number of destinations per next hop
– The maximum number of hash functions
• Output: the size of each Bloom filter
– Larger BF for next hop with more destinations
51
1. Optimize Memory Usage (cont.)
• Constraints
– Memory constraint
 Sum of all BF sizes <= fast memory size M
– Bound on number of hash functions
 To bound CPU calculation time
 Bloom filters share the same hash functions
• Convex optimization problem
– An optimal solution exists
– Solved by IPOPT optimization tool
– Runs in about 50 msec
52
1. Minimize False Positives
• Forwarding table with 200K entries, 10 next hops
Overall False Positives
[fraction]
• 8 hash functions
1.00E+00
1.00E-01
1.00E-02
1.00E-03
1.00E-04
1.00E-05
1.00E-06
0
200
400
600
Memory size (KB)
800
1000
53
1. Comparing with Hash Table
Fast Memory Size
(MB)
• Save 65% memory with 0.1% false positive
14000
12000
10000
8000
6000
4000
2000
0
hash table
fp=0.01%
fp=0.1%
fp=1%
0
500
1000
1500
#Forwarding Table Entries (K)
65%
2000
54
2. Handle False Positives
• Design goals
– Should not modify the packet
– Never go to slow memory
– Ensure timely packet delivery
• BUFFALO solution
– Exclude incoming interface
 Avoid loops in one false positive case
– Random selection among the rest
 Guarantee reachability with multiple FPs
55
2. One False Positive
• Most common case: one false positive
–When there are multiple matching next hops
–Avoid sending to incoming interface
• We prove that there is at most a 2-hop loop
with a stretch <= l(AB)+l(BA)
A
dst
B
56
2. Multiple False Positives
• Handle multiple false positives
– Random selection from matching next hops
– Random walk on shortest path tree plus a few
false positive links
– To eventually find out a way to the destination
False positive
link
Shortest path tree
for destination
dst
57
2. Multiple False Positives (cont.)
• Provable stretch bound
–With k false positives
–… expected stretch is at most O(k2*3k/3)
–Proved by random walk theories
• Stretch bound is actually not that bad
–False positives are independent
 Different switches use different hash functions
–k false positives for one packet are rare
 Probability of k false positives drops exponentially in k
• Tighter bounds in special cases: e.g., tree
58
2. Stretch in Campus Network
1.E+00
fp=0.5%
When fp=0.001%
fp=0.1%
When fp=0.5%,
99.9% of the packets
fp=0.001%
have no stretch 0.0002% packets
0.0003%
packets
have
a stretch 6 times
have aofstretch
of path length
shortest
shortest path length
1.E-01
CCDF
1.E-02
1.E-03
1.E-04
1.E-05
1.E-06
0
1
2
3
4
5
6
Stretch
normalized by shortest path length
7
59
3. Update on Routing Change
• Use CBF in slow memory
– Assist BF to handle forwarding-table updates
– Easy to add/delete a forwarding-table entry
CBF in slow memory
2
0
1
0
0
0
Delete a route
3
0
2
0
0
0
0
0
2
0
1
0
1
0
0
0
0
0
1
0
BF in fast memory
1
0
1
0
0
0
3. Occasionally Resize BF
• Under significant routing changes
–# of addresses in BFs changes significantly
–Re-optimize BF sizes
• Use CBF to assist resizing BF
–Large CBF and small BF
–Easy to expand BF size by contracting CBF
CBF
0
0
1
1
0
0
0
3
0
Easy to contract CBF to size 4
BF
0
Hard to expand to size 4
0
0
1
0
BUFFALO Switch Architecture
Prototype implemented in kernel-level Click
62
Prototype Evaluation
• Environment
– 3.0 GHz 64-bit Intel Xeon
– 2 MB L2 data cache (fast-memory size M)
– 200K forwarding-table entries and 10 next hops
• Peak forwarding rate
– 365 Kpps, 1.9 μs per packet
– 10% faster than hash-based EtherSwitch
• Performance with forwarding-table updates
– 10.7 μs to update a route
– 0.47 s to reconstruct BFs based on CBFs
63
Conclusion on BUFFALO
• BUFFALO
–Small, bounded memory requirement
–Small stretch
–Fast reaction to routing updates
• Enabling design decisions
–Bloom filter per next hop
–Optimizing of Bloom filter sizes
–Preventing forwarding loops
–Dynamic updates using counting Bloom filters
64
SeaBuff
Seattle + Buffalo
65
Seattle + Buffalo
• Seattle
–Shortest-path routing and scalable control plane
–Fewer host MAC addresses stored at switches
• Buffalo
–Less memory for given # of routable addresses
–Graceful handling of increase in # of addresses
• Combined data plane
egress
switch
destination
cache
Bloom
filter
Seattle
Buffalo
outgoing
link
Two Small Sources of Stretch
x
y
C
Traffic to x
A
D
B
E
Relay for x
• Seattle: diverting some packets through relay B
• Buffalo: extra stretch from D to B, or B to A
Choosing the Right Solution
• Spatial distribution of traffic
–Sparse: caching in Seattle is effective
–Dense: caching is not as effective
• Temporal distribution of traffic
–Stable: shortest path routing is effective
–Volatile: forwarding through relay spreads traffic
• Topology
–Arbitrary: false positives have more impact
–Tree-like: false positives have less impact
Conclusions
• Growing important of edge networks
–Enterprise networks
–Data center networks
• Shortest-path routing and flat addressing
–Self-configuring and efficient
–Scalability in exchange for stretch
• Ongoing work
–SeaBuff = Seattle + Buffalo
–Theoretical analysis of stretch in Buffalo
–Efficient access control in OpenFlow

high-speed packet forwarding in software routers on multi

Transcript high-speed packet forwarding in software routers on multi

Directory