slides - Computer Science Department
Download
Report
Transcript slides - Computer Science Department
1
PEREGRINE: AN ALL-LAYER-2
CONTAINER COMPUTER NETWORK
Tzi-cker Chiueh , Cheng-Chun Tu, Yu-Cheng Wang, Pai-Wei
Wang, Kai-Wen Li, and Yu-Ming Huan
∗Computer Science Department, Stony Brook University
†Industrial Technology Research Institute, Taiwan
1
Outline
2
Motivation
Layer 2 + Layer 3 design
Requirements for cloud-scale DC
Problems of Classic Ethernet to the Cloud
Solutions
Related Solutions
Peregrine’s Solution
Implementation and Evaluation
Software architecture
Performance Evaluation
L2 + L3 Architecture: Problems
3
Bandwidth
Bottleneck
Problem: Configuration:
- Routing table in the routers
- IP assignment
- DHCP coordination
- VLAN and STP
Problem: forwarding table size
Commodity switch 16-32k
Virtual Machine Mobility Constrained to a
Physical Location
Ref: Cisco data center with FabricPath and the Cisco FabricPath Switching System
Requirements for Cloud-Scale DC
4
Any-to-any connectivity with non-blocking fabric
Virtual machine mobility
Quick failure detection and recovery
Support for Multi-tenancy
Large layer 2 domain
Fast fail-over
Scale to more than 10,000 physical nodes
Share resources between different customers
Load balancing routing
Efficiently use all available links
Solution: A Huge L2 Switch!
Layer 2 Switch
Single L2 Network
Non-blocking
Backplane Bandwidth
However, Ethernet does not scale!
Config-free, Plug-and
play
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
5
Linear cost and power
scaling
….. Scale to 1 million VMs
5
Revisit Ethernet: Spanning Tree Topology
6
N4
N3
N5
s2
s1
N6
N2
s3
N1
s4
N8
N7
Revisit Ethernet: Spanning Tree Topology
7
N4
N3
N5
D
D
s1
B
s2
R
N6
N2
s3
s4
Root
N1
N8
N7
Revisit Ethernet: Broadcast and Source Learning
8
N4
N3
N5
D
D
s1
B
s2
R
N6
N2
s3
s4
Root
N1
N8
N7
Benefit: plug-and-play
Ethernet’s Scalability Issues
Limited forwarding table size
STP as a solution to loop prevention
Fail-over latency is high ( > 5 seconds)
Broadcast overhead
9
Not all physical links are used
No load-sensitive dynamic routing
Slow fail-over
Commodity switch: 16k to 64k entries
Typical Layer 2 consists of hundreds of hosts
Related Works / Solution Strategies
10
Scalability:
Clos network / Fat-tree to scale out
Alternative to STP
Link aggregation, e.g. LACP, Layer 2 trunking
Routing protocols to layer 2 network
Limited forwarding table size
Packet header encapsulation or re-writing
Load balancing
Randomness or traffic engineering approach
11
Design of the Peregrine
Peregrine’s Solutions
12
Not all links are used
L2 Loop Prevention
Redirect broadcast and block flooding packet
Source learning and forwarding
Disable Spanning Tree protocol
calculate all routes for all node-pairs by Route Server
Limited switch forwarding table size
Mac-in-Mac two stage forwarding by Dom0 kernel module
ARP Intercept and Redirect
13
Directory Service
Control flow
Data flow
DS
Route Algorithm Server
RAS
sw3
1. DS-ARP
2. DS-Reply
sw1
A
sw4
3. Send data
sw2
B
Peregrine’s Solutions
14
Not all links are used
L2 Loop Prevention
Disable Spanning Tree protocol
Redirect broadcast and block flooding packet
Source learning and forwarding
Calculate all routes for all node-pairs by Route Server
Fast fail-over: primary and backup routes for each pair
Limited switch forwarding table size
Mac-in-Mac two stage forwarding by Dom0 kernel module
Mac-in-Mac Encapsulation
15
Control flow
Data flow
DS
2. B locates at sw4
sw3
3. sw4
A
sw1
sw4
1. ARP redirect
4. Encap sw4 in
source mac
sw2
DA
SA
5. Decap and restore
original frame
DA
Encapsulation
IDA
sw 4
DA
B
B
SA
A
SA
Decapsulation
IDA
DA
SA
Fast Fail-Over
16
Goal: Fail-over latency < 100 msec
Application agnostic
TCP timeout: 200ms
Strategy: Pre-compute a primary and backup route
for each VM
Each VM has two virtual MACs
When a link fails, notify hosts using affected primary routes
that they should switch to corresponding backup routes
When a Network Link Fails
17
18
IMPLEMENTATION & Evaluation
Software Architecture
19
Review All Components
20
DS
ARP request rate
How fast can
DS handle request?
RAS
sw3
MIM module
A
How long can RAS
process request?
sw1
sw4
ARP redirect
sw2
sw5
sw7
sw6
Backup route
Performance of: MIM, DS, RAS, switch ?
B
Mac-In-Mac Performance
cdf
21
Time spent for decap/encap/total: 1us / 5us / 7us (2.66GHz CPU)
Around 2.66K / 13.3K / 18.6K cycles
Aggregate Throughput for Multiple VMs
22
980
970
960
950
No MIM
MIM
940
930
920
910
1VM
2VM
4VM
1. APR table size < 1k
2. Measure TCP throughput of 1VM, 2VM, 4VM communicating to each
other.
ARP Broadcast Rate in a Data Center
23
What’s the ARP traffic rate in real world?
From
2456 hosts, CMU CS department claims that
there are 1150 ARP/sec at peak, 89 ARP/sec on
average.
From 3800 hosts at university network, there are
around 1000 ARP/sec at peak, < 100 ARP/sec on
average.
To scale to 1M node, 20K-30K ARP/sec on average.
Current optimal DS: 100K ARP/sec
Fail-over time and its breakdown
24
Average fail-over time:
75ms
Switch: 25 ~ 45 ms
sending trap (soft unplug)
RS: 25ms
400
350
300
250
total
200
RS
150
receiving trap and processing
DS
100
DS: 2ms
receiving info from RS and
inform DS
The rests are network delay
and dom0 processing time
50
0
1
2
3
4
5
6
7
8
9
10
11
Conclusion
25
A unified Layer-2-only network for LAN and SAN
Centralized control plane and distributed data plane
Use only Commodity Ethernet switches
Army of commodity switches vs. few high-port-density switches
Requirements on switches: run fast and has programmable routing table
Centralized load-balancing routing using real-time traffic
matrix
Fast fail-over using pre-computed primary/back routes
26
Questions?
Thank you
Review All Components: Result
27
DS
7us for Packet
processing
ARP request rate
A
100K ARP/sec
RS
sw3
25ms per request
sw1
Link down
35ms
ARP redirect
sw4
B
sw2
sw5
sw7
sw6
Backup route
27
28
Backup slides
Thank you
OpenFlow Architecture
29
OpenFlow switch: A data plane that implements a
set of flow rules specified in terms of the OpenFlow
instruction set
OpenFlow controller: A control plane that sets up
the flow rules in the flow tables of OpenFlow
switches
OpenFlow protocol: A secure protocol for an
OpenFlow controller to set up the flow tables in
OpenFlow switches
30
OpenFlow Controller
OpenFlow Protocol (SSL/TCP)
Control Path
OpenFlow
Data Path (Hardware)
Conclusion and Contribution
31
Using commodity switches to build a large scale layer 2
network
Provide solutions to Ethernet’s scalability issues
Suppressing broadcast
Load balancing route calculation
Controlling MAC forwarding table
Scale up to one million VMs by Mac-in-Mac two stage forwarding
Fast fail-over
Future work
High Availability of DS and RAS, mater-slave model
Inter
Comparisons
32
Scalable and available data center fabrics
IEEE 802.1aq: Shortest Path Bridging
IETF TRILL
Competitors: Cisco, Juniper, Brocade
Differences: commodity switches, centralized load balancing routing and
proactive backup route deployment
Network virtualization
OpenStack Quantum API
Competitors: Nicira, NEC
Generality carries a steep performance price
Every virtual network link is a tunnel
Differences: Simpler and more efficient because it runs on L2 switches
directly
Three Stage Clos Network (m,n,r)
nxm
n
n
n
1
rxr
1
mxn
1
2
2
2
.
.
.
.
.
.
.
.
.
r
m
r
n
n
n
33
Clos Network Theory
34
Clos(m, n, r) configuration:
rn inputs, rn outputs
2r nxm + m rxr switches, less than rn x rn
Each rxr switch can in turn be implemented as
a 3-stage Clos network
Clos(m,n,r) is rearrangeably non-blocking iff
m >= n
Clos(m,n,r) is stricly non-blocking iff m >= 2n-1
Link Aggregation
35
Logically a single switch, states are synchronized
Stacking or Trunking
! " #$
! " %$
! " &$
!" ' $
Logically a single link to the upper layer
STP views as a non-loop topology
! " #$
! " &$
!" ' $
ECMP: Equal-Cost Multipath
36
$%#
$&#
$' #
$( #
'#
Layer 3
ECMP routing
Flow hashing
to 4 uplinks
!"
$) #
(#
$*#
$+#
$, #
%#
! %#
…
! "#
- %#
…
- "#
. %#
…
. "#
/ %# …
/ "#
Pros: multiple links are used,
Cons: hash collision, re-converge downstream to a single link
Example: Brocade Data Center
37
L3 ECMP
Link aggregation
Ref: Deploying Brocade VDX 6720 Data Center Switches with Brocade VCS in Enterprise Data Centers
PortLand
• Scale-out: Three-layer,
multi-root topology
• Hierarchical, encode
location into MAC
address
• Local Discover Protocol to
find shortest path, route by
MAC
• Fabric Manager maintains IP
to MAC mapping
• 60-80 ms failover, centrally
control and notify
38
VL2: Virtual Layer 2
• Three layer, Clos network
• Flat, IP-in-IP, Location
address(LA) an
Application Address (AA)
IP-in-IP
• Link-state routing to
encapsulation
disseminate LA
• VLB + flow-based ECMP
• Depend on ECMP to
detect link failure
• Packet interception at S
IP-in-IP
decapsulation
VLB
VL2 Directory Service
39
Monsoon
• Three layer, multi-root
topology
• 802.1ah MAC-in-MAC
encapsulation, source
routing
• centralized routing
decision
• VLB + MAC rotation
• Depend on LSA to detect
failures
• Packet interception at S
Monsoon Directory Service
IP <-> (server MAC, ToR MAC)
40
TRILL and SPB
TRILL
• Transparent Interconnect of
Lots of Links, IETF
• IS-IS as a topology
management protocol
• Shortest path forwarding
• New TRILL header
• Transit hash to select nexthop
SPB
• Shortest Path Bridging, IEEE
• IS-IS as a topology
management protocol
• Shortest path forwarding
• 802.1ah MAC-in-MAC
• Compute 16 source node
based trees
41
TRILL Packet Forwarding
Link-state routing
TRILL header
Ref: NIL Data Communications
A-ID: nickname of A
C-ID: nickname of C
HopC: hop count
42
SPB Packet Forwarding
Link-state routing
802.1ah
Mac-in-Mac
I-SID: Backbone Service Instance Identifier
B-VID: backbone VLAN identifier
Ref: NIL Data Communications
43
Re-arrangeable non-blocking Clos network
Example:
1. Three-stage Clos network
2. Condition: k>=n
3. An unused input at ingress switch can always be connected
to an unused output at egress switch
4. Existing calls may have to be rearranged
nxk
2x2
(N/n)x(N/n)
3x3
kxn
2x2
input
output
2x2
3x3
2x2
ingress
2x2
N=6
n=2
k=2
2x2
middle
egress
44
Features of Peregrine network
• Utilize all links
• Load balancing routing algorithm
• Scale up to 1 million VMs
– Two stage dual mode forwarding
• Fast fail over
• Load balancing routing algorithm
45
45
Goal
Primary
S
D
Backup
• Given a mesh network and traffic profile
– Load balance the network resource utilization
• Prevent congestion by balancing the network load
to support as many traffic load as possible
– Provide fast recovery from failure
• Provide primary-backup route to minimize recovery
time
46
Factors
• Only hop count
• Hop count and link residual capacity
• Hop count, link residual capacity, and link
expected load
• Hop count, link residual capacity, link expected
load and additional forwarding table entries
required
How to combine them into one number for a particular candidate
route?
47
Route Selection: idea
A
B
Leave C-D free
S1
D1
Share with S2-D2
C
S2
D
S2-D2 shares link C-D
D2
Which route is better from S1 to D1?
Link C-D is more important!
Idea: use it as sparsely as possible
48
Route Selection: hop count and Residual
capacity
Traffic Matrix:
S1 -> D1: 1G
S2 -> D2: 1G
A
B
Leave C-D free
S1
D1
Share with S2-D2
C
D
S2
D2
Using Hop count or residual capacity makes no
difference!
49
Determine Criticality of A Link
Determine the importance of a link
fl ( s,d) = fraction of all (s, d) routes that pass through link l
Expected load of a link at initial state
fl = åfl ( s,d )B( s,d )
(s,d )
B( s,d )
= Bandwidth demand matrix for s and d
50
Criticality Example
From B to C has four possible routes.
2/4
2/4
2/4
Case2:
Calculate
l
s = B, d = C
f ( s,d)
2/4
2/4
Case3:
s = A, d = C is similar
2/4
0
A
4/4
B
4/4
C
51
Expected Load
Assumption: load is equally distributed over each
possible routes between S and D.
10
10
10
Consider bandwidth demand for B-C is 20.
Expected Load:
fl = åfl ( s,d )B( s,d )
10
10
10
(s,d )
0
A
20
20
B
C
52
Cost Metrics
Cost metric represents the expected load per
unit of available capacity on the link
0.01
0.01
A
0.02
fl
Rl
f l= Expected Load
0.01
0.01
0
cos t(l) =
0.01
Rl = Residual Capacity
0.01
0.02
B
C
Idea: pick the link with minimum cost
53
Forwarding Table Metric
Consider using commodity switch with 16-32k
forwarding table size.
300
200
0
Fwd(n)
= available forwarding
table entries at node n
INC_FWD
= extra entries needed to route A-C
100
100
A
B
C
Idea: minimize entry consumption, prevent forwarding table from being exhausted
54
Load Balanced Routing
• Simulated network
– 52 PMs with 4 NICs,
total 384 links
– Replay 17 multi-VDC
300-second traces
• Compare
– Random shortest path
routing (RSPR)
– Full Link Criticality-based
routing (FLCR)
• Metrics: congestion count
– # of links with exceeded
capacity
• Low additional traffic
induced by FLCR
55