Transcript PPT

B4: Experience with a GloballyDeployed Software Defined
WAN
Presenter: Klara Nahrstedt
CS 538 Advanced Networking Course
Based on “B4: Experience with a Globally Deployed Software Defined WAN”, by
Sushant Jain et al, ACM SIGCOMM 2013, Hong-Kong, China
Overview
• Current WAN Situation
• Google Situation
• User Access Network
• Data Centers Network (B4 Network)
•
•
•
•
Problem Description regarding B4 Network
B4 Design
Background on Existing Technologies used in B4
New Approaches used in B4
•
•
•
•
Switch design
Routing design
Network controller design
Traffic engineering design
• Experiments
• Conclusion
Current WAN Situation
• WAN (Wide Area Networks) must deliver performance and reliability
in Internet, delivering terabits/s bandwidth
• WAN routers consist of high end specialized equipment being very
expensive due to provisioning of high availability
• WANs treat all bits the same and applications are treated the same
• Current Solution:
• WAN links are provisioned to 30-40% utilization
• WAN is over-provisioned to deliver reliability at very real cost of 2-3x
bandwidth over-provisioning and high end routing gear.
Google WAN Current Situation
• Two types of WAN networks
• User-facing network peers – exchanging traffic with other Internet domains
• End user requests and responses are delivered to Google data centers and edge caches
• B4 network – providing connectivity among data centers
• Network applications (traffic classes) over B4 network
• User data copies (email, documents, audio/video files) to remote data centers
for availability and durability
• Remote storage access for computation over inherently distributed data
sources
• Large-scale data push synchronizing state across multiple data centers
• Over 90% of internal application traffic runs over B4
Google’s Data Center WAN (B4 Network)
1. Google controls
- applications,
- Servers,
- LANs, all the way to the edge of network
2. bandwidth-intensive apps
- Perform large-scale data copies from one
site to another;
- Adapt transmission rate
- Defer to higher priority interactive apps
during failure periods or resource
constraints
3. No more than few dozen data center
deployments, hence making central control of
bandwidth
possible
Problem Description
• How to increase WAN Link utility to close to 100%?
• How to decrease cost of bandwidth provisioning and still provide
reliability ?
• How to stop over-provisioning?
• Traditional WAN architectures cannot achieve the level of scale, fault
tolerance, cost efficiency, and control required for B4
B4 Requirements
• Elastic Bandwidth Demand
• Synchronization of datasets across sites demands large amount of bandwidth,
but can tolerate periodic failures with temporary bandwidth reductions
• Moderate number of sites
• End application control
• Google controls both applications and site networks connected to B4
• Google can enforce relative application priorities and control bursts at
network edges (no need for overprovisioning)
• Cost sensitivity
• B4’s capacity targets and growth rate led to unsustainable cost projections
Approach for Google’s Data Center WAN
• Software Defined Networking architecture (Open Flow)
• Dedicated software-based control plane running on commodity servers
• Opportunity to reason about global state
• Simplified coordination and orchestration for both planned and unplanned
network changes
• Decoupling of software and hardware evolution
• Customization of routing and traffic engineering
•
•
•
•
Rapid iteration of novel protocols
Simplified testing environment
Improved capacity planning
Simplified management through fabric-centric rather than router-centric
WAN view
B4 Design
OFA – Open Flow Agent
OFC – Open Flow Controller
NCA – Network Control Application
NCS – Network Control Server
TE – Traffic Engineering
RAP – Routing Application Proxy
Paxos – Family of protocols for solving consensus
in network of unreliable processors (Consensus is
process of agreeing on one result among group of
participants; Paxos is usually used where
durability is needed such as in replication).
Quagga – network routing software suite
providing implementations of Open Shortest Path
First (OSPF), Routing Information Protocol (RIP),
Border Gateway Protocol (BGP), and IS-IS
Background
• IS-IS (Intermediate System to Intermediate System) routing protocol –
interior gateway protocol (iBGP) routing information within AS – linkstate routing protocol (similar to OSPF)
• ECMP – Equal-cost multi-path routing
• Next hop packet forwarding to single destination can occur over multiple
“best paths”
• See animation http://en.wikipedia.org/wiki/Equal-cost_multi-path_routing
11
Equal-cost multipath routing (ECMP)
• ECMP
• Multipath routing strategy that splits traffic over multiple
paths for load balancing
• Why not just round-robin packets?
• Reordering (lead to triple duplicate ACK in TCP?)
• Different RTT per path (for TCP RTO)…
• Different MTUs per path
http://www.cs.princeton.edu/courses/archive/spring11/cos461/
12
Equal-cost multipath routing (ECMP)
• Path-selection via hashing
• # buckets = # outgoing links
• Hash network information (source/dest IP addrs) to select
outgoing link: preserves flow affinity
http://www.cs.princeton.edu/courses/archive/spring11/cos461/
13
Now: ECMP in datacenters
• Datacenter networks are multi-rooted tree
• Goal: Support for 100,000s of servers
• Recall Ethernet spanning tree problems: No loops
• L3 routing and ECMP: Take advantage of multiple paths
http://www.cs.princeton.edu/courses/archive/spring11/cos461/
Switch Design
• Traditional Design of routing equipment
• Deep buffers
• Very large forwarding tables
• Hardware support for high availability
• B4 Design
• Avoid deep buffers while avoiding expensive packet drops
• Run across relatively small number of data centers, hence smaller forwarding
tables
• Switch failures caused more by software than hardware, hence separation of
software functions off the switch hardware minimizes failures
Switch Design
• Two-stage switch
• 128-port 10G switch
• 24 individual 16x10G non-blocking switch chips (
Spine Layer)
• Forwarding Protocol
• Ingress chip bounces incoming chip to Spine
• Spine forwards packets to appropriate output chip (unless dest on the same ingress chip)
• Open Flow Agent (OFA)
•
•
•
•
User-level process running on switch hardware
OFA connects to OFC accepting OF (OpenFlow) commands
OFA translates OF messages into driver commands to set chip forwarding table entries
Challenges:
• OFA exports abstraction of a single non-blocking switch with hundreds of 10 Gbps ports. However,
underlying switch has multiple physical switch chips, each with their own forwarding tables.
• OpenFlow architecture considers neutral version of forwarding table entries, but switches have
many linked forwarding tables of various sizes and semantics.
Network Control Design
• Each site has multiple NCS
• NCS and switches share dedicated
Out of band control-plane network
• One of NCS serves as leader
• Paxos handles leader election
• Paxos instances perform application-level
failure detection among pre-configured set of
available replicas for given piece of control
functions.
• Modified Onix is used
• Network Information Base (NIB) contains
network state (topology, trunk configurations,
link status)
• OFC replicas are warm standbys
• OFA maintains multiple connections to
multiple OFCs
• Active communication to only one OFC at a
time
Routing Design
• Use Open Source Quagga stack for
BGP/ISIS on NCS
• Develop RAP to provide connectivity
between Quagga and OF switches for
• BGP/ISIS route updates
• Routing protocol packets flowing between
switches and Quagga
• Interface updates from switches to
Quagga
• Translation from RIB entries forming a
network level view of global connectivity
to low level HW tables
• Translation of RIB into two Open Flow
tables
Traffic Engineering Architecture
• TE Server operates over states
• Network topology
• Flow Group (FG)
• Applications are aggregated to FG
• {source site, dest site, QoS}
• Tunnel (T)
• Site-level path in network
• A->B->C
• Tunnel Group (TG)
• Map of FGs to set of tunnels and
corresponding weights
• Weight specifies fraction of FG
traffic to be forwarded along each
tunnel
Bandwidth Functions
• Each application is associated
with bandwidth function
• Bandwidth Function
• Gives Contract between app and
B4
• Specifies BW allocation to app
given flow’s relative priority on a
scale, called fair share
• Derived from admin-specific static
weights (slope) specifying app
priority
• Configured, measured and
provided to TE via Bandwidth
Enforcer
TE Optimization
Algorithm
• Two components
• Tunnel Group Generation allocates BW to FGs using
bandwidth functions to
prioritize at bottleneck
edges
• Tunnel Group Quantization
– changes split ratios in
each TG to match
granularity supported by
switch tables
TE Optimization
Algorithm (2)
• Tunnel Group Generation
Algorithm
• Allocates BW to FGs based on
demand and priority
• Allocates edge capacity
among FGs based on BW
function
• Receive either equal fair share
or fully satisfy their demand
• Preferred tunnel for FG is
minimum cost path that
does not include bottleneck
edge
• Algorithm terminates when
each FG is satisfied or no
preferred tunnel exists
TE Optimization Algorithm (3)
• Tunnel Group Quantization
• Adjusts splits to granularity
supported by underlying
hardware
• Is equivalent to solving integer
linear programming problem
• B4 uses heuristics to maintain
fairness and throughput
efficiency
• Example
• Split above allocation in
multiples of 0.5
• Problem (a) 0.5:0.5
• Problem (b) 0.0:1.0
TE Protocol
• B4 switches operate in three roles:
• Encapsulating switch initiates tunnels and splits traffic between them
• Transit switch forwards packets based on outer header
• Decapsulating switch terminates tunnels and then forwards packets using regular
routes
• Source site switches implement FGs
• Switch maps packets to FG when their destination IP adr matches one of the prefixes
associated with FG
• Each incoming packet hashes to one of Tunnels associated with TG in desired ratio
• Each site in tunnel path maintains per-tunnel forwarding rules
• Source site switches encapsulate packet with outer IP header (tunnel-ID) whose
destination IP address uniquely identifies tunnel
• Installing tunnel requires configuring switches at multiple sites
TE Protocol Example
Coordination between
Routing and TE
• B4 supports both
• Shortest path routing and
• TE routing protocol
• This is very robust solution since if TE
gets disabled, underlying routing
continues working
• We require support for multiple
forwarding tables
• At OpenFlow level, we use RAP,
mapping different flows and groups to
appropriate HW tables – routing/BGP
populates LPM (Last Prefix Match) table
with appropriate entries
• AT TE level, we use Access Control List
(ACL) table to set desired forwarding
behavior
Role of Traffic Engineering
Database (TED)
• TE server coordinates T/TG/FG rule
installation across multiple OFCs.
• TE optimization output is translated
to per-site TED
• TED captures state needed to
forward packets along multiple paths
• Each OFC uses TED to set forwarding
states at individual switches
• TED maintains key-value store for
global tunnels, tunnel groups, and
flow groups
• TE operation (TE op) can
add/delete/modify exactly one TED
entry at one OFC
Other TE Issues
• What are some of the other issues with TE?
• How would you synchronize TED between TE and OFC?
• What are other issues related to reliability of TE, OFC, TED, etc?
• ….
Impact of Failures
• Traffic between two sites
• Measurements of duration of
any packet loss after six types
of events
• Findings:
• Failure of transit router that is
neighbor to encap router is bad
– very long convergence
• Reason?
• TE server does not incur any
loss
• Reason?
Utilization
Link utilization (effectiveness
Of hashing)
Site-to-site edge utilization
Conclusions – Experience from Outage
• Scalability and latency of packet IO path between OFA and OFC is critical
• Why?
• How would you remedy problems?
• OFA should be asynchronous and mutli-threaded for more parallelism
• Why?
• Loss of control connectivity between TE and OFC does not invalidate
forwarding state
• Why not?
• TE must be more adaptive to failed/unresponsive OFCs when modifying
TGs that depend on creating new Tunnels
• Why?
• What other issues do you see with B4 design?