Jupiter Google presentation at Intel Apr 2016

Download Report

Transcript Jupiter Google presentation at Intel Apr 2016

Jupiter rising: A decade of Clos topologies and centralized
control in Google’s datacenter networks
Credits
Authors of “Jupiter Rising” [SIGCOMM2015]:
Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb
Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason
Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat
And several teams at Google:
Platforms Networking Hardware and Software Development, Platforms SQA, Mechanical
Engineering, Cluster Engineering, NetOps, Global Infrastructure Group (GIG), and SRE.
Grand challenge for datacenter networks
• Tens of thousands of servers interconnected in clusters
• Islands of bandwidth a key bottleneck for Google a decade ago
■ Engineers struggled to optimize for b/w locality
■ Stranded compute/memory resources
■ Hindered app scaling
1 Mbps / machine
within datacenter
Datacenter
3
1 Gbps / machine
within rack
100 Mbps / machine
within small cluster
Grand challenge for datacenter networks
• Challenge: Flat b/w profile across all servers
• Simplify job scheduling (remove locality)
• Save significant resources via better bin-packing
• Allow application scaling
X Gbps / machine
flat bandwidth
Datacenter
4
Motivation
•
•
5
Traditional network architectures
•
Cost prohibitive
•
Could not keep up with our bandwidth demands
•
Operational complexity of “box-centric” deployment
Opportunity: A datacenter is a single administrative domain
•
One organization designs, deploys, controls, operates the n/w
•
...And often also the servers
Three pillars that guided us
Merchant silicon: General purpose,
commodity priced, off the shelf
switching components
Clos topologies: Accommodate low
radix switch chips to scale nearly
arbitrarily by adding stages
Centralized control / management
6
SDN: The early days
•
•
7
Control options
•
Protocols: OSPF, ISIS, BGP, etc; Box-centric config/management
•
Build our own
Reasons we chose to build our own central control/management:
•
Limited support for multipath forwarding
•
No robust open source stacks
•
Broadcast protocol scalability a concern at scale
•
Network manageability painful with individual switch configs
Challenges faced in building our own solution
•
•
•
8
Topology and deployment
• Introducing our network to production
• Unmanageably high number of cables/fiber
• Cluster-external burst b/w demand
Control and management
• Operating at huge scale
• Routing scalability / routing with massive multipath
• Interop with external vendor gear
Performance and reliability
• Small on-chip buffers
• High availability from cheap/less reliable components
Outline
•
•
•
•
9
Motivation
Network evolution
Centralized control / management
Experience
Bisection
b/w (bps)
1000T
100T
(log scale)
10T
1T
‘04
‘05
‘06
‘07
‘08
‘09
‘10
‘11
‘12
‘13
Year 10
Bisection
b/w (bps)
2004 State of the art: 4 Post cluster network
Cluster
Router 1
Cluster
Router 2
1000T
Cluster
Router 3
Cluster
Router 4
2x10G
1G
100T
ToR
Server
Rack
1
(log scale)
10T
ToR
Server
Rack
2
ToR
Server
Rack
3
ToR
Server
Rack
4
ToR
Server
Rack
5
ToR
Server
Rack
512
+ Standard Network Configuration
- Scales to 2 Tbps (limited by the biggest router)
- Scale up: Forklift cluster when upgrading routers
1T
‘04
‘05
‘06
‘07
‘08
‘09
‘10
‘11
‘12
‘13
Year 11
DCN bandwidth growth demanded much more
12
Five generations of Clos for Google scale
Spine Block
1
Spine Block
2
Edge Aggregation
Block 1
13
Spine Block
3
Spine Block
4
Edge Aggregation
Block 2
Spine Block
M
Edge Aggregation
Block N
Server
racks
with ToR
switches
Bisection
b/w (bps)
1000T
100T
(log scale)
10T
4 Post
1T
‘04
‘05
‘06
‘07
‘08
‘09
‘10
‘11
‘12
‘13
Year 14
Bisection
b/w (bps)
Firehose 1.0
1000T
100T
(log scale)
10T
4 Post
+ Scales to 10T (10K servers @1G)
- Issues with servers housing switch cards
- Was not deployed in production
1T
‘04
‘05
‘06
‘07
‘08
‘09
‘10
‘11
‘12
‘13
Year 15
Bisection
b/w (bps)
1000T
Firehose 1.0
100T
(log scale)
10T
4 Post
1T
‘04
‘05
‘06
‘07
‘08
‘09
‘10
‘11
‘12
‘13
Year 16
Bisection
b/w (bps)
Firehose 1.1
1000T
Firehose 1.0
100T
(log scale)
10T
4 Post
+ Chassis based solution (but no backplane)
- Bulky CX4 copper cables restrict scale
1T
‘04
‘05
‘06
‘07
‘08
‘09
‘10
‘11
‘12
‘13
Year 17
Challenges faced in building our own solution
•
•
•
18
Topology and deployment
• Introducing our network to production
• Unmanageably high number of cables/fiber
• Cluster-external burst b/w demand
Control and management
• Operating at huge scale
• Routing scalability / routing with massive multipath
• Interop with external vendor gear
Performance and reliability
• Small on-chip buffers
• High availability from cheap/less reliable components
Bisection
b/w (bps)
Firehose 1.1
1000T
Firehose 1.0
100T
(log scale)
10T
4 Post
+ In production as a “Bag-on-side”
+ Central control and management
1T
‘04
‘05
‘06
‘07
‘08
‘09
‘10
‘11
‘12
‘13
Year 19
Bisection
b/w (bps)
1000T
Firehose 1.0
100T
Firehose 1.1
(log scale)
10T
4 Post
1T
‘04
‘05
‘06
‘07
‘08
‘09
‘10
‘11
‘12
‘13
Year 29
Bisection
b/w (bps)
Watchtower
1000T
Firehose 1.0
100T
Firehose 1.1
(log scale)
10T
+
+
+
+
4 Post
1T
‘04
‘05
‘06
‘07
‘08
‘09
‘10
Chassis with backplane
Fiber (10G) in all stages
Scale to 82 Tbps fabric
Global deployment
‘11
‘12
‘13
Year 21
Challenges faced in building our own solution
•
•
•
22
Topology and deployment
• Introducing our network to production
• Unmanageably high number of cables/fiber
• Cluster-external burst b/w demand
Control and management
• Operating at huge scale
• Routing scalability / routing with massive multipath
• Interop with external vendor gear
Performance and reliability
• Small on-chip buffers
• High availability from cheap/less reliable components
Watchtower
23
+ Cable bundling saves 40% TCO
+ 10x reduction in fiber-runs to deploy
Challenges faced in building our own solution
•
•
•
24
Topology and deployment
• Introducing our network to production
• Unmanageably high number of cables/fiber
• Cluster-external burst b/w demand
Control and management
• Operating at huge scale
• Routing scalability / routing with massive multipath
• Interop with external vendor gear
Performance and reliability
• Small on-chip buffers
• High availability from cheap/less reliable components
Watchtower
+ Connect externally via border routers
+ Massive external burst b/w
+ Enables cross cluster Map Reduce
25
+ Retire Cluster Routers completely
Bisection
b/w (bps)
Watchtower
1000T
Firehose 1.0
100T
Firehose 1.1
(log scale)
10T
4 Post
1T
‘04
‘05
‘06
‘07
‘08
‘09
‘10
‘11
‘12
‘13
Year 26
Bisection
b/w (bps)
Saturn
Watchtower
1000T
Firehose 1.0
100T
Firehose 1.1
(log scale)
10T
+
+
+
+
4 Post
1T
‘04
‘05
‘06
‘07
‘08
‘09
‘10
288x10G port chassis
Enables 10G to hosts
Scales to 207 Tbps fabric
Reuse in WAN (B4)
‘11
‘12
‘13
Year 27
Bisection
b/w (bps)
Watchtower
Jupiter
1000T
Saturn
Firehose 1.0
100T
Firehose 1.1
(log scale)
10T
4 Post
1T
‘04
‘05
‘06
‘07
‘08
‘09
‘10
‘11
‘12
‘13
Year 28
Jupiter topology
29
+ Scales out building wide 1.3 Pbps
Jupiter racks
+ Enables 40G to hosts
+ External control servers
+ OpenFlow
39
Bisection
b/w (bps)
Watchtower
Jupiter (1.3P)
1000T
Saturn
Firehose 1.0
100T
Firehose 1.1
(log scale)
10T
4 Post
1T
‘04
‘05
‘06
‘07
‘08
‘09
‘10
‘11
‘12
‘13
Year 31
Challenges faced in building our own solution
•
•
•
32
Topology and deployment
• Introducing our network to production
• Unmanageably high number of cables/fiber
• Cluster-external burst b/w demand
Control and management
• Operating at hugescale
• Routing scalability / routing with massive multipath
• Interop with external vendor gear
Performance and reliability
• Small on-chip buffers
• High availability from cheap/less reliable components
Network control and config
New conventional wisdom from engineering systems at scale
• Logically centralized control plane beats full decentralization
• Centralized configuration and management dramatically
simplifies system aspects
33
Network config/management
Datacenter Network
Fabric DB
• Bill Of Materials
• Rack config
• Port Maps
• Cable Bundles
• CPN design
• Cluster Config
...
Switch X
Fabric
tools/
scripts
Switch Y
Spec
• Size of spine
• Base IP prefix
• ToR rack indexes
...
N/ w designer
Servers
34
Switch Z
Network config/management
Datacenter Network
Fabric N/ w
Management
Switch X
Switch Y
Switch Z
(e.g. Install, Push
config, Update
software, Drain ...)
Scalable
Infrastructure
•Monitor
• Alert
•Collect,
process logs
35
N/ w Operator
Network and
Server
Monitoring
Cluster
Config
Cluster
Config
Cluster
Config
Switch X
Switch Y
...
Switch Z
Link A
...
Switch X
Switch Y
...
Switch Z
Link A
...
Switch X
Switch Y
...
Switch Z
Link A
...
Servers
Firepath route controller
FMRP
Interface state update
Firepath Master
Link State database
FMRP protocol
Control Plane
Network
Firepath
Client 1
36
Firepath
Client 2
Firepath
Client N
Firepath route controller
FMRP
Interface state update
Link State database
Firepath Master
FMRP protocol
Control Plane
Network
Firepath
Client 1
Firepath
Client 2
eBGP protocol (inband)
Firepath
Client N
(Border router)
Firepath Client,
BGP 1
(Border router)
Firepath Client,
BGP M
External BGP peers
37
Challenges faced in building our own solution
•
•
•
38
Topology and deployment
• Introducing our network to production
• Unmanageably high number of cables/fiber
• Cluster-external burst b/w demand
Control and management
• Operating at huge scale
• Routing scalability / routing with massive multipath
• Interop with external vendor gear
Integrate BGP stack on border
Performance and reliability
routers
• Small on-chip buffers
• High availability from cheap/less reliable components
Challenges faced in building our own solution
•
•
•
39
Topology and deployment
• Introducing our network to production
• Unmanageably high number of cables/fiber
• Cluster-external burst b/w demand
Control and management
• Operating at huge scale
• Routing scalability / routing with massive multipath
• Interop with external vendor gear
Performance and reliability
Tune switches (eg ECN) and Hosts (DCTCP)
• Small on-chip buffers
• High availability from cheap/less reliable components
Challenges faced in building our own solution
•
•
•
40
Topology and deployment
• Introducing our network to production
• Unmanageably high number of cables/fiber
• Cluster-external burst b/w demand
Control and management
• Operating at huge scale
• Routing scalability / routing with massive multipath
• Interop with external vendor gear
Performance and reliability
• Small on-chip buffers
• High availability from cheap/less reliable components
Redundancy; diversity; implement only what was needed
Experience: Outages
Large remote
link churn
Three broad categories of outages:
• Control software failures at scale
• Cluster-wide reboot did not converge
■ Liveness protocol contended for
cpu with routing process
• Cannot test at scale in a hardware lab
■ Developed virtualized testbeds
• Aging hardware exposes corner cases
• Component misconfigurations
41
Frequent
topology updates
Routing
client
Local
link churn
Link
liveness
Pegged
embedded CPU
Missed
heart
beats
Grand challenge for datacenter networks
• Challenge: Flat b/w profile across all servers
• Simplify job scheduling (remove locality)
• Save significant resources (better bin-packing)
• Allow application scaling
• Scaled datacenter networks to Petabit scale in under a decade
• Bonus: reused solution in campus aggregation and WAN
X Gbps / machine
flat bandwidth
42
Datacenter
Back up
43
What’s different about datacenter networking
44
•
Single administrative domain (homogeneity, protocol modification
much easier)
•
More plentiful and more uniform bandwidth (aggregate bandwidth of
entire Internet in one building)
•
Tiny round trip times
•
Massive multipath
•
Little buffering
•
Latency/tail latency as important asbandwidth
•
From client-server to large-scale, massively parallel computation
Bisection
b/w (bps)
Watchtower
1000T
Firehose 1.0
100T
Firehose 1.1
(log scale)
10T
+ Depop deployments
+ Cable bundling
+ Connect externally via
Border Routers
+ Reuse in inter cluster
4 Post
1T
‘04
‘05
‘06
‘07
‘08
‘09
‘10
‘11
‘12
‘13
Year 45
Job mix on an example cluster
46
No locality within blocks
47
Firepath route controller
Firepath Master
Firepath protocol
Firepath
Client
CONFIG
port
status
route
updates
Embedded Stack
Firepath
Client
CONFIG
port
status
route
updates
RIB
BGP
Intra, inter
cluster route
redistribute
eBGP
packets
Embedded Stack
Kernel and Device drivers
Non-CBR fabric switch
48
CBR switch
Experience: Congestion
Packet loss initially > 1%
Mitigation:
•
•
•
•
•
•
QoS
Switch buffer tuning
Upgradable bandwidth
ECMP configuration tuning
ECN + DCTCP
Congestion window bound
Packet loss reduced to < 0.01%
49
Drops in an example Saturn cluster