Transcript Bcube

Presenter:
[email protected]
Po-Chun Wu
Outline
•
•
•
•
•
•
•
Introduction
BCube Structure
BCube Source Routing (BSR)
Other Design Issues
Graceful degradation
Implementation and Evaluation
Conclusion
Introduction
Container-based modular DC
• 1000-2000 servers in a single container
• Core benefits of Shipping Container DCs:
– Easy deployment
• High mobility
• Just plug in power, network, & chilled water
– Increased cooling efficiency
– Manufacturing & H/W Admin. Savings
BCube design goals
• High network capacity for:
– One-to-one unicast
– One-to-all and one-to-several reliable groupcast
– All-to-all data shuffling
• Only use low-end, commodity switches
• Graceful performance degradation
– Performance degrades gracefully as
servers/switches failure increases
BCube Structure
BCube Structure
BCube1
<1,0>
<1,1>
<1,2>
<1,3>
• A BCubek has:
Level-1
- K+1 levels: 0 through k
BCube0 - n-port switches, same count at each level (nk)
k+1 total servers, (k+1)nk total switches.
n
<0,0>
<0,1>
<0,2>
<0,3>
-(n=8,k=3 : 4-levels connecting 4096 servers using 512 8-port
Level-0
switches at each layer.)
) where
[0,k]
00 • A
01server
02 is
10 11a BCube
12 13addr
20(ak,a
21k-1,…,a
22 023
30 ai31
32
03assigned
• Neighboring server addresses differ in only one digit
• Switches only connect to servers
switch
server
• Connecting rule
- The i-th server in the j-th BCube0 connects to the j-th
port of the i-th level-1 switch
-Server “13” is connected to switches <0,1> and <1,3>
33
Bigger BCube: 3-levels (k=2)
BCube2
BCube1
BCube: Server centric network
BCube1
Switch <1,3>
MAC table
<1,0>
<1,2>MAC03
<1,1>
BCube0
00
0
1
2
3
MAC13
MAC23
MAC33
•Server-centric BCube
Switch <0,2>
MAC table
port
<1,3>
port
MAC23
MAC03
20
03
MAC20
0 to other switches and only
- Switches never connect
data
MAC21
1
MAC22
2
<0,0>
<0,1>
<0,2>
<0,3>
connect to servers
MAC23
3
MAC23
- Servers control routing, loadMAC20
balancing,
fault20
03
tolerance
01
02
10
03
dst
11
12
13
20
21
22
data
30
23
src
MAC addr
MAC23
MAC03
MAC20
MAC23
Bcube addr
20
03
20
03
data
data
31
32
33
Bandwidth-intensive application
support
• One-to-one:
– one server moves data to another server. (disk backup)
• One-to-several:
– one server transfers the same copy of data to several
receivers. (distributed file systems)
• One-to-all:
– a server transfers the same copy of data to all the other
servers in the cluster (boardcast)
• All-to-all:
– very server transmits data to all the other servers
(MapReduce)
Multi-paths for one-to-one traffic
• THEOREM 1. The diameter(longest path) of a BCubek is k+1
• THEOREM 3. There are k+1 parallel paths between any two servers in a
BCubek
00
<1,0>
<1,1>
<1,2>
<1,3>
<0,0>
<0,1>
<0,2>
<0,3>
01
02
03
10
11
12
13
20
21
22
23
30
31
32
33
Speedup for one-to-several traffic
• THEOREM 4. Server A and a set of servers {di|di is A’s level-i neighbor}
form an edge disjoint complete graph of diameter 2
<1,0>
<1,1>
<1,2>
<1,3>
Writing to ‘r’ servers, is r-times
faster than pipeline replication
<0,0>
00
P1
01
02
<0,1>
03
P1
P2
10
11
12
<0,2>
13
20
21
22
<0,3>
23
P2
30
31
32
33
Speedup for one-to-all traffic
src
00
01
02
03
10
11
12
13
20
21
22
23
30
31
32
33
• THEOREM 5. There are k+1
edge-disjoint spanning
trees in a Bcubek
• The one-to-all and one-toseveral spanning trees can
be implemented by TCP
unicast to achieve reliability
Aggregate bottleneck throughput for
all-to-all traffic
• The flows that receive the smallest throughput are called the
bottleneck flows.
• Aggregate bottleneck throughput (ABT)
– ( the bottleneck flow) * ( the number of total flows in the all-to-all traffic )
•
In BCube there are no bottlenecks
Larger ABT
means
all-to-all
finish time.
since
all shorter
links are
usedjobequally
• THEOREM 6. The ABT for a BCube network is
n
( N  1)
n 1
• where n is the switch port number and N is the total server number
Bcube Source Routing (BSR)
BCube Source Routing (BSR)
• Server-centric source routing
– Source server decides the best path for a flow by probing a
set of parallel paths
– Source server adapts to network condition by re-probing
periodically or due to failures
– Intermediate servers only forward the packets based on
the packet header.
intermediate
source
K+1 path
Probe packet
destination
BSR Path Selection
• Source server:
– 1.construct k+1 paths using BuildPathSet
– 2.Probes all these paths (no link status broadcasting)
– 3.If a path is not found, it uses BFS to find alternative (after
removing all others)
• Intermediate servers:
– Updates Bandwidth: min(PacketBW, InBW, OutBW)
– If next hops is not found, returns failure to source
• Destination server:
– Updates Bandwidth: min(PacketBW, InBW)
– Send probe response to source on reverse path
• 4.Use a metric to select best path. (maximum available
bandwidth / end-to-end delay)
Path Adaptation
• Source performs path selection periodically (every 10
seconds) to adapt to failures and network condition
changes.
• If a failure is received, the source switches to an
available path and waits for next timer to expire for
the next selection round and not immediately.
• Usually uses randomness in timer to avoid path
oscillation.
Packet Forwarding
• Each server has two components:
– Neighbor status table (k+1)x(n-1) entries
• Maintained by the neighbor maintenance protocol (updated upon
probing / packet forwarding)
• Uses NHA(next hop index) encoding for indexing neighbors ([DP:DV])
– DP: diff digit (2bits)
– DV: value of diff digit (6 bits)
– NHA Array (8 bytes: maximun diameter = 8)
• Almost static (except Status)
– Packet forwarding procedure
• Intermediate servers update next hop MAC address on packet if next
hop is alive
• Intermediate servers update status from packet
• One table lookup
Path compression and fast packet
forwarding
Traditional address array needs 16 bytes:
Path(00,13) = {02,22,23,13}
Forwarding table of server 23
NHI
Output port
MAC
0
Mac20
0
Mac21
0:2
<1,2>
1:0
0
1
Mac22
<1,3>
Mac03
1:1
1
Mac13
1:3
<0,2>
1
Mac33
<0,3>
0:0
The Next Hop Index (NHI) Array needs 4 bytes:
0:1
Path(00,13)={0:2,1:2,0:3,1:1}
<1,0>
Fwd node
Next hop
<0,0>
00
01
02
<1,1>
23
13
<0,1>
03
10
11
12
13
20
21
22
23
30
31
32
33
20
Other Design Issues
Partial Bcubek
(1) build the need BCube k−1s
(2) use partial layer-k switches ?
BCube1
• Solution
– connect the BCubek−1s using a full layer-k switches.
•
<1,0>
Level-1
BCube0
Advantage
<1,1>
<1,2>
– BCubeRouting performs just as in a complete
BCube, and BSR just works as before.
• Disadvantage
<0,0>
– switches in <0,1>
layer-k are not fully utilized.
Level-0
00
01
02
03
10
11
12
13
<1,3>
Packing and Wiring (1/2)
• 2048 servers and 1280 8-port switches
– a partial BCube with n = 8 and k = 3
• 40 feet container (12m*2.35m*2.38m)
• 32 racks in a container
Packing and Wiring (2/2)
• One rack = BCube1
• Each rack has 44 units
– 1U = 2 servers or 4 switches
– 64 servers occupy 32 units
– 40 switches occupy 10 units
• Super-rack(8 racks) = BCube2
Routing to external networks (1/2)
• Ethernet has two levels link rate hierarchy
– 1G for end hosts and 10G for uplink
aggregator
10G
<1,0>
<1,1>
<1,2>
<1,3>
<1,1>
<0,0>
<0,1>
<0,2>
<0,3>
1G
00
01
02
gateway
03
10
11
12
gateway
13
20
21
22
gateway
23
30
31
32
gateway
33
Routing to external networks (2/2)
• When an internal server sends a packet to an
external IP address
1) choose one of the gateways.
2) The packet is then routed to the gateway using
BSR (BCube Source Routing)
3) After the gateway receives the packet, it strips
the BCube protocol header and forwards the
packet to the external network via the 10G
uplink
Graceful degradation
fat-tree
DCell
Graceful degradation
• Graceful degradation : when server or switch failure
increases, ABT reduces slowly and there are no
dramatic performance falls. (Simulation Based)
• Server failure
• Switch failure
BCube
BCube
Fat-tree
Fat-tree
DCell
DCell
Implementation and Evaluation
Implementation
software
BCube
configuration
app
kernel
TCP/IP protocol driver
Intermediate driver
BCube driver
Neighbor
maintenance
Packet
send/recv
BSR path probing
& selection
packet fwd
Flow-path
cache
Ava_band
calculation
Ethernet miniport driver
IF 0
hardware
IF 1
Neighbor
maintenance
IF k
packet fwd
Ava_band
calculation
server
ports
Testbed
• A BCube testbed
– 16 servers (Dell Precision 490 workstation with
Intel 2.00GHz dualcore CPU, 4GB DRAM, 160GB
disk)
– 8 8-port mini-switches (DLink 8-port Gigabit
switch DGS-1008D)
• NIC
– Intel Pro/1000 PT
quad-port Ethernet NIC
– NetFPGA
Intel® PRO/1000 PT Quad
Port Server Adapter
NetFPGA
Bandwidth-intensive application
support
• Per-server throughput
Support for all-to-all traffic
• Total throughput for all-to-all
Related work
Speedup
Conclusion
• BCube is a novel network architecture for
shipping-container-based MDC
• Forms a server-centric architecture network
• Use mini-switches instead of 24 port switches
• BSR enables graceful degradation and meets
the special requirements of MDC