Transcript Document

Dr. Multicast
for Data Center Communication Scalability
HotNets, October 5, 2008
Ymir Vigfusson Hussam Abu-Libdeh Mahesh Balakrishnan Ken Birman
Cornell University
Yoav Tock
IBM Research Haifa
IP Multicast in Data Centers
• IPMC is not used in data centers
IP Multicast in Data Centers
• IPMC is not used in data centers
• Would speed up products that use multicast
IP Multicast in Data Centers
• Why is IP multicast rarely used?
IP Multicast in Data Centers
• Why is IP multicast rarely used?
o Limited IPMC scalability on switches/routers and
NICs
IP Multicast in Data Centers
• Why is IP multicast rarely used?
o Limited IPMC scalability on switches/routers and
NICs
o Broadcast storms: Loss triggers a horde of
NACKs, which triggers more loss, etc.
o Disruptive even to non-IPMC applications.
IP Multicast in Data Centers
• IP multicast has a bad reputation
IP Multicast in Data Centers
• IP multicast has a bad reputation
o Works great up to a
point,
after which it
breaks
catastrophically
IP Multicast in Data Centers
• Bottom line:
o Administrators have no control over multicast
use ...
o Without control, they opt for never.
Dr. Multicast
Dr. Multicast (MCMD)
• Policy: Permits data center operators to
selectively enable and control IPMC
• Transparency: Standard IPMC interface, system
calls are overloaded.
• Performance: Uses IPMC when possible,
otherwise point-to-point unicast
• Robustness: Distributed, fault-tolerant service
Terminology
• Process: Application that joins logical IPMC
groups
• Logical IPMC group: A virtualized abstraction
• Physical IPMC group: As usual
• UDP multi-send: New kernel-level system-call
• Collection: Set of logical IPMC groups with
identical membership
Acceptable Use Policy
• Assume a higher-level network management tool
compiles policy into primitives
• Explicitly allow a process to use IPMC groups
o allow-join(process,logical IPMC)
o allow-send(process,logical IPMC)
• UDP multi-send always permitted
• Additional restraints
o max-groups(process,limit)
o force-udp(process,logical IPMC)
Overview
• Library module
• Mapping module
• Gossip layer
• Optimization
questions
• Results
MCMD Library Module
• Transparent. Overloads the IPMC
functions
o setsockopt(), send(), etc.
• Translation. Logical IPMC map to a
set of P-IPMC/unicast addresses.
o Two extremes
MCMD Mapping Role
• MCMD Agent runs on each machine
o Contacted by the library modules
o Provides a mapping
• One agent elected to be a leader:
o Allocates IPMC resources according to the
current policy
MCMD Mapping Role
• Allocating IPMC resources: An optimization
problem
Procs
Procs
This box intentionally left
BLACK
L-IPMC
Collections
L-IPMC
MCMD Gossip Layer
• Runs system-wide as part of the agent
• Automatic failure detection
• Group membership fully replicated via gossip
o Node reports its own state
o Future: Replicate more selectively
o Leader runs optimization algorithm on data and
reports the mapping
MCMD Gossip Layer
• But gossip is slow...
• Implications:
o Slow propagation of group membership
o Slow propagation of new maps
o We assume a low rate of membership churn
• Remedy: Broadcast module
o Leader broadcasts urgent messages
o Bounded bandwidth of urgent channel
o Trade-off between latency and scalability
Overview
• Library module
• Mapping module
• Gossip layer
• Optimization
questions
• Results
Optimization Questions
Collections
BLACK
Procs L-IPMC
Procs
L-IPMC
• First step: compress logical IPMC groups
Optimization Questions
• How compressible are subscriptions?
o Multi-objective optimization:
 Minimize number of collections
 Minimize bandwidth overhead on network
o Thm:
o Thm:
The general problem is NP-complete
In uniform random allocation, "little"
compression opportunity.
o Social preferences
o Lots of duplicates due to replication (e.g. for
load balancing)
Optimization Questions
• Which collections get an IPMC address?
o Thm: Ordered by decreasing
traffic*size, assign P-IPMC addresses
greedily, we minimize bandwidth.
• Tiling heuristic:
o Sort L-IPMC by traffic*size
o Greedily collapse identical groups
o Assign IPMC to collections in reverse order of
traffic*size, UDP-multisend to the rest
• Building tilings incrementally
Experimental Results
Overhead (max. throughput)
• Insignificant overhead when mapping LIPMC to P-IPMC.
Overhead (CPU utilization)
• Insignificant overhead when mapping LIPMC to P-IPMC.
Network Overhead
• Gossip Layer uses constant background
bandwidth, urgent channel behaves well
Latency
• Latency of propagation of joins/leaves
and new maps
Policy control
• A malfunctioning node bombards an existing
IPMC group.
• MCMD policy prevents ill-effects
<Traffic starts
<New policy
Conclusion
• IPMC has been a bad
citizen...
Conclusion
• IPMC has been a bad
citizen...
• Dr. Multicast has the cure!
• Opportunity for big
performance enhancements
and policy control.
Thank you!
Thank you!
Overhead
• Insignificant overhead when mapping L-IPMC to
P-IPMC.
Policy control
• A malfunctioning node bombards an existing
IPMC group.
• MCMD policy prevents ill-effects
Policy control
• A malfunctioning node bombards an existing
IPMC group.
• MCMD policy prevents ill-effects
Overhead
• Linux kernel module increases UDP-multisend
throughput by 17% (compared to user-space
UDP-multisend)
Latency of events
• Gossip: 99% of nodes aware of change within 9
epochs (now 1 sec)
Conclusions
• Policy: Allows data center operators to
and control IPMC
enable
• Transparency: Standard IPMC interface, system
calls are overloaded.
• Performance: Uses IPMC when possible,
otherwise point-to-point UDP
• Robustness: Distributed, fault-tolerant service
Results
• Library Module
o Insignificant slowdown
o Linux
Kernel module provides 17% speed-up
for UDP multi-send
Optimization questions
Users
This box intentionally left
BLACK
Topics
Users
Groups
Topics
• Multi-objective:
o Minimize number of groups
o Minimize bandwidth overhead on network
• Thm: This problem is NP-complete
o Reduction to Minimum Normal Set Basis
MCMD Library Layer
• Overloads the IPMC functions
o setsockopt(), send(), etc.
• Translates logical IPMC addresses to
physical IPMC, or point-to-point UDP
packets depending on policy
• Notifies MCMD immediately about
joins/leaves
• Learns about new mappings from
MCMD
• Keeps statistics about group traffic
rates
MCMD Library Layer
• Overloads the IPMC functions
o setsockopt(), send(), etc.
• Translates logical IPMC addresses to
physical IPMC, or point-to-point UDP
packets depending on policy
• Caches translation maps
• Maintains a connection to MCMD for
updates
Overview
• Library module
• Mapping module
• Gossip layer
• Optimization
questions
• Results