Transcript 07-seattle
SEATTLE
- A Scalable Ethernet Architecture
for Large Enterprises
M.Sc. Pekka Hippeläinen
IBM
phippela@gmail
1.10.2012
T-110.6120 – Special Course in Future Internet Technologies
1
SEATTLE
Based on and pictures borrowed
from:Changhoon,K;Caesar,M;Rexford,J.
Floodless in SEATTLE: A Scalable Ethernet
Architecture for Large Enterprises
Is it possible to build a protocol that maintains the
same configuration-free properties as Ethernet
bridging, yet scales to large networks?
1.10.2012
2
Contents
Motivation: network management challenge
Ethernet features: ARP and DHCP broadcasts
1) Ethernet Bridging
2) Scaling with Hybrid networks
3) Scaling with VLANs
Distributed Hashing
SEATTLE approach
Results
Conclusions
1.10.2012
3
Network management challenge
IP Networks require massive effort to configure
and manage
Even 70% of an enterprise network’s cost goes
to maintenance and configuration
Ethernet is much simpler to manage
However Ethernet does not scale well beyond
small LANs
SEATTLE architecture aims to provide scalability
of IP with simplicity of Ethernet management
1.10.2012
4
Why Ethernet is so wonderful ?
Easy to setup, easy to manage
DHCP server, some hubs, plug’n play
1.10.2012
5
Flooding query 1: DHCP
requests
Lets say node A joins the ethernet
To get IP / confirm IP – node A sends a DHCP
request as a broadcast
Request floods through the broadcast domain
18.09.2012
6
Flooding query 2: ARP
In order for node A to communicate to node B in
the same broadcast domain, the sender needs
MAC address of the node B
Lets assume that node B IP is know
Node A sends and Address Request Protocol
(ARP) broadcast – to find out MAC address of
node B
Similarly to DHCP broadcast – the request is
flooded through the whole broadcast domain
This is basically {IP -> MAC} mapping
1.10.2012
7
Why flooding is bad ?
Large Ethernet deployments contain vast
number of hosts and thousands of bridges
Ethernet was not designed to such a scale
Virtualization and mobile deployments can
cause many dynamic events – causing control
traffic
Broadcast messages need to be processed in the
end hosts – interrupting cpu
The bridges forwarding tables grow roughly
linearly with number of hosts
1.10.2012
8
1) Ethernet bridging
Ethernet consists of segments each comprising
a single physical layer
Ethernet bridges are used to interconnect
segments to multi-hop network i.e. LAN
This forms a single broadcast domain
Bridge learns how to reach a host – by
inspecting the incoming frames and associating
the source MAC with the incoming port
A bridge stores this information to a forwarding
table – using the table to forward packets to
correct direction
1.10.2012
9
Bridge spanning tree
One bridge is configured to be the root bridge
Other bridges collectively compute a spanning
tree based on the distance to the root
Thus traffic is not routed through shortest path
but along the spanning tree
This approach avoids broadcast storms
1.10.2012
10
1.10.2012
11
2) Hybrid IP/Ethernet
In this approach multiple LANs are
interconnected with IP routing
In hybrid networks each LAN contains at most a
few hundred of hosts that form IP subnet
IP subnet is associated with the IP prefix
Assigning IP prefixes to subnet and associating
subnets with router interfaces is a manual
process
Unlike MAC which is host identifier – IP address
denotes the hosts current location in the
network
1.10.2012
12
1.10.2012
13
Drawbacks of Hybrid approach
Biggest drawback is the configuration overhead
Router interfaces must be configured
Host must have correct IP address corresponding to
the subnet it is located (DHCP can be used)
Networking policies are defined usually per
network prefix i.e. topology
When network changes the policies must be updated
Limited mobility support
Mobile users & virtualized hosts at datacenters
If IP is constant – the user should stay on the same
subnet
1.10.2012
14
3) Virtual LANs
Overcomes some problems of Ethernet and IP
Networks
Administrators can logically groups hosts into
same broadcast domain
VLANS can be configured to overlap –
configuring bridges not the hosts
Now broadcast overhead can be reduced by the
isolates domains
Mobility is simplified – IP address can be
retained while moving between bridges
1.10.2012
15
Virtual LANs
Traffic from B1 to B2 can be ‘trunked’ over
multiple bridges
Inter domain traffic needs to be routed
1.10.2012
16
Drawbacks of VLANs
Trunk configuration overhead
Extending VLAN across multiple bridges requires
VLAN to be configured at each of the bridges
participating. Often manual work.
Limited control plane scalability
Forwarding table entries and broadcast traffic for
every active host and every VLAN visible
Insufficient data plane efficiency
Single spanning tree is still used within each VLAN
Inter-VLAN traffic must be routed via IP gateways
1.10.2012
17
Distributed Hash Tables
Hash tables are used to store {key -> value} pairs
In case of multiple nodes there is nice way to
make
Nodes symmetric
Distribute the hash table entries evenly among nodes
Keep reshuffling of entries small in case of
adding/removing nodes
Idea is to calculate H(key) that is mapped to a
host – one can visualize this to mapping to an
angle (or to a point on a circle)
1.10.2012
18
Distributed Hash Tables
Each node is mapped to randomly distributed
points on the circle
Thus each node is mapped to multiple buckets
One calculates the H(key) – and stores the entry
to the node owning this bucket
If node is removed – the values are now assigned
to next buckets
If node is added – entries are moved to the new
buckets
1.10.2012
19
SEATTLE approach 1/2
1) Switches calculate shortest
path among themselves
This is link state protocol – basically Dijkstra
Switch level discovery protocol – Ethernet hosts do
not respond
Switch topology much more stable than at host level
Much more scalable than at host level
Each switch has an ID – one MAC address of the
switch interfaces
1.10.2012
20
SEATTLE approach 2/2
2) DHT used in switches
{IP->MAC} mapping
This is essentially ARP request avoiding flooding
{MAC->location} mapping
When switch is located – routing along the shortest path
can be used
DCHP Service location can also be stored
SEATTLE thus reduces flooding, allows usage of
shortest path and offers a nice way to locate
DHCP service
1.10.2012
21
SEATTLE
Control overhead reduced with consistent
hashing
When set of switches changes due to network failure
or recovery – only some entries must be moved
Balancing load with virtual switches
If some switches are more powerful – the switch can
represent itself as many – getting more load
Enabling flexible service discovery
This is mainly DHCP – but could be something like
{“PRINTER”->location}
1.10.2012
22
Topology changes
Adding and removing switches/links can alter
topology
Switch/link failures and recoveries can also lead
to partitioning events (more rare)
Non-partitioning link failures are easy to handle
– the resolver for hash entry is not changed
1.10.2012
23
Switch failures
If switch fails or recovers hash entries need to be
moved
The switch that published value – monitors the
liveliness of resolver. Republishing entry when
needed
The entries have TTL
1.10.2012
24
Partitioning events
Each switch has to book keep also locally-stored
location entries
If switch s_old is removed / not reachable – all the
switches need to remove these location entries
This approach correctly handles partitioning
events
1.10.2012
25
Scaling:
location
Hosts use directory service to publish and
maintain {mac->location} mappings
When host a with mac_a arrives – it accesses
switch S_a (steps 1-3)
Switch s_a publishes {mac_a,location}, by calculating
the correct bucket F(mac_a) i.e. switch/resolver
When node b wants to send message to node a
F(mac_a) is calculated to fetch the location
’Reactive resolution’ – also cache misses do not
lead flooding
1.10.2012
26
Scaling:
ARP
When node b makes ARP request – SEATTLE
converts this to a {F(IP_a) -> mac_a} request
The resolver/switch for F(IP_a) is usually
different from F(mac_a)
Optimization for hosts making ARP request
F(IP_a) address resolver can also store mac_a and S_a
When node b makes F(IP_a) ARP request also mac_a-
>S_a mapping is cached to S_b
Shortest path (-> path 10) can now be used
1.10.2012
27
Handling host dynamics
Location change
Wireless handoff
VM moved but retaining MAC
Host MAC address changes
NIC card replaced
Failover event
VM migration forcing MAC change
Host changes IP
DHCP leave expires
Manual reconfiguration
1.10.2012
28
Insert, delete and update
Location change
Host h moves from s_old to s_new
s_new updates the existing mac-to-location entry
MAC change
IP-to-MAC update
MAC-to-location deletion (old) and insertion (new)
IP change
S_h deletes old IP-to-MAC and inserts new IP-to-MAC
1.10.2012
29
Ethernet: Bootstrapping
hosts
Host discovered by access switches
SEATTLE switches snoop ARP requests
Most OSes generate ARP request at boot up / if up
Aldo DHCP messages or host down can be used
Host configuration without broadcast
DHCP_SERVER hashes string “DHCP_SERVER” and
stores the location to the switches
The “DHCP_SERVER” string is used to locate service
No need to broadcast for ARP or DHCP
1.10.2012
30
Scalable and flexible VLANs
To support broadcasts – the authors suggest
using groups
Similar to VLAN - groups is defined as a set of
hosts who share the same broadcast domain
The groups are not limited to layer-2
reachability
Multicast-based group-wide broadcasting
Multicast tree with broadcast root for each group
F(group_id) used for broadcast root location
1.10.2012
31
Simulations
1) Campus ~40 000 students
517 routers and switches
2) AP-Large (Access Provider)
315 routers
3) Datacenter (DC)
4 core routes with 21 aggregation switches
Routers were converted to SEATTLE switches
1.10.2012
32
Cache timeout and AP-large
with 50k hosts
Shortest path cache timeout
has impact on number of
location lookups
Even with 60s time out 99.98%
packets were forwarded without lookup
Control overhead (blue) decreases very fast – where as the
table size increases only moderately
Shortest path is used in majority of routing in these
simulations
1.10.2012
33
Table size increase in DC
Ethernet bridges stores entry
for each destination ~ O(sh)
behavior across network
SEATTLE requires only ~O(h) state since only access
and resolver switches need to store and location
information for each hosts
With this topology the table size was reduced by factor of 22
In AP-large case the factor was increased to 64
1.10.2012
34
Control overhead in AP-large
Number of control messages
over all links in the topology
divided by the number switches
and duration of the trace
SEATTLY significantly reduces control overhead in
the simulations
This is mainly because Ethernet generates network
wide floods for a significant number of packets
1.10.2012
35
Effect of switch failure in
DC
Switches were allowed to fail
randomly
The average recover time was
30 seconds
SEATTLE can use all the links in the topology, where
as Ethernet is restricted to the spanning tree
Ethernet must re-compute the tree causing outages
1.10.2012
36
Effect of host mobility in
Campus
Hosts were randomly moved
between access switches
For high mobility rates,
SEATLLES loss rate was
lower than Ethernet
On Ethernet it takes sometime for switches to evict
the stale information location information and relearn the new location
SEATTLE provided low loss and broadcast overhead
1.10.2012
37
What was omitted
Authors suggest multi-level one-hop DHTs
With large dynamic networks – it can be beneficial that
entries are stored close
This is achieved with regions and backbone – border
switches connect to the backbone switches
With topology changes
Approach to seamless mobility is described in the paper
Updating remote host caches is required with switch
based MAC revocation lists
Some simulation results
Authors also made sample implementation
1.10.2012
38
Conlusions
Operators today face challenges in managing
and configuring large networks. This is largely to
complexity of administering IP networks.
Ethernet is not a viable alternative
poor scaling and inefficient path selection
SEATTLE promises scalable self-configuring
routing
Simulations suggest efficient routing, low latency
with quick recovery
Host mobility supported with low control overhead
Ethernet stacks at end hosts are not modified
1.10.2012
39
Thank you for your attention!
Questions? Comments?
1.10.2012
40