Transcript ppt

Improving Robustness in
Distributed Systems
Per Bergqvist
[email protected]
Erlang User Conference 2001
(courtesy CellPoint Systems AB)
Design base
Cluster of cooperating hosts
Erlang and C
COTS hardware based
Unix based (i.e. Solaris or Linux)
10/100/1000 base-T back plane
(”system area network”)
Cluster
Shared, distributed, system configuration
Each host have ONE cluster controller
Dispatch and supervise worker tasks
Master cluster controller: holds configuration
database (persistent replica)
Slave cluster controller: gets configuration
from master cluster controllers
Cluster is DOWN when all master cluster
controllers are inaccessible
Typical system
Traffic
Firewall
Switch
Control
Cluster Key Benefits
Single system view
Enforces decoupling of parts of O&M
from actual traffic processing
Implementing a cluster
Cluster->Host->Node->NodeData
Cluster global parameters
Subscription mechanisms for conf. changes
Mnesia as configuration database on master
cluster controllers
Homebrewn configuration distribution to slave
controllers (NOT using mnesia)
(Worker) node supervision
Mnesia gotchas
First distributed node startup


Disallow writes when all replicas not
accessible
Use timeout on table load and force load
... BUT ...
TCP based distribution
Network partitioning
Network parameters
Align TCP retransmission intervals w/
Erlang heartbeats
Align TCP and IP rerouting parameters
Typical system II:
Dual back plane
Firewall
Switch
Traffic
Control
Erlang multi-homing problem
Host A
Host C
Host B
Multi-home Erlang w/ TCP
Add an alias interface to loop back i/f
Patch tcp distribution to bind to alias
Publish alias interface on (all wanted)
via real hw i/f’s


Method 1: Static routes and
gratuitous/proxy arp
Method 2: Use new (routing) protocol
ARP method
Implement a utility to:
- broadcast unsolicited ARP responses
- respond to ARP requests
for the alias i/f address
Add static routes on all far end systems
NOTE: all real i/f needs to be on same
IP subnet
New routing protocol
Broadcast (Ethernet frames) what you
have, including interface priority
Let the far end select path based on
what/when they receive
Far end dynamically sets up host routes
Use short retransmission intervals
Erlang multi-homing resolved ?
Host A
Host C
Host B
Summing up
Erlang can support multihoming with some
additional work
By using loop back alias i/f, link failure
becomes a routing problem (peer-peer
association is kept intact)
Solaris TCP/IP stack parameters are:
- hard to find (only in out-of-date app. notes)
- hard to set ”right”
- host global
A distribution mechanism with built-in support
for multi-homing preferred
Erlang Distribution over SCTP
Per Bergqvist et al
[email protected]
Erlang User Conference 2002