Titre du document Date

Download Report

Transcript Titre du document Date

Laboratoire LIP
ENS-Lyon
A Survey on High Availability Mechanisms for IP Services
11 October 2005
N. AYARI, FT R&D., D. Barbaron, FT R&D
L. Lefevre, INRIA – P. Primet, INRIA
2005 High Availability and Performance Computing Workshop (HAPCW'2005)
Santa FE, USA
France Telecom
Research & Development
D1 - 05/04/2016
Introduction
Different types of clusters
s
MPP and SMP clusters,
– Scalability via CPU and Memory interconnects
– Using special purpose hardware and/or software,
– High availability through
s
– Job scheduling and migration,
– Fault detection and check pointing.
Clusters of independent working nodes
– Pretty alternative based on commodity hardware and/or
general purpose operating systems
– Scalability achieved by efficient distribution of the incoming
requests on the available nodes
– High availability ?
– Service non interruption and service integrity
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D2 - 05/04/2016
Introduction
Scalability issues in clusters of commodity hw/sw nodes
s
The request distribution should
QIncrease performance by
– Improving the system responsiveness
– Concurrent supported connections per unit of time,
– Keeping reasonable response times
– When does the bottleneck is observed?
QSupport upper layer session integrity
– Integrity depends on the switching granularity
– On a per datagram, connection or session distribution basis.
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D3 - 05/04/2016
Switch designs
s
s
Can be
QStateless or Statefull
Applies to
QLayer 4 switching
– Uses 2-4 packet information (TCP/IP Model)
QLayer 5 switching
– Uses 2-5 packet information (TCP/IP Model)
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D4 - 05/04/2016
Stateless vs Statefull switch designs
Stateless switch design
s
Stateless switch design
QAchieves a better latency by
– Processing each datagram independently from its
predecessors
– Does not maintain any state information
QImplements service integrity
– On a per connection basis in Layer 4 Switching
– Uses hashing to compute the same cluster node for all datagrams
originated from the same client identified by <IP @, Port Number,
Protocol>.
– On a per session basis in layer 5 Switching
– Depends on the IP data application
France Telecom
Research & Development
- Cookie based persistency for web traffic
- Cookie Switching
- Cookie based hashing
What about other data applications?
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D5 - 05/04/2016
Stateless vs Statefull switch designs
Stateless switch design limitations
s
Upper layer session integrity
QA request belonging to one session goes to the wrong
server
s
– Hash Collisions needs robust hash functions
Fault node handling
– When the hash function depends on the number of active
nodes
s
– Replaying all sessions when one or more nodes crash
Fair load distribution
QThe stateless nature uses static load balancing
– Source Hashing,
QWhile request have varying service time and service
resources
– SIP long sessions, FTP bandwidth consuming transfers, etc.
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D6 - 05/04/2016
Stateless vs Statefull switch designs
Statefull switch designs
s
It aims to improve both
QUpper layer session integrity
– Maintaining connection/session STATES
– Source and destination IP @, port numbers, transport protocol
- No semantic to delimit a UDP connection
– Maintains multiple purpose timers
– Avoid maintaining inactive sessions/connections
- DDoS counter measure
– Computes statistics on the client's session duration average
– Needs to speed up the lookup for each datagram
– Use index hashing
QLoad distribution Fairness
– Using service state aware load distribution policies
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D7 - 05/04/2016
Stateless vs Statefull switch designs
Statefull design limitations
s
s
s
Cost effectiveness
QServer state distribution overhead
Efficiency depends on the granularity of the switching operation
QLayer 4 or Layer 5 ?
– Does layer 4 scale for all IP services?
Load distribution fairness?
QDecision taken on the first datagram in a session/connection
– Need new mechanisms
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D8 - 05/04/2016
Fair Scheduling
s
s
How to measure load?
QUsing a robust, simple, quickly adapted summary metric
– CPU, Memory and Disk I/O utilization,
– Number of active application processes and connections,
– The availability of network protocol buffers,
– Number of active users.
Policies?
QStatic
– Randomization, (Weighted) Round Robin, Source/Destination Hashing.
QDynamic (Server/Client state aware)
– (Weighted) Least connections, Short Expected delay, Minimum misses,
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D9 - 05/04/2016
Fair Scheduling
s
Policies?
QDynamic (Server/Client state aware) (cont.)
– Cache affinity,
– The file is partitioned among the nodes
– SIETA (Size Interval Task assignment with equal load),…
– The node is determined based on the 'size' of the request
– CAP (Client Aware Policy)
– Consecutive connections from the same client assigned to the same node
QAdmission Control Policies
– Locality-Based Least-Connection, Locality-Based Least-Connection with
Replication.
France Telecom
Research & Development
s
TLimit
Unacceptable
THigh
Acceptable
TLow
Fully Utilized
Under Utilized
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D10 - 05/04/2016
Fair Scheduling
s
Policies?
QNetwork traffic based balancing
– Focus on predicting the volume of incoming traffic from a source
based upon past history
QPriority based balancing
– Assigns higher priority to some data traffic
QTopology based Redirection
– Redirect traffic to the cluster nearest the client in terms of
– Hop count (static),
– Network latency (dynamic).
QApplication specific Redirection
– Layer 5 load balancing specialize back end servers for special
contents or services
QEtc.
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D11 - 05/04/2016
Layer 4 Switching
How?
s
Works at the TCP/IP level
QContent blind switching
Layer 4 switches
One Way Architectures
Two Way Architectures
Packet Double
Rewriting
client
France Telecom
Research & Development
switch
s
servers
HAPCW'2005
Packet Single
Rewriting
Packet
Forwarding
client
switch
Packet
Tunnelling
servers
Distribution of this document is subject to France Telecom’s authorization
D12 - 05/04/2016
Layer 4 Switching
A kernel implementation
s
The IP Virtual Server implementation
QSupports NAT, DR, and Tunnelling
– As add-on modules in the networking layer of the kernel
s
s
QBased on the Linux packet filtering and routing
The Linux Virtual Server
QA cluster of independently working nodes,
QUsing the IPVS load balancer.
Some recommendations [WZ]
France Telecom
Research & Development
capabilities
Cluster Mgt
KTCPVS
IPVS
VS/NAT
VS/TUN
VS/DR
Server
any
Tunneling
Non-arp device
server network
private
LAN/WAN
LAN
server number
low (10~20)
High (100)
High (100)
server gateway
load balancer
own router
Own router
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D13 - 05/04/2016
Layer 4 switching
Performance: Single CPU Linux 2.2 LVS-NAT vs. LVS-DR scaling
s
Performance [Rou2001].
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D14 - 05/04/2016
Layer 4 switching
Some Layer 4 switching products
Two Way
One Way
Packet single
Rewriting
Packet double rewriting
- Cisco's Local Director (commercial)
- Magic Router (Berkley)
- LSNAT
- F5 Network's BIG-IP 5100
- LVS
- Foundry Network's Server Iron
- Cyber IQ's Hyper Flow
- Coyote Point's Equalizer
France Telecom
Research & Development
s
- TCP Router
HAPCW'2005
Packet Tunneling
- LVS
Packet Forwarding
- IBM Network Dispatcher (Component of
IBM Websphere NetEdge server)
- OneIP (BellLabs)
- LSMAC
- Intel NetStructure
- Traffic Director
- Nortel Network's Alteon 780 series
- Foundry Network's Server Iron
- Radware WSD Pro
- LVS
- VA Balance (VA Linux Systems Japan)
Distribution of this document is subject to France Telecom’s authorization
D15 - 05/04/2016
Layer 4 Switching
The Net filter Capabilities and Return Code
IN
Sanity Checks
Return
Code #1
ROUTING
DECISION
EXTERNAL
DNAT-PREROUTING
NF_IP_PREROUTING
INTERNAL
NF_DROP
Meaning
FORWARD
Discard the packet
NF_IP_LOCAL_OUT
SNAT-POST
ROUTING
OUT
LAYER 2 Functions
NF_IP_FORWARD
NF_ACCEPT
Keep the packet
NF_STOLEN
Forget about the packet
OUTPUT
POSTROUTING
NF_IP_POSTROUTING
NF_IP_LOCAL_IN
NF_QUEUE
Queue packet for user space
NF_REPEAT
Call this hook function again
INPUT
France Telecom
Research & Development
s
HAPCW'2005
LOCAL PROCESS
OUTPUT
DNAT 2
Distribution of this document is subject to France Telecom’s authorization
D16 - 05/04/2016
Layer 4 Switching
The IPVS Architecture
Packet
ROUTING
PREROUTING
FORWARD
POSTROUTING
Packet
LOCAL_IN
LOCAL_OUT
LOCAL PROCESS
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D17 - 05/04/2016
Layer 4 Switching
Persistency handling
Marked Packet
ROUTING
PREROUTING
FORWARD
POSTROUTING
Packet
LOCAL_IN
LOCAL_OUT
LOCAL PROCESS
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D18 - 05/04/2016
Layer 4 switching
Issues
s
Client
Transaction
The persistence template for layer 4 switching may not scale
QExample: VoIP data exchange using SIP
INVITE
100
180
200
Server
Client
Transaction
Transaction
ACK
RTP
Server
Transaction
BYE
200
Client
Server
Transaction
Transaction
INVITE
100
180
200
INVITE
100
180
200
ACK
ACK
ACK
RTP
RTP
RTP
BYE
200
BYE
200
Server
Client
Transaction
Transaction
Client
Server
Transaction
Transaction
INVITE
100
180
200
BYE
200
Server
Transaction
Client
Transaction
– Different
transport connections for different Stateful
transaction
within
Stateful Proxy (Record Route)
Proxy (Record Route)
Stateless Proxy
the same SIP session.
QSession corruption implies datagram losses
– More Latency (TCP AIMD)
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D19 - 05/04/2016
Layer 5 Switching
The solution?
s
s
The switch is also the single view of the cluster
The request distribution is done on the basis of
Qthe load estimation on the cluster's nodes
Qthe connection identifiers of the request
– <source and destination IP @, source and destination port nb,
protocol>
Qthe session identifiers of the request and the content type
s
– Layer 5 header informations
Additional delay
QNeed to complete the connection to parse the data
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D20 - 05/04/2016
Layer 5 Switching
The solution?
Layer 5 switches
Two Way Architectures
TCP Gateway
France Telecom
Research & Development
TCP Splicing
s
One Way Architectures
TCP Splicing
Variants
HAPCW'2005
TCP Handoff
TCP Handoff
Variants
Distribution of this document is subject to France Telecom’s authorization
D21 - 05/04/2016
Layer 5 Switching
TCP Gateway, the problems
Client
Content Based Switch
Server
Application Layer Forwarding
Application
Layer
User Space
Transport
Receive
Send
Layer
Buffer
Buffer
Kernel Space
Network
Layer
s
Cost effective
QMultiple copies and context switching
QThe proxy becomes rapidly the bottleneck because it is a
two way architecture.
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D22 - 05/04/2016
Layer 5 Switching
TCP Splicing, the Packet Mapping operations.
Client
Application
Content Based Switch
Layer
Server
User Space
Transport
Receive
Send
Layer
Buffer
Buffer
Kernel Space
Packet Forwarding with
Network
Header Translation
Layer
SourcePort
DestPort
SEQNber
ACKNber
• Modifications also affect
• IP pseudo Header
• Socket options
France Telecom
Research & Development
Len
s
FLG
AdvWin
CheckSum
UrgPtr
Options
Padding
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D23 - 05/04/2016
Layer 5 Switching
TCP Splicing message timeline, the Delayed Binding.
Client
Server
Switch
SYN (CSEQ)
SYN (PrSEQ), ACK(CSEQ+1)
ACK(PrSEQ+1)
DATA (CSEQ+1)
Scheduling
&
Packet Rewriting
SYN (PsSEQ)
SYN (SSEQ), ACK(PsSEQ+1)
ACK(SSEQ+1)
DATA (PsSEQ+1)
DATA (PrSEQ+1), ACK(CSEQ+1+len)
France Telecom
Research & Development
s
Packet Rewriting
HAPCW'2005
DATA (SSEQ+1), ACK(PsSEQ+1+len)
Distribution of this document is subject to France Telecom’s authorization
D24 - 05/04/2016
Layer 5 Switching
TCP Splicing, the issues
s
s
s
s
s
Delayed binding
Double processing overhead
QTwo way switch mechanism
Buffer size for large scale forwarders
The transition between the control mode and the forwarder mode
QDelay the activation of the spliced connection until the
buffers got drained.
QForwarding data concurrently with draining the buffers.
End-to-end Flow Control
QFrom Small/Big AdvWin to Big/Small AdvWin
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D25 - 05/04/2016
Layer 5 Switching
TCP Splice improvements
s
s
s
Pre forking TCP splice
QReduce the three way handshake cost
Pre-allocate Server Scheme
QGuess Real Server on receipt of the TCP Sync
Etc.
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D26 - 05/04/2016
Layer 5 Switching
s
TCP Handoff
One way mechanism
– Migrate the TCP connection from the Front end to the back
end servers using the Handoff protocol Msg/Ack
Magic Nber
ConnMagic
Magic Nber
Conn_Info
ConnMagic
Ack Msg
– MagicNber=HdPrIdentifier, ConnMagic=NxtSeqNber, AckMsg
informs of the hdoff result
– The connection is done without going through the Three Way
handshake procedure.
Client
Switch
Server
ConnReq
TCP / IP
Stack
TCP / IP
Stack
TCP / IP
Stack
Handoff
Reply
France Telecom
Research & Development
Forward Module
s
HAPCW'2005
Ack
Distribution of this document is subject to France Telecom’s authorization
D27 - 05/04/2016
Layer 5 Switching
TCP Handoff message timeline
Client
Server
Switch
SYN (CSEQ)
SYN (PrSEQ), ACK(CSEQ+1)
ACK(PrSEQ+1)
DATA (CSEQ+1)
Scheduling
&
Connection
Migration
Migrate Request (DATA, CSEQ, PrSEQ)
DATA (PrSEQ+1), ACK(CSEQ+1+len)
ACK
ACK
Packet Rewriting
DATA
DATA
DATA, ACK
FIN
ACK
ACK
FIN
Packet Rewriting
FIN
ACK
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D28 - 05/04/2016
TCP Handoff vs TCP Splice
s
Based on LVS TCPSP and TCPHA 2.4 kernel implementations
QThroughput (13 KB file)
QOverhead due to L7 processing front-end -> bottleneck -> low
Apache throughput (conn/sec)
scalability
# Back End nodes in cluster
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D29 - 05/04/2016
Layer 5 Switching
The limitations
s
s
s
Highly available connections?
QConnection failover
One way vs two way architectures
QImprovements on TCP Handoff
Actual implementations do not cover all data traffic
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D30 - 05/04/2016
Layer 5 Switching
Some layer 5 switching products
Two Way Architecture
TCP Gateway
- IBM Network Dispatcher CBR
- CAP (Client Aware Proxy)
- Vovida's Load balancer Proxy
France Telecom
Research & Development
TCP Splicing
- Foundry Network Server Iron
- Radware WSD Pro+
- Hydra WS Hydra2500
- Alteon Applications switching
series from Nortel
- Sharp Corporation Super
Proxy
- Resonate's Central Dispatcher
(with redirection capabilities)
- Cisco's CSS 11500 (Content
Service switch)
- OpenFusion Load balancing
service for Corba based applications
and services from PrismTech
- Kemp technologies LoadMaster
series (2460, 2860, etc.)
- Sun Fire B10n Content Load
Balancing Blade switch (Tunneling
based)
- Procera MLXP Layer 5 switch
- OctaGate Smart Web switch
- Extreme Network layer 5 CA
switch device
s
HAPCW'2005
One Way Architecture
TCP Handoff
- ScalaServer
- TCPHA
TCP Connection Hop
- Resonate's Central
Dispatch
Distribution of this document is subject to France Telecom’s authorization
D31 - 05/04/2016
High Availability
s
How to detect that a member has failed?
QPings, timeouts,
QHeartbeat message exchange
– Status, cluster transition and retransmission messages
– TCPHA include state message exchange
– The accuracy of the failure detection
s
–Timeouts with multiple retries detect failure accuracy with high probability
How to recover from failover
Qa load balancer failover
– State synchronization
QSubsystem failover
– IP Takeover through channel bonding
QApplication Failover
– The Linux watchdog timer interface, etc.
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D32 - 05/04/2016
High Availability
s
s
More on connection failover
QThrough connection migration and reliable sockets
– Different from TCP Handoff
Include
QMigratory TCP
QFault tolerant TCP
QConnection passing
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D33 - 05/04/2016
High Availability
The accuracy in distributed architectures
s
DNS: scalability through site redundancy
QDNS SRV RR used in service location
– Localizing available SIP proxies
QThe effectiveness of DNS based scalability and failover are
corrupted by the DNS cache updates frequency.
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D34 - 05/04/2016
High Availability
The accuracy in distributed architectures
s
RSerPool
Server Pool
PE
PU
Server Pool
PE
PE
PE
Registration
Name Resolution
Registration
PE State Update
ENRP Server
France Telecom
Research & Development
s
HAPCW'2005
Redundant ENRP
Server
Distribution of this document is subject to France Telecom’s authorization
D35 - 05/04/2016
High availability
Other tips for distributed architectures
s
s
Multicast
QNeeds explicit support of all routers within the client server
path
IP Anycast route redundancy
QDifferent servers running the same service can all have the
same anycast @ on one of their interfaces
QIf server fails, the router will update its route to the nearest
available node
– Depends on router's update frequency
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D36 - 05/04/2016
Conclusion and Future directions
s
Further work will address
QKernel implementation of layer 5 switching to handle
session oriented data transfers.
QImprovements on the forwarder kernel component
QFair load distribution in session oriented data transfers.
QIPv6 compliance?
QSecurity concerns in connection failover
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D37 - 05/04/2016
THANKS
France Telecom
Research & Development
s
HAPCW'2005
Distribution of this document is subject to France Telecom’s authorization
D38 - 05/04/2016