Presentation on HA-Clustering results

Transcript Presentation on HA-Clustering results

An Empirical Examination of
Current High-Availability
Clustering Solutions’
Performance
Jeffrey Absher
DePaul University
Research Symposium Presentation
November 2003
See actual paper for bibliographical, procedural info, and appropriate academic reference information
HA and Related Technology
Distributed OS
 Load Balancing
 Disaster Recovery
 Fault Tolerance
 HA clustering

HA’s defining traits






SPOF avoided by using
redundancy
Single image to the outside
world using a single virtual IP
address and hostname
Automated fault management
and recovery
Multiple access paths from
each cluster node to each
resource group (set of HA
services)
Simple abstraction for
applications and administrators
Undisrupted (or minimal
disrupted) services during
failover.
“If a computer breaks down, the
functions performed by that
computer will be handled by
some other computer in the
cluster.”
A cluster and tester topology
Event/Failure
What does it Simulate?
Baseline
No Events
Kill process on Primary server
A simple fault that causes an abend to the HA process
but does not take out the server.
Kill process on primary server and hold the
process down for 30 seconds
A core dump that takes a long time or a more complex
fault.
Kill process on primary, hold down for 30 seconds
and fail to start on second node
A core dump or more complex fault, as well as a
misconfiguration on the secondary server.
Kill the cluster/watchdog process on the primary
server
A bug in the cluster programming that causes an
abend or a mistaken shutdown of the cluster
processes.
Short power failure on primary node
A single node power failure, technician error, or a
loose power-cable, etc.
Simultaneous power failure on both nodes,
primary/secondary recovers first.
A datacenter power failure with the two possible
recovery orders
For AIX and Linux, Loss of serial communication
for 60 seconds. For Windows, the Virtual
Shared disk processes were killed and
disabled for 60 seconds.
A loose serial cable or technician error such as a
cable disconnect, a port misconfiguration, or a
mistaken command such as echo hello> /dev/tty0.
Primary/Secondary Server public network loss for
60 seconds
A loose network cable or a technician error such as a
cable disconnect, card misconfiguration, or a
mistaken command such as ifconfig en0 down.
Public/Private network down 60 seconds
A power failure on the public hub or MAU, a network
storm, or a technican’s error such as a VLAN
misconfiguration.
IP address clash public network for 60 seconds.
A situation where another machine on the same VLAN
is accidentally brought online with an incorrect IP
address.
ho
ld
in
e
ho
ld
2
0
er
Pr
oc
es
s
Po
we
ro
Po
n
we
1
ro
n
bo
th
Po
,1
up
we
ro
n
bo
th
,2
Se
up
ria
l
Pu
Co
b
nn
ne
Lo
tl
ss
os
s
o
Pu
n
1
b
60
ne
se
tl
os
c
s
on
2
60
Pu
se
b
ne
c
td
ow
n
60
Pr
iv
se
ne
c
td
ow
n
60
se
IP
c
Cl
as
h
60
se
c
on
fo
r3
ll O
nc
e
30
,f
ai
l
ll a
nd
ll C
lu
st
ll,
Ki
Ki
Ki
-0.2
Ki
Ba
se
l
Uptime (1=100%)
AIX Trials
1.2
1
0.8
0.6
0.4
Cluster 200s as % of baseline
0.2
Cluster 200s + 206s as % of
baseline
Nocluster 200s as % of Baseline
Nocluster 200s + 206s as % of
baseline
0
Failure Type
ho
ld
in
e
ho
ld
2
0
er
Pr
oc
es
s
Po
we
ro
Po
n
we
1
ro
n
bo
th
Po
,1
up
we
ro
n
bo
th
Q
,2
uo
up
ru
m
Pu
Dr
b
iv
ne
e
Lo
tl
os
ss
s
o
Pu
n
1
b
60
ne
se
tl
os
c
s
on
2
60
Pu
se
b
ne
c
td
ow
n
60
Pr
iv
se
ne
c
td
ow
n
60
se
IP
c
Cl
as
h
60
se
c
on
fo
r3
ll O
nc
e
30
,f
ai
l
ll a
nd
ll C
lu
st
ll,
Ki
Ki
Ki
-0.2
Ki
Ba
se
l
Uptime (1=100%)
Win2K Trials
1.2
1
0.8
0.6
0.4
Cluster 200s as % of baseline
0.2
Cluster 200s + 206s as % of
baseline
Nocluster 200s as % of Baseline
0
Nocluster 200s + 206s as % of
baseline
Failure Type
ll C
lu
st
30
,f
ai
l
on
0
2
fo
r3
er
Pr
oc
es
s
Po
we
ro
Po
n
we
1
ro
n
bo
th
Po
,1
up
we
ro
n
bo
th
,2
Se
up
ria
lC
Pu
on
b
n
ne
Lo
tl
ss
os
s
o
Pu
n
1
b
60
ne
se
tl
os
c
s
on
2
60
Pu
se
b
ne
c
td
ow
n
60
Pr
iv
se
ne
c
td
ow
n
60
se
IP
c
Cl
as
h
60
se
c
Ki
ho
ld
ho
ld
0.2
ll,
ll a
nd
ll O
nc
e
in
e
0.4
Ki
Ki
-0.2
Ki
Ba
se
l
Uptime (1=100%)
RedHat Trials
1.2
1
0.8
0.6
Cluster 200s as % of baseline
Cluster 200s + 206s as % of
baseline
Nocluster 200s as % of Baseline
Nocluster 200s + 206s as % of
baseline
0
Failure Type
Inter OS Comparison
AIX
Win2K
Linux
Configuration
most difficult
reasonable
simplest
Scripting required?
some
none
much
Features
many
many
few
OS integration
medium
high
low/none
Installation
Interdependent
Independent
Independent
Trials with HA
resulting in a
longer outage
4/14
2/14
3/14
Trials requiring
manual intervention
0
1
1
Subjective
Observations

HA clustering is difficult to configure properly and the available
documentation is lacking




Multiple machines must be configured simultaneously, often
packages and software must be installed and configured in a
specific order.
For what should be a loosely-coupled system, there are many
interdependencies.
Youn et al suggest that the design of “administration of
clusters…needs improvement,” – I agree
Vogels et al state, “Users find it difficult to configure clusters with
the desired management … properties. It is difficult to configure
applications to be automatically launched in an appropriate
order. Lacking solutions to these problems, clusters will remain
awkward and time-consuming tools.” - I agree
Objective
Conclusions
Based on Empirical Evidence


HA is not a perfect solution for every environment, and may be a bad
solution for some, depending on the expected faults.
High failover time for some systems contributes to a lower-thanexpected performance of HA systems when compared to non-HA
systems.



Failover times need to be significantly smaller than the time required for a
reboot or even a restart of a slow-to-start process.
Primary-node negotiation time at boot contributes to poor performance
during power outages.
There were cases where clustering is shown to actually decrease the
uptime of a service or site.
Q&A

Presentation on HA-Clustering results

Transcript Presentation on HA-Clustering results

Directory