transparently - Indico

Download Report

Transcript transparently - Indico

Enabling Grids for E-sciencE
Failover Procedures
A. Cavalli
A. Pagano
6th CIC on Duty meeting
Lyon 27-29/03/2006
Outline
Enabling Grids for E-sciencE
• Failover idea
• Local failover concepts
• Points from last COD (Barcelona)
• Geo-failover proposal
• Conclusions
6th CIC on Duty meeting
Lyon 27-29/03/2006
2/14
Failover idea
Enabling Grids for E-sciencE
•A backup operation that automatically switches to a standby
database, server or network if the primary system fails or is
temporarily shut down for servicing.
•Failover is an important fault tolerance function of mission-critical
systems that rely on constant accessibility.
•Failover automatically and transparently to the user redirects
requests from the failed or down system to the backup system that
mimics the operations of the primary system.
6th CIC on Duty meeting
Lyon 27-29/03/2006
3/14
Uptime…
Enabling Grids for E-sciencE
How much availability must we guarantee?
6th CIC on Duty meeting
Lyon 27-29/03/2006
4/14
Downtime causes
Enabling Grids for E-sciencE
• Magic words are:
 Redundancy
 Remove SPOF
(Single Point of Failure)
6th CIC on Duty meeting
Lyon 27-29/03/2006
5/14
Enabling Grids for E-sciencE
LB: forward and balances requests from clients among a set of servers, single IP address.
Server Array: a set of servers running actual network services.
Shared Storage: so that it is easy for the servers to have the same contents and provide the same services.
6th CIC on Duty meeting
Lyon 27-29/03/2006
6/14
Example HA load balance
Enabling Grids for E-sciencE
Split Brain...
“STONITH”
Shoot The Other
Node In The Head
DRBD = mirroring a whole block device via network. Like raid-1 --> SAN/NAS (GPFS?)
HEARTBEAT = switch system when a machine go down --> (only Ping check, but services ?)
MON = alert system and take a decision (mail,reboot...)
6th CIC on Duty meeting
Lyon 27-29/03/2006
7/14
Points from COD-6
Enabling Grids for E-sciencE
•
SFT SERVER
1.
Ask for the status of CNAF Oracle support: DONE
CNAF is active in “3D Project”, Oracle replication tested, waiting for customers
(contact is: [email protected])
2.
Register a new domain to do DNS and geographical failover:
idea came out with Piotr and Kostas at COD-6: see the PROPOSAL
3.
Implement the mechanism to detect failure and switch to backup:
see the PROPOSAL – TODO
4.
SFT SERVER replica installation started, but based on old documentation and release:
need to be in sync with last release in production
•
SFT CLIENT
1.
Client update: CVS (Piotr - SFT dev.)
2.
Implement client failure detection and let job know where to publish:
see the PROPOSAL - TODO
6th CIC on Duty meeting
Lyon 27-29/03/2006
8/14
Points from COD-6 (2)
Enabling Grids for E-sciencE
•
CIC PORTAL
Web + DB (Oracle) + Lavoisier: we can do the same as SFT
+ rule: if just one element fails we switch all to a backup site
•
SFT ADMIN
Failover is done with inclusion in CIC portal
•
•
GGUS
−
Local failover: 2 separate network accesses + heartbeat
−
Geo failover: too expensive
GSTAT
2nd GSTAT installed but still in baby-sitting with Min
•
MAILING LISTS
Organize, centralize and replicate them (?): TODO
6th CIC on Duty meeting
Lyon 27-29/03/2006
9/14
Proposal: Geographical Failover
Enabling Grids for E-sciencE
• New DNS domain to remap services under new names. E.g.:
sft.gon.org points to the ip of sft.cern.ch
• All the DNS servers in the participant sites are masters for the
domain, and this means that:
– there are no problems related to the master/slave solution (master down,
stale slave records)…
– but update must be done on each master (nsupdate)
• Commercial solutions similar to this one are available, e.g. with
spread DNSs and controllers, check frequency up to every 10 sec.
• What happens when sft.cerc.ch goes down for any reason? … see
next slide …
6th CIC on Duty meeting
Lyon 27-29/03/2006
10/14
Enabling Grids for E-sciencE
6th CIC on Duty meeting
Lyon 27-29/03/2006
11/14
Geographical Failover (2)
Enabling Grids for E-sciencE
• Replication: proper technologies have to be chosen for the
replication/synchronization of the backup instances:
– Rely on Oracle teams support where a DB backend is needed: a slave(ro)-replica must be able to become master-(rw) when needed
– Web: static contents and dynamic procedures updated 1-2 times per day
(rsync? scp with keys? AFS?)
• Control: Should we attach action scripts to existing tools (nagios?
mon?) or develop a completely new one (e.g. in python)?
– The n-1 backup services control the master one: resolving cic.gon.org
everyone knows who’s the master
– When n-1 controllers detect a master failure the next in the list from “dig
gon.org” become master and updates all the nameservers
6th CIC on Duty meeting
Lyon 27-29/03/2006
12/14
Geographical Failover (3)
Enabling Grids for E-sciencE
Issues to take into account
• Reverse resolution: do we care about?
• SITE-1 is completely unreachable and SITE-2 has taken its place: we
don’t want that its DNS is outdated when (1) come up again,
because a sort of “split-brain” will occur. Frequent tries of nsupdate
can be a solution?
• Service masters (SFT, CIC, GSTAT…) shouldn’t be concentrated, so
CONTROL must act per-service monitoring/switching
• Can we rely on existing DNS servers/support or need to
install/manage new ones? At CNAF nsupdate on institute server is
almost for sure forbidden
• DNS TTLs (and timeouts?) have to be short to be able to switch
without the bad effect of old cached records (and slow responses?)
6th CIC on Duty meeting
Lyon 27-29/03/2006
13/14
Conclusions
Enabling Grids for E-sciencE
• Local failover can be obtained with different approaches
and technologies, we can invite each one other to
implement it, but at last is up to the local admin
• Something like a DNS based geographical failover is a
complex but needed and interesting idea to be studied
and put in place. Thanks to Kostas and Piotr who had
the idea
• Can we start with a good example of service and 2 sites
for a pilot-test?
• Comments, questions, ideas?
6th CIC on Duty meeting
Lyon 27-29/03/2006
14/14