Planning for LCG emergencies

Download Report

Transcript Planning for LCG emergencies

Planning for LCG Emergencies
HEPiX, Fall 2005
SLAC, 13 October 2005
David Kelsey
CCLRC/RAL, UK
[email protected]
LHC Tier 0/1/2
Network Architecture
T2
T2
General Purpose IP Research
Networks:
NREN’s, GEANT2, LHCNet, Esnet
Abilene, Dedicated Links …. Etc.
T2
T2
T2
GridKa
IN2P3
T2
Brookhaven
T2
TRIUMF
CERN
Special Purpose
Optical Private Network:
GEANT2+NREN 10Gbit circuits and
LHCNet Dedicated 10Gbit Links to US
T2
ASCC
T0
CERN
Nordic
T2
T2
Fermilab
T2
CNAF
T2
RAL
T2
SARA
PIC
13-Oct-05
David Kelsey, LCG Emergencies
2
Background
•
Computing and Networking is essential
– Tier 0 (CERN) and 12 Tier 1 critical for data taking
• 10 Gbps Optical Private link to each T1
•
•
•
– The T1’s collectively keep a second copy of the raw data
– The T1’s play vital role in (re)processing and providing access to
derived data
– During data taking, can cope with Tier 0 - Tier 1 link down for
12 hours to < few days. All T1’s down – very bad!
– LCG MoU requires avg T1 uptime during data taking: 99%
LCG TDR says
– “Special attention needs to be paid to the security aspects of
the Tier-0, the Tier-1s and their network connections to
maintain these essential services during or after an incident so
as to reduce the effect on LHC data taking.”
LCG also essential for analysis
Need to keep the Grid running at all times
– Therefore must deal quickly with incidents
13-Oct-05
David Kelsey, LCG Emergencies
3
Security Incident Response
• Joint (LCG/EGEE) Security Policy Group & EGEE
Operational Security Coordination Team
– Based Security Incident Response Policy and
procedures on work of Open Science Grid
• Agreement on Incident Response
See https://edms.cern.ch/document/428035/
• Sites must
– Take local action to prevent disruption
– Report to local security officers
– Report to others via Grid Incident Response
mail list
• “Volunteer” incident response team created when
needed
13-Oct-05
David Kelsey, LCG Emergencies
4
Incident classification
• High: (team leader required)
– The incident could lead to exploitation of the trust
fabric, i.e user and host identities, or the incident could
lead to instability of the overall Grid, or a denial-ofservice is in progress against all replicas of a given Grid
service.
• Medium: (team leader required if widespread)
– The incident affects an instance of a Grid service, but
Grid stability is not at risk, or a denial-of-service
affects one replica of a given Grid service, or a local
attack compromised a privileged user account.
• Low: (team leader probably not required)
– A local attack comprised individual user, non-privileged
credentials, or a denial-of-service attack or compromise
affects only local grid resources.
13-Oct-05
David Kelsey, LCG Emergencies
5
Emergency procedures
• JSPG discussed this at last meeting (Sep 2005)
• Started from point of view of Security incidents
– But quickly realised that other disasters are
also likely, so should deal with these too
• Very early overview of the issues at this point
– Certainly no plan yet
– Invite feedback from HEPiX
• There must be lots of site-based plans
• JSPG will produce a draft emergency plan (and
address policy issues)
– Grid Operations and OSCT will need to define
the details
13-Oct-05
David Kelsey, LCG Emergencies
6
JSPG discussion topics
• What is the scope?
– LCG vs EGEE?
– Critical: Tier 0/1, data taking, data integrity
• Inter-site information flow
– This is the critical point to be tackled
– Users, Sys Admins and Managers
• External information
– including interface(s) to the Press
• How do we keep the infrastructure operational?
– Is this the aim?
• What do we take down?
– And who decides?
• Can optical private networks remain up?
– And are they sufficient for LCG data taking?
• How do we deal with Tier 2 problems?
13-Oct-05
David Kelsey, LCG Emergencies
7
LCG/EGEE Emergency Procedures
Denise Heagerty
CERN
When are emergency procedures required?

Emergency procedures are required to cover the
following cases:



Incident response plans cannot be followed: critical parts of the
infrastructure are unavailable (e.g. mailing lists)
Incident response plans are inappropriate: E.g. need to rapidly inform
large parts of the community beyond the security contacts or incident
communication channels are compromised
Examples





Major power cut at Site A lasted several days
Cable cut network access to Site B
Major worm disrupted network access at Site C
Security incident blocks user access to accounts at Site D
Wide area exploit of the (homogeneous) security fabric
David Kelsey, LCG Emergencies
9
What is needed in an emergency?

Out of band communication channels




Clear decision-making roles



There is no time for consensus during a crisis
Usual decision making process needs to be bypassed
Clear information flow and roles



Alternative service providers (Internet, telephony)
Alternative contact details (e-mail, chat, …)
Alternative technology
For at least management, users, the press
Reduce the risk of mis-communication
Disaster Recovery Plan



Definition of critical infrastructure to kept running or repaired quickly
Dependencies and sequence must be clear for restoring services
Mailing lists (at CERN) are key to restoring communication
David Kelsey, LCG Emergencies
10
Some ideas to stimulate discussion

Define an emergency advisory committee?



Assure information flow



E.g. update DNS servers to point to temporary (web) servers
Pre-record messages on telephone help services
Prepare alternative communication channels



Members, mandate
Goal is to ensure rapid and appropriate decisions
E.g. commercial conference call facilities
Alternative Internet providers (e-mail addresses, chat, phone,…)
When/do we return to normal Incident Response?
David Kelsey, LCG Emergencies
11
Final words
• LCG needs a written plan
• Clear definition of roles
• Operations staff need to know what to do
– Training
• The sites need to agree to policy and procedures
– Recognise the powers of operations staff
• Sites already have their own internal plans
– Now trying to extend to the Grid
• Feedback and advice is welcome!
13-Oct-05
David Kelsey, LCG Emergencies
12