DE-KIT_GridKa_procedures-1.1 - Indico

Download Report

Transcript DE-KIT_GridKa_procedures-1.1 - Indico

GridKa – DE-KIT procedurs
Bruno Hoeft
LHC-OPN Meeting
10. – 11. 03. 08
Bruno Hoeft,
Aurelie Reymund
LHC-OPN 2008, Madrid, 10-11th March.
1
LHC-OPN Hardware at DE-KIT (GridKa):
fully redundant border router setup are in place (resilience)
two border router Cisco Catalyst 6509 Router
- 2 sup engines WS-SUP720-3B ( IOS s72033_rp-IPSERVICESK9_WAN-VM), Version 12.2(33)SXF9).
-- line cards WS-x6704-10GE, facilitated with single mode transceiver XENPAK-10GB-SR
-DFN 2 Huawei DWDM
- one DWDM is providing the light colour from DE-KIT (GridKa) to CERN and SARA (direction north from Karlsruhe)
- the second DWDM is providing the light colour from DE-KIT (GridKa) to IN2P3 and CNAF (direction south from Karlsruhe
The direction to CERN from Karlsruhe is north since the DANTE peering to DFN is located in Frankfurt for the DFN/Dante link DE-KIT(GridKa) – CERN.
Bruno Hoeft,
Aurelie Reymund
LHC-OPN 2008, Madrid, 10-11th March.
2
DE-KIT LHC-OPN links
R-inet-gis-I
Interface
(Layer-2)
VLan
IP (Layer-3) /
Te 7/2
10
192.16.166.34/30 GE10/HUA0674_FRA_FZK
Te 1/1
751
192.16.166.105/30
Link Name (DFN)
Description
(Frankfurt/Dante ->Genf) CERN
(fra-gen_LHC_CERN-DFN_06006)
GE10/HUA0778_FZK_MUE
Muenster/Surfnet-> Amsterdam/SARA (DFN/Surfnet CBF)
R-inet-gis-II
Interface
(Layer-2)
Vlan
IP (Layer-3)
Te 3/2
752
192.16.166.109/30 /
GE10/HUA1106_FZK_KEH
(Kehl) IN2P3 (DFN/RENATER CBF)
Te 2/2
750
192.16.166.101/30 /
GE10/HUA0673_BAS_FZK
(Milano) Bologna INFN(CNAF) (DFN/Switch/GARR CBF)
Bruno Hoeft,
Aurelie Reymund
/ Link Name (DFN)
LHC-OPN 2008, Madrid, 10-11th March.
Description
3
Operative service levels
three service levels entities:
-
First level support is GGUS (5*8)
General FZK network support: (5*8, (plus an automated incident broadcast (SMS) 24*7)
–
Telematis (an external Company is covering the “off workinghours” incident broadcast on call support)
-
Expert Support: (5*8, plus Experts on call)
•
The combination of the three operative service levels are providing a 24*7 LHC-OPN support. This
will match the requirements specified by the LHC experiments in there CDR.
•
•
o
All operators will be granted a fully transparent access to the DE-KIT (GridKa) wiki knowledge base,
the DE-KIT (GridKa) log analyser facility and monitoring system as well as LHC-OPN monitoring
systems, as they are:
- DE-KIT (GridKa) local
–
•
- LHC-OPN central monitoring pages
–
–
•
DE-KIT (GridKa) general monitoring site [http://www.gridka.de/monitoring/main.html]
cacti , netflow, ganglia, nagios, log analyser
iepm [http://192.108.45.161/iepm-bw.fzk.de/LHC-ATLAS.slac_wan_bw_tests.html#node1.uchicago.edu]
BGP – ENOC monitoring page
Dante E2Ecu monitoring page
- Several DE-KIT (GridKa) local information sites are restricted to local access only.
Bruno Hoeft,
Aurelie Reymund
LHC-OPN 2008, Madrid, 10-11th March.
4
Incident origination:
-
DE-KIT (GridKa) Monitoring (LogMonitoring/PortMonitoring)
-
-
DE-KIT (GridKa) Monitoring tools triggering an incident, automated email/SMS (e.g. router port
up/down, flapping, bgp changes…), or by router operators
operation at DE-KIT (GridKa) will open a GGus (or LCU) ticket
GGus (or LCU) will control the ticket
the mainly involved tier-1 site (DE-KIT (GridKa)) will operate the ticket, until the ticket is solved or
closed.
appropriate partner(s) affected by the incident will be included in the ticket.
-
GGUS/LCU ticket initiated by HEP user, distant NOC/Tier-0/1 or NREN
GGus/LCU submits the ticket to the appropriate site (DE-KIT (GridKa))
the ticket will still be controlled by GGus(/LCU) and DE-KIT (GridKa) will take over the operative part
-
-
-
GGus/LCU:
LIPCU (LCU)/E2ECU:
-
-
Information by a site:
-
-
no difference to a GGus/LCU ticket.
request to open a GGus/LCU ticket
however appropriate actions will be taken immediately to solve the issue.
maintenance/changes at DE-KIT (GridKa) / EGEE Broadcast:
-
GGus (and/or LCU) ticket will be opened and it will be announced in GOC, this should inform all LHCOPN sites via EGEEBroadcast as well as through GOC (for each EGEE broadcast should exist an according
ticket)
•
Incident and ticket handling
-
Bruno Hoeft,
Aurelie Reymund
ticket of an incident is handled and controlled by either GGus, LCU, or E2Ecu
operation of certain actions are transferred to the affected/coresponding location like a tier-1 centre
DE-KIT (GridKa) or a “NREN”
the management will still resides at the ticket owner (GGUS, LCU/LIPCU, E2ECU
LHC-OPN 2008, Madrid, 10-11th March.
5
Operation of an Incident (1)
-
Layer-1 incident
(An issue on layer-1 has for consequence that there is no light on the path)
- No light (Descr.: there is a light cut somewhere on the path)
Actions:
- check the router / transceiver / hardware / cable / logs
- evaluate the impact (backup path available)
- contact DFN and Di-Data as well as T0/T1
- send an EGEE broadcast if no backup path (depended on –estimated length, and impact) and escalate to Experts
- report the incident and its solution in the documentation
Involved groups:
- Internal:
GIS / NG (Network Group)
- External:
DFN, Di-Data, T0/T1 network responsible, NREN / Dante
- Momitoring eg.: http://stats.geant2.net/e2emon/mon/G2_E2E_index_PROD.html
- Local hardware failure (Descr.: a hardware element seems to be deficient on the local network)
Actions:
- check the router / transceiver / hardware / cable / logs
- evaluate the impact (backup path available)
- contact T0/T1
- send an EGEE broadcast if no backup path (depended on –estimated length, and impact) and escalate to Experts
- report the incident and its solution in the documentation
Involved groups:
-Internal:
GIS / NG
- External:
DFN, Di-Data, T0/T1 network responsible, NREN / Dante
- Remote hardware failure (Descr.: a hardware element seems to be deficient on the remote network)
Actions:
- check the router / transceiver / hardware / cable / logs
- evaluate the impact (backup path available)
- if nothing suspicious detected, contact T0/T1
- send an EGEE broadcast if no backup path (depended on –estimated length, and impact) and escalate to Experts
- report the incident and its solution in the documentation
Involved groups:
- Internal:
GIS / NG
- External:
DFN, Di-Data, T0/T1 network responsible, NREN / Dante
http://stats.geant2.net/e2emon/mon/G2_E2E_index_PROD.html
•
Bruno Hoeft,
Aurelie Reymund
LHC-OPN 2008, Madrid, 10-11th March.
6
Operation of an Incident (2)
-
Layer-2 (the light on the path is maintained, but there is no connectivity
to the neighbour)
- No MAC (Descr.: missing mac entry from the neighbor’s network)
Actions:
- check router configuration
- evaluate the impact
- contact T0/T1
- send EGEE broadcast if no backup path (estimated length, and impact), escalate to Experts
- report the incident and its solution in the documentation
Groups involved:
- Internal:
GIS / NG
- External:
T0/T1 network responsible
Bruno Hoeft,
Aurelie Reymund
LHC-OPN 2008, Madrid, 10-11th March.
7
Operation of an Incident (3)
-
Layer-3 (By a routing issue on layer-3, the light on the path is maintained, but there is no
reachability to the neighbour)
-
Routing issue : no route to neighbour (Descr.: T1-center cannot reach the neighbour)
Actions:
- check router configuration / routing / acls
- evaluate the impact
- contact T0/T1
- send EGEE broadcast if no backup path (estimated length, and impact),escalate to Experts
- report the incident and its solution in the documentation
Involved groups:
- Internal:
GIS / NG
- External:
T0/T1 network responsible
-
BGP issue : no announcement from neighbour (Descr.: the bgp table shows)
Actions:
- check router configuration / routing / acls
- evaluate the impact
- contact T0/T1
- send EGEE broadcast if no backup path (estimated length, and impact), escalate to Experts
- eport the incident and its solution in the documentation
Involved groups:
- Internal:
GIS / NG
- External:
T0/T1 network responsible
-
BGP issue : no routes advertised to neighbour (Descr.: local bgp does not advertise the network(s) correctly to the neighbour)
Actions:
- check router configuration / routing / acls
- evaluate the impact
- contact T0/T1
- send EGEE broadcast if no backup path (estimated length, and impact), escalate to Experts
- report the incident and its solution in the documentation
Involved groups:
- Internal:
GIS / NG
- External:
T0/T1 network responsible
Bruno Hoeft,
Aurelie Reymund
LHC-OPN 2008, Madrid, 10-11th March.
8
Maintenance window
-
The light path and/or the connectivity / reachability can be affected -- Descr.: T1-center plans
maintenance on the network infrastructure
Actions:
- send an EGEE broadcast
- contact T0/T1, NREN, Dante
Involved groups:
- Internal:
- External:
Bruno Hoeft,
Aurelie Reymund
LHC-OPN 2008, Madrid, 10-11th March.
GIS / NG / Security
T0/T1 network responsible, NREN (DFN) / Dante
9
Configuration / Infrastructure change
-
Configuration change (The light path and/or the connectivity / reachability
can be affected -- Descr.: T1-center makes a change on the network
configuration)
Actions:
- send an EGEE broadcast
- contact T0/T1, NREN, Dante
Involved groups:
- Internal:
GIS / NG / Security
- External: T0/T1 network responsible, NREN (DFN) / Dante
-
Infrastructure change (The light path and/or the connectivity / reachability
can be affected -- Descr.: T1-center plans a change in the network
infrastructure/topology)
Actions:
- send an EGEE broadcast
- contact T0/T1, NREN, Dante
Involved groups:
- Internal:
GIS / NG / Security
- External: T0/T1 network responsible, NREN (DFN) / Dante
-
General remarks:
- all LHC-OPN involving actions:
-
(as long as planable) shall as possible 3 days in advanced anounced (ticket, GOC, EGEEBroadcast)
Changes of the infrastructure (e.g. routing/reorganisation of router port) shall be discussed with the
affected site, cern and the coordination unit (LCU/LIPCU)
- The configuration of the DE-KIT (GridKa) installation will be documented,
as well as all changes will be included in the documentation
Bruno Hoeft,
Aurelie Reymund
LHC-OPN 2008, Madrid, 10-11th March.
10