gcx-LHCOPN-Ops - Indico

Download Report

Transcript gcx-LHCOPN-Ops - Indico

LHCOPN: Operations report
Guillaume.Cessieux @ cc.in2p3.fr
Network team, FR-CCIN2P3
LHCOPN meeting, CERN, 2010-10-08
From last LHCOPN meeting, 2010-06-29, Barcelona

Conclusion on Operations
– Unequal following of processes by sites because
missing clear feeling of usefulness and evidence of
network failures
– WLCG relationships are weak
– Monitoring and SLD required to really assess Operations

Items not solved
– LHCOPN representatives
• How to push efficiently for proper solving of some
issues/administrative tasks
– In clear words: Stress sites and escalate frozen issues
– Merging LHCOPN helpdesk with standard GGUS
GCX
LHCOPN meeting, CERN, 2010-10-08
2
Outlines

Operation status
– TTS stats
– Long standing issues & Ops phoneconf report

Operational exchanges with WLCG
– Post mortem analysis of some issues
– Ease exchanges with WLCG

GCX
AOB
LHCOPN meeting, CERN, 2010-10-08
3
Number of tickets put in the LHCOPN TTS per month
AVG: 23 tickets/month
GCX
LHCOPN meeting, CERN, 2010-10-08
4
Kind of tickets per month
GCX
LHCOPN meeting, CERN, 2010-10-08
5
KPI-1: Infrastructure vs operations behavior
GCX
LHCOPN meeting, CERN, 2010-10-08
6
Ticket ownership during [2010-07-01,2010-09-31]
Joy of terminating 6 LHCOPN links
GCX
LHCOPN meeting, CERN, 2010-10-08
7
Ownership of tickets per month per site
GCX
LHCOPN meeting, CERN, 2010-10-08
8
Conclusion from TTS stats

Workflow stable, but unclear if this is good
– Miss SLD & monitoring to correlate and focus
on service impacting events

Lot of L2 events (80%) well handled
– Often clear cut, easy to spot

Not used to complex issues
– Often turning into a long story
• packet loss, MTU...
GCX
LHCOPN meeting, CERN, 2010-10-08
9
Long standing issues

Only administrative!
– Validate prefix acceptance etc.
– Wait GGUS feature “clone this ticket and
assign it to all impacted sitename” to follow this
in a per site basis

Followed during the LHCOPN Ops
phoneconf, each 3 months
– Recurrent issue: Hard to have administrative
issue solved
GCX
LHCOPN meeting, CERN, 2010-10-08
10
Issues highlighted by WLCG (1/4)

Painful to spot and a lot not anyhow related to the
LHCOPN
1. #GGUS-54473 transfer error from
PIC_DATADISK to SARA-MATRIX_DATADISK
– Child issues: #GGUS-54416, #GGUS-54474, #GGUS-54500
– “The two LHCOPN routers at CERN were connected
via a VLAN, and VLAN tagging adds 4 bytes to a
packet. The MTU between these routers has been
increased”
– Opened 2010-01-05 12:17, closed 2010-01-08 16:16
– No related LHCOPN tickets
GCX
LHCOPN meeting, CERN, 2010-10-08
11
Issues highlighted by WLCG (2/4)
2.
#LHCOPN-58197:
Poor performance between CERN and ASGC
– Opened 2010-05-12, closed 2010-05-17
– Never updated, only Opened/Closed for the record
• Only communication problem, issue was managed
• Network staff movement at TW-ASGW, solved
• SIR filled https://twiki.cern.ch/twiki/bin/view/LCG/SIRCernAsgcLinkMay2010
3.
#GGUS-59791: Transfer problem from to INFNT1_DATADISK to PIC_DATADISK
– Child issue: #GGUS-59697 T0 export to INFN-T1_DATADISK failures:
No valid space tokens
– Opened 2010-07-07 00:06, closed 2010-07-14 18:05
– “Network issue of MTU black hole + route asymetry at CNAF/GARR”
– No LHCOPN tickets
GCX
LHCOPN meeting, CERN, 2010-10-08
12
Issues highlighted by WLCG (3/4)
4.
# GGUS-61306: Functional test transfer errors to RALLCG2_DATADISK
– Related to
• #GGUS-61942 “NDGF-T1 transfer error from RAL-LCG2 and to
BNL-OSG2”
• #GGUS-61835 “Transfer errors from NDGF-T1_DATADISK to RALLCG2_DATADISK”
• #GGUS-62287 “Transfer errors at NDGF-T1_SCRATCHDISK”
– Opened 2010-08-19 17:41, closed 2010-09-17 15:09
– #LHCOPN-62228, opened/closed 2010-09-17
• Symbolic for the record, no info into
– “The linecard terminating the RAL primary link on the
CERN router was replaced and the issue was definitely
solved”
GCX
LHCOPN meeting, CERN, 2010-10-08
13
Issues highlighted by WLCG (4/4)

4 LHCOPN issues this year
– Nothing particularly wrong
– Problem is mainly around communication

Main mistake is forgetting creating a ticket
in LHCOPN helpdesk
– This was the agreed process

Not aware of any other LHCOPN related
issue from WLCG
– But others network issues (LAN, Generic IP...)
GCX
LHCOPN meeting, CERN, 2010-10-08
14
Separated LHCOPN helpdesk in GGUS, why? (1/3)

Key requirement 2008-03
– Not doing user support, but coordinating network teams
– Match operational model, particularly responsibility and notification
scheme
– Network issue ≠ Grid issue, lot of non service impacting events to
be registered into
• Avoid disturbing or misleading people
– Network teams have no access to standard GGUS
• And did not want
– Centralize anything related to LHCOPN Ops
– Clear desire to be isolated/protected
• “If we use standard GGUS this will be a mess”
• Real fear of enquiries for anything
• Did not want to be considered as a catch all networking support, we should
accept only selected enquiries LHCOPN related going through storage teams

GCX
So we ended with the LHCOPN helpdesk
LHCOPN meeting, CERN, 2010-10-08
15
Separated LHCOPN helpdesk in GGUS, why? (2/3)
 Now
– General workflow is agreed, discussion is on way to implement it
– Lot of things have evolved
• GGUS support scheme, experience in applying processes etc.
– Several problems/concerns experienced
• Problem cannot be solved independently by network team?
– Lot of interaction with storage, system etc.
– Aren’t iperf tests or monitoring sufficient?
• We miss clear bridge with WLCG Ops
– Hope was put in awaited parent/child relationship feature for GGUS tickets
– cross helpdesk accesses and exchanges required ?
• Enquiries often still have a standard GGUS tickets
– “Why creating a LHCOPN TT if there is still a GGUS one ?”
» Competition between LHCOPN helpdesk and standard GGUS
– Tickets turning out to be network related after some time and investigations
– LHCOPN tickets: Overhead or true advantage?
» Notification, responsibility, tracking etc.
GCX
LHCOPN meeting, CERN, 2010-10-08
16
Separated LHCOPN helpdesk in GGUS, why? (3/3)

So create 12 related support units in the standard
GGUS?
• LHCOPN_CA-TRIUMF etc.
– Will this add happy interactions with everybody?
– Can we keep the set of particular features we have and be smartly
integrated in current GGUS’ workflow?
• Particular view, non service impacting events hidden, categories, tickets for
maintenances, notification and assignment scheme ?
• Transparent for us? Can a standard ticket be turned into a LHCOPN one?
– Aren’t we doing more than user support?
GCX
LHCOPN meeting, CERN, 2010-10-08
17
AOB (1/3)

Routing policies
– To be documented accurately through a routing matrix
– https://twiki.cern.ch/twiki/bin/view/LHCOPN/RoutingPolicies

Escalation process
– Existing, but never used
– https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#Escalated_incident_
management_pr
– Give this privilege to WLCG people on LHCOPN tickets?

Scheme of responsibilities to be improved?
– Set on links basis, so who’s responsible for a IT-INFNCNAF ↔ US-T1-BNL issue?
• Can this really happen without problems between IT-INFNCNAF ↔ CERN or US-T1-BNL ↔ CERN ?
GCX
LHCOPN meeting, CERN, 2010-10-08
18
AOB (2/3)

Issues/requests related to MDM
– Must be visible, tracked and centralised like any others
LHCOPN issues
• Must be in the LHCOPN TTS
– Maybe new problem categories etc. to support this
– How far? Track software bug or only sites implementation?
• DANTE/GN3 could have login/pass to GGUS if no certificate
– Any concern about?
– Documentation about MDM boxes available?
• Should be on the LHCOPN twiki, even very brief
– List and IP address of boxes enough?
• Hard to solve problems only knowing local boxes
• DANTE/GN3 should have R/W access to LHCOPN twiki
GCX
LHCOPN meeting, CERN, 2010-10-08
19
AOB (3/3)

Too many off the record e-mails exchanges
about LHCOPN issues
– MUST be in the LHCOPN TTS
• Visible, followed, timestamped etc.
• Tickets in the LHCOPN TTS have a clear scheme of
responsibilites… not an e-mail sleeping in inbox
– If no LHCOPN ticket, no LHCOPN issue
GCX
LHCOPN meeting, CERN, 2010-10-08
20
Conclusion

Awaiting monitoring to revitalise Ops
– And SLD to really know what matters

Main weakness of LHCOPN Ops: relationship with
WLCG
– GGUS merging: To be investigated/discussed further
• Why not if this solves issues

Be careful with the scope of our model
– LHCOPN only
– Key reason for having this so specific?
• But be careful before changing something working
• Wait also EGI networking support and Tiers 2 networking to
converge
GCX
LHCOPN meeting, CERN, 2010-10-08
21