gcx-LHCOPN-Ops

Download Report

Transcript gcx-LHCOPN-Ops

LHCOPN Operations: Yearly review
Guillaume.Cessieux @ cc.in2p3.fr
Network team, FR-CCIN2P3
LHCOPN meeting, Lyon, 2011-02-11
Tickets per month
AVG 19 tickets/month
GCX
231 tickets in 2010
LHCOPN
meeting, Lyon,
2011-02-11
280 tickets
in 2009
2
LHCOPN tickets’ ownership
34% of tickets
GCX
LHCOPN meeting, Lyon, 2011-02-11
3
Breakdown per category and kind of problem
80% carrier issues
GCX
LHCOPN meeting, Lyon, 2011-02-11
4
Service impact reported
GCX
At least 67% of events
have no service impact
LHCOPN meeting, Lyon, 2011-02-11
5
Top 10 links Ids involved in tickets
GCX
LHCOPN meeting, Lyon, 2011-02-11
6
Top 10 link Ids involved in L2 incident tickets
GCX
LHCOPN meeting, Lyon, 2011-02-11
7
Excellent redundancy and diversity
GCX
Situation in simultaneously
loosing 12 previous links
LHCOPN meeting, Lyon, 2011-02-11
8
Correlation Monitoring / Operations
77% of events have no related GGUS tickets
GCX
LHCOPN meeting, Lyon, 2011-02-11
9
Focus on useful events
Now
Service
impacting
events
None
service
impacting
events
(backup link down...)
Events
Reported
and
managed
in
the TTS
~30%
GCX
Target
Events
Reported
and
managed
in
the TTS
None
service
impacting
events
(backup link down...)
~70%
Ensuring, showing and tracking restoration of redundancy
LHCOPN meeting, Lyon, 2011-02-11
10
Attendance to the quarterly LHCOPN Ops phoneconf
GCX
LHCOPN meeting, Lyon, 2011-02-11
11
LHCOPN operations: Status

Good
– Process and tools implemented and agreed
– Clear improvement on documenting

Bad
–
–
–
–
Hard to push for administrative issues (update doc etc.)
Redundancy prevents us from regularly practising
No evidence of faults giving feeling of unnecessary actions
Can’t clearly assess work done by sites due to lack of monitoring
• Can’t correlate service impacting events vs managed events
• Can’t assess, can’t improve
– Backup tests (4 sites reported something in 2010)
– Change management DB (9 entries related to 3 sites)
– Interactions with WLCG
GCX
LHCOPN meeting, Lyon, 2011-02-11
12
Role of LHCOPN helpdesk
T0
Network
team
T1
T1
Network
team
LHCOPN TTS
(GGUS)
• Centralise information
• Unify communications
T1
GCX
LHCOPN meeting, Lyon, 2011-02-11
Network
team
Network
team
13
Problem for LHCOPN network support
Site
Storage
team
Network
team
LHCOPN TTS
(GGUS)
Site
Experiments
GCX
WLCG TTS
(GGUS)
Network
team
Storage
team
LHCOPN meeting, Lyon, 2011-02-11
14
How/why we ended here?

Two information repositories but two different goals
– Problem solving vs coordinating network teams
• Backup link down, informational/changes tickets, routing issue etc.
– We ended with 230 tickets/year for this...
– Scheduled events vs users’ enquiries
– Lot of LHCOPN tickets not of interest for WLCG
• “No service impacting event, no Grid problem, no need for a WLCG tickets”
– Standard GGUS not tailored for network support
• Particularly multi-sites notification scheme?
– Clear weaknesses for user support
• But disturb everything for 4 enquiries/year?
– Was assumed we can link LHCOPN TTS and WLCG TTS

Who are users of the LHCOPN?
– Was said only storage teams on sites
• Network teams did not want direct exposure in WLCG TTS
– Only accepting enquiries from local teams or remote network teams
• Only local storage team can state if there is a network problem or not
GCX
LHCOPN meeting, Lyon, 2011-02-11
15
Why something specific for the LHCOPN? (1/2)

Not so specific processes, just clear implementation of usual
processes for a delimited and dedicated network

Can’t we handle generic IP issues in the same way?
– Same concepts sound applicable
• Project ↔ On site Grid related teams (storage...) ↔ local network team
– Generic IP issues or ... LHCONE issues?
• Careful scaling required: Point to point vs any to any; 12 sites vs 300
GCX
LHCOPN meeting, Lyon, 2011-02-11
16
Why something specific for the LHCOPN? (2/2)

Two important points
1. Should sites’ network teams be directly acting in WLCG TTS for generic
issues, or should information be relayed by some other teams (Storage,
Grid, support, etc.) ?
• What kind of issue are we discussing? Expected link cut or complex
performance issues?
– Previously agreed: Clear demarcation point for network teams = iperf test working
– We learnt that solving complex issue need concurrent involvement from a LOT of
supporters
• As it is for scheduled network downtimes: Only resource managers talk to
projects
• Network teams did not want to duplicate actions for several projects
– Generic networks = generic processes not focused on WLCG
– Handle non dedicated networks as a generic resource like electricity ?
2. Ownership of issues has to be very clear
• Who is in charge of a London – St Petersburg issue between two Tiers 2?
• Can this be really pre-determinated? Maybe enable transfer of responsibility if default
assignment is not good
GCX
LHCOPN meeting, Lyon, 2011-02-11
17
Summary on LHCOPN network support

Problems
– We are not doing network support
• No clear ownership or process for network issues appearing in WLCG TTS
– Generic IP vs LHCOPN
– Our helpdesk is particular, isolated and restricted

Possible solutions
1. Use only WLCG TTS
• Could it make us fully happy? Which changes are really required?
2. Make a clear and strong bridge between the two helpdesks
• Initially envisioned features like “Linking tickets” etc. not sufficient
• Need real cross helpdesk interactions
3. Make transparent the two helpdesks keeping specificities
• When something turns to be a LHCOPN issue transforms the ticket in a
LHCOPN ticket and allow a wide range of supporter to act into
• Otherwise keep things as they currently are
GCX
LHCOPN meeting, Lyon, 2011-02-11
18
Conclusion

Infrastructure quality hides Ops weaknesses

LHCOPN operations need improvements

Two key issues
1. Network monitoring
• Preventing improvement process
2. User support
• Communication issues between two worlds
GCX
LHCOPN meeting, Lyon, 2011-02-11
19