Transcript EGI-InSPIRE

EGI-InSPIRE
EGI 2nd level support
training
Marian Babik, David Collados, Wojciech Lapka, Pedro
Andrade, Paloma Fuente, Jacobo Tarragon
(CERN)
Emir Imamagic
(SRCE)
Christos Triantafyllidis (AUTH)
EGI-InSPIRE RI-261323
www.egi.eu
Introduction
• Aim
– provide detailed technical overview of SAM
• improve understanding how the system works
• help you to solve most common issues
– get feedback from 2nd level
• Approach:
– overview of architecture
– per component (3 slides)
• configuration, debugging
• what are the most common issues, how to resolve them
EGI-InSPIRE RI-261323
www.egi.eu
Introduction
• GGUS 2nd level
– 69 tickets
• GGUS 3rd level
– 249 tickets
EGI-InSPIRE RI-261323
www.egi.eu
Disclaimer
• many internal/development APIs will be
shown
• they can change anytime and shouldn’t
be considered public
• public API is documented at:
– https://tomtools.cern.ch/confluence/display/S
AMDOC/Web+Services+Specification
EGI-InSPIRE RI-261323
www.egi.eu
Terminology
• service – endpoint (hostname, port)
• service flavour – service type (GOCDB)
• profile – set of tuples (flavour, metric, vo,
fqan)
• status – discrete state (one of ok, critical,
warning, unknown)
• availability – time period for which status
was ok (- downtime)
• reliability – availability (+ downtime)
EGI-InSPIRE RI-261323
www.egi.eu
SAM Architecture
EGI-InSPIRE RI-261323
www.egi.eu
SAM Architecture
EGI-InSPIRE RI-261323
www.egi.eu
ATP - Configuration
• atp_synchro.conf : main configuration file
–
–
–
–
•
•
•
•
•
debug level
external data sources location (GOCDB, CIC, VOMS, etc)
location of vo feed and roc configuration files
synchronizer selector
atp_db.conf : database connection configuration
atp_logging_files.conf : location of log configuration file
atp_logging_parameters_config.conf : log configuration
roc.conf : list of enabled regions
vo_feeds.conf : list of enabled vo feeds
• All configuration files are based on key-value pairs
• Default configuration structure distributed in ATP package
8
EGI-InSPIRE RI-261323
www.egi.eu
ATP - Debugging
•
•
•
•
Log of last execution: /var/log/atp/atp.log
Log of all executions: /var/log/atp/atp_full.log (with logrotate)
Errors are also sent to system logging
Six levels of debugging:
– CRITICAL, ERROR, WARNING, INFO, DEBUG, NOTSET
– Default configuration is on INFO (20)
• Standard log file line:
–
–
–
–
“2012-03-22 15:24:02,308 - ATP - INFO - CIC - Execution – Starting”
CIC: synchronizer name (e.g. CIC, GOCDB Topology, VOFeeds, etc)
Execution: task type (e.g. configuration, validation, execution)
Starting: action description
• ATP_sync probe
• POEM/NCG calls (for all non-deleted VOs):
– localhost/atp/api/search/servicemap/json?vo=<vo>&ismonitored=on
9
EGI-InSPIRE RI-261323
www.egi.eu
ATP – Common Issues
• A line-by-line analysis of atp.log allows to understand
99% of the problems with atp synchronizer
• ATP synchronizes data from several distinct external
data sources. Sometimes ATP execution fails due to
“invalid” or “not available” input data
– Check for “Validation” tag in the log to understand which data
source was not reachable or was providing invalid data
• ATP is based on several PL/SQL procedures/functions
– If you detect ORA-* error codes please assign the ticket to 3rd
level
10
EGI-InSPIRE RI-261323
www.egi.eu
POEM sync
• /etc/poem/poem_sync.ini
– logging
– database details
– POEM_SYNC_NS_URLS – list of URLs from which to
synchronize (NGI defaults to grid-monitoring, VO defaults to
localhost)
– POEM_SYNC_NS_RESTRICT – space separated list of
namespace!profile which should be synchronized for given
namespace (ch.cern.sam!ROC ch.cern.sam
• reasonable defaults are provided
• debugging
– localhost/poem_sync/api/0.1/json/servicemetricinstances
– localhost/poem_sync/api/0.1/json/profiles
• Poem_sync probe (dumps log information)
EGI-InSPIRE RI-261323
www.egi.eu
POEM Web
• /etc/poem/poem.ini : main configuration file for poem
web
– database details
– logging
– namespace
• poem web instance, list of defined profiles, metrics
– localhost/poem/api/0.1/json/profiles/
– localhost/poem/api/0.1/json/namespace/
• poem web (mod_wsgi), django admin
– DEBUG=True in /etc/poem/poem.ini
EGI-InSPIRE RI-261323
www.egi.eu
POEM known issues
• no history
– changes take effect immediately (critical profiles need to be
changed at beginning of a month – PROC10)
• metric configuration is not integrated with poem
– poem web doesn’t filter metrics in any way
– no guidance in terms of dependencies, internal metrics, etc.
• FQAN support
– if fqan is null this means results with any fqan will be accepted
– local profiles with custom fqans can overwrite results of the
central profiles
EGI-InSPIRE RI-261323
www.egi.eu
NCG configuration
• /etc/ncg/ncg.conf
– basic structure
• /etc/ncg/ncg
• outputs to /etc/nagios/wlcg.d/
• log /var/log/ncg/ncg.log
EGI-InSPIRE RI-261323
www.egi.eu
NCG debugging
• review /var/log/ncg/ncg.log
• check metric configuration
– /etc/ncg-metric-config.conf
– /etc/ncg-metric-config.d
• probes
– NCGPidFile (freshness)
– ncg_sync
EGI-InSPIRE RI-261323
www.egi.eu
NCG known issues
EGI-InSPIRE RI-261323
www.egi.eu
voms2htpasswd
• Authorization for Nagios
• Configuration files:
–
–
–
/etc/voms2htpasswd.conf Major configuration file
/etc/voms2htpasswd-bans.conf Banned DNs
/etc/voms2htpasswd-static.d/ Files containing list of DNs
• Sample entries for /etc/voms2htpasswd.conf:
–
–
–
atps://grid-monitoring.cern.ch/atp/api/search/contactgroup/json?groupname=NGI_HU
atps://gridmonitoring.cern.ch/atp/api/search/contactgroup/json?groupname=NGI_PL&role=Regional
%20Manager
atps://grid-monitoring.cern.ch/atp/api/search/contactsite/json?sitename=KR-KISTI-GSDC01
• Sample entries for /etc/voms2htpasswd-bans.conf and
/etc/voms2htpasswd-static.d/
–
/C=GR/O=HellasGrid/OU=auth.gr/CN=Christos Triantafyllidis
Debugging:
–
Check existence of entries in: /etc/httpd/httpd.users
EGI-InSPIRE RI-261323
www.egi.eu
Messaging config
• brokers:
– /var/cache/msg/broker-cache-file/broker-list
• msg-to-handler daemon:
– /etc/msg-to-handler.conf (/etc/msg-tohandler.d)
• Nagios probes:
– org.egee.SendToMsg – publishes config and
metrics
– org.egee.RecvFromQueue – imports results
EGI-InSPIRE RI-261323
www.egi.eu
MRS configuration
• basic configuration
– mrs.conf is located at:
• /etc/mrs.d/mysql-mrs.conf (MySQL)
• /etc/mrs.d/oracle-mrs.conf (Oracle)
• send_to_db.ini is located at
– /etc/nagios/plugins/send_to_db.ini
• structure:
– [send_to_db]
– db_uri=mrs;host=localhost
– db_user=msuser
– db_pwd=mspass
EGI-InSPIRE RI-261323
www.egi.eu
MRS debugging
select uts_to_w3ctime(max(check_time)) from metricdata_spool; (ORACLE)
select FROM_UNIXTIME(max(check_time)) from metricdata_spool; (MySQL)
latest entry in metricdata_spool, it shouldn’t be old (if too old.. maybe metrics aren’t
received from messaging)
select uts_to_w3ctime(max(check_time)) from metricdata; (ORACLE)
select FROM_UNIXTIME(max(check_time)) from metricdata; (MySQL)
latest entry in metricdata, it shouldn’t be old (if too old.. maybe metrics aren’t received
from metricdata_spool)
select uts_to_w3ctime(m.check_time), uts_to_w3ctime(m.insert_time), m.* from
metricdata_rejected m;
select FROM_UNIXTIME(m.check_time), from_unixtime(m.insert_time), m.* from
metricdata_rejected m;
see reason to understand why metric was rejected
Nagios probes: SendToMetricStore, MrsDirSize, MrsCheckMissingProbes
EGI-InSPIRE RI-261323
www.egi.eu
MRS known issues
• no known issues 
• Basic contracts
– metric is marked as REMOVED if status is MISSING and service is
marked as deleted
– metric is marked as REMOVED if its tuple disappears from mrs
bootstrapper
– metric is marked as MISSING after 24 hours
• statuschange_service_profile table keeps data for 12
months
• metricdata table keeps data for 6 months
• metricdata_rejected table keeps data for 1 month
• metricdata_latest table contains metric results newer
than 7 days.
EGI-InSPIRE RI-261323
www.egi.eu
SAM reloading
• /etc/rc.d/init.d/sam-sync
• /var/log/sam-sync.log
• reloads SAM:
– suspends ATP, POEM
– ncg.reload.sh
– mrs bootstrapping
– resumes ATP, POEM
EGI-InSPIRE RI-261323
www.egi.eu
myEGI config and debug
• /etc/mywlcg/mywlcg.ini
– database connection
• /var/log/httpd/error.log
• based on django (mod_wsgi)
– you can get more explicit errors if you set
DEBUG=True
• myegi tests
• myegi web service tests
EGI-InSPIRE RI-261323
www.egi.eu
ACE - Configuration
• ace.conf: main configuration file
– database configuration file path
– logging level and configuration file path
– computation_delay: used to set a maximum time for which
computations can be performed. ie:
• Current time: 08.45
• Computation delay: 15 (minutes)
• When calculations are performed, last period considered will end at 08.30
• ace_db.conf : database connection configuration
• atp_logging.conf : log path and logging configuration
• All configuration files are based on key-value pairs
• Default configuration structure distributed in ACE package
24
EGI-InSPIRE RI-261323
www.egi.eu
ACE - Debugging
• Log of last execution: /var/log/ace/ace.log
– Used for both ace_status and ace_availability
• Five levels of logging:
– CRITICAL, ERROR, WARNING, INFO, DEBUG
– Default configuration is on ERROR (40)
• Logging of performed actions
–
–
–
–
Status auto-summarization (missing status calculations in the past 24h)
Regular status summarization (from last summarization to current time – delay)
Availability auto-summarization (missing availability calculations in the past 24h)
Regular availability summarization (from last summarization to current time – delay)
• Hourly, daily, weekly and monthly calculations for each hour, day, week
and month within the period.
25
EGI-InSPIRE RI-261323
www.egi.eu
ACE – Common Issues
• Availability recomputation requests
– Must follow request policy:
• caused by problems in the monitoring infrastructure
• requested up to 10 days after the publication of the monthly report
– If coming from site admin, assign to regional operations staff
• policy for EGI sites and regions: https://wiki.egi.eu/wiki/PROC10
– If coming from regional operations staff, assign to 3rd level
• Apparently wrong values caused by external reasons
– topology issues
– MRS data
26
EGI-InSPIRE RI-261323
www.egi.eu
Documentation
• https://tomtools.cern.ch/confluence/displa
y/SAMDOC/Home
• https://tomtools.cern.ch/confluence/displa
y/SAMDOC/FAQs
• https://tomtools.cern.ch/confluence/displa
y/SAMDOC/Troubleshooting
• https://tomtools.cern.ch/confluence/displa
y/SAMDOC/Released+Probes
EGI-InSPIRE RI-261323
www.egi.eu