LHCb Trigger and Data Acquisition System - Indico

Download Report

Transcript LHCb Trigger and Data Acquisition System - Indico

Management of the LHCb DAQ
Network
Guoming Liu*†, Niko Neufeld*
* CERN, Switzerland
† University of Ferrara, Italy
Outline
Introduction to LHCb DAQ system
Network Monitoring based on SCADA system
Network Configuration
Network Debugging
Status of LHCb network installation and
deployment
2
LHCb online system
LHCb Online system consists of three major components:
 Data Acquisition (DAQ)
 transfers the event data from the detector front-end electronics
to the permanent storage
 Timing and Fast Control (TFC)
 drives all stages of the data readout of the LHCb detector
between the front-end electronics and the online processing farm
 Experiment Control System (ECS),
 controls and monitors all parts of the experiment: the DAQ
System, the TFC Systems, the High Level Trigger Farm, the
Detector Control System, the Experiment's Infrastructure etc.
3
LHCb online system
L0
Trigger
L0 trigger
LHC clock
CASTOR
TFC
System
VELO
ST
OT
RICH
ECal
HCal
Muon
FEE
FEE
FEE
FEE
FEE
FEE
FEE
Readout Readout Readout Readout Readout Readout Readout
Board Board Board Board Board Board Board
Front-End
MEP Request
READOUT NETWORK
Event building
SWITCH
SWITCH
SWITCH
SWITCH
SWITCH
SWITCH
SWITCH
C C C C
P P P P
U U U U
C C C C
P P P P
U U U U
C C C C
P P P P
U U U U
C C C C
P P P P
U U U U
C C C C
P P P P
U U U U
C C C C
P P P P
UU U U
SWITCH
CC CC
P P P P
UUUU
MON farm
Event data
Timing and Fast Control Signals
Control and Monitoring data
Experiment Control System (ECS)
Detector
HLT farm
4
LHCb online network
 Two large scale Ethernet
networks:
 DAQ network
Dedicated to data acquisition
 Control network
For the instruments and
computers in LHCb
experiment
DAQ
CONTROL
UKL1
RICH1: sw-d3c01-01
RICH2: sw-d3c04-01
L0Muon Trg
sw-d3a01-01
Sw-d3a03-01
TFC
sw-d3b07-d1
sw-d3b07-c1
TELL1 ccpc
switches
sw-d2e01-c1
DATA
AGGREGATION
HLT FARM
sw-d1exx-d1
sw-d1dxx-c1
sw-d2cxx-c1
sw-d2bxx-c1
Dummy det.
sw-d2e01-d1
sw-ux-01
sw-d1exx-c1
instruments
sw-d1dxx-c1
sw-d2cxx-c1
sw-d2bxx-c1
sw-d2c08-01
sw-daq-01
sw-d1axx-c1
HLT FARM
sw-d1axx-d1
sw-d2c05-01
sw-d2c05-02
sw-agg-01
sw-d2d07-c1
Calibration Farm
sw-d2c05-m1
IP: 10.132.10.21
10 G
sw-d2d05-01
sw-d2a08-01
sw-agg-02
In total:
 ~170 switches
 ~9000 ports
sw-d2a08-c1
sw-d2a07-c1
sw-d2a07-d1
sw-d2b05-s1
Storage Aggregation
sw-d2b07-s1
MONITORING
sw-d2a08-d1
sw-d2b04-s1
sw-d2b03-s1
sw-d2b02-s1
STORAGE
sw-d2b01-s1
5
LHCb DAQ network
 DAQ works in a push mode
~330 Readout
Boards
 Components:
 Readout board: TELL1/UKL1
In total: ~330
 Aggregation switches
 Core DAQ switch:
Force10 E1200i
Aggregation
Switches
Core Switch
CASTOR
50 Edge
Switches
 Supports up to 1260 GbE
ports
 Switch capacity: 3.5Tb/s
 Edge switches
HLT CPU
HLT CPU
HLT CPU
Storage Aggregation
6
LHCb DAQ network
 Protocols
 Readout: MEP
light-weight datagram protocol
over IP
 Storage: standard TCP/IP
 Network throughputs
 Read out: ~35 GByte/s
L0 trigger accept rate: 1 MHz
Avg. event size: ~ 35 kByte
 Storage: ~ 70 MByte/s
HLT accept rate: ~ 2 kHz
~330 Readout
Boards
Aggregation
Switches
~280 Gb/s
Core Switch
CASTOR
50 Edge
Switches
HLT CPU
HLT CPU
HLT CPU
~560 Mb/s
Storage Aggregation
7
Network Monitoring
 Part of the LHCb ECS
 Uses the same tool and
framework
 Provides the same operation
interface
 Implementation
 Monitoring and integration:
PVSS and JCOP
 Data collection:
Varied front-end processors
 Data exchange:
Distributed Information
Management (DIM)
8
Network Monitoring
 Monitoring the status
of the LHCb DAQ
network at different
levels




Topology
IP routing
Traffic
Hardware/system
Architecture of the Network Monitoring
9
Network Monitoring
 Monitoring the status
of the LHCb DAQ
network at different
levels




Topology
IP routing
Traffic
Hardware/system
Structure of the Finite State Machine
for Network Monitoring
10
Network Monitoring: Topology
 The topology is quite “static”
 NeDi: an open source tool to discover the network
 Discovery of the network topology based on Link Layer Discovery
Protocol (LLDP)
Queries the neighbors of the seed, and then the neighbors of
those neighbors, and so on until all the devices have been
discovered in the network.
 Discovery of the network nodes
 All information is stored in the database, and can be
queried by PVSS
 PVSS Monitors the topology only (the uplinks between the
switches). The nodes are monitored by Nagios.
11
Network Monitoring: IP routing
 Monitoring the status of the routing with Internet Control
Message Protocol (ICMP), specifically “ping“
 Three stages for the DAQ:
 Entire read-out event from the readout board to HLT farm
ICMP not fully implemented in the readout board, a general
computer is inserted to simulate the the readout board:
 Test the status of the readout board using “arping”
 Test the availability of the HLT nodes using “ping”
 Selected events from the HLT to the LHCb online storage
 From the online storage to CERN CASTOR
 The front-end script gets the result and exchanges the
message with PVSS using DIM
12
Network Monitoring: traffic
 Front-end processors:
 Collect all the interface counters from the network devices using
SNMP
 Input and output traffic
 Input and output errors, discards
 Exchange data as a DIM server
 PVSS:
 Receives the data via PVSS-DIM bridge
 Analyzes the traffic and archives them
 Displays the current status and trending of the bandwidth
utilization
 Issues alarm in case of error
13
Network Monitoring: traffic
14
Network Monitoring: hardware/system
 Syslog server is setup to receive the syslog messages
from the network devices and parse the messages.
When the network devices run into problems, the error
messages will be generated and sent to the syslog server
as configured in the network device
 Hardware: temperature, fan status, power supply status
 System: CPU, memory, login authentication etc.
 Syslog can collect some information not covered by SNMP
 All the collected messages will be communicated to PVSS
15
Network Configuration
 The LHCb online network system is quite large:
 Different devices with different OS and command sets
 But quite static luckily, only a few features are essential for
configuring the network devices.
 Currently a set of Python scripts is used for configuring
the network devices, using module pexpect for interactive
CLI access.
 Initial setup for new installed switch
 Firmware upgrade
 Configuration file backup and restore
16
Network Configuration
NeDi CLI access
 Web-based interface
 Possible to select a
set of switches by
type, IP, or name etc.
 Can apply a batch of
commands on a set of
switches
17
Network Diagnostics Tools
 sFlow Sampler
 sFlow is a mechanism to capture packet headers, and collect the
statistics from the device, especially in high speed networks
 samples the packet on the switch port and displays the header
information
It is very useful to debug the packet loss problem, e.g. caused by
wrong IP or MAC address
 Relative high speed traffic monitoring
 Queries the counters for selected interfaces using SNMP or CLI
with a finer time resolution
 Shows the utilization for the selected interfaces
18
Status of Network Installation and Deployment
 Current setup:
 With 2 aggregation switches
 Only 2 linecards inserted to the core DAQ switch
 For L0 trigger rate at ~200kHz
 Upgrade for 1 MHz full speed readout.
 Core DAQ switch: Forec10 E1200i
 14 linecards, 1260 GbE ports will be ready at the end of June
 Upgrade from Terascale to Exascale: double the switch capacity and
all ports run in line rate
 All readout boards will be connected to the core DAQ switch
directly
19
20