LHCb Trigger and Data Acquisition System - Indico
Download
Report
Transcript LHCb Trigger and Data Acquisition System - Indico
Management of the LHCb DAQ
Network
Guoming Liu*†, Niko Neufeld*
* CERN, Switzerland
† University of Ferrara, Italy
Outline
Introduction to LHCb DAQ system
Network Monitoring based on SCADA system
Network Configuration
Network Debugging
Status of LHCb network installation and
deployment
2
LHCb online system
LHCb Online system consists of three major components:
Data Acquisition (DAQ)
transfers the event data from the detector front-end electronics
to the permanent storage
Timing and Fast Control (TFC)
drives all stages of the data readout of the LHCb detector
between the front-end electronics and the online processing farm
Experiment Control System (ECS),
controls and monitors all parts of the experiment: the DAQ
System, the TFC Systems, the High Level Trigger Farm, the
Detector Control System, the Experiment's Infrastructure etc.
3
LHCb online system
L0
Trigger
L0 trigger
LHC clock
CASTOR
TFC
System
VELO
ST
OT
RICH
ECal
HCal
Muon
FEE
FEE
FEE
FEE
FEE
FEE
FEE
Readout Readout Readout Readout Readout Readout Readout
Board Board Board Board Board Board Board
Front-End
MEP Request
READOUT NETWORK
Event building
SWITCH
SWITCH
SWITCH
SWITCH
SWITCH
SWITCH
SWITCH
C C C C
P P P P
U U U U
C C C C
P P P P
U U U U
C C C C
P P P P
U U U U
C C C C
P P P P
U U U U
C C C C
P P P P
U U U U
C C C C
P P P P
UU U U
SWITCH
CC CC
P P P P
UUUU
MON farm
Event data
Timing and Fast Control Signals
Control and Monitoring data
Experiment Control System (ECS)
Detector
HLT farm
4
LHCb online network
Two large scale Ethernet
networks:
DAQ network
Dedicated to data acquisition
Control network
For the instruments and
computers in LHCb
experiment
DAQ
CONTROL
UKL1
RICH1: sw-d3c01-01
RICH2: sw-d3c04-01
L0Muon Trg
sw-d3a01-01
Sw-d3a03-01
TFC
sw-d3b07-d1
sw-d3b07-c1
TELL1 ccpc
switches
sw-d2e01-c1
DATA
AGGREGATION
HLT FARM
sw-d1exx-d1
sw-d1dxx-c1
sw-d2cxx-c1
sw-d2bxx-c1
Dummy det.
sw-d2e01-d1
sw-ux-01
sw-d1exx-c1
instruments
sw-d1dxx-c1
sw-d2cxx-c1
sw-d2bxx-c1
sw-d2c08-01
sw-daq-01
sw-d1axx-c1
HLT FARM
sw-d1axx-d1
sw-d2c05-01
sw-d2c05-02
sw-agg-01
sw-d2d07-c1
Calibration Farm
sw-d2c05-m1
IP: 10.132.10.21
10 G
sw-d2d05-01
sw-d2a08-01
sw-agg-02
In total:
~170 switches
~9000 ports
sw-d2a08-c1
sw-d2a07-c1
sw-d2a07-d1
sw-d2b05-s1
Storage Aggregation
sw-d2b07-s1
MONITORING
sw-d2a08-d1
sw-d2b04-s1
sw-d2b03-s1
sw-d2b02-s1
STORAGE
sw-d2b01-s1
5
LHCb DAQ network
DAQ works in a push mode
~330 Readout
Boards
Components:
Readout board: TELL1/UKL1
In total: ~330
Aggregation switches
Core DAQ switch:
Force10 E1200i
Aggregation
Switches
Core Switch
CASTOR
50 Edge
Switches
Supports up to 1260 GbE
ports
Switch capacity: 3.5Tb/s
Edge switches
HLT CPU
HLT CPU
HLT CPU
Storage Aggregation
6
LHCb DAQ network
Protocols
Readout: MEP
light-weight datagram protocol
over IP
Storage: standard TCP/IP
Network throughputs
Read out: ~35 GByte/s
L0 trigger accept rate: 1 MHz
Avg. event size: ~ 35 kByte
Storage: ~ 70 MByte/s
HLT accept rate: ~ 2 kHz
~330 Readout
Boards
Aggregation
Switches
~280 Gb/s
Core Switch
CASTOR
50 Edge
Switches
HLT CPU
HLT CPU
HLT CPU
~560 Mb/s
Storage Aggregation
7
Network Monitoring
Part of the LHCb ECS
Uses the same tool and
framework
Provides the same operation
interface
Implementation
Monitoring and integration:
PVSS and JCOP
Data collection:
Varied front-end processors
Data exchange:
Distributed Information
Management (DIM)
8
Network Monitoring
Monitoring the status
of the LHCb DAQ
network at different
levels
Topology
IP routing
Traffic
Hardware/system
Architecture of the Network Monitoring
9
Network Monitoring
Monitoring the status
of the LHCb DAQ
network at different
levels
Topology
IP routing
Traffic
Hardware/system
Structure of the Finite State Machine
for Network Monitoring
10
Network Monitoring: Topology
The topology is quite “static”
NeDi: an open source tool to discover the network
Discovery of the network topology based on Link Layer Discovery
Protocol (LLDP)
Queries the neighbors of the seed, and then the neighbors of
those neighbors, and so on until all the devices have been
discovered in the network.
Discovery of the network nodes
All information is stored in the database, and can be
queried by PVSS
PVSS Monitors the topology only (the uplinks between the
switches). The nodes are monitored by Nagios.
11
Network Monitoring: IP routing
Monitoring the status of the routing with Internet Control
Message Protocol (ICMP), specifically “ping“
Three stages for the DAQ:
Entire read-out event from the readout board to HLT farm
ICMP not fully implemented in the readout board, a general
computer is inserted to simulate the the readout board:
Test the status of the readout board using “arping”
Test the availability of the HLT nodes using “ping”
Selected events from the HLT to the LHCb online storage
From the online storage to CERN CASTOR
The front-end script gets the result and exchanges the
message with PVSS using DIM
12
Network Monitoring: traffic
Front-end processors:
Collect all the interface counters from the network devices using
SNMP
Input and output traffic
Input and output errors, discards
Exchange data as a DIM server
PVSS:
Receives the data via PVSS-DIM bridge
Analyzes the traffic and archives them
Displays the current status and trending of the bandwidth
utilization
Issues alarm in case of error
13
Network Monitoring: traffic
14
Network Monitoring: hardware/system
Syslog server is setup to receive the syslog messages
from the network devices and parse the messages.
When the network devices run into problems, the error
messages will be generated and sent to the syslog server
as configured in the network device
Hardware: temperature, fan status, power supply status
System: CPU, memory, login authentication etc.
Syslog can collect some information not covered by SNMP
All the collected messages will be communicated to PVSS
15
Network Configuration
The LHCb online network system is quite large:
Different devices with different OS and command sets
But quite static luckily, only a few features are essential for
configuring the network devices.
Currently a set of Python scripts is used for configuring
the network devices, using module pexpect for interactive
CLI access.
Initial setup for new installed switch
Firmware upgrade
Configuration file backup and restore
16
Network Configuration
NeDi CLI access
Web-based interface
Possible to select a
set of switches by
type, IP, or name etc.
Can apply a batch of
commands on a set of
switches
17
Network Diagnostics Tools
sFlow Sampler
sFlow is a mechanism to capture packet headers, and collect the
statistics from the device, especially in high speed networks
samples the packet on the switch port and displays the header
information
It is very useful to debug the packet loss problem, e.g. caused by
wrong IP or MAC address
Relative high speed traffic monitoring
Queries the counters for selected interfaces using SNMP or CLI
with a finer time resolution
Shows the utilization for the selected interfaces
18
Status of Network Installation and Deployment
Current setup:
With 2 aggregation switches
Only 2 linecards inserted to the core DAQ switch
For L0 trigger rate at ~200kHz
Upgrade for 1 MHz full speed readout.
Core DAQ switch: Forec10 E1200i
14 linecards, 1260 GbE ports will be ready at the end of June
Upgrade from Terascale to Exascale: double the switch capacity and
all ports run in line rate
All readout boards will be connected to the core DAQ switch
directly
19
20