Transcript Document

LHCb
on-line/off-line computing
Domenico Galli, Bologna
INFN CSN1
Pisa, 23.6.2004
DC04 (May-August 2004) – Physics Goals

Demonstrate performance of HLT (needed for
computing TDR):


Improve/confirm S/B estimates of reoptimisation
TDR:


Large minimum bias sample + signal.
Large bb sample + signal.
Validation of Gauss/Geant 4 and Generators:

EVTGEN has replaced QQ;

Inclusion of new processes in generation (e.g. prompt J/ψ);

Vincenzo Vagnoni from Bologna, as a member of the Physics
Panel, coordinates the MC generator group.
LHCb on-line/off-line computing. 2
Domenico Galli
DC04 – Computing Goals

Main goal: gather information to be used for writing
LHCb computing TDR:


Robustness test of the LHCb software and production
system;
Test of the LHCb distributed computing model;



Including distributed analyses;
Incorporation of the LCG software into the LHCb production
environment;
Use of LCG resources as a substantial fraction of the
production capacity.
LHCb on-line/off-line computing. 3
Domenico Galli
Goal: Robustness Test of the LHCb
Software and Production System

First use of the simulation program Gauss based on
Geant4.

Introduction of the new digitisation program, Boole.

Robustness of the reconstruction program, Brunel:


Including any new tuning or other available improvements;
Not including mis-alignment/calibration (discussion now
going on).
Gauss
Boole
Brunel
DST
MC
digitization
reconstruction
50 Mevents
25 TB
LHCb on-line/off-line computing. 4
Domenico Galli
Goal: Robustness Test of the LHCb
Software and Production System (II)

Pre-selection of events based on physics criteria
(DaVinci)

AKA “stripping”;

Performed by production system after the production;

One job for all the physics channels;

1:1000 reduction for each physics channel;

~10 physics channels  1:100 total reduction;

25 TB → 250 GB.
LHCb on-line/off-line computing. 5
Domenico Galli
Goal: Test of the LHCb Computing Model

Distributed data production:


As in 2003, will be run on all available production sites:

Including LCG2;

Controlled by the production manager at CERN;

In close collaboration with the LHCb production site managers.
Distributed data sets:


CERN:

Complete DST (copied from production centres);

Master copies of pre-selections (stripped DST);
Tier1:

Complete replica of pre-selections;

Master copy of DST produced at associated sites.
LHCb on-line/off-line computing. 6
Domenico Galli
DC04 Production System
Tier-1
LCG
Tier-1
DIRAC
GRIDFTP
GRIDFTP
Tier-0
CASTOR
Tier-1
CASTOR
RFIO
Tier-1
disk
GRIDFTP
Tier-1-associated sites
LCG
GRIDFTP
LHCb on-line/off-line computing. 7
Domenico Galli
BBFTP
DC04 Production Share

LHCb Italy is participating to the DC04 with order of
400 processors (200k SPECint) at the INFN Tier-1:

In this very moment it is the most important regional centre
with an amount of resources comparable to CERN.
INFN
Tier-1
CERN
LHCb on-line/off-line computing. 8
Domenico Galli
Migration to LCG



We started using DIRAC as the main production
system because for LHCb is urgent to produce
samples for physics.
LCG production now under test. LCG quota is growing.
We hope to have
most of the
production under
LCG at the end
of DC04.
LHCb on-line/off-line computing. 9
Domenico Galli
Problems

Manpower dedicated to the hardware and software
infrastructure at Tier-1 is widely unsufficient.



People involved are working hard but the service is however
devoid.
Problems patching demands, in average, 1-2 days.
Duty cycle is about 20-30%. And the situation nowadays is
becoming worse and worse.

Problem still not solved. We cohabit with them.

Main problems:

Disk storage (both hardware hung and NFS client hung);

Instability of PBS-Maui queues.
LHCb on-line/off-line computing. 10
Domenico Galli
On-line computing and trigger



The most challenging aspect of LHCb on-line
computing is the use a software trigger for L1 too
(not only in HLT) with 1 MHz input rate.

Cheaper then other solutions (hardware, Digital Signal
Processors).

More configurable.
Data flow:

L1: 45-88 Gb/s.

HLT: 13 Gb/s.
Latency:

L1: < 2 ms.

HLT: ~ 1 s.
LHCb on-line/off-line computing. 11
Domenico Galli
L1&HLT Architecture
Level-1
Traffic
Front-end Electronics
FE FE FE FE FE FE FE FE FE FE FE FE TRM
126-224
Links
44 kHz
5.5-11.0 GB/s
62-87 Switches Switch Switch Switch
Switch
Switch
Storage
System
SFC
SFC
SFC
94-175 Links
7.1-12.6
GB/s
94-175
SFCs
Switch Switch Switch
Level-1 Traffic
Mixed Traffic
29 Switches
32 Links
L1-Decision
Readout Network
Gb Ethernet
323
Links
4 kHz
1.6 GB/s
Multiplexing
Layer
64-137 Links
88 kHz
HLT Traffic
HLT
Traffic
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
Sorter
TFC
System
SFC
SFC
Switch Switch
~1800
CPUs
CPU
CPU
CPU
LHCb on-line/off-line computing. 12
Domenico Galli
CPU
CPU
CPU
CPU
Farm
L1&HLT Data Flow
Front-end Electronics
FE
1 FE FE FE FE FE FE FE
2 FE FE FE FE
2
TRM
1
Switch
Switch
Switch
Switch
Switch
Readout Network
Sorter
L1-Decision
SFC
SFC
Storage
System
2
1
SFC
94 Links
7.1 GB/s
Switch Switch Switch
Level-1 Traffic
Mixed Traffic
SFC
1
94
SFCs
Gb Ethernet
HLT Traffic
2
SFC
L0
Yes
TFC
L1
System
Yes
CPU
CPU
CPU
CPU
CPU
CPU
Switch
L1
L1 Trigger
CPU
D
CPU
CPU
~1800 CPUs
Switch
HLT CPU
CPU
B
ΦΚs
CPU
CPU
LHCb on-line/off-line computing. 13
Domenico Galli
Yes
CPU
CPU
CPU
Farm
First Sub-Farm Prototype Built in
Bologna

2 Gigabit Ethernet switch


3com 2824, 24 ports
16 1U rack-mounted PC

Dual processor Intel Xeon 2.4 GHz

Motherboard SuperMicro X5DPL-iGM

533 MHz FSB (front side bus)

2 GB ECC RAM

Chipset Intel E7501 (8 Gb/s Hub
interface)

Bus Controller Hub Intel P64H2 (2 x PCIX, 64 bit, 66/100/133 MHz)

3 1000Base-T interfaces (1 x Intel
82545EM + 2 x Intel 82546EB)
LHCb on-line/off-line computing. 14
Domenico Galli
Farm Configuration

16 Nodes running Red Hat 9b, with 2.6.5 kernel.




1 Gateway, acting as bastion host and NAT to the external
network;
1 Service PC, providing network boot services, central
syslog, time synchronization, NFS exports, etc.;
1 diskless Sub-Farm Controller (SFC), with 3 Gigabit
Ethernet links (2 for data and 1 for control traffic);
13 diskless Sub-Farm Nodes (SFNs) (26 physical, 52 logical
processors with HT) with 2 Gigabit Ethernet links (1 for data
and 1 for control traffic).
LHCb on-line/off-line computing. 15
Domenico Galli
Bootstrap Procedure


Little disks, little problems:

Hard disk is the PC part more subject to failure.

Disk-less (and swap-less) system already successfully tested
in Bologna off-line cluster.

Network bootstrap using DHCP + PXE + MTFTP

NFS-mounted disks

Root filesystem on NFS
New scheme (proposed by Bologna group) already
tested:


Root filesystem on a 150 MB RAMdisk (instead of NFS).
Compressed image downloaded together with kernel from
network at boot time (Linux initrd)
More robust in temporary congestion conditions.
LHCb on-line/off-line computing. 16
Domenico Galli
Studies on Throughput and Datagram
Loss in Gigabit Ethernet Links

“Reliable” protocols (TCP or level 4) can’t be used,
because retransmission introduces an unpredictable
latency.

A dropped IP datagrams means 25 event lost.

It’s mandatory to verify that IP datagram loss is
acceptable for the task.

Limit value for BER specified in IEEE 802.3 (10-10
for 100 m cables) is not enough.

Measures performed at CERN show a BER < 10-14 for
100 m cables (small enough).

However we had to verify that are acceptable:

Datagram loss in IP stack of Operating System.

Ethernet frame loss in level 2 Ethernet switch.
LHCb on-line/off-line computing. 17
Domenico Galli
Studies on Throughput and Datagram
Loss in Gigabit Ethernet Links (II)


Concerning PCs, best performances reached are:

Total throughput (4096 B datagrams): 999.90 Mb/s.

Loss datagram fraction (4096 B): 7.1x10-10.
Obtained in the following configuration:

OS: Linux, kernel 2.6.0-test11, compiled with preemptive
flag;

NAPI-compliant network driver.

FIFO Scheduling;

Tx/Rx ring descriptors: 4096;

qdisc queue (pfifo discipline) size: 1500.

IP socket send buffer size: 512 kiB.

IP socket receive buffer size : 1 MiB.
LHCb on-line/off-line computing. 18
Domenico Galli
1000
1000 Mb/s
900
500
4432 B
5912 B
7392 B
8872 B
600
2952 B
700
1472 B
800
498 B
rate [Mb/s]
Studies on Throughput and Datagram
Loss in Gigabit Ethernet Links (III)
payload
+ header UDP (8 B),
+ header IP (20 B)
+ header Ethernet (14 B),
+ preambolo Ethernet (7 B),
+ Ethernet Start Frame Delimiter (1 B),
+ Ethernet Frame Check Sequence (4 B),
+ Ethernet Inter Frame Gap (12 B)
400
300
200
100
0
10
2
10
3
LHCb on-line/off-line computing. 19
Domenico Galli
kernel 2.6.0-test11
point-to-point
flow control on
total rate
UDP payload rate
4
10 datagram size [B]
 10 2
3500
3000
kernel 2.6.0-test11
point-to-point
flow control on
279000 p/s
2500
2000
498 B
1500
500
0
10
2
10
4432 B
5912 B
7392 B
8872 B
80000 p/s
2952 B
1000
1472 B
packet rate [p/s]
Studies on Throughput and Datagram
Loss in Gigabit Ethernet Links (IV)
3
LHCb on-line/off-line computing. 20
Domenico Galli
4
10 datagram size [B]
Studies on Throughput and Datagram
Loss in Gigabit Ethernet Links (V)
fraction of dropped frames
 10 -4
881
0.5
0.4
890
900
909
918
UDP payload send rate [Mb/s]
928
938
948
957
Frame Loss in the Gigabit Ethernet
Switch HP ProCurve 6108
0.3
0.2
0.1
0
920
930
940
950
960
970
LHCb on-line/off-line computing. 21
Domenico Galli
980
990
1000
raw send rate [Mb/s]
Studies on Throughput and Datagram
Loss in Gigabit Ethernet Links (VI)

An LHCb public note has been published:

A. Barczyk, A. Carbone, J.-P. Dufey, D. Galli, B. Jost,
U. Marconi, N.Neufeld, G. Peco, V. Vagnoni, Reliability of
Datagram Transmission on Gigabit Ethernet at Full Link
Load, LHCb note 2004-030, DAQ.
LHCb on-line/off-line computing. 22
Domenico Galli
Studies on Port Trunking


In several tests performed at CERN, AMD Opteron
CPUs show better performances than Intel Xeon in
serving IRQ.
The use of Opteron PC, together with port trunking
(i.e. splitting data across more than 1 Ethernet cable)
could help in simplifying the online farm design by
reducing the number of subfarm controllers.


Every SFC could support more computing nodes.
Ethernet
switch
We plan to investigate Linux kernel performances
in port trunking in the different
configurations (balance-rr,
SFC
balance-xor, 802.3ad,
balance-tlb, balance-alb).
LHCb on-line/off-line computing. 23
Domenico Galli
CN
CN
CN
CN
On-line Farm Monitoring, Configuration
and Control



One critical issue in administering the event filter
farm is how to monitor, keep configured and up-todate, and control each node.
A stringent requirement of such a control system is
that it necessarily has to be interfaced to the
general DAQ framework.
PVSS provides a runtime DB, automatic archiving of
data to permanent storage,
alarm generation, easy
realization of graphical
panels, various protocols
to communicate via network.
LHCb on-line/off-line computing. 24
Domenico Galli
On-line Farm Monitoring, Configuration
and Control (II)

The DIM network communication layer, already
integrated with PVSS, is very suitable for our needs:



It is simple and efficient.
It allows bi-directional
communication.
PVSS
The idea is to run light agents on the farm nodes,
providing information to a PVSS project, which
publishes them through GUIs, and which can also
receive arbitrary complex commands to be executed
on the farm nodes passing back the output.
LHCb on-line/off-line computing. 25
Domenico Galli
On-line Farm Monitoring, Configuration
and Control (III)

All the relevant quantities useful to diagnose hardware or
configuration problems should be traced:










CPU fans and temperatures;
Memory occupancy;
RAM disk filesystem occupancy;
CPU load;
Network interface statistics, counters, errors;
TCP/IP stack counters;
Status of relevant processes;
Network Switch statistics (via the SNMP-PVSS interface).
Information should be viewed as actual values and/or historical
trends.
Alarms should be issued whenever relevant quantities don’t fit in
allowed ranges

PVSS naturally allows it, and can even start feedback procedures.
LHCb on-line/off-line computing. 26
Domenico Galli
On-line Farm Monitoring, Configuration
and Control (IV)


Concerning configuration and control, the idea is to embed in
the framework every common operation which is usually needed
by the system administrator, to be performed by means of
GUIs:
On the Service PCs side

Upgrade of operating systems

Upgrade of application software

Automatic setup of configuration files


dhcpd table, NFS exports table, etc.
On the farm nodes side

Inspection and modification of files

Broadcast commands to the entire farm (e.g., reboot)


Fast logon by means of a shell like environment embedded inside a
PVSS GUI (e.g., commands, stdout and stderr passed back and
forth by DIM)
(Re)start of online processes
LHCb on-line/off-line computing. 27
Domenico Galli