Future experiment specific needs for LHCb

Download Report

Transcript Future experiment specific needs for LHCb

Future experiment specific needs for
LHCb
OpenFabrics/Infiniband Workshop at
CERN
Monday June 26
Sai Suman Cherukuwada and Niko Neufeld CERN/PH
LHCb Trigger-DAQ system: Today
•
•
•
•
•
•
LHC crossing-rate: 40MHz
Visible events: 10MHz
Two stage trigger system
– Level-0: synchronous in
hardware; 40 MHz  1 MHz
– High Level Trigger (HLT):
software on CPU-farm;
1 MHz  2 kHz
Front-end Electronics (FE):
interface to Readout Network
Readout network
FE
FE
FE
L0
trigger
– Gigabit Ethernet LAN
– Full readout at 1MHz
Readout network
Event filter farm
Timing and
Fast Control
– ~ 1800 to 2200 1 U servers
CPU
permanent
Niko Neufeld
CERN,storage
PH
CPU
CPU
CPU
CPU
2
LHCb DAQ system: features
•
•
•
•
•
On average every 1 us new data become available at
each of ~ 300 sources (= custom electronics boards,
“TELL1”)
Data from several 1 us cycles (=“triggers”) are
concatenated into 1 IP packet  reduces message /
packet-rate
IP packets are pushed over 1000 BaseT links 
short distances allow using 1000 BaseT throughout
Destination IP-address is synchronously assigned
via a custom optical network (TTC) to all TELL1s
For each trigger a PC-server must receive IP packets
from all TELL1 boards (“event-building”).
Niko Neufeld
CERN, PH
3
Terminology
•
•
•
channel: elementary
sensitive element = 1 ADC =
8 to 10 bits. The entire
detector comprises millions
of channels
event: all data fragments
(comprising several
channels) created at the
same discrete time together
form an event. It is an
electronic snap-shot of the
detector response to the
original physics reaction
zero-suppression: send
only channel-numbers of
non-zero value channels
(applying a suitable
threshold)
•
Niko Neufeld
CERN, PH
packing-factor: number of
event-fragments (“triggers”)
packed into a single
packet/message
– reduces the message rate
– optimises bandwidth usage
– is limited by the number of
CPU cores in the receiving
CPU (to guarantee prompt
processing and thus limit
latency)
4
Following the data-flow
VELO TT
IT
OT CALO
RICH
#2
VELO
TELL1 TELL1 TELL1
TELL1 TELL1 TELL1 UKL1
#2
#1
VELO
VELO
#1TO
MEP
PC #876
MUON
L0
Front-end
Electronics
RICH
#2
UKL1
TELL1 TELL1 TELL1
#2
#1
RICH
RICH
MEP
#1
PC #876
TFCL0
YesL0
System
MEPYes
Destination
PC #876
Readout Network
400 Links
35 GByte/s
Storage
System
Switch Switch Switch
PC
PC
PC
PC
PC
PC
Switch
PC
PC
50
Subfarms
PC
RICH#2
PC HLT PC
B
RICH#1
Process
MEP
VELO#2
ΦΚs
PC
Request
PC
VELO#1
#876
PC
#876
MEP
PC
Niko Neufeld
CERN, PH
Switch
PC
Event Filter
Farm
5
Data pre-processing:
The LHCb common Readout Board TELL1
Receiver cards get data from
detector via optical fibres
FPGAs do pre-processing,
zero-suppression and data
formatting (into IP packets)
FE
FE
FE
A-RxCard
A-RxCard
PP-FPGA
PP-FPGA
PP-FPGA
PP-FPGA
L1B
L1B
L1B
L1B
FPGA attached to Ethernet
Quad-MAC on SPI3 bus
(simple FIFO protocol)
FE
O-RxCard
SyncLink-FPGA
ECS
IP packets are pushed out to the
Data Acquisition on a private
LAN over 4 x 1000 BaseT links
ECS
TTCrx
RO-Tx
TTC
Throttle
4 x 1000 BaseT
Niko Neufeld
CERN, PH
6
Improving the LHCb trigger
• Triggering is filtering. The quality of the trigger
•
•
•
is determined (using simulated data) by
measuring how many good events of the
possible good events are selected:
efficiency ε = Ngood-selected / Ngood-all
Each stage has its own efficiency. LHCb
looses mostly in the “L0” step: 40 MHz  1
MHz
Reason: only coarse information (“high pT”)
used
Solution: reconstruct secondary vertices at
collision rate 40 MHz!
Niko Neufeld
CERN, PH
7
Upgrade
We want to have a DAQ and Event filter which:
• allows for vertex triggering at collision rate
(40 MHz)
• fits within the existing infrastructure:
– 1 MW power and cooling
– 50 racks with a total space of 2200 Us
• preserves the main good features of the
current LHCb DAQ
– simple, scalable, industry-standard technologies,
as much as possible commodity items
• costs <107 of a reasonable currency
Niko Neufeld
CERN, PH
8
Two Options
•
Two stage readout:
– Readout ~ 10 kB @ 40 MHz.
Data are buffered in the FL1
for a suitable amount of time:
40 ms (?)
– Algorithm on event-filter farm
selects 1 MHz of “good” events
and informs (how?) FL1 boards
of its decision (yes/continue –
no/discard):
– In case of “yes” the entire
detector is read out: 35 kB @ 1
MHz
Niko Neufeld
CERN, PH
• Always read out
entire detector 35
kB @ 40 MHz
(“brute force”)
9
Full read-out at 40 MHz
•
•
•
•
At a collision rate of 40 MHz, the data rate for a full
readout is ~ 1400 GB/s, or ~ 12 Tb/s
– network with ~ 2 x 1200 x 10 Gigabit ports
Need several switches as building blocks 
optimised topology highly desirable (non-Banyan)
Advantages:
– No latency constraints
– Less memory requirements on the FL1
Disadvantages:
– Huge, expensive
– Almost all of the data shipped will never be looked at
(physics  algorithms do not change much)
– Requires zero-suppression and FPGA pre-processing for all
detector data 40 MHz (not obvious)
Niko Neufeld
CERN, PH
10
Parameters / Assumptions
•
•
•
•
Vertex reconstruction requires only a subset of the
total event of roughly 10 kB @ 40 MHz (essentially the
VertexLocator of the future + some successor of TT)
FE with full 40 MHz readout capability
We dispose of the successor of the TELL1, FL1* ,
which has several 10 Gigabit output links and can do
pre-processing / zero-suppression at the required
rate
Several triggers are packed into a MTP. This reduces
the message rate from each board. In this
presentation we assume 8 triggers per message ==
RTX-message = 5 MHz (per FL1)
(*) FL1 for Future L1 or Fast L1 or FormuLa 1
Niko Neufeld
CERN, PH
11
Data pre-processing:
A new readout-board: FL1
FE
Receiver cards get data from
detector via optical fibres
FE
O-RxCard
FPGAs do pre-processing,
zero-suppression and data
formatting)
FE
FE
O-RxCard
PP-FPGA
PP-FPGA
PP-FPGA
PP-FPGA
L1B
L1B
L1B
L1B
FPGA attached to HCA on
??? bus (are there alternatives
to PCIe?)
SyncLink-FPGA
ECS
Output to the Data Acquisition
private LAN on (up-to) 4 x CX4
cables
ECS
Host processor needed (??) to
handle complex protocol stack
Sync
Info
RO-Tx
TTC
Host
Processor
Throttle
4 x CX4
Niko Neufeld
CERN, PH
12
Event filter farm for upgraded LHCb
•
•
•
We need an event-filter which can absorb
4 * 10^7 * 10 kB/s + 10^6 * 35 kB/s ~ 435 GB/s!
Assume 2000 servers:
– A server is something which takes one U in space and has p
two processor sockets
– Each socket holds a chip, which comprises several CPU
cores
Each server must accept ~ 210 MB/s as 500 kHz of
messages of ~ 400 Bytes
Options for attaching servers to network:
– 3 Gigabit links as a trunk: not very practical because would
have to bring > 130 links into one rack!
– Use an (underused) 10 Gigabit link
Niko Neufeld
CERN, PH
13
Server Horoscopes
• Quad-core processors from Intel and AMD
•
•
•
•
will most likely be available in 2007
Could we have “Octo-cores” by end of 2008?
Can thus assume to have 8 cores running at
2 to 2.4 GHz (prob. not more!) in one U.
Commitment by Intel and AMD: power
consumption per processor < 100 W
Reasonable rumors:
– 2007 will see first mainboards with 10 Gigabit
interface on board: most likely CX4 for either 10
Gigabit Ethernet or Infiniband (?)
Niko Neufeld
CERN, PH
14
CPU power for triggering /
latency / buffering
• Assuming 2000 servers / 16000 cores and
40 MHz of events each core has on average
2.5 ms to reach a decision when processing
the ~ 10 kB of vertex-detector data
–  should have at least 40 ms buffering in the
FL1s to cope with fluctuations in processing time
(the processing time distribution is known to have
long tails)
• Assuming 400 FL1 means that they have to
have 12.5 GB buffer memory
Niko Neufeld
CERN, PH
15
High Density Switches
Fabric
65Gbps per Rack
2. Send to Farm for
Trigger decision
1
2
3
20m+?
4
400 ports
“in” per Switch
4. Receive Trigger
Decision.
4x10Gbps Links
3. Send trigger
Decision to FL1
5. If trigger decision
Positive, readout 30
35KB @ 1MHz
1
2
3
…
400
m+
Front-End L1 Boards
1. Readout 10KB events @ 40MHz,
Buffer on FL1
60 m
Rack1
…
Rack50
Farm Racks with 1 x 32-port
Switch or 2 x 16-port Switch
LHCb Detector
Niko Neufeld
CERN, PH
16
Power Consumption
• Probably need 512 MB per core (trigger
•
•
•
•
•
process)
x 8 ==> 4 GB
4 GB of high-speed memory + onboard 10
Gigabit interface will need also power
(assume conservatively 50 W)
The 1 U box should stay below 300 W
Total power for CPUs < 600 kW
10 Gigabit distribution switches need also
power (should count at least 250 W)
Niko Neufeld
CERN, PH
17
Open questions
• Can an FPGA drive the HCA or do we need
•
•
an embedded host-processor with an OS?
It would be nice to centrally assign the next
destination (server) to all FL1 boards. This
means determining the Queue Pair number
and DLID/DGID to send a message to. Can
we use the Infiniband network for this as
well?
Almost the entire traffic is unidirectional (from
the FL1s to the servers). Can we take
advantage of this fact?
Niko Neufeld
CERN, PH
18