Boukhelef-XFEL-2ndCRISP-Mar2013

Download Report

Transcript Boukhelef-XFEL-2ndCRISP-Mar2013

Data Recording Model at XFEL
CRISP 2nd Annual meeting
March 18-19, 2013
Djelloul Boukhelef
Djelloul Boukhelef - XFEL
1
Outline
•
•
•
•
Purpose and scope
Hardware setup
Software architecture
Experiments & results
– Network
– Storage
• Summary & outlook
Djelloul Boukhelef - XFEL
2
Purpose and present scope
• Build a prototype of a fully featured DAQ/DM/SC system
– Select/install adequate h/w, develop s/w, and test all system’s
properties: Control, DAQ, DM, and SC systems
• Current prototype focuses on:
– Data acquisition, pre-processing, formatting and storage
– Assess the performance and stability of the h/w + s/w
• Network: bandwidth (10Gbps), UDP packets loss, TCP behavior…
• Processing: concurrent read, processing, write operations, …
• Storage: performance of disk (write), concurrent IO operations, …
• Software development
– Application architecture: processing pipeline, communication, …
– Design for performance, robustness, scalability, flexibility, …
Djelloul Boukhelef - XFEL
3
Hardware setup
• 2D detector generates ~10GB
of image data per second
•
•
Data is multiplexed on
16 channels (10GbE)
1GB/1.6 sec = 640MB/s
per channel
• Lots of other slow data streams
Details were presented in the IT&DM meeting in October
Djelloul Boukhelef - XFEL
4
SOFTWARE ARCHITECTURE
Djelloul Boukhelef - XFEL
5
Overview
• Current prototype consists of three software components
– Data feeder
UDP
• No TB board (Train builder emulator)
Feed the PCL with train data
• With TB board:
Feed TB with detector data
Train Builder Emulator
Data Feeder 1
UDP
– PC Layer software
• Acquire, pre-process, reduce, monitor,
format, send data to storage and SC
Data Feeder M
PC Layer
PCL node 1
PCL node N
– Storage service
• Device/DeviceServer model to
build a distributed control system
– PSR, Flexibility and Configurability
Djelloul Boukhelef - XFEL
TCP
Storage service
Storage
node 1
Storage
node S
6
Network Timer
Train Builder Layer
• Groups (Id, rate, ...)
Timer server
• CPU, Network
TCP
Train Builder Emulator
(Master)
• Net-timer
• PCL nodes
Architecture
overview
Data-driven model
• CPU, Network
TCP
Configurations
(xml files)
Data Feeder
1
Data Feeder
2
Data Feeder
M
• TB-Emulator (Master)
• Train Metadata
• Data files
• CPU, Queues, Network
PC Layer
UDP
PCL node
1
PCL node
2
PCL node
N
• Storage nodes
• Train Metadata
• CPU, Queues, Network
Storage Layer
TCP
• Folder
• Naming convention
Storage node
1
Storage node
2
Storage node
N
Djelloul Boukhelef - XFEL
• CPU
7
Processing pipeline
Build
Pipelining and
multithreading
on multi-core
T2
T1
T2
Send
T1
T2
PC Layer
Receive
Process
T1
T2
T1
Format
T2
T1
Send
Djelloul Boukhelef - XFEL
T2
T1
Storage
Generate
T1
Processing
Train builder
Acquisition
• Distributed and parallel processing
T2
Receive
Write
Online analysis
SC
T1
T2
T1
T2
8
Images
Data files
(offline)
Image Data
Generator
Detector Data
Generator
Generate & Store
(CImg)
DAQ request
Detector emulator
Load
Image data queue
(pointers)
Train Builder Emulator
Data Feeder
Raw data buffer
Detector data queue
(pointers)
Train Builder
Trains
buffer
Train data queue
(tokens)
Packetizer
Clock-ticks (TCP)
Packets (UDP)
PCL nodes
Djelloul Boukhelef - XFEL
Timer server
Train builder
PC-Layer node
Packets (UDP)
De-packetizer
Process queue
(train id)
Trains
buffer
Processing
Pipeline
Statistics, …
(Monitoring,
reduction, …)
Format queue
(train id)
Formatter
Simple processing: checksum
Need real algorithms
In memory
files
Write queue
(file name)
Writer
TCP stream
Online storage
Djelloul Boukhelef - XFEL
10
Storage server
TCP network
• Memory ring buffer
read
– Buffer size (16G)
– Slot (chunk) size (1G)
• Threads pool
Reader
Get free slot
Fill slot
– Reader: Read data stream (TCP)
Ring
and store it into the memory buffer
buffer
– Writer: Write filled buffers slots into disk (cache)
– Sync: Flush cache to disk (physical writing) Flush slot
Writer
• IO modes
– Cached: issues IO command via OS
– Direct: talk directly to the IO device
– Splice: data transfer within kernel space
Djelloul Boukhelef - XFEL
write
Free slot
Sync
Disk array
11
NETWORK PERFORMANCE
Djelloul Boukhelef - XFEL
12
Train Transfer Protocol (TTP)
Header
• Train data format
X
– Header, images & descriptors,
detector data, trailer
sof
X
1
• Train Transfer Protocol (TTP)
– Based on UDP: designed for fast data
transfer where TCP is not implemented
or not suitable (overhead, delays)
– Transfers are identified by unique
identifiers (frame number)
– Packetization: bundle the data block
(frame) into small packets that are
tagged with increasing packet numbers
Data
– Flags: SoF, EoF, Padding
8kbytes
– Packet trailer mode
Djelloul Boukhelef - XFEL
SoF, EoF
# Frame
4b
1b
# Packet
0
Images data
X
2
X
3
X
4
Images
descriptors
X
5
X
Detector
specific data
6
X
Trailer
7
Train data
X
eof
13
8
Previous results (reminder)
Unidirectional
stream
• Two types of programs run in parallel
on all machines
PCL
Node
1
PCL
Node
2
– Feeders: generate train data, packetize it into packets, send
them using UDP
Concurrent
send/receive
– Receivers: reconstruct train data from packets,
1
5
store them in memory (overwrite)
• Run length (#packets/total time):
– Typical:
~ 2.5×109 packets (few hours)
– Maximum: 5×109 packets (16h37m)
2
6
3
7
4
8
3.5×108
• Time profile
– XFEL: 10 MHz for 16 channels  Send 1 train
(~131100 packets) within 1.6sec
– Continuous: no waiting time between trains sending
Djelloul Boukhelef - XFEL
14
Previous results (reminder)
• Network transfer rate: 1GB train  0.87sec ≈ 9.9Gbps
• CPU usage (ie. receiver core)  40%
• Packets loss
– Few packets (tens to hundreds) are sometimes lost per run
– It happens only at the beginning of some runs (not train)
– Observed sometimes on all machines, some machines only.
We have run with no packet loss on any machine, also
• Ignoring first lost packets which affect only the first train
– Typical run (3.5×108)
– Long run (5×109)
million trains
less than 3.7 out of 10000 trains
less than 26 out of one
Djelloul Boukhelef - XFEL
15
Train switching
• In previous experiments:
Feeder 1
– Each feeder is configured to feed
one PC layer node (one-to-one)
– Packet loss appears at the start of a run
– In TB, trains are sent out through
different channels every time
10 trains  16 channels per second.
• Question:
Feeder 2
st101
st102
TTP
10GbE Switch
Sub-net 1
PCLayer 1
PCLayer 2
st105
st104
Feeder 1
st101
Feeder 2
Feeder 3
st102
– What if the feeder sends train data
to a different PC layer node every time?
st103
TTP
10GbE Switch
Sub-net 1
PCLayer 1
st104
Djelloul Boukhelef - XFEL
PCLayer 2
st105
16
Train switching
Timer
st401
• Test configuration
Train builder
– 3 feeders nodes
•
•
•
•
st401
Pre-load images from disk
Build train data (header, trailer,…) Feeder 1
Calculate checksum (Adler32) st101
Packetize train data (TTP)
– 2 PC layer nodes
•
•
•
•
Feeder 2
Feeder 3
st102
st103
TTP
10GbE Switch
Sub-net 1
Depacketize (TTP)
No processing is performed
Format to HDF5 file
Stream files through TCP (splice)
PCLayer 1
PCLayer 2
st105
st104
TCP
10GbE Switch
– 2 storage nodes
Sub-net 2
• Write files to shared memory (splice)
Djelloul Boukhelef - XFEL
Storage 1
Storage 2
st106
st107
17
Train switching
• 3 Feeders feed 2 PC layer nodes in round robin manner
– Rate: 2 trains every 1.6 second
Feeder 1
• A PC layer node receives 1 train every st101
1.6 sec, each time from a different feeder
• A Feeder sends out 1 train every 2.4 sec,
each time to a different IP address
– Packetizer checks the send buffer in order
to avoid overwriting previous (not sent yet)
packets, eg. every 100 packets.
– All Feeder-to-PCLayer data transfers
are done on the same sub-network
– Train transfer time is .88 sec, ie. there is
an overlap between two consecutive trains
of 0.08sec (9% of the time)
Djelloul Boukhelef - XFEL
Feeder 2
Feeder 3
st102
st103
TTP
10GbE Switch
Sub-net 1
PCLayer 1
PCLayer 2
st104
Feeder
1
2
3
1
2
3
1
…
st105
PCL
1
2
1
2
1
2
1
Time
0.0
0.8
1.6
2.4
3.2
4.0
4.8
18
Experiment
• Total run time
– Short time: less than ½ hour
– Long time: 18 hours (81657 trains ~ 80TB)
• Observations:
– 6 trains were affected at the beginning of the run for each pc layer
node, than they continue smoothly with no train loss until 1am.
– … 8am, 2 more trains were affected at one PC layer (probably due
to the nightly update). No train loss on the other node.
– … than the run continues very stable until the end
– Network send and receive (TTP and TCP) were very stable
– Formatting time was not stable al the time
Djelloul Boukhelef - XFEL
19
Summary
• Results:
– Trains sent: 81657 (27219 trains per feeder)
– PCLayer01 (received:40820, affected: 8)
– PCLayer02 (received:40823, affected: 6)
– Train size: 1073754173 = 1GB (image data) + 12.06MB (header,
descriptors, detector data, and trailer)
– # packets per train: 131202
– Transfer time: 0.877579 sec
– Transfer rate: 1.1395 GBytes/sec = 9.116 Gbps
• Sustainable and stable network bandwidth
Djelloul Boukhelef - XFEL
20
STORAGE PERFORMANCE
Djelloul Boukhelef - XFEL
21
 Cached IO
 Issue read/write commands via OS
kernel, which will execute the IO. Data
are copied to/from page cache.
Direct IO vs Cached IO
User land
write
read
Application buffer
Direct IO
 Zero-copy operation
 Splice socket and file descriptors.
Kernel space
Cached IO
Perform data transfer within the kernel
Page remapping
space (transparent).
 Problem using Linux splice
function with IBM/Mellanox
Page cache
Kernel buffer
 Couldn’t figure out the reason!!
flush
 Direct IO
 Performs read/write directly to/from
the device, bypassing page cache
Device driver
 DMA: Memory alignment (512)
 RAID: strip size (64K), # of disks
 Per file vs. per partition (Linux)
Djelloul Boukhelef - XFEL
Disk
Hardware layer
Device driver
Network
22
Cached IO
Time (sec)
Time (sec)
• Run length: 2+ hours
• Buffer size: 4GB
• Method: read all file data into one buffer, write it once, then sync
File number
File number
File size: 1GB
Time period: 1.6s
Disk: empty
File size: 2GB
Time period: 3.2s
Disk: empty
Djelloul Boukhelef - XFEL
23
Run configuration
Dell machines
PCL
Node
101
• Two types of programs run
in parallel on four machines
PCL
Node
102
PCL
Node
103
PCL
Node
104
Storage
202
Storage
203
Storage
204
– Sender: open in memory data file,
stream its content using TCP, close file.
– Receiver: read file data from socket
to RAM buffer (16GB), write it to disk.
• Run length
– Typical: 4.5 TB (2 hours)
– Maximum: until disk is full
9TB (4 hours) and 28TB(12 hours)
– Disks are cleaned before every run
• Time profile
– 1G  1.6sec per box
Storage
201
External disks
Internal disks
IBM machines
Size (GB)
1
10
20
40
Time (sec)
1.6
16
32
64
Djelloul Boukhelef - XFEL
24
Direct IO – external storage
50
read
write
overall
40
Time(sec)
• Storage extension box:
3TB 7.2Krpm
6Gbps NL SAS,
RAID6
30
20
10
0
0
10
20
30
40
File size (GB)
Network Storage
(GB/s)
File (Gbps)
(GB)
avg.
avg.
Net read
(sec)
avg.
max
Disk write
(sec)
max
avg.
Overall
(sec)
max
avg.
1
9.86
0.95
2.27
0.87
2.95
1.06
3.95
1.93
10
9.85
0.97
9.01
8.72
15.58
10.30
16.45
11.17
20
9.83
0.98
17.79
17.48
30.07
20.45
30.95
21.33
40
9.61
0.97
38.81
35.74
58.06
41.39
59.06
42.36
Djelloul Boukhelef - XFEL
25
Direct IO – internal storage
50
read
write
overall
40
Time(sec)
• Internal disks:
14x900GB 10Krpm
6Gbps SAS
RAID6
30
20
10
0
0
10
20
30
40
File size (GB)
Network Storage
(GB/s)
File (Gbps)
(GB)
avg.
avg.
Net read
(sec)
avg.
max
Disk write
(sec)
max
avg.
Overall
(sec)
max
avg.
1
9.86
1.17
3.42
0.87
2.47
0.86
4.23
1.73
10
9.60
1.09
9.38
8.95
11.28
9.19
12.15
10.07
20
9.36
1.06
19.18
18.36
22.63
18.84
23.55
19.75
40
9.38
1.07
38.11
36.64
45.25
37.54
46.22
38.50
Djelloul Boukhelef - XFEL
26
Long run experiments
Internal storage: 9TB, 918 files, 4 hours
Network read
Disk write
Overall time
Average
External storage: 28TB, 2792 files, 12 hours
Network read
Djelloul Boukhelef - XFEL
Disk write
Overall time
Buffer size: 16GB, Slot size: 1GB, File size: 10GB, Time profile: 16sec 27
Statistics from Ganglia
Reader (core 0)
• Long run experiment (5:24am - 5:29pm)
– Host: exflst201 (with external storage)
• Disk write: 671.14MB/s
• Network bandwidth: 676.8MB/s
• CPU usage: syst: 5.81% user: 0.39 %
– Reader: syst: 49.54%
– Writer: syst: 7.03%
Network
CPU
user: 0.75%
user: 0.010%
Writer (core 1)
Disk
Djelloul Boukhelef - XFEL
28
Result summary
• We need to write 1GB data file within 1.6s per storage box
– Both internal and external storage configurations are able to
achieve this rate (1.1GB/s, 0.97GB/s, resp.)
– 16 storage boxes are needed to handle 10GB/s train data stream
• High network bandwidth and low CPU load (stable)
• Direct IO:
– Network read and disk write operations are overlapped 97%
 Low overall time per file
– Application buffer: For big files, the bigger the slot size the better
disk IO performance (as long as DMA allows)
• To do:
– Concurrent IO operations: write/write, write/read, file merging
– Storage manager: file indexing, disk space management
Djelloul Boukhelef - XFEL
29
SUMMARY & OUTLOOK
Djelloul Boukhelef - XFEL
30
Summary
• First half of slice test hardware is configured and running
• Testing and tuning network and i/o performance using
– System/community tools: netperf, iozone, …
– PCL software
– Train builder board
• TB (Emulator)  PCL software (Dell)
– Bandwidth: 9.9 Gbps (99% of the wire speed)
– Low UDP packet loss rate: only few packets loss at the start of
runs (3.5×108 ~ 5×109)  less than 3.7~0.26 per 105 trains
can be affected at most
• PC Layer (Dell)  Storage boxes (IBM)
– TCP data streaming: ~9.8 Gbps
– Write terabytes of data to disk at 0.97 to 1.1 GB/s speed
Djelloul Boukhelef - XFEL
31
Outlook
• Fully featured DAQ system
– Data readout, pre-processing, monitoring, storage
– Feed the system with real data and apply real algorithms
(processing, monitoring, scientific computing)
– Deployment, configuration and control: upload libraries, initiate
devices, start/stop and monitor runs  Device composition
• Soak and stress testing
– Test performance (CPU, IO, network), behavior (bugs, memory
leaks), reliability (error handling, failure), and stability of the system
• Significant workload applied over long period of time
– Parallel tasks: forwarding data to online analysis or scientific
computing, multiple streams into the same storage server, …
• Cluster file system vs. DDN vs. local storage system
• Data management: structure, access control, metadata, …
Djelloul Boukhelef - XFEL
32
Thanks!
Djelloul Boukhelef - XFEL
33