High-Performance Storage System for the LHCb Experiment

Transcript High-Performance Storage System for the LHCb Experiment

High-Performance
Storage System for the
LHCb Experiment
Sai Suman Cherukuwada, CERN
Niko Neufeld, CERN
IEEE NPSS RealTime 2007, FNAL
LHCb Background
Large Hadron Collider beauty
 One of 4 Major CERN experiments at
the LHC
 Single-arm Forward Spectrometer
 b-Physics, CP Violation in the
Interactions of b-hadrons

IEEE NPSS RealTime 2007, FNAL
Sai Suman Cherukuwada and Niko Neufeld, CERN
LHCb Data Acquisition




Level 0 Trigger – 1
MHz
High Level Trigger
– 2-5 kHz
Written to Disk at
Experiment Site
Staged out to Tape
Centre
L0 Electronics
Readout Boards
Storage at
Experiment Site
IEEE NPSS RealTime 2007, FNAL
High-Level Trigger Farm
Sai Suman Cherukuwada and Niko Neufeld, CERN
Storage Requirements





Must sustain Write operations for 2-5 kHz of
Event Data at ~30 kB/event, with peak
loads going up to twice as much
Must sustain matching read operations for
staging out data to tape centre
Must support reading of data for Analysis
tasks
Must be Fault tolerant
Must easily scale to support higher storage
capacities and/or throughput
IEEE NPSS RealTime 2007, FNAL
Sai Suman Cherukuwada and Niko Neufeld, CERN
Architecture : Choices
Wrter Farm Nodes can
access Cluster File System
transparently
Unified Namespace
Writer Farm Nodes
IP Network
Independent
Servers run
local File
Systems on
their Local
Disks
Fully Partitioned Independent Servers
IEEE NPSS RealTime 2007, FNAL
Cluster File System
Combines all Servers
over IP to form a Single
Namespace
Independent
Servers with
Storage
Cluster File System
Sai Suman Cherukuwada and Niko Neufeld, CERN
Architecture : Shared-Disk
File System Cluster
Event Data Writer Clients




Can Scale Compute or
Storage Components
independently
Locking Overhead
restricted to very few
servers
Unified namespace is
simpler to manage
Storage fabric delivers
high throughput
HLT Farm nodes
write data over IP
using a custom
protocol
Fault Tolerant, Load Balanced Event Data Writer
Service
IP Network
Servers connect
to Shared
Storage over a
Fibre Channel
Network
SAN File System provides all servers with a Consistent
Namespace on the Shared Storage
Fibre Channel
Network
Shared-Disk File System Cluster
IEEE NPSS RealTime 2007, FNAL
Sai Suman Cherukuwada and Niko Neufeld, CERN
Hardware : Components





Dell PowerEdge 2950 Intel Xeon Quad
Core Servers (1.6 GHz) with 4 GB FBD
RAM
QLogic QLE2462 4 Gbps Fibre Channel
Adapters
DataDirect Networks S2A 8500 Storage
Controllers with 2 Gbps host-side ports
50 x Hitachi 500 GB 7200 rpm SATA Disks
Brocade 200E 4 Gbps Fibre Channel
Switch
IEEE NPSS RealTime 2007, FNAL
Sai Suman Cherukuwada and Niko Neufeld, CERN
Hardware : Storage
Controller
300

DirectRAID





Combined Features
of RAID3, RAID5, and
RAID0
8 + 1p + 1s
Very Low Impact on
Disk Rebuild
Large Sector Sizes (up
to 8 kB) supported
Eliminates host-side
striping
IEEE NPSS RealTime 2007, FNAL
250
200
Read
MB/sec 150
Write
Read+Write
100
50
0
Normal
Rebuild
•IOZone File System Benchmark with 8
threads writing 2 GB files each on one
server
•Tested first in “Normal” mode with all
disks in normal health, and then in
“Rebuild”, with one disk in the process of
being replaced by a global hot spare
Sai Suman Cherukuwada and Niko Neufeld, CERN
Software
Writer Service
Discovery
Writer Service
I/O Threads
Writer Service
Failover Thread
Writer Service
GFS File System
Linux Logical Volume Manager
Linux Multipath Driver
SCSI LUNs
(Logical Units)
IEEE NPSS RealTime 2007, FNAL
Sai Suman Cherukuwada and Niko Neufeld, CERN
Software : Shared-Disk File
System
450
400
350




Runs on RAID volumes
exported from Storage
Arrays (called LUNs or
Logical Units)
Can be mounted by multiple
servers simultaneously
Lock Manager ensures
consistency of operations
Scales almost linearly up to
4 nodes (at least)
300
Node 4
250
Node 3
Node 2
MB/sec
200
Node 1
150
100
50
0
Read
Write
800
700
600
500
Read
MB/sec 400
Re-Read
Write
300
•IOZone Test with 8 threads, O_DIRECT I/O
•LUNs striped over 100+ disks
•2 Gbps Fibre Channel Connections to Disk Array
Re-Write
200
100
0
1
2
3
4
Servers
IEEE NPSS RealTime 2007, FNAL
Sai Suman Cherukuwada and Niko Neufeld, CERN
Writer Service : Design
Goals
Enable a large number of HLT Farm
Servers to write to disk
 Write Data to shared disk file system
at close to maximum disk throughput
 Failover + Failback with no data loss
 Load Balancing between instances
 Write hundreds of concurrent files per
server

IEEE NPSS RealTime 2007, FNAL
Sai Suman Cherukuwada and Niko Neufeld, CERN
Writer Service : Discovery



Discovery and status
updates performed through
Writer Service 1
multicast
Service Table maintains
Multicast
current status of all known
Messages
hosts
Service Table contents Writer Service 2
Writer Service 3
constantly updated to all
connected Gaudi Writer
Processes from the HLT
Relay Service
Farm
Table
Information to
Writers
Gaudi Writer Process 1
IEEE NPSS RealTime 2007, FNAL
Gaudi Writer Process 2
Sai Suman Cherukuwada and Niko Neufeld, CERN
Writer Process : Writing
Cache every event
 Send to Writer Service
 Wait for Acknowledgement
 Flush and free

IEEE NPSS RealTime 2007, FNAL
Sai Suman Cherukuwada and Niko Neufeld, CERN
Writer Service : Failover




Writer Processes
aware of all instances
of Writer Service
Each Data Chunk is
entirely self-contained
Writing of a Data
Chunk is idempotent
If Writer Service fails,
Writer Process can
reconnect and resend
unacknowledged
chunks
IEEE NPSS RealTime 2007, FNAL
Writer Service 1
Failed
Connection
Writer Service 2
1.
2.
Gaudi Writer Process 1
3.
Connect to Next Entry
In Service Table,
Update New Service
Table
Replay
Unacknowledged Data
Chunks
Sai Suman Cherukuwada and Niko Neufeld, CERN
Writer Service : Throughput
200
180



Cached concurrent write
performance for large
numbers of files is
insufficient
Large CPU and memory
load (memory copy)
O_DIRECT reduces CPU
and memory usage


IEEE NPSS RealTime 2007, FNAL
Data need to be pagealigned for O_DIRECT
Written event data are
not aligned to anything
160
140
120
CPU (sys %)
IO (MB/sec)
MB/sec 100
80
60
40
20
0
Cached
O_DIRECT
•Custom test writing 32 files per thread x 8 threads
•Write sizes varying from 32 bytes to 1 MB
•LUNs striped over 16 disks
•2 Gbps Fibre Channel Connections to Disk Array
Sai Suman Cherukuwada and Niko Neufeld, CERN
Writer Service : Throughput
per Server
200


Scales up with
number of clients
Write throughput
within 3% of
maximum
achievable through
GFS
180
160
140
120
MB/sec 100
80
60
40
20
0
1
•Custom test writing event sizes ranging from 1 byte
to 2 MB
•LUNs striped over 16 disks
•2 Gbps Fibre Channel Connections to Disk Array
•2 x 1 Gbps Ethernet Connections to Server
•CPU Utilisation ~ 7-10%
IEEE NPSS RealTime 2007, FNAL
2
4
8
16
Clients
Sai Suman Cherukuwada and Niko Neufeld, CERN
Conclusions & Future Work





Solution that can offer high read and write
throughput with minimal overhead
Can be scaled up easily with more
hardware
Failover with no performance hit
More sophisticated “Trickle” load-balancing
algorithm in the process of being prototyped
Maybe worth implementing a full-POSIX FS
version someday?
IEEE NPSS RealTime 2007, FNAL
Sai Suman Cherukuwada and Niko Neufeld, CERN
Thank You
Writer Service



Linux-HA not suited
for Load Balancing
Linux Virtual Server
not suited for Write
Workloads
NFS in sync mode
too slow, async
mode can lead to
information loss on
failure
IEEE NPSS RealTime 2007, FNAL
80
70
60
50
MB/sec 40
30
20
10
0
NFS async
NFS sync
Sai Suman Cherukuwada and Niko Neufeld, CERN

High-Performance Storage System for the LHCb Experiment

Transcript High-Performance Storage System for the LHCb Experiment

Directory