cherukuwada_rt07 - Indico

Download Report

Transcript cherukuwada_rt07 - Indico

High-Performance
Storage System for the
LHCb Experiment
Sai Suman Cherukuwada, CERN
Niko Neufeld, CERN
IEEE NPSS RealTime 2007, FNAL
LHCb Background
Large Hadron Collider beauty
 One of 4 Major CERN experiments at
the LHC
 Single-arm Forward Spectrometer
 b-Physics, CP Violation in the
Interactions of b-hadrons

IEEE NPSS RealTime 2007, FNAL
Sai Suman Cherukuwada and Niko Neufeld, CERN
LHCb Data Acquisition




Level 0 Readout –
1 MHz
High Level Trigger
– 2-5 kHz
Written to Disk at
Experiment Site
Staged out to Tape
Centre
L0 Electronics
Readout Boards
Storage at
Experiment Site
IEEE NPSS RealTime 2007, FNAL
High-Level Trigger Farm
Sai Suman Cherukuwada and Niko Neufeld, CERN
Storage Requirements





Must sustain Write operations for 2-5 kHz of
Event Data at ~30 kB/event, with peak
loads going up to twice as much
Must sustain matching read operations for
staging out data to tape centre
Must support reading of data for Analysis
tasks
Must be Fault tolerant
Must easily scale to support higher storage
capacities and/or throughput
IEEE NPSS RealTime 2007, FNAL
Sai Suman Cherukuwada and Niko Neufeld, CERN
Architecture : Choices
HLT Farm can access
Cluster File System
transparently
Unified Namespace
IP Network
Cluster File System
Combines all Servers
over IP to form a Single
Namespace
Independent
Servers with
Storage
HLT Farm nodes
are bound to
specific servers
Independent
Servers run
local File
Systems on
their Local
Disks
Cluster File System
IEEE NPSS RealTime 2007, FNAL
Fully Partitioned Independent Servers
Sai Suman Cherukuwada and Niko Neufeld, CERN
Architecture : Shared-Disk
File System Cluster
Event Data Writer Clients




Can Scale Compute or
Storage Components
independently
Locking Overhead
restricted to very few
servers
Unified namespace is
simpler to manage
Storage fabric delivers
high throughput
HLT Farm nodes
write data over IP
using a custom
protocol
Fault Tolerant, Load Balanced Event Data Writer
Service
IP Network
Servers connect
to Shared
Storage over a
Fibre Channel
Network
SAN File System provides all servers with a Consistent
Namespace on the Shared Storage
Fibre Channel
Network
Shared-Disk File System Cluster
IEEE NPSS RealTime 2007, FNAL
Sai Suman Cherukuwada and Niko Neufeld, CERN
Hardware : Components





Dell PowerEdge 2950 Quad Core Servers
QLogic QLE2462 4 Gbps Fibre Channel
Adapters
DataDirect Networks S2A 8500 Storage
Controllers with 2 Gbps host-side ports
50 x Hitachi 500 GB 7200 rpm SATA Disks
Brocade 200E 4 Gbps Fibre Channel
Switch
IEEE NPSS RealTime 2007, FNAL
Sai Suman Cherukuwada and Niko Neufeld, CERN
Hardware : Storage
Controller






200
DirectRAID
Combined Features
of RAID3, RAID5, and
RAID0
8 + 1p + 1s
Very Low Impact on
Disk Rebuild
Large Sector Sizes (up
to 8 kB) supported
Eliminates host-side
striping
IEEE NPSS RealTime 2007, FNAL
180
160
140
120
MB/s e c
100
80
60
40
20
0
Re ad
Write
Re ad+Write
Norm al
Re build
Sai Suman Cherukuwada and Niko Neufeld, CERN
Software
Writer Service
Discovery
Writer Service
I/O Threads
Writer Service
Failover Thread
Writer Service
GFS File System
Linux Logical Volume Manager
Linux Multipath Driver
SCSI LUNs
(Logical Units)
IEEE NPSS RealTime 2007, FNAL
Sai Suman Cherukuwada and Niko Neufeld, CERN



Runs on RAID
volumes exported from
Storage Arrays (called
LUNs or Logical Units)
Can be mounted by
multiple servers
simultaneously
Lock Manager ensures
consistency of
operations
Scales almost linearly
up to 4 nodes (at least)
(figures alongside are
for GFS)
500000
450000
400000
350000
300000
250000
200000
150000
100000
50000
0
Node4
Node3
Node2
Node1
Read
Write
450000
Throughput (KB/sec)

Throughput (KB/sec)
Software : Shared-Disk File
System
400000
350000
300000
250000
200000
Node 2 in Group
150000
Node 1in Group
100000
50000
0
Read
IEEE NPSS RealTime 2007, FNAL
Write
Sai Suman Cherukuwada and Niko Neufeld, CERN
Writer Service : Design
Goals
Enable a large number of HLT Farm
Servers to write to disk
 Write Data to shared disk file system
at close to maximum disk throughput
 Failover + Failback with no data loss
 Load Balancing between instances
 Write hundreds of concurrent files per
server

IEEE NPSS RealTime 2007, FNAL
Sai Suman Cherukuwada and Niko Neufeld, CERN
Writer Service : Discovery



Discovery and status
updates performed through
multicast
Service Table maintains
current status of all known
hosts
Service Table contents
constantly updated to all
connected Gaudi Writer
Processes from the HLT
Farm
Writer Service 1
Multicast
Messages
Writer Service 2
Writer Service 3
Relay Service
Table
Information to
Writers
Gaudi Writer Process 1
IEEE NPSS RealTime 2007, FNAL
Gaudi Writer Process 2
Sai Suman Cherukuwada and Niko Neufeld, CERN
Writer Service : Failover




Gaudi Writer
Processes aware of all
instances of Writer
Service
Each Data Chunk is
entirely self-contained
Writing of a Data
Chunk is idempotent
If Writer Service fails,
Gaudi Writer Process
can reconnect and
resend
unacknowledged
chunks
IEEE NPSS RealTime 2007, FNAL
Writer Service 1
Failed
Connection
Writer Service 2
1.
2.
Gaudi Writer Process 1
3.
Connect to Next Entry
In Service Table,
Update New Service
Table
Replay
Unacknowledged Data
Chunks
Sai Suman Cherukuwada and Niko Neufeld, CERN
Writer Service : Throughput



Suboptimal cached
concurrent write
performance for large
numbers of files
Large CPU and memory
load (memory copy)
O_DIRECT reduces CPU
and memory usage


Data need to be pagealigned for O_DIRECT
Written event data are
not aligned to anything
IEEE NPSS RealTime 2007, FNAL
200
180
160
140
120
MB/sec
CPU (sys %)
IO (MB/sec)
100
80
60
40
20
0
Cached
O_DIRECT
Sai Suman Cherukuwada and Niko Neufeld, CERN
Writer Service : Throughput
per Server


Scales up with
number of clients
Write throughput
within 3% of
maximum
achievable through
GFS
200
180
160
140
120
MB/sec 100
80
60
40
20
0
1
2
4
8
16
Clients
IEEE NPSS RealTime 2007, FNAL
Sai Suman Cherukuwada and Niko Neufeld, CERN
Thank You
Writer Service




Linux-HA not suited for
Load Balancing
Linux Virtual Server not
suited for Write
Workloads
NFS in sync mode too
slow, async mode can
lead to information loss
on failure
Cached Operations do
not
IEEE NPSS RealTime 2007, FNAL
80
70
60
50
MB/sec 40
30
20
10
0
NFS async
NFS sync
Sai Suman Cherukuwada and Niko Neufeld, CERN