Transcript Slide

New directions in
storage, hard-disks with
build-in networking
Presenter: Patrick Fuhrmann
dCache.org
Patrick Fuhrmann, Paul Millar, Tigran Mkrtchyan
DESY
Yves Kemp
HGST
Christopher Squires
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 1
Objectives
• Current storage systems are composed of sophisticated building
blocks: large file servers, often equipped with special RAID
controllers
• Are there options to build storage systems based on small,
independent, low-care (and low-cost) components?
• Are there options for a massive scale-out using different
technologies?
• Are there options for moving away from specialised hardware
components to software components running on standard
hardware?
• What changes would be needed to software and operational
aspects, and how would the composition of TCO change?
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 2
Building storage systems with
small, independent data nodes
• The actual storage system must be able to handle
independent data nodes.
– Like CEPH and dCache
• It must support protocols, allowing massive scale-out (no
bottlenecks)
– CEPH Ok (Client side proprietary driver)
– dCache through NFS 4.1/pNFS, GridFTP, WebDAV
• It must support protection against failures and support data
integrity features.
– CEPH and dCache continue operation if data nodes fail.
– CEPH and dCache support data integrity checks.
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 3
Selection
As we are running dCache at DESY, we focused
our investigation on dCache, for now,
however, we will build a CEPH cluster soon.
CEPH has already been successful ported to a
HGST setup.
We believe that a merge of CEPH as storage
backend and dCache as protocol engine
would be exactly the right solution for DESY.
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 4
dCache design
dCache pool (data) nodes
Number Crunching Nodes
dCache Head Node
MDS (Name Space Ops)
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 5
Available ‘small’ systems
• DELL C5000 blades, HP moonshot, Panasas, …
– Those still share quite some infrastructure
– Often more than one disk per CPU. Too large for a simple system, but to
small for a serious RAID system.
• The extreme setup would be:
– One disk is one system
– Small CPU with limited RAM
– Network connectivity included.
• HGST and Seagate announced such a device
– Equipped with a standard disk drive, an ARM CPU and a network interface.
– We have chosen the HGST Open Ethernet drive for our investigations.
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 6
HGST Open Ethernet
• Small ARM CPU with Ethernet
piggybacked on regular Disk.
• Spec:
– Any Linux (Debian on demo)
– CPU 32-bit ARM, 512 KB Level 2
– 2 GB DRAM DDR-3 Memory
• 1792 MB available
– Block storage driver as SCSI sda
– Ethernet network driver as eth0
– dCache pool code working!
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 7
Photo Evidence
Single Disk
Enclosure
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 8
Professional DESY Setup
Retaining Clip
Power supply
Yves
HGST
Disk
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 9
dCache Instance for CMS at
DESY
• dCache instance for the CMS collaboration at DESY
• 199 File Servers in total
– 172 with locally attached disks
• 12 disks: RAID-6
• Varying file size
– 27 with remote attached (fibre channel or SAS)
– All file servers are equipped with 24 GB RAM and 1 or 10 GBit Ethernet
• dCache with 5 PBytes net capacity
– Some dCache partitions are running in resilient mode, holding 2 copies of each
file and reducing the total net capacity.
• Four head nodes for controlling and accounting
– 2 CPU + 32 GByte RAM each
• 10 Protocol initiators (doors)
– Simple CPU 16 GByte RAM
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 10
Potential setup with
HGST Open Ethernet
drive / iSCSI
The standard 12 disks in RAID-6, attached via SAS to RAID
controller card is replaced by :
– One pizza-box with Intel CPU and performant network, connects to 12
HGST drives.
– Each HGST drives acts as one iSCSI target
– RAID-6 is build on the pizza-box in software.
Ethernet Switch
(iSCSI protocol)
SAS Raid 6
Controller
CPU
1
2
SAS Disks
12
CPU
1
2
12
HGST Open Ethernet Disks
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 11
Potential setup with
HGST Open Ethernet
drive / CEPH
The standard 12 disks in RAID-6, attached via SAS to RAID
controller card is replaced with:
– All HGST drives acting as one large object store
– Pizza Boxes act as CEPH clients can can run higher level services, e.g.
dCache Storage System.
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 12
Potential CEPH
deployment
SAS Raid 6
Controller
1
CPU
12
2
CPU
Running
CEPH plus
high level services
e.g. dCache pool
*N
SAS Disks
Ethernet Switch
(CEPH protocol)
CPU
1
2
12
13
HGST Open Ethernet Disks running CEPH
M
CPUs
Running
High level services e.g.
dCache
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 13
Potential dCache
deployment
SAS Raid 6
Controller
1
CPU
12
2
CPU
Running
dCache pool.
*N
SAS Disks
Ethernet Switch
CPU
1
2
12
13
M
HGST Open Ethernet Disks running dCache pool node
CPUs
Running
dCache head nodes
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 14
dCache composed of HGST disks
(A quantitative Investigation)
• Qualitative comparison between a setup based on
HGST and RAID dCache pool nodes.
• Assumptions:
– All files identical in size (4 GBytes)
– Streaming read only
– Files are equally distributed (flat, random)
• Initial measurement: Total bandwidth versus number
of concurrent streams.
– Using ioZone (stream, read)
– Applying a fit to the measured data.
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 15
Comparison: Ethernet Disk - RAID
Measured
Total Bandwidth (MB/s)
HGST
(single drive)
RAID
(12 drives)
mean value fit
mean value fit
1 / n fit
1 / n fit
2 4 6
10
16
20
60 100 140 180
Number of concurrent sequential read operations
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 16
HGST Disk dCache: Qualitative
Simulation
• Simulating and comparing two 4 PByte dCache instances
– A) 100 RAID servers (12 disks a 4 TB), single file copy.
– B) 2000 HGST disks a 4 TB, two file copies
• dCache only supports replication of entire files (no erasure code).
• Computing total bandwidth for a varying number of streams.
– IOzone measurements from previous slides are used.
– Counting streams/servers and looking up bandwidth to be
added up.
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 17
Total Bandwidth (GB/s)
Simulation
200
150
+ HGST
x RAID
100
50
2000
4000
Number of concurrent read operations
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 18
Real Monitor
Photon dCache
CMS dCache
Typically << 100 simultaneous reads
Typically >> 1000 simultaneous reads
 75 Streams
RAID setup would have best
performance
 8000 streams
HGST setup would have best
performance
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 19
Total Cost of Ownership
(some considerations)
•
•
At the time being, neither the price nor the delivery units are known.
Assumption for a HGST product:
–
–
•
4U box with 60 disks and internal switch
4 * 10 GE network
Reminder: dCache only supports full replica (Erasure codes would be more efficient)
What?
100 RAID
2000 HGST
Comment
Network: Ports
100 Network ports
2000/60=133 ports
Small overhead
Network: IP
100 IP addresses
2000 IP addresses
Public IPv4 NO
Private IPv4 & IPv^6 OK
Power
1200 disks + 200 Intel CPU +
100 RAID controller
2000 disks + 2000 ARM CPU +
133 switches
Roughly similar power
consumption to be expected
Space
100x2U = 200 U (usual DESY
setup)
2000/60x4 U = 130 U
Potentially two computing
rooms
Caveat: Other form factor exists
for RAID which are denser
Management
100 Systems, with RAID
controller
2000 systems, without RAID
controller
Anyway, need a scaling life-cycle
management system
Operations
100x12 disks in RAID6, no
resilience
2000 disks with resilience
HGST setup: No RAID rebuild
times, no need for timely change
of defective hardware, just reestablishing resilience
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 20
Summary
• Performance:
– To preserve a high overall throughput with a high number of
streams, many single storage units are preferred.
– If the focus in on the single stream performance of a medium
or small number of streams, RAID system would be preferable.
• As an example for small storage units we selected the
HGST Open Ethernet drive
• Operations considerations:
– A dCache setup with two copies as data integrity method is roughly similar
in TCO compared to traditional RAID system setup
– A setup based e.g. on CEPH with more advanced and efficient data
integrity methods would clearly shift the TCO towards an advantage for the
HGST setup compared to traditional RAID system setup
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 21
Thanks
Further reading:
• HGST: www.hgst.com
• dCache: www.dCache.org
• DESY: www.DESY.de
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 22