PetaByte Storage Facility at RHIC

Download Report

Transcript PetaByte Storage Facility at RHIC

PetaByte Storage Facility at
RHIC
Razvan Popescu - Brookhaven
National Laboratory
Who are we?

Relativistic Heavy-Ion Collider @ BNL
– Four experiments: Phenix, Star, Phobos,
Brahms.
– 1.5PB per year.
– ~500MB/sec.
– >20,000SpecInt95.

Startup in May 2000 at 50% capacity and
ramp up to nominal parameters in 1 year.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
2
Overview

Data Types:
– Raw: very large volume (1.2PB/yr.), average
bandwidth (50MB/s).
– DST: average volume (500TB), large
bandwidth (200MB/s).
– mDST: low volume (<100TB), large bandwidth
(400MB/s).
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
3
Data Flow (generic)
RHIC
raw
Reconstruction
Farm (Linux)
35MB/s
DST
raw
10MB/s
50MB/s
Archive
DST
200MB/s
File Servers
(DST/mDST)
mDST
mDST
400MB/s
10MB/s
Analysis
Farm (Linux)
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
4
The Data Store

HPSS (ver. 4.1.1 patch level 2)
– Deployed in 1998.
– After overcoming some growth difficulties we
consider the present implementation successful.
– One major/total reconfiguration to adapt to new
hardware (and system understanding).
– Flexible enough for our needs. One shortage:
preemptable priority schema.
– Very high performance.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
5
The HPSS Archive



Constraints - large capacity & high bandwidth:
– Two types of tape technology: SD-3 (best
$/GB) & 9840 (best $/MB/s).
– Two tape layers hierarchies. Easy management
of the migration.
Reliable and fast disk storage:
– FC attached RAID disk.
Platform compatible with HPSS:
– IBM, SUN, SGI.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
6
Present Resources



Tape Storage:
– (1) STK Powderhorn silo (6000 cart.)
– (11) SD-3 (Redwood) drives.
– (10) 9840 (Eagle) drives.
Disk Storage:
– ~8TB of RAID disk.
• 1TB for HPSS cache.
• 7TB Unix workspace.
Servers:
– (5) RS/6000 H50/70 for HPSS.
– (6) E450&E4000 for file serving and data mining.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
7
Phenix Data Flow
Calibration - xMB/s
RHIC
10MB/s
HPSS (RAW)
10MB/s
6MB/s
150GB @ 80MB/s
10MB/s
6MB/s
Redwood
(3)
6MB/s
Reconstr.
Farm
(?Si95)
(?00 proc.)
2MB/s
55MB/s
1MB/s
HPSS (DST)
16MB/s
File Server
10MB/s -Calib.
1MB/s
16MB/s
150GB @ 80MB/s
1MB/s
18MB/s
Analysis Farm
(? Si95)
65MB/s
3TB @ 100MB/s
16MB/s
9840
(2)
1MB/s
Redwood
(0)
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
8
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
9
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
10
HPSS Structure

(1) Core Server:
–
–
–
–
–
RS/6000 Model H50
4x CPU
2GB RAM
Fast Ethernet (control)
OS mirrored storage for metadata (6pv.)
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
11
HPSS Structure

(3) Movers:
–
–
–
–
–
–
–
–
RS/6000 Model H70
4x CPU
1GB RAM
Fast Ethernet (control)
Gigabit Ethernet (data) (1500&9000MTU)
2x FC attached RAID - 300GB - disk cache
(3-4) SD-3 “Redwood” tape transports
(3-4) 9840 “Eagle” tape transports
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
12
HPSS Structure




Guarantee availability of resources for a
specific user group  separate resources 
separate PVRs & movers.
One mover per user group  total exposure
to single-machine failure.
Guarantee availability of resources for Data
Acquisition stream  separate hierarchies.
Result: 2PVR&2COS&1Mvr per group.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
13
HPSS Structure
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
14
HPSS Topology
Net 1 - Data (1000baseSX)
10baseT
STK
Core
M1
M2
Client
M3
(Routing)
N x PVR
pftpd
Net 2 - Control (100baseT)
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
15
HPSS Performance




80 MB/sec for the disk subsystem.
~1 CPU per 40MB/sec for TCPIP Gbit
traffic @ 1500MTU or 90MB/sec @
9000MTU
>9MB/sec per SD-3 transport.
~10MB/sec per 9840 transport.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
16
I/O Intensive Systems



Mining and Analysis systems.
High I/O & moderate CPU usage.
To avoid large network traffic merge file
servers with HPSS movers:
– Major problem with HPSS support on non-AIX
platforms.
– Several (Sun) SMP machines or Large (SGI)
Modular System.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
17
Problems

Short lifecycle of the SD-3 heads.
– ~ 500 hours < 2 months @ average usage. (6 of
10 drives in 10 months).
– Built a monitoring tool to try to predict
transport failure (based of soft error frequency).



Low throughput interface (F/W) for SD-3:
high slot consumption.
SD-3 production discontinued?!
9840 ???
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
18
Issues

Tested the two tape layer hierarchies:
– Cartridge based migration.
– Manually scheduled reclaim.

Work with large files. Preferable ~1GB.
Tolerable >200MB.
– Is this true with 9840 tape transports?

Don’t think at NFS. Wait for DFS/GPFS?
– We use exclusively pftp.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
19
Issues

Guarantee avail. of resources for specific
user groups:
– Separate PVRs & movers.
– Total exposure to single-mach. failure !

Reliability:
– Distribute resources across movers  share
movers (acceptable?).
– Inter-mover traffic:
• 1 CPU per 40MB/sec TCPIP per adapter:
Expensive!!!
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
20
Inter-Mover Traffic Solutions





Affinity.
– Limited applicability.
Diskless hierarchies (not for DFS/GPFS).
– Not for SD-3. Not enough tests on 9840.
High performance networking: SP switch. (This is your
friend.)
– IBM only.
Lighter protocol: HIPPI.
– Expensive hardware.
Multiply attached storage (SAN). Most promising! See
STK’s talk. Requires HPSS modifications.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
21
Summary


HPSS works for us.
Buy an SP2 and the SP switch.
– Simplified admin. Fast interconnect. Ready for
GPFS.



Keep an eye on the STK’s SAN/RAIT.
Avoid SD-3. (not a risk anymore)
Avoid small file access. At least for the
moment.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC
22
Thank you!
Razvan Popescu
[email protected]