haas-20041025

Download Report

Transcript haas-20041025

Parallel File Systems
Peter W. Haas
[email protected]
Universität Stuttgart
Höchstleistungsrechenzentrum Stuttgart (HLRS)
www.hlrs.de
Parallel File Systems
DESY, October 25, 2004
Slide 1
Höchstleistungsrechenzentrum Stuttgart
Table of Contents
1. Global Parallel File System Developments
– ASCI PathForward File System Strategy
– Lustre, Panasas ActiveScale, IBM SAN FS
– Parallel File Systems w/o Metadata Servers
2. Rationale of HSM Systems
– Storage Design Space
– Future of Tape
– Scalable Global Parallel HSM Systems
3. Layering of Legacy Networks
– IEEE 10Gbase-* Standardization
– Backplane and Data Center Networks
4. Conclusion
Parallel File Systems
DESY, October 25, 2004
Slide 2
Höchstleistungsrechenzentrum Stuttgart
Unified Heterogeneous File Systems – for 100+ years
Shannon Filing Cabinet, Schlicht & Field Co.,
Rochester, NY, 1886
http://www.officemuseum.com
Parallel File Systems
DESY, October 25, 2004
Slide 3
Höchstleistungsrechenzentrum Stuttgart
SGS File System SOW: Figure 2
A Typical Supercomputer Center (2001)
1
FY01 Platform Plan
12.3TF to GA by October
3x1.3T F = 3.9TF
12x
2
4x
200 MB/s
FY01 I/O Plan
96 MB/s
6x
64 MB/s
Switch
4x
280 MB/s
140-800 MB/s
2 T B vis
Disk cache
70 MB/s
I/O speeds determined
by ASCI Apps Milestone
requirements in FY00
3
TB3 (150 MB/s)
Gb ENet (125 MB/s)
HIPPI (100 MB/s)
FY01 DVC Plan
Parallel File Systems
DESY, October 25, 2004
Slide 4
Höchstleistungsrechenzentrum Stuttgart
Data Visualization Corridor
SGS File System SOW: Figure 4
Parallel File Systems
DESY, October 25, 2004
Slide 5
Höchstleistungsrechenzentrum Stuttgart
SNIA Storage Model
Application
Storage devices (disks, …)
High availability (fail-over, …)
Device-based block aggregation
Redundancy mgmt (backup, …)
SN-based block aggregation
Security, billing
Block
aggregation
Discovery, monitoring
Host-based block aggregation
Parallel Data Management APIs
File system
(VFS)
Services subsystem
Storage domain
Database (dbms)
Resource mgmt, configuration
File/record subsystem
Block subsystem
Copyright 2000, Storage Network Industry Association
Parallel File Systems
DESY, October 25, 2004
Slide 6
Höchstleistungsrechenzentrum Stuttgart
File Systems for Clusters
File Systems for Clusters
C
Distributed File System
C
e.g. NFS/CIFS
Server is bottleneck
C
C
C
Symmetric Clustered
File System
e.g. GPFS
C
Lock management
Is bottleneck
Server
C
C
C
san
Server
SAN based
File Systems
like SANergy
Server is bottleneck
Scale limited
Parallel File Systems
DESY, October 25, 2004
Slide 7
Höchstleistungsrechenzentrum Stuttgart
C
C
Parallel File System
like PFS
Asymmetric
MD Server is bottleneck
C
Component Component
Server
Server
Metadata
Server
Lustre Solution
Lustre Solution
C
OST
C
C
Asymmetric Cluster File System
Scalable MDS handles object allocation,
OSTs handle block allocation
OST
Parallel File Systems
DESY, October 25, 2004
Slide 8
Höchstleistungsrechenzentrum Stuttgart
MDS Cluster
Linux cluster = Lustre
is a completely new start
High-Performance, Scalable, Open Distributed File
System for Clusters and Shared-Data Environments
Security &
• Scalability at inception
– Separation of Meta data & file
data
– Scalable meta data
– Scalable file data
• Block management at OST level
Resource
Metadata
Database
Control
Access Control
Clients
Coherence
Management
– Efficient locking
• Object architecture
• Lustre is open source
Storage
Management
– Check www.lustre.org
• Not encumbered by existing
architecture
Parallel File Systems
DESY, October 25, 2004
Object
Storage
Targets
Slide 9
FS Requirements for HPTC
global
file space
Clients
Files
Usage
N
N
N
1
1
1
Temp
Snapshots
Result
Development
High Speed Interconnect
clients
(10,000’s)
Parallel I/O
Philippe Bernadat: The Lustre File System
Parallel File Systems
DESY, October 25, 2004
3rd HLRS Workshop on SGPFS, March 22, 2004
Slide 10
Lustre Object Storage Model
Meta Data
Servers
Object
Storage
Clients
Object
Storage
Targets
meta data
control
S
A
N
data
Access Control
Configuration (LDAP)
Parallel File Systems
DESY, October 25, 2004
Slide 11
Lustre Components & Interactions
LDAP Server
Configuration information, network connection & security management
Lustre Client
(CFS/OSC)
Directory operations,
Meta-data & concurrency
Meta-data Server
(MDS)
Parallel File Systems
recovery,
file status,
file creation
DESY, October 25, 2004
file I/O &
file locking
Object Storage Targets
Object Storage Targets
Object (OST)
Storage Targets
(OST)
(OST)
Slide 12
Bringing it all together
Request Processing
CFS
Device (Elan,TCP,…)
Metadata
WB cache
LOV
Portal Library
Portal NAL’s
Client FS
OSCs
System &
Parallel File I/O,
File Locking
MDC
Lock
Client
Networking
Directory
Metadata &
Concurrency
OST
MDS
Networking
Networking
Load balancing
Recovery,
File Status,
File Creation
SAN, Fibre Channel, …
DESY, October 25, 2004
MDS
Server
Lock
Server
Lock
Client
Recovery
Ext3, Reiser, XFS, … FS
Recovery
Object-Based
Lock
Disk Server (OBD) Server
Parallel File Systems
Recovery
NIO API
Network
Fabric
Ext3, Reiser, XFS, … FS
SAN, Fibre Channel, …
Slide 13
Recent Results
•
•
•
•
•
•
File I/O % of raw bandwidth >90%
Achieved OST throughput 270 MB/s
Achieved client I/O 260 MB/s
GigE end-to-end throughput 120 MB/s
Aggregate I/O 1,000 clients 11.1 GB/s
Attribute retrieval rate in 10M file directory, 1,000 clients
7,500/s
• Creation rate one directory, 1,000 clients 5,000/s
* Performance measurements made in November 2003 on
production clusters at Lawrence Livermore National Laboratories
and the National Center for Supercomputing Applications.
Parallel File Systems
DESY, October 25, 2004
Slide 14
Clusters Demand a New Storage Architecture
Garth Gibson: ActiveScale Storage Cluster
Parallel File Systems
DESY, October 25, 2004
3rd HLRS Workshop on SGPFS, March 22, 2004
Slide 15
New Object Storage Architecture
Parallel File Systems
DESY, October 25, 2004
Slide 16
Object Access Example
Parallel File Systems
DESY, October 25, 2004
Slide 17
Object Storage Bandwidth
Parallel File Systems
DESY, October 25, 2004
Slide 18
How does an Object Scale ?
Parallel File Systems
DESY, October 25, 2004
Slide 19
Additional Strengths of Object Model
Parallel File Systems
DESY, October 25, 2004
Slide 20
Object Storage Access Security
Parallel File Systems
DESY, October 25, 2004
Slide 21
Standardization Timeline
Parallel File Systems
DESY, October 25, 2004
Slide 22
ActiveScale SW Architecture
Parallel File Systems
DESY, October 25, 2004
Slide 23
Storage Tank (SAN FS)
David Pease: Storage Tank Research Directions
Parallel File Systems
DESY, October 25, 2004
3rd HLRS Workshop on SGPFS, March 22, 2004
Slide 24
Storage Tank Overview
Storage Tank is a distributed file system that provides:
• High-performance, heterogeneous file sharing using
Storage Area Network (SAN) technology
• Automated, policy-based storage and data management
facilities (like those provided on mainframes)
• Virtually unlimited SAN-wide scalability
• Enterprise-level availability
• IBM TotalStorage SAN File System (SAN FS)
Parallel File Systems
DESY, October 25, 2004
Slide 25
Storage Tank Overview
NFS
CIFS
IP Network for Client/Metadata Cluster Communications
Storage
Tank
Server
Cluster
External
Clients
Admin
Client
AIX
AIX
Solaris
Solaris
VFS
VFS w/Cache
w/Cache
Metadata
Server
VFS
VFS w/Cache
w/Cache
Metadata
Store
HP/UX
HP/UX
VFS
VFS w/Cache
w/Cache
Linux
Linux
VFS
VFS w/Cache
w/Cache
Windows
Win 2K
IFS
IFS w/Cache
w/Cache
Storage Network
Metadata
Server
Metadata
Server
Shared
Storage
Devices
Multiple Storage Pools
Data Store
Parallel File Systems
DESY, October 25, 2004
Slide 26
Object-based Storage
• Concept
– Storage (controller or disk) manages “objects”
– One file maps (roughly) to one object
– File System does I/O to objects, rather than blocks
• Security
–
–
–
–
Object has security key
Key is shared between metadata server and storage device
MDS gives non-forgeable credential to client
All allocation and I/O done under credential
• Scalability
– space allocation offloaded to storage subsytem
• Copy Services
– potential for more intelligent copy services
Parallel File Systems
DESY, October 25, 2004
Slide 27
Storage Tank Object Store Extensions
Parallel File Systems
DESY, October 25, 2004
Slide 28
Distributed/Grid Tank
Local Tank
Host (ST Client)
clients always
access files from its
local ST cluster
SAN
Virtual Filesystem Directory Service
Replica Location Service
file name and
location resolution
LAN
MDS Cluster
w/ DST Features
DST caches/replicates
heterogeneous/remote
data locally
Cache and
Replica Management
using DFMAPI
Grid-based Storage
DST allows Data Grids to be
accessed through a local Tank
cluster utilizing Grid protocols
and services to resolve file
names and file locations.
Parallel File Systems
Remote Tank
Remote Tank clusters
accessed through a local
Tank cluster, providing local
SAN performance and
single site semantics.
inter-tank
protocol
NAS
Existing data stored in NAS
can be accessed through
NAS
DST. Legacy NAS hardware
protocols investment is protected.
Can be extended to support
smooth migration of data
from NAS to ST.
WAN/LAN
GridFTP
DESY, October 25, 2004
Slide 29
SGS File System SOW: Table 5
File System Capacities
1999
2002
2005
2008
Teraflops
Memory size (TB)
File system size (TB)
3.9
2.6
75
30
13-20
200 - 600
100
32-67
500 -2,000
Number of Client Tasks
[see RATIONALE point (1)]
6144
to
8192
1,000
8192
to
16384
3,000
8192
to
32768
3,500
400
44-167
3,000 –
20,000
8192
to
65536
3,500
5.0*10^6
1.5*10^7
1.8*10^7
1.8*10^7
Number of Users
Number of Directories
[see RATIONALE point (2)]
Number of devices/subsystem
[see RATIONALE point (3)]
Number of Files
[see RATIONALE point (4)]
5000
(18GB
drives)
7.5*10^7
to
xxxxx
Parallel File Systems
DESY, October 25, 2004
Slide 30
Höchstleistungsrechenzentrum Stuttgart
3250 – 10000 2084 - 8375
(72GB drives)
3.75*10^8
to
xxxx
(300GB
drives)
4.5*10^8
to
xxxx
1,350 –
8750
(1200 GB
drives)
4.5*10^8
to
xxxx
Advanced Areal Density Trends
Atom Surface Density
1000000
100000
2
Areal Density (Gb/in)
10000
Atom Level
Storage
Probe Contact Area Viability
1000
?
?
100
?
Magnetic Disk
10
Tape Demos
?
Optical Disk
1
Serpentine Longitudinal Tape
0.1
Helical Tape
0.01
Parallel Track
Longitudinal Tape
0.001
1987
1992
1997
2002
2007
GA Year
2012
2017
2022
M. Leonhardt 4-9-02
Parallel File Systems
DESY, October 25, 2004
Slide 31
Storage Cost
Storage Subsystem Price Trends
(OEM price/equiv. unless otherwise noted; no capacity compression or utilization factors)
1000
Enterprise Class disk
subsystem - capacity
Enterprise Class disk
subsystem - performance
Performance Disk
Drives (IDC)
Price (OEM/Integrator - $/GB)
100
average
Tape Drives + 100 Tapes
low
Desktop Disk Drives (IDC)
10
Tape Drives
Tape Media Only
average
1
Optical Disk
Subsystems (IDC)
low
0.1
0.01
1994
1996
1998
2000
2002
2004
2006
2008
GA year
Parallel File Systems
DESY, October 25, 2004
Slide 32
There are 50+ installations of HPSS installed at
20+ organizations throughout the world
Parallel File Systems
DESY, October 25, 2004
Slide 33
Parallel File Systems
DESY, October 25, 2004
Slide 34
HPSS Architecture
• Shared, secure global file system
• Metadata-mediated via database
based on IBM DB2
Based on
HPSS 6
• Highly distributed
• Multiple movers and subsystems
for scalability
Client
• Parallel disk and tape I/O
Computers
• API for maximum control and
performance
• Parallel FTP (pftp)
• Proven Petabyte capability with
no identified upper limits
Disk
• Joint IBM and DOE sponsorship
Arrays
assures both COTS and large (30
person) dev and test staff
Parallel File Systems
DESY, October 25, 2004
Core
Server
LAN
Metadata
Disks
Backup
Core Server
SAN
Tape-Disk
Movers
Robotic Tape
Libraries
Slide 35
LLNL ASCI Case for a
Hierarchical SAN Files System with an Explicit
Interface between Lustre & HPSS
•
•
LLNL policy is that hierarchical and archival storage is too massive to be
left to empirical algorithms such as least recently referenced or age
– This precludes using DMAPI (XOPEN XSDM) approaches used in
most commercial archives such as ADIC SNMS and Veritas Storage
Manager
– Also parallel and concurrent data transfers are required for archive
throughput, and this is inconsistent with DMAPI
LLNL prefers to manage the archiving policy with a site-developed agent
that is aware of exactly what vital data needs to be archived and exactly
when
– This also allows creation of “containers” of small files to save metadata
entries in the archive and to reduce interfile gaps on tape
– Non-critical data is never archived
Parallel File Systems
DESY, October 25, 2004
Slide 36
ASCI Purple Data Movement Vision
Capability
Or
Capacity Platform
Application
HPSS parallel
data movers
open, read, and
write Lustre
files using
conventional
Unix semantics
1
Lustre is an
open source
shared global
file system in
development by
DOE, HP, and
others.
3
HPSS
2
Lustre Disk
Client
Archive
Agent
Site-provided
agent controls
migration
based on file
content and
not on
empirical data
 Simplicity (configuration, equipment expenditures, networking)
Performance potential
 Minimize disk cache
Parallel File Systems
DESY, October 25, 2004
Slide 37
Detail Showing HPSS Parallel Data
Movers with PLFM Capability
Capability
Or
Capacity Platform
HPSS
HPSS
Movers
HPSS
PLFM
HPSS
PLFM
PLFM
Application
PLFMs Can Reside on
any platform connected
to SGFS
1
2
SAN Disk
SAN Disk
SAN Disk
SAN Disk
3
PFTP
Client
User interface
Orchestrates transfer
Parallel File Systems
DESY, October 25, 2004
Slide 38
Lustre native HSM Capability
File Metadata
File Location Data
OST Configuration
MDS
Clnt
Cluster
O1
O2
....
HSM
nO1
nO2
....
Interconnect
On
nMDS
Interconnect
nOn
Parallel File Systems
DESY, October 25, 2004
Slide 39
Höchstleistungsrechenzentrum Stuttgart
Block Allocation
Data
1. mirrored nMDS with
featured storage
classes
2. nMDS becomes
metadata store of its
own
Lustre: Attachment of Industry standard HSMs
File Metadata
File Location Data
OST Configuration
MDS
Clnt
Cluster
Interconnect
O1
O2
....
On
Block Allocation Data
TSM1
TSM2
....
TSMn
Commercial HSM
Parallel File Systems
DESY, October 25, 2004
Slide 40
Höchstleistungsrechenzentrum Stuttgart
IEEE 10GBase-* Standardization
Parallel File Systems
DESY, October 25, 2004
Slide 41
Ethernet Backplane (Study Group)
Parallel File Systems
DESY, October 25, 2004
Slide 42
Data Center Ethernet (CFI)
Facilitation of Ethernet Clustering
Parallel File Systems
DESY, October 25, 2004
Slide 43
Data Center Ethernet (CFI)
Parallel File Systems
DESY, October 25, 2004
Slide 44
Data Center Ethernet (CFI)
Parallel File Systems
DESY, October 25, 2004
Slide 45
DWDM-GBIC: Optical Data Sheet
Product
Numbers:
DWDM-GBIC-XX.XX
32 different SKUs – one
per wavelength
Equipment: Standard GBIC interface
Network: Dual SC connector
(Power Consumption 1.8 W)
Optical Parameter
Value
Unit
Receiver Wavelength Range
1260 - 1620
nm
Transmitter Wavelength
Range
1530.33 – 1560.61
nm
Transmitter Power range
0-3
dBm
Receiver Power Sensitivity1
-28
dBm
Receiver OSNR Sensitivity2
20
dB
Receiver Power Overload
-7
dBm
Chromatic Dispersion
Tolerance3
(-1000)/+3600
ps/nm
Dispersion Power Penalty4
3
dB
Dispersion OSNR Penalty4
0
dB
1: Measured with a 10-12 BER and with OSNR 20 dB @ 0.1 nm RBW
2: Measured with a 0.1 nm resolution bandwidth (RBW).
3: Equivalent to 200 km of G.652 fiber with 18 ps/nm*km of dispersion coefficient
4: Measured at 3600 ps/nm of dispersion and with OSNR 20 dB @ 0.1 nm RBW
Parallel File Systems
DESY, October 25, 2004
Slide 46
Arie van Praag - HNF-Europe
[email protected]
High Performance at its time
This tower can be seen in JAFFA,
Israel, slightly north of the old
harbor.
It is not a minaret as many people
say but an ORIGINAL FIRE TOWER
and very old. There is no Mosque
under it but an Arabic restaurant
Distance between towers: About 5 to 20 Km
Bandwidth:
Remark:
0.02 Baud
Faster than a running slave
300 B Chr.
Parallel File Systems
DESY, October 25, 2004
Fires on towers have
been used along the
Dutch and Belgian coast
to warn against the
invasions of the Vikings
+ 800 - 1000
Slide 47
Conclusions on Parallel File and HSM Systems
Parallel File System Considerations
• Uniform Global Name Space
• No limitations wrt # of Servers, Clients, Storage subsystems
• Security is primary concern: Kerberos and/or PKI authentication
• Authorization via Access Control Lists: NFS V4, NTFS
• Parallel Metadata Servers required for HA and performance
HSM Systems
• Either part of a Parallel File System, or attached via parallel Interface
• X/Open XDSM and XBSA specify interfaces between File Systems
• HSM should enable direct and partial access to files on any level
• HSM configuration should be based on an IP SAN
• Long-term repositories require HSMs with independent GNS
Parallel File Systems
DESY, October 25, 2004
Slide 48
Höchstleistungsrechenzentrum Stuttgart
For Details please refer to
3rd HLRS Workshop on SGPFS
at URL:
http://www.hlrs.de/news-events/events/2004/hwwws
User Name: hww3gfs
Password: paragon
Thank You !
Parallel File Systems
DESY, October 25, 2004
Slide 49
Höchstleistungsrechenzentrum Stuttgart