An Examination of Clustered Parallel File Systems for High
Download
Report
Transcript An Examination of Clustered Parallel File Systems for High
A Survey of Clustered Parallel
File Systems for High
Performance Computing
Clusters
James W. Barker, Ph. D.
Los Alamos National Laboratory
Computer, Computational and Statistical Sciences Division
7/17/2015
Los Alamos National Laboratory
1
Definition of Terms
Distributed File System - The generic term for a client/server or
"network" file system where the data is not locally attached to a host.
Network File System (NFS) is the most common distributed file system
currently in use.
Storage Area Network (SAN) File System – Provides a
means for hosts to share Fiber Channel storage, which is
traditionally separated into private physical areas bound to different
hosts. A block-level metadata manager manages access to different
SAN devices. A SAN File system mounts storage natively on only
one node and connects all other nodes to that storage by distributing
the block address of that storage to all other nodes.
7/17/2015
Scalability is often an issue due to the significant workload required of
the metadata managers and the large network transactions required in
order to access data.
Examples include: IBM’s General Parallel File System (GPFS) and
Sistina (now Red Hat) Global File System (GFS)
Los Alamos National Laboratory
2
Definition of Terms
Symmetric File Systems - A symmetric file system is one in
which the clients also host the metadata manager code, resulting in
all nodes understanding the disk structures.
A concern with these systems is the burden that metadata management
places on the client node, serving both itself and other nodes, which can
impact the ability of the client node to perform its intended
computational jobs.
Examples include IBM’s GPFS and Red Hat GFS
Asymmetric File Systems - An asymmetric file system is a file
system in which there are one or more dedicated metadata
managers that maintain the file system and its associated disk
structures.
7/17/2015
Examples include Panasas ActiveScale, Lustre and traditional NFS file
systems.
Los Alamos National Laboratory
3
Definition of Terms
Cluster File System - a distributed file system that is not a single server
with a set of clients, but a cluster of servers that all work together to provide
high performance storage service to their clients.
To the clients the cluster file system is transparent, it is simply "the file system",
but the file system software manages distributing requests to elements of the
storage cluster.
Examples include: Hewlett-Packard Tru64 cluster and Panasas ActiveScale
Parallel File System - a parallel file system is one in which data blocks
are striped, in parallel, across multiple storage devices on multiple storage
servers. Support for parallel applications is provided allowing all nodes
access to the same files at the same time, thus providing concurrent read
and write capabilities.
7/17/2015
Network Link Aggregation, another parallel file system technique, is the
technology used by PVFS2, in which the I/O is spread across several network
connections in parallel, each packet taking a different link path from the previous
packet.
Examples of this include: Panasas ActiveScale, Lustre, PVFS2, GPFS and GFS.
Los Alamos National Laboratory
4
Definition of Terms
An important note: all of the above definitions overlap. A SAN file
system can be symmetric or asymmetric. Its servers may be
clustered or single servers. And it may support parallel applications
or it may not.
7/17/2015
For example; the Panasas Storage Cluster and its ActiveScale File
System (a.k.a. PanFS) is a clustered (many servers share the work),
asymmetric (metadata management does not occur on the clients),
parallel (supports concurrent reads and writes), object-based (not blockbased) distributed (clients access storage via the network) file system.
Another example; the Lustre File System is also a clustered,
asymmetric, parallel, object-based (referred to as targets by Lustre),
distributed file system.
Another example, the Parallel Virtual File System 2 (PVFS2) is a
clustered, symmetric, parallel, aggregation-based, distributed file
system.
And finally; the Red Hat Global File System (GFS) is a clustered,
symmetric, parallel, block-based, distributed file system.
Los Alamos National Laboratory
5
Object Storage Components
An Object contains the data and enough additional information to allow the
data to be autonomous and self-managing.
An Object-based Storage Device (OSD) is an intelligent evolution of the disk
drive capable of storing and serving objects rather then simply coping data
to tracks and sectors. (The term OSD does not exist in Lustre)
The term OSD in Panasas = The term OST in Lustre
An Object-based Storage Target (OST) is an abstraction layer above the physical
blocks of a physical disk (in Panasas terminology, not in Lustre).
An Object-Based Disk (OBD) is an abstraction of the physical blocks of the
physical disks (in Lustre terminology, OBD’s do not exist in Panasas
terminology).
An Installable File System (IFS) integrates with compute nodes, accepts
POSIX file system commands and data from the Operating System,
addresses the OSD’s directly and stripes the objects across multiple OSD’s.
A Metadata Server intermediates throughout multiple compute nodes in the
environment, allowing them to share data while maintaining cache
consistency on all nodes.
The Network Fabric ties the compute nodes to the OSD’s and metadata
servers.
7/17/2015
Los Alamos National Laboratory
6
Storage Objects
Each file or directory can be thought of as an object. As
with all objects, storage objects have attributes.
Each storage object attribute can be assigned a value
such as file type, file location, whether the data is striped
or not, ownership, and permissions.
An object storage device (OSD) allows us to specify for each file
where to store the blocks allocated to the file, via a metadata
server and object storage targets.
Extending the storage attribute further, it can also be
specified how many object storage targets to stripe onto
and what level of redundancy to employ on the target.
7/17/2015
Some implementations (Panasas) allow the specification of
RAID 0 (striped) or RAID 1 (mirrored) on a per-file basis.
Los Alamos National Laboratory
7
Panasas
Within the storage device, all
objects are accessed via a 96-bit
object ID. The object is accessed
based on the object ID, the
beginning of the range of bytes
inside the object and the length of
the byte range that is of interest
(<objectID, offset, length>).
There are three different types of
objects:
7/17/2015
The “Root” object on the storage
device identifies the storage
device and various attributes of
the device; including total capacity
and available capacity.
A “Group” object provides a
“directory” to a logical subset of
the objects on the storage device.
A ”User” object contains the actual
application data to be stored.
Los Alamos National Laboratory
8
Panasas
The “User” object is a container for data and two types of attributes:
Application Data is essentially the equivalent of the data that a file would
normally have in a conventional file system. It is accessed with file-like
commands such as Open, Close, Read and Write.
Storage Attributes are used by the storage device to manage the block
allocation for the data. This includes the object ID, block pointers,
logical length and capacity used. This is similar to the inode-level
attributes inside a traditional file system.
User Attributes are opaque to the storage device and are used by
applications and metadata managers to store higher-level information
about the object.
7/17/2015
These attributes can include; file system attributes such as ownership and
access control lists (ACL’s), Quality of Service requirements that apply to a
specific object and how the storage system treats a specific object (i.e., what
level of RAID to apply, the size of the user’s quota or the performance
characteristics required for that data).
Los Alamos National Laboratory
9
Panasas
The Panasas concept of object
storage is implemented
entirely in hardware.
The Panasas ActiveScale File
System supports two modes of
data access:
7/17/2015
DirectFLOW is an out of band
solution enabling Linux Cluster
nodes to directly access data
on StorageBlades in parallel.
NFS/CIFS operates in band,
utilizing the DirectorBlades as
a gateway between NFS/CIFS
clients and StorageBlades.
Los Alamos National Laboratory
10
Panasas Performance
Random I/O - SPECsfs97_R1.v3 as measured by Standard
Performance Evaluation Corporation (www.spec.org) a Panasas
ActiveScale storage cluster produced a peak of 305,805 random I/O
Operations/Second.
Data Throughput – as measured “in-house” by Panasas on a
similarly configured cluster delivered a sustained 10.1 GBytes/
Second on sequential I/O read tests.
7/17/2015
Los Alamos National Laboratory
11
Lustre
Lustre is an open, standards-based technology that runs on commodity
hardware and uses object-based disks for storage and metadata servers for
file system metadata.
Replicated, failover MetaData Servers (MDSs) maintain a transactional
record of high-level file and file system changes.
Distributed Object Storage Targets (OSTs) are responsible for actual file
system I/O and for interfacing with storage devices.
This design provides an efficient division of labor between computing and
storage resources.
File operations bypass the metadata server completely and utilize the parallel
data paths to all OSTs in the cluster.
Lustre’s approach of separating metadata operations from data operations
results in enhanced performance.
7/17/2015
The division of metadata and data operations creates a scalable file system with
greater recoverability from failure conditions by providing the advantages of both
journaling and distributed file systems.
Los Alamos National Laboratory
12
Lustre
Lustre supports strong file and metadata locking
semantics to maintain coherency of the file systems
even under a high volume of concurrent access.
File locking is distributed across the Object Storage
Targets (OSTs) that constitute the file system, with each
OST managing locks for the objects that it stores.
Lustre uses an open networking stack composed of
three layers:
7/17/2015
At the top of the stack is the Lustre request processing layer.
Beneath the Lustre request processing layer is the Portals API
developed by Sandia National Laboratory.
At the bottom of the stack is the Network Abstraction Layer
(NAL) which is intended to provide out-of-the-box support for
multiple types of networks.
Los Alamos National Laboratory
13
Lustre
Lustre provides security in the form of authentication,
authorization and privacy by leveraging existing security
systems.
Similarly, Lustre leverages the underlying journaling file
systems provided by Linux
This eases incorporation of Lustre into existing enterprise
security environments without requiring changes to Luster.
These journaling file systems enable persistent state recovery
providing resiliency and recoverability from failed OST’s.
Finally, Lustre’s configuration and state information is
recorded and managed using open standards such as
XML and LDAP
7/17/2015
Easing the task of integrating Lustre into existing environments
or third-party tools.
Los Alamos National Laboratory
14
Lustre
Lustre technology is
designed to scale while
maintaining resiliency.
7/17/2015
As servers are added to a
typical cluster environment,
failures become more likely
due to the increasing
number of physical
components.
Lustre’s support for
resilient, redundant
hardware provides
protection from inevitable
hardware failures through
transparent failover and
recovery.
Los Alamos National Laboratory
15
Lustre File System Abstractions
The Lustre file system provides several
abstractions designed to improve both
performance and scalability.
At the file system level, Lustre treats files
as objects that are located through
metadata Servers (MDSs).
Metadata Servers support all file system
namespace operations:
These operations include file lookups, file
creation and file and directory attribute
manipulation. As well as directing actual
file I/O requests to Object Storage Targets
(OSTs), which manage the storage that is
physically located on underlying ObjectBased Disks (OBDs).
Metadata servers maintain a transactional
record of file system metadata changes
and cluster status, as well as supporting
failover operations.
7/17/2015
Los Alamos National Laboratory
16
Lustre Inodes, OST’s & OBD’s
Like traditional file systems, the Lustre file system has a
unique inode for every regular file, directory, symbolic
link, and special file.
Creating a new file causes the client to contact a metadata
server, which creates an inode for the file and then contacts the
OSTs to create objects that will actually hold file data.
The objects allocated on OSTs hold the data associated with the
file and can be striped across several OSTs in a RAID pattern.
Within the OST, data is actually read and written to underlying
storage known as Object-Based Disks (OBDs).
Subsequent I/O to the newly created file is done directly between
the client and the OST, which interacts with the underlying OBDs
to read and write data.
7/17/2015
Metadata for the objects is held in the inode as extended attributes
for the file.
The metadata server is only updated when additional namespace
changes associated with the new file are required.
Los Alamos National Laboratory
17
Lustre Network Independence
Lustre can be used over a wide variety of
networks due to its use of an open
Network Abstraction Layer. Lustre is
currently in use over TCP and Quadrics
(QSWNet) networks.
Myrinet, Fibre Channel, Stargen and
InfiniBand support are under development.
Lustre's network-neutrality enables Lustre to
quickly take advantage of performance
improvements provided by network
hardware and protocol improvements
offered by new systems.
Lustre provides unique support for
heterogeneous networks.
7/17/2015
For example, it is possible to connect some
clients over an Ethernet to the MDS and
OST servers, and others over a QSW
network, in a single installation.
Los Alamos National Laboratory
18
Lustre
One drawback to Lustre
is that a Lustre client
cannot run on a server
that is providing OSTs.
Lustre has not been
ported to support UNIX
and Windows operating
systems.
7/17/2015
Lustre clients can and
probably will be
implemented on non-Linux
platforms, but as of this
date, Lustre is available
only on Linux.
Los Alamos National Laboratory
19
Lustre Performance
Hewlett-Packard (HP) and Pacific Northwest National
Laboratory (PNNL) have partnered on the design,
installation, integration and support of one of the top 10
fastest computing clusters in the world.
The HP Linux super cluster, with more than 1,800
Itanium® 2 processors, is rated at more than 11
TFLOPS.
PNNL has run Lustre for more than a year and currently
sustains over 3.2 GB/s of bandwidth running production
loads on a 53-terabyte Lustre-based file share.
7/17/2015
Individual Linux clients are able to write data to the parallel
Lustre servers at more than 650 MB/s.
Los Alamos National Laboratory
20
Luster Summary
Lustre is a storage architecture and distributed file system that
provides significant performance, scalability, and flexibility to
computing clusters.
Lustre uses an object storage model for file I/O, and storage
management to provide an efficient division of labor between
computing and storage resources.
Replicated, failover metadata Servers (MDSs) maintain a transactional
record of high-level file and file system changes.
Distributed Object Storage Targets (OSTs) are responsible for actual file
system I/O and for interfacing with local or networked storage devices
known as Object-Based Disks (OBDs).
Lustre leverages open standards such as Linux, XML, LDAP, readily
available open source libraries, and existing file systems to provide
a scalable, reliable distributed file system.
Lustre uses failover, replication, and recovery techniques to
minimize downtime and to maximize file system availability, thereby
maximizing cluster productivity.
7/17/2015
Los Alamos National Laboratory
21
Storage Aggregation
Rather than providing scalable performance by striping
data across dedicated storage devices, storage
aggregation provides scalable capacity by utilizing
available storage blocks on each compute node.
Each compute node runs a server daemon that provides
access to free space on the local disks.
7/17/2015
Additional software runs on each client node that combines
those available blocks into a virtual device and provides locking
and concurrent access to the other compute nodes.
Each compute node could potentially be a server of blocks and a
client. Using storage aggregation on a large (>1000 node)
cluster, 10’s of TB of free storage could potentially be made
available for use as high-performance temporary space.
Los Alamos National Laboratory
22
Parallel Virtual File System
(PVFS2)
Parallel Virtual File System 2 (PVFS2) is an open source
project from Clemson University that provides a
lightweight server daemon to provide simultaneous
access to storage devices from hundreds to thousands
of clients.
Each node in the cluster can be a server, a client, or
both.
Since storage servers can also be clients, PVFS2
supports striping data across all available storage
devices in the cluster (e.g., storage aggregation) .
7/17/2015
PVFS2 is best suited for providing large, fast temporary storage.
Los Alamos National Laboratory
23
Parallel Virtual File System
(PVFS2)
Implicitly maintains consistency by carefully
structuring metadata and namespace.
Uses relaxed semantics
By defining the semantics of data access that
can be achieved without locking.
7/17/2015
Los Alamos National Laboratory
24
Parallel Virtual File System
(PVFS2)
PVFS2 shows that it is possible to build a
parallel file system that implicitly maintains
consistency by carefully structuring the metadata
and name space and by defining the semantics
of data access that can be achieved without
locking.
This design leads to file system behavior that
some traditional applications do not expect.
7/17/2015
These relaxed semantics are not new in the field of
parallel I/O. PVFS2 closely implements the
semantics dictated by MPI-IO.
Los Alamos National Laboratory
25
Parallel Virtual File System
(PVFS2)
PVFS2 also has native support
for flexible noncontiguous data
access patterns.
For example, imagine an
application that reads a
column of elements out of an
array. To retrieve this data, the
application might issue a large
number of small and scattered
reads to the file system.
However, if it could ask the file
system for all of the
noncontiguous elements in a
single operation, both the file
system and the application
could perform more efficiently.
7/17/2015
Los Alamos National Laboratory
26
PVFS2 Stateless Architecture
PVFS2 is designed around a stateless architecture.
PVFS2 servers do not keep track of typical file system
bookkeeping information such as which files have been opened,
file positions, and so on.
There is also no shared lock state to manage.
The major advantage of a stateless architecture is that
clients can fail and resume without disturbing the system
as a whole.
It also allows PVFS2 to scale to hundreds of servers and
thousands of clients without being impacted by the
overhead and complexity of tracking file state or locking
information associated with these clients.
7/17/2015
Los Alamos National Laboratory
27
PVFS2 Design Choices
These design choices enable PVFS2 to perform well in a
parallel environment, but not so well if treated as a local
file system.
Without client-side caching of metadata, status operations
typically take a long time, as the information is retrieved over the
network. This can make programs like “ls” take longer to
complete than might be expected.
PVFS2 is better suited for I/O intensive applications,
rather than for hosting a home directory.
7/17/2015
PVFS2 is optimized for efficient reading and writing of large
amounts of data, and thus it’s very well suited for scientific
applications.
Los Alamos National Laboratory
28
PVFS2 Components
The basic PVFS2 package consists of three
components: a server, a client, and a kernel
module.
The server runs on nodes that store either file system
data or metadata.
The client and the kernel module are used by nodes
that actively store or retrieve the data (or metadata)
from the PVFS2 servers.
Unlike the original PVFS, each PVFS2 server
can operate as a data server, a metadata server,
or both simultaneously.
7/17/2015
Los Alamos National Laboratory
29
Accessing PVFS2 File Systems
Two methods are provided for accessing PVFS2 file
systems.
The first is to mount the PVFS2 file system. This lets the user
change and list directories, or move files, as well as execute
binaries from the file system.
Scientific applications use the second method, MPI-IO.
7/17/2015
This mechanism introduces some performance overhead but is the
most convenient way to access the file system interactively.
The MPI-IO interface helps optimize access to single files by many
processes on different nodes. It also provides “noncontiguous”
access operations that allow for efficient access to data spread
throughout the file.
For the pattern in Figure 2 this is done by asking for every eighth
element starting at offset 0 and ending at offset 56, all as one file
system operation.
Los Alamos National Laboratory
30
PVFS2 Summary
There is no single file system that is the perfect
solution for every I/O workload, and PVFS2 is no
exception.
High-performance applications rely on a different
set of features to access data than those
provided by typical networked file systems.
7/17/2015
PVFS2 is best suited for I/O-intensive applications.
PVFS2 was not intended for home directories, but as
a separate, fast, scalable file system, it is very
capable.
Los Alamos National Laboratory
31
Red Hat Global File System
Red Hat Global File System (GFS) is an open source,
POSIX-compliant cluster file system.
Red Hat GFS executes on Red Hat Enterprise Linux
servers attached to a storage area network (SAN).
Allows simultaneous reading and writing of blocks to a
single shared file system on a SAN.
GFS runs on all major server and storage platforms supported by
Red Hat.
GFS can be configured without any single points of failure.
GFS can scale to hundreds of Red Hat Enterprise Linux servers.
GFS is compatible with all standard Linux applications.
Supports direct I/O by databases
7/17/2015
Improves database performance by avoiding traditional file
system overhead.
Los Alamos National Laboratory
32
Red Hat Global File System
Red Hat Enterprise Linux allows organizations to utilize the default
Linux file system, Ext3 (Third Extended file-system), NFS (Network
File System) or Red Hat's GFS cluster file system.
Ext3 is a journaling file system, which uses log files to preserve the
integrity of the file system in the event of a sudden failure. It is the
standard file system used by all Red Hat Enterprise Linux systems.
NFS is the de facto standard approach to accessing files across the
network.
GFS (Global File System) allows multiple servers to share access to the
same files on a SAN while managing that access to avoid conflicts.
7/17/2015
Sistina Software, the original developer of GFS, was acquired by Red Hat at
the end of 2003. Subsequently, Red Hat contributed GFS to the open source
community under the GPL license.
GFS is provided as a fully supported, optional layered product for Red Hat
Enterprise Linux systems.
Los Alamos National Laboratory
33
GFS Logical Volume Manager
Red Hat Enterprise Linux includes the Logical Volume Manager (LVM),
which provides kernel-level storage virtualization capabilities. LVM supports
a combination of physical storage elements into a collective storage pool,
which can then be allocated and managed according to application
requirements, without regard for the specifics of the underlying physical disk
systems.
Initially developed by Sistina and now part of the standard the Linux kernel.
LVM provides enterprise-level volume management capabilities that are
consistent with the leading, proprietary enterprise operating systems.
LVM capabilities include:
7/17/2015
Storage performance and availability management by allowing for the addition
and removal of physical devices and through dynamic disk volume resizing.
Logical volumes can be resized dynamically online.
The Ext3 supports offline file system resizing (requiring unmount, resize, and
mount operations).
Disk system management that enables the upgrading of disks, removal of failing
disks, reorganization of workloads, and adaptation of storage capacity to
changing system needs
Los Alamos National Laboratory
34
GFS Multi-Pathing
Red Hat GFS works in concert with Red Hat
Cluster Suite to provide failover of critical
computing components for high availability.
Multi-path access to storage is essential to
continued availability in the event of path
failure (such as failure of a Host Bus
Adapter).
Red Hat Enterprise Linux’s multi-path
device driver (MD driver), recognizes
multiple paths to the same device,
eliminating the problem of the system
assuming each path leads to a different
disk.
7/17/2015
MD driver combines the paths to a single
disk, enabling failover to an alternate path if
one path is disrupted.
Los Alamos National Laboratory
35
GFS Enterprise Storage Options
Although SAN and NAS have emerged as the preferred enterprise
storage approach, direct attached storage remains widespread
throughout the enterprise. Red Hat Enterprise Linux supports the full
set of enterprise storage options:
Direct attached storage
Networked storage
SAN (access to block-level data over Fibre Channel or IP networks)
NAS (access to data at the file level over IP networks)
Storage interconnects
7/17/2015
SCSI
ATA
Serial ATA
SAS (Serial Attached SCSI)
Fibre Channel (FC)
iSCSI
GNBD (global network block device)
NFS
Los Alamos National Laboratory
36
GFS on SAN’s
SANs provide direct block-level
access to storage. When
deploying a SAN with the Ext3 file
system, each server mounts and
accesses disk partitions
individually. Concurrent access is
not possible. When a server shuts
down or fails, the clustering
software will “failover” its disk
partitions so that a remaining
server can mount them and
resume its tasks.
Deploying GFS on SANconnected servers allows full
sharing of all file system data,
concurrently. These two
configuration topologies are
shown in the diagram.
7/17/2015
Los Alamos National Laboratory
37
GFS on NFS
In general, an NFS file server,
usually configured with local
storage, will serve file-level data
across a network to remote NFS
clients. This topology is best
suited for non-shared data files
(individual users' directories, for
example) and is widely used in
general purpose computing
environments.
NFS configurations generally offer
lower performance than blockbased SAN environments, but
they are configured using
standard IP networking hardware
so offer excellent scalability. They
are also considerably less
expensive.
7/17/2015
Los Alamos National Laboratory
38
GFS on iSCSI
Combining the performance and sharing capabilities of a
SAN environment with the scalability and cost
effectiveness of a NAS environment is highly desirable.
A topology that achieves this uses SAN technology to
provide the core (“back end”) physical disk infrastructure,
and then uses block-level IP technology to distribute
served data to its eventual consumer across the
network.
The emerging technology for delivering block-level data
across a network is iSCSI.
7/17/2015
This has been developing slowly for a number of years, but as
the necessary standards have stabilized, adoption by industry
vendors has started to accelerate considerably.
Red Hat Enterprise Linux currently supports iSCSI.
Los Alamos National Laboratory
39
GFS on GNBD
As an alternative to iSCSI, Red Hat Enterprise Linux provides support for
Red Hat’s Global Network Block Device (GNBD) protocol, which allows
block-level data to be accessed over TCP/IP networks.
The combination of GNBD and GFS provides additional flexibility for sharing
data on the SAN. This topology allows a GFS cluster to scale to hundreds of
servers, which can concurrently mount a shared file system without the
expense of including a Fibre Channel HBA and associated Fibre Channel
switch port with every machine.
GNBD can make SAN data available to many other systems on the network
without the expense of a Fibre Channel SAN connection.
7/17/2015
Today, GNBD and iSCSI offer similar capabilities, however GNBD is a mature
technology while iSCSI is still relatively new.
Red Hat provides GNBD as part of Red Hat Enterprise Linux so that customers
can deploy IP network-based SANs today.
As iSCSI matures it is expected to supplant GNBD, offering better performance
and a wider range of configuration options. An example configuration is shown in
the diagram that follows.
Los Alamos National Laboratory
40
GFS Summary
Enterprises can now deploy large sets of open
source, commodity servers in a horizontal
scalability strategy and achieve the same levels
of processing power for far less cost.
Such horizontal scalability can lead an
organization toward utility computing, where
server and storage resources are added as
needed. Red Hat Enterprise Linux provides
substantial server and storage flexibility; the
ability to add and remove servers and storage
and to redirect and reallocate storage resources
dynamically.
7/17/2015
Los Alamos National Laboratory
41
Summary
Panasas a clustered , asymmetric , parallel , object-based, distributed file system.
Lustre a clustered, asymmetric, parallel, object-based, distributed file system.
An open standards based system.
Great modularity and compatibility with interconnects, networking components and storage
hardware.
Currently only available for Linux.
Parallel Virtual File System 2 (PVFS2) is a clustered, symmetric, parallel,
aggregation-based, distributed file system.
Implements file system entirely in hardware.
Claims highest sustained data rate of the four systems reviewed.
Data access is achieved without file or metadata locking
PVFS2 is best suited for I/O-intensive (i.e., scientific) applications
PVFS2 could be used for high-performance scratch storage where data is copied and
simulation results are written from thousands of cycles simultaneously.
Red Hat Global File System (GFS) is a clustered, symmetric, parallel, block-based,
distributed file system.
7/17/2015
An open standards based system.
Great modularity and compatibility with interconnects, networking components and storage
hardware.
A relatively low-cost, SAN-based technology.
Only available on Red Hat Enterprise Linux.
Los Alamos National Laboratory
42
Conclusions
No single clustered parallel file system can address the
requirements of every environment.
Hardware based implementations have greater throughput then
software based implementations.
Standards based implementations exhibit greater modularity and
flexibility in interoperating with third-party components and appear
most open to the incorporation of new technology.
All implementations appear to scale well into the thousands of
clients, hundreds of servers and hundreds of TB’s of storage range.
All implementations appear to address the issue of hardware and
software redundancy, component failover, and avoidance of a single
point of failure.
All implementations exhibit the ability to take advantage of lowlatency, high-bandwidth interconnects thus avoiding the overhead
associated with TCP/IP networking.
7/17/2015
Los Alamos National Laboratory
43
Questions?
7/17/2015
Los Alamos National Laboratory
44
References
Panasas:
http://www.panasas.com/docs/Object_Storage_Architecture_WP.pdf
Lustre:
http://www.lustre.org/docs/whitepaper.pdf
A Next-Generation Parallel File System for Linux Clusters:
http://www.pvfs.org/files/linuxworld-JAN2004-PVFS2.ps
Red Hat Global File System:
http://www.redhat.com/whitepapers/rha/gfs/GFS_INS0032US.pdf
Red Hat Enterprise Linux: Creating a Scalable Open Source Storage Infrastructure:
http://www.redhat.com/whitepapers/rhel/RHEL_creating_a_scalable_os_storage_infra
structure.pdf
Exploring Clustered Parallel File Systems and Object Storage by Michael Ewan:
http://www.intel.com/cd/ids/developer/asmona/eng/238284.htm?prn=Y
7/17/2015
Los Alamos National Laboratory
45