Transcript ppt

NASD (Network-Attached Secure Disks)
“A Cost-Effective, High-Bandwidth Storage Architecture”. Gibson et al.
ASPLOS’98
Shimin Chen
Big Data Reading Group Presentation
Goal: avoid the bottleneck of file server
Machine
LAN: Local Area Network; PAN: Peripheral Area Network;
SAN: Storage Area Network
• Direct drive-client transfer
• Asynchronous oversight:
Capabilities contain the policy
decisions, enforced at drives.
• Cryptographic integrity
all messages protected by crypto
digests; computation is
expensive, require HW
• Object-based interface
In contrast to fixed-size blocks,
drives support variable-size
objects, layout managed by
drives
File Systems for NASD
• NFS and AFS ported to NASD
– Using Andrew benchmark, NASD-NFS and NFS’s times are within 5%
• A parallel filesystem for NASD (Cheops)
Scaling of a Parallel Data Mining Application
NASD: n clients reading from a single NASD PFS file striped across n drives.
scales to 45 MB/s.
NFS: a single file striped across n disks; bottlenecks near 20 MB/s.
NFS-parallel: each client reads from a replica of the file on an independent
disk through the one server. maximum bandwidth 22.5 MB/s.
All NFS configurations show the maximum achievable bandwidth with the given
number of disks, each twice as fast as a NASD, and up to 10 clients.
Discussions
• Requires SAN (devices and computers on the
same network)
• Object-based storage device (OSD)
– Supports NFS or CIFS interface
• Direct client-device transfer, async oversight
– Google file system: chunk server ~ NASD
– Google file system does not use object interface
pNFS
Standardizing Storage Clusters
Garth Goodson, Sai Susarla, and Rahul Iyer
Network Appliance
IETF Draft: pNFS Problem Statement
Garth Gibson and Peter Corbett
CMU/Panasas and Network Appliance
Presented by Steve Schlosser
NFSv4.1: A standard interface for
parallel storage access
• Motivation
– Commodity clusters = scalable compute
– Single-server storage (NFS) doesn’t scale
– Network-attached clustered storage systems
provide similar scaling
• e.g., NASD, Panasas, Lustre, GPFS, GFS, et al.
– Proprietary, one-off, not interoperable
• pNFS standardizes the interface to parallel
storage systems using an extension of the
NFSv4 protocol
pNFS
• Separate control and data paths
• Data striped on back-end storage nodes
• On open, client obtains a recallable data layout object
from metadata server (MDS)
– And access rights, if supported
• Using layout, fetch data directly from storage servers
– Layouts can be cached, are recalled by MDS if layout changes
• Before close, client commits metadata changes to MDS,
if any
• Layout drivers run within a standard NFSv4.1 client
– Three data layout types: block (iSCSI, FCP), object (OSD), and
file (NFSv4.1)
– Vendor provides this driver, rest of client remains unmodified
• Client can always fall back to standard NFSv4 through
the MDS
File operations
Client
Client
Client
Client
1) OPEN
2) LAYOUTGET
3) READ/WRITE
layout
objects
4) LAYOUTCOMMIT
5) LAYOUTRETURN
6) CLOSE
Metadata
server (MDS)
Data
servers
What is standardized?
• Only specifies pNFS protocol
– Supports 3 data classes:
block, object, file
Client
Client
Client
Client
• Storage access and control
protocols are left unspecified
pNFS
protocol
Metadata
server (MDS)
layout
driver
storage access
protocol
control
protocol
Data
servers
pNFS client stack
user applications
file
layout
Sun RPC
RDMA
NFS v4.1
NFSv4.1 protocol
generic pNFS layout
generic layout
handling/caching
object (OSD)
layout
block
layout
SCSI
TCP/IP
FCP
layout-specific
drivers
File sharing semantics
• Close-to-open consistency
– Client will see changes made by other clients
only after they close
– Assumes write-sharing of files is rare
– NFSv4 compatibility
• Clients notify metadata server after
updates
– Control protocol can be lazy
• Client cannot issue multi-server requests
– Write atomicity must be handled by clients
Conclusions
• pNFS could provide standard access to
cluster storage systems
• Implementation, standardization underway
– Files: NetApp, IBM, Sun
– Objects: Panasas
– Blocks: EMC
– For what it’s worth, Connectathon, Bakeathon
results suggest that prototypes are
interoperating…
GPFS: A Shared-Disk File System
for Large Computing Clusters
Frank Schmuck and Roger Haskin
FAST’02
Presented by Babu Pillai
Big Data Reading Group
03 Dec 2007
Outline
•
•
•
•
•
Overview
Parallel data access
Metadata management
Handling Failures
Unanswered questions
GPFS Overview
• IBM’s General Parallel File System
• Deployed at 100s of sites
• Largest: 1500+ node, 2PB ASCI Purple at
LLNL
• Goal: provide fast parallel fs access
– Inter- and intra- file shared access
– Scale to 1000’s of nodes, disks
Overview 2
• Up to 4096 disks, 1 TB each (in 2002)
• Assumes no disk intelligence, only block
access needed
Low level details
•
•
•
•
Very large files (64-bit size), sparse files
Stripes files across disks
Large block size
Efficient support of large directories
– Extensible hashing
• Metadata journaling
Parallel Data Access
• Central token /
lock manager
• Byte range locking
on files
• Fine grain sub
block sharing
– “Data shipping” to /
from node that
owns the file block
Metadata Management
• File metadata:
– “Metanode” lock holder coordinates
– Lazily updated by nodes (akin to shipping)
• Allocation tables:
– Partition into many regions
– Each node assigned a region
– Lazily ship deallocations
• Others – similar central coarse locking,
distributed shipping for fine sharing
Scaling tweaks
• Multiple token managers (by inode)
• Token eviction to limit memory
• Optimized token protocol
– Batch requests
– Lazy release on deletes
– Revocation by new token holder
Failure Handling
• Journaling of metadata updates
• Failover of token manager service
• Handles partitioning of nodes
– Majority partition can continue
– Disk fencing
• Simple replication for disk failures
Conclusions
• Good handling of parallel access
• Good combination of coarse-grain central,
fine-grain distributed control
• Details missing
• Failure handling seems minimal
Lustre FS
•Parallel File System Blitz (Big Data Reading Group)
•Presented by: Julio López
Lustre Overview
• Cluster parallel file system, POSIX interface
• Target: Supercomputing environments
• Tens of thousands of nodes
• Petabytes of storage capacity
• Deployed LLNL, ORNL, PNNL, LANL, PSC
• Development:
• License: GPL
• Initial developer: Cluster File Systems
• Maintainer: Sun Microsystems, Inc
http://www.pdl.cmu.edu/
Julio Lopez © April 16
User Point of View
• Single unified FS view
• Kernel support, mounted like a regular FS
• User-level library for parallel applications
Higher throughput, lower latency
• Tools to manipulate file stripping
• No client-side caching (lightweight version)
• POSIX semantics
Multiple processes accessing a local FS
No application rewriting
http://www.pdl.cmu.edu/
Julio Lopez © April 16
POSIX semantics: Why are they important?
• HPC access patterns
• Disjoint byte-range accesses
• Overlapping blocks
• Guaranteed consistency
Client
Client
Client
Byte range
Simpler application code
Blocks
http://www.pdl.cmu.edu/
Julio Lopez © April 16
Architecture
Object Storage Device (OSD)
• MDS: Meta data server
• OSS: Object storage server
• Clients
OST
OSS
OST
MDS
MDS
OSS
Client
OSS
Client
OSS
Enterprise RAID
OSS
Client
OSS
http://www.pdl.cmu.edu/
Julio Lopez © April 16
Operation
Clients retrieve layout data from MDS: LoV
Clients directly communicate with OSS (not OST)
MDS
LoV
OSS
OST
OSS
OST
Client
http://www.pdl.cmu.edu/
Julio Lopez © April 16
More features
• MDS and OSS run Linux
Native FS as storage backend (e.g., ext3)
• Networking TCP/IP over Ethernet, Infiniband,
Myrinet, Quadrics and others
• RDMA when available
• Distributed lock manager (LDLM)
• High availability:
•
•
•
•
Single MDT, Failover MDS
Failover OSS
Online upgrades
Dynamic OST additions
http://www.pdl.cmu.edu/
Julio Lopez © April 16
Discussion
http://www.pdl.cmu.edu/
Julio Lopez © April 16
Boxwood: Abstractions as the Foundation for
Storage Infrastructure
John MacCormick, Nick Murphy, Marc Najork,
Chandramohan A. Thekkath, Lidong Zhou
Microsoft Research Silicon Valley
Main Idea


Problem: building distributed, fault-tolerant, storageintensive software is hard
Idea: high level abstractions provided directly by
storage subsystem (e.g., hash tables, lists, trees)

adieu, block-oriented interfaces!

high level applications easier to build

no need to manage free space for sparse/non-contiguous data
structures

system can take advantage of structural information for
performance (caching, pre-fetching, etc.)
Approach


Experiment with high(er)-level abstractions to storage

chunk store: allocate/free/read/write variable sized data
(memory)

B-tree: create/insert/delete/lookup/enumerate (uses chunk
store)
Boxwood system = collection of services

service = software module running on multiple machines

implemented in user-level processes

Test “application”: BoxFS, a multi-node NFS file server

Target environment: machine room/enterprise setting

processing nodes with local storage, high speed network
Boxwood Architecture
B-Tree
 Distributed implementation
of B-Link tree (high
concurrency B-tree)
Chunk store
Backed by pair of chunk
managers and RLDev
Chunk managers map
unique chunk handle (id) to
RLDev offset
RLDev: reliable “media”
Block-oriented replicated
storage
fixed size segments,
2way replication
primary/secondary
structure
Storage Application
B-Tree
Chunk Store
Replicated
Logical Device
S
e
r
v
i
c
e
s
Distributed lock service


Multiple reader, single
writer locks
Master/slave structure
Transaction service
Undo/redo logging
 Logs stored on RLDev

Distributed consensus

Storage

Stores global state, e.g.
― identity of master lock
server
― locations of replicated
data
2k+1 instances tolerate k
failures
BoxFS: multi-node NFS file server

Multiple nodes export single FS via NFS v2

User-level implementation, ~2500 lines C#



BoxFS
S
e
B-Link
B-trees store directory structure, entries
r
Tree
v
File data stored in chunk store
i
Simplifying assumptions for performance
Chunk Store c
e
 data cache flushed every 30 seconds
s
 B-tree log (metadata operations) flushed to stable storage
every 30 seconds instead of on transaction commit

access times not always updated
Performance

RLDev read/write (2-8 servers)




Throughput scales linearly, limited by disk latency at small sizes (0.5K), TCP
throughput at large sizes (256K)
B-Tree insert/delete/lookup (2-8 servers)

Throughput scales linearly using private trees (no contention)

Throughput flattens using single public tree w/partitioned key space
(contention in upper levels of tree)
BoxFS vs. stock NFS using Connectathon (single server)

BoxFS: no RLDev replication, NFS: native Win2003 server on NTFS

Performance comparable; BoxFS with asynchronous metadata log flushes
wins on meta-data intensive operations (B-trees)
BoxFS scaling (2-8 servers)

Read file: linear scaling (no contention)

MkDirEnt, write file: flat performance, limited by lock contention for metadata
updates
Ursa Minor: a brief review
Michael Kozuch
Motivation
Data distribution – protocol family

Data encoding

Repication or erasure encoding
Block size
 Storage node fault type


Crash or byzantine
Number of faults to tolerate
 Timing model



Synchronous or asynchronous
Data location
Architecture

Object-based


Objects stored as slices with common
charachterisitics
Similar to files:




Create/delete
Read/write
Get/set attr.
Etc.
Results