Systems Lecture

Transcript Systems Lecture

Distributed Storage and Consistency
Storage moves into the net
Network delays
Network cost
Storage capacity/volume
Administrative cost
Network bandwidth
Shared storage with scalable bandwidth and capacity.
Consolidate — multiplex — decentralize — replicate.
Reconfigure to mix-and-match loads and resources.
Storage as a service
SSP
Storage Service Provider
ASP
Application Service Provider
Outsourcing: storage and/or applications as a service.
For ASPs (e.g., Web services), storage is just a component.
Storage Abstractions
• relational database (IBM and Oracle)
tables, transactions, query language
• file system
hierarchical name space of files with ACLs
Each file is a linear space of fixed-size blocks.
• block storage
SAN, Petal, RAID-in-a-box (e.g., EMC)
Each logical unit (LU) or volume is a linear space of fixed-size blocks.
• object storage
object == file, with a flat name space: NASD, DDS, Porcupine
Varying views of the object size: NASD/OSD/Slice objects may act as
large-ish “buckets” that aggregate file system state.
• persistent objects
pointer structures, requires transactions: OODB, ObjectStore
Network Block Storage
One approach to scalable storage is to attach raw block
storage to a network.
• Abstraction: OS addresses storage by <volume, sector>.
iSCSI, Petal, FibreChannel: access through special device driver
• Dedicated Storage Area Network or general-purpose
network.
FibreChannel (FC) vs. Ethernet
• Volume-based administrative tools
backup, volume replication, remote sharing
• Called “raw” or “block”, “storage volumes” or just “SAN”.
• Least common denominator for any file system or database.
“NAS vs. SAN”
In the commercial sector there is a raging debate today about
“NAS vs. SAN”.
• Network-Attached Storage has been the dominant approach
to shared storage since NFS.
NAS == NFS or CIFS: named files over Ethernet/Internet.
E.g., Network Appliance “filers”
• Proponents of FibreChannel SANs market them as a
fundamentally faster way to access shared storage.
no “indirection through a file server” (“SAD”)
lower overhead on clients
network is better/faster (if not cheaper) and dedicated/trusted
Brocade, HP, Emulex are some big players.
NAS vs. SAN: Cutting through the BS
• FibreChannel a high-end technology incorporating NIC
enhancements to reduce host overhead....
...but bogged down in interoperability problems.
• Ethernet is getting faster faster than FibreChannel.
gigabit, 10-gigabit, + smarter NICs, + smarter/faster switches
• Future battleground is Ethernet vs. Infiniband.
• The choice of network is fundamentally orthogonal to
storage service design.
Well, almost: flow control, RDMA, user-level access (DAFS/VI)
• The fundamental questions are really about abstractions.
shared raw volume vs. shared file volume vs. private disks
Storage Architecture
Any of these abstractions can be built using any, some, or all
of the others.
Use the “right” abstraction for your application.
Basic operations: create/remove, open/close, read/write.
The fundamental questions are:
• What is the best way to build the abstraction you want?
division of function between device, network, server, and client
• What level of the system should implement the features and
properties you want?
Duke Mass Storage Testbed
IBM Shark/HSM
Campus FC net
IP LAN
IP LAN
Brain Lab
Med Ctr
Goal: managed storage on
demand for cross-disciplinary
research.
Direct SAN access for “power
clients” and NAS PoPs; other
clients access through NAS.
Problems
poor interoperability
• Must have a common volume layout across heterogeneous
SAN clients.
poor sharing control
• The granularity of access control is an entire volume.
• SAN clients must be trusted.
• SAN clients must coordinate their access.
$$$
Duke Storage Testbed, v2.0
IBM Shark/HSM
Each SAN volume is managed
by a single NAS PoP.
All access to each volume is
mediated by its NAS PoP.
Campus FC net
Campus IP net
Brain Lab
Med Ctr
Testbed v2.0: pro and con
Supports resource sharing and data sharing.
Does not leverage Fibre Channel investment.
Does not scale access to individual volumes.
Prone to load imbalances.
Data crosses campus IP network in the clear.
Identities and authentication must be centrally administered.
It’s only as good as the NAS clients, which tend to be fair at
best.
Sharing Network Storage
How can we control sharing to a space of files or blocks?
• Access control etc.
• Data model and storage abstraction
• Caching
• Optimistic replication
Consistency
• One-copy consistency vs. weak consistency
• Read-only (immutable) files?
• Read-mostly files with weak consistency?
• Write-anywhere files?
File/Block Cache Consistency
•
Basic write-ownership protocol.
Distributed shared memory (software DSM)
•
Timestamp validation (NFS).
Timestamp each cache entry, and periodically query the server:
“has this file changed since time t?”; invalidate cache if stale.
•
Callback invalidation (AFS, Sprite, Spritely NFS).
Request notification (callback) from the server if the file changes;
invalidate cache and/or disable caching on callback.
•
Leases (NQ-NFS, NFSv4, DAFS)
[Gray&Cheriton89,Macklem93]
Software DSM 101
Software-based distributed shared memory (DSM) provides an illusion of
shared memory on a cluster.
• remote-fork the same program on each node
• data resides in common virtual address space
library/kernel collude to make the shared VAS appear consistent
• The Great War: shared memory vs. message passing
for the full story, take CPS 221
switched interconnect
Page Based DSM (Shared Virtual Memory)
Virtual address space is shared
Virtual Address Space
physical
physical
DRAM
DRAM
The Sequential Consistency Memory Model
sequential
processors
issue
memory ops
in program
order
P1
P3
P2
Easily implemented with shared bus.
For page-based DSM, weaker consistency
models may be useful….but that’s for later.
Memory
switch randomly set
after each memory op
ensures some serial
order among all operations
Inside Page-Based DSM (SVM)
The page-based approach uses a write-ownership token
protocol on virtual memory pages.
• Kai Li [Ivy SVM, 1986], Paul Leach [Apollo, 1982]
• Each node maintains per-node per-page access mode.
{shared, exclusive, no-access}
determines local accesses allowed
For SVM, modes are enforced with VM page protection
mode
shared
exclusive
no-access
load (read)
yes
yes
no
store (write)
no
yes
no
Write-Ownership Protocol
A write-ownership protocol guarantees that nodes observe
sequential consistency of memory accesses:
• Any node with any access has the latest copy of the page.
On any transition from no-access, fetch current copy of page.
• A node with exclusive access holds the only copy.
At most one node may hold a page in exclusive mode.
On transition into exclusive, invalidate all remote copies and set
their mode to no-access.
• Multiple nodes may hold a page in shared mode.
Permits concurrent reads: every holder has the same data.
On transition into shared mode, invalidate the exclusive remote
copy (if any), and set its mode to shared as well.
Network File System (NFS)
server
client
user programs
syscall layer
VFS
syscall layer
VFS
NFS
server
*FS
*FS
NFS
client
RPC over UDP or TCP
NFS Protocol
NFS is a network protocol layered above TCP/IP.
• Original implementations (and most today) use UDP
datagram transport for low overhead.
Maximum IP datagram size was increased to match FS block
size, to allow send/receive of entire file blocks.
Some implementations use TCP as a transport.
• The NFS protocol is a set of message formats and types.
Client issues a request message for a service operation.
Server performs requested operation and returns a reply message
with status and (perhaps) requested data.
File Handles
Question: how does the client tell the server which file or
directory the operation applies to?
• Similarly, how does the server return the result of a lookup?
More generally, how to pass a pointer or an object reference as an
argument/result of an RPC call?
In NFS, the reference is a file handle or fhandle, a token/ticket
whose value is determined by the server.
• Includes all information needed to identify the file/object on
the server, and find it quickly.
volume ID
inode #
generation #
Consistency for File Systems
How is the consistency problem different for network file
systems, relative to DSM/SVM?
Note: The CDK text includes a lot of detail about the kernel
implementation issues for these file systems. These are interesting and
useful, but in this course we focus on the distribution aspects.
NFS as a “Stateless” Service
A classical NFS server maintains no in-memory hard state.
The only hard state is the stable file system image on disk.
• no record of clients or open files
• no implicit arguments to requests
E.g., no server-maintained file offsets: read and write requests
must explicitly transmit the byte offset for each operation.
• no write-back caching on the server
• no record of recently processed requests
• etc., etc....
“Statelessness makes failure recovery simple and efficient.”
Recovery in Stateless NFS
If the server fails and restarts, there is no need to rebuild inmemory state on the server.
• Client reestablishes contact (e.g., TCP connection).
• Client retransmits pending requests.
Classical NFS uses a connectionless transport (UDP).
• Server failure is transparent to the client; no connection to
break or reestablish.
A crashed server is indistinguishable from a slow server.
• Sun/ONC RPC masks network errors by retransmitting a
request after an adaptive timeout.
A dropped packet is indistinguishable from a crashed server.
Drawbacks of a Stateless Service
The stateless nature of classical NFS has compelling design
advantages (simplicity), but also some key drawbacks:
• Recovery-by-retransmission constrains the server interface.
ONC RPC/UDP has execute-mostly-once semantics (“send and
pray”), which compromises performance and correctness.
• Update operations are disk-limited.
Updates must commit synchronously at the server.
• NFS cannot (quite) preserve local single-copy semantics.
Files may be removed while they are open on the client.
Server cannot help in client cache consistency.
Let’s look at the consistency problem...
Timestamp Validation in NFS [1985]
NFSv2 and NFSv3 cache consistency uses a form of timestamp validation
like today’s Web
• Timestamp cached data at file grain.
• Maintain per-file expiration time (TTL)
• Probe for new timestamp to revalidate if cache TTL has expired.
Get attributes (getattr)
Key difference: NFS file cache and access primitives are block-grained,
and the client may issue many operations in sequence on the same file.
• Clustering: File-grained timestamp for block-grained cache
• Piggyback file attributes on each response
• Adaptive TTL
What happens on server failure? Client failure?
AFS [1985]
AFS is an alternative to NFS developed at CMU.
Duke still uses it.
Designed for wide area file sharing:
• Internet is large and growing exponentially.
• Global name hierarchy with local naming contexts and location
info embedded in fully qualified names.
Much like DNS
• Security features, with per-domain authentication / access control.
• Whole file caching or 64KB chunk caching
Amortize request/transfer cost
• Client uses a disk cache
Cache is preserved across client failure.
Again, it looks a lot like the Web.
Callback Invalidations in AFS-2
AFS-1 uses timestamp validation like NFS; AFS-2 uses
callback invalidations.
• Server returns “callback promise” token with file access.
Like ownership protocol, confers a right to cache the file.
Client caches the token on its disk.
• Token states: {valid, invalid, cancelled}
• On a sharing collision, server cancels token with a callback.
Client invalidates cached copy of the associated file.
Detected on client write to server: last writer wins.
(No distinction between read/write token.)
Issues with AFS Callback Invalidations
What happens after a failure?
• Client invalidates its tokens on client restart.
Invalid tokens may be revalidated, like NFS getattr or WWW.
• Server must remember tokens across server restart.
• Can the client distinguish a server failure from a network
failure?
• Client invalidates tokens after a timeout interval T if the
client has no communication with the server.
Weakens consistency in failures.
Then there’s the problem of update semantics: two clients
may be actively updating the same file at the same time.
NQ-NFS Leases
In NQ-NFS, a client obtains a lease on the file that permits
the client’s desired read/write activity.
“A lease is a ticket permitting an activity; the lease is valid until
some expiration time.”
• A read-caching lease allows the client to cache clean data.
Guarantee: no other client is modifying the file.
• A write-caching lease allows the client to buffer modified
data for the file.
Guarantee: no other client has the file cached.
Allows delayed writes: client may delay issuing writes to
improve write performance (i.e., client has a writeback cache).
Using NQ-NFS Leases
1. Client NFS piggybacks lease requests for a given file on
I/O operation requests (e.g., read/write).
NQ-NFS leases are implicit and distinct from file locking.
2. The server determines if it can safely grant the request, i.e.,
does it conflict with a lease held by another client.
read leases may be granted simultaneously to multiple clients
write leases are granted exclusively to a single client
3. If a conflict exists, the server may send an eviction notice
to the holder of the conflicting lease.
If a client is evicted from a write lease, it must write back.
Grace period: server grants extensions while the client writes.
Client sends vacated notice when all writes are complete.
NQ-NFS Lease Recovery
Key point: the bounded lease term simplifies recovery.
• Before a lease expires, the client must renew the lease.
• What if a client fails while holding a lease?
Server waits until the lease expires, then unilaterally reclaims the
lease; client forgets all about it.
If a client fails while writing on an eviction, server waits for
write slack time before granting conflicting lease.
• What if the server fails while there are outstanding leases?
Wait for lease period + clock skew before issuing new leases.
• Recovering server must absorb lease renewal requests and/or
writes for vacated leases.
NQ-NFS Leases and Cache Consistency
• Every lease contains a file version number.
Invalidation cache iff version number has changed.
• Clients may disable client caching when there is concurrent
write sharing.
no-caching lease (Sprite)
• What consistency guarantees do NQ-NFS leases provide?
Does the server eventually receive/accept all writes?
Does the server accept the writes in order?
Are groups of related writes atomic?
How are write errors reported?
What is the relationship to NFS V3 commit?
The Distributed Lock Lab
The lock implementation is similar to DSM systems, with
reliability features similar to distributed file caches.
• use Java RMI
• lock token caching with callbacks
lock tokens passed through server, not peer-peer as DSM
• synchronizes multiple threads on same client
• state bit for pending callback on client
• server must reissue callback each lease interval (or use RMI
timeouts to detect a failed client)
• client must renew token each lease interval
Remote Method Invocation (RMI)
RMI is “RPC in Java”, supporting
Emerald-like distributed object
references, invocation, and garbage
collection, derived from SRC Modula-3
network objects [SOSP 93].
RMI registry
obj1
obj2
The registry provides a bootstrap
naming service using URLs.
rmi://slowww.server.edu/object1
obj3
1: Naming.bind(URL, obj1)
2: stub1 = Naming.lookup(URL)
server app
client app
3: stub2 = stub1->method()
stub
skeleton
RMI layer
RMI layer
transport
transport
client VM
server VM
Background Slides
These slides were not discussed. I use them in CPS 210, the
operating systems course. They provide useful background for
the material on NFS.
Cluster File Systems
cluster FS
cluster FS
storage client
storage client
shared block storage service (FC/SAN, Petal, NASD)
xFS [Dahlin95]
Petal/Frangipani [Lee/Thekkath]
GFS
Veritas
EMC Celerra
issues
trust
compatibility with NAS protocols
sharing, coordination, recovery
Sharing and Coordination
*FS client
*FS client
block allocation and layout
locking/leases, granularity
NAS
shared access
*FS svc
*FS svc
separate lock service
logging and recovery
“SAN”
network partitions
reconfiguration
storage service + lock manager
What does Frangipani need from Petal?
How does Petal contribute to F’s *ility?
Could we build Frangipani without Petal?
A Typical Unix File Tree
Each volume is a set of directories and files; a host’s file tree is the set of
directories and files visible to processes on a given host.
/
File trees are built by grafting
volumes from different volumes
or from network servers.
bin
In Unix, the graft operation is
the privileged mount system call,
and each volume is a filesystem.
ls
etc
tmp
sh
project
packages
mount point
mount (coveredDir, volume)
coveredDir: directory pathname
volume: device specifier or network volume
volume root contents become visible at pathname coveredDir
(volume root)
tex
emacs
usr
vmunix
users
Filesystems
Each file volume (filesystem) has a type, determined by its
disk layout or the network protocol used to access it.
ufs (ffs), lfs, nfs, rfs, cdfs, etc.
Filesystems are administered independently.
Modern systems also include “logical” pseudo-filesystems in
the naming tree, accessible through the file syscalls.
procfs: the /proc filesystem allows access to process internals.
mfs: the memory file system is a memory-based scratch store.
Processes access filesystems through common system calls.
VFS: the Filesystem Switch
Sun Microsystems introduced the virtual file system interface
in 1985 to accommodate diverse filesystem types cleanly.
VFS allows diverse specific file systems to coexist in a file tree,
isolating all FS-dependencies in pluggable filesystem modules.
user space
syscall layer (file, uio, etc.)
network
protocol
stack
(TCP/IP)
Virtual File System (VFS)
NFS FFS LFS *FS etc. etc.
VFS was an internal kernel restructuring
with no effect on the syscall interface.
Incorporates object-oriented concepts:
a generic procedural interface with
multiple implementations.
device drivers
Other abstract interfaces in the kernel: device drivers,
file objects, executable files, memory objects.
Based on abstract objects with dynamic
method binding by type...in C.
Vnodes
In the VFS framework, every file or directory in active use is
represented by a vnode object in kernel memory.
free vnodes
syscall layer
Each vnode has a standard
file attributes struct.
Generic vnode points at
filesystem-specific struct
(e.g., inode, rnode), seen
only by the filesystem.
Each specific file system
maintains a cache of its
resident vnodes.
NFS
UFS
Vnode operations are
macros that vector to
filesystem-specific
procedures.
Vnode Operations and Attributes
vnode attributes (vattr)
type (VREG, VDIR, VLNK, etc.)
mode (9+ bits of permissions)
nlink (hard link count)
owner user ID
owner group ID
filesystem ID
unique file ID
file size (bytes and blocks)
access time
modify time
generation number
generic operations
vop_getattr (vattr)
vop_setattr (vattr)
vhold()
vholdrele()
directories only
vop_lookup (OUT vpp, name)
vop_create (OUT vpp, name, vattr)
vop_remove (vp, name)
vop_link (vp, name)
vop_rename (vp, name, tdvp, tvp, name)
vop_mkdir (OUT vpp, name, vattr)
vop_rmdir (vp, name)
vop_symlink (OUT vpp, name, vattr, contents)
vop_readdir (uio, cookie)
vop_readlink (uio)
files only
vop_getpages (page**, count, offset)
vop_putpages (page**, count, sync, offset)
vop_fsync ()
V/Inode Cache
HASH(fsid, fileid)
VFS free list head
Active vnodes are reference- counted
by the structures that hold pointers to
them.
- system open file table
- process current directory
- file system mount points
- etc.
Each specific file system maintains its
own hash of vnodes (BSD).
- specific FS handles initialization
- free list is maintained by VFS
vget(vp): reclaim cached inactive vnode from VFS free list
vref(vp): increment reference count on an active vnode
vrele(vp): release reference count on a vnode
vgone(vp): vnode is no longer valid (file is removed)
Pathname Traversal
When a pathname is passed as an argument to a system call,
the syscall layer must “convert it to a vnode”.
Pathname traversal is a sequence of vop_lookup calls to descend
the tree to the named file or directory.
open(“/tmp/zot”)
vp = get vnode for / (rootdir)
vp->vop_lookup(&cvp, “tmp”);
vp = cvp;
vp->vop_lookup(&cvp, “zot”);
Issues:
1. crossing mount points
2. obtaining root vnode (or current dir)
3. finding resident vnodes in memory
4. caching name->vnode translations
5. symbolic (soft) links
6. disk implementation of directories
7. locking/referencing to handle races
with name create and delete operations
Problem 1: Retransmissions and Idempotency
For a connectionless RPC transport, retransmissions can saturate
an overloaded server.
Clients “kick ‘em while they’re down”, causing steep hockey stick.
Execute-at-least-once constrains the server interface.
• Service operations should/must be idempotent.
Multiple executions should/must have the same effect.
• Idempotent operations cannot capture the full semantics we
expect from our file system.
remove, append-mode writes, exclusive create
Solutions to the Retransmission Problem
1. Hope for the best and smooth over non-idempotent requests.
E.g., map ENOENT and EEXIST to ESUCCESS.
2. Use TCP or some other transport protocol that produces
reliable, in-order delivery.
higher overhead...and we still need sessions.
3. Implement an execute-at-most once RPC transport.
TCP-like features (sequence numbers)...and sessions.
4. Keep a retransmission cache on the server [Juszczak90].
Remember the most recent request IDs and their results, and just
resend the result....does this violate statelessness?
DAFS persistent session cache.
Problem 2: Synchronous Writes
Stateless NFS servers must commit each operation to stable
storage before responding to the client.
• Interferes with FS optimizations, e.g., clustering, LFS, and
disk write ordering (seek scheduling).
Damages bandwidth and scalability.
• Imposes disk access latency for each request.
Not so bad for a logged write; much worse for a complex
operation like an FFS file write.
The synchronous update problem occurs for any storage
service with reliable update (commit).
Speeding Up Synchronous NFS Writes
Interesting solutions to the synchronous write problem, used
in high-performance NFS servers:
• Delay the response until convenient for the server.
E.g., NFS write-gathering optimizations for clustered writes
(similar to group commit in databases).
Relies on write-behind from NFS I/O daemons (iods).
• Throw hardware at it: non-volatile memory (NVRAM)
Battery-backed RAM or UPS (uninterruptible power supply).
Use as an operation log (Network Appliance WAFL)...
...or as a non-volatile disk write buffer (Legato).
• Replicate server and buffer in memory (e.g., MIT Harp).
NFS V3 Asynchronous Writes
NFS V3 sidesteps the synchronous write problem by adding a
new asynchronous write operation.
• Server may reply to client as soon as it accepts the write,
before executing/committing it.
If the server fails, it may discard any subset of the accepted but
uncommitted writes.
• Client holds asynchronously written data in its cache, and
reissues the writes if the server fails and restarts.
When is it safe for the client to discard its buffered writes?
How can the client tell if the server has failed?
NFS V3 Commit
NFS V3 adds a new commit operation to go with async-write.
• Client may issue a commit for a file byte range at any time.
• Server must execute all covered uncommitted writes before
replying to the commit.
• When the client receives the reply, it may safely discard any
buffered writes covered by the commit.
• Server returns a verifier with every reply to an async write or
commit request.
The verifier is just an integer that is guaranteed to change if the
server restarts, and to never change back.
• What if the client crashes?

Systems Lecture

Transcript Systems Lecture

Directory