Distributed File Systems

Transcript Distributed File Systems

Distributed File Systems
By Ryan Middleton
1
Outline
 Introduction
 Definition
 Example architecture
 Early implementations
 NFS
 AFS
 Later advancements
 Journal articles
 Patent applications
 Conclusion
2
In the Book
Introduction: Definition
 Non-distributed file systems
 Provide “nice” interface for working with disks, OS level, concept of “files”
 Also provide file locking, access control, file organization, and more
 Distributed file systems
 Share files (possibly distributed) with multiple remote clients
 Ideally provides
 Transparency: access, location, mobility, performance, scaling
 Concurrency control/consistency
 Replication support
 Hardware and OS heterogeneity
 Fault tolerance
 Security
 Efficiency
 Challenges: availability, load balancing, reliability, security, concurrency control
3
SOURCE:
COULOURIS, G.F., DOLLIMORE, J. AND KINDBERG, T. 2005. Distributed systems: concepts and design. Addison-Wesley Longman, .
Introduction: Examples
 List of some distributed file systems










4
Sun’s Network File System (NFS)
Andrew File System (AFS)
Coda
PFS
PPFS
TLDFS
PVFS
EDRFS
Umbrella FS
WheelFS
Introduction: Example Architecture
5
Early Implementations:
Sun’s Network File System (NFS)
 First distributed file system to be developed as a commercial
product (1985)
 Originally implemented using UNIX 4.2 kernel (BSD)
 Design goals:
 Architecture & OS independent
 Crash Recovery (server crashes)
 Access transparency
 UNIX file system semantics
 “Reasonable Performance” (1985)
 “about 80% as fast as a local disk.”
6
SOURCE: SANDBERG, R., GOLDBERG, D., KLEIMAN, S., WALSH, D. AND LYON, B. 1985. Design and implementation of the Sun network filesystem. In Proceedings of the
Summer 1985 USENIX Conference, Anonymous , 119–130.
Early Implementations:
Sun’s Network File System (NFS)
 Main components
Client Side
1.


Server Side
2.




Provides RPC interface for client to invoke procedures for file access
Allows clients to “mount” file system to their local file system using MOUNT protocol
File handles: inode #, inode generation, file system id
Implements Virtual File System (VFS) at kernel level using virtual nodes (vnodes)
Protocol:
3.





7
Remote file systems mounted through the MOUNT protocol (part of NFS protocol)
Implements Virtual File System (VFS) at kernel level using virtual nodes (vnodes)
Uses Sun RPC for simplicity of implementation and uses (synchronous until later versions of NFS)
Stateless to simplifies crash recovery
Transport independent
File handles
MOUNT protocol uses system dependant paths
SOURCE: SANDBERG, R., GOLDBERG, D., KLEIMAN, S., WALSH, D. AND LYON, B. 1985. Design and implementation of the Sun network filesystem. In Proceedings of the
Summer 1985 USENIX Conference, Anonymous , 119–130.
Early Implementations:
Sun’s Network File System (NFS)
8
SOURCE: SANDBERG, R., GOLDBERG, D., KLEIMAN, S., WALSH, D. AND LYON, B. 1985. Design and implementation of the Sun network filesystem. In Proceedings of the
Summer 1985 USENIX Conference, Anonymous , 119–130.
Early Implementations:
Sun’s Network File System (NFS)
 Some functions in the VFS interface:
9
SOURCE: SANDBERG, R., GOLDBERG, D., KLEIMAN, S., WALSH, D. AND LYON, B. 1985. Design and implementation of the Sun network filesystem. In Proceedings of the
Summer 1985 USENIX Conference, Anonymous , 119–130.
Early Implementations:
Sun’s Network File System (NFS)
 Later developments
 Automounter
 Client/Server caching
 Clients must poll server copies to validate cache using timestamps
 Faster and more efficient local file systems
 Better security than originally implemented using Kerberos
authentication system
 Performance improvements
 Very effective
 Depends on hardware, network, dedicated OS
SOURCES:
COULOURIS, G.F., DOLLIMORE, J. AND KINDBERG, T. 2005. Distributed systems: concepts and design. Addison-Wesley Longman, .
10
SANDBERG, R., GOLDBERG, D., KLEIMAN, S., WALSH, D. AND LYON, B. 1985. Design and implementation of the Sun network filesystem. In Proceedings of the Summer
1985 USENIX Conference, Anonymous , 119–130.
Early Implementations:
Sun’s Network File System (NFS)
 Characteristics
 Access transparency: yes, through client module using VFS
 Location transparency: yes, remote file systems mounted into local





file system
Mobility transparency: mostly, remote mount tables must be updated
with each move
Scalability: yes, can handle large loads efficiently (economic and
performance)
File replication: limited, read-only replication is supported and
multiple remote file systems may be given to the automounter
Security: yes, using Kerberos authentication
Fault tolerance, heterogeneity, efficiency, and consistency: yes
SOURCES:
COULOURIS, G.F., DOLLIMORE, J. AND KINDBERG, T. 2005. Distributed systems: concepts and design. Addison-Wesley Longman, .
11
SANDBERG, R., GOLDBERG, D., KLEIMAN, S., WALSH, D. AND LYON, B. 1985. Design and implementation of the Sun network filesystem. In Proceedings of the Summer
1985 USENIX Conference, Anonymous , 119–130.
Early Implementations:
Andrew File System (AFS)
 Provides transparent access for UNIX programs
 Uses UNIX file primitives
 Compatible with NFS
 Files referenced using handles like NFS
 NFS client may access data from AFS “server”
 Design goals
 Access transparency
 Scalability (most important)
 Performance requirements met by serving whole files and
caching them at clients
12
SOURCE:
COULOURIS, G.F., DOLLIMORE, J. AND KINDBERG, T. 2005. Distributed systems: concepts and design. Addison-Wesley Longman, .
Early Implementations:
Andrew File System (AFS)
 Caching algorithm
 On client open system call
 If a copy of the file is not in cache then request a copy from the server
 Received copy is cached on the client computer
 Read, write, and other operations are performed on local copy
 On Close system call
 If local copy has been modified it is sent to the server
 Server saves the client’s copy over own copy
• No concurrency control build in:
• During concurrent writes, the last write overwrites previous
writes resulting in the silent loss of all but the last write
13
SOURCE:
COULOURIS, G.F., DOLLIMORE, J. AND KINDBERG, T. 2005. Distributed systems: concepts and design. Addison-Wesley Longman, .
Early Implementations:
Andrew File System (AFS)
 Server
 Implemented by Vice user-level process
 Handles requests from Venus be operating on its local files
 Serves file copies to client’s Venus process with a callback promise: guarantee to notify
client if another client invalidates its cache
 Clients
Implemented by the Venus user level process
Local files accessed as normal UNIX files
Remote files accessed through /root/cmu subtree
Database of access servers maintained
Kernel intercepts file system calls and passes them to Venus process
Venus translates paths and filenames to file ids; Vice only uses file ids
If notified that cache is invalid, it sets callback promise to cancelled
When performing a file operation, if file isn't cached or the callback promise is cancelled
then it fetches a new copy from the server
 Periodically validates cache using timestamp comparison
 Must implement concurrency control if desired; not required








14
SOURCE:
COULOURIS, G.F., DOLLIMORE, J. AND KINDBERG, T. 2005. Distributed systems: concepts and design. Addison-Wesley Longman, .
Early Implementations:
Andrew File System (AFS)
 Characteristics
 Access transparency: yes
 Location transparency: yes
 Mobility transparency: mostly, remote access database must be
updated
 Scalability: yes, very high performance
 File replication: limited, read-only replication is supported
 Security: yes, using Kerberos authentication
 Fault tolerance, heterogeneity, consistency: somewhat
15
SOURCE:
COULOURIS, G.F., DOLLIMORE, J. AND KINDBERG, T. 2005. Distributed systems: concepts and design. Addison-Wesley Longman, .
Later Advancements: Zebra File System
 Published in 1992, 1995 greatly cited
 For clients to store data on remote computers
 Zebra stripes data from client streams across multiple servers
 Reduces load on a single server
 Maximizes throughput
 Parity information and redundancy can allow continued operation if a




server is down (like RAID array)
Each client creates an append-only log of files and stripes it across the
servers
Data is “viewed” at log level, not at the file level
LFS introduced idea of append-only logs
Metadata is kept for information about log locations
 Designed for “UNIX workloads”: short file lifetimes, sequential
file access, infrequent write-sharing
16
SOURCE: Hartman, J. H., OUSTERHOUT, J. K. 1995. The Zebra Striped Network File System. ACM Transactions on Coputer Systems, Vol. 13, No. 3, 274–310.
Later Advancements: Zebra File System
 Servers
 5 Operations
 Store a fragment
 Append to an existing fragment
 Retrieve a fragment
 Delete a fragment
 Identify fragments
 Stripes can’t be modified except to delete: performance at server local access
level
 “Deltas” appended to end of stripes or new stripes created
 Race Condition: accessing a stripe while it is being deleted
 Optimistic control used instead of LFS locking
 File manager
 Stores all information about files except file data itself
 Hosted on own server
 Client
 Use the file manager to determine where to read/write stripe fragments from
17
SOURCE: Hartman, J. H., OUSTERHOUT, J. K. 1995. The Zebra Striped Network File System. ACM Transactions on Coputer Systems, Vol. 13, No. 3, 274–310.
Later Advancements: Speculative Execution
 Work published in 2005
 Applicable to most/all distributed file systems
 Don’t block during a remote operation
 Create a checkpoint
 Continue execution based upon expected (speculated) results
Cached results, …
 Execution at given time possibly based upon multiple speculations (causal dependencies)
 If results differ then revert to checkpoint

 Share speculative state
 Track causal dependencies propagated through distributed communication
 Correct execution guaranteed by preventing output externalization until
execution is validated
 Performance increased
 I/O latency
 I/O Throughput
 Large improvements to existing distributed file systems’ performance, even
with many rollbacks
18
SOURCE: Nightingale, E. B., Chen, P. M., Flinn, J. 2005. Speculative Execution in a Distributed File System. SOSP ‘05, 191–205.
Later Advancements: Hadoop
 2007
 Design goals
 Highly fault-tolerant
 High throughput
 Parallel batch processing on streams of const data (high throughput instead of
low latency)
 Run on commodity hardware
 Inspired by Google File System (GFS) and MapReduce
 Tuned for large files: typical files GB to TB
 Used by: (See wiki.apache.org/hadoop/PoweredBy)




19
Amazon Web Services, IBM Blue Cloud Computing Clusters
Facebook,Yahoo
Hulu, Joost, Veoh, Last.fm
Rackspace (for processing email logs for search)
SOURCES: Dhruba Borthakur, 2007. The Hadoop Distributed File System: Architecture and Design. http://hadoop.apache.org.
hadoop.apache.org
en.wikipedia.org/wiki/Hadoop
Later Advancements: Hadoop
 Implementation
 Implemented using Java
 Files are write-once-read-many
 What are the consistency and replication consequences?
 Clients (e.g. Hulu server) use NameNode for namespace
translation and DataNodes for serving data
 NameNode performs namespace operations and translation
 DataNodes serve data
 Uses RPC over TCP/IP
 Move processing closer to data
 MapReduce engine allows executing processing tasks on DataNodes with
data or closest possible
20
SOURCES: Dhruba Borthakur, 2007. The Hadoop Distributed File System: Architecture and Design. http://hadoop.apache.org.
hadoop.apache.org
en.wikipedia.org/wiki/Hadoop
Later Advancements: Hadoop
21
SOURCE: Dhruba Borthakur, 2007. Speculative Execution in a Distributed File System. http://hadoop.apache.org.
Later Advancements: WheelFS
 Articles published 2007-2009
 User-level wide-area file system
 POSIX interface
 Design Goal:
 Help nodes in a distributed application share data
 Provide fault tolerance for distributed applications
 Applications tune operations and performance
 Applications using WheelFS show similar performance to
BitTorrent services
 Distributed web cache
 Email service
 Large file distribution
22
SOURCE: Stribling, J., Sovran, Y., Zhang, I., Pretzer, X. 2009. Flexible, Wide Area Storage for Distributed Systems with WheelFS. NSDI ’09, 43–58.
Later Advancements: WheelFS
 Implemented using Filesystem in Userspace (FUSE)
 User-level process providing a bridge between user created file system and
kernel operation
 Communication uses RPC over SSH
 Servers




Unified directory structure presented
Each file/directory object has primary maintainer agreed upon by all nodes
Subdirectories can be maintained by other servers
Replication to non-primary servers supported
 Clients
 Find primary maintainer (or backup) through configuraton service replicated to




23
multiple sites (nodes cache copy of lookup table)
Cached copy is leased from maintainer
Writes buffered in cache until close operation for performance
Clients can read from other clients’ cache
File system entry through wfs directory in the root of the local file system
SOURCE: Stribling, J., Sovran, Y., Zhang, I., Pretzer, X. 2009. Flexible, Wide Area Storage for Distributed Systems with WheelFS. NSDI ’09, 43–58.
Later Advancements: WheelFS
 Semantic Cues included in file or directory paths




Allow user applications to customize data access policies
/wfs/.cue/dir1/dir2/file
Apply to all sub-paths proceeding cue
Cues: Hotspot, MaxTime, EventualConsistency, …
 Ex: /wfs/.MaxTime=200/url causes operations to fail accessing /wfs/url
after 200 ms
 Ex: EventualConsistency allows operations to access data through backups,
caches, etc… instead of directly from maintainer
 Configuration Service eventually reconciles differences
o Version numbers
o resolves conflicts by choosing latest version number (result in lost
writes)
 Combined with MaxTime a client will use the backup/cache version with
the highest version number that is reachable in the specified time
24
SOURCE: Stribling, J., Sovran, Y., Zhang, I., Pretzer, X. 2009. Flexible, Wide Area Storage for Distributed Systems with WheelFS. NSDI ’09, 43–58.
Later Advancements: Patent Applications
 Patent No. US 7,406,473
 July 2008
 Brassow et al.
 Meta-data servers, locking servers, file servers organized into single layer each
 Patent No.US 7,475,199 B1
 January 2009
 Bobbitt et al.
 All files organized into virtual volumes
 Virtualization layer intercepts file operations and maps to physical location
 Client operations handled by “Venus”
 Sound familiar?
 Others
 Reserving space on remote storage locations
 Distributed file system using NFS servers coupled to hosts to directly request
data from NAS nodes, bypassing NFS clients
25
Conclusion
 Non-Distributed File Systems
 Distributed File Systems




Access files over a distributed system
Concurrency issues
Performance
Transparency
 New Advances in Distributed File Systems




Parallel processing and striping
Advanced caching
Tune file system performance on a per-application basis
Very large file support
 Future Work
 Relational-database organization
 Better concurrency control
 Developments in non-distributed file systems can be applicable to distributed file systems
26

Distributed File Systems

Transcript Distributed File Systems

Directory