No Slide Title
Download
Report
Transcript No Slide Title
Distributed Systems:
General Services
November 2005
Distributed systems: general services
1
Overview of chapters
• Introduction
• Co-ordination models and languages
• General services
–
–
–
–
Ch 11 Time Services, 11.1-11.3
Ch 9 Name Services
Ch 8 Distributed File Systems
Ch 10 Peer-to-peer Systems
• Distributed algorithms
• Shared data
November 2005
Distributed systems: general services
2
This chapter: overview
• Introduction
• Time services
• Name services
• Distributed file systems
• Peer-to-peer systems
November 2005
Distributed systems: general services
3
Introduction
• Objectives
– Study existing subsystems
• Services offered
• Architecture
• Limitations
– Learn about non-functional requirements
• Techniques used
• Results
• Limitations
November 2005
Distributed systems: general services
4
This chapter: overview
• Introduction
• Time services
• Name services
• Distributed file systems
• Peer-to-peer systems
November 2005
Distributed systems: general services
5
Time services: Overview
• What & Why
• Clock synchronization
• Logical clocks (see chapter on algorithms)
November 2005
Distributed systems: general services
6
Time services
• Definition of time
– International Atomic Time (TAI)
A second is defined as 9.192.631.770 periods of transition
between two hyperfine levels of the ground state of Caesium-133
– Astronomical time
Based on the rotation of the earth on its axis and its rotation about
the sun. But due to the tidal friction the earth’s rotation is getting
longer
– Co-ordinated Universal Time (UTC)
International standard based Atomic time, but a leap second is
occasionally inserted or deleted to keep up with Astronomical
time
November 2005
Distributed systems: general services
7
Time services (cont.)
• Why is time important?
– Accounting (e.g. logins, files,…)
– Performance measurement
– Resource management
– some algorithms depend on it:
• Transactions using timestamps
• Authentication (e.g. Kerberos)
• Example Make
November 2005
Distributed systems: general services
8
Time services (cont.)
• An example: Unix tool “make”
– large program:
• multiple source files: graphics.c
• multiple object files: graphics.o
• change of one file recompilation of all files
– use of make to handle changes:
• change of graphics.c at 3h45
• make examines creation time of graphics.o
• e.g. 3h43: recompiles graphics.c to create an up-todate graphics.o
November 2005
Distributed systems: general services
9
Time services (cont.)
• Make in a distributed environment:
• source file on system A
• object file on system B
Graphics.o created
Make graphics.o
Clock B
3h..
47
48
49
50
Clock A
3h..
no effect!!
41
42
43
Graphics.c modified
November 2005
time
44
Equal times
Distributed systems: general services
10
Time services (cont.)
• Hardware clocks
– each computer has a physical clock
• H(t): value of hardware clock/counter
• different clocks different values
• e.g. crystal-based clocks drift due to
– different shapes, temperatures, …
• Typical drift rates:
– 1sec/sec 1 sec in 11.6 days
– High precision quartz clocks: 0.1-0.01 sec/sec
• Software clocks
– C(t) = H(t) +
• Ex. nanosecs elapsed since reference time
November 2005
Distributed systems: general services
11
Time services (cont.)
• Changing local time
– Hardware: influence H(t)
– Software: change and/or
– Forward:
• Set clock forward
• some clock ticks appear to have been missed
– Backward:
• Set backward? unacceptable
• Let clock run slower for a short period of time
November 2005
Distributed systems: general services
12
Time services (cont.)
• Properties
– External synchronization to UTC (S)
| S(t) - C(t) | < D
– Internal synchronization
| Ci(t) – Cj(t)| < D
– Correctness
Drift rate <
– Monotonicity
t’ > t C(t’) > C(t)
November 2005
Distributed systems: general services
13
Time services: Overview
• What & Why
• Synchronising Physical clocks
– Synchronising with UTC
– Christian’s algorithm
– The Berkeley algorithm
– Network Time Protocol
• Logical clocks (see chapter on algorithms)
November 2005
Distributed systems: general services
14
Time services
synchronising with UTC
• Broadcast by Radio-stations
– WWV (USA) (accuracy 0.1 - 10 msec.)
• Broadcast by satellites
– Geos (0.1 msec.)
– GPS (1 msec)
• Receivers to workstations
November 2005
Distributed systems: general services
15
Time services
Christian’s algorithm
• assumption: time server process Ts on a
computer with UTC receiver
• process P: request/reply with time server
T-request
P
Ts
reply (t)
November 2005
Distributed systems: general services
16
Time services
Christian’s algorithm
• how to set clock at P?
– time t inserted in Ts
– time for P: t + Ttrans
– Ttrans = min + x
• min = minimum transmission time
• x? depends on
– network delay
– process scheduling
November 2005
Distributed systems: general services
17
Time services
Christian’s algorithm
• In a synchronous system
– Upper bound on delay: max
– P sets time to t + (max – min) / 2
– Accuracy: (max – min) / 2
• In an asynchronous system?
November 2005
Distributed systems: general services
18
Time services
Christian’s algorithm
• practical approach
– P measures round-trip time: Tround
– P sets time to t + Tround/2
– accuracy?
• time at Ts when message arrives at P is in range:
[t + min, t + Tround - min]
• accuracy = ((Tround/2) - min)
• problems:
– single point of failure
– impostor time server
November 2005
Distributed systems: general services
19
Time services
The Berkeley algorithm
• Used on computers running Berkeley Unix
• active master elected among computers
whose clocks have to be synchronised
– no central time server
– upon failure of master: re-election
November 2005
Distributed systems: general services
20
Time services
The Berkeley algorithm
• algorithm of master:
– periodically polls the clock value of other
computers
– estimates its local clock time (based on roundtrip delays)
– averages the obtained values
– returns the amount by which each individual’s
slave clock needs adjustment
– fault-tolerant average (strange values excluded)
may be used
November 2005
Distributed systems: general services
21
Time services
Network Time Protocol
• Standard for the Internet
• Design aims and features:
– Accurate synchronization despite variable message
delays
• Statistical techniques to filter timing data from different
servers
– Reliable service that can survive losses of connectivity
• Redundant servers
• hierarchy of servers in synchronised subnets
November 2005
Distributed systems: general services
22
Time services
Network Time Protocol
Stratum1 have UTC
receiver
1
2
3
Level = stratum
2
3
3
synchronisation
November 2005
Distributed systems: general services
23
Time services
Network Time Protocol
• Three working modes for NTP servers
– Multicast mode
• used on high speed LAN
• slave sets clock assuming small delay
– Procedure-call mode
• similar to Christian’s algorithm
– Symmetric Mode
• used by master servers and higher levels
• association between servers
November 2005
Distributed systems: general services
24
Time services
Network Time Protocol
• Procedure-call and Symmetric mode
– each message bears timestamps of recent
message events
T i-2
Server B
m
Server A
November 2005
T i-3
T i-1
m’
time
Ti
Distributed systems: general services
25
Time services
Network Time Protocol
– Compute approximations of offset and delay
T i-2
Server B
Transmission times
Server A
m
t
T i-3
a = T i-2 - T i-3 = t + o
m’
t’
Ti
T i-2 = T i-3 + t + o
T i = T i-1 + t’ - o
T i-1
di=a-b
estimates
o true offset
d delay
total transmission time
o i = (a + b) / 2
oi-di/2 o oi+di/2
b = T i-1 - T i = o – t’
November 2005
Distributed systems: general services
26
Time services
Network Time Protocol
• Procedure-call and Symmetric mode
– data filtering applied to successive pairs
– NTP has complex peer-selection algorithm; favoured
are peers with :
• a lower stratum number
• a lower synchronisation dispersion
– Accuracy:
• 10 msec on Internet
• 1 msec on LAN
November 2005
Distributed systems: general services
27
This chapter: overview
• Introduction
• Time services
• Name services
• Distributed file systems
• Peer-to-peer systems
November 2005
Distributed systems: general services
28
Name services
• Overview
– Introduction
– Basic service
– Case study: Domain name system
– Directory services
November 2005
Distributed systems: general services
29
Name services
Introduction
• Types of name:
–
–
–
–
users
files, devices
service names (e.g. RPC)
port, process, group identifiers
• Attributes associated with names
–
–
–
–
Email address, login name, password
disk, block number
network address of server
...
November 2005
Distributed systems: general services
30
Name services
• Goal:
Introduction
– binding between name and attributes
• History
– originally:
• quite simple problem
• scope: a single LAN
• naïve solution: all names + attributes in a single file
– now:
• interconnected networks
• different administrative domains in a single name
space
November 2005
Distributed systems: general services
31
Name services
Introduction
• naming service separate from other services:
– unification
•
•
•
•
general naming scheme for different objects
e.g. local + remote files in same scheme
e.g. files + devices+ pipes in same scheme
e.g. URL
– integration
• scope of sharing difficult to predict
• merge different administrative domains
• requires extensible name space
November 2005
Distributed systems: general services
32
Name services
Introduction
• General requirements:
– scalable
– long lifetime
– high availability
– fault isolation
– tolerance of mistrust
November 2005
Distributed systems: general services
33
Name services
• Overview
– Introduction
– Basic service
– Domain name system
– Directory services
November 2005
Distributed systems: general services
34
Name services
Basic service
• Operations:
identifier lookup (name, context)
register (name, context, identifier)
delete (name, context)
• From a single server to multiple servers:
– hierarchical names
– efficiency
– availability
November 2005
Distributed systems: general services
35
Name services
Basic service
• Hierarchical name space
– introduction of directories or subdomains
• /users/pv/edu/distri/services.ppt
• cs.kuleuven.ac.be
– easier administration
• delegation of authority
• less name conflicts: unique per directory
• unit of distribution: server per directory + single
root server
November 2005
Distributed systems: general services
36
Name services
Basic service
• Hierarchical name space
– navigation
• process of locating naming data from more than one
server
• iterative or recursive
– result of lookup:
• attributes
• references to another name server
November 2005
Distributed systems: general services
37
Name services
Basic service
Lookup(“/users/pv/edu/…”,…)
P
root
users
sys
November 2005
pv
Distributed systems: general services
src
ann
38
Name services
Basic service
P
Lookup(“pv/edu/…”,…)
root
Ref to server-pv
users
sys
November 2005
pv
Distributed systems: general services
src
ann
39
Name services
Basic service
• Efficiency caching
– hierarchy introduces multiple servers
more communication overhead
more processing capacity
– caching at client side:
• <name, attributes> are cached
• fewer request lookup performance higher
• inconsistency?
– Data is rarely changed
– limit on validity
November 2005
Distributed systems: general services
40
Name services
Basic service
• Availability replication
– at least two failure independent servers
– primary server: accepts updates
– secondary server(s):
• get copy of data of primary
• period check for updates at primary
– level of consistency?
November 2005
Distributed systems: general services
41
Name services
• Overview
– Introduction
– Basic service
– Case study: Domain name system
– Directory services
November 2005
Distributed systems: general services
42
Name services
Domain name system
• DNS = Internet name service
• originally: an ftp-able file
– not scalable
– centralised administration
• a general name service for computers and
domains
– partitioning
– delegation
– replication and caching
November 2005
Distributed systems: general services
43
Name services
Domain name system
• Domain name
– hierarchy
nix.cs.kuleuven.ac.be
country (België)
univ. (academic)
K.U.Leuven
dept. computer science
name of computersystem
November 2005
Distributed systems: general services
44
Name services
Domain name system
• 3 kinds of Top Level Domains (TLD)
– 2-letter country codes (ISO 3166)
– generic names (similar organisations)
•
•
•
•
com
org
int
net
commercial organisations
non-profit organisations (bv. vzw)
international organisations (nato, EU, …)
network providers
– USA oriented names
• edu
• gov
• mil
universities
American government
American army
– new generic names
• biz, info, name, aero, museum, ...
November 2005
Distributed systems: general services
45
Name services
Domain name system
• For each TLD:
– administrator (assign names within domain)
– “be”: DNS BE vzw (previous: dept. computer science)
• Each organisation with a name :
– responsible for new (sub)names
– e.g. cs.kuleuven.ac.be
• Hierarchical naming + delegation
workable structure
November 2005
Distributed systems: general services
46
Name services
Domain name system
• Name servers
– root name server +
– server per (group of) subdomains
+ replication high availability
+ caching acceptable performance
– time-to-live to limit inconsistencies
November 2005
Distributed systems: general services
47
Name services
Domain name system
Name server cs.kuleuven.ac.be
Systems/subdomains
of cs.kuleuven.ac.be
type
IP-adres
nix
A
134.58.42.36
idefix
A
134.58.41.7
droopy
A
134.58.41.10
stevin
A
134.58.41.16
...
A = Address
November 2005
Distributed systems: general services
48
Name services
Domain name system
Name server kuleuven.ac.be
Machines/subdomeinen
van kuleuven.ac.be
type
IP-adres
cs
NS
134.58.39.1
esat
NS
…
www
A
…
...
NS = NameServer
November 2005
Distributed systems: general services
49
Name services
Domain name system
www.
cs.vu.
nl
130.37.24.11
Lokale NS
Example : www.cs.vu.nl
Root-NS
NS (nl)
(cs.kuleuven.ac.be)
NS (vu.nl)
130.37.24.11
November 2005
NS (cs.vu.nl)
Distributed systems: general services
50
Name services
Domain name system
• Extensions:
– mail host location
• used by electronic mail software
• requests where mail for a domain should be
delivered
– reverse resolution (IP address -> domain name)
– host information
• Weak points:
– security
November 2005
Distributed systems: general services
51
Name services
Domain name system
• Good system design:
– partitioning of data
• multiple servers
– replication of servers
• high availability
• limited inconsistencies
• NO load balancing; NO server preference
– caching
• acceptable performance
• limited inconsistencies
November 2005
Distributed systems: general services
52
Name services
• Overview
– Introduction
– Basic service
– Case study: Domain name system
– Directory services
November 2005
Distributed systems: general services
53
Name services
Directory services
• Name service
exact name attributes
• Directory service
some attributes of object all attributes
• Examples:
– request: E-mail address of Janssen at KULeuven
– result: list of names of all persons called Janssen
– select person and read attributes
November 2005
Distributed systems: general services
54
Name services
Directory services
• Directory information base
– list of entries
– each entry
• set of type/list-of-values pairs
• optional and mandatory pairs
• some pairs determine name
– distributed implementation
November 2005
Distributed systems: general services
55
Name services
Directory services
• X500
– CCITT standard
• LDAP
– light weight directory access protocol
– Internet standard
November 2005
Distributed systems: general services
56
This chapter: overview
• Introduction
• Time services
• Name services
• Distributed file systems
• Peer-to-peer systems
November 2005
Distributed systems: general services
57
Distributed file systems
• Overview
– File service architecture
– case study: NFS
– case study AFS
– comparison NFS <> AFS
November 2005
Distributed systems: general services
58
Distributed file systems
File service architecture
• Definitions:
– file
– directory
– file system
cf. Operating systems
November 2005
Distributed systems: general services
59
Distributed file systems
File service architecture
• Requirements:
– addressed by most file systems:
•
•
•
•
access transparency
location transparency
failure transparency
performance transparency
– related to distribution:
• concurrent updates
• hardware and operating system heterogeneity
• scalability
November 2005
Distributed systems: general services
60
Distributed file systems
File service architecture
• Requirements (cont.) :
– scalable to a very large number of nodes
• replication transparency
• migration transparency
– future:
• support for fine-grained distribution of data
• tolerance to network partitioning
November 2005
Distributed systems: general services
61
Distributed file systems
File service architecture
• File service components
User program
User program
Client module: API
Network
names
Directory service
UFID
Flat file service
November 2005
Distributed systems: general services
62
Distributed file systems
File service architecture
• Flat file service
– file = data + attributes
– data = sequence of items
– operations: simple read & write
– attributes: in flat file & directory service
– UFID: unique file identifier
November 2005
Distributed systems: general services
63
Distributed file systems
File service architecture
• Flat file service: operations
– Read (FileID, Position, n) Data
– Write (FileID, Position, Data)
– Create () FileID
– Delete (FileID)
– GetAttributes (FileId) Attr
– SetAttributes (FileID, Attr)
November 2005
Distributed systems: general services
64
Distributed file systems
File service architecture
• Flat file service: fault tolerance
– straightforward for simple servers
– idempotent operations
– stateless servers
November 2005
Distributed systems: general services
65
Distributed file systems
File service architecture
• Directory service
– translate file name in UFID
– substitute for open
– responsible for access control
November 2005
Distributed systems: general services
66
Distributed file systems
File service architecture
• Directory service: operations
– Lookup (Dir, Name, AccessMode, UserID)
UFID
– AddName (Dir, Name, FileID, UserID)
– UnName (Dir, Name)
– ReName (Dir, OldName, NewName)
– GetNames (Dir, Pattern) list-of-names
November 2005
Distributed systems: general services
67
Distributed file systems
File service architecture
• Implementation techniques:
– known techniques from OS experience
– remain important
– distributed file service: comparable in
• performance
• reliability
November 2005
Distributed systems: general services
68
Distributed file systems
File service architecture
• Implementation techniques: overview
–
–
–
–
–
–
–
–
–
file groups
space leaks
capabilities and access control
access modes
file representation
file location
group location
file addressing
caching
November 2005
Distributed systems: general services
69
Distributed file systems
File service architecture
• Implementation techniques: file groups
– (similar: file system, partition)
= collection of files mounted on a server
computer
– unit of distribution over servers
– transparent migration of file groups
– once created file is locked in file group
– UFID = file group identifier + file identifier
November 2005
Distributed systems: general services
70
Distributed file systems
File service architecture
• Implementation techniques: space leaks
– 2 steps for creating a file
• create (empty) file and get new UFID
• enter name + UFID in directory
– failure after step 1:
• file exists in file server
• unreachable: UFID not in any directory
lost space on disk
– detection requires co-operation between
• file server
• directory server
November 2005
Distributed systems: general services
71
Distributed file systems
File service architecture
• Implementation techniques: capabilities
= digital key: access to resource granted on
presentation of the capability
– request to directory server: file name + user id
+ mode of access
UFID including permitted access modes
– construction of UFID
– unique
– encode access
– unforgeable
November 2005
Distributed systems: general services
72
Distributed file systems
File service architecture
• Implementation techniques: capabilities
Reuse?
File group id
File nr
File group id
File nr
Random nr
Access check?
File group id
File nr
Random nr
Forgeable?
Access bits
File group id
File nr
Encrypted (access bits + random number)
November 2005
Access bits
Distributed systems: general services
73
Distributed file systems
File service architecture
• Implementation techniques: file location
– from UFID location of file server
– use of replicated group location database
file group id, PortId
– why replication?
– why location not encoded in UFID?
November 2005
Distributed systems: general services
74
Distributed file systems
File service architecture
• Implementation techniques: caching
– server cache: reduce delay for disk I/O
• selecting blocks for release
• coherence:
– dirty flags
– write-through caching
– client cache: reduce network delay
• always use write-through
• synchronisation problems with multiple caches
November 2005
Distributed systems: general services
75
Distributed file systems
• Overview
– File service model
– case study: Network File System -- NFS
– case study: AFS
– comparison NFS <> AFS
November 2005
Distributed systems: general services
76
Distributed file systems
NFS
• Background and aims
– first file service product
– emulate UNIX file system interface
– de facto standard
• key interfaces published in public domain
• source code available for reference implementation
– supports diskless workstations
• not important anymore
November 2005
Distributed systems: general services
77
Distributed file systems
NFS
• Design characteristics
– client and server modules can be in any node
Large installations include a few servers
– Clients:
• On Unix: emulation of standard UNIX file system
• for MS/DOS, Windows, Apple, …
– Integrated file and directory service
– Integration of remote file systems in a local
one: mount remote mount
November 2005
Distributed systems: general services
78
Distributed file systems
NFS: configuration
• Unix mount system call
– each disk partition contains hierarchical FS
– how integrate?
• Name partitions
a:/usr/students/john
• glue partitions together
– invisible for user
– partitions remain useful for system managers
November 2005
Distributed systems: general services
79
Distributed file systems
NFS: configuration
• Unix mount system call
Root partition
Partition c
/ (root)
/ (root)
...
vmunix
students
x
usr
staff
...
pierre
network
ann
e
proj
Directory staff root of c: /usr/staff/ann/network
November 2005
Distributed systems: general services
80
Distributed file systems
NFS: configuration
• Remote mount
client
Server 1
/ (root)
/ (root)
...
vmunix
students
x
usr
staff
...
ann
users
...
pierre
Directory staff users: /usr/staff/ann/...
November 2005
Distributed systems: general services
81
Distributed file systems
NFS: configuration
• Mount service on server
– enables clients to integrate (part of) remote file
system in the local name space
– exported file systems in /etc/exports + access
list (= hosts permitted to mount; secure?)
• On client side
– file systems to mount enumerated in /etc/rc
– typically mount at start up time
November 2005
Distributed systems: general services
82
Distributed file systems
NFS: configuration
• Mounting semantics
– hard
• client waits until request for a remote file succeeds
• eventually forever
– soft
• failure returned if request does not succeed after n
retries
• breaks Unix failure semantics
November 2005
Distributed systems: general services
83
Distributed file systems
NFS: configuration
• Automounter
– principle:
• empty mount points on clients
• mount on first request for remote file
– acts as a server for a local client
• gets references to empty mount points
• maps mount points to remote file systems
• referenced file system mounted on mount point via
a symbolic link, to avoid redundant requests to
automounter
November 2005
Distributed systems: general services
84
Distributed file systems
NFS: implementation
• In UNIX: client and server modules
implemented in kernel
• virtual file system:
– internal key interface, based on file handles for
remote files
November 2005
Distributed systems: general services
85
Distributed file systems
NFS: implementation
Client computer
User
client
process
server computer
System call
UNIX kernel
UNIX kernel
Virtual file system
Virtual file system
UNIX
file
system
November 2005
Network
NFS
client
NFS protocol
Distributed systems: general services
NFS
server
UNIX
file
system
86
Distributed file systems
NFS: implementation
• Virtual file system
– Added to UNIX kernel to make distinction
• Local files
• Remote files
– File handles: file ID in NFS
• Base: inode number
– File ID in partition on UNIX system
• Extended with:
– File system identifier
– inode generation number (to enable reuse)
November 2005
Distributed systems: general services
87
Distributed file systems
NFS: implementation
• Client integration:
– NFS client module integrated in kernel
•
•
•
•
offers standard UNIX interface
no client recompilation/reloading
single client module for all user level processes
encryption in kernel
• server integration
– only for performance reasons
– user level = 80% of kernel level version
November 2005
Distributed systems: general services
88
Distributed file systems
NFS: implementation
• Directory service
– name resolution co-ordinated in client
– step-by-step process for multi-part file names
– mapping tables in server: high overhead
reduced by caching
• Access control and authentication
– based on UNIX user ID and group ID
– included and checked for every NFS request
– secure NFS 4.0 thanks to use of DES
encryption
November 2005
Distributed systems: general services
89
Distributed file systems
NFS: implementation
• Caching
– Unix caching
•
•
•
•
based on disk blocks
delayed write (why?)
read ahead (why?)
periodically sync to flush buffers in cache
– Caching in NFS
• Server caching
• Client caching
November 2005
Distributed systems: general services
90
Distributed file systems
NFS: implementation
• Server caching in NFS
– based on standard UNIX caching: 2 modes
– write-through (instead of delayed write)
• failure semantics
• performance
– delayed write
• Data stored in cache, till commit operation is
received
– Close on client commit operation on server
• Failure semantics?
• Performance
November 2005
Distributed systems: general services
91
Distributed file systems
NFS: implementation
• Client caching
– cached are results of
• read, write, getattr, lookup, readdir
– problem: multiple copies of same data at
different NFS clients
– NFS clients use read-ahead and delayed write
November 2005
Distributed systems: general services
92
Distributed file systems
NFS: implementation
• Client caching (cont.)
– handling of writes
• block of file is fetched and updated
• changed block is marked as dirty
• dirty pages of files are flushed to server
asynchronously
– on close of file
– sync operation on client
– by bio-daemon (when block is filled)
• dirty pages of directories are flushed to server
– by bio-daemon without further delay
November 2005
Distributed systems: general services
93
Distributed file systems
NFS: implementation
• Client caching (cont.)
– consistency checks
• based on time-stamps indicating last modification
of file on server
• validation checks
– when file is opened
– when a new block is fetched
– assumed to remain valid for a fixed time (3 sec for file, 30
sec for directory)
– next operation causes another check
• costly procedure
November 2005
Distributed systems: general services
94
Distributed file systems
NFS: implementation
• Caching
– Cached entry is valid
(T – Tc) < t or (Tmclient = Tmserver)
–
–
–
–
T: current time
Tc: time when cache entry was last validated
t: freshness interval (3 .. 30 secs in Solaris)
Tm: time when block was last modified at …
– consistency level
• acceptable
• most UNIX applications do not depend critically on
synchronisation of file updates
November 2005
Distributed systems: general services
95
Distributed file systems
NFS: implementation
• Performance
– reasonable performance
• remote files on fast disk better than local files on
slow disk
• RPC packets are 9 Kb to contain 8Kb disk blocks
– Lookup operation covers about 50% of server
calls
– Drawbacks:
• frequent getattr calls for time-stamps (cache
validation)
• poor performance of (relative infrequent) writes
(because of write-through)
November 2005
Distributed systems: general services
96
Distributed file systems
• Overview
– File service model
– case study: NFS
– case study: Andrew File System -- AFS
– comparison NFS <> AFS
November 2005
Distributed systems: general services
97
Distributed file systems
AFS
• Background and aims
– Base: observation of UNIX file systems
•
•
•
•
•
•
files are small ( < 10 Kb)
read is more common than write
sequential access is common, random access is not
most files are not shared
shared files are often modified by one user
file references come in bursts
– Aim: combine best of personal computers and
time-sharing systems
November 2005
Distributed systems: general services
98
Distributed file systems
AFS
• Background and aims (cont.)
– Assumptions about environment
•
•
•
•
secured file servers
public workstations
workstations with local disk
no private files on local disk
– Key targets: scalability and security
• CMU 1991: 800 workstations, 40 servers
November 2005
Distributed systems: general services
99
Distributed file systems
AFS
• Design characteristics
– whole-file serving
– whole-file caching
entire files are transmitted, not blocks
– client cache realised on local disk (relatively
large)
lower number of open requests on the network
– separation between file and directory service
November 2005
Distributed systems: general services
100
Distributed file systems
AFS
• Configuration
– single (global) name space
– local files
• temporary files
• system files for start-up
– volume = unit of configuration and
management
– each server maintains a replicated location
database (volume-server mappings)
November 2005
Distributed systems: general services
101
Distributed file systems
AFS
• File name space
local
shared
/ (root)
/ (root)
tmp
bin
afs
–Symbolic link
...
ann
users
bin
pierre
Directory afs root of shared file system
November 2005
Distributed systems: general services
102
Distributed file systems
AFS
• Implementation
– Vice = file server
• secured systems, controlled by system management
• understands only file identifiers
• runs in user space
– Venus = user/client software
• runs on workstations
• workstations keep autonomy
• implements directory services
– kernel modifications for open and close
November 2005
Distributed systems: general services
103
Distributed file systems
AFS
workstations
User
program
servers
Venus
Unix kernel
User
program
Unix kernel
Venus
Vice
Unix kernel
Unix kernel
November 2005
Vice
Distributed systems: general services
104
Distributed file systems
AFS
User
program
Unix file
system call
Non-local file
operations
Venus
Unix kernel
Unix file system
November 2005
Distributed systems: general services
105
Distributed file systems
AFS
• Implementation of file system calls
– open, close:
• UNIX kernel of workstation
• Venus (on workstation)
• VICE (on server)
– read, write:
• UNIX kernel of workstation
November 2005
Distributed systems: general services
106
Distributed file systems
AFS
• Open system call
open( FileName,…)
kernel
if FileName in shared space /afs then pass request to Venus
if file not present in local cache OR file present with
invalid callback then pass request to Vice server
venus
network
vice
transfer copy of file and valid callback
place copy of file in local file system and store FileName
open local file and return descriptor to user process
November 2005
Distributed systems: general services
107
Distributed file systems
AFS
• Read Write system call
read( FileDescriptor,…)
kernel
perform normal UNIX read op local copy of file
….
November 2005
Distributed systems: general services
108
Distributed file systems
AFS
• Close system call
close( FileDescriptor,…)
kernel
close local copy and inform Venus about close
venus
network
if local copy is changed then send copy to Vice server
Store copy of file and send callback to other
clients holding callback promise
vice
...
November 2005
Distributed systems: general services
109
Distributed file systems
AFS
• Caching
– callback principle
• service supplies callback promise at open
• promise is stored with file in client cache
– callback promise can be valid or cancelled
• initially valid
• server sends message to cancel callback promise
– to all clients that cache the file
– whenever file is updated
November 2005
Distributed systems: general services
110
Distributed file systems
AFS
• Caching maintenance
– when client workstation reboots
• cache validation necessary because of missed
messages
• cache validation requests are sent for each valid
promise
– valid callback promises are renewed
• on open
• when no communication has occurred with the
server during a period T
November 2005
Distributed systems: general services
111
Distributed file systems
AFS
• Update semantics
– guarantee after successful open of file F on server S:
latest(F, S, 0)
or
lostcallback(S, T) and incache(F) and latest(F, S, T)
– no other concurrency control
• 2 copies can be updated at different workstations
• all updates except from last close are (silently) lost
• <> normal UNIX operation
November 2005
Distributed systems: general services
112
Distributed file systems
AFS
• Performance: impressive <> NFS
– benchmark: load on server
• 40% AFS
• 100% NFS
– whole-file caching:
• reduction of load on servers
• minimises effect of network latency
– read-only volumes are replicated (master copy
for occasional updates)
Andrew optimised for a specific pattern of use!!
November 2005
Distributed systems: general services
113
Distributed file systems
• Overview
– File service model
– case study: NFS
– case study AFS
– comparison NFS <> AFS
November 2005
Distributed systems: general services
114
Distributed file systems
NFS <>AFS
• Access transparency
– both in NFS and AFS
– Unix file system interface is offered
• Location transparency
– uniform view on shared files in AFS
– in NFS
• mounting freedom
• same view possible if same mounting; discipline!
November 2005
Distributed systems: general services
115
Distributed file systems
NFS <>AFS
• Failure transparency
– NFS
• no state of clients stored in servers
• idempotent operations
• transparency limited by soft mounting
– AFS
• state about clients stored in servers
• cache maintenance protocol handles server crashes
• limitations?
November 2005
Distributed systems: general services
116
Distributed file systems
NFS <>AFS
• Performance transparency
– NFS
• acceptable performance degradation
– AFS
• only delay for first open operation on file
• better than NFS for small files
November 2005
Distributed systems: general services
117
Distributed file systems
NFS <>AFS
• Migration transparency
– limited: update of locations required
– NFS
• update configuration files on all clients
– AFS
• update of replicated database
November 2005
Distributed systems: general services
118
Distributed file systems
NFS <>AFS
• Replication transparency
– NFS
• not supported
– AFS
• limited support; for read-only volumes
• one master for exceptional updates; manual
procedure for propagating changes to other volumes
November 2005
Distributed systems: general services
119
Distributed file systems
NFS <>AFS
• Concurrency transparency
– not supported (in UNIX)
• Scalability transparency
– AFS better than NFS
November 2005
Distributed systems: general services
120
Distributed file systems
• Overview
– File service architecture
– case study: NFS
– case study AFS
– comparison NFS <> AFS
November 2005
Distributed systems: general services
121
Distributed file systems
• Conclusions
– Standards <> quality
– evolution to standards and common key
interface
• AFS-3 incorporates Kerberos, vnode interface,
large messages for file blocks (64Kb)
– network (mainly access) transparency causes
inheritance of weakness: e.g. no concurrency
control
– evolution mainly performance driven
November 2005
Distributed systems: general services
122
Distributed file systems
• Good system design:
– partitioning
• different storage volumes
– replication of servers
•
•
•
•
high availability
limited inconsistencies
load balancing?
server preference?
– caching
• acceptable performance
• limited inconsistencies
November 2005
Distributed systems: general services
123
This chapter: overview
• Introduction
• Time services
• Name services
• Distributed file systems
• Peer-to-peer systems
November 2005
Distributed systems: general services
124
Peer-to-peer systems
• Overview
–
–
–
–
–
Introduction
Napster
Middleware
Routing algorithms
OceanStore –Pond file store
November 2005
Distributed systems: general services
125
Peer-to-peer systems
• Definition 1:
Introduction
– Support useful distributed services
– Using data and computer resources in the PCs
and workstations available on the Internet
• Definition 2:
– Applications exploiting resources available at
the edges of the Internet
November 2005
Distributed systems: general services
126
Peer-to-peer systems
• Characteristics:
Introduction
– Each user contributes resources to the system
– All nodes have the same functional capabilities and
responsibilities
– Correct operation does not depend on the existence of
any centrally administered systems
– Can offer limited degree of anonymity
– Efficient algorithms for the placement of data across
many hosts and subsequent access
– Efficient algorithms for load balancing and availability
of data
November 2005
Distributed systems: general services
127
Peer-to-peer systems
Introduction
• 3 generations of peer-to-peer systems and
application development:
– Napster: music exchange service
– File-sharing applications offering greater
scalability, anonymity and fault tolerance:
Freenet, Gnutella, Kazaa
– Middleware for the application-independent
management of distributed resources on a
global scale: Pastry, Tapestry,…
November 2005
Distributed systems: general services
128
Peer-to-peer systems
Napster
• Download digital music files
• Architecture:
– Centralised indices
– Files stored and accessed on PCs
• Method of operation (next slide)
November 2005
Distributed systems: general services
129
Peer-to-peer systems
Napster
pee rs
Napste r se rv er
Inde x
1. File locati on
req uest
2. List of peers
offering the file
Napste r se rv er
Inde x
3. File req uest
5. Index update
4. File deli vered
– step 5: file sharing expected!
November 2005
Distributed systems: general services
130
Peer-to-peer systems
• Conclusion:
Napster
– Feasibility demonstrated
– Simple load balancing techniques
– Replicated unified index of all available music
files
– No updates of files
– Availability not a hard concern
– Legal issues: anonymity
November 2005
Distributed systems: general services
131
Peer-to-peer systems
• Overview
–
–
–
–
–
Introduction
Napster
Middleware
Routing algorithms
OceanStore –Pond file store
November 2005
Distributed systems: general services
132
Peer-to-peer systems
• Goal:
middleware
– Automatic placement
– Subsequent location
} of distributed objects
• Functional requirements:
–
–
–
–
Locate and communicate with any resource
Add and remove resources
Add and remove hosts
API independent of type of resource
November 2005
Distributed systems: general services
133
Peer-to-peer systems
middleware
• Non-functional requirements:
– Global scalability (millions of object on hundreds of thousands of
hosts)
– Optimization for local interactions between
neighbouring peers
– Accommodating to highly dynamic host availability
– Security of data in an environment with heterogeneous
trust
– Anonymity, deniability and resistance to censorship
November 2005
Distributed systems: general services
134
Peer-to-peer systems
• Approach:
middleware
– Knowledge of locations of objects
• Partitioned
• Distributed
• Replicated (x16)
throughout the network
═Routing overlay:
• Distributed algorithm for locating objects and nodes
November 2005
Distributed systems: general services
135
Peer-to-peer systems
middleware
AÕs r outing knowledg e
DÕs r outing knowledg e
C
A
D
B
Obj ect:
Node:
November 2005
BÕs r outing knowledg e
CÕs r outing knowledg e
Distributed systems: general services
136
Peer-to-peer systems
middleware
• Resource identification:
– GUID = globally unique identifier
– Secure hash from
• all state
– self certifying
– for immutable objects
• part of state
– e.g. name + …
November 2005
Distributed systems: general services
137
Peer-to-peer systems
middleware
• API for a DHT in Pastry
put(GUID, data)
The data is stored in replicas at all nodes responsible for
the object identified by GUID.
remove(GUID)
Deletes all references to GUID and the associated data.
value = get(GUID)
The data associated with GUID is retrieved from one of
the nodes responsible it.
DHT = distributed hash table
November 2005
Distributed systems: general services
138
Peer-to-peer systems
middleware
• API for a DOLR in Tapestry
publish(GUID )
This function makes the node performing a publish operation the host for
the object corresponding to GUID.
unpublish(GUID)
Makes the object corresponding to GUID inaccessible.
sendToObj(msg, GUID, [n])
an invocation message is sent to an object in order to access it. The
optional parameter [n], requests the delivery of the same message to n
replicas of the object.
DOLR = distributed object location and routing
November 2005
Distributed systems: general services
139
Peer-to-peer systems
• Overview
–
–
–
–
–
Introduction
Napster
Middleware
Routing algorithms
OceanStore –Pond file store
November 2005
Distributed systems: general services
140
Peer-to-peer systems
Routing algorithms
• prefix routing
– Used in Pastry, Tapestry
• GUID: 128-bit
– Host: secure hash on public key
– Object: secure hash on name or part of state
• N participating nodes
– O(log N) steps to route a message to any GUID
– O(log N) messages to integrate a new node
November 2005
Distributed systems: general services
141
Peer-to-peer systems
Routing algorithms
• Simplified algorithm: circular routing
– Leaf set in each active node
– Element in leaf set:
GUID – IP address
of nodes whose GUIDS are numerically closest
on either side of its own GIUD
– Size L = 2*l
– Leaf sets are updated when nodes
• Join
• Leave
November 2005
Distributed systems: general services
142
Peer-to-peer systems
Routing algorithms
0 FFFFF....F ( 2128-1)
D471F1
D467C4
D46A1C
• Circular space for
GUIDs
• Dot = live node
• Leaf size = 8
• Shown: routing of a
message from node
65A1FC to D46A1C
D13DA3
65A1FC
November 2005
Distributed systems: general services
143
Peer-to-peer systems
Routing algorithms
• Full algorithm (Pastry)
– Tree-structured routing table
• GUID – IP pairs spread throughout entire range of
GUIDs
• Increased density of coverage for GUIDs
numerically close to GUID of local node
• GUIDs viewed as hexadecimal values
• Table classifies GUIDs based on hexadecimal
prefices
• As many rows as hexadecimal digits
November 2005
Distributed systems: general services
144
Peer-to-peer systems
Routing algorithms
• Full algorithm (Pastry)
November 2005
Distributed systems: general services
145
Peer-to-peer systems
Routing algorithms
128
0 FFFFF....F ( 2 -1)
D471F1
D46A1C
D467C4
D462BA
• Routing a message
from 65A1FC
to D46A1C
• Well-populated table:
~log16(N) hops
D4213F
D13DA3
65A1FC
November 2005
Distributed systems: general services
146
Peer-to-peer systems
Routing algorithms
• Full algorithm (Pastry)
To hand l e a mess ag eM add res sed to a no dD
e (where R[p ,i] is th e el ement at col umn i,
ro w p o f t h e rou t in g tab l e):
1 . If (L -l < D < L l) { // th e dest in at io n is wit h in th e leaf set o r is th e cu rren t n od e.
2.
ForwardM t o th e elemen tL i o f t h e l eaf s et wit h GUID clo ses tD
t oo r th e cu rren t
n od eA.
3 . } el se { // u se th e ro u tin g tab le to d es p atM
ch t o a n o de wi th a cl os er GUID
4.
fi nd p , t he l en gt h of th e lo n gest commo n p refi x ofD and A. an d i, th e (p +1 )th
h exad ecimal d i gi t oD.
f
5.
If (R[p ,i] ° n ul )l fo rwardM t o R[p ,i] // ro ut eM t o a n o de wi th a lo ng er co mmo n
p refix.
6.
else { // t here is n o en try in t he ro u ti ng t ab le
7.
Forward M t o an y no d e inL o r R wi th a co mmon p refix o f l en g thi, bu t a
GUID t h at is n umerically clo ser.
}
}
November
2005
Distributed systems: general services
147
Peer-to-peer systems
Routing algorithms
• Full algorithm (Pastry): host integration
– New node X:
• computes GUIDX
• X should know at least one nearby Pastry node A with GUIDA
• X sends special join message to A (destination GUIDX)
– A despatches the join message via Pastry
• Route of message: A, B, ….Z (GUIDZ GUIDX)
– Each node sends relevant part of its routing table and
leaf sets to X
November 2005
Distributed systems: general services
148
Peer-to-peer systems
Routing algorithms
• Full algorithm (Pastry): host integration
– Route of message: A, B, ….Z
– Routing table of X:
•
•
•
•
Row 0 of A row 0 of X
B and X share hexadecimal digits: so row i of B row I of X
…
Choice for entry? proximity neighbour selection algorithm
(metric: IP hops, measured latency,…)
– Leaf set of Z leaf set of X
– X sends its leaf set and routing table to all nodes in its
routing table and leaf set
November 2005
Distributed systems: general services
149
Peer-to-peer systems
Routing algorithms
• Full algorithm (Pastry): host failure
– Repair actions to update leafsets: leaf set requested
from node(s) close to failed node
– Routing table: repairs on a “when discovered” basis
November 2005
Distributed systems: general services
150
Peer-to-peer systems
Routing algorithms
• Full algorithm (Pastry): Fault tolerance
– All nodes in leaf set life?
• Each node sends heartbeat messages to nodes in
leaf set
– Malicious nodes?
• Clients can use at least once delivery mechanism
• Use limited randomness in node selection in routing
algorithm
November 2005
Distributed systems: general services
151
Peer-to-peer systems
Routing algorithms
• Full algorithm (Pastry): dependability
– Include acks in routing algorithm: No ack
• alternative route
• node: suspected failure
– Heartbeat messages
– Suspected failed nodes in routing tables
• Probed
• Fail: alternative nodes
– Simple Gossip protocol to exchange routing
table info between nodes
November 2005
Distributed systems: general services
152
Peer-to-peer systems
Routing algorithms
• Full algorithm (Pastry): evaluation
– Message loss
– Performance: Relative Delay Penalty (RDP)
= Pastry delivery / IP/UDP delivery
IP loss
Pastry loss
Pastry wrong node
RDP
100. 000 messages
0%
1.5
0
1.8
5%
3.3
1.6
2.2
November 2005
Distributed systems: general services
153
Peer-to-peer systems
• Overview
–
–
–
–
–
Introduction
Napster
Middleware
Routing algorithms
OceanStore –Pond file store
November 2005
Distributed systems: general services
154
Distributed file systems
• Objectives:
OceanStore
– Very large scale, incrementally-scalable,
persistent storage facility
– Mutable data objects with long-term
persistence and reliability
– Environment of network and computing
resources constantly changing
• Intended use
• Mechanisms used
November 2005
Distributed systems: general services
155
Distributed file systems
• Intended use:
OceanStore
– NFS-like file system
– Electronic mail hosting
• Mechanism needed:
– Consistency between replicas
• Tailored to application needs by a Bayou like
system
– Privacy and integrity
• Encryption of data
• Byzantine agreement protocol for updates
November 2005
Distributed systems: general services
156
Distributed file systems
OceanStore
• Storage organization
AGUID
version i+1VGUID of
version i
d1 d2 d3
root block
version i
Ver sion i+1 has been updated in blocks d1,
d2 and d3. The certificate and the root
blocks incl ude some metadata not shown.
All unlabel led arr ows ar e BGUIDs.
November 2005
BGUID (copy on write)
VGUID of current
version
certificate
indirection blocks
data blocksd1
d2 d3 d4 d5
VGUID of version i-1
Distributed systems: general services
157
Distributed file systems
OceanStore
• Storage organization (cont)
Name
Meaning
Description
BGUID
block GUID
Secure hash of a data block
VGUID
version GUID
BGUID of the root block of a version
AGUID
active GUID
Uniquely identifies all the versions of an object
November 2005
Distributed systems: general services
158
Distributed file systems
OceanStore
• Storage organization (cont)
– New object AGUID
• small set of hosts act as inner ring
• publish (AGUID)
• Refers to signed certificate recording the sequence
of versions of the object
– Update of object
• Byzantine agreement between host in inner ring
(primary copy)
• Result disseminated to secondary replicas
November 2005
Distributed systems: general services
159
Distributed file systems
• Performance
OceanStore
LAN
WAN
Phase
Linux NFS Pond
Linux NFS Pond
Predominant
operations in
benchmark
1
0.0
1.9
0.9
2.8
Read and write
2
0.3
11.0
9.4
16.8
Read and write
3
1.1
1.8
8.3
1.8
Read
4
0.5
1.5
6.9
1.5
Read
5
2.6
21.0
21.5
32.0
Read and write
Total
4.5
37.2
47.0
54.9
November 2005
Distributed systems: general services
160
Distributed file systems
OceanStore
• Performance (cont)
– Benchmarks:
•
•
•
•
•
Create subdirectories recursively
Copy a source tree
Examine the status of all files in the tree
Examine every byte of data in all files
Compile and link the files
November 2005
Distributed systems: general services
161
This chapter: overview
• Introduction
• Time services
• Name services
• Distributed file systems
• Peer-to-peer systems
November 2005
Distributed systems: general services
162
Distributed Systems:
General Services
November 2005
Distributed systems: general services
163