Coda Server Internals
Download
Report
Transcript Coda Server Internals
Coda Server Internals
Peter J Braam
Contents
Data structure overview
Volumes
Vnodes
Inodes
Data Structure Overview
Object
Inodes
Purpose
File Contents
Resides where
/vicep* partitions
Volumes
Vnodes
Directory cnts
ACL
Reslogs
Meta Data &
Dir contents
RVM
Volinfo records
Volume location
VLDB, VRDB: RW db files
Security
VSGDB, .pdb, .tk files:
dynamic RO db files
Configuration Data
Static data
VSGDB
Pdb records
Tokens
Servers/SCM
Partitions
Startup flags
Skipvolumes
LOG & DATA & DB
Locators
RVM layout
(coda_globals.h)
Already_initialized (int)
struct VolHead[MAXVOLS]
struct VnodeDiskObject *SmallVnodeFreeLists[SM_FREESIZE]
short SmallVnodeIndex
…. Same for large …
MaxVolId (unsigned long)
Remainder is dynamically allocated
Volume zoo
RVM: structures
VolumeData
VolHead
VolumeHeader
VolumeDiskData
(volume.h, camprivate.h)
VM: structures
Volume
VolumeInfo ……..
A volume in RVM
VolHead
VolumeHeader
VolumeHeader
stamp
id
parentid
type
VolumeData
*volumeDiskData
*smallVnodeLists
nsmallVnodes
nsmallLists
-- same for big --
contains
pointer to
rvm malloced data
VolumeDiskData (rvm)
Lots of stuff:
Identity & location: partition, name,
runtime info: use, inService, blessed, salvaged
Vnode related: next uniquefier
Versionvector
Resolution flags, pointer to recov_vol_log
Quota
Resource usage: filecount, diskused etc
Volumes in VM
struct Volumes sit in VolHash with copies of
RVM data structures
Salvage before “attaching” to VolHash
Model of operation (FS):
GetVolume copy out from RVM
Do your mods in VM
PutVolume does RVM transaction
Model of operation (Volutil):
operate on RVM
Volumes in Venus RPC’s
One RPC: GetVolInfo
used for mount point traversal
Only relates to
volume location database
volume replication database
VSGDB
Could sit in separate Volume Location
Server
Vnodes
(cvnode.h)
Small & large: large for directories
difference is ACL at back of large vnodes
Inode field:
small vnodes: points to diskfile inode number
large vnodes: is RVM address of dir inode
Contain important small structure: vv_t
Pointers to reslog entries
VM: cvnode’s with hash table, freelists etc
Vnodes in RVM
RVM: VnodeDiskinfo (rvm_malloced)
vnodes sit on rec_smolists
each link points to a DiskVnode
lists link vnodes with identical vnodenumbers
but different uniquefiers
new vnodes grabbed from FreeLists
(index.cc, recov{a,b,c}.cc)
volumes have arrays of rec_smolists which
grow when they are full
Vnodes in action
Model:
GetFSObj calls GetVnode
work is done
PutFS Objects calls
rvm_begin_transaction
ReplaceVnode - copies data from VM to RVM
rvm_end_transaction
Getting a vnode takes 3 pointer derefs, possibly
3 page faults vs. 1 for local file systems.
Is this necessary? Probably not. Cure it: yes!
Directories (rvm)
DirInode
page table and “copy on write” refcount
DirPages 2048 bytes each
build up the directory
divided into 64 32byte blobs
Hash table for fast name lookups
Blob Freelist
Array of free blobs per page
Directories
More than one vnode can point to
directory (copy on write)
VM: hash table of DirHandles
point to VM contiguous copy of dir
point to DirInode
have a lock etc
Model: as for volumes & vnodes
Critique: too baroque
Files
Vnode references file by InodeNumber
Files are copy on write
There are “FileInodes” like dir inodes, but
they are held in external DB or in inode
itself
Server always reads/writes whole files
(could be exploited)
Volinit and salvage
Set up volume hash table, serverlist,
DiskPartitionList
Cycle through partitions, check each for
list of inodes
every inode has a vnode
every vnode has a directory name
every directory name has a vnode
Put volume in a VM hash table
Server connection info
Array of HostEntry (a “venus”)
Contains a linked list of connections
Contains a callback connection id
Connection setup
first binding creates a host & callback conn
new binding creates a new connection and
verifies callback
in RPC2_NewBinding & ViceNewConnectFS
Callbacks
Hashtable of FileEntries:
each contains Fid
number of users
linked list of callbacks
Callbacks: point to HostEntry
Ops:
RPC: BreakCallBack
Local: placing, delete, deleteVenus
Callbacks
Connection is non-authenticated. Should
be fixed. Session key for CB connection
should not expire.
Side effect of callback connection is used
for BackFetch bulk transfer of files during
reintegration.
RPC processing
Venus RPC’s:
srvproc.cc - standard file ops
srvproc2.cc - standard volume ops
codaproc.cc - repair stuff
codaproc2.cc - reintegration stuff
Volutil RPC’s:
vol-your-rpc.cc (in coda-src/volutil)
Resolution: below
RPC processing
RPC structure:
ValidateParms: validate, hand off COP2, cid
GetObject: vm copy, lock objects
CheckSemantics:
Concurrency, Integrity, Permissions
Perform operations:
BulkTransfer, UpdateObjects, OutParms
PutObject: rvm transactions, inode deletions
vlists
GetFSObjects: instantiate a vlist
RPC needs list of objects copied from RVM
Modification status is held there (did
CopyOnWrite kick in etc)
PutObjects
rvm_begin_transaction
walk through the list, copy, rvm_set_range,
unlock
rvm_end_transaction
COP2 handling
In COP2 Venus give final VV to server
are sent out by Venus (with some delay)
often piggybacked in bulk
server knows about pending COP2 entries
in hash table (coppend.cc)
Manager thread CopPendingManager
Runs every minute.
Removes entries more than 900 secs old
Cop2 to RVM
Data can be
PiggyBacked on another rpc
sent in ViceCop2 rpc.
Both cases call InternalCop2 (srvproc.cc)
InternalCop2 (codaproc.cc)
notifies the manager to dequeue
gets the FS objects listed for the COP2
installs final VV’s into RVM (rvm transaction!)
COP2 Problems
Easy cause of conflicts in replicated
volumes when clients access objects in
rapid succession. (Can be fixed easily
during the writeback caching operation)
Not optimized for singly replicated
volume.
Resolution
Initiated by client with RPC to coordinator
ViceResolve (codaproc.cc)
coordinator
sets up connections in VSG (unauthenticated)
LockAndFetch (res/reslock, resutil):
lock volumes,
collect “closure”
Resolution - special cases
RegResDirRequired (rvmres/rvmrescoord.cc)
check for
unresolved ancestors
already inconsistent
runts (missing objects)
weak equality (identical storeid)
RecovDirResolve
Phase II: (rvmres/{rescoord,subphase?}.cc)
coordinator request logs from other servers
subordinates lock affected dirs,marshall logs
coordinator merges logs
Phase III:
ship merged log to subordinates
perform operations on VM copies
Return results to coordinator
Resolution
Phase IV: (is old Phase 3 …)
collect results, compute new VV’s ship to
subordinates
commit results
Comments on resolution
Old versions of resolution:
OldDirResolve: resolve only runts and weak
DirResolve: resolve only in VM
Remove these
resolve directory has nothing to do with
resolution: should be called librepair. Srv
uses merely one function in it - repair
uses the rest
Volume Log
During FS operations, log entries are
created for use during resolution
Different format per operation
(rvmres/recov_vollog.cc)
Added to the vlist by SpoolVMLogRecord
Put in RVM at commit time
Repair
Venus makes ViceRepair RPC.
File and symlink repair: BulkTransfer the
object
Directory repair, BulkTransfer the repair file
and replay operations
Venus follows this with a COP2 multi rpc
For directory repair Venus invokes
asynchronous resolve
Future
Good:
Design is simple and efficient
There is little C++: should eliminate
easy to multi-thread
Bad:
Scalability ~8GB in practice, ~40GB in theory
Data handling is bad: tricky to fix
Volume code was & is worst: rewrite