Coda Server Internals

Download Report

Transcript Coda Server Internals

Coda Server Internals
Peter J Braam
Contents
Data structure overview
Volumes
Vnodes
Inodes
Data Structure Overview
Object
Inodes
Purpose
File Contents
Resides where
/vicep* partitions
Volumes
Vnodes
Directory cnts
ACL
Reslogs
Meta Data &
Dir contents
RVM
Volinfo records
Volume location
VLDB, VRDB: RW db files
Security
VSGDB, .pdb, .tk files:
dynamic RO db files
Configuration Data
Static data
VSGDB
Pdb records
Tokens
Servers/SCM
Partitions
Startup flags
Skipvolumes
LOG & DATA & DB
Locators
RVM layout
(coda_globals.h)
 Already_initialized (int)
 struct VolHead[MAXVOLS]
 struct VnodeDiskObject *SmallVnodeFreeLists[SM_FREESIZE]
 short SmallVnodeIndex
 …. Same for large …
 MaxVolId (unsigned long)
 Remainder is dynamically allocated
Volume zoo
RVM: structures
VolumeData
VolHead
VolumeHeader
VolumeDiskData
(volume.h, camprivate.h)
VM: structures
Volume
VolumeInfo ……..
A volume in RVM
VolHead
VolumeHeader
VolumeHeader
stamp
id
parentid
type
VolumeData
*volumeDiskData
*smallVnodeLists
nsmallVnodes
nsmallLists
-- same for big --
contains
pointer to
rvm malloced data
VolumeDiskData (rvm)
Lots of stuff:
Identity & location: partition, name,
runtime info: use, inService, blessed, salvaged
Vnode related: next uniquefier
Versionvector
Resolution flags, pointer to recov_vol_log
Quota
Resource usage: filecount, diskused etc
Volumes in VM
struct Volumes sit in VolHash with copies of
RVM data structures
Salvage before “attaching” to VolHash
Model of operation (FS):
GetVolume copy out from RVM
Do your mods in VM
PutVolume does RVM transaction
Model of operation (Volutil):
 operate on RVM
Volumes in Venus RPC’s
One RPC: GetVolInfo
used for mount point traversal
Only relates to
volume location database
volume replication database
VSGDB
Could sit in separate Volume Location
Server
Vnodes
(cvnode.h)
Small & large: large for directories
difference is ACL at back of large vnodes
Inode field:
small vnodes: points to diskfile inode number
large vnodes: is RVM address of dir inode
Contain important small structure: vv_t
Pointers to reslog entries
VM: cvnode’s with hash table, freelists etc
Vnodes in RVM
RVM: VnodeDiskinfo (rvm_malloced)
vnodes sit on rec_smolists
each link points to a DiskVnode
lists link vnodes with identical vnodenumbers
but different uniquefiers
new vnodes grabbed from FreeLists
(index.cc, recov{a,b,c}.cc)
volumes have arrays of rec_smolists which
grow when they are full
Vnodes in action
Model:
GetFSObj calls GetVnode
work is done
PutFS Objects calls
rvm_begin_transaction
ReplaceVnode - copies data from VM to RVM
rvm_end_transaction
Getting a vnode takes 3 pointer derefs, possibly
3 page faults vs. 1 for local file systems.
Is this necessary? Probably not. Cure it: yes!
Directories (rvm)
DirInode
page table and “copy on write” refcount
DirPages 2048 bytes each
build up the directory
divided into 64 32byte blobs
Hash table for fast name lookups
Blob Freelist
Array of free blobs per page
Directories
More than one vnode can point to
directory (copy on write)
VM: hash table of DirHandles
point to VM contiguous copy of dir
point to DirInode
have a lock etc
Model: as for volumes & vnodes
Critique: too baroque
Files
Vnode references file by InodeNumber
Files are copy on write
There are “FileInodes” like dir inodes, but
they are held in external DB or in inode
itself
Server always reads/writes whole files
(could be exploited)
Volinit and salvage
Set up volume hash table, serverlist,
DiskPartitionList
Cycle through partitions, check each for
list of inodes
every inode has a vnode
every vnode has a directory name
every directory name has a vnode
Put volume in a VM hash table
Server connection info
Array of HostEntry (a “venus”)
Contains a linked list of connections
Contains a callback connection id
Connection setup
first binding creates a host & callback conn
new binding creates a new connection and
verifies callback
in RPC2_NewBinding & ViceNewConnectFS
Callbacks
Hashtable of FileEntries:
each contains Fid
number of users
linked list of callbacks
Callbacks: point to HostEntry
Ops:
RPC: BreakCallBack
Local: placing, delete, deleteVenus
Callbacks
Connection is non-authenticated. Should
be fixed. Session key for CB connection
should not expire.
Side effect of callback connection is used
for BackFetch bulk transfer of files during
reintegration.
RPC processing
Venus RPC’s:
srvproc.cc - standard file ops
srvproc2.cc - standard volume ops
codaproc.cc - repair stuff
codaproc2.cc - reintegration stuff
Volutil RPC’s:
vol-your-rpc.cc (in coda-src/volutil)
Resolution: below
RPC processing
RPC structure:
ValidateParms: validate, hand off COP2, cid
GetObject: vm copy, lock objects
CheckSemantics:
Concurrency, Integrity, Permissions
Perform operations:
BulkTransfer, UpdateObjects, OutParms
PutObject: rvm transactions, inode deletions
vlists
GetFSObjects: instantiate a vlist
RPC needs list of objects copied from RVM
Modification status is held there (did
CopyOnWrite kick in etc)
PutObjects
rvm_begin_transaction
walk through the list, copy, rvm_set_range,
unlock
rvm_end_transaction
COP2 handling
In COP2 Venus give final VV to server
are sent out by Venus (with some delay)
often piggybacked in bulk
server knows about pending COP2 entries
in hash table (coppend.cc)
Manager thread CopPendingManager
Runs every minute.
Removes entries more than 900 secs old
Cop2 to RVM
Data can be
PiggyBacked on another rpc
sent in ViceCop2 rpc.
Both cases call InternalCop2 (srvproc.cc)
InternalCop2 (codaproc.cc)
notifies the manager to dequeue
gets the FS objects listed for the COP2
installs final VV’s into RVM (rvm transaction!)
COP2 Problems
Easy cause of conflicts in replicated
volumes when clients access objects in
rapid succession. (Can be fixed easily
during the writeback caching operation)
Not optimized for singly replicated
volume.
Resolution
Initiated by client with RPC to coordinator
ViceResolve (codaproc.cc)
coordinator
sets up connections in VSG (unauthenticated)
LockAndFetch (res/reslock, resutil):
lock volumes,
collect “closure”
Resolution - special cases
RegResDirRequired (rvmres/rvmrescoord.cc)
check for
unresolved ancestors
already inconsistent
runts (missing objects)
weak equality (identical storeid)
RecovDirResolve
Phase II: (rvmres/{rescoord,subphase?}.cc)
coordinator request logs from other servers
subordinates lock affected dirs,marshall logs
coordinator merges logs
Phase III:
ship merged log to subordinates
perform operations on VM copies
Return results to coordinator
Resolution
Phase IV: (is old Phase 3 …)
collect results, compute new VV’s ship to
subordinates
commit results
Comments on resolution
Old versions of resolution:
OldDirResolve: resolve only runts and weak
DirResolve: resolve only in VM
Remove these
resolve directory has nothing to do with
resolution: should be called librepair. Srv
uses merely one function in it - repair
uses the rest
Volume Log
During FS operations, log entries are
created for use during resolution
Different format per operation
(rvmres/recov_vollog.cc)
Added to the vlist by SpoolVMLogRecord
Put in RVM at commit time
Repair
Venus makes ViceRepair RPC.
File and symlink repair: BulkTransfer the
object
Directory repair, BulkTransfer the repair file
and replay operations
Venus follows this with a COP2 multi rpc
For directory repair Venus invokes
asynchronous resolve
Future
Good:
Design is simple and efficient
There is little C++: should eliminate
easy to multi-thread
Bad:
Scalability ~8GB in practice, ~40GB in theory
Data handling is bad: tricky to fix
Volume code was & is worst: rewrite