Unit OS 10: Fault-Tolerance

Transcript Unit OS 10: Fault-Tolerance

Unit OS10: Fault Tolerance
Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze
Copyright Notice
© 2000-2005 David A. Solomon and Mark Russinovich
These materials are part of the Windows Operating
System Internals Curriculum Development Kit,
developed by David A. Solomon and Mark E.
Russinovich with Andreas Polze
Microsoft has licensed these materials from David
Solomon Expert Seminars, Inc. for distribution to
academic organizations solely for use in academic
environments (and not for commercial use)
2
Roadmap for Section 10.1
The Notion of Fault-Tolerance
Fault-Tolerance Support in NTFS
Volume Management Striped and Spanned Volumes
Distributed File System (DFS) and File
Replication Service (FRS)
Network Load Balancing (NLB)
Windows Clustering (MSCS)
3
Fault-tolerant Systems
Fault-tolerance is the property of a system that
continues operating properly in the event of
failure of some of its parts
If its operating quality decreases at all, the decrease
is proportional to the severity of the failure
Fault-tolerance is particularly sought-after in highavailability or life-critical systems
Fault-tolerance is not just a property of individual
machines; it may also characterize the rules by
which they interact
4
Fault Models and Protocols
Need to specify Fault Model when discussing fault-tolerant (FT) systems
All FT mechanisms in Windows are dealing with crash faults of
computers or applications only
Crash faults can be handled by replication in space or time
5
Fault-tolerance (FT) by duplication
Three approaches toward FT systems
Replication:
multiple identical system instances
directing tasks or requests to all of them in parallel, and
choosing the correct result on the basis of a quorum
Redundancy:
fail-over among multiple identical system instances
fall-back or backup
Diversity:
multiple different implementations of the same spec.
using them like replicated systems to cope with errors in a
specific implementation
6
Fault-tolerance in NTFS Increasing System Availability
Transaction-based logging scheme
Fast, even for large disks
Recovery is limited to file system data
Use transaction processing like SQL server for user data
Tradeoff: performance versus fully fault-tolerant File System (FS)
Design options for file I/O & caching:
Careful write: VAX/VMS FS, other proprietary OS FS
Lazy write: most UNIX FS, OS/2 HPFS
7
Recoverable File System
(Journaling File System)
Safety of careful write FS / performance of lazy write FS
Log file + fast recovery procedure
Log file imposes some overhead
Optimization over lazy write: distance between cache flushes
increased
NTFS supports cache write-through and cache flushing
triggered by applications
No extra disk I/O to update FS data structures necessary:
all changes to FS structure are recorded in log file which can
be written in a single operation
In the future, NTFS may support logging for user files (hooks
in place)
8
Recovery - Principles
NTFS performs automatic recovery
Based on update records and checkpoints in Log file
Update records store sub operations that change File System
structure
NTFS writes checkpoint every 5 sec.
Includes copy of transaction table and dirty page table
Checkpoint includes LSNs of the log records containing the tables;
really a series of records - interleaved with update records
Recovery depends on two NTFS in-memory tables:
Transaction table: keeps track of active transactions (not completed)
(sub operations of these transactions must be removed from disk)
Dirty page table: records which pages in cache contain modifications
to file system structure that have not yet been written to disk
Dirty page
table
Update
record
Begin of checkpoint operation
Transaction Checkpoint
table
record
Update
record
Update
record
End of checkpoint operation
9
Recovery - Passes
1. Analysis pass
• NTFS scans forward in log file from beginning of last checkpoint
• Updates transaction/dirty page tables it copied in memory
• NTFS scans tables for oldest update record of a non-committed trans.
2. Redo pass
• NTFS looks for “page update“ records which contain volume
modification that might not have been flushed to disk
•
•
NTFS redoes these updates in the cache until it reaches end of log file
Cache manager “lazy writer thread“ begins to flush cache to disk
3. Undo pass
• Roll back any transactions that were not committed when system failed
• After undo pass – volume is at consistent state
• Write empty LFS restart area; no recovery is needed if system fails now
10
Undo Pass - Example
„Transaction committed“ record
LSN
4044
LSN
4045
LSN
4046
LSN
4047
LSN
4048
Power
failure
LSN
4049
Redo: Add the filename to the index
Undo: Remove the filename from the index
Redo: Allocate/Initialize an MFT file record
Undo: Deallocate the file record
Redo: Set bits 3-9 in the bitmap
Undo: Clear bits 3-9 in the bitmap
Transaction 1 was committed before power failure
Transaction 2 was still active
NTFS must log undo operations in log file!
Power might fail again during recovery;
NTFS would have to redo its undo operations
11
NTFS Recovery - Conclusions
Recovery will return volume to some preexisting consistent state
(not necessarily state before crash)
Lazy commit algorithm: log file is not immediately flushed when a
„transaction committed“ record is written
Log File Service batches records
Flush when cache manager calls or check pointing record is
written (once every 5 sec); also when log is full
Several parallel transactions might have been active before
crash
NTFS uses log file mechanisms for error handling
Most I/O errors are not file system errors
NTFS might create MFT record and detect that disk is full when
allocating space for a file in the bitmap
NTFS uses log info to undo changes and returns „disk full“ error
to caller
12
Fault Tolerance Support using multiple disks
NTFS‘ capabilities are enhanced by the fault-tolerant
volume managers FtDisk/DMIO
Lies above hard disk drivers in the I/O system‘s layered driver
scheme
FtDisk – for basic disks
DMIO – for dynamic disks
Volume management capabilities:
Redundant data storage
Dynamic data recovery from bad sectors on SCSI disks
NTFS itself implements bad-sector recovery for
non-SCSI disks
13
Terminology
Disks are a physical storage device such as a hard disk, a 3.5-inch floppy
disk, or a CD-ROM
A disk is divided into sectors, addressable blocks of fixed size
Sector sizes are determined by hardware
All current x86-processor hard disk sectors are 512 bytes, and CD-ROM sectors
are typically 2048 bytes
Future x86 systems might support larger hard disk sector sizes
Partitions are collections of contiguous sectors on a disk
A partition table or other disk-management database stores a partition's
starting sector, size, and other characteristics
Simple volumes are objects that represent sectors from a single partition that
file system drivers manage as a single unit
Multipartition volumes are objects that represent sectors from multiple
partitions and that file system drivers manage as a single unit
Multipartition volumes offer performance, reliability, and sizing features that
simple volumes do not
14
Basic vs Dynamic Disks
Two disk partitioning schemes used by Windows:
Basic disk partitioning
Dynamic disk partitioning
Basic disks rely on MS-DOS-style disk partitioning
Are really Windows legacy disks
Partition information for each disk stored on disk
Multipartition information not stored on disk
can be lost when disk moved, OS reinstalled
Dynamic disks implement a more flexible partitioning scheme
Configuration of multipartition volumes is on disk and mirrored across the
dynamic disks of the same computer. This allows for easy migration and
minimizes chances of disk configuration loss.
Disadvantage is that partitioning is not understood by other OS’s
Laptops only support basic disks
usually only disk and disks not removable
All disks are basic disks unless created new as dynamic disks or converted
15
Basic Disk Partitioning
A disk has a sector called a Master Boot Record (MBR) as its first
sector, that defines the first level of partitioning with its partition table
MBR
Boot code
Boot sector
1
2
3
4
Partitiion
table
Partitions within an
extended partition
Extended partition
boot record
Boot partition
Partition 1
Partition 2
Partition 3
(extended)
Partition 4
16
Basic Disk Partitioning
The MBR describes up to 4 primary partitions
The first record of each primary partition is a boot record
One primary partition can be marked “bootable”
Each partition has a partition type (FAT, FAT32, NTFS, …)
To overcome a 4-partition limit, a basic disks define a special type of
partition called an extended partition
Like a subdisk, complete with its own MBR
In NT 4, configuration for multipartition volumes is stored in the
Registry’s HKLM\System\Disk subkey
Lost of system is reinstalled or disk is moved to another system
17
Dynamic Disk Partitioning
The dynamic disk partitioning scheme is defined by a component called
Logical Disk Manager (LDM)
LDM consists of a service and driver components
Dynamic disk partitioning scheme was co-developed with Veritas Software,
porting LDM from UNIX implementations
LDM maintains one unified database that stores all partitioning information,
for all disks in the system.
Database also stores multipartition configuration
Database occupies last 1 MB of each dynamic disk, and is mirrored across a
system’s dynamic disks
Veritas offers add-on software that allows
dynamic disks to be managed in subsets
called disk groups
Master
boot
record
LDM partition area
1 MB
LDM database
18
Dynamic Partitioning
A computer’s boot and system disks have a mix of
dynamic and basic disk partitioning
NTLDR only understands basic-disk partitioning
LDM partitions are called soft partitions whereas basic-disk
partitions are hard partitions
Even on “pure” dynamic disks, the MBR contains a
basic-disk partition table that defines the entire usable
area of the disk as a single hard partition of type LDM
LDM manages the space within the LDM partition in its
database
19
Multipartition Volumes
The multipartition volumes available in Windows are:
Spanned volumes
Mirrored volumes
Striped-volumes
RAID-5 volumes
All partitions that make up new multipartition volumes
must be on dynamic disks
Windows preserves NT 4 multipartition volumes on basic disks
during an upgrade
20
Volume Management Features –
Spanned Volumes
C:
(100 MB)
D:
(100 MB)
D:
(100 MB)
E:
(100 MB)
Volume set D:
occupies half of
two disks
Spanned Volumes:
Single logical volume composed of a maximum of 32 areas of free
space on one or more disks
NTFS volume sets can be dynamically increased in size
(only bitmap file which stores allocation status needs to be extended)
FtDisk/DMIO hide physical configuration of disks from file system
Tool: Windows Disk Management MMC snap-in
Spanned volumes were called volume sets in Windows NT 4.0
21
Striped Volumes (RAID-0)
(150 MB)
1
4
(150 MB)
2
(150 MB)
3
Series of partitions, one partition per disk (of same size)
Combined into a single logical volume
FtDisk/DMIO optimize data storage and retrieval times
Stripes are narrow: 64KB
Data tends to be distributed evenly among disks
Multiple pending read/write ops. will operate on different disks
Latency for disk I/O is often reduced (parallel seek operations)
22
Fault Tolerant Volumes
FtDisk/DMIO implement redundant storage schemes
Mirror sets (RAID-1)
Stripe sets with parity (RAID-5)
Sector sparing
Tools: Windows Disk Management MMC snap-in
Mirrored Volumes:
C:
C:
(mirror)
Contents of a partition on one disk are duplicated on another disk
FtDisk/DMIO write same data to both locations
Read operations are done simultaneously on both disks
(load balancing)
23
Mirrored Volumes
Performance improvement because reads of different data can
occur in parallel through dynamic load balancing
24
RAID-5 Volumes
parity
Fault tolerant version of a regular stripe set
Parity: logical sum (XOR)
Parity info is distributed evenly over available disks
FtDisk/DMIO reconstruct missing data by using XOR op.
25
The Disk Management
MMC Snapin
26
The Volume Manager
LDM Volume Manager inserts
itself above disk drivers
Exports “volumes” to file systems
Takes volume-oriented requests
and can create sub-requests
aimed at different disks of
multipartition volumes
27
Bad Cluster Recovery
Sector sparing is supported by FtDisk/DMIO
Dynamic copying of recovered data to spare sectors
Without intervention from file system / user
Works for certain SCSI disks
FtDisk/DMIO return bad sector warning to NTFS
Sector re-mapping is supported by NTFS
NTFS will not reuse bad clusters
NTFS copies data recovered by FtDisk/DMIO into a new
cluster
NTFS cannot recover data from bad sector without help
from FtDisk/DMIO
NTFS will never write to bad sector (re-map before write)
28
Bad-cluster re-mapping
Standard info
NTFS filename Security desc.
Data
Starting Starting
VCN
LCN
VCN 0
Bad
Standard info
0
1357
LCN 1357
NTFS filename Security desc.
VCN 0
1
Data
Starting Starting
VCN
LCN
User file
Number of
clusters
Number of
clusters
0
1355
2
2
1049
1
3
1588
4
1
Data
LCN 1355 1356
VCN 2
Data
LCN 1049
VCN 3
4
5
6
Data
LCN 1588 1589 1590 1591
29
Distributed File System (DFS)
Data Replication for high Availability
A strategic storage management solution
Namespaces: simplified views of folders regardless
of where those files physically reside in a network
A namespace abstracts away file paths
Change of a file server’s name does not break
virtual DFS paths
DFS stores path names logically as a single
namespace
Replicated file servers for high availability
30
How DFS Works
Client and a server component
Any Windows system can be a DFS client
Windows NT/2000/2003 Server include DFS server component
The view of shared folders on different servers is called the DFS
namespace
Like a virtual UNC path
A single namespace
can map to physical
resources residing
on multiple servers
31
DFS Operation
DFS operates in multiple steps
1. Client makes a request of the DFS namespace,
2. DFS returns the appropriate path to the data (including Active
Directory site-costing information when AD is in use)
3. Client makes a connection to the server and share
32
DFS Authentication
DFS is a multi-protocol architecture
Uses SMB and LAN Manager authentication protocols to
communicate between a DFS client and a DFS server
DFS server can redirect requests to various types of
shares protocol-specific authentication
Server Message Block (SMB) servers
Network File System (NFS) servers
Services for Macintosh™ (AFP) servers, and
Netware™ Core Protocol (NCP) servers.
Windows client machines must install suitable
redirector drivers
33
DFS Request Routing
for replicated servers
Files/Folders on a DFS server can be replicated using
File Replication Service (FRS)
FRS resolves file and folder name conflicts to make data
consistent among the replica members
FRS uses “last writer wins” rule to resolve conflicts
DFS requests are then routed to the closest server
If a server becomes unavailable, DFS ensures that requests
are routed to the next closest server by using site-costing
Active Directory site and costing information will be used
for routing decisions
I.e.; whether sites are connected via inexpensive, high-speed
links or by expensive WAN links.
34
File Replication Service
(FRS) Details
Continuous replication
Subject to replication schedule, server load, and network load
When a file or folder is changed and closed, FRS begins
replicating within three seconds.
Fault-tolerant replication path
Fault-tolerant distribution by way of multiple connection paths
between members
Identical file data is sent no more than once to any replica
member
35
FRS Details (contd.)
Replication scheduling
Can be scheduled to occur at specified times and durations
Replicating data during off-hours may free up network
bandwidth for other uses
Replication integrity
Files are replicated only after they have been changed and
closed
FRS relies on the update sequence number (USN) journal to
log records of files that have changed on a replica member
FRS does not lose track of a changed file even if a replica
member shuts down abruptly
36
Distributed File System Solution in
Microsoft Windows Server 2003 R2
DFS Replication is a new state-based, multimaster
replication engine
Supports replication scheduling and bandwidth throttling
New compression protocol called Remote Differential
Compression (RDC)
Allows to efficiently update files over a limited-bandwidth network
RDC detects insertions, removals, and re-arrangements of data
in files
RDC replicates only the changes when files are updated
Cross-file RDC can help reduce the amount of bandwidth
required to replicate new files
Substantial improvements over File Replication Service
(FRS)
37
Increased Availability
with DFS and FRS
DFS may pointing to multiple volumes that can be
alternates for each other
DFS manages failover to alternate volume
Multiple copies of read-only shares can be mounted under the
same logical DFS name (replication)
Client accesses to DFS volumes are evenly distributed across
multiple alternate network shares
FRS and DFS replication (in Server 2003 R2) can be
used to maintain consistency among replicated volumes
38
Network Load Balancing (NLB)
A clustering technology for stateless services
Part of all Windows 2000 Server and Windows
Server 2003 family operating systems
Uses a distributed algorithm to load balance network
traffic across a number of hosts
Helping to enhance the scalability and availability of
mission critical, IP-based services,
Web, Virtual Private Networking, Streaming Media,
Terminal Services, Proxy, etc.
High availability by detecting host failures and
redistributing traffic to operational hosts
39
NLB versus Server Clusters
A server cluster (MSCS) is a collection of servers
Provide a single, highly available platform
Applications can be failed over (SQL Server,
Exchange Server data stores, file and print servers)
MSCS clusters are used for stateful applications
NLB clusters distribute network traffic
NLB clusters provide a highly available and scalable
platform for applications such as IIS, ISA server, etc.
NLB is used for stateless applications; i.e. those that
do not build any state as a result of a request.
40
NLB for High Availability
How Does NLB Detect a Server Failure?
Each NLB Cluster host emits heartbeats
Convergence process to remove a failed host from
the cluster
By default, five seconds are required to detect a
failed host
Convergence process takes three seconds to evict
the failed host and redistribute its load
41
NLB Load Balancing Algorithm
Fully distributed filtering algorithm to map incoming clients
to the cluster hosts
All hosts simultaneously inspect arriving packets and determine
which host should handle the packet
Randomization function determines destination and calculates a
host priority based on IP address, port, etc.
Destination host forwards the packet to the TCP/IP network stack
Other cluster hosts discard the packet
Mapping remains unchanged unless the membership of
cluster hosts changes
A given clients IP address and port will always map to the same
cluster host
Client affinity settings modify statistical mapping algorithms
42
NLB Implementation and Overhead
NLB has a kernel component called wlbs.sys
This is an intermediate NDIS driver
NLB also has user-mode management components
NLB creates additional CPU load
Increases linearly with increased throughput on network interface
43
Server Cluster
(Windows Server 2003)
Clustering technology for stateful applications
A dramatically improved version of the Microsoft
Cluster Service (MSCS) component
MSCS was included with Windows 2000 Advanced
Server and Windows 2000 Datacenter Server
Between two and eight servers that will act as
nodes in the cluster
Cluster resource include network names, IP
addresses, applications, services, and disk drives
44
Server Cluster Operation
(single quorum)
Nodes in a cluster use a quorum to track which node
owns a clustered application
The quorum is the storage device controlled by the primary
node for a clustered application
Only one node at a time may own the quorum
On failover, the backup node takes ownership of the quorum
The quorum may be created on the storage device attached to
all nodes (single quorum device server cluster)
45
Majority node set (MNS)
server clusters
Quorum stored on a locally attached storage device
connected to each of the cluster nodes
Backup node must have a copy of the data stored within the
quorum
Server cluster handles this requirement by replicating quorum
data across the network
Network can be a LAN, WAN, or VPN
46
Availability of a Server Cluster
To effectively fail over between nodes, majority
node set clusters must have at least three nodes
More than half of the cluster nodes must be active
at all times
I.e.; in a cluster with three nodes, two of them must
be active for the cluster to be functional
Eight node clusters must have five nodes active to
remain online
Single quorum device server clusters require
that only a single node continues to function
47
Hardware and Software Failures
Failed disks, memory, processors, power, and network
equipment are all common sources of unplanned
downtime
Server Cluster and NLB can be used to provide availability in
the event of a failure of a processor, memory chip, power
supply, or other hardware component
Windows Server 2003 clusters provide availability at many
other layers
To provide complete redundancy, all layers of your
application must be clustered
NLB provides availability and scalability for the firewalls, frontend servers, and application servers,
Server Cluster provides high availability for the database.
48
Clustered Applications the bigger picture
49
Further Reading
Mark E. Russinovich and David A. Solomon,
Microsoft Windows Internals, 4th Edition, Microsoft Press, 2004.
Chapter 12 - File Systems
NTFS Recovery Support (from pp. 775)
Chapter 13 - Networking
Network Load Balancing and File Replication Service (from pp. 841)
Chapter 10 - Storage Management
Volume Management (from pp. 622)
Distributed File System (DFS) and File Replication Services (FRS)
http://www.microsoft.com/windowsserver2003/technologies/storage/df
s/default.mspx
Network Load Balancing (NLB) for Windows 2000 and Windows
Server 2003,
http://www.microsoft.com/technet/prodtechnol/windowsserver2003/te
chnologies/clustering/nlbfaq.mspx
Windows Server 2003 Clustering
http://www.microsoft.com/windowsserver2003/techinfo/overview/bdmt
dm/default.mspx
50
Source Code References
Windows Research Kernel (WRK) sources
\base\ntos\fstub – partition table/MBR support code
Note: the other topics covered in this unit are not
included with the WRK
51