Module 7: Troubleshooting Cluster Service

Transcript Module 7: Troubleshooting Cluster Service

Module 7: Server
Cluster Maintenance
and Troubleshooting
Overview

Cluster Maintenance

Troubleshooting Cluster Service

Server cluster maintenance and troubleshooting are
considered two separate disciplines. Maintenance is
continuous, whereas troubleshooting has a beginning
when the problem is discovered, and an end when the
problem is resolved. The two disciplines are
complimentary, however. When every troubleshooting
procedure that you follow fails, you will need to rebuild
the cluster from a backup tape that was generated
during a maintenance procedure.
After completing this module, you will be able to:

Perform the steps to successfully back up a server
cluster.

Perform the steps to successfully restore a server cluster.

Evict a node from a server cluster.

Identify the tools that are necessary to troubleshoot a
cluster failure.

Interpret the entries on the cluster log.

Identify and troubleshoot common server cluster failures:
network communications, small computer system
interface (SCSI) configuration problems, group, resource,
and quorum failures.
 Cluster Maintenance

Backup

Restoring the First Node

Restoring Cluster Disks

Restoring the Second Node

Evicting a Node

Cluster service uses the self-tuning features of
Microsoft® Windows® 2000 and requires very little
maintenance. The only day-to-day maintenance
operation that you need to perform is to back up the
cluster.

Under special circumstances, a node in the cluster may
need to be replaced, for example, when your
organization decides to perform a hardware upgrade. In
this situation, you need to evict a node from the cluster
and add the upgraded node to the cluster.
Backup

Backing Up the System State

Backing Up the Local Disk

Backing Up the Cluster Disk

Backing up the cluster is no different from backing up
Microsoft Windows 2000 Advanced Server. It is
recommended that you perform regular backups by
using the Windows 2000 Backup program (NTBackup),
or other compatible backup programs. Additional
backup agents are still necessary to back up
applications running on the cluster, such as Microsoft
SQL Server™ and Microsoft Exchange.

Note: A cluster-aware backup program will be able to
perform the same backup operations as NTBackup,
especially with regard to backing up the System State
and the cluster configuration database.
Backing Up the System State

The configuration information for the cluster is located
on the registry on each node
(HKEY_LOCAL_MACHINE\Cluster). The Backup
tool that is included with Windows 2000 backs up the
cluster database when you back up each node’s system
state.

NTBackup backs up the system state on each node. The
system state includes:

The quorum log.

The local registry.

The Cluster registry hive.
Backing Up the Local Disk

Follow standard computer backup procedures to back
up the operating system and the data on the local
drives. You must also back up key cluster files on the
local disks.



On each node, back up the cluster database files:
%systemroot%\cluster\CLUSDB
%systemroot%\cluster\CLSUDB.LOG
On each node, back up the clustering service:
%systemroot%\cluster\*.*
Note: Backup is essential, but regular testing to make
sure that backups and restores actually work as
expected is also necessary. A good practice is to
schedule test backup and restore operations frequently.
Backing Up the Cluster Disks

It is critical to back up cluster files on the quorum disk and data on
the cluster disks, because Cluster service will write information to
files in the \mscsdirectory on the quorum disk and cluster-aware
applications will likely be placing data on the cluster disk. Because
either node of the cluster could own the cluster disk resource at
any time, it is possible for each node to back up the data on the
drive. However, having each node back up data would require you
to install backup hardware and software on each cluster node,
which is not the best solution.

One possibility is to identify a nonclustered server running
Windows 2000 Server and schedule it to back up data remotely
through a network connection to the Cluster disk’s administrative
share or a hidden share that you create. For example, you might
create FBackup$, GBackup$, HBackup$, and WBackup$ file share
resources on the virtual server for the root of drives F, G, H, and W.
F, G, and H would be cluster disks with data, and W would be the
drive letter for the quorum disk. Hidden shares would not appear in
a browse list and you could configure them to allow access only to
members of the Backup Operators group.

The following sections describe the procedure for
restoring a server cluster in the event that both nodes
and the cluster disk fail. It is possible that any one of
the components in the cluster could fail independently.
In the case of a failed component, you follow the same
procedure for restoring that specific component.

Performing a complete restore of a server cluster is a
straightforward process.
1.
Restore a node of the cluster.
2.
Restore the cluster disks of the restored first node.
3.
Restore the remaining node of the cluster.
4.
Perform node testing.
Restoring the First Node
Steps For Restoring a Server Cluster:
1. Restore the first node
2. Restore the cluster disks
3. Restore the second node
4. Perform node testing
Restoring a Node of the Cluster

To restore a node in a server cluster, you follow the same
procedure that you would use in restoring a Windows 2000
operating system.
1.
Install a fresh copy of Windows 2000 Advanced Server on the node
to be restored.
2.
Log on as Administrator and restore the system and boot partition,
system state, and associated volumes from the backup. Make sure
that you select the option to restore the system state to the
original location in the backup program.
3.
Restart the node.
4.
Perform the steps for restoring the cluster disk. These steps follow
in the next section.

Note: The difference between the time of the backup and the time
of the restoration to the new computer may affect the computer
account on the domain controller. You may have to join a
workgroup and then rejoin the domain.
Restoring Cluster Disks

Restoring Disk Signature Files

Restoring the Data on the Cluster Disk

Restoring the Cluster Configuration Files

After you have restored a node in the cluster, you must
restore the cluster disks. Restoring the cluster disks
involves restoring the disk signature file that the cluster
uses to identify the disk. You may also need to restore a
cluster disk if you are running out of disk space or if
there is impending disk failure of a disk. It can be costly
to make mistakes while replacing a cluster disk; the
consequence can be the irrecoverable loss of all of the
data on that disk. If the disk is the quorum disk, the
server cluster's configuration data is at risk.

Before restoring the cluster disks, stop Cluster service
on all of the nodes of the cluster. Stopping Cluster
service will ensure that it will not attempt to start, which
would place a lock on the disks.
 Restoring Disk Signature Files

Because Cluster service relies on disk signatures to
identify and mount volumes, if a disk is replaced, or if
the bus is re-enumerated, Cluster service will not find
the disk signatures that it is expecting and will not
function.

You can run Dumpcfg.exe to extract the disk signature
from the registry and write it to the new disk. Cluster
service will recognize the new disk and successfully
start the resource.

Note: The Dumpcfg.exe is a resource kit utility that
restores an old disk signature file to a new disk.

If the disk that you are replacing is the quorum disk, use
Cluster Administrator to move the quorum to a different
disk, and proceed in the replacement of the disk. After
the disk is brought back online, you can move the
quorum back to the new disk.
Restoring the Data on the Cluster Disk

Restoring the data on the cluster disk is the same as a
restore of a local disk. Before restoring the data, make
sure that you have associated each cluster disk to the
same drive letter as before the disaster or failure. When
restoring, make sure that you restore the data to the
original location and verify the integrity after you have
completed the restore.
Restoring the Cluster Configuration Files

The cluster configuration files include the cluster
database and the quorum log. The cluster database is
the database or configuration data (cluster objects and
their settings) that are pertinent to the cluster. This
database is the product of the cluster registry key
checkpoint and the changes that are recorded in the
quorum log. All of the nodes of the cluster hive maintain
a local copy of this database in the nodes local registry.

After you have restored the disk signature file and data,
you can start the server cluster. If the cluster files were
not restored, or were corrupted, the following procedure
can restore the cluster database from the registry of the
restored node.

Identify the node on which you will restore the database (in the case
of a disaster restore, this will be the first node that you have
restored). Restore the cluster database on the selected node by
restoring the system state. Restoring the system state creates a
temporary folder under the %Systemroot%\Cluster folder called
Cluster_backup.

You use NTBackup to restore the cluster configuration files, which
places them on the node. You then restore the cluster database to the
node’s registry by using the Clusrest.exe tool. Clusrest.exe restores
both the quorum log (Quorum.log) file and the cluster database
(Clusdb).

Note: The Clusrest.exe tool is available in the Windows 2000
Resource Kit. This tool is a free download from www.microsoft.com
Restoring the Second Node

Restoring the Remaining Node(s) of a Cluster

Perform Node Testing

After you complete the process of restoring a node of a
cluster, and Cluster service has started successfully on
the newly restored node, you can start the restore
process on the other node of the cluster.
Restoring the Remaining Node(s) of the Cluster

The restoration of the second node of a cluster is the
same procedure as restoring the first node of a cluster,
except that you will not have to restore the cluster
disks.
Performing Node Testing

Testing the failover and failback policy is
recommended before putting the cluster back into
production.
1.
Verify that the disk and cluster resources are available
on the correct node.
2.
Fail over each group and resource to verify that they
can successfully start on the other node of the cluster.
3.
Test the failback policy of each resource by allowing
the resource to fail back to a preferred owner after the
node has come back online.
Evicting a Node
Steps for Evicting a Node
1. Back up both nodes
2. Verify backup
3. Move all groups to the remaining node
4. Stop Cluster service on the node to be removed
5. Evict the node
6. Unplug the server from the shared bus

If you need to change a node of a cluster, for example,
to add a more powerful server, you need to logically
remove the node before physically removing the node
from the cluster. When you configure a new server with
the shared bus, and the public and private networks,
you can then run the Cluster Installation Wizard.

To remove a node from a cluster, from Cluster
Administrator, right-click on the node to access the
menu with the Stop Cluster option and Evict Node
options.
To evict a node:
1.
Back up both nodes.
2.
Verify backup.
3.
Move all of the groups to the remaining node.
4.
Stop Cluster service on the node that is to be
removed.
5.
Evict the node.
6.
Unplug the server from the shared bus (if the shared
bus is a SCSI bus, be careful about termination).

Note: If a new server is to join the cluster later, run the
Cluster Installation Wizard and select Join a Cluster.
 Troubleshooting Cluster Service

Troubleshooting Tools

Examining the Cluster Log

Troubleshooting Network Communications

SCSI Configuration Problems

Group and Resource Failures

Quorum Log Corruption

Troubleshooting a problem with Cluster service can be
more complex than troubleshooting a single server
because of the virtual servers and the need for
intracluster communications. Virtual servers change
ownership from one node to another, which may cause
network connectivity problems. Applications running on
the cluster are difficult to troubleshoot, because they
are running on a virtual server instead of a physical
server. You could also have a node-to-node
communication problem because servers usually work
independently of each other and not together. You might
experience hardware problems with the shared bus and
the cluster disk resources.

The most common failures are due to improper
configurations within groups and resources. Cluster
service will fail if the quorum log becomes corrupt. It is
important to know how to repair the quorum log to
restart the cluster.

You use the same tools to identify problems on the
cluster as you would use to identify problems on a
physical server. The best resource for troubleshooting
is the cluster log because Cluster service records the
activity of each node in the cluster log. This log can
help you identify problems on the node or in the cluster.
Troubleshooting Tools

Disk Manager

Task Manager

Performance Monitor

Network Monitor

Dr. Watson

Services Snap-in

When troubleshooting Cluster service, you can use the
same tools and methodologies that you would when
troubleshooting Windows 2000 Advanced Server.

Cluster service writes logging information to the system
log of every node in the cluster. Cluster service also
writes a more detailed log of cluster activity to the
cluster log on each node. Use these two sources to
gather information when you begin troubleshooting a
problem. You will be able to determine whether the
problem is related to the network, to services or
applications, or to physical components in the cluster.

Note: Use Event Viewer to filter the system log on event
source: ClusSvc. You can view general events, such as
if Microsoft Cluster service failed to join the cluster on
this node and Microsoft Cluster service successfully
created a cluster on this node.

After you have determined the type of problem, you can
use the following tools to search for the source of the
problem. You must check each node individually when
using any of these tools.

Disk Manager. You check disk manager to find out the
health of the cluster disk. You can check whether the
operating system recognizes the disks, and whether the
cluster disks are basic versus dynamic. You also need
to verify that the drive letters of the cluster disks are the
same on both nodes.

Task Manager. You can verify that Cluster service is
running in Microsoft Windows 2000 Task Manager. You
can also use Task Manager as a performance monitor,
but you do not obtain the level of detail as you would
with a performance monitor. In Task Manager, you will
be able to verify the CPU utilization percentage and the
memory resources on the node.

Performance Monitor. Microsoft Windows 2000
Performance Monitor is the primary tool for finding
bottlenecks on servers running Windows 2000. It is
recommended that you create a baseline before and
after you add cluster resources to the cluster. You also
need to create a baseline on each node during failover
and failback of resources to check for potential physical
resource deficiencies. It is recommended that you
configure a computer to monitor the Cluster service
property on every node of the cluster, and send an email message to an administrator when a node or the
cluster is offline.

Network Monitor. You use Microsoft Windows 2000
Network Monitor to troubleshoot any node-to-node and
client-to-node communication. You must configure
Network Monitor to capture data on the private network
to see node-to-node communication.

Dr. Watson. Dr. Watson is a user-mode debugging tool.
If a clustered application or the Cluster Administrator
crashes, the debugging information is found in the Dr.
Watson log file.

Services Snap-in. Cluster service runs as a service in
Windows 2000. If Cluster service is not running
correctly, check the properties of the service through
the services snap-in to ensure that the default
properties have not changed. Verify that Cluster service:



Is set to start automatically.
Is set to log on as the designated domain service
account.
Is set to restart after a failure.

Make sure that the four following services have started:




Network Connections (Network Connections has a
Remote Procedure Call (RPC) dependency)
RPC
Windows Management Instrumentation Driver
Extensions
Windows Time
Examining the Cluster Log
Copy of cluster - Wordpad
File
Edit
View
Insert
Format
Help
timestamp
event description
000003b8.000003b4::2000/10/02-19:44:12.946 [CS] Cluster Service started – Cluster Node Vers
000003b8.000003b4::2000/10/02-19:44:12.946
OS Version 5.0.21
000003b8.000002f0::2000/10/02-19:44:12.957
000003b8.000002f0::2000/10/02-19:44:13.007
000003b8.000002f0::2000/10/02-19:44:13.057
000003b8.000002f0::2000/10/02-19:44:13.097
000003b8.000002f0::2000/10/02-19:44:13.397
The IDs of the process and
000003b8.000002f0::2000/10/02-19:44:13.397
thread issuing the log entry
000003b8.000002f0::2000/10/02-19:44:13.427
000003b8.000002f0::2000/10/02-19:44:13.427
000003b8.000002f0::2000/10/02-19:44:13.427
000003b8.000002f0::2000/10/02-19:44:13.427
000003b8.000002f0::2000/10/02-19:44:13.437
000003b8.000002f0::2000/10/02-19:44:13.447
000003b8.000002f0::2000/10/02-19:44:13.788
000003b8.000002f0::2000/10/02-19:44:13.848
000003b8.000002f0::2000/10/02-19:44:13.848
000003b8.000002f0::2000/10/02-19:44:13.848
000003b8.000002f0::2000/10/02-19:44:13.878
000003b8.000002f0::2000/10/02-19:44:13.878
000003b8.000002f0::2000/10/02-19:44:13.878
000003b8.000002f0::2000/10/02-19:44:13.878
000003b8.000002f0::2000/10/02-19:44:14.038
000003b8.000002f0::2000/10/02-19:44:14.048
000003b8.000002f0::2000/10/02-19:44:14.048
Creates a new cluster group
[CS] Service Starting…
[EP] Initialization…
[DM]: Initialization
[DM]: Loading cluster database form D:\WINNT\clu
[DM] DmpStartFlusher: Entry
event description
[DM] DmpStartFlusher: thread created
[NM] Initializing…
[NM] Local node name = SERVER1.
[NM] Local node ID = 1.
[NM] Creating object for node 1 (SERVER1)
[NM] Initializing networks.
[NM] Initializing network interfaces.
[NM] Initializing complete.
[NM] Starting worker thread…
[API] Initializing
[FM] Worker thread running
[LM] :LMInitialize Entry.
[LM] :TimerActInitialize Entry.
[CS] Service Domain Account = clusservice@mocmoc
[CS] Initializing RPC server.
[INIT] Attempting to join cluster MYCLUSTER
[JOIN] Spawning thread to connect to sponsor 10.
[JOIN] Spawning thread to connect to sponsor 169

The cluster log is a diagnostic log that is a more complete record of
cluster activity than the Microsoft Windows 2000 Event Log. The
cluster log records the Cluster service activity (Clussvc.exe and
associated processes) that leads up to the events that are recorded
in the event log. Although the event log can point you to a problem,
the cluster log helps you to determine the source of the problem.
So, for diagnosis, check the event log for general information and
the cluster log for specific details about the cluster status. If you
see a problem in the event log, note the timestamp and go to
approximately the same timestamp on the cluster log.

The cluster log is enabled by default when you install Cluster
service, but will not start logging information until after the first
restart of the node. Cluster log output is written to
%SystemRoot%\Cluster\Cluster.log, and you can view it with
Microsoft Wordpad.
 Setting the Logging Level

You can set four logging levels in the cluster log. Four
logging levels are possible. The default level is two,
which logs enough information necessary for normal
troubleshooting. To set a different logging level, click
Start, point to Settings, click Control Panel, and then
double-click the System icon. Create a system
environment variable under the Advanced button called
ClusterLogLevel with a value of 0, 1, 2, or 3, where 0=no
logging, 1=Errors only, 2=Errors and Warnings, and
3=Everything that happens.
Setting the Log File Size

The log file defaults to a maximum size of 8 megabytes
(MB). When the log file size reaches 8 MB, the log file
will start overwriting the data in the log file. To specify a
larger file size, add the registry entry ClusterLogSize
under HKEY_LOCAL_MACHINE\SYSTEM\
CurrentControlSet\Services\ClusSvc\ Parameters.
ClusterLogSize has a type of DWORD and it should
specify the maximum size in MB for the log file. If this
value is set to 0, logging is disabled.
Cluster Log Entries

There are two types of cluster log entries: Component
Event Log entries and Resource dynamic-link library
(DLL) log entries. Cluster service is made up of a number
of components, such as the database manager and the
global update manager. The cluster log records the
interactions of these components, making it a powerful
diagnostic tool. Because resource groups are the basic
unit of failover, resource DLL entries are essential to
understanding cluster activity.

The first line in the body of a typical cluster log is:
378.32c::1999/06/09-18:00:18.874 Cluster service started Cluster Node Version 3.2051

The main elements of this line are common to every line
of the log:


The IDs of the process and thread issuing the log entry.
These two IDs are concatenated, separated by a period.
In the previous example, the Process ID is 378, and the
Thread ID is 32c.
Timestamp. The timestamp is recorded in the following
format, in Greenwich Mean Time (GMT):
yyyy/mm/dd-hh:mm:ss.sss

Event description. One example of an event description
would be Cluster service started.
Component Event Log Entries

In the following example, [NM] indicates the component
that wrote the event to the cluster log; in this case, NM
stands for node manager.
378.380::1999/06/09-18:00:50.881 [NM] Forming cluster
membership.
Resource DLL Log Entries.

The following example is a cluster log entry for a
resource DLL event. This example is one of the entries
from the disk arbitration process.
15c.458::1999/06/09-18:00:47.897 Physical Disk
<Disk D:>: [DISKARB] Arbitration Parameters (1 9999).


Instead of listing an abbreviated component name
between the timestamp and event description as
component log entries do, entries describing resource
DLL events list the following information:

Resource type (Physical Disk)

Resource name (<Disk I:>)
The event description in this example is [DISKARB]
Arbitration Parameters (1 9999).
Troubleshooting Network Communications


Troubleshooting Node-to-Node Communication

Verify RPC Communication’s

Verify Cluster Heartbeats
Troubleshooting Client-to-Node Communications

Check NetBT Cache with Nbtstat

Ping IP Address

WINS Static Mappings

There are two types of cluster network communications
that can fail: the client may be unable to access the
cluster or the nodes may be unable to communicate
with each other. When client communications are
interrupted, there is a problem with the public network.
When the nodes are unable to communicate, there is a
problem with either the public or the private network.
Troubleshooting these two types of network-related
problems requires different approaches.
Troubleshooting Node-to-Node Communications

You can use Windows 2000 Network Monitor before
installing Cluster service to capture the trace of the ping
between the nodes on the public and private network.
After Cluster service is installed, you use Network
Monitor to verify remote procedure call (RPC)
communication and cluster heartbeats.

Note: You can also use RPC Ping, which is an RPC
connectivity verification tool that is a free download
from www.microsoft.com. This tool verifies that
Windows 2000 Server services are responding to the
call requests of remote procedures between nodes.
Verifying RPC Communication

To verify that RPC communication is occurring between
the nodes of a cluster, use a network capture utility,
such as Microsoft Network Monitor. Windows 2000
Server includes a simple version of Network Monitor
that you can install by using the Network program in
Control Panel.

To verify RPC communication, configure the Capture
utility to capture all of the traffic between the nodes of a
cluster. After you have started a capture, using Cluster
Administrator to create a group or resource will result in
RPC traffic between the nodes.
Verifying Cluster Heartbeats

As with RPC communication, to verify that cluster
heartbeats are occurring between the nodes of a cluster,
you must use a network capture utility.

Cluster service uses User Datagram Protocol (UDP) port
3343 to send heartbeats on the network. Use Network
Monitor to capture port 3343 to verify both nodes of the
cluster are sending and receiving cluster heartbeats.
Troubleshooting Client-to-Node Communications

After a failover occurs, clients must still be able to gain
access to a cluster, even though they will be accessing
a different node. The client must be able to resolve any
cluster network names so that they will always connect
to the node on which the resources are online. If clients
cannot connect to virtual servers, verify that:


The client is accessing the cluster by using the correct
network name or IP address.
The client has the Transmission Control Protocol/Internet
Protocol (TCP/IP) protocol correctly installed and
configured.
Check NetBT Cache with Nbtstat

Depending on the resource that is being accessed, the
client can address the cluster by specifying either the
resource network name or the IP address. In the case of
the network name, you can verify proper name
resolution by checking the NetBT cache (using the
Nbtstat.exe utility) to determine whether the name had
been previously resolved. Also, confirm proper
Windows Internet Name Service (WINS) configuration, at
the client and at the cluster nodes.
Ping IP Address Using Ping Utility

If the client is accessing the resource through a specific
IP address, ping the IP address of the cluster resource
and cluster nodes from a command prompt.
WINS Static Mappings

You should not create static network name to IP address
mappings for any cluster names in a WINS database.
WINS is the only name resolution method that will cause
problems when using static mappings, because WINS
static mappings use the media access control (MAC)
address of the network card as part of the static
mapping.

If clients are having a problem connecting to a virtual server, an
administrator might have created a WINS static mapping for a
virtual server. The node for which the mapping is created will be
able to bring the network name resource online and clients will be
able to connect. However, if failover occurs, the second node in the
cluster will be able to bring the IP address online but not the
network name. When the second node attempts to bring the
network name online, WINS will return an error preventing it from
registering the network name. WINS prevents the network name
from going online because the second node does not have the
same physical address as the one recorded in the static mapping
for the network name.

Note: For more WINS troubleshooting information, see
“Recommended WINS Configuration for Microsoft Cluster Server,”
Q193890, on the Student compact disk.
SCSI Configuration Problems

SCSI Controllers

SCSI Terminiation

SCSI Cabling

If you suffer from hardware failures, you may have to
replace hardware components of the cluster. If you
replace components in the SCSI subsystems, you need
to make sure that the new SCSI configurations conform
to the following guidelines.

SCSI Controllers
SCSI IDs
Each device on the shared SCSI bus must have a unique SCSI ID.
Most SCSI controllers default to SCSI ID 7. Therefore, you must
change the SCSI ID for one of the controllers on the shared SCSI bus
to something other than ID 7.
Boot Time
SCSI Bus
Reset
Cluster service uses SCSI bus resets, but in a controlled way during
a membership regroup operation. Some SCSI controllers reset the
SCSI bus when they initialize at start time, before Windows 2000 is
loaded. If the SCSI controllers reset the SCSI bus, the bus reset can
interrupt any data transfers between the other node and drives on the
shared SCSI bus. Therefore, you should disable automatic SCSI bus
resets, if possible, by using the adapter configuration program
accessible at computer start time.
It is important to verify that the SCSI controllers that are being used
are on the Cluster service Hardware Compatibility List (HCL). For a
SCSI controller to work with Cluster service, it must support the SCSI
reserve and release commands and bus resets.
NonCompliant
Controllers

SCSI Termination
Active or ForcedThere are three types of termination that are used for
Perfect Termination terminating the SCSI bus: passive termination, active
termination, and forced perfect termination. Because both
active and forced perfect termination use electronics to
provide termination, these types provide the best termination.
You should not use passive termination in a cluster, because
it can result in problems, such as unnecessary failover or
inability to access the quorum disk.
On-Card
Termination
Many SCSI controllers provide on-card termination; however,
the on-card termination does not provide termination when
the computer is not turned on. On-card termination only
becomes an issue when external terminators are not used.
When using external terminators, the on-card termination
should be disabled.

SCSI Cabling
Tri-Link or Ycable SCSI
Connectors
Attaching Y-cables or tri-link connectors to the back of the SCSI
controllers at each end of the bus is one method that you can use
to allow the SCSI bus to remain terminated even when one node is
turned off. These components allow you to use external terminators
that will continue to provide termination if a node is turned off. You
must ensure that the SCSI cards in the nodes are not providing
termination when using these connectors.
Long Cables
It is very common to have multiple external SCSI drives on the
shared SCSI bus. When configuring multiple external drives, it is
very important not to exceed the maximum combined cable length
that the controller manufacturer recommends. The SCSI
specifications specify the maximum combined cable length when
using different types of cabling. If the manufacturer of the controller
recommends a shorter distance, be sure to follow the
recommendation of the manufacturer.
Group and Resource Failures
Cluster Administrator – [MYCLUSTER (MYCLUSTER)]
File
View Window Help
MYCLUSTER
Groups
Cluster Group
Mygroup
SQL Group
Resources
Cluster Configuration
SERVER1
SERVER2
For Help, press F1
Name
Cluster IP Address
Cluster Name
Disk W:
Printer Spooler
Public
State
Owner
Reso
Online
Online
Online
Online
Failed
SERVER2
SERVER2
SERVER2
SERVER2
SERVER2
IP Ad
Netw
Physi
Print
File S
NUM

If groups or resources are not available to clients, you
need to verify whether it is a restart, failover, or failback
problem. In Cluster Administrator, you will see a visual
notification that a group or a resource in a group is
offline. Because there are a variety of reasons for a
failure, you will have to troubleshoot the cause to find
out whether it is a resource or group failure.
Problem
Possible Resolution
A Resource Fails,
But is Not Brought
Back Online
In the Policies dialog box for the resource properties, verify that
Don’t restart is cleared (not selected).
Verify that the resource dependencies are correctly configured.
Verify that any dependent resources are online.
The Default Quorum
Resource Will Not
Come Online
Verify that there are no hardware errors by using Event Viewer and
looking for disk input/output (I/O) error messages.
Cannot Bring a
Group Online
Verify that there are no hardware or configuration problems with
any disk resources for the group.
Verify that the resource dependencies are correctly configured.
Move the group to the other node and attempt to bring the group
online. If this works, verify that the first node can gain access to
everything that is necessary to bring the group’s resources online
(for instance, the disk resource).
(continued)
Problem
Possible Resolution
A Group Cannot Be Verify that the resource is properly installed on the node.
Moved or Failed
Verify that the other node is set as a possible owner for all
Over to the Other
resources in the group in the Properties dialog box for the
Node
resource.
A Group Failed
Verify that the failback policies for the group are properly
Over But Did Not
configured.
Fail Back
In the Properties dialog box for the group, verify that Prevent
failback is cleared. If Failback immediately is selected, be
sure to wait long enough for the group to fail back. Check
these settings for all of the resources within a group.
Because groups fail over as a whole, one resource that is
prevented from failing back will affect the entire group.
Ensure that the node to which you want the groups to fail
back is configured as the preferred owner of the group. If not,
Cluster service will leave the groups on the node to which
they failed over.
(continued)
Problem
Possible Resolution
The Entire Group
Failed and Has Not
Restarted
If the node on which the group had been running is offline,
verify that the other node is a possible owner of the group
and of all of the resources in the group.
Ensure that the group has not exceeded its failover threshold
or its failover period.
Bring the resources online one at a time to determine which
resource is causing the problem.
Create a temporary group (for testing purposes), and then
move the resources to it one at a time, bringing each
resource online after moving the resource.
Quorum Log Corruption

Reset the Quorum Log


Clussvc –debug -resetquorumlog
Delete the Quorum Log

-noquorumlogging

Microsoft Cluster service maintains details about
changes within the cluster through a quorum log file. If
this file becomes corrupted for any reason, it is possible
that Cluster service will not start. The following error
message may occur when you attempt to start Cluster
service on a node of the server cluster: Event ID: 1147
Source: ClusSvc

If the cluster will not start because of a corrupted
quorum log, you can reset the quorum log. If Cluster
service still will not start after attempting a reset, you
can access the quorum disk and remove the corrupted
quorum log.
Reset the Quorum Log

If you do not have a backup of the quorum log file, perform the
following steps:
1.
Open a command prompt.
2.
Go to the %Systemroot%\Cluster.
3.
Start Cluster service by typing clussvc -debug -resetquorumlog
which attempts to create a new quorum log file that is based on
the cluster configuration information in the local system's cluster
registry hive.
4.
Stop Cluster service by pressing CTRL+C.
5.
Restart Cluster service by typing net start clussvc
6.
Close the command prompt.
Delete the Quorum Log

If the log file becomes corrupted and cannot be reset,
Cluster service may not start. To correct this problem,
you must use the -noquorumlogging option when
starting Cluster service. This option allows the cluster
to start without quorum logging. You may then access
the quorum disk and remove the corrupted Quolog.log
file.

Use the following procedure to help recover from this situation:
1.
If Cluster service is running, use Control Panel on both nodes to stop
Cluster service.
2.
On one node, use the Services tool in Control Panel to specify the startup
parameter for Cluster service as -noquorumlogging and start the service.
3.
On the quorum disk, run Chkdsk. If the disk does not show corruption, the
log file may be corrupted. In this case, delete the Quolog.log file and any
.tmp files that are located in the MSCS folder on the quorum disk.
4.
In Services, stop Cluster service, and then start Cluster service without
startup parameters. After the service starts, you may start it on the other
node.

Note: When you disable quorum logging within a cluster, changes to the
cluster configuration cannot be logged. If a node goes offline during this
period, recent changes may be lost if changes could not be
communicated to the other node. Quorum logging should only be
disabled when necessary to recover from log file corruption.
Lab A: Cluster Maintenance
Objectives
After completing this lab, you will be able to:

Back up cluster configuration files.

Restore cluster configuration files.

Evict a node from the cluster.

Uninstall Cluster service.
Scenario

In this exercise, you will back up a node’s system state,
which includes the cluster configuration files. After the
backup is complete, you will restore the system state
and verify that the cluster configuration files were
restored to the node. At this point, to restore the cluster,
you would run the Clustrest.exe utility, but for the
purposes of this lab, you will not restore the cluster. You
will evict a node from a cluster and uninstall the Cluster
service on both nodes.

The following exercises will refer to your computers as
Node A and Node B. For this lab, you will perform all of
the tasks on both Node A and Node B, with the
exception of evicting a node, which you will perform
only on Node B.
Exercise 1: Backup and Restore

In this exercise, you will learn how NTBackup is used to
backup and restore the cluster.
To back the Cluster
Complete this lab from Node A and Node B.
1.
Click Start, point to Programs, point to Accessories, point to
System Tools, and then click Backup.
2.
In the Backup dialog box, click Backup Wizard.
3.
In the Backup Wizard dialog box, click Next.
4.
Select Only backup the System State data, and then click Next.
5.
In the Backup media or file name: dialog box, type c:\Backup.bkf
and then click Next.
6.
Click Finish to start the backup.
7.
NTBackup will start backing up the system state, which will take a
couple of minutes.
8.
When the backup is complete, click Close.
To Restore the Cluster
1.
In the Backup dialog box, click Restore Wizard.
2.
Click Next.
3.
Click Import File to locate the backup file of the system
state.
4.
In the Catalog backup file: dialog box, type
c:\Backup.bkf and then click OK.
5.
In the What to restore box, expand File, expand Media
created.
6.
Select the System State box, click Next, and then click
Finish.
7.
In the Enter Backup File Name dialog box, click OK.
8.
The Restore process will take a couple of minutes.
9.
When Restore is complete, click Close.
10.
Do not restart the computer, click No.
11.
Close NTBackup.

Note: NTBackup does not restore the cluster files to
the cluster disk. NTBackup places the cluster files on
the local node.
To examine the cluster files that are restored by
NTBackup
1.
Click Start, and then click Run.
2.
In the Run dialog box, type %systemroot%\cluster and
then click OK.
3.
Double-click the cluster_backup folder to view the files
that are restored by NTBackup.

What utility would you use to restore these files to the
shared drive?___________________
To create a group after backup

To test the restore process, you will create a group after the
backup. The restore procedure will roll back the cluster to the
state when the backup was performed.

Perform this task from Node A.
1.
In Cluster Administrator, click File, select New, select Group.
2.
In the New Group dialog box, fill out the following properties:
Name: Test Group Description: Test Group
3.
Click Next.
4.
In the Preferred Owners dialog box, click Finish.
5.
Click OK to acknowledge that the group was successfully created.
To install the Clusrest.exe

In this task Node B will install the ClustRest utility and restore the
cluster to the state of the last backup. Close Cluster Administrator
on Node A and Node B if it is running.

Perform this task from Node B.
1.
On the Start menu, click Run.
2.
In the Run dialog box, type c:\moc\2087a\labfiles\mscs and then
click OK.
3.
In the Microsoft Web Installation Wizard – Tool: CLUSREST:EXE
dialog box, click Next.
4.
Click I Agree, and then click Next.
5.
Click Install Now.
6.
Click Finish.
7.
On the Start menu, click Run.
8.
In the Run dialog box, type cmd and then click OK.
9.
In the command prompt, type cd\program files\resource kit and
then press ENTER.
10.
In the command prompt, type clusrest and press ENTER.
11.
In the command prompt, type y to continue.
12.
Wait for clusrest before proceeding.
13.
Open Cluster Administrator.
14.
Expand Groups and notice that the Test Group that was created in
the previous task is now missing and that Node A is OfflLine.
Exercise 2: Removing Cluster Service

In this exercise, you will remove Cluster service from
both computers in the cluster.
To evict a node

Complete this task from Node A only.
1.
Log on as Administrator with a password of password.
2.
Open Cluster Administrator from the Administrative Tools menu.
3.
If prompted, click Yes to restart Cluster service on Node A.
4.
Right-click Node B.
5.
Click Stop Cluster Service.
6.
Click Yes.
7.
Right-click Node B.
8.
Click Evict Node.
9.
Click Yes.
To remove Cluster service from NodeA and NodeB

Complete this task from Node A and Node B.
1.
Log on as Administrator with a password of password.
2.
On the Start menu, select Settings, and then click
Control Panel.
3.
Open Add/Remove Programs from Control Panel.
4.
Click Add/Remove Window Components.
5.
Clear the Cluster Service check box, and then click
Next.
6.
Click Finish.
7.
Click Yes to restart the computer.
Review

Cluster Maintenance

Troubleshooting Cluster Service

Module 7: Troubleshooting Cluster Service

Transcript Module 7: Troubleshooting Cluster Service

Directory