Chapter 13 Troubleshooting the Operating System

Download Report

Transcript Chapter 13 Troubleshooting the Operating System

Chapter 13
Troubleshooting the Operating System
13.1 - Identifying and Locating Symptoms
and Problems
13.2 - LILO Boot Errors
13.3 - Various Reasons for Package
Dependency Problems
13.4 - Troubleshooting Network Problems
13.5 - Disaster Recovery
Identifying and Locating
Symptoms and Problems
Hardware Problems
• Although a few problems are
due to a combination of
factors, most can be isolated in
origin to one of these:
– Hardware, Kernel, Application
Software, Configuration, and
User Error,
• Other hardware leaves traces that
the kernel detects and records.
• Assuming an error is such that it
does not crash the system,
evidence might be left in the log
file /var/log/messages, with the
message prefixed by the word
oops.
Kernel Problems
• Released Linux kernels are remarkably stable,
unless experimental versions are used or
individual modifications are made.
• Loadable kernel modules are considered part
of the kernel as well, at least for the time
period they are loaded.
• Sometimes these can cause difficulties, too.
• The good news with modules is that they can
be uninstalled and replaced with fixed versions
while the system is still running.
Application Software
• Errors in application packages are most identifiable in
that they occur only when running the application.
• This is in contrast to hardware and kernel conditions
that affect an entire system.
• Some common signs of application bugs are failure to
execute and program crash.
• An application may consume too much system
memory and ultimately begin to swap so badly that
the whole system is affected.
• Some errors are caused by things that have to do with
the running program itself.
Configuration
• Configuration problems tend to affect whole subsystems,
such as the graphics, printing, or networking subsystems.
• If the system is rebooted and a remote file system that
was once present is not, the first place to look is in the
configuration file /etc/fstab to see if the file system is
supposed to be mounted at boot time.
User Error
• It is forgivable to make a mistake in using a computer
program or to be ignorant of the right way to do
something. It is only unforgivable to insist on
remaining stubbornly so.
• There is more to know about the ins and outs of
operating almost any software package than
everyday users will ever care or attempt to learn.
Using System Utilities and System Status Tools
• Linux operating systems
provide various system
utilities and system status
tools.
• The setserial utility
provides information and
set options for the serial
ports on the system.
• The lpq command helps
resolve printing problems.
• The command will display
all the jobs that are
waiting to be printed.
Using System Utilities and System Status Tools
• The ipconfig command can
be entered at the shell to
return the current network
interface configuration of
the system.
• The route command
displays or sets the
information on the system’s
routing, which it uses to
send information to
particular IP addresses.
Unresponsive Programs and Processes
• Sometimes there are programs and
processes that for various reasons can
become unresponsive or “lock up”.
• Sometimes just the program or process itself
will lock up and other times can cause the
entire system to become unresponsive.
• One method of identifying and locating the
unresponsive program and effectively
troubleshooting the problem is to kill or restart
the process or program.
When to Start, Stop,
or Restart a Process
• It is easiest to terminate a program by using the kill
command.
• Other processes need to be terminated by editing the
Sys V startup script.
• When restarting a program, service, or daemon it is
best to first consult the documentation because
different programs have to be restarted in different
ways.
• Some support using the restart command, some
need to be stopped completely and then started
again, and others can simply reread their
configuration files without needing to be either
stopped and started again, or restarted.
Troubleshooting Persistent Problems
• The best way to fix programs that crash repeatedly is
to replace them with new software or with a different
kind of software that performs the same task.
• If it is possible, try using the software in a different
way or if there is a particular keystroke or command
that causes the program to fail, stop using it.
• Most times there will be replacement software
available.
• If it is a daemon that is crashing regularly try using
other methods of starting it and running it.
Examining Log Files
• Some of the more important log
files on a Linux system are the
/var/log/messages,
/var/log/secure, and the
/var/log/syslog log files.
• The system’s log files can be
used to monitor system loads
such as how many pages a
web server has served.
• They can also check for
security breaches such as
intrusion attempts, verify that
the system is functioning
properly, and note any errors
that might be generated by
software or programs.
Examining Log Files
• There are several different types of
information that are good to know, which will
make identifying problems using the log files
a little easier.
• Some of these are listed below:
–
–
–
–
–
Monitoring System Loads
Intrusion Attempts and Detection
Normal System Functioning
Missing Entries
Error Messages
The dmesg Command
• The dmesg command can
be used to display the
recent kernel messages,
also known as the kernel
ring buffer.
• These messages contain
information about the
hardware installed in the
system and the drivers.
• The information in these
messages relates to
whether the drivers are
being loaded successfully
and what devices the
drivers are controlling.
Troubleshooting Problems
Based on User Feedback
• There are several
different types of
problems that users
report.
• Some of the most
common ones are:
– Login Problems
– File Permission
Problems
– Removable Media
Problems
– E-mail Problems
– Program Errors
– Shutdown Problems
LILO Boot Errors
Error Codes
• The LILO boot loader is the
first piece of code that takes
control of the boot process
form the BIOS. It loads the
Linux kernel, and then
passes control entirely to the
Linux kernel.
• When there is a problem
with LILO an error code will
be displayed:
– None, L error-code, LI,
LI101010… LIL , LIL?, LIL-,
LILO
Booting a Linux System
without LILO
• Using the LILO on a
Floppy method is the
least useful but it can help
in some instances.
• From this screen a LILO
boot floppy disk can be
created which can be used
to boot Linux from LILO
using the floppy disk.
Emergency Boot System
• Linux provides an
emergency system’s copy
of LILO, which can be
used to boot Linux in the
event that the original LILO
boot loader has errors or is
not working.
• This is known as the
Emergency Boot System.
• To use this copy of LILO
configuration changes
must be made in lilo.conf.
Using an Emergency
Boot Disk in Linux
• There are several
reasons and errors that
can cause a Linux
system not to boot,
besides LILO problems.
• The emergency boot
disk should have the
necessary disk utilities
such as fdisk, mkfs,
and fsck, which can be
used to format a hard
drive so that Linux can
be installed on it.
Using an Emergency
Boot Disk in Linux
• It is always important to
include some sort of
backup software utility.
• If a change or repair to
some configuration files
needs to be made, first
back them up.
• Most distributions come
with some sort of backup
utility like tar, restore,
cpio, and possibly others.
Recognizing Common Errors
Various Reasons for
Package Dependency Problems
• When a package is installed in a Linux system there
might be other packages that need to be installed for
that particular package to work properly.
• The dependency package may have certain files
which need to be in place or it may run certain
services which need to be started before the package
that is to be installed can work.
• Linux will often notify the user if they are installing a
package that has dependencies so that they can be
installed as well.
Solutions to Package
Dependency Problems
• One solution to solving package dependency problems is to
simply ignore the error message and forcibly install the
package anyway.
• The correct and recommended method for providing solutions
is to modify the system so that it has the necessary
dependencies that are needed to run properly.
• It may be necessary to rebuild the package from source code
if there are dependency error messages showing up.
• The easiest way is to locate a different version of the package
that is causing the problems.
• Another option is to look for a newer version of the package.
Backup and Restore Errors
• Backup and Restore errors can occur at different points.
• Some errors will occur when the system is actually
performing the backup.
• Other errors will occur during the restore process when
the system is attempting to recover data.
• Some of the most common types of problems:
–
–
–
–
Driver problems
Tape drive access errors
File access errors
Media errors
– Files not found errors
Application Failure
on Linux Servers
• There are several things that can provide some
indication of an application failure or software problem
on a Linux server:
–
–
–
–
–
Failure to Start
Failure to Respond
Slow Responses
Unexpected Reponses
Crashing Application or Server
• A good general rule is to check the system’s logs.
• The system’s log files are usually the place to find most
error messages that are generated because they are not
always displayed on the screen.
Troubleshooting Network Problems
Loss of Connectivity
• Loss of connectivity can be hardware and/or software
related. The first rule of troubleshooting is to check
for physical connectivity.
• Ensure that the cables are properly plugged in at
both ends, that the network adapter is functioning by
checking the link light on the NIC, that the hub's
status lights are on, and that the communication
problem is not a simple hardware malfunction.
Operator Error
• Be sure that users are using the correct username and
password and that their accounts are not restricted in a
way that prevents them from being able to connect to the
network.
•
Software settings might have been changed by the
installation routine of a recently installed program, or the
user might have been experimenting with settings.
• Users accidentally, or purposely, delete files, and power
surges or shutting down the computer abruptly can
damage file data.
• Viruses can also damage system files or user data.
Using TCP/IP Utilities
• The first step in checking for a
suspected connectivity
problem is to ping the host.
• If a reply is received, the
physical connection between
the two computers is intact and
working.
• The successful reply also
signifies that the calling system
can reach the Internet.
• The term ping time refers to
the amount of time that
elapses between the sending
of the Echo Request and
receipt of the Echo Reply.
• A low ping time indicates a fast
connection.
Using TCP/IP Utilities
• Tracing utilities are used to
discover the route taken by a
packet to reach its destination.
• The way to determine packet
routing in UNIX systems is the
traceroute command.
• Traceroute shows all the
routers through which the
packet passes as it travels
through the network from
sending computer to destination
computer.
• This is useful for determining at
what point connectivity is lost or
slowed.
Using TCP/IP Utilities
• The ipconfig
command is used in
Windows NT and
Windows 2000 to
display the IP address,
subnet mask, and
default gateway for
which a network
adapter is configured.
• For more detailed
information, the /all
switch is used.
Problem-Solving Guidelines
• Troubleshooting a network requires problem-solving
skills.
• The use of a structured method to detect, analyze,
and address each problem as it is encountered
increases the likelihood of successful
troubleshooting.
• These steps should be followed:
–
–
–
–
–
Gather information
Analyze the information
Formulate and implement a "treatment" plan
Test to verify the results of the treatment
Document everything
Windows 2000 Diagnostic Tools
• The network diagnostic
tools for Microsoft
Windows 2000 Server
include Ipconfig,
Nbtstat, Netstat,
Nslookup, Ping, and
Tracert.
• Windows 2000 Server
also includes the
Netdiag and Pathping
commands.