Network Management Session 1 Network Basics

Download Report

Transcript Network Management Session 1 Network Basics

COMP3122
Network Management
Richard Henson
April 2012
Week 11 – Troubleshooting
& Optimisation

Learning Objectives:
– Explain the principles of troubleshooting as a
means of mitigating against failure
– Use the various tools available on a named
operating system to identify potential faults
and problems
– Take appropriate action to stop a fault
becoming a failure
“A stitch in time saves nine”
Business - Worst Possible Scenario (1)

There is an interruption in the power
supply
– UPS is invoked
– the interruption continues…
– servers all have to be shut down

Power supply restored…
– but main domain controller doesn’t reboot
– no other domain controllers therefore
connect to it
– the domain tree fails
Business - Worst Possible Scenario (2)

Organisation cannot do business with the
network down…
– server can’t be persuaded to boot
– new main domain controller has to be
commissioned
– whole directory tree has to be rebuilt!!!
– word spreads very rapidly…

Business loses so much custom, trust, and
credibility that even when it starts doing
business again customers choose to go
elsewhere
– without a flourishing customer base… the
business folds
Analysis: This scenario shouldn’t
have occurred…

Unlikely that the server would fail to boot
without prior warning…
– warnings would have been presented…
– but were clearly not acted upon!

Disaster recovery plan!?!
– not formulated?
– not tested?
– not effective (in the event of a domain tree controller
failure…)
But it does…

Actual example (15th Feb 2010):
– root domain controller [on the network] had not
been backed up for 10 months, when it crashed
(well… at least it had been backed up at some
time…)
– http://searchwindowsserver.techtarget.com/generi
c/0,295582,sid68_gci1381567,00.html

The consultant called in to fix it reported that:
– “I had never seen a case where the forest
root domain had to be recovered -- and I
couldn't find anyone who had.”
Analysis: Who is to blame? (1)

In this example, the organisation said
they were following Microsoft guidelines
– they set up an empty root domain
– the root domain controller had a RAID-5
disk configuration

This was true, to some extent
– Microsoft did espouse this as best
practice… in the year 2000!
– guidelines had changed since then…
Analysis: Who is to blame? (2)

The disaster that struck was:
– two RAID drives failed on the same day!
– unlucky? possible to prepare for this?

The recovery process took about three weeks
– most of the time was spent studying logs, doing
the restore, etc.

In this case, the tree was still able to function
without a root domain
– business was able to continue
– customer base wasn’t compromised…
Fault Tolerance and Risk
Assessment

General “common sense” principle:
– always have a backup
– ESPECIALLY for the most important computer
on the network…

Q:
– How can you tell what needs backing up?

A:
– Risk Assessment and Risk Management
Why not Risk Management?
Time consuming!
 However, without proper risk
management…

– how does the organisation know what
processes are most important to its
functioning?
– how can an organisation provide resources
to protect aspects of its network?
Risk Management and
Risk Assessment

Risk Assessment is an essential first step
– requires putting a “value” on assets
– more valuable… greater protection

Do information assets have value?
– organisations still failing to acknowledge that they
do…
– categorisation of information assets therefore
potentially problematic
– need to look at the consequence to the
organisation of losing that asset…
How do you back up a
Domain Controller?

The Windows “Backup” program works, and
can easily be scheduled
– but heavily criticised…
– even the 2008 server version…

Third Party products give more flexibility and
protection e.g. :
– Recovery Manager
» http://www.quest.com/recovery-manager-for-active-directory
– Backup Exec
» http://www.symantec.com/business/products/family.jsp?familyid=backupexec
Prevention is Better than Cure

A server shouldn’t crash unexpectedly!
– should be kept cool (environmental unit mustn’t
break down!)
– monitoring should show that unexpected things are
happening
– action can then (usually) be taken to take care of
the unexpected

Many tools available to:
– Check/monitor the system on a regular basis
– Provide stats/ to administrators
» could also be used for security purposes
– Generate alerts if something is starting to go
wrong…
Troubleshooting Tools for a Windows
Server: Task Manager

Applications tab:
– shows which applications are running
– enables changing of process priority
» use view/update speed
– can be used to
» open new applications
» shut rogue applications down
Task Manager (continued)

Processes tab:
– all system processes
– Memory usage of each
– % CPU time for each
– total CPU time since boot up
– also used to close a process down
» careful! (but you get a warning…)
Task Manager (continued)

Performance tab:
– total no. of threads, processes, handles running
– Graph: % CPU usage
» User mode
» Kernel mode (optional: view menu)
» graph per CPU (optional: view menu)
– physical (Page File) memory available/usage
– virtual memory available/usage
Event Viewer

Events recorded into “event log” files
– System log
– Auditing log (customisable)
– Application log
– customisable - additional files

New files recorded daily; old ones
archived
– time before archiving also customisable
Event Viewer

Three types of events recorded in log:
– Information
– Warning
– Error

More information on each event obtained by
double-clicking
– make note of event code
– heed and take action if necessary
Using Event Viewer

Wise to check all event logs regularly
– take time/trouble to find out that those
messages really mean…

The action is needed that it
– sort out potential problems now
– Make sure they don’t become real ones
later…
Auditing Further Events
Any “object” can be audited
 Objects to audit, and processes
audited can be set through audit
(group) policy

– Using MMC & relevant snap-in

Types of process audited:
– access
– attempt to access
Security auditing
Same principles as general
auditing
 Refers to “restricted” objects
 Events appear in separate
security log

Event Management software
(SIEM)

Who’s going to look at all these log files?
– in practice, often no-one..

Solution – SIEM software to analyse and
present information from:
–
–
–
–
–
network and security devices
identity & access management applications
vulnerability management/policy compliance tools
os, database & application logs
external threat data
http://www.focus.com/briefs/how
-select-security-information-andevent-management-siem
Other Troubleshooting
Resources

NT Diagnostics (winmsd.exe)
– hardware & operating system data from registry

Performance Monitor
– Can monitor many aspects of system performance
– Either display current data graphically, in real-time
– or log data at regular intervals to get a longer term
picture
– Useful role in system optimisation
Other Troubleshooting
Resources

System Monitor (perfmon.msc)
– captures, filters, or analyses frames or packets
sent over the network

Alerts
– notify administrator when a particular threshold
value has been reached

System Recovery
– if a fatal error occurs:
» a dump of system memory is made, and can be used for
identifying the cause of the problem
» alerts are sent to users
» system is restarted automatically
Performance Monitor
Windows 2003 Server, but not available
on disk
 To obtain and download Performance
Monitor Wizard (PerfWiz), visit the
following Web site:

– http://www.microsoft.com/downloads/details.a
spx?FamilyID=31fccd98-c3a1-4644-9622faa046d69214&displaylang=en
What if the machine
doesn’t boot…

Tools available:
– The boot error itself
» blue screen? driver software
» constant reboot? motherboard
– Last Known Good…
» Gives machine a chance to go back to the
previous (usually last but one)
configuration
What if the machine
doesn’t boot… (continued)

Safe Mode
– includes VGA Mode or boot
logging
– Debugging mode also available
» output difficult to decipher for nonexperts

Recovery Console
– “DOS-type prompt” for performing
minor repairs
What if the machine
doesn’t boot… (continued)

System Configuration Utility
(Msconfig.exe)
– automates the routine troubleshooting
steps relating to Windows configuration
issues
– can be used to modify the system
configuration and troubleshoot the problem
using a process-of-elimination method
What if the machine
doesn’t boot… (continued)

Emergency Repair Disk (ERD)
– reboot machine using different media
» e,g. floppy disk
– media should be generated BEFORE it
needs to be used!
– option to create the ERD during the set
up process…
What if the machine
doesn’t boot… (continued)

Full restore
– assumes a full backup has already been
made
– still have to:
» reformat hard disk from scratch…
» and then restore the backup files using
backup/restore option….
– but better than losing all your data!
Network Troubleshooting Chart -1
Identify the
problematic
network node
Use commands such
as PING &
TraceRt
URL:
http://teamapproach.ca
/trouble

Is there a problem
with one of the
network
protocols?
 Isolate the problem
to a protocol
layer and fix it

Is there a memory
problem?
 Is there a memory
leak?



Is there sufficient
memory?
 Fix or eliminate the
software with the
memory leak
 Add more memory
Network Troubleshooting Chart - 2
Does the system
freeze?
 Investigate priority
and device driver
problems

Is there high
processor
utilization?
 Is it caused by
hardware or
software?

 hardware

Can an upgraded
device driver fix
the problem?
 Provide adequate
processor
resources
 Upgrade you
hardware to
offload the
processor
Network Troubleshooting Chart – 3
Is there a disk
problem?
 Is there sufficient
file cache?



Use NTFS and do
regular
maintenance



Is there a boot
record problem?
 Add more memory
to ensure sufficient
cache
 Use RAID
 Use FixBoot or
FixMBR from the
recovery console
Network Troubleshooting Chart – 4
Is there a
network
problem?
 Use Network Monitor
to identify top
broadcasters
 Eliminate
unnecessary
broadcasts

Use Network Monitor
to identify top talkers
 Eliminate
unnecessary network
traffic

Correct poor
configuration
 Reorganize &
upgrade network for
more capacity

Is there a address or
name resolution
problem?
 Examine ARP cache,
WINS, DNS, and
NBTstats
Optimisation…
All about improving the performance
of system resources…
 A network manager should never
have “nothing to do…”
