Network Management Session 1 Network Basics
Download
Report
Transcript Network Management Session 1 Network Basics
COMP3122
Network Management
Richard Henson
April 2012
Week 11 – Troubleshooting
& Optimisation
Learning Objectives:
– Explain the principles of troubleshooting as a
means of mitigating against failure
– Use the various tools available on a named
operating system to identify potential faults
and problems
– Take appropriate action to stop a fault
becoming a failure
“A stitch in time saves nine”
Business - Worst Possible Scenario (1)
There is an interruption in the power
supply
– UPS is invoked
– the interruption continues…
– servers all have to be shut down
Power supply restored…
– but main domain controller doesn’t reboot
– no other domain controllers therefore
connect to it
– the domain tree fails
Business - Worst Possible Scenario (2)
Organisation cannot do business with the
network down…
– server can’t be persuaded to boot
– new main domain controller has to be
commissioned
– whole directory tree has to be rebuilt!!!
– word spreads very rapidly…
Business loses so much custom, trust, and
credibility that even when it starts doing
business again customers choose to go
elsewhere
– without a flourishing customer base… the
business folds
Analysis: This scenario shouldn’t
have occurred…
Unlikely that the server would fail to boot
without prior warning…
– warnings would have been presented…
– but were clearly not acted upon!
Disaster recovery plan!?!
– not formulated?
– not tested?
– not effective (in the event of a domain tree controller
failure…)
But it does…
Actual example (15th Feb 2010):
– root domain controller [on the network] had not
been backed up for 10 months, when it crashed
(well… at least it had been backed up at some
time…)
– http://searchwindowsserver.techtarget.com/generi
c/0,295582,sid68_gci1381567,00.html
The consultant called in to fix it reported that:
– “I had never seen a case where the forest
root domain had to be recovered -- and I
couldn't find anyone who had.”
Analysis: Who is to blame? (1)
In this example, the organisation said
they were following Microsoft guidelines
– they set up an empty root domain
– the root domain controller had a RAID-5
disk configuration
This was true, to some extent
– Microsoft did espouse this as best
practice… in the year 2000!
– guidelines had changed since then…
Analysis: Who is to blame? (2)
The disaster that struck was:
– two RAID drives failed on the same day!
– unlucky? possible to prepare for this?
The recovery process took about three weeks
– most of the time was spent studying logs, doing
the restore, etc.
In this case, the tree was still able to function
without a root domain
– business was able to continue
– customer base wasn’t compromised…
Fault Tolerance and Risk
Assessment
General “common sense” principle:
– always have a backup
– ESPECIALLY for the most important computer
on the network…
Q:
– How can you tell what needs backing up?
A:
– Risk Assessment and Risk Management
Why not Risk Management?
Time consuming!
However, without proper risk
management…
– how does the organisation know what
processes are most important to its
functioning?
– how can an organisation provide resources
to protect aspects of its network?
Risk Management and
Risk Assessment
Risk Assessment is an essential first step
– requires putting a “value” on assets
– more valuable… greater protection
Do information assets have value?
– organisations still failing to acknowledge that they
do…
– categorisation of information assets therefore
potentially problematic
– need to look at the consequence to the
organisation of losing that asset…
How do you back up a
Domain Controller?
The Windows “Backup” program works, and
can easily be scheduled
– but heavily criticised…
– even the 2008 server version…
Third Party products give more flexibility and
protection e.g. :
– Recovery Manager
» http://www.quest.com/recovery-manager-for-active-directory
– Backup Exec
» http://www.symantec.com/business/products/family.jsp?familyid=backupexec
Prevention is Better than Cure
A server shouldn’t crash unexpectedly!
– should be kept cool (environmental unit mustn’t
break down!)
– monitoring should show that unexpected things are
happening
– action can then (usually) be taken to take care of
the unexpected
Many tools available to:
– Check/monitor the system on a regular basis
– Provide stats/ to administrators
» could also be used for security purposes
– Generate alerts if something is starting to go
wrong…
Troubleshooting Tools for a Windows
Server: Task Manager
Applications tab:
– shows which applications are running
– enables changing of process priority
» use view/update speed
– can be used to
» open new applications
» shut rogue applications down
Task Manager (continued)
Processes tab:
– all system processes
– Memory usage of each
– % CPU time for each
– total CPU time since boot up
– also used to close a process down
» careful! (but you get a warning…)
Task Manager (continued)
Performance tab:
– total no. of threads, processes, handles running
– Graph: % CPU usage
» User mode
» Kernel mode (optional: view menu)
» graph per CPU (optional: view menu)
– physical (Page File) memory available/usage
– virtual memory available/usage
Event Viewer
Events recorded into “event log” files
– System log
– Auditing log (customisable)
– Application log
– customisable - additional files
New files recorded daily; old ones
archived
– time before archiving also customisable
Event Viewer
Three types of events recorded in log:
– Information
– Warning
– Error
More information on each event obtained by
double-clicking
– make note of event code
– heed and take action if necessary
Using Event Viewer
Wise to check all event logs regularly
– take time/trouble to find out that those
messages really mean…
The action is needed that it
– sort out potential problems now
– Make sure they don’t become real ones
later…
Auditing Further Events
Any “object” can be audited
Objects to audit, and processes
audited can be set through audit
(group) policy
– Using MMC & relevant snap-in
Types of process audited:
– access
– attempt to access
Security auditing
Same principles as general
auditing
Refers to “restricted” objects
Events appear in separate
security log
Event Management software
(SIEM)
Who’s going to look at all these log files?
– in practice, often no-one..
Solution – SIEM software to analyse and
present information from:
–
–
–
–
–
network and security devices
identity & access management applications
vulnerability management/policy compliance tools
os, database & application logs
external threat data
http://www.focus.com/briefs/how
-select-security-information-andevent-management-siem
Other Troubleshooting
Resources
NT Diagnostics (winmsd.exe)
– hardware & operating system data from registry
Performance Monitor
– Can monitor many aspects of system performance
– Either display current data graphically, in real-time
– or log data at regular intervals to get a longer term
picture
– Useful role in system optimisation
Other Troubleshooting
Resources
System Monitor (perfmon.msc)
– captures, filters, or analyses frames or packets
sent over the network
Alerts
– notify administrator when a particular threshold
value has been reached
System Recovery
– if a fatal error occurs:
» a dump of system memory is made, and can be used for
identifying the cause of the problem
» alerts are sent to users
» system is restarted automatically
Performance Monitor
Windows 2003 Server, but not available
on disk
To obtain and download Performance
Monitor Wizard (PerfWiz), visit the
following Web site:
– http://www.microsoft.com/downloads/details.a
spx?FamilyID=31fccd98-c3a1-4644-9622faa046d69214&displaylang=en
What if the machine
doesn’t boot…
Tools available:
– The boot error itself
» blue screen? driver software
» constant reboot? motherboard
– Last Known Good…
» Gives machine a chance to go back to the
previous (usually last but one)
configuration
What if the machine
doesn’t boot… (continued)
Safe Mode
– includes VGA Mode or boot
logging
– Debugging mode also available
» output difficult to decipher for nonexperts
Recovery Console
– “DOS-type prompt” for performing
minor repairs
What if the machine
doesn’t boot… (continued)
System Configuration Utility
(Msconfig.exe)
– automates the routine troubleshooting
steps relating to Windows configuration
issues
– can be used to modify the system
configuration and troubleshoot the problem
using a process-of-elimination method
What if the machine
doesn’t boot… (continued)
Emergency Repair Disk (ERD)
– reboot machine using different media
» e,g. floppy disk
– media should be generated BEFORE it
needs to be used!
– option to create the ERD during the set
up process…
What if the machine
doesn’t boot… (continued)
Full restore
– assumes a full backup has already been
made
– still have to:
» reformat hard disk from scratch…
» and then restore the backup files using
backup/restore option….
– but better than losing all your data!
Network Troubleshooting Chart -1
Identify the
problematic
network node
Use commands such
as PING &
TraceRt
URL:
http://teamapproach.ca
/trouble
Is there a problem
with one of the
network
protocols?
Isolate the problem
to a protocol
layer and fix it
Is there a memory
problem?
Is there a memory
leak?
Is there sufficient
memory?
Fix or eliminate the
software with the
memory leak
Add more memory
Network Troubleshooting Chart - 2
Does the system
freeze?
Investigate priority
and device driver
problems
Is there high
processor
utilization?
Is it caused by
hardware or
software?
hardware
Can an upgraded
device driver fix
the problem?
Provide adequate
processor
resources
Upgrade you
hardware to
offload the
processor
Network Troubleshooting Chart – 3
Is there a disk
problem?
Is there sufficient
file cache?
Use NTFS and do
regular
maintenance
Is there a boot
record problem?
Add more memory
to ensure sufficient
cache
Use RAID
Use FixBoot or
FixMBR from the
recovery console
Network Troubleshooting Chart – 4
Is there a
network
problem?
Use Network Monitor
to identify top
broadcasters
Eliminate
unnecessary
broadcasts
Use Network Monitor
to identify top talkers
Eliminate
unnecessary network
traffic
Correct poor
configuration
Reorganize &
upgrade network for
more capacity
Is there a address or
name resolution
problem?
Examine ARP cache,
WINS, DNS, and
NBTstats
Optimisation…
All about improving the performance
of system resources…
A network manager should never
have “nothing to do…”