Network Management Session 1 Network Basics
Download
Report
Transcript Network Management Session 1 Network Basics
COMP2221
Networks in Organisations
Richard Henson
May 2013
Week 11 – Troubleshooting
& Optimisation
Learning Objectives:
– Explain the principles of troubleshooting as a
means of mitigating against failure
– Use the various tools available on a named
operating system to identify potential faults
and problems
– Take appropriate action to stop a fault
becoming a failure
“A stitch in time saves nine”
Business - Worst Possible Scenario (1)
There is an interruption in the power
supply
– UPS is invoked
– the interruption continues…
– servers all have to be shut down
Power supply restored…
– but main domain controller doesn’t reboot
– no other domain controllers therefore
connect to it
– the domain tree fails
Business - Worst Possible Scenario (2)
Organisation cannot do business with the
network down…
– server can’t be persuaded to boot
– new main domain controller has to be
commissioned
– whole directory tree has to be rebuilt!!!
– word spreads very rapidly…
Business loses so much custom, trust, and
credibility that even when it starts doing
business again customers choose to go
elsewhere
– without a flourishing customer base… the
business folds
Analysis: This scenario shouldn’t
have occurred…
Unlikely that the server would fail to boot
without prior warning…
– warnings would have been presented…
– but were clearly not acted upon!
Disaster recovery plan!?!
– not formulated?
– not tested?
– not effective (in the event of a domain tree controller
failure…)
But it does…
Actual example (15th Feb 2010):
– root domain controller [on the network] had not
been backed up for 10 months, when it crashed
(well… at least it had been backed up at some
time…)
– http://searchwindowsserver.techtarget.com/generi
c/0,295582,sid68_gci1381567,00.html
The consultant called in to fix it reported that:
– “I had never seen a case where the forest
root domain had to be recovered -- and I
couldn't find anyone who had.”
Analysis: Who is to blame? (1)
In this example, the organisation said
they were following Microsoft guidelines
– they set up an empty root domain
– the root domain controller had a RAID-5
(best) disk configuration
Was true, to some extent…
– Microsoft did espouse this as best
practice… (in the year 2000!)
– guidelines had changed since then…
Analysis: Who is to blame? (2)
The disaster that struck was:
– two RAID drives failed on the same day!
– unlucky? possible to prepare for this?
The recovery process took about three weeks
– most of the time was spent studying logs, doing
the restore, etc.
In this case, the tree was still able to function
without a root domain
– business was able to continue
– customer base wasn’t compromised…
Fault Tolerance and Risk
Assessment
General “common sense” principle:
– always have a backup
– ESPECIALLY for the most important computer
on the network…
Q:
– How can you tell what needs backing up?
A:
– Risk Assessment and Risk Management
Why not Risk Management?
Time consuming!
However, without proper risk
management…
– how does the organisation know what
processes are most important to its
functioning?
– how can an organisation provide resources
to protect aspects of its network?
Risk Management and
Risk Assessment
Risk Assessment is an essential first step
– requires putting a “value” on assets
– more valuable… greater protection
Do information assets have value?
– organisations still failing to acknowledge that they
do…
– categorisation of information assets therefore
potentially problematic
– need to look at the consequence to the
organisation of losing that asset…
How do you back up a
Domain Controller?
The Windows “Backup” program works, and
can easily be scheduled
– but heavily criticised…
– even the 2008 server version…
Third Party products give more flexibility and
protection e.g. :
– Recovery Manager
» http://www.quest.com/recovery-manager-for-active-directory
– Backup Exec
» http://www.symantec.com/business/products/family.jsp?familyid=backupexec
Prevention is Better than Cure
A server shouldn’t crash unexpectedly!
– should be kept cool (environmental unit mustn’t
break down!)
– monitoring should show that unexpected things are
happening
– action can then (usually) be taken to take care of
the unexpected
Many tools available to:
– Check/monitor the system on a regular basis
– Provide stats/ to administrators
» could also be used for security purposes
– Generate alerts if something is starting to go
wrong…
Troubleshooting Tools for a Windows
Server: Task Manager
Applications tab:
– shows which applications are running
– enables changing of process priority
» use view/update speed
– used to
» open new applications
» shut rogue applications down
Task Manager (continued)
Processes tab:
– all system processes
– Memory usage of each
– % CPU time for each
– total CPU time since boot up
– also used to close a process down
» careful! (but you get a warning…)
Task Manager (continued)
Performance tab:
– total no. of threads, processes, handles running
– Graph: % CPU usage
» User mode
» Kernel mode (optional: view menu)
» graph per CPU (optional: view menu)
– physical (Page File) memory available/usage
– virtual memory available/usage
Event Viewer
Events recorded into “event log” files
– System log
– Auditing log (customisable)
– Application log
– customisable - additional files
New files recorded daily; old ones
archived
– time before archiving also customisable
Event Viewer
Three types of events recorded in log:
– Information
– Warning
– Error
More information on each event obtained by
double-clicking
– make note of event code
– heed and take action if necessary
Using Event Viewer
Wise to check all event logs regularly
– take time/trouble to find out that those
messages really mean…
The action is needed that it
– sort out potential problems now
– Make sure they don’t become real ones
later…
Auditing Further Events
Any “object” can be audited
Objects to audit, and processes
audited can be set through audit
(group) policy
– Using MMC & relevant snap-in
Types of process audited:
– access
– attempt to access
Security auditing
Same principles as general
auditing
Refers to “restricted” objects
Events appear in separate
security log
Event Management software
(SIEM)
Who’s going to look at all these log files?
– in practice, often no-one..
Solution – SIEM software to analyse and
present information from:
–
–
–
–
–
network and security devices
identity & access management applications
vulnerability management/policy compliance tools
os, database & application logs
external threat data
http://www.focus.com/briefs/how
-select-security-information-andevent-management-siem
Performance Monitor
Not available on disk
To obtain and download Performance
Monitor Wizard (PerfWiz), visit the
following Web site:
– http://www.microsoft.com/downloads/details.a
spx?FamilyID=31fccd98-c3a1-4644-9622faa046d69214&displaylang=en
What if the machine
doesn’t boot…
Tools available:
– The boot error itself
» blue screen? driver software
» constant reboot? motherboard
– Last Known Good…
» Gives machine a chance to go back to the
previous (usually last but one)
configuration
What if the machine
doesn’t boot… (continued)
Safe Mode
– includes VGA Mode or boot
logging
– Debugging mode also available
» output difficult to decipher for nonexperts
Recovery Console
– “DOS-type prompt” for performing
minor repairs
What if the machine
doesn’t boot… (continued)
System Configuration Utility
(Msconfig.exe)
– automates the routine troubleshooting
steps relating to Windows configuration
issues
– can be used to modify the system
configuration and troubleshoot the problem
using a process-of-elimination method
What if the machine
doesn’t boot… (continued)
Emergency Repair Disk (ERD)
– reboot machine using different media
» e,g. floppy disk (yes… still possible)
– media should be generated BEFORE it
needs to be used!
– option to create the ERD during the set
up process…
What if the machine
doesn’t boot… (continued)
Full restore
– assumes a full backup has already been
made
– still have to:
» reformat hard disk from scratch…
» and then restore the backup files using
backup/restore option….
– but better than losing all your data!
Optimisation…
All about improving the performance
of system resources…
A network manager should never
have “nothing to do…”