Performance Management (Best Practices)
Download
Report
Transcript Performance Management (Best Practices)
Performance Management
(Best Practices)
REF:www.cisco.com
Document ID 15115
Introduction
• Performance Management involves
optimization of network response time and
management of consistency and quality of
individual and overall network services
• The most important service is the need to
measure the user/application response time.
• For most users, response time is the critical
performance success factor.
Background (1)
• Performance problems often correlate with
capacity of resources (CPU, RAM, Bandwidth).
– In networks, this is typically bandwidth and data
that must wait in queues before it can be
transmitted through the network.
– In voice applications, this wait time almost
certainly impacts users because factors such as
delay and jitter affect the quality of the voice call.
Performance management issues
•
•
•
•
User performance
Application performance
Capacity planning
Proactive fault management
• It is important to note that with newer application
like video and voice performance management is the
key success
Performance management process
flow (1)
Develop a network management
concept of operation
Measure Performance
Perform a Proactive Fault Analysis
Performance management process
flow (2)
1 develop a network management concept of
operation
– Define the required features : Services, Scalability
objectives
– Define availability and network management
objectives
– Define performance SLAs and Metrics
– Define SLA
Performance management process
flow (3)
2 Measure Performance
– Gather network baseline data
– Measure availability
– Measure response time
– Measure accuracy
– Measure utilization
– Capacity planning
Performance management process
flow (4)
3 perform a proactive fault analysis
– Use threshold for proactive fault management
– Network management implementation
– Network operation metrics
Performance management process
flow (5)
Develop a network management
concept of operation
Measure Performance
Perform a Proactive Fault Analysis
Develop a network management
concept of operation
• The purpose of this document is to describe
the overall desired system characteristics from
an operational standpoint
• The focus of this document is to form the long
range operational planning activities for
network management and operation.
• It also provides guidance for the development
of all subsequent definition documentation,
such as service level agreements.
Define the required features:
Services, Scalability Objectives
• Define services objectives :
– To describe what the objectives that networks and
services are supposed to be
– This step requires that you understand applications,
basic traffic flows, user and site counts, and required
network services.
• Define scalability objectives:
– to help network engineers design networks that meet
future growth requirement and not experience
resource constraint (media capacity, number of routes
and etc)
Define availability and network
management Objectives (1)
• Defining availability objectives is to explain the
level of services needed (service level
requirements)
• This helps to ensure the solution meets end
availability requirements
• It might lead to
– categorize different class of service for each
availability requirement
– Higher availability objective might necessitate
increased redundancy and support procedures
Define availability and network
management objectives (2)
• Define manageability objectives to ensure that
overall network management does not lack
management functionality
• It might lead to
– Have understand the process and tools used for
organization
– Uncover all important MIB or network tool
information required to support a potential network
– Have training required to support the new network
service
Define performance SLAs and Metrics
• Performance SLAs and metrics help define and
measure the performance of new network
solutions to ensure they meet performance
requirements.
• The performance SLAs should include the
average expected volume of traffic, peak
volume of traffic, average response time and
maximum response time allowed
Define SLAs (1)
• SLA (Service Level Agreement) – Customer
(Enterprise) , SLM (Service Level Management) Provider
• SLA include definitions for problem types and
severity and help desk responsibilities
– Escalation path, time before escalation at each tier support
level
– Time to start work on the problem
– Time to close target based on priority
– Service to provide in the area of capacity planning, hardware
replacement
Performance management process flow
Develop a network management
concept of operation
Measure Performance
Perform a Proactive Fault Analysis
Measure Performance
• Gather Network Baseline data
– Perform a baseline of the network before and
after a new solution deployment
– A typical router/switch baseline report includes
capacity issues related to CPU, memory, buffer,
link/media utilization, throughput
– Application baseline: bandwidth used by app per
time period
Measure availability
• Availability is the the measure of time for
which a network system or application is
available to a user
– Coordinate the help desk phone calls with the
statistics collected from managed devices
– Check scheduled outages
– Etc
Measure Response Time
• Network response time is the time required to travel
between two points
• Simple level – pings from the network management
station to key points I the network. (not accuracy)
• Server-centric polling : SAA (Service Assurance Agent)
on router (Cisco) to measure response time to a
destination device
• Generate traffic that resembles the particular
application or technology of interest
Measure accuracy
• Accuracy is the measure of interface traffic
that does not result in error and can be
expressed in term of percentage
• Accuracy = 100 – error rate
• Error rate = ifInErrors * 100 / (ifInUcastPkts +
IfInNUcastPkts)
Measure Utilization (1)
• Utilization measure the use of a particular
resource over time
• Percentage in which the usage of a resource is
compared with its maximum operational
capacity
• High utilization is not necessarily bad
• Sudden jump in utilization can indicate
unnormal condition
Measure Utilization (2)
• Input utilization =
ifInOctets *8*100/(time in second)*ifSpeed
• Output Utilization
ifOutOctets *8*100/(time in second)*ifSpeed
Capacity planning
• The following are potential areas for concern:
– CPU
– Backplane or I/O
– Memory
– Interface and pip sizes
– Queuing, latency and jitter
– Speed and distance
– Application characteristics
Performance management process flow
Develop a network management
concept of operation
Measure Performance
Perform a Proactive Fault Analysis
Perform a Proactive fault analysis
• One method to perform fault management is
through the use of RMON alarms and event
groups
• Distributed management system that enables
polling at a local level with aggregation of data
at a manager to manager
Use threshold for proactive fault
management (1/2)
• Threshold is the point of interest in specific
data stream and generate event when
threshold is triggered
• 2 classes of threshold for numeric data
– Continuous threshold apply to continuous or time
series data such as data stored in SNMP counter or
gauges
– Discrete threshold apply to enumerated objects or
discrete numeric data such as Boolean objects
Use threshold for proactive fault
management (2/2)
• 2 different forms of continuous threshold
– Absolute :use with gauges
– Relative (delta): use with counter
• Step to determine threshold
– 1 select the objects
– 2 select the devices and interfaces
– 3 determine the threshold values for each object or
interface
– 4 determine the severity for the event generated by
each threshold
Network management implementation
• The organization should have an implemented
network management system.
• SNMP/RMON or other network management
system tools
Network operation metrics (1/2)
• Number of problems that occurs by call priority
• Minimum, maximum and average time to close
in each priority
• Breakdown of problems by problem type
(hardware, software crash, configuration,
power user error)
Network operation metrics (2/2)
• Breakdown of time to close for each problem
type
• Availability by availability or SLA
• How often you met or missed SLA
requirements