Building a NOC

Download Report

Transcript Building a NOC

How to Build a NOC
•
Identify Customers
– Who are your customers?
•
Understand Customer Expectations
– What are your user expectations?
– SLA’s?
•
Support Service Offerings
– Besides networking, what other services are
being offered?
• Monitor and Troubleshoot Service Issues
– How large and complex is your network
– What level troubleshooting and/or monitoring
will your NOC do?
– How will you communicate outages and
planned work to customers?
•
Determine Appropriate Staffing Levels
– What will be the service hours?
•
•
Not all NOCs need to be 7x24x365
What about holidays? Weekends? On-call?
– Do your SLA's require in-person staffing?
– Do you have after hours service response
requirements?
– What level of staff needs to be present, and
when?
– What will be the means of responding to
issues when NOC is not staffed 24x7?
•
Organizational Structure
– What staffing tiers/hierarchy will you have for
support? Techs? Leads? NEs?
– What will be your escalation practices and
policies?
– To what group does the NOC report?
– What other groups report there, and what is
your organizational relationship to other key
groups?
– Who will write and update procedures, training
manuals, etc.?
•
Location and Design of NOC Facility
– How much space do you need?
– What is your facility like?
– How do you want to arrange your
staff? Separate offices? "War room"?
•
NOC Funding
– How will your organization be funded?
•
NOC Tools
– How will you track customer
information? (Database needs, CRM?)
– How will you monitor and
troubleshoot? Tools, specifically.
– Are you writing any of your own tools?
– Who will maintain your applications?
– How will you track trouble tickets?
•
Reporting
– What reports will you issue?
– How will you measure the data?
*
What factors may determine operational changes for your
organization - new services, expanded hours,
increased number of customers, new
equipment types, deeper skill level
Case Study
Building a NOC
Karibu Telecoms
*Our customers are in the Tanzania, Iringa
*We track customer information in a database
designed and maintained locally.
*Demand and Service have driven our need for
7x24.
* Escalation policies also drive our need for oncall schedules and on-site personnel.
* Our customers expect 48 hours notice prior to
work, unless it is an emergency.
* Outages are communicated via an application
we have built locally.
* We also want to know when our customers
have planned work.
*Expectations are included in the contract and
located on the Karibu web page.
*http://www.kaributelecoms.co.tz/noc
*We prefer to have the NOC as the primary
customer contact point for our organization in
order to maintain quality of service and quality
of experiennce when a ticket moves between
groups.
*In addition to network monitoring for
Karibu Telecoms, our NOC monitors:
*Connectivity
* Approximately 5000 sites throughout Iringa
*
Karibu Telecoms troubleshoots with the
customer on routing problems, latency, and
loss of connectivity.
Node - Node and IGP status is monitored.
We manage DWDM, Sonet, and MPLS
circuits.
Complexity is increased with escalation
paths being different depending on what
isn’t working.
Outages and planned events are sent via sms and email
announcement in a standard format.
We include the date/time of the work or when the outage
began.
If the customer’s connectivity is entirely down, we also call
them.
Updates are sent at predefined intervals for large events, or
when we have a change in status.
*We are 7x24 with full-time staff.
*Weekends we only have one person covering
each day, so vacations and sick time are
problems.
*Holidays are covered by on-site and on-call
staff.
*On-call consists of a 7-day period and rotates
among all NOC staff on a regular schedule.
*Tier 1 are student staff, also called Network
Analysts.
*Tier 2 are full-time staff, most are titled
Network Specialists.
*Tier 3 are full-time staff, titled Network
Engineers.
*We advertise to all customers an on-site
7x24 staff for immediate response to
outages.
*Our SLAs indicate when a problem
should be escalated beyond the on call
staff to a manager, director, and CEO at
any time of the day or week.
*Our escalation practices and policies are based on
length or severity of outage.
*At predetermined intervals additional management
levels are notified of severe outages in order to help
with escalation at other organizations (telcos), or to
keep peers updated at the affected sites.
*Consider lighting and noise control with
shared offices.
*How many monitors will each person
need? Will you use a large central
monitor for some things?
*Provide an impromptu meeting space for
collaboration on big events.
*Conference bridges greatly
enhance collaboration across
geographic distances whether
working on outages or events.
*A very useful tool is live chat or IM for
coordinating efforts no matter where your
office is.
*Our customer information is tracked in a
home-grown database which has grown and
morphed over a dozen years.
*New needs such as SLAs and layer 1 info now
require significant investment in upgrades.
*Our monitoring system – Habari – is also
“homegrown”.
*We monitor interface state and IP reachability;
performance and protocol state connectivity will
soon be integrated into our “event system” (NMS)
*Automated tools can page the appropriate group to
notify them of outages or threshold conditions.
*We have a separate Tools team (with 5 staff
members) who design, write, implement, and
maintain tools.
*This allows us to have full-featured and robust
tools.
*One trade-off is fewer “one-off” tools for
specific or isolated issues.
*Karibu Telecoms
uses Request Tracker, RT, an
open-source application to track trouble tickets.
*Weekly reports are generated for our Directors
by sector, severity, and type.
*Monthly reports are generated by sector for
billing purposes.
*
Key metrics we track include:
1.
2.
3.
4.
Ticket numbers by sector for billing
*
*
Outage time is measured by duration of the customer impact.
Phone call volumes
Duration of outages
Root Cause Analysis for high-impact events
After Action Review and Follow-up is conducted for serious events.
* Monthly report is emailed to the customer for traffic sent
to/from their site.
* Our internal reporting includes “operational impacts” to groups
under our main organization.
* How do you measure your NOC’s success?
Reduced calls?
Response times?
*Factors that have determined operational changes for
our organization have been increased size, complexity
and number of networks monitored;
*Need to respond to outages 24 hours/day with on-site
personnel
*Skill and responsibility levels have increased
significantly, and continue to do so.
You