Building a NOC
Download
Report
Transcript Building a NOC
How to Build a NOC
•
Identify Customers
– Who are your customers?
•
Understand Customer Expectations
– What are your user expectations?
– SLA’s?
•
Support Service Offerings
– Besides networking, what other services are
being offered?
• Monitor and Troubleshoot Service Issues
– How large and complex is your network
– What level troubleshooting and/or monitoring
will your NOC do?
– How will you communicate outages and
planned work to customers?
•
Determine Appropriate Staffing Levels
– What will be the service hours?
•
•
Not all NOCs need to be 7x24x365
What about holidays? Weekends? On-call?
– Do your SLA's require in-person staffing?
– Do you have after hours service response
requirements?
– What level of staff needs to be present, and
when?
– What will be the means of responding to
issues when NOC is not staffed 24x7?
•
Organizational Structure
– What staffing tiers/hierarchy will you have for
support? Techs? Leads? NEs?
– What will be your escalation practices and
policies?
– To what group does the NOC report?
– What other groups report there, and what is
your organizational relationship to other key
groups?
– Who will write and update procedures, training
manuals, etc.?
•
Location and Design of NOC Facility
– How much space do you need?
– What is your facility like?
– How do you want to arrange your
staff? Separate offices? "War room"?
•
NOC Funding
– How will your organization be funded?
•
NOC Tools
– How will you track customer
information? (Database needs, CRM?)
– How will you monitor and
troubleshoot? Tools, specifically.
– Are you writing any of your own tools?
– Who will maintain your applications?
– How will you track trouble tickets?
•
Reporting
– What reports will you issue?
– How will you measure the data?
*
What factors may determine operational changes for your
organization - new services, expanded hours,
increased number of customers, new
equipment types, deeper skill level
Case Study
Building a NOC
Karibu Telecoms
*Our customers are in the Tanzania, Iringa
*We track customer information in a database
designed and maintained locally.
*Demand and Service have driven our need for
7x24.
* Escalation policies also drive our need for oncall schedules and on-site personnel.
* Our customers expect 48 hours notice prior to
work, unless it is an emergency.
* Outages are communicated via an application
we have built locally.
* We also want to know when our customers
have planned work.
*Expectations are included in the contract and
located on the Karibu web page.
*http://www.kaributelecoms.co.tz/noc
*We prefer to have the NOC as the primary
customer contact point for our organization in
order to maintain quality of service and quality
of experiennce when a ticket moves between
groups.
*In addition to network monitoring for
Karibu Telecoms, our NOC monitors:
*Connectivity
* Approximately 5000 sites throughout Iringa
*
Karibu Telecoms troubleshoots with the
customer on routing problems, latency, and
loss of connectivity.
Node - Node and IGP status is monitored.
We manage DWDM, Sonet, and MPLS
circuits.
Complexity is increased with escalation
paths being different depending on what
isn’t working.
Outages and planned events are sent via sms and email
announcement in a standard format.
We include the date/time of the work or when the outage
began.
If the customer’s connectivity is entirely down, we also call
them.
Updates are sent at predefined intervals for large events, or
when we have a change in status.
*We are 7x24 with full-time staff.
*Weekends we only have one person covering
each day, so vacations and sick time are
problems.
*Holidays are covered by on-site and on-call
staff.
*On-call consists of a 7-day period and rotates
among all NOC staff on a regular schedule.
*Tier 1 are student staff, also called Network
Analysts.
*Tier 2 are full-time staff, most are titled
Network Specialists.
*Tier 3 are full-time staff, titled Network
Engineers.
*We advertise to all customers an on-site
7x24 staff for immediate response to
outages.
*Our SLAs indicate when a problem
should be escalated beyond the on call
staff to a manager, director, and CEO at
any time of the day or week.
*Our escalation practices and policies are based on
length or severity of outage.
*At predetermined intervals additional management
levels are notified of severe outages in order to help
with escalation at other organizations (telcos), or to
keep peers updated at the affected sites.
*Consider lighting and noise control with
shared offices.
*How many monitors will each person
need? Will you use a large central
monitor for some things?
*Provide an impromptu meeting space for
collaboration on big events.
*Conference bridges greatly
enhance collaboration across
geographic distances whether
working on outages or events.
*A very useful tool is live chat or IM for
coordinating efforts no matter where your
office is.
*Our customer information is tracked in a
home-grown database which has grown and
morphed over a dozen years.
*New needs such as SLAs and layer 1 info now
require significant investment in upgrades.
*Our monitoring system – Habari – is also
“homegrown”.
*We monitor interface state and IP reachability;
performance and protocol state connectivity will
soon be integrated into our “event system” (NMS)
*Automated tools can page the appropriate group to
notify them of outages or threshold conditions.
*We have a separate Tools team (with 5 staff
members) who design, write, implement, and
maintain tools.
*This allows us to have full-featured and robust
tools.
*One trade-off is fewer “one-off” tools for
specific or isolated issues.
*Karibu Telecoms
uses Request Tracker, RT, an
open-source application to track trouble tickets.
*Weekly reports are generated for our Directors
by sector, severity, and type.
*Monthly reports are generated by sector for
billing purposes.
*
Key metrics we track include:
1.
2.
3.
4.
Ticket numbers by sector for billing
*
*
Outage time is measured by duration of the customer impact.
Phone call volumes
Duration of outages
Root Cause Analysis for high-impact events
After Action Review and Follow-up is conducted for serious events.
* Monthly report is emailed to the customer for traffic sent
to/from their site.
* Our internal reporting includes “operational impacts” to groups
under our main organization.
* How do you measure your NOC’s success?
Reduced calls?
Response times?
*Factors that have determined operational changes for
our organization have been increased size, complexity
and number of networks monitored;
*Need to respond to outages 24 hours/day with on-site
personnel
*Skill and responsibility levels have increased
significantly, and continue to do so.
You