`Network Management`? - Systems and Computer Engineering

Download Report

Transcript `Network Management`? - Systems and Computer Engineering

January 2017
Henry Starzynski
Network Operations Support
Global Network Mgmt Centre
Bell Canada
Henry Starzynski – Manager, Global Network Management Centre
• Graduated from the University of Waterloo in 1982 with Bachelor of Mathematics
(Computer Science)
• Post graduation, worked for a computer time sharing company called Datacrown,
which become Canada Systems Group, then SHL-Systemhouse , now CGI
• I’ve been with Bell 32.5 years! (yes there have been LOTS of changes since I started!)
• Started out working on network design tools for services called Datapac and Megastream
• Moved to our network management centre taking care of Datapac, managing the 7/24 console
then Frame Relay (Hyperstream) support
• Today, I continue with legacy network support, PLUS bring in new business for our centre,
support our computers (PCs) and handling international escalations
• I have a life outside of Bell too! I’m involved in the local community with Scouts Canada –
so, when you are free of University life, don’t forget to
be involved in your community as well. You have lots of energy and knowledge that can
help make local communities, where ever you end up, much better!
• Don’t forget, when you leave Carleton, learning never ever stops! Keep your brains active, technology
is continually changing
Bell Canada’s GNMC
•
GNMC = Global Network Management Centre
•
One of the world’s first Data Network Management Centres
•
Operating locally in Ottawa, serving Bell Canada customers
globally
Bell Canada GNMC
A bit about who were are …
•
•
•
•
•
•
Involved in managing data networks in Canada since 1974, globally since 1992
Originally - the National Data Network Control (NDNC) for domestic (Canada only)
core data networks: Dataroute, Datapac (packet switching) , Megastream (Pt-Pt T1),
Hyperstream (frame relay), Canadian ATM Gateway networks
Expanded to include private networks (Lotto Quebec) and VPN clouds
Started internationally with Financial Networks Associates (FNA – consortium of 8
countries ) network in 1991 (Alcatel based network) – this network no longer exists
Evolved into Global Network Management (GNMC) at the individual customer circuit
level
Today, we serve as International Help Desk/SPOC (single point of contact) for
international data circuit troubles going OUT of Canada (with the exception of
Canadian government circuits, which are handled by a separate group)
Bell Canada GNMC
Main Focus Areas:
 Single Point of Contact (SPOC) for international customer data circuits
 VPN Managed Services (MPLS) and support of private or virtual private network clouds
and routers (LAN)
 Core Network Management (WAN) of legacy data networks (Datapac=Packet Switching,
Frame Relay)
 Technical Support on existing legacy networks
 Surveillance of 2 major customers’ networks internationally
GNMC is involved in major processes of Network Management:

Fault Management

Configuration Management (Provisioning)

Performance Management & Change Management

Security Management
Network Management
• Like any industry, we toss around lots of BUZZ WORDS
• What do all those terms mean??
• WANs
• Clouds
• OSSs
• Network Management
• SPOC
• Why do we do network management & customer management?
• Why is it important?
WELL
let’s start … WHAT IS A NETWORK?
What is a Network ??
A Network means something different to everyone
For example, a ‘network’ can be ..
• LAN (Local), WAN (Wide) MAN (Metropolitan), CAN (Campus) Area networks
• Point to Point network - connecting two sites regardless of distance
• The ‘CLOUD’ - the service provider’s network – the infrastructure, sometimes
termed the Public Network
• The `NET - the ubiquitous network
• The PSTN – Public System Telephone Network
• Wireless network
• Home Network
• A VPN – a Virtual Private Network
• A ‘social’ network!
A NETWORK MEANS DIFFERENT THINGS TO DIFFERENT PEOPLE
BUT whatever your definition, all networks do the same thing!
What is a Network ?
• A standard definition of a ‘network’ we will use is the following:
• A set of elements or NODES linked together to provide paths to transmit
information, (data, voice, video) from one location to another.
• A critical tool which allows businesses & people to operate and communicate
• When it is all boiled down, All information is ‘data’, and it travels over a network.
• Successful networks are managed
Examples of Data Networks
• Transport Networks (Sonet, DS3, DS1, Fibre, IP core) – the BIG
infrastructures
• Circuit Switched (Public Switched Telephone Network)
• Dedicated (Point to point)
• Packet/Frame/Cell (legacy services)
• IP (Internet/ Intranet)
• Local Area Networks, in the home, office, or around the campus.
• Private (TV, Radio, Financial, Lottery) or Virtual Private Networks
(VPNs)
• Wireless
• Metropolitan (MAN), Wide Area (WAN), Personal (PAN)
• Cellular
• Satellite
• Unfortunately .. Star Wars HOLONET is not a reality yet (although it
existed in a galaxy far far away long long ago)
Network Characteristics
• Common characteristic of all networks is
• the transmission of DATA (information, etc.)
• Some type of information (i.e. - data) is being transmitted from one
person/computer/location to another, for business, pleasure, research,
etc.
• In today’s world, we take data communications over networks for
granted - it is there, reliable, fault tolerant, and it NEVER fails.
• We use it every day, it is part of our daily routines, part of our ‘life’!
We expect connectivity!
What then - is Network Management
and why is it important ?
• Network management has 5 main processes:
Fault Management
Configuration Management (Provisioning)
Accounting Management
Performance Management (including Change Management
and Capacity Management)
Security Management
Question!
What is the latest current estimate of the number of internet users in the world?
http://www.internetworldstats.com/stats.htm
Blasts from the Past!!
ROOT CAUSES OF BLACKOUTS AND THEIR REMEDY
The electric power transmission system of the United States is seriously deficient.
Experts generally agree that fixing this system to an adequate level would take
many years and cost of tens of billions of dollars. But the root causes of the recent
“Blackout of 2003” can be solved in a relatively short time and at a much more
reasonable cost.
The root causes of the present problems are:
• A totally outdated reliability philosophy; and
• Inadequate real time monitoring of the transmission grid.
Isn’t the power grid a network too? Of course! Electricity is just a form of ‘data’!
http://www.speedmatters.org/blog/archive/fcc-verizon-at-fault-for-network-failures-of2012-derecho/#.UPgdWh1lGQG
In June 2012, large parts of the Midwest and Middle Atlantic were, without warning, hit
by a destructive rain and windstorm called a derecho. It left in its wake 22 dead, hundreds
of injuries and millions of people without power or communications.
Today, the FCC released a lengthy report prepared by its Public Safety and Homeland
Security Bureau that looks at the communications outages that followed from the derecho,
and made recommendations to avoid or reduce future failures.
FCC Commissioner issued a statement reinforcing the findings and recommendations,
and commenting on the service breakdowns:
"Tragically, many of these were avoidable interruptions involving a lack of back-up
power to central offices or failures of the service providers' monitoring systems...
Carriers should test their networks and ensure that plans are in place in case of an
emergency. It is time for an honest accounting of the resiliency of our nation's
network infrastructure in the wireless and digital age."
In computer networking: “Resiliency is the ability to provide and maintain an
acceptable level of service in the face of faults and challenges to normal operation.”[1]
] Network resilience touches a very wide range of topics. In order to increase the
resilience of a given communication network, the probable challenges and risks have
to be identified and appropriate resilience metrics have to be defined for the service to
be protected
Why ‘Network Management’?
From a network provider’s viewpoint …
• Manage network resources equitably to ensure users can establish communications
quickly & reliably
• Ensure information is transferred with original quality, integrity, and securely
• Operate a high performance, reliable, cost effective network that meets customer/
business/organizational needs and requirements
• Plan and implement measures to prevent or mitigate interruptions of service
degradation
• Make $$$$$ for the network provider and its shareholders
• Gain market share for the network provider
• At Bell Canada, networks are the building blocks of our own business – they are why we
exist!
Why ‘Network Management’?
From the customer’s viewpoint …
• Ensure information is transferred with original quality, integrity, and securely
• Obtain service at best cost/service/value combination
• To ensure a customer’s business operates with minimum downtime, in order to meet
the requirements of its’ customers
• Meet regulatory, legal, safety requirements
• For a customer, networks are critical
• For businesses, for their operations.
• For the general public, so we can communicate, get money, do our assignments,
talk, text, tweet, message .. BE CONNECTED
Network Management Poses Endless Challenges
by Willie Schatz
If network managers are in accord about anything, it’s that they have a lot
more tasks to do than resources to handle them.
The fundamental roles of a network administrator are to provide network
connections for computer equipment and to ensure availability and
performance of network communications.
But that’s only the beginning. The administrator must set up and manage
hardware and software solutions, enabling servers, clients, printers and other
peripherals to communicate. He or she also is responsible for providing users
the highest quality server functionality, which means uninterrupted, optimum
network availability and performance.
This same individual also must plan so any changes required in the network
conform with changes in the larger enterprise system.
“People really think network management is easier than it really is”.
Network Management Processes
There are five processes involved in network management
Configuration Management ==Provisioning
• Programming network elements to communicate with each other and user equip.
• User datafill to make their service functional
• Copying critical (non default) network provisioning parameters to storage in
offline in databases
• Ensuring billable parameters/features are updated in related billing systems
• Providing ‘dumps’, downloads, or application program interfaces (APIs) to other
downstream systems
Why is Configuration/Provisioning management important?
• Users want their service when it is ordered (on due date or
NOW ..when we walk into BestBuy)
• Users want to get the options they pay for
• The network provider needs to ensure their service is billed
Network Management Processes
Fault Management==Service Assurance
• Surveillance - proactive - alarms/traps from the network that indicate major problems
• Isolating problems - reactive - when users have troubles
• Having clearly defined escalation procedures - how to prioritize troubles
• Providing customers with timely and honest status on problems - when will it be fixed?
• Performing analysis on failures for trends, root cause
Service Assurance is .. REAL-TIME surveillance, control , and analysis of a
network, with the objective of ensuring maximum use of network resources , particularly
when it is under stress due to traffic overload or failure conditions.
Network Management Processes
Performance Management
• Performance measures can be internal (for the provider), regulated (CRTC), or
to assist the customer (how is my network performing)
• Network performance (Mean time to repair, Network availability) are standard
metrics used in the industry, and are often basis for ‘service level agreements’
• Customers may require information on their traffic patterns - are they
paying for bandwidth they don’t require, or is their network overloaded?
• Many customers want guarantees of performance – a Service Level Agreement (SLA)
in order to ensure they are getting the performance they pay for.
• A SLA may include the following
• Network Availability
• Frame/Cell/Packet delivery
• Mean time to Repair
• Penalty clauses for non-performance
• Delay metrics
Network Management Processes
Change Management
• Scheduling downtime / maintenance activities (new software, network upgrades)
with users (notification, release or emergency)
• Ensuring software levels are compatible with all network components
• Keeping the customer informed of planned service interruptions is critical
Networks are in need of periodic maintenance for software or hardware upgrades,
etc. In a 7x24 world, unscheduled downtime can mean
• loss of revenue
• legal liability
• threats to public safety.
FROM: CHANGE MANAGEMENT PLANNED OUTAGE
Foreign-Tel COMMUNICATIONS Dept.: GNMC
Phone: 1-555-868-7883
Fax: 1-555-868-7822
Please respond to the following Email: [email protected]
ForeignTel Communications would like to inform you that the Change Management activity will be
performed as indicated below:
_____________________________________________________________________
Outage #:
POM041793
/ POT356369
Your ref. #:
Description:
DISREGARD OUTAGE NOTICE//THIS IS NOT SERVICE
AFFECTING//WE ARE ADDING BACKBONE CAPACITY:
PORTLAND-SANTA CLARA DURING THIS PERIOD,
NETWORK WILL BE IN HAZARDOUS CONDITION. WALL
NOC WILL CLOSELY MONITOR THE NETWORK AND ANY
ALARMS ON IT
Scheduled Planned Start Date (UTC): february 16, 2014 15:00:00
Scheduled Planned End Date (UTC): february 24, 2014 03:00:00
Related Network Management Activities
• Co-ordination with other Carriers and Agencies.
No one carrier can route traffic everywhere on the planet. Strategic alliances and
co-operation amongst carriers is essential.
• Dynamic Controls.
Can traffic be rerouted around failures or congestion? Is this automatic or manual?
• Disaster recovery planning.
Could it happen to you? What would you do in the event of a ‘disaster’?
• Security
Who has access to the network infrastructure? Can it be ‘hacked’? Ensuring one
customer’s data does not go to another customer.
Security Management
• The goal of security management is to control access to
network resources according to local guidelines so that the
network cannot be sabotaged (intentionally or
unintentionally) and sensitive information cannot be
accessed by those without appropriate authorization.
• Security management subsystems work by partitioning
network resources into authorized and unauthorized areas.
– They identify sensitive network resources (including systems, files,
and other entities) and determine mappings between sensitive
network resources and user sets.
– They also monitor access points to sensitive network resources and
log inappropriate access to sensitive network resources.
Nice visualization of data breaches
http://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/
•In October, cybercriminals launched major DDoS attacks, disrupting a host of websites,
including the likes of Twitter, Netflix, PayPal, Pinterest and the PlayStation Network,
amongst many others.
•X-rated adult website AdultFriendFinder has now been hit by cybercriminals in
consecutive years, with the November 2016 attack involving far more people
than before.
This time, the amount of accounts compromised was immense — approximately
412 million users had personal information stolen and published in criminal
marketplaces on the dark web.
•Yahoo hacks
Network Management Centre
Functions
• 7 x 24 operation - it’s more than a buzzword.
• Operations Support Systems for provisioning, change management, surveillance, trouble
tracking, customer records
• Subject experts/access to engineering support personnel or labs
• Multiple & diverse communications channels
• Situation (War) room
• Secure and Independent Power Supply
• Access to Information Databases
• Contact information for support resources (level 1,2 3 support, vendor support)
• Secure location
• Fully redundant backup location
When Disaster strikes!
• If something will go wrong .. It will ..
• Ice Storms (1998 & 2013)/Hurricane Katrina/Sandy & other natural disasters
• Bell Canada Toronto Simcoe Central Office fire July 1999
• Power plant failures
• Hackers and viruses
• Terrorist attacks
• All of these test the plans of a network provider.
• Are contingency plans in place? Have they been tested or gathered dust for 5 years?
• Is there an escalation chain of command?
• Are there agreements with other suppliers/vendors/competitors?
• What contingencies are in place to get critical services restored as quickly as possible
• When service is lost, the prime objective, after immediate human safety, is the
restoration of service
From July 1999 …
TORONTO - Phones stopped ringing in several major cities in Canada on Friday
after an explosion caused a major system failure at a Bell Canada building in Toronto.
The failure knocked out phone lines, most cell phones, internet services and bank machines
in downtown Toronto. Cantel and digital cell phones appear to be working. Police
report 911 emergency systems are working, but the police are urging people to use these
systems only for real emergencies. The failure was caused by an explosion on the fourth
floor at the downtown bell centre at around 8:00 am. One person was reportedly
injured. Immediately after the explosion, battery powered backup systems kicked in.
But they ran out of power a few hours later. The Toronto Stock Exchange is back up and
running after it suspended trading briefly but brokerages are having trouble
communicating. Phone systems in Ottawa and Montreal and as far away as Halifax and
Vancouver have also been affected as calls that normally routed through Toronto are
rerouted through other cities. Bell Canada says it hopes to have services restored
by midafternoon.
The Globe and Mail
Published Thursday, Oct. 10 2013, 11:18 AM EDT
Rogers Communications Inc. said a software glitch created a big spike in “signalling traffic” that caused one of the worst
wireless network outages in the company’s history.
Canada’s largest wireless carrier determined that root cause on Thursday roughly 18 hours after implementing a fix that restored
voice and text services for customers across the country.
DISASTERS CAN HAPPEN? How will your network provider handle the trouble?
• Another
aspect of Network Management is Planning
• A carrier will have a plan for a disaster situation,
as well anticipating potential issues
• Examples of planning for potential issues include
• Y2K
• more recently, the change in dates for Daylight
Savings Time
• Other various clock rollover issues
• A carrier may also do periodic disaster simulations
to test the response of various groups as well
as procedures
• Where will YOU be January 19 2038?
Operational Support Systems
• Successful network management uses standardized protocols or vendor-specific
mechanisms to transmit alarms and commands
(e.g. Simple Network Management Protocol)
• Operational control data can be transmitted over conventional data networks,
over the same network (inband), or over another network (out of band).
• The systems which receive alarms, allow for network configuration, troubleshooting,
and control is commonly called Operational Support Systems (OSS).
• OSS may be more than 10 times the cost of the network infrastructure!
• OSSs may consist of Workstations, Databases, network elements, scripts, provisioning
systems, security systems, offline databases and billing systems.
• Without a good OSS structure, a great network infrastructure will fail. The network
objectives cannot be met without this.
Operational Support Systems
• No one OSS does it all - if fact, many OSSs are required, and these must interact
with each other. This is typically via Application Program Interfaces (API) or
some standard format for information exchange.
• The interaction can be simple - or complex. Often, simple format changes in one
OSS will impact many other ‘downstream’ OSSs.
• Remember where the money is spent - Not on the network infrastructure, but on
the systems that make the network run.
•
Metrics – Key Performance
Indicators
• Each network needs some means of measuring its success, and to see where
improvement can be made. Public networks may be regulated. Metrics may be stipulated
in Service level agreements (SLAs) between provider and customer
• To the end user/customer, the most critical metrics are the following:
• Mean time to repair (MTTR)
• Network Availability ((Total available time-total downtime)/(Total avail. Time))
• Quality of Service (QOS)
• round trip delay
• Network congestion/blocking
• frame/packet/cell loss
• repeat failures
• To the network provider, the following are important metrics:
• Network Availability
• EBITDA (Earnings Before Interest Taxes Depreciation & Amortization)
• Cost / Revenue (return on investment)
• Market Share
• Network capacity
Metrics
•To the shareholder the following are important:
• Dividend
• share price
• Return on Investment
Summary
• Networks can be simple, or extremely complex and mission critical
• Network quality , reliability, diversity, and low cost are essential
• The operation of a high quality reliable, cost effective network requires
effective Network Management Centre(s), along with skilled people and good support
tools (operational support systems)
• As networks continue to evolve, customers will manage more and more of their own
networks.
• Challenges for the future include global coverage, scaling for growth,
new technologies, telco mergers, acquisitions, failures - an industry always in flux.