Performance Management: Application-driven Evolution

Download Report

Transcript Performance Management: Application-driven Evolution

OAM and QoS
Presented by:
Yaakov (J) Stein
Chief Scientist
Unique Access Solutions
SERVICE GUARANTEES
OAMQoS-YJS Slide 2
Why do we pay for services ?
Generally good (and frequently much better than toll quality)
voice service is available free of charge (Skype, Fring, Nimbuzz,
…)
So why does anyone pay for voice services ?
Similarly, one can get free
• (WiFi) Internet access
• email boxes
• file storage and sharing
• web hosting
• software services
So why pay ?
OAMQoS-YJS Slide 3
Paying for QoS
The simple answer is that one doesn’t pay for the service
one pays for Quality of Service guarantees
In our voice model
price
toll quality
with mobility
BE
QoS
But what does QoS mean
and why are we willing to pay for it ?
To explain, we need to review some history
OAMQoS-YJS Slide 4
Father of the telephone
Everyone knows that the father of the telephone was
Alexander Graham Bell
(along with his assistant Mr. Watson)
But Bell did not invent the telephone network
Bell and Watson sold pairs of phones to customers
The father of the telephone network was
Theodore Vail
OAMQoS-YJS Slide 5
Theodore Vail Theodore Who?
Son of Alfred Vail (Morse’s coworker)
Ex-General Superintendent of US Railway Mail Service
First general manager of Bell Telephone
Father of the PSTN
Why is he so important?
Organized PSTN
Established principle of reinvestment in R&D
Established Bell Telephones IPR division
Executed merger with Western Union to form AT&T
Solved the main technological problems
• use of copper wire
• use of twisted pairs
Organized telephony as a service (like the postal service!)
Vailism is the philosophy that public services should be run as closed
centralized monopolies for the public good
OAMQoS-YJS Slide 6
What’s the difference ?
In the Bell-Watson model
the customer pays once, but is responsible for
• installation
• wires
• wiring
• operations
+
• power
• fault repair
• performance (distortion and noise)
• infrastructure maintenance
while the Bell company is responsible only for
providing functioning telephones
In the Vail model the customer pays a monthly fee
but the provider assumes responsibility for everything
including fault repair and performance maintenance
the telephone company owns the telephone sets and even the wires in the walls !
OAMQoS-YJS Slide 7
Service Level Agreements
In order to justify recurring payments
the provider agrees to a minimum level of service in an SLA
SLAs should capture Quality of user Experience (QoE)
but this is often hard to quantify
So SLAs usually actually detail measurable network parameters
that influence QoE, such as :
•
•
•
•
•
availability (e.g., the famous five nines)
time to repair (e.g., the famous 50 ms)
information rate (throughput)
information latency (delay)
allowable defect densities (noise/distortion)
Availability (basic connectivity) always influences QoE
It is hard to predict the effect of the other parameters on QoE
even when there is only one application (e.g., voice)
When multiple applications are in use - it may be impossible
OAMQoS-YJS Slide 8
Some Applications
System traffic
routing protocols, DNS, DHCP, time delivery, system update, OAM,
tunneling and VPN setup
Business processes
database access, backup and data-center, B2B, ERP
Communications - interactive
voice, video conferencing, telepresence, instant messaging,
remote desktop, application sharing
Communications – non-interactive
email, broadcast programming, music
video : progressive download, live streaming, interactive
Information gathering
http(s), Web 2.0, file transfer
Recreational
gaming, p2p file transfer
Malicious
DoS, malware injection, illicit information retrieval
OAMQoS-YJS Slide 9
What do applications need ?
Some applications only require availability
Some also require minimum available throughput
Some require delay less then some end-end (or RT) delay
Some require packet loss ratio (PLR) less than some percentage
and these parameters are not necessarily independent
For example,
TCP throughput drops with PLR
1000 B packets
50 ms RTT
OAMQoS-YJS Slide 10
Some rules of thumb
Mission Critical (and life critical) applications require
• high availability
If there are any MC applications
then system traffic requires high availability too
MC applications do not necessarily require strict throughput
but always indirectly require
• a certain minimal average throughput
• bounded delay
If the MC application uses TCP then it requires
• low PLR
Real-time applications require
• sufficient throughput
but not necessarily low PLR (audio and video codecs have PLC)
Interactive applications require
• low RT delay
It may be more scalable for a SP to measure 1-way delays
OAMQoS-YJS Slide 11
OAM
OAMQoS-YJS Slide 12
Monitoring an SLA
The Service Provider’s justification for payment
is the maintenance of an SLA
To ensure SLA compliance, the SP must :
• monitor the SLA parameters
• take action if parameter is dropping below compliance levels
But how does the SP verify/ensure that the SLA is being met ?
Monitoring is carried out using
Operations, Administration, Maintenance (OAM)
The customer too may use OAM to see that the SP is compliant !
Technical note:
OAM is a user-plane function
but may influence control and management plane operations
for example
• OAM may trigger protection switching, but doesn’t switch
• OAM may detect provisioned links, but doesn’t provision them
OAMQoS-YJS Slide 13
Operations, Administration, Maintenance
Traditionally, one distinguishes between 2 OAM functionalities :
1. Fault Monitoring
• OAM runs continuously/periodically at required rate
• detection and reporting of anomalies, defects, and failures
• used to trigger mechanisms in the
• control plane (e.g. protection switching) and
• management plane (alarms)
• required for maintenance of basic connectivity (availability)
2. Performance Monitoring
• OAM run :
• before enabling a service
• on-demand or
• per schedule
• measurement of performance criteria (delay, PDV, etc.)
• required for maintenance of all other QoE attributes
OAMQoS-YJS Slide 14
Early OAM
Analog channels and 64 kbps digital channels
did not have mechanisms to check signal validity and quality
Thus
• major faults could go undetected for long periods of time
• hard to characterize and localize faults when reported
• minor defects might be unnoticed indefinitely
As PDH networks evolved, more and more OAM was added on :
• monitoring for valid signal
• loopbacks
• defect reporting
• alarm indication/inhibition
The OAM overhead started to explode in size !
When SONET/SDH was designed
bounded overhead was reserved for OAM functions
OAMQoS-YJS Slide 15
OAM for Packet Switched Networks
OAM is more complex for Packet Switched Networks
in addition to the previous defects :
• loss of signal
• bit errors
we have new defect types
• packets may be lost
• packets may be delayed
• packets may delivered to the wrong destination
The first PSN-like network to acquire OAM was ATM (I.610)
Although technically ATM is cell-based, not packet-based
OAMQoS-YJS Slide 16
Some FM OAM mechanisms (1)
How do we perform Continuity Check ?
• send OAM packets at a constant known rate
• if CC packets are not received for >3 intervals then declare a fault
see also LB / echo mode
How do we perform Connectivity Verification ?
• send OAM packets to a known destination
• if CV packets are received somewhere else then declare a fault
How do we indicate AIS (FDI) ?
• when do not receive forward traffic send AIS OAM packets
• if AIS packets received then declare a fault
How do we indicate RDI (BDI) ?
• when do not receive reverse traffic send RDI OAM packets
• if RDI packets received then declare a fault
Note: RDI is often a flag set on CC message
OAMQoS-YJS Slide 17
Some FM OAM mechanisms (2)
How do we use LoopBack ?
• non-intrusive (in-service) (echo mode)
• send LB request OAM packet to remote site
• remote site replies with LB reply
• if LB reply not received then declare a fault
• intrusive (out-of-service)
• put remote site into LB mode
• remote sites reflects (and does not forward) all traffic
(note that it must monitor OAM traffic)
• if packets sent are not received then declare a fault
note: need to inform next hops of LB by locking
How do we use LinkTrace ?
• send LB request OAM packet to next hop
• send LB request to following hop
• etc.
OAMQoS-YJS Slide 18
Some PM OAM mechanisms (1)
How do we measure Packet Loss Ratio ?
• Traffic (counter) based
maintain 2 counters:
•
•
number of packets transmitted to peer Tx
number of packets received from peer Rx
• send Tx counter to peer at time 1 Tx(1)
• peer notes its Rx counter at time of reception Rx(2)
and its Tx counter at time of its reply Tx(3)
• originator notes its Rx counter when reply is received Rx(4)
calculate PLR in both directions
• Synthetic :
do not maintain counters – use OAM packets
Note : synthetic loss is only a rough estimate
How do we measure Throughput?
• Primitive way (RFC 2544)
• send packets at maximum rate and observe packet loss
• reduce rate until no loss is observed
Note : there are more sophisticated mechanisms !
OAMQoS-YJS Slide 19
Some PM OAM mechanisms (2)
How do we measure 1-way Packet Delay (Latency) ?
synchronize clocks at both OAM peers
• send timestamp T1 to peer
• peer timestamps receipt with T2
calculate time difference T2 – T1
How do we measure 2-way Packet Delay (Latency) ?
send timestamp T1 to peer
peer timestamps receipt with T2
peer replies at T3
originator timestamps receipt of reply at T4
calculate time difference (T4 – T1) – (T3 - T2)
assuming symmetry, 1-way delay is half this amount
Note : do not need to synchronize clocks
•
•
•
•
How do we measure Packet Delay Variation ?
• send timestamps at a constant rate
• peer calculates timestamp differences and statistics thereof
Note : do not need to synchronize clocks
OAMQoS-YJS Slide 20
ETHERNET OAM
OAMQoS-YJS Slide 21
What about Ethernet ?
Carrier Ethernet has replaced ATM as the default layer-2
Ethernet is by far the most widespread network interface
Ethernet has some advantages as compared to ATM
• it has network-wide unique addresses
• it has a source address in every packet
but some aspects make Ethernet OAM more difficult
• ConnectionLess (CL)
• multipoint to multipoint
• overlapping layering – need OAM for operator, SPs, customer
• some specific problematic ETH behaviors (flooding, multicast, …)
OAMQoS-YJS Slide 22
What’s the problem with CL ?
OAM makes a lot of sense in Connection Oriented environments
• connections last a relatively long amount of time
• there is some SLA at the connection level
For CL networks, the network path is neither known nor pinned
So it doesn’t really make sense to talk about FM
what does continuity mean if when a link goes down
the network automatically reroutes around the failure ?
The Ethernet CL problem is solved by overlaying CO functionality :
• flows or
• EVCs
OAMQoS-YJS Slide 23
Ethernet OAM
For many years there was no OAM for Ethernet
(LANs don’t need OAM)
now there are two incompatible ones!
• Link layer OAM – 802.3 clause 57 (EFM OAM, 802.3ah)
single link only
slow protocol, limited functionality
some management functions
• Service OAM – Y.1731, 802.1ag (CFM)
any network configuration
multilevel OAM functionality
In some cases one may need to run both
while in others only service OAM makes sense
Link layer OAM is only for a single link, which is necessarily CO
Service OAM is most frequently used for infrastructure networks,
which are also CO
OAMQoS-YJS Slide 24
Layer 2 control protocols (L2CPs)
Do not be confused - L2CPs are NOT OAM !
Here are a few well-known L2CPs :
protocol
DA
reference
01-80-C2-00-00-00
802.2 LLC
01-80-C2-00-00-01
802.1D §8,9
802.1D§17 802.1Q §13
802.3 §31B 802.3x
802.3 §43 (ex 802.3ad)
Port Authentication
01-80-C2-00-00-02
EtherType 88-09
Subtype 01 and 02
01-80-C2-00-00-02
EtherType 88-09
Subtype 03
01-80-C2-00-00-02
EtherType 88-09
Subtype 10
01-80-C2-00-00-03
E-LMI
01-80-C2-00-00-07
MEF-16
Provider MSTP
01-80-C2-00-00-08
802.1D § 802.1ad
Provider MMRP
01-80-C2-00-00-0D
802.1ak
STP/RSTP/MSTP
PAUSE
LACP/LAMP
Link OAM
ESMC
LLDP
GARP (GMRP, GVRP)
802.3 §57 (ex 802.3ah)
G.8264
802.1X
01-80-C2-00-00-0E
802.1AB-2009
EtherType 88-CC
Block 01-80-C2-00-00-20 802.1D §10, 11, 12
through 01-80-C2-00-00-2F
Note: IEEE disallows forwarding of L2CPs, MEF allows it under certain circumstances
OAMQoS-YJS Slide 25
Link Layer OAM (AKA EFM OAM)
Ethernet in the First Mile (Last Mile ?)
EFM networks are mostly p2p DSL links or p2mp PONs
thus a link layer OAM is sufficient for EFM applications
Since EFM link is between customer and Service Provider
EFM OAM entities are either active (SP) or passive (customer)
active entity can place passive one into LB mode but not the reverse
EFM OAMPDUs are a slow protocol frames – never forwarded
Ethertype = 88-09 and subtype 03
messages multicast to slow protocol specific group address
OAMPDUs must be sent once per second (heartbeat)
messages are TLV-based
DA
01-80-C200-00-02
SA
TYPE
8809
SUB
TYPE
FLAGS
CODE
(2B)
(1B)
DATA
CRC
03
OAMQoS-YJS Slide 26
EFM OAM capabilities
6
•
•
•
•
•
•
codes are defined
Information (autodiscovery, heartbeat, fault notification)
Event notification (statistics reporting)
Variable request (active entity query passive’s configuration) (mngt)
Variable response (passive entity responds to query)
(mngt)
Loopback control (active entity enable/disable of intrusive LB mode)
Organization specific (proprietary extensions)
and there are flags in every OAMPDU to
expedite notification of critical events
• link fault (RDI)
• dying gasp
• unspecified
monitor slow degradations in performance
OAMQoS-YJS Slide 27
Service OAM (AKA CFM, Y.1731)
Many SPs need to monitor full networks
not just single links
Service layer OAM provides end-to-end integrity
of the Ethernet service over arbitrary server layers
Because Ethernet is flat
not true client-server layering (except MAC-in-MAC)
service layer OAM is multilevel
Because SPs want to replace transport networks with Ethernet
service OAM must support all OAM features
and must enable advanced transport capabilities
(such as linear/ring protection switching)
a transport network is a network with :
1.
High availability (Fault Management OAM and Automatic Protection Switching)
2.
SLA support (Performance Management OAM and QoS mechanisms)
3.
a Management plane (optionally a control plane) for configuration and provisioning
4.
Efficiency and Scalability
OAMQoS-YJS Slide 28
Y.1731 messages
Y.1731 supports many OAM message types:
•
•
•
•
•
•
•
•
•
•
•
•
•
Continuity Check proactive heartbeat with 7 possible rates
Synthetic Loss Measurement on demand loss rate estimation
LoopBack
unicast/multicast pings with optional patterns
Link Trace
identify path taken to detect failures and loops
AIS
periodically sent when CC fails
RDI
flag set to indicate reverse defect
Client Signal Fail
sent by MEP when client doesn’t support AIS
LoCK signal
inform peer entity about diagnostic actions
TeST signal
in-service/out-of-service tests for loss rate, etc.
Automatic Protection Switching
Maintenance Communications Channel
remote maintenance
EXPerimental
Vendor SPecific
OAMQoS-YJS Slide 29
Y.1731 frame format
after DA, SA and Ethertype (8902)
Y.1731/802.1ag PDUs have the following header
(may be VLAN tagged)
LEVEL
VER
OPCODE
FLAGS
TLV-OFF
(3b)
(5b)
(1B)
(1B)
(1B)
if there are sequence numbers/timestamp(s)
they immediately follow
then come TLVs, the “end TLV”, followed by the CRC
TLVs have 1B type and 2B length fields
there may or not be a value field
the “end-TLV” has type = zero and no length or value fields
OAMQoS-YJS Slide 30
Y.1731 PDU types
opcode
OAM Type
DA
1
CCM
M1 or U
3
LBM
M1 or U
2
LBR
U
5
LTM
M2
LTR
U
4
6-31
RES IEEE
32-63 unused
RES ITU-T
33
AIS
M1 or U
35
LCK
M1or U
37
TST
M1 or U
39
Linear APS
M1or U
40
Ring APS
M1or U
41
MCC
M1 or U
43
LMM
M1 or U
42
LMR
U DA
45
1DM
M1 or U
47
DMM
M1 or U
46
DMR
UA
49
EXM
48
EXR
51
VSM
50
VSR
52
CSF
M1 or U
55
SLM
U
54
SLR
U
64-255
RES IEEE
OAMQoS-YJS Slide 31
MEPs and MIPs
Maintenance Entity (ME) – entity that requires maintenance
ME is a relationship between ME end points
because Ethernet is MP2MP, we need to define a ME Group
MEGs can be nested, but not overlapped
MEG LEVEL takes a value 0 … 7
by default - 0,1,2 operator, 3,4 SP, 5,6,7 customer
MEP = MEG end point
(MEG = ME group, ME = Maintenance Entity)
(in IEEE MEG is called MA = Maintenance Association)
unique MEG IDs specify to which MEG we send the OAM message
MEPs responsible for OAM messages not leaking out
but transparently transfer OAM messages of higher level
MIPs = MEG Intermediate Points
• never originate OAM messages,
• process some OAM messages
• transparently transfer others
OAMQoS-YJS Slide 32
MEPs and MIPs (cont.)
OAMQoS-YJS Slide 33
How is OAM used ?
MEF-30 Service OAM FM
•
•
•
•
and MEF-xx Service OAM PM
describe the use of OAM for Carrier Ethernet networks, such as
which Y.1731/802.1 features/messages should be used
where to put MEPs, what MA and MEG levels names should be used
minimum number of EVCs that must be supported
what should be reported and how
Y.1564 (ex Y.156sam) Ethernet Service Activation Test Methodology
describes commissioning procedures (replaces RFC2544-like benchmarking)
Tests that desired performance level can be achieved, including
• CIR, EIR (and optionally CBS and EBS for bursting)
• traffic policing
• rate, loss, delay, delay variation, availability (measured simultaneously)
Testing in two steps :
• Service Configuration Test – each service separately
• Service Performance Test – all services together
Performance testing may be for :
• 15 minutes (new service on operational network)
• 2 hours (single operator network)
• 24 hours (multiple operator networks)
OAMQoS-YJS Slide 34
QOS ENFORCEMENT
OAMQoS-YJS Slide 35
QoS approaches
There are two approaches to QoS handling
IntServ (guaranteed QoS)
• define traffic flows (CO approach)
• guarantee QoS attributes for each flow
• reserve resources at each router along the flow
• signaling protocol (e.g., RSVP) needed
DiffServ (statistical QoS)
• retain CL paradigm
• no guaranteed QoS attributes
• mark packets (differentiated – e.g., gold, silver, bronze)
•
marking can be by VLAN, P-bits, IP-ToS/DSCP, or general “flow”
• offer special treatment (priority) relative to other packets
• no resource reservation
For Ethernet and IP DiffServ is the preferred approach
OAMQoS-YJS Slide 36
Some fields for marking
Example:
For an IPv4 packet inside Q-in-Q Ethernet
we have various choices for marking priority
DA (6B)
SA(6B)
ET=8100 (2B)
ET=88A8 (2B)
ET=0800 (2B)
P(3b) CFI(1b) CVID(12b)
P(3b) DEA(1b) SVID(12b)
Ver(4b) IHL(4b) ToS(1B)
Len(2B)
...
Source IP Address (4B)
802.1p
user priority field AKA P-bits 0 … 7
priority tagging (VLAN=0) if no VLAN
P=0 means non-expedited traffic
802.1Q recommends mappings
IP ToS
RFC 2474 redefined ToS to contain
• 6 bit DSCP (see also RFC 4594)
• 2 bit ECN
Destination IP Address (4B)
OAMQoS-YJS Slide 37
Queuing
output port
output port
input port
input port
take from nonempty queues according
to configured “weight”
input port
• Weighted Fair Queuing
output port
always take from nonempty queue
of highest priority
switch
fabric
queue
queue
queue
queue
Many methods for emptying queues
The most popular are :
• Strict Priority
output port
Ethernet switches have queues FIFO buffers
on each output port
If there were only one queue
then traffic handling would be FIF
To enable DiffServ prioritization
multiple queues are used
Outgoing frames are inserted into queues
according to priority marking
OAMQoS-YJS Slide 38
Traffic shaping
One of the most important parts of an SLA is the
Committed Information Rate (bps)
This is the datarate (bandwidth) SP guarantees will be forwarded
There may also be an
Extra Information Rate (bps)
This is a datarate that the SP will forward if possible
Packet traffic is often bursty
A customer who did not send data for a while
will expect to be able to send a higher rate afterwards
This is accomplished via traffic shaping
• time integration is accomplished by leaky/token buckets
• the effect of shaping is marking drop eligibility
(marking a packet on the line is only possible with S-tags!)
There is often also traffic policing
policing simply discards packets to police a maximum rate !
OAMQoS-YJS Slide 39
MEF token bucket algorithm
Metro Ethernet Forum 10.x defines a bandwidth profile
there are two byte buckets, C of size CBS and E of size EBS (in bytes)
tokens are added to the buckets at rate CIR/8 and EIR/8
when bucket overflows tokens are lost (use it or lose it)
if ingress frame length < number of tokens in C bucket
frame is green and its length in tokens is debited from C bucket
else
if ingress frame length < number of tokens in E bucket
frame is yellow and its length of tokens is debited from E
bucket
for simplicity we assume
else frame is red
• no coupling and
• no sharing !
green frames are delivered
CBS
and service objectives apply
EBS
yellow frames are delivered
but service objectives don’t apply
red frames are discarded
C
E
OAMQoS-YJS Slide 40