Network reliability and QoS measurements

Download Report

Transcript Network reliability and QoS measurements

Network reliability and
QoS measurements
Henning Schulzrinne
University of Cincinnati
March 2003
Overview



The IRT Lab at Columbia University
Application: Internet multimedia
Quality of service =




scheduling and admission control 
thousands of papers…
network signaling
end-system performance  embedded end
systems + PCs
QoS  network application reliability
Laboratory overview

11 PhDs




3 at IBM, Lucent, Telcordia
5 MS
Visitors (Ericsson, Fujitsu, Mitsubishi,
Nokia, U. Coimbra, U. Oulu, …)
China, Finland, Greece, India, Japan,
Portugal, Spain, Sweden, US, Taiwan
IRT topics

Internet multimedia protocols and systems








Internet telephony and radio (SIP, RTSP, RTP)
Content distribution networks
Internet-scale event distribution
Service creation
Ubiquitous, context-aware computing and communications
Protocols and services for wireless ad-hoc networks
Service discovery
Quality of service





Pricing for adaptive services
Scalable resource reservation protocols (CASP, BGRP, YESSIR)
End-system evaluation
Network measurements
Service reliability
Internet multimedia

Internet telephony = replacing the existing
circuit-switched system with Internet-based
systems


Signaling and services
Quality of service philosophies:

end systems adapt and compensate




end systems use FEC, LBR, PLC
jitter  playout delay compensation
network offers guarantees  difficult architecturally,
business, not necessarily technically
we pursue both
Assessment of VoIP Service
Availability
Wenyu Jiang
Henning Schulzrinne
IRT Lab, Dept. of Computer Science
Columbia University
Overview
(on-going work, preliminary results, still
looking for measurement sites, …)
 Service availability
 Measurement setup
 Measurement results




call success probability
overall network loss
network outages
outage induced call abortion probability
Service availability


Users do not care about QoS
at least not about packet loss, jitter, delay



rather, it’s service availability  how likely is it that I
can place a call and not get interrupted?
availability = MTBF / (MTBF + MTTR)



FEC and PLC can deal with losses up to 5-8%
MTBF = mean time between failures
MTTR = mean time to repair
availability = successful calls / first call attempts




equipment availability: 99.999% (“5 nines”)  5
minutes/year
AT&T: 99.98% availability (1997)
IP frame relay SLA: 99.9%
UK mobile phone survey: 97.1-98.8%
Availability – PSTN metrics

PSTN metrics (Worldbank study):

fault rate


fault clearance (~ MTTR)


“next business day”
call completion rate



“should be less than 0.2 per main line”
during network busy hour
“varies from about 60% - 75%”
dial tone delay
Example PSTN statistics
Source: Worldbank
Measurement setup
Node name Location
Connectivity
Network
columbia
Columbia University, NY
>= OC3
I2
wustl
Washington U., St. Louis
I2
unm
Univ. of New Mexico
I2
epfl
EPFL, Lausanne, CH
I2+
hut
Helsinki University of Technology
I2+
rr
NYC
cable modem
ISP
rrqueens
Queens, NY
cable modem
ISP
njcable
New Jersey
cable modem
ISP
newport
New Jersey
ADSL
ISP
sanjose
San Jose, California
cable modem
ISP
suna
Kitakyushu, Japan
3 Mb/s
ISP
sh
Shanghai, China
cable modem
ISP
Shanghaihome
Shanghai, China
cable modem
ISP
Shanghaioffice
Shanghai, China
ADSL
ISP
Measurement setup



Active measurements
call duration 3 or 7 minutes
UDP packets:




36 bytes alternating with 72 bytes (FEC)
40 ms spacing
September 10 to December 6, 2002
13,500 call hours
Call success probability


62,027 calls
succeeded, 292
failed  99.53%
availability
roughly constant
across I2, I2+,
commercial ISPs
All
99.53%
Internet2
99.52%
Internet2+
99.56%
Commercial
99.51%
Domestic (US)
99.45%
International
99.58%
Domestic
commercial
99.39%
International
commercial
99.59%
Overall network loss

PSTN: once connected,
call usually of good
quality


exception: mobile phones
compute periods of time
below loss threshold


5% causes degradation
for many codecs
others acceptable till
20%
loss
0%
5%
10%
20%
All
82.3
97.48
99.16
99.75
ISP
78.6
96.72
99.04
99.74
I2
97.7
99.67
99.77
99.79
I2+
86.8
98.41
99.32
99.76
US
83.6
96.95
99.27
99.79
Int.
81.7
97.73
99.11
99.73
US
ISP
73.6
95.03
98.92
99.79
Int.
ISP
81.2
97.60
99.10
99.71
Network Outages

sustained packet losses







arbitrarily defined at 8 packets
far beyond any recoverable loss (FEC,
interpolation)
23% outages
make up significant part of 0.25%
unavailability
symmetric: AB  BA
spatially correlated: AB   AX
not correlated across networks (e.g., I2 and
commercial)
Network outages
1
US Domestic paths
International paths
0.1
0.01
0.001
0.0001
Complementary CDF
Complementary CDF
1
all paths
Internet2
0.1
0.01
0.001
0.0001
0
50 100 150 200 250 300 350 400
outage duration (sec)
1e-05
0
50 100 150 200 250 300 350 400
outage duration (sec)
Network outages
no. of
outages
%
duration
symmetric (mean)
duration
(median)
total (all,
h:m)
outages >
1000
packets
all
10,753
30%
145
25
17:20
10:58
I2
819
14.5%
360
25
3:17
2:33
I2+
2,708
10%
259
26
7:47
5:37
ISP
8,045
37%
107
24
9:33
4:58
US
1,777
18%
269
20
5:18
3:53
Int.
8,976
33%
121
26
12:02
6:42
Outage-induced call abortion
proability





Long interruption  user likely
to abandon call
from E.855 survey: P[holding]
= e-t/17.26 (t in seconds)
 half the users will abandon
call after 12s
2,566 have at least one
outage
946 of 2,566 expected to be
dropped  1.53% of all calls
all
1.53%
I2
1.16%
I2+
1.15%
ISP
1.82%
US
0.99%
Int.
1.78%
US ISP
0.86%
Int. ISP
2.30%
Conclusion







Availability in space is (mostly) solved 
availability in time restricts usability for new
applications
initial investigation into service availability for
VoIP
need to define metrics for, say, web access
unify packet loss and “no Internet dial tone’’
far less than “5 nines”
working on identifying fault sources and
locations
looking for additional measurement sites
Quality and Performance
Evaluation of VoIP End-points
Wenyu Jiang
Henning Schulzrinne
Columbia University
Motivations


The quality of VoIP depends on both
the network and the end-points
Extensive QoS literature on network
performance, e.g., IntServ, DiffServ


Focus is on limiting network loss & delay
Little is known about the behavior of
VoIP end-points
Performance Metrics for VoIP
End-points

Mouth-to-ear (M2E) delay


Clock skew





whether the voice is clipped (depends much on hangover
time)
robustness to non-speech input, e.g., music
Robustness to packet loss


whether it causes any voice glitches
amount of clock drift
Silence suppression behavior


compare network delay
voice quality under packet loss
Acoustic echo cancellation
Jitter adaptation: delay > max(jitter)?
Measurement Approach


Capture both original and output audio
Use adelay program to measure M2E delay



auto correlation
no clock synchronization needed
Assume a LAN environment by default

Serve as a baseline of reference, or lower bound
stereo
signal
PC
line in
notebook
speaker
original
audio
(mouth)
coupler
coupler
IP phone
output
audio
In
Out
ethernet
(ear)
LAN
IP phone
In
Out
ethernet
VoIP End-points Tested

Hardware End-points



Cisco, 3Com and Pingtel IP phones
Mediatrix 1-line SIP/PSTN Gateway
Software clients



Microsoft Messenger, NetMeeting (Win2K, WinXP)
Net2Phone (NT, Win2K, Win98)
Sipc/RAT (Solaris, Ultra-10)


Robust Audio Tool (RAT) from UCL as media client
Operating parameters:

In most cases, codec is G.711 -law, packet
interval is 20ms
IP Phone Hardware
•
•
•
•
DSP for audio coding, AEC
C for protocol processing
embedded OS (Linux, Windriver, …) with web browser
Ethernet interface, maybe with hub
Example M2E Delay Plot

3Com to Cisco, shown with gaps > 1sec
Delay adjustments correlate with gaps,
despite 3Com phone has no silence
suppression 60 experiment
1-1
experiment 1-2
silence gaps
M2E delay (ms)

55
50
45
40
35
0
50
100
150 200
time (sec)
250
300
350
Visual Illustration of M2E
Delay Drop, Snapshot #1



3Com to Cisco
1-1 case
Left/upper
channel is
original audio
Highlighted
section shows
M2E delay
(59ms)
Snapshot #2

M2E delay
drops to
49ms, at
time of
4:16
Snapshot #3

Presence of
a gap during
the delay
change
Effect of RTP Marker Bits on
Delay Adjustments
Cisco phone sends M-bits, whereas Pingtel
phone does not

Presence of M-bits results in more adjustments
100
Cisco to 3Com 1-1
Pingtel to 3Com 2-1
new talkspurt (M-bit=1)
90
M2E delay (ms)

80
70
60
50
40
30
20
0
50
100
150 200
time (sec)
250
300
Sender Characteristics
Certain senders may introduce delay
spikes, despite operating on a LAN
300
Mediatrix to 3Com 3-1
Mediatrix to Cisco 1-1
Mediatrix to Pingtel 1-1
250
M2E delay (ms)

200
150
100
50
0
50
100
150 200
time (sec)
250
300
Average M2E Delays for IP
phones and sipc

Averaging the M2E delay allows more compact
presentation of end-point behaviors
Receiver (especially RAT) plays an important role in
M2E delay
250
Average M2E delay (ms)

200
150
100
50
0
3Com
Cisco
Mediatrix
Pingtel
Receiver
Sender: 3Com
Sender: Cisco
RAT
Average M2E Delays for PC
Software Clients

Messenger 2000 wins the day




Its delay as receiver (68ms) is even lower than Messenger
XP, on the same hardware
It also results in slightly lower delay as sender
NetMeeting is a lot worse (> 400ms)
Messenger’s delay performance is similar to or better
than a GSM mobile phone.
A
B
AB
BA
MgrXP (pc)
MgrXP (notebook)
109ms
120ms
Mgr2K (pc)
NM2K (pc)
96.8ms 68.5ms
NM2K (notebook)
Mobile (GSM) PSTN (local number)
401ms
421ms
115ms
109ms
Delay Behaviors for PC Clients,
contd.

Net2Phone’s delay is also high



~200-500ms
V1.5 reduces PC->PSTN delay
PC-to-PC calls have fairly high delays
A
B
AB
BA
N2P v1.1 NT P-2 (pc2)
PSTN
(local number)
292ms
372ms
201ms
373ms
N2P v1.5 W2K K7 (pc)
196ms
401ms
N2P v1.5 W2K K7 (pc)
N2P v1.5 W98 P-3 525ms
(notebook2)
350ms
N2P v1.5 NT P-2 (pc2)
Effect of Clock Skew: Cisco to
3Com, Experiment 1-1




Symptom of
playout buffer
underflow
Waveforms
are dropped
Occurred at
point of delay
adjustment
Bugs in
software?
Clock Skew Rates


Mostly symmetric between two devices
RAT (Sun Ultra-10) has unusually high drift rates, > 300
ppm (parts per million)

High clock skews confirmed in many (but not all) PCs and
workstations
Drift Rates 3Com
(in ppm)
Cisco
Mediatrix Pingtel
RAT
3Com
-8.3
55.4
43.3
41.2
-333
Cisco
-55.2
-0.4
-11.8
-12.1
-381
Mediatrix
-43.1
11.7
1.3
-0.8
Pingtel
-40.9
12.7
2.8
-3.5
-380
RAT
343
403
376
12.3
Drift Rates for PC Clients

Drift Rates not always symmetric!


But appears to be consistent between Messenger
2K/XP and Net2Phone on the same PC
Existence of 2 clocking circuits in sound card?
A
B
AB
BA
MgrXP (pc)
172
87.7
Mgr2K (pc)
MgrXP
(notebook)
165
85.6
NM2K (pc)
NM2K (notebook) ?
-33?
Net2Phone NT (pc2)
PSTN
290
-287
Net2Phone 2K (pc)
166
82
Mobile (GSM)
0
0
Packet Loss Concealment

Common PLC methods





Silence substitution (worst)
Packet repetition, with optional fading
Extrapolation (one-sided)
Interpolation (two-sided), best quality
Use deterministic bursty loss pattern



3/100 means 3 consecutive losses out of every
100 packets
Easier to locate packet losses
Tested 1/100, 3/100, 1/20, 5/100, etc.
PLC Behaviors

Loss tolerance (at 20ms interval)




Level of audio distortion by packet loss



By measuring loss-induced gaps in output audio
3Com and Pingtel phones: 2 packet losses
Cisco phone: 3 packet losses
Inaudible at 1/100 for all 3 phones
Inaudible at 3/100 and 1/20 for Cisco phone, yet
audible to very audible for the other two.
Cisco phone is the most robust

Probably uses interpolation
Effect of PLC on Delay
No affirmative effect on M2E delay

E.g., sipc to Pingtel
80
0/100
3/100
1/20
mouth-to-ear delay (ms)

75
70
65
60
55
50
0
10
20
30
40
time (sec)
50
60
Silence Suppression

Why?




Saves bandwidth
May reduce processing power (e.g., in
conferencing mixer)
Facilitates per-talkspurt delay adjustment
Key parameters


Silence detection threshold
Hangover time, to delay silence suppression and
avoid end clipping of speech

Usually 200ms is long enough [Brady ’68]
Hangover Time



Measured by feeding ON-OFF
waveforms and monitor RTP packets
Cisco phone’s is the longest (2.3-2.36
sec), then Messenger (1.06-1.08 sec),
then NetMeeting (0.56-0.58 sec)
A long hangover time is not necessarily
bad, as it reduces voice clipping


Indeed, no unnatural gaps are found
Does waste a bit more bandwidth
Robustness of Silence
Detectors to Music

On-hold music is often used in
customer support centers




Need to ensure music is played without
any interruption due to silence suppression
Tested with a 2.5-min long soundtrack
Messenger starts to generate many
unwanted gaps at input level of -24dB
Cisco phone is more robust, but still
fails from input level of -41.4dB
Acoustic Echo Cancellation


Important for hands-free/conferencing
(business) applications
Primary metric: Echo Return Loss (ERL)


Measured by LAN-sniffing RTP packets
Most IP phones support AEC


ERL depends slightly on input level and
speaker-phone volume
Usually > 40 dB (good AEC performance)
IP Phone 3Com Cisco
ipDialog Pingtel
Snom-100
ERL (dB) 40-45 53-
49-54
-5 (no AEC)
33-42
M2E Delay under Jitter

Delay properties under the LAN environment
serves as a baseline of reference
When operating over the Internet:



Fixed portion of delay adds to M2E delay as a constant
Variable portion (jitter) has a more complex effect
Initial test




Used typical cable modem
delay traces
Tested RAT to Cisco
No audible distortion due
to late loss
Added delay is normal
180
170
160
150
140
130
120
110
100
90
High jitter (uplink)
Low jitter (downlink)
mouth-to-ear delay (ms)

0
20 40 60 80 100 120 140 160 180
time (sec)
M2E Delay under Jitter, contd.
Cisco phone generally within expectation

Can follow network delay change timely



Does not overshoot playout delay
More end-points to be examined
160
140
M2E delay (ms)
Takes longer (10-20sec) to adapt to decreasing delay
120
Trace
test1
test2
100
80
60
40
20
0
800
Trace
test1
test2
700
M2E delay (ms)

600
500
400
300
200
100
0 10 20 30 40 50 60 70 80 90 100
time (sec)
Artificial Trace
0
0
10
20
30 40 50
time (sec)
60
70
Real Trace with Spikes
80
Conclusions

Average M2E Adelay:



Low (mostly < 80ms) for hardware IP phones
Software clients: lowest for Messenger 2000 (68.5ms)
Application (receiver) most vital in determining delay



Clock skew high on SW clients (RAT, Net2Phone)
Packet loss concealment quality




Acceptable in all 3 IP phones tested, w. Cisco more robust
Silence detector behavior


Poor implementation easily undoes good network QoS
Long hangover time, works well for speech input
But may falsely predict music as silence
Acoustic Echo Cancellation: good on most IP phones
Playout delay behavior: good based on initial tests
Future Work




Further tests with more end-points on
how jitter influences M2E delay
Measure the sensitivity (threshold) of
various silence detectors
Investigate the non-symmetric clock
drift phenomena
Additional experiments as more brands
of VoIP end-points become available