Services in CINEMA

Transcript Services in CINEMA

SIP Server Scalability
IRT Internal Seminar
Kundan Singh, Henning Schulzrinne
and Jonathan Lennox
May 10, 2005
Agenda
Why do we need scalability?
Scaling the server





Scaling using load sharing




SIP express router (Iptel.org)
SIPd (Columbia University)
Threads/Processes/Events
DNS-based, Identifier-based
Two stage architecture
Conclusions
27 slides
2
Internet telephony
(SIP: Session Initiation Protocol)
[email protected]
yahoo.com
example.com
INVITE
REGISTER
INVITE
129.1.2.3
[email protected]
192.1.2.4
DB
DNS
3
Scalability Requirements
Depends on role in the network architecture
Cybercafe
Edge ISP server
10,000 customers
ISP
IP network
IP phones
ISP
SIP/MGC
SIP/PSTN
Carrier network
GW
Enterprise server
GW
1000 customers
MG
IP
PSTN
PBX
PSTN phones
SIP/MGC
GW
MG
Carrier (3G)
MG
10 million customers
T1 PRI/BRI
PSTN
4
Scalability Requirements
Depends on traffic type

Registration (uniform)


Call routing (Poisson)


Instant message, presence (including sensors), device
control
Stateful calls (Poisson arrival, exponential call
duration)


stateful vs stateless proxy, redirect, programmable scripts
Beyond telephony (Don’t know)


Authentication, mobile users
Firewall, conference, voicemail
Transport type

UDP/TCP/TLS (cost of security)
5
SIPstone
SIP server performance metrics
SQL
database

Steady state rate for

Server

Measure: #requests/s with given delay constraint.

Loader
Handler

REGISTER
200 OK
R1
successful registration, forwarding and unsuccessful
call attempts measured using 15 min test runs.

Performance=f(#user,#DNS,UDP/TCP,g(request),L)
where g=type and arrival pdf (#request/s),
L=logging?
For register, outbound proxy, redirect, proxy480,
proxy200.
Parameters

INVITE
100 Trying
R2
180 Ringing
200 OK
ACK
BYE
200 OK
INVITE
180 Ringing


200 OK
ACK
200 OK


Delay budget: R1 < 500 ms, R2 < 2000 ms
Shortcomings:

BYE
Measurement interval, transaction response time, RPS
(registers/s), CPS (calls/s), transaction failure
probability<5%,
does not consider forking, scripting, Via header,
packet size, different call rates, SSL. Is there linear
combination of results?
Whitebox measurements: turnaround time
Extend to SIMPLEstone
6
SIP server
What happens inside a proxy?
stateful
Response
recvfrom or
accept/recv
parse
Request
Match
transaction
Modify
response
Stateless proxy
Found
Match
transaction
Update DB
REGISTER
other
Stateless proxy
sendto,
send or
sendmsg
Redirect/reject
Lookup DB
Build
response
Proxy
Modify
Request
DNS
(Blocking) I/O
Critical section (lock)
Critical section (r/w lock)
7
Lessons Learnt (sipd)
In-memory database

Call routing involves
( 1) contact lookups


Cache (FastSQL)



10 ms per query
(approx)
Loading entire
database is easy
Periodic refresh
Potentially useful for
DNS lookups
Web config
SQL
database
Periodic
Refresh
Cache
< 1 ms
[2002:Narayanan]
Single CPU Sun Ultra10
Turnaround time vs RPS
8
Lessons Learnt (sipd)
Thread-per-request does not scale
One thread per message

Doesn’t scale

Thread pool + queue

Too many threads over a
short timescale


Stateless: 2-4 threads per
transaction
Stateful: 30s holding time


Overload management


Thread overhead less; more useful processing
Pre-fork processes for SIP-CGI
Graceful failure, drop requests over responses
Not enough if holding time is high

Each request holds (blocks) a thread
Incoming
Requests
R1-4
R1
R2
R3
R4
Throughput
Thread pool with
overload control
Incoming
Requests
R1-4
Thread per request
Load
Fixed number of threads
9
What is the best architecture?

Event-based
Reactive system



1.
Process pool
2.
Each pool process
receives and processes to
the end (SER)

stateful
Response
recvfrom or
accept/recv
parse
Request
Thread pool
3.
Receive and hand-over to pool thread
(sipd)
Each pool thread receives and processes
to the end
Staged event-driven: each stage has a
thread pool
Match
transaction
Modify
response
Stateless proxy
Update DB
Found
Match
transaction
Stateless proxy
REGISTER
other
Lookup DB
sendto,
send or
sendmsg
Redirect/reject
Build
response
Proxy
Modify
Request
DNS
10
Stateless proxy
UDP, no DNS, six messages per call
stateful
Response
recvfrom or
accept/recv
parse
Request
Match
transaction
Modify
response
Stateless proxy
Found
Match
transaction
Stateless proxy
Update DB
REGISTER
other
sendto,
send or
sendmsg
Redirect/reject
Lookup DB
Build
response
Proxy
Modify
Request
DNS
11
Stateless proxy
UDP, no DNS, six messages per call
4
3.5
3
2.5
Event
Th/msg
Th-pool1
Th-pool2
Proc-pool
2
1.5
1
0.5
0
1xP/Linux
4xP/Linux
1xS/Solaris
2xS/Solaris
Architecture
/Hardware
1 PentiumIV 3GHz,
1GB, Linux2.4.20
(CPS)
4 pentium, 450MHz,
512 MB, Linux2.4.20
(CPS)
1 ultraSparc-IIi, 300
MHz, 64MB, Solaris
(CPS)
2 ultraSparc-II, 300
MHz, 256MB, Solaris
(CPS)
Event-based
1650
370
150
190
Thread/msg
1400
TBD
100
TBD
Thread-pool1
1450
600 (?)
110
220 (?)
Thread-pool2
1600
1150 (?)
152
TBD
Process-pool
1700
1400
160
350
12
Stateful proxy
UDP, no DNS, eight messages per call

Event-based


Thread-per-message



single thread: socket listener + scheduler/timer
pool_schedule => pthread_create
Thread-pool1 (sipd)
Thread-pool2

N event-based threads




Each handles specific subset of requests (hash(call-id))
Receive & hand over to the correct thread
poll in multiple threads => bad on multi-CPU
Process pool

Not finished yet
13
Stateful proxy
UDP, no DNS, eight messages per call
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Event
Th/msg
Th-pool1
Th-pool2
1xP/Linux
4xP/Linux
1xS/Solaris
2xS/Solaris
Architecture
/Hardware
1 PentiumIV 3GHz,
1GB, Linux2.4.20
(CPS)
4 pentium, 450MHz,
512 MB, Linux2.4.20
(CPS)
1 ultraSparc-IIi,
360MHz, 256 MB,
Solaris5.9 (CPS)
2 ultraSparc-II, 300
MHz, 256 MB,
Solaris5.8 (CPS)
Event-based
1200
300
160
160
Thread/msg
650
175
90
120
Thread-pool1
950
340 (p=4)
120
120 (p=4)
Thread-pool2
1100
500 (p=4)
155
200 (p=4)
Process-pool
-
-
-
-
14
Lessons Learnt
What is the best architecture?
Stateless







CPU is bottleneck
Memory is constant
Process pool is the
best
Event-based not
good for multi-CPU
Thread/msg and
thread-pool similar
Thread-pool2 close
to process-poll

Stateful



Memory can become
bottle-neck
Thread-pool2 is good

But not N x CPU

Not good if P
 CPU
Process pool may be
better (?)
15
Lessons Learnt (sipd)
Avoid blocking function calls

DNS


10-25 ms (29 queries)
Cache




Lazy logger as a separate thread
Date formatter



non-blocking
Logger


110 to 900 CPS
Internal vs external
Logger:
while (1) {
lock;
writeall;
unlock;
sleep;
}
Strftime() 10% REG processing
Update date variable every second
random32()

Cache gethostid()- 37s
16
Lessons Learnt (sipd)
Resource management

Socket management


Problems: OS limit (1024), “liveness” detection, retransmission
One socket per transaction does not scale



Socket buffer size


Global socket if downstream server is alive, soft state – works for UDP
Hard for TCP/TLS – apply connection reuse
64KB to 128KB; Tradeoff: memory per socket vs number of sockets
Memory management

Problems: too many malloc/free, leaks
Stateless processing
INV pool 180
200
ACK
BYE
200
REG

Memory
time (s)
 Transaction specific memory, free once; also, less memcpy
W/o mempool  About155
67
67gain 95
139
62
237
30% performance
W/ mempool
Improvement (%)

Stateful: 650 to 800 CPS; Stateless: 900 to 1200 CPS
200
70
111
49
48
64
106
41
202
48
28
27
28
33
24
34
15
31
17
Lessons Learnt (SER)
Optimizations

Reduce copying and string operations


Reduce URI comparison to local


Data lumps, counted strings (+5-10%)
User part as a keyword, use r2 parameters
Parser


Lazy parsing (2-6x), incremental parsing
32-bit header parser (2-3.5x)



Case compare


Use padding to align
Fast for general case (canonicalized)
Hash-table, sixth bit
Database

Cache is divided into domains for locking
[2003:Jan Janak] SIP proxy server effectiveness, Master’s thesis, Czech Technical University
18
Lessons Learnt (SER)
Protocol bottlenecks and other scalability concerns

Protocol bottlenecks

Parsing





Authentication


Reuse credentials in subsequent requests
TCP


Order of headers
Host names vs IP address
Line folding
Scattered headers (Via, Route)
Message length unknown until Content-Length
Other scalability concerns

Configuration:


broken digest client, wrong password, wrong expires
Overuse of features



Use stateless instead of stateful if possible
Record route only when needed
Avoid outbound proxy if possible
19
Load Sharing
Distribute load among multiple servers

Single server scalability


There is a maximum capacity limit
Multiple servers




DNS-based
Identifier-based
Network address translation
Same IP address
20
Load Sharing (DNS-based)
Redundant proxies and databases
P1

REGISTER

D1


D2
P3
Write to D1 & D2
INVITE

P2
INVITE
REGISTER
Read from D1 or D2
Database write/
synchronization
traffic becomes
bottleneck
21
Load Sharing (Identifier-based)
Divide the user space
P1
a-h

D1

P2
i-q

D2

Use many
Hashing

P3
r-z
Proxy and database
on the same host
First-stage proxy
may get overloaded
Static vs dynamic
D3
22
Load Sharing
Comparison of the two designs
P1
P1
a-h
D1
D1
P2
P3
P2
i-q
D2
D2
High scale
Low reliability
P3
r-z
D2
Total time per DB
((tr/D)+1)TN
((tr+1)/D)TN
= (A/D) + B
= (A/D) + (B/D)
D
N
r
T
t
=
=
=
=
=
number of database servers
number of writes (REGISTER)
#reads/#writes = (INV+REG)/REG
write latency
read latency/write latency
23
Scalability (and Reliability)
Two stage architecture for CINEMA
a*@example.com
a1
s1
Master
a2
a.example.com
_sip._udp
SRV 0 0 a1.example.com
SRV 1 0 a2.example.com
Slave
sip:[email protected]
s2
sip:[email protected]
b*@example.com
s3
ex
example.com
_sip._udp
SRV 0 40 s1.example.com
SRV 0 40 s2.example.com
SRV 0 20 s3.example.com
SRV 1 0 ex.backup.com
b1
Master
b2
Slave
b.example.com
_sip._udp
SRV 0 0 b1.example.com
SRV 1 0 b2.example.com
Request-rate = f(#stateless, #groups)
Bottleneck: CPU, memory, bandwidth?
24
Load Sharing
Result (UDP, stateless, no DNS, no mempool)
S
P CPS
3
3 2800
2
3 2100
2
2 1800
1
2 1050
0
1
900
25
Lessons Learnt
Load sharing

Non-uniform distribution



Stateless proxy



S=800, P=650 CPS
Registration (no auth)



S=1050, P=900 CPS
S3P3 => 10 million BHCA (busy hour call attempts)
Stateful proxy


Identifier distribution (bad hash function)
Call distribution => dynamically adjust
S=2500, P=2400 RPS
S3P3 => 10 million subscribers (1 hour refresh)
Memory pool and thread-pool2/event-based further
increase the capacity (approx 1.8x)
26
Conclusions and future work

Server scalability


Load sharing


Non-blocking, process/events/thread, resource
management, optimizations
DNS, Identifier, two-stage
Current and future work:


Measure process pool performance for stateful
Optimize sipd




Use thread-pool2/event-based (?)
Memory - use counted strings; clean after 200 (?)
CPU - use hash tables
Presence, call stateful and TLS performance
(Vishal and Eilon)
27
Backup slides
Telephone scalability
(PSTN: Public Switched Telephone Network)
database (SCP)
for freephone,
calling card, …
signaling network
(SS7)
local telephone switch
(class 5 switch)
signaling
router
10,000
customers
(STP)
20,000 calls/hour
regional telephone switch
(class 4 switch)
100,000 customers
150,000 calls/hour
“bearer” network
database (SCP)
10 million customers
2 million lookups/hour
signaling router (STP)
1 million customers
1.5 million calls/hour
telephone switch
(SSP)
29
SIP server
Comparison with HTTP server

Signaling (vs data) bound



Transactions


DNS, SQL database
Transport


Stateful wait for response
Depends on external entities


No File I/O (exception: scripts, logging)
No caching; DB read and write frequency are comparable
UDP in addition to TCP/TLS
Goals


Carrier class scaling using commodity hardware
Try not to customize/recompile OS or implement (parts of)
server in kernel (khttpd, AFPA)
30
Related work
Scalability for (web) servers

Existing work




HTTP vs SIP


Connection dispatcher
Content/session-based redirection
DNS-based load sharing
UDP+TCP, signaling not bandwidth intensive, no
caching of response, read/write ratio is
comparable for DB
SIP scalability bottleneck


Signaling (chapter 4), real-time media data,
gateway
302 redirect to less loaded server, REFER session
to another location, signal upstream to reduce 31
Related work
3GPP (release 5)’s IP Multimedia core network Subsystem uses SIP

Proxy-CSCF (call session control function)


Interrogating-CSCF



First contact in operator’s network.
Locate S-CSCF for register
Serving-CSCF



First contact in visited network. 911 lookup. Dialplan.
User policy and privileges, session control service
Registrar
Connection to PSTN

MGCF and MGW
32
Server-based vs peer-to-peer
Reliability,
failover latency
DNS-based. Depends on client
retry timeout, DB replication
latency, registration refresh
interval
DHT self organization and
periodic registration refresh.
Depends on client timeout,
registration refresh interval.
Scalability,
number of users
Depends on number of servers
in the two stages.
Depends on refresh rate,
join/leave rate, uptime
Call setup
latency
One or two steps.
O(log(N)) steps.
Security
TLS, digest authentication,
S/MIME
Additionally needs a reputation
system, working around spy nodes
Maintenance,
configuration
Administrator: DNS, database,
middle-box
Automatic: one time bootstrap
node addresses
PSTN
interoperability
Gateways, TRIP, ENUM
Interact with server-based
infrastructure or co-locate peer node
with the gateway
33
Comparison of sipd and SER

sipd




Thread pool
Events (reactive
system)
Memory pool
PentiumIV 3GHz,
1GB, 1200 CPS, 2400
RPS (no auth)

SER



Process pool
Custom memory
management
PentiumIII 850 MHz,
512 MB => 2000
CPS, 1800 RPS
34

Services in CINEMA

Transcript Services in CINEMA

Directory