Transcript threads
Combining Events and Threads
for Scalable Network Services
Peng Li and Steve Zdancewic
University of Pennsylvania
PLDI 2007, San Diego
1
Overview
A Haskell framework for massively concurrent network
applications
Servers, P2P systems, load generators
Massive concurrency ::=
1,000 threads?
(easy)
A lazy, purely
functional
language
| 10,000 programming
threads? (common)
| 100,000http://www.haskell.org
threads? (challenging)
| 1,000,000 threads? (20 years later?)
| 10,000,000 threads? (in 15 minutes)
How to write such programs?
The very first decision to make: the programming model
Shall we use threads or events?
2
Threads vs. Events
The event-driven model:
The multithreaded model
One thread ↔ 10000 clients
Asynchronous I/O
Scheduling: programmer
One thread ↔ one client
Synchronous I/O
Scheduling: OS/runtime libs
“Why threads
a bad
int send_data(int
fd1, are
int fd2)
{ idea
(for most purposes)”
while (!EOF(fd1))
{
[USENIX
ATC 1999]
size =
read_chunk(fd,
buf, count);
write_chunk(fd, buf, size);
}
events are a bad idea
while(1)“Why
{
(for high-concurrency
servers)”
nfds=epoll_wait(kdpfd,
events,
MAXEVT,-1);
for(n=0; n<nfds;
++n) 2003]
[HotOS
handle_event(events[n]);
…
…
Threads
Events
Expressiveness and
Abstraction
(for programming each client)
Synchronous I/O + intuitive Finite state machines /
control flow primitives
Continuation-passing style
(CPS) programming
Flexibility and Control
(for resource scheduling)
Baked into OS/runtime,
difficult to customize
Programmer has complete
control – tailored to each
application’s needs
3
Can we get the best of both worlds?
Programming with each client: threads
• Synchronous I/O
• Intuitive control-flow primitives
One
application
program
The bridge between threads/events?
(some kind of “continuation” support)
Resource scheduling: events
•Written as part of the application
• Tailored to application’s needs
4
Roads to lightweight,
application-level concurrency
Direct language support for continuations:
Good if you have them
Source-to-source CPS translations
Requires hacking on compiler/runtime
Often not very elegant
Other solutions?
(no language support)
(no compiler/runtime hacks)
5
The poor man’s concurrency monad
“A poor man’s concurrency monad” by Koen Claessen,
JFP 1999. (Functional Pearl)
The thread interface:
The CPS monad
The event interface:
A lazy, tree-like data structure called “trace”
Internal
Representation
Multithreaded code
Scheduler code
SYS_NBIO(accept)
CPS
Monad
Event Abstraction
SYS_EPOLL_WAIT(s)
Thread Abstraction
server_loop s = do {
sock <- sock_accept s;
sys_fork (client sock);
server_loop;
}
client_loop sock = do {
sock_send sock data;
sock_close sock;
}
sock_send sock data = do {
...
n<-sys_nbio (write_nb ...);
...;
sys_epoll_wait sock EPOLL_READ;
...
foo;
...
n<-sys_nbio (write_nb ...);
...
}
Trace
SYS_NBIO(accept)
SYS_FORK
SYS_NBIO(write_nb)
SYS_EPOLL_WAIT(sock)
SYS_NBIO(write_nb)
scheduler = do {
...
trace <- fetch_thread;
execute trace;
...
}
execute trace =
case trace of
SYS_NBIO c ->
do { cont <- c;
execute cont;
}
SYS_FORK t1 t2 ->
...
6
Questions on the poor man’s approach
Does it work for high-performance
network services?
(using a pure, lazy, functional language?)
How does the design scale up to real systems?
Symmetrical multiprocessing? Synchronization? I/O?
How cheap is it?
How much does a poor man’s thread cost?
How poor is it?
Does it offer acceptable performance?
7
Our experiment
A high-performance Haskell framework for massivelyconcurrent network services!!!
Supported features:
Linux Asynchronous IO (AIO)
epoll() and nonblocking IO
OS thread pools
SMP support
Thread synchronization primitives
Applications developed
IO benchmarks on FIFO pipes / Disk head scheduling
A simple web server for static files
HTTP load generator
Prototype of an application-level TCP stack
We used the Glasglow Haskell Compiler (GHC)
8
Multithreaded code example
Nested function calls
Exception handling
Conditional branches
Synchronous call to
I/O lib
Recursion
9
Event-driven code example
An event loop running in a
separate OS thread
A wrapper function to the C
library call using the Haskell
Foreign Function Interface (FFI)
Put events in queues for
processing in other OS threads
10
A complete event-driven I/O
One “virtual
subsystem
processor” event
loop for each
ready_queue
CPU
System call completion, thread ready to run
Fetch threads
OS thread pool for CPS thread execution
worker_main
OS thread pool for blocking I/O
Context switch
/ fork
worker_blio
blio_queue
Fetch thread
worker_main
worker_main
worker_main
Haskell Foreign
Function
Inteface (FFI)
worker_blio
SYS_BLIO
Register event handler
Submit AIO request
with event handler
Epoll
interface
AIO interface
Event notification
AIO completion
worker_epoll
worker_aio
Each event loop
runs in a
separate OS
thread
11
Modular and customizable I/O system
(add a TCP stack if you like)
ready_queue
System call completion, thread ready to run
Fetch threads
OS thread pool for CPS thread execution
worker_main
OS thread pool for blocking I/O
Context switch
/ fork
worker_blio
blio_queue
SYS_BLIO
Fetch thread
worker_main
Define
/ interpret Register
TCPevent handler
worker_main
syscalls (22 lines)
Epoll
interface
Event
Event notification
worker_blio
worker_epoll
loop for incoming
packets (7 lines)
worker_main
Submit AIO request
with event handler
AIO interface
worker_main
Blocking
TCP User requests
TCP stack
states
AIO completion
Request Completion
Request Completion
Event loop for timers
(9 lines)
worker_aio
worker_tcp_input
worker_tcp_timer
12
How cheap is a poor man’s thread?
Minimal memory consumption:
48 bytes
48 bytes
Each thread just loops and does nothing
Actual size determined by thread-local states
Even an ethernet packet can be >1,000 bytes…
Pay as you go --- only pay for things needed
In contrast:
A Linux POSIX thread’s stack has
2MB by default
The state-of-the-art user-level thread system (Capriccio) use at
least a few KBs for each thread
Observation:
The poor man’s thread is extremely memory-efficient
(Challenging most event-driven systems)
13
I/O scalability test
Comparison against the Linux POSIX Thread Library
(NPTL)
Highly optimized OS thread implementation
Each NPTL thread’s stack limited to 32KB
Mini-benchmarks used:
Disk head scheduling (all threads running)
FIFO pipe scalability with idle threads (128 threads running)
14
A simple web server
15
How poor is the poor man’s monad?
Not too shabby
Benchmarks shows comparable (if not higher)
performance to existing, optimized systems
An elegant design is more important than 10%
performance improvement
type safety for
Added benefit:
many dangerous things
Continuations, thread queues, schedulers, asynchronous I/O
16
Related Work
We are motivated by two projects:
Twisted: the python event-driven framework for
scalable internet applications
- The programmer must write code in CPS
Capriccio: a high-performance user-level thread
system for network servers
- Requires C compiler hacks
- Difficult to customize (e.g. adding SMP support)
Continuation-based concurrency
[Wand 80], [Shivers 97], …
Other languages and programming models:
CML, Erlang, …
17
Conclusion
Haskell and The Poor Man’s Concurrency
Monad are a promising solution for highperformance, massively-concurrent networking
applications:
Get the best of both threads and events!
This poor man’s approach is actually very
cheap, and not so poor!
http://www.cis.upenn.edu/~lipeng/homepage/unify.html
18