Multi Layer Click - Keller Family Web site

Download Report

Transcript Multi Layer Click - Keller Family Web site

Multi-Level Architecture for Data
Plane Virtualization
Eric Keller
Oral General Exam
5/5/08
The Internet (and IP)
• Usage of Internet continuously evolving
• The way packets forwarded hasn’t (IP)
–
–
–
–
–
Meant for communication between machines
Address tied to fixed location
Hierarchical addressing
Best-effort delivery
Addresses easy to spoof
• Great innovation at the edge (Skype/VoIP, BitTorrent)
– Programmability of hosts at application layer
– Can’t add any functionality into network
2
Proposed Modifications
• Many proposals to modify some aspect of IP
– No single one is best
– Difficult to deploy
• Publish/Subscribe mechanism for objects
– Instead of routing on machine address, route on object ID
– e.g. DONA (Data oriented network architecture), scalable
simulation
• Route through intermediary points
– Instead of communication between machines
– e.g. i3 (internet indirection infrastructure), DOA (delegation oriented
architecture)
• Flat Addressing to separate location from ID
– Instead of hierarchical based on location
– e.g. ROFL (routing on flat labels), SEIZE (scalable and efficient,
zero-configuration enterprise)
3
Challenges
• Want to Innovate in the Network
– Can’t because networks are closed
• Need to lower barrier for who innovates
– Allow individuals to create a network and define its
functionality
• Virtualization as a possible solution
– For both network of future and overlay networks
– Programmable and sharable
– Examples: PlanetLab, VINI
4
Network Virtualization
• Running multiple virtual networks at the same time over a
shared physical infrastructure
– Each virtual network composed of virtual routers having custom
functionality
Physical
machine
Virtual router
Virtual network – e.g. blue virtual
routers plus Blue links
5
Virtual Network Tradeoffs
• Goal: Enable custom data planes per virtual network
– Challenge: How to create the shared network nodes
Programmability
Isolation
Performance
6
Virtual Network Tradeoffs
• Goal: Enable custom data planes per virtual network
– Challenge: How to create the shared network nodes
Programmability
How easy is it to add new
functionality?
What is the range of new
functionality that can be added?
Does it extend beyond “software
routers”?
Isolation
Performance
7
Virtual Network Tradeoffs
• Goal: Enable custom data planes per virtual network
– Challenge: How to create the shared network nodes
Programmability
Does resource usage by one
virtual networks have an
effect on others?
Faults?
How secure is it given a
shared substrate?
Isolation
Performance
8
Virtual Network Tradeoffs
• Goal: Enable custom data planes per virtual network
– Challenge: How to create the shared network nodes
Programmability
How much overhead is
there for sharing?
What is the forwarding rate?
Throughput?
Latency?
Isolation
Performance
9
Virtual Network Tradeoffs
• Network Containers
Programmability
– Duplicate stack or data structures
– e.g. Trellis, OpenVZ, Logical Router
Programability, Isolation, Performance
Isolation
Performance
• Extensible Routers
– Assemble custom routers from common functions
– e.g. Click, Router Plug Ins, Scout
Programmability, Isolation, Performance
• Virtual Machines+Click
– Run operating system on top of another operating system
– e.g. Xen, PL-VINI (Linux-VServer)
Programmability, Isolation, Performance
10
Outline
• Architecture
• Implementation
– Virtualizing Kernel
– Challenges with kernel execution
– Extending beyond commodity hardware
• Evaluation
• Conclusion/Future Work
11
Outline
• Architecture
• Implementation
– Virtualizing Kernel
– Challenges with kernel execution
– Extending beyond commodity hardware
• Evaluation
• Conclusion/Future Work
12
User Experience
(Creating a virtual network)
• Custom functionality
– Custom user environment on each node (for controlling virtual router)
– Specify single node’s packet handling as graph of common functions
For example…
Determine Shortest Path
User Control Environment
Config/Query interface
From
devices
A1
A4
A2
A3
A5
Populate routing tables
To
devices
Check Header,
Destination Lookup
• Isolated from others sharing same node
– Allocated share of resources (e.g. CPU, memory, bandwidth)
– Protected from faults in others (e.g. another virtual router crashing)
• Highest performance possible
13
Lightweight Virtualization
• Combine graphs into single graph
– Provides lightweight virtualization
Graph 1
combine
Graph 2
Master
graph
• Add extra packet processing (e.g. mux/demux)
– Needed to direct packets to the correct graph
Master
Graph
Graph 1
Output
port
Input
port
Graph 2
• Add resource accounting
14
Increasing Performance and Isolation
• Partition into multiple graphs across multiple targets
– Each target with different capabilities
 Performance, Programmability, Isolation
– Add connectivity between targets
– Unified run-time interface (it appears as a single graph)
 To query and configure the forwarding capabilities
Graph 1
combine
Graph 2
Master
graph
partition
Target0
graph
Target1
graph
Target2
graph
15
Examples of Multi-Level
• Fast Path/Slow Path
– IPv4: forwarding in fast path, exceptions in slow path
– i3: Chord ring lookup function in fast path, handling
requests in slow path
• Preprocessing
– IPSec – do encryption/decryption in HW, rest in SW
• Offloading
– TCP Offload
– TCP Splicing
• Pipeline of coarse grain services
– e.g. transcoding, firewall
– SoftRouter from Bell Labs
16
Outline
• Architecture
• Implementation
– Virtualizing Kernel
– Challenges with kernel execution
– Extending beyond commodity hardware
• Evaluation
• Conclusion/Future Work
17
Implementation
• Each network has custom functionality
– Specified as graph of common functions
– Click modular router
• Each network allocated share of resources
– e.g. CPU
– Linux-VServer – single resource accounting for both
control and packet processing
• Each network protected from faults in others
– Library of elements considered safe
– Container for unsafe elements
• Highest performance possible
– FPGA for modules with HW option, Kernel for
modules without
18
Click Background: Overview
• Software architecture for building flexible and configurable
routers
– Widely used – commercially and in research
– Easy to use, flexible, high performance (missing sharable)
• Routers assembled from packet processing modules
(Elements)
– Simple and Complex
• Processing is directed graph
FromDevice(eth0)
Counter
Discard
• Includes a scheduler
– Schedules tasks (a series of elements)
19
Linux-VServer
20
Linux-VServer + Click + NetFPGA
Coordinating Process
click
click
Installer
Installer
click
Installer
Click
Click on
NetFPGA
21
Outline
• Architecture
• Implementation
– Virtualizing Click in the Kernel
– Challenges with kernel execution
– Extending beyond software routers
• Evaluation
• Conclusion/Future Work
22
Virtual Kernel Mode Click
• Want to run in Kernel mode
– Close to 10x higher performance than user mode
• Use library of ‘safe’ elements
– Since Kernel is shared execution space
• Need resource accounting
– Click scheduler does not do resource accounting
– Want resource accounting system-wide
(i.e. not just inside of packet processing)
23
Resource Accounting with VServer
• Purpose of Resource Accounting
– Provides isolation between virtual networks
• Unified resource accounting
– For packet processing and control
• VServer’s Token Bucket Extension to Linux
Scheduler
– Controls eligibility of processes/threads to run
• Integrating with Click
– Each individual Click configuration assigned to its own
thread
– Each thread associated with VServer context
 Basic mechanism is to manipulate the task_struct
24
Outline
• Architecture
• Implementation
– Virtualizing Kernel
– Challenges with kernel execution
– Extending beyond software routers
• Evaluation
• Conclusion/Future Work
25
Unyielding Threads
• Linux kernel threads are cooperative (i.e. must
yield)
– Token scheduler controls when eligible to start
• Single long task can have short term disruptions
– Affecting delay and jitter on other virtual networks
• Token bucket does not go negative
– Long term, a virtual network
can get
more
Tokens added
(rate
A) than its share
Size of Bucket (S)
Min tokens to exec (M)
Tokens consumed
(1 per scheduler tick)
26
Unyielding Threads (solution)
• Determine maximum allowable execution time
– e.g. from token bucket parameters, network guarantees
• Determine pipeline’s execution time
– Elements from library have known execution times
– Custom elements execution times are unknown
elem1
elem2
elem3
• Break pipeline up (for known)
elem1
elem2
elem3
• Execute inside of container (for unknown)
elem1
elem2
To
User
From
Kern
elem3
27
Custom Elements Written in C++
• Elements have access to global state
– Kernel state/functions
– Click global state
• Could…
– Pre-compile in user mode
– Pre-compile with restricted header files
• Not perfect:
– With C++, you can manipulate pointers
• Instead, custom elements are unknown (“unsafe”)
– Execute in container in user space
28
Outline
• Architecture
• Implementation
– Virtualizing Kernel
– Challenges with kernel execution
– Extending beyond commodity hardware
• Evaluation
• Conclusion/Future Work
29
Extending beyond commodity HW
• PC + Programmable NIC
(e.g. NetFPGA)
– FPGA on PCI card
– 4 GigE ports
– On board SRAM and DRAM
• Jon Turner’s “Pool of
Processing Elements” – with
crossbar
– PEs can be GPP, NPU, FPGA
– Switch Fabric = Crossbar
Processing
Engines PE1 PE2
...
PEm
Switch Fabric
Line
Cards LC1 LC2
...
LCn
Partition between FPGA and Software
Generalize: Partition among PEs
30
FPGA Click
• Two previous approach
– Cliff – Click graph to Verilog, standard interface on modules
– CUSP – Optimize Click graph by parallelizing internal statements.
• Our approach:
– Build on Cliff by integrating FPGAs into Click (the tool)
• Software Analogies
–
–
–
–
–
–
–
Connection to outside environment
Packet Transfer
Element specification and implementation
Run-time querying and configuration
Memory
FromDevice
Element
Notifiers
(eth0)
(LEN 5)
Annotations
ToDevice
(eth0)
31
Outline
• Architecture
• Implementation
– Virtualizing Kernel
– Challenges with kernel execution
– Extending beyond commodity hardware
• Evaluation
• Conclusion/Future Work
32
Experimental Evaluation
• Is multi-level the right approach?
– i.e. is it worth effort to support kernel and FPGA
– Does programmability imply less performance?
• What is the overhead of virtualization?
– From container: when you need to go to user space.
– From using multiple threads: when running in kernel.
• Are the virtual networks isolated in terms of
resource usage?
– What is the maximum short-term disruption from
unyeilding threads?
– How long can a task run without leading to long-term
unfairness?
33
Setup
PC3000 on Emulab
3GHz, 2GB RAM
n1
n0
Modify header (IP and ETH)
To be from n1 to n2.
rtr
n2
*Generates Packets from
n0 to n1, tagged with time
* Receives packets, diffs the
current time and packet time
(and stores avg in mem)
The router under test
(Linux or a Click config)
n3
34
Peak Send Rate
(64 byte pkts/sec)
Is multi-Level the right approach?
700000
600000
500000
400000
300000
200000
100000
0
FPGA
ClickKe rne l
Linux
Click-Use r
• Performance benefit going from user to kernel, and
– Kernel to FPGA
• Programmability imply less performance?
– Not sacrificing performance by introducing
programmability
35
What is the overhead of virtualization?
From container
Peak Rate
(64 Byte pkts/sec)
• When you must go to user space, what is the cost
of executing in a container?
45000
40000
35000
30000
25000
20000
15000
10000
5000
0
Click (in Container)
Click (not in Container)
• Overhead of executing in a VServer is minimal
36
What is the overhead of virtualization?
From using multiple threads
Put same click graph in each thread
Thread
(each runs
X tasks/yield)
Round robin traffic between them
4portRouter
(compound
element)
PollDevice
RoundRobin
ToDevice
4portRouter
(compound
element)
37
How long to run before yielding
160000
Peak Send Rate
(64 byte pkts/sec)
140000
120000
100000
2 virt. Networks
80000
4 virt. Networks
60000
10 virt. Networks
40000
20000
0
1
10
100
1000
10000 100000
tasks / yield
• # tasks per yield:
– Low => high context switching, I/O executes often
– High => low context switching, I/O executes infrequently
38
180000
160000
140000
70
60
50
120000
100000
80000
60000
40
30
20
40000
20000
0
Tasks / Yield
(Lines)
Peak Send Rate
64 byte pkts/sec
(Bars)
What is the overhead of virtualization?
From using multiple threads
10
0
1
2
4
6
8
10
# Virtual Networks
• Given sweet spot for each # of virtual networks
– Increasing number of virtual networks from 1 to 10 does
not hurt aggregate performance significantly
• Alternatives to consider
– Single threaded with VServer
– Single threaded, modify Click to do resource accounting
39
– Integrate polling into threads
What is the maximum short-term
disruption from unyeilding threads?
• Profile of (some) Elements
– Standard N port router example - ~ 5400 cycles (1.8us)
– RadixIPLookup (167k entries) - ~1000 cycles
– Simple Elements
 CheckLength - ~400 cycles
 Counter - ~700 cycles
 HashSwitch - ~450 cycles
Infinite
Source
RoundTrip
CycleCount
Elem
Discard
NoFree
• Maximum Disruption is length of longest task
– Possible to break up pipelines
40
How long can a task run without leading
to long-term unfairness?
Limited to 15%
Infinite
Source
Infinite
Source
Chewy
4portRouter
(compound
element)
Discard
4portRouter
(compound
element)
Discard
Count cycles
41
Fraction CPU
(network with chewy)
How long can a task run without leading
to long-term unfairness?
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
100
200
300
400
500
600
700
800
900
Chewy Load
Fraction CPU
(network with chewy)
~10k extra cycles / task
0.2
0.15
0.1
0.05
Zoomed In
0
0
5
10
15
20
25
30
35
40
45
50
Chewy Load
• Tasks longer than 1 token can lead to unfairness
• Run long executing elements in user-space
– performance overhead of user-space is not as big of an issue
42
Outline
• Architecture
• Implementation
– Virtualizing Kernel
– Challenges with kernel execution
– Extending beyond commodity hardware
• Evaluation
• Conclusion/Future Work
43
Conclusion
• Goal: Enable custom data planes per virtual network
• Tradeoffs
– Performance
– Isolation
– Programmability
• Built a multi-level version of Click
– FPGA
– Kernel
– Container
44
Future Work
• Scheduler
– Investigate alternatives to improve efficiency
• Safety
– Process to certify element as safe (can it be automated?)
• Applications
– Deploy on VINI testbed
– Virtual router migration
• HW/SW Codesign Problem
– Partition decision making
– Specification of elements (G language)
45
Questions
Click!
Multi
Level
46
Backup
47
Signs of Openness
• There are signs that network owners and equipment
providers are opening up
• Peer-to-peer and network provider collaboration
– Allowing intelligent selection of peers
– e.g. Pando/Verizon (P4P), BitTorrent/Comcast
• Router Vendor API
– allowing creation of software to run on routers
– e.g. Juniper PSDP, Cisco AXP
• Cheap and easy access to compute power
– Define functionality and communication between machines
– e.g. Amazon EC2, Sun Grid
48
Example 1: User/Kernel Partition
• Execute “unsafe” elements in container
– Add communication elements
container
u1
User
s1
s2
u1
tk
s3
Kernel
Safe (s1, s2, s3)
Unsafe (u1)
fk
tu
s1
fu
s2
s3
ToUser (tu), FromKernel (fk)
ToKernel(tk), FromUser (fu)
49
Example 2: Non-Commodity HW
• PC + Programmable NIC
(e.g. NetFPGA)
– FPGA on PCI card
– 4 GigE ports
– On board SRAM and DRAM
• Jon Turner’s “Pool of
Processing Elements” – with
crossbar
– PEs can be GPP, NPU, FPGA
– Switch Fabric = Crossbar
Processing
Engines PE1 PE2
...
PEm
Switch Fabric
Line
Cards LC1 LC2
...
LCn
Partition between FPGA and Software
Generalize: Partition among PEs
50
Example 2: Non-Commodity HW
• Redrawing the picture for FPGA/SW…
– Elements can have HW implementation, SW
implementation, or both (choose one)
Software
sw1
hw1
hw2
hw3
Software (sw1)
Hardware (hw1, hw2, hw3)
FPGA
fd
sw1
tc
hw1
td
fc
hw2
hw3
ToCPU (tc), FromDevice (fd)
ToDevice(td), FromCPU (fc)
51
Connection to outside environment
• In Linux, the “Board” is set of devices (e.g. eth0)
– Can query Linux for what’s available
– Network driver (to read/write packets)
– Inter process communication (for comm with handlers)
• FPGA is a chip on a board
– Using “eth0” needs
 Pins to connect to
 Some on chip logic (in form of IP Core)
• Board API
– Specify available devices
– Specify size of address block - used by char driver
– Provide elaborate() function
 Generates a top level Verilog module
 Generates a UCF file (pin assignments)
52
Packet Transfer
• In software it is a function call
• In FPGA use a pipeline of elements with a
standard interface
• Option1: Stream packet through, 1 word at a time
– Could just be the header
– Push/Pull a bit tricky
• Option2: Pass pointer
– But would have to go to memory (inefficient)
Element1
data
ctrl
valid
ready
Element2
53
Element specification and
implementation
• Need
– Meta-data
– Specify packet processing
– Specify run-time querying handling (next slide)
• Meta-data
– Use Click C++ API
– Ports
– Registers to use specific devices
 e.g. FromDevice(eth0) registers to use eth0
• Packet Processing
– Use C++ to print out Verilog
 Specialized based on instantiation parameters (config. string)
– Standard interface for packet
– Standard interface for handler
 Currently memory mapped register
54
Run-time querying and configuration
• Query state and update configuration in elements
– e.g. “add ADDR/MASK [GW] OUT”
user
telnet
click
char
driver
kernel
• When Creating Element
– Request Addr Block
– Specify software handlers
– Uses read/write methods to
get data
• Allocating Addresses
PCI
FPGA
decode
elem1
elem2
elem3
– Given total size, and
– size of each elements
requested block
• Generating Decode Logic
55
Memory
• In software
– malloc
– static arrays
– Share table through global variables or passing pointer
– Elements that do no packet processing
(passed as configuration string to elements)
• In FPGA
– Elements have local memory (registers/BRAM)
– Unshared (off-chip) memories – treat like a device
– Shared (off-chip) global memories (Unimplemented)
 Globally shared vs. Shared between subset of elements
– Elements that do no packet processing (Unimplemented)
56
Notifiers, Annotations
• Notifiers
– Element registers as listener or notifier
– In FPGA, create extra signal(s) from notifier to listener
• Annotations
– Extra space in Packet data structure
– Used to mark packet with info not in packet
 Which input port packet arrived in
 Result of lookup
– In software
 fixed byte array
– In FPGA
 packet is streamed through, so adding extra bytes is simple
57
User/Kernel Communication
• Add communication elements
– Use mknod for each direction
– ToUser/FromUser store packets
and provide file functions
– ToKernel/FromKernel use file I/O
u1
s1
s2
Safe (s1, s2, s3)
Unsafe (u1)
User
s3
Kernel
container
fk
u1
tu
s1
tk
fu
s2
s3
ToUser (tu), FromKernel (fk)
ToKernel(tk), FromUser (fu)
58
FPGA/Software Communication
• Add communication elements
– ToCPU/FromCPU uses device that communicates with
Linux over PCI bus
– Network driver in Linux
– ToDevice/FromDevice – standard Click element
Software
sw1
hw1
hw2
hw3
Software (sw1)
Hardware (hw1, hw2, hw3)
FPGA
fd
sw1
tc
hw1
td
fc
hw2
hw3
ToCPU (tc), FromDevice (fd)
ToDevice(td), FromCPU (fc)
59