Secure Overlay Services - Washington University in St. Louis

Download Report

Transcript Secure Overlay Services - Washington University in St. Louis

Networking in the Linux Kernel
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
1
Introduction
Overview of the Linux Networking implementation:
Covered:
• Data path through the kernel
• Quality of Service features
• Hooks for extensions (netfilter, KIDS, protocol demux placement)
• VLAN Tag processing
• Virtual Interfaces
Not covered:
• Kernels prior to 2.4.20, or 2.6+
• Specific protocol implementations
• Detailed analysis of existing protocols, such as TCP. This is covered
only in enough detail to see how they link to higher/lower layers.
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
2
OSI Model
The Linux kernel adheres closely to the OSI 7-layer
networking model
Application
Presentation
Session
Transport
Network
Data Link
Physical
Networking in the Linux Kernel
Application
(Above socket)
(HTTP, SSH, etc.)
TCP/UDP
Internet (IP)
Data Link
(802.x, PPP, SLIP)
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
3
OSI Model (Interplay)
Layers generally interact in the same manner, no
matter where placed
Layer N+1 Data
Add header and/or trailer
Layer N+1Control
Layer N+1 Data
Pass to layer N as raw data
Layer N Data
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
4
Socket Buffer
When discussing the data path through the linux kernel, the
data being passed is stored in sk_buff structures (socket
buffer).
• Packet Data
• Management Information
• The sk_buff is first created incomplete, then filled in
during passage through the kernel, both for received
packets and for sent packets.
• Packet data is normally never copied. We just pass around
pointers to the sk_buff and change structure members
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
5
Socket Buffer
sk_buff
next
prev
list
head
data
tail
dev dev_rx sk end
associated
device
source
device
sk_buff
All sk_buff’s are
members of a queue
Packet Data
MAC Header
IP Header cloned sk_buff’s
TCP Header share data, but
data
not control
socket
struct sk_buff is defined in:
include/linux/skbuff.h
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
6
Socket Buffer
sk_buff features:
•
•
•
•
Reference counts for cloned buffers
Separate allocation pool and support
Functions for manipulating the data space
Very “feature-rich” – this is a very complex, detailed
structure, encapsulating information from protocols at
multiple layers
There are also numerous support functions for queues of
sk_buff’s.
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
7
Data Path Overview
user
socket
kernel
socket
socket demux
TCP
UDP
Layer 4 protocol demux
protocol
protocol
IP
Layer 3 protocol demux
net_rx_action()
…
…
DMA
rings
Queue
Discipline
…
…
softirq
Driver
kernel
hardware
Networking in the Linux Kernel
Network
Device
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
8
OSI Layers 1&2 – Data Link
The code presented resides mostly in the following files:
•
•
•
•
•
•
•
•
include/linux/netdevice.h
net/core/skbuff.c
net/core/dev.c
net/dev/core.c
arch/i386/irq.c
drivers/net/net_init.c
net/sched/sch_generic.c
net/ethernet/eth.c (for layer 3 demux)
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
9
Data Link – Data Path
IP
Layer 3
poll_queue
net_rx_action()
softirq
Layer 2
enqueue()
netif_rx_schedule()
Add device pointer to
poll_queue
Queue
Discipline
…
…
…
…
DMA Rings
kernel
hardware
Networking in the Linux Kernel
DMA
Driver
net_interrupt
(net_rx, net_tx,
net_error)
Network
Device
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
10
Data Link – Features
•
NAPI
–
–
Old API would reach interrupt livelock under 60 MBps
New API ensures earliest possible drop under overload
1.
2.
3.
4.
5.
6.
•
Backward compatibility for non-DMA interfaces maintained
•
•
•
All legacy devices use the same backlog (equivalent to DMA ring)
Backlog queue is treated just like all other modern devices
Per-CPU poll_list of devices to poll
–
•
Packet received at NIC
NIC copies to DMA ring (struct skbuff *rx_ring[])
NIC raises interrupt via netif_rx_schedule()
Further interrupts are blocked
Clock-based softirq calls softirq_rx(), which calls dev->poll()
dev->poll() calls netif_receive_skb(), which does protocol demux (usually
calling ip_rcv() )
Ensures no packet re-ordering necessary
No memory copies in kernel – packet stays in the sk_buff at the same
memory location until passed to user space
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
11
Data Link – transmission
•
Transmission
1. Packet sent from IP layer to Queue Discipline
2. Any appropriate QoS in qdisc – discussed later
3. qdisc notifies network driver when it’s time to send –
calls hard_start_xmit()
1. Place all ready sk_buff pointers in tx_ring
2. Notifies NIC that packets are ready to send
3. NIC signals (via interrupt) when packet(s) successfully
transmitted. (Highly variable on when interrupt is sent!)
4. Interrupt handler queues transmitted packets for deallocation
4. At next softirq, all packets in completion_queue are
deallocated
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
12
Data Link – VLAN Features
•
Still dependent on individual NICs
– Not all NICs implement VLAN filtering
•
A partial list is available at need (not included here)
– For non-VLAN NICs, linux filters in software and
passes to the appropriate virtual interface for ingress
priotization and layer 3 protocol demux
•
•
net/8021q/vlan_dev.c (and others in this directory)
Virtual interface passes through to real interface
– No VID-based demux needed for received packets, as
different VLANs are irrelevant to the IP layer.
– Some changes in 2.6 – still need to research this
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
13
OSI Layer 3: Internet
The code presented resides mostly in the following files:
•
•
•
•
•
•
•
net/ipv4/ip_input.c – process packet arrivals
net/ipv4/ip_output.c – process packet departures
net/ipv4/ip_forward.c – process packet traversal
net/ipv4/ip_fragment.c – IP packet fragmentation
net/ipv4/ip_options.c – IP options
net/ipv4/ipmr.c – IP multicast
net/ipv4/ipip.c – IP over IP, also good virtual interface
example
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
14
Internet: Data Path
Note: chart copied from DataTag’s
“A Map of the Networking Code in the Linux Kernel”
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
15
Internet: Features
Netfilter hooks in many places
–
–
–
–
–
–
INPUT, OUTPUT, FORWARD (iptables)
NF_IP_PRE_ROUTING – ip_rcv()
NF_IP_LOCAL_IN – ip_local_deliver()
NF_IP_FORWARD – ip_forward()
NF_IP_LOCAL_OUT – ip_build_and_send_pkt()
NF_IP_POST_ROUTING – ip_finish_output()
• Connection tracking in IPv4, not in TCP/UDP/ICMP.
– Used for NAT, which must maintain connection state in violation
of OSI Layering
– Can also gather statistics for networking usage, but all of this
functionality comes from the netfilter module
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
16
Socket Structure and System Call Mapping
The following files are useful:
• include/linux/net.h
• net/socket.c
There are two significant data structures involved,
the socket and the net_proto_family. Both involve
arrays of function pointers to handle each system
call type that is relevant.
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
17
System Call: socket
1.
2.
3.
From user space, an application calls socket(family,type, protocol)
The kernel calls sys_socket(), which calls sock_create()
sock_create references net_families[family], an array of network protocol families, to
find the corresponding protocol family, loading any modules necessary on the fly.
•
•
•
4.
5.
If the module is loaded, it is loaded as “net_pf_<num>”, where the protocol family number is
used directly in the string. For TCP, the family is PF_INET (was: AF_INET), and the type is
SOCK_STREAM
Note: linux has a hard limit of 32 protocol families. (These include PF_INET, PF_PACKET,
PF_NETLNK, PF_INET6, etc.)
Layer 4 Protocols are registered in inet_add_protocol() (include/net/protocol.h), and socket
interfaces are registered by inet_register_protosw(). Raw IP datagram sockets are registered
like any other Layer 4 protocol.
Once the correct family is found, sock_create allocates an empty socket, obtains a
mutex, and calls net_families[family]->create(). This is protocol-specific, and filles
in the socket structure. The socket structure includes another function array, ops,
which maps all system calls valid on file descriptors.
sys_socket() calls sock_map_fd() to map the new socket to a file descriptor, and
returns it.
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
18
Other socket System Calls
Subsequent socket system calls are passed to the
appropriate function in socket->ops[]. These
include (exhaustive list):
•release
•bind
•connect
•socketpair
•accept
•getname
•poll
•ioctl
•listen
•shutdown
•setsockopt
•getsockopt
•sendmsg
•recvmsg
•mmap
•sendpage
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Technically, Linux offers
only one socket system call,
sys_socket-call(), which
multiplexes to all other
system calls via the first
parameter. This means that
socket-based protocols could
provide new and different
system calls via a library and
a mux, although this is never
done in practice.
Mike Wilson – 15 March 2005
19
PF_PACKET
A brief word on the PF_PACKET Protocol family
PF_PACKET creates a socket bound directly to a
network device. The call may specify a packet
type. All packets sent to this socket are sent
directly over the device, and all incoming packets
of this type are delivered directly to the socket.
No processing is done in the kernel. Thus, this
interface can – and is – used to create user-space
protocol implementations. (E.g., PPPoE uses this
with packet type ETH_P_PPP_DISC)
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
20
Quality of Service Mechanisms
Linux has two QoS mechanisms:
– Traffic Control
• Provides for multiple queues and priority schemes within those
queues between the IP layer and the network device
• Defaults are 100-packet queues with 3 priorities and a FIFO ordering.
– KIDS (Karlsruhe Implementation architecture of Differentiated
Services)
• Designed to be component-extensible at runtime.
• Consists of a set of components with similar interfaces that can be
plugged together in almost arbitrarily complex constructions
Neither mechanism implements the higher-level traffic
agreements, such as Traffic Conditioning Agreements
(TCA’s). MPLS is offered in Linux 2.6.
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
21
Traffic Control
Traffic Control consists of three types of components:
1. Queue Disciplines
•
•
2.
Filters
•
•
3.
These implement the actual enqueue() and dequeue()
Also has child components
Filters classify traffic received at a Queue Discipine into Classes
Normally children of a Queuing Discipline
Classes
•
•
These hold the packets classified by Filters, and have associated queuing
disciplines to determine the queuing order.
Normally children of a Filter and parents of Queuing Displines
Components are connected into structures called “trees,” although
technically they aren’t true trees because they allow upward
(cyclical) links.
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
22
Traffic Control: Example
This is a typical TC tree.
The top-level Queuing Discipline is the
only access point from the outside, the
“outer queue.” From external access,
this is a single queue structure.
Internally, packets eceived at the outer
queue are matched against each filter in
order. The first match wins, with a final
default case.
Dequeue requests to the outer queue are
passed along recursively to the inner
queues to find a packet ready for
sending.
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
enqueue
dequeue
Queuing Discipline 1:0
Filter
Filter
...
Class
1:1
Class
1:2
Queuing
Discipline
Queuing
Discipline
2:0
3:0
Mike Wilson – 15 March 2005
Default
23
Traffic Control (Cont’d)
The TC architecture supports a number of pre-built
filters, classes, and disciplines, found in
net/sched/cls_* are filters, whereas sch_* are
disciplines (classes collocated with disciplines).
Some disciplines:
•
•
•
•
•
ATM
Class-Based
Queuing
Clark-ShenkerZhang
Differentiated
Services mark
FIFO
Networking in the Linux Kernel
•
•
•
•
RED
Hierarchical
Fair Service
Curve
(SIGCOMM’97)
Hierarchical
Token Bucket
Network
Emulator (For
protocol testing)
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
•
•
•
•
•
Priority (3
levels)
Generic RED
Stochastic
Fairness
Queuing
Token Bucket
Equalizer (for
equalizing line
rates of different
links)
Mike Wilson – 15 March 2005
24
KIDS
KIDS establishes 5 general component types (by interface)
• Operative Components – receive a packet and runs an algorithm on it.
The packet may be modified or simply examined. E.g., Token
Buckets, RED, Shaper
• Queue Components – Data structures used to enqueue/dequeue.
Includes FIFO, “Earliest-Deadline-First” (EDF), etc.
• Enqueuing Components – enqueue packets based on special methods:
tail-enqueue, head-enqueue, EDF-enqueue, etc.
• Dequeuing Components – dequeue based on special methods
• Strategic Components – strategies for dequeue requests. E.g., WFQ,
Round Robin
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
25
KIDS (Cont’d)
• KIDS has 8 different hook points in the linux kernel, 5 at
the IP layer and 3 at Layer 2:
–
–
–
–
–
–
–
–
IP_LOCAL_IN – just prior to delivery to Layer 4
IP_LOCAL_OUT – just after leaving Layer 4
IP_FORWARD – packet being forwarded (router)
IP_PRE_ROUTING – Packet newly arrived at IP layer from
interface
IP_POST_ROUTING – Packet routed from IP to Layer 2
L2_INPUT_<dev> – Packet has just arrived from interface
L2_ENQUEUE_<dev> – Packet is being queued at Layer 2
L2_DEQUEUE_<dev> – Packet is being transmitted by Layer 2
Networking in the Linux Kernel
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
Mike Wilson – 15 March 2005
26