Lecture note 6

Download Report

Transcript Lecture note 6

Make Hosts Ready for Gigabit
Networks
Hardware Requirement
• To allow a host to fully utilize Gbps bandwidth, its
hardware system must be ready for Gbps. For
example:
– CPU speed
• Is Pentium 100 MHZ PC fast enough to process a large number
of packets per second? (10 bits/HZ ?)
– Memory throughput
• Is SDRAM’s sustained throughput large enough to move data
in and out of it at Gbps ?
– I/O Bus bandwidth
• Is 32-bit 33 MHZ PCI bus fast enough to move data at Gbps
– Network interface
• Is the firmware on the NIC fast enough to process packets at
Gbps?
Software Design
• If a host’s hardware system can barely support Gbps
bandwidth, its software system must be carefully
designed so that Gbps can still be achieved for an
application. For example:
–
–
–
–
–
–
–
NIC device driver in OS
TCP/IP Protocol stack in OS
Routing table look up in OS
Buffer system in OS
API between OS and application programs
Networking services (e.g., NAT, Firewall)
(Improving the design and implementation of software
systems is focus of our course.)
The Path of Moving Data
• What networking does is basically to move data
from one networking application residing on one
machine to a networking application residing on a
different machine.
• The path of moving data is:
– Application -> operating system -> network interface > network -> network interface -> operating system ->
application program.
• Therefore, to achieve Gbps, moving data between
application and operating system, and between
operating system and network interface, must be
performed at least at Gbps.
The Cost of Moving Data
• The cost of moving data is very high.
– The CPU speed has been continuously improved and
now increased to 2 GHZ. However, the throughput and
access speed of memory (e.g., SDRAM) remains about
the same as those a few years ago.
– Therefore, the CPU now needs to wait and waste more
clock cycles to access a word in memory.
– The cost of moving data now becomes increasingly
high and the memory becomes the performance
bottleneck.
• Therefore, the goal is to minimize the need for
moving data or hide the cost of moving data.
Hide Memory Access Cost (1)
• Scoreboarding processor
– Instructions that load data into a register do not
need to wait for the data to come back from
memory, but rather mark the registers as
awaiting data. (single stream)
– The processor then can continue execution.
– Only if an instruction accesses the register
before the memory access has completed does
the processor needs to stop execution.
Hide Memory Access Cost (2)
• Super-scalar processor
– Permit independent instructions to be executed in the same
clock cycle (multiple instruction streams)
– Therefore, an instruction that is loading data from memory
can be executed in parallel with an instruction that does not
need this data.
Both scoreboarding and super-scalar methods benefit reading a
small amount of data. They are not very useful for reading
a large amount of data. Therefore, operating system should
be designed to minimize the number of times that a large amount
of data has to be copied.
Host Memory Hierarchy
Good cache performance depends
on good locality.
However, networking code often
violates the locality assumptions.
Example: when a packet arrives,
it interrupts the execution of the
processor. This forces the processor
to load new instructions. Furthermore,
because the data of the packet is not
in the data cache, it needs to be
fetched from memory.
The Problem with Layered Code
• Layering is a useful concept that enables network
researchers to cooperate together at the same time on
different aspects of a networking problem.
• However, an implementation of protocol stacks based
on strict layering often results in bad performance.
– Because the upper layer does not know which format the
lower layer wants, the packets copied into the lower layer
often need to be reformatted and recopied.
– Nowadays, we are seeing that more and more
implementation violates the layering concept for higher
performance.
• E.g., Content-aware (URL) Web switching at an IP router.
Reduce Memory Copy Operations
• Currently, on a UNIX host, two data copy
operations are needed to move data in an
application to the network interface.
– Application -> OS
– OS -> network interface.
One-Copy Techniques
• Virtual page remapping:
– The first copy can be eliminated by using virtual memory
mechanism to map the pages used by the application to
the pages used by the buffer in OS.
– The buffers in the application must start and end at page
boundary for this mechanism to work.
• Copy-on-write:
– The first copy can also be saved by COW.
– If a packet needs to be copied from one domain to another
domain, copy-on-write can be used to reduce or eliminate
the copy operation.
– The pages of the packet will be copied and generated only
when the packet is modified. Otherwise, the same pages
are used in different domains.
One-Copy Techniques
• Memory-mapped buffer:
– The second copy can be eliminated by mapping the
memory on the NIC to a part of the system memory.
– The OS then can use the mapped system memory for its
buffer area.
– Therefore, when the application’s data is copied to the
buffer in the OS, effectively it is copied into the
memory on NIC.
– (From PCI specification, it shows that if there is
memory on a PCI card, we can map that memory to a
part of the system memory.)
Zero-Copy Technique
• Memory-mapped buffer + Virtual page remapping :
– We can first map the NIC’s buffer to the buffer in the OS
(PCI hardware map operation).
– We can then map the buffer in the application to the buffer
in the OS. (OS software map operation)
– Then, the buffer in the application is mapped to the NIC’s
buffer.
– This will result in zero-copy operation.
• Although from network performance’s viewpoints
zero-copy is good, it is very difficult to use for the
application.
– Because now the application needs to know the hardware
details, which however should be hidden by the OS.
DMA Technique
• To avoid the data copy operation between the OS and
the NIC, instead of using the normal programming I/O,
we can use DMA.
• Using DMA, a NIC can transfer data directly from/to
memory without involving the CPU. This enables
CPU to execute in parallel with the data transfer.
(However, CPU may still be stalled.)
• Generally, DMA’s performance is better than PIO.
However, there are some situations where PIO is
preferred (e.g., doing checksuming)
• Scatter-Gather capability in DMA-based NIC is
important because they can avoid data copy operations.
Buffer Editing
• To support Gbps, the design of a buffer system should
allow buffers to be created, clipped, shared, split,
concatenated, destroyed with little overhead.
• Otherwise, a packet may need to be copied to a new
buffer again and again while traversing the layers of a
protocol stack.
– E.g., as a packet goes down/up a protocol stack when it is
sent/received, more and more headers need to be
prepended/stripped to/from it.
• Generally, lists or tree structures are used as the data
structure to easily support the above operations.
– E.g., the mbuf used in the BSD UNIX.
API Design
• The design of application program interface (API)
can significantly affect the data passing performance
between the user application and the OS.
• Currently, the read() and write() system calls
provided on UNIX allow the user to choose a buffer
with arbitrary address, size, alignment, and
unconstraint access to that buffer.
– This makes the OS difficult to avoid the data copy
operation between the application and the OS.
• Suppose that, instead, the UNIX requires that the
buffer must start and end at page boundary, the
length be a multiple of page size, then copy-on-write
technique can be used to avoid one copy operation.
Data Manipulation
• Data manipulation are computations that inspect and
possibly modify every word of data in a network
packet.
– E.g., encryption, compression, checksuming, presentation
conversion, etc.
• Typically, different network layers manipulate data
independently from each other.
• Each data manipulation requires the CPU to load
potentially un-cached data from memory and store the
inspected/modified data to memory.
• Therefore, repeated transfers need to across the
CPU/memory data path multiple times, which limits
and lowers the achievable maximum throughput.
Integrated Layer Processing
• Integrated layer processing (ILP) technique can be used to
minimize the number of data transfers.
• The data manipulation steps from different protocol layers
are combined into a pipeline.
• A word of data is loaded into a register, then manipulated
by multiple manipulation layers while it remains in a
register, then finally stored – all before the next word of
data is processed.
• In this way, a combined series of data manipulations only
transfer data from memory to the CPU and back once,
instead of transferring the data once per distinct layer.
The difficulty is that different manipulations cannot be easily
integrated.
Copy-Avoiding Techniques
Relationship
NIC to NIC Transfers
• What we have discussed so far is to reduce
the number of copy operations required for
sending data from the user application,
through the OS, to the NIC.
• Here, we discuss the methods that can
reduce the number of copy operations
required for forwarding data from one NIC
to another NIC. (I.e., the system functions
as a routing or switching device.)
Techniques for NIC-to-NIC (1)
• Hardware streaming (peer-to-peer)
– The maximum achievable forwarding throughput is the
I/O bus bandwidth.
– The problem are that special hardware is required and
the OS has no chance to inspect/modify packets.
• As a result, some processing (e.g., routing table lookup) needs
to be performed by the CPU on NICs.
• However, due to economic, the CPU on NICs are often much
slower than the CPU on the system.
• DMA-DMA streaming
– The maximum achievable throughput is only ½ of the
I/O bandwidth.
– However, packets can be inspected/modified by the OS.
• E.g., routing table lookup
Techniques for NIC-to-NIC (2)
• OS kernel streaming
– Packets are first DMA’ed into memory.
– Packets then are read from memory to the CPU for
inspection or modification.
– Depending on the number of inspection/modification
and the memory system read/write throughput, the
achievable maximum forwarding throughput is further
limited by (memory throughput / number of read or
write).
• User-level streaming
– In some applications, packets may need to go up to the
user level for inspection/modification.
• E.g., a Web proxy system, an email relay system, NATD
– The throughput will be further limited.
Latency of Small Packets
• For large packets, we care about the cost of
copying them (i.e., transfer throughput).
• For small packets, however, what we care about is
the latency of their transmission.
• The following three interactions between the
processor and memory can affect latency:
– Branch misses
– Context switching
– Interrupts
Branch Misses
– To make the instruction pipeline full (some processors can
have up to 13 stages), most processors today fetch
instructions continuously.
– Conditional branches, however, present a problem because
the target instruction cannot be determined until the condition
result has been computed.
– If the CPU waits for the completion of the condition testing
before fetching the next instruction, the pipeline cannot be
full most of the time. This will result in low CPU utilization.
– To solve this problem, most processors today try to predict
the next instruction to perform.
– If the guess is wrong, the instructions that are already fetched
need to be abandoned. This will also result in low CPU
utilization.
Do not put too many if-then-else in your networking code.
Context Switches
– Context switches are very expensive because they
require both new code and data be fetched from the
slow memory and loaded into the processor cache.
– In a perfect system, no more than one context switch
should be needed to send a packet and one context
switch plus an interrupt to receive a packet.
– In micro-kernel OS, sending and receiving a packet
need more context switches than a traditional UNIX
kernel. (because the packet needs to traverse the
application program, network server, and micro kernel
domains)
For a high-speed system, macros are preferred over function calls.
Function calls are preferred over threads (need to save its PC and
stack) to process an incoming packet.
Interrupts
• Interrupts are very expensive.
– They cause context switches, which in turn cause a lot of code
and data cache misses.
– Sometimes the host’s priority mode needs to be changed from
the user mode to privileged mode when an interrupt occurs.
Changing mode, however, is a very slow operation.
• One solution is to minimize the number of interrupts.
– Do not issue a receive interrupt for every incoming packet.
Issue an interrupt only when a certain number of packets have
been received or a timer has expired.
– When a receive interrupt occurs, the device driver retrieves
and processes as many packets from the NIC as possible.
– Do not issue a transmit interrupt for every sent packet. Issue a
transmit interrupt only after a certain number of packets have
been sent or a timer ha expired.
Receive Livelock Problem
• Can happen in an interrupt-driven kernel
• This problem happens when packets arrives at the
system at high rates.
• When this problem occurs, the system will spend
all of its time processing interrupts, to the
exclusion of other necessary tasks.
• The result is that, under extreme conditions, no
packets can be delivered to the user application or
the output of the system.
• To avoid this problem, tasks and interrupts must
be carefully scheduled.
Techniques to Avoid Livelock
• Limit the interrupt arrival rate
– For example, when the ipintr queue is going to be full and
packets are going to be dropped, we can temporarily
disable interrupts. The interrupt can be re-enabled when
the buffer occupancy of the ipintr queue drops a certain
threshold.
• Use of polling
– Poll each NIC at a fixed rate. This can limit packet
processing rate and also provide fair resource allocation
between multiple interfaces.
• Avoiding preemption
– Let higher-level protocol processing (e.g., TCP/IP) be
executed at the same level as that used by an interrupt
service routine.