How to Advance Network Driver Performance

Download Report

Transcript How to Advance Network Driver Performance

Network Driver Performance
Outline
Software features for high performance NICs
Some of the top features include:
Scatter-Gather DMA
Automatic Tuning of resources
Task Offloading support for IPv6
Hardware features for high performance NICs
Some of the top features include:
Task Offloading support
Receive-Side Scaling (RSS) support
Performance Tools
NTttcp
Kernrate Profiler
Goals
This information can be used to optimally tune
your network driver to work with your hardware
for best networking performance
This information can be used to fine-tune your
hardware features to operate at its optimal
performance
How to use NTttcp to isolate Network
performance problems
How to use Kernrate to identify bottlenecks on
hot paths
Note: The mention of packets is relevant to NDIS 5.x drivers and translates to
NetBuffers and NetBufferLists for NDIS 6.0 drivers on Windows codenamed
“Longhorn”
Software Optimizations
Network Software Optimizations
Scatter Gather DMA
SG DMA yields optimum performance with NDIS 6.0
model
It is highly recommended to pre-allocate the buffer
hosting the SCATTER_GATHER_LIST as part of
Transmit Control Block during the initialization phase
and reuse it.
Use maximum buffer size for
MaximumPhysicalMapping parameter in
NdisMInitializeScatterGatherDma function to avoid
buffer allocation and copy
Using Cached Memory to allocate NIC receive
buffers
X86, IA64, and x64 hardware guarantees DMA coherency
and there is no need to call IoFlushBuffer since it would
become a nop
NdisMAllocateSharedMemory ( pMpRxbuf->AllocSize,
TRUE,
&pMpRxbuf->AllocVa,
&pMpRxbuf->AllocPa);
// CACHED
More Network Software Optimizations
NDIS Safe APIs
Required for NDIS 6.0 model!
It has shown overall TCP/IP improvements of up to 7%
in Kernel mode scenarios (e.g. IIS 6.0)
Eliminate the need to call into Kernel for probing and
locking buffer
Set NDIS_ATTRIBUTES_USES_SAFE_BUFFER_APIS
flag in NdisMSetAttributesEx for NDIS 5.x drivers. The
flag does not need to be set for NDIS 6.0 drivers
Example: When using NdisQueryBufferSafe, the
VirtualAddress parameter should be set to NULL to
avoid mapping of buffers sent down by NDIS
64-bit DMA Support
Avoid copies for addresses above the 4GB range by
setting Dma64Addresses to TRUE in
NdisMInitializeScatterGatherDma
Locking Mechanisms Optimizations
Expensive hit to system performance if not used properly
Measurements show that we use approximately 160 cycles for Lock
Acquires and 140 cycles for Lock Releases.
Spinlocks should be used to protect data and not code.
Locking at DPC Level
When at DPC level, avoid extra code by using the following:
NdisDprAcquireSpinlock
NdisDprReleaseSpinlock
Reader-Write Locks
To minimize the number of spinlock acquire and release operations,
use the NDIS ReadWriteLock functions for scalability:
NdisInitializeReadWriteLock
NdisAcquireReadWriteLock
NdisReleaseReadWriteLock
The Read-Write Locks allow multiple concurrent readers to use a
single lock and limit write access to a single writer thread. No read
access is allowed during a write access. They will still behave like a
spinlock and raise the IRQL to dispatch when acquired.
Auto Tuning Network Drivers
Static: Driver and NIC hardware parameters are
based on system configuration such as whether it is
a client or server machine, CPU, memory, and what
can the NIC do.
Dynamic: System conditions dictate what type of
tuning is necessary for optimum performance. It
uses resource utilization and network load as metrics
for determining the best operating points for the NIC
and driver.
Some of the primary auto tuning parameters include:
Interrupt moderation
Receive Buffers allocation
Small buffer coalescing
Packets processed per DPC
Drivers can obtain current processor utilization by
using the NdisGetCurrentProcessorCounts function.
Hardware Optimizations
Task Offload Support
Checksum Offload
It has shown to improve overall TCP/IP performance by
up to 20%
It improves caching effect and eliminates churning –
8% increase
It reduces code path length – 12% improvement
TCP Segmentation Offload
It has shown to improve overall TCP/IP performance by
up to 11%
Reduces sender Cycles per Byte cost by 2x (it goes
below 1.5)
NDIS 6.0 has support for successor: Giant Send
Offload (> 64K)
NDIS 6.0 has IPv6 support for TCP Segmentation
Offload
NDIS 6.0 offers support for IPSec Offload
Message Signaled Interrupts (MSI)
MSI has the following attributes:
No acknowledgment is necessary for the message
No sharing is usually necessary
There is support for many interrupts per PCI function
Caveat: It only works on P4 and later chipsets
Advantages of MSI
With no sharing in place, latency is less with a single
ISR running
Bus utilization goes down by eliminating some read
operations from device
Device can target interrupts at designated processors
(e.g. RSS)
It guarantees data buffer coherency because message
follows DMA traffic on bus
Receive Side Scaling (RSS)
Existing stack limits receive processing to one CPU
Restricts scalability of Web server to the number of short-lived
connections a single CPU can process (per NIC)
Limits transaction throughput to packet receive processing rate of
one CPU
Example: A four processor machine can not use more than 25%
of its overall CPU cycles when hosting a single NIC on the system
RSS helps both long and short-lived connections
At times when CPU processing is dominated by connection setup,
RSS improves performance
Connection setup tasks map well to a general purpose CPU
RSS gives us parallel receive processing = parallel DPCs
Planned availability in Windows Server 2003 Network
Scalable Pack Add-on and Longhorn
Receive Side Scaling
Receive Side Scaling
Today
NDIS
CPU0
CPU0
ISR
NDIS
DPC
NDIS
CPU1
DPC
NDIS
CPU2
DPC
DPC
Parallel
DPC
Parallel
Receive
Packet
Queues
NIC
One processor per NIC
NIC
Multiple processors per NIC
Network Performance Tools
NTttcp benchmark
Uses Winsock 2.x publicly available APIs
Uses Overlapped I/O and Multithreading model
Transfers random data from Memory to Memory
Provides Throughput, CPU, and Interrupt rate
Provides Cycles per Byte metric - key for measuring
performance to catch regressions
Provides Packet to ACK ratio to detect link condition
Provides number of Segment Retransmits and Errors
Supports all Windows hardware architectures
NTttcp Output for a Single Thread
NTttcp Output for Multiple Threads
More Network Performance Tools
Kernrate Profiling tool
General purpose profiler for tracking CPU utilization
Samples periodically (programmable) to see what is
executing
Adjustable granularity
Per-processor, per-process, and total
Supports all Windows hardware architectures
Supports Windows 2000 and beyond
Highly customizable (numerous options)
The profiling tool and its viewer (KrView) can be
downloaded from:
http://www.microsoft.com/whdc/system/sysperf/krview.mspx
Call To Action
NDIS 6.0 driver developers need to implement
Task Offloading support for IPv6
Fine-tune your hardware so it operates at its
optimal performance point
Fine-tune your network driver to work optimally
with your hardware for best performance
For questions, please e-mail
ndis6fb @ microsoft.com. Please include your
name, company name, and phone number
Additional Resources
Email: ndis6fb @ microsoft.com
Web Resources:
Analyzing Driver Performance:
http://www.microsoft.com/whdc/driver/perform/
drvperf.mspx
High Performing Adapters and Drivers whitepaper:
http://www.microsoft.com/whdc/device/network/
NetAdapters-Drvs.mspx
Kernrate is available for download from the following:
http://www.microsoft.com/whdc/system/sysperf/krview.mspx
© 2005 Microsoft Corporation. All rights reserved.
This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.