Transcript Slides

Passive monitoring of 10 Gb/s lines
with PC hardware
Sven Ubik, Petr Žejdl
CESNET
TNC2008, Brugges, 19 May 2008
Active vs. passive monitoring
Active monitoring – a probe
• Uses test packets
• Results truly applicable to test packets only
Passive monitoring – a view
• Does not send anything, provides characteristics
about real traffic
• Many characteristics are inherent to real traffic
and cannot be obtained from test packets
traffic volume, protocol usage, burstiness, real packet loss,
anomalies, security attacks, …
Hardware vs. software
• Hardware processing is often considered „fast“ and
software processing „slow“
• Software runs on top of hardware and hardware is often
programmed, no clear line between HW and SW processing
• Hardware programming is sometimes considered „design-time“
and software programming „run-time“, but dynamically
reconfigurable HW exists and software needs to be designed
• There is often difference in flexibility of programming
(SW better than HW)
• What is more „powerful“, FPGA, network processor or
multi-core CPU?
NICs and monitoring cards
• 10GE NICs are now commonly available and relatively
inexpensive, $1300 / port including XFP transceiver
(was $4000 4 years ago)
• 10GE monitoring cards are few and expensive –
DAG (Endace), Napatech, COMBO (Invea-Tech),
~6500 Euro / port and more incl. XFP transceiver
(was over 20000 Euro 2 years ago)
• Two main differences NICs vs. monitoring cards:
- some hardware acceleration (filtering, header
classification, simple packet statistics)
- large packet buffer
and block DMA transfer – key difference
10 Gb/s cards that we tested
x8 PCI-Express NICs:
• Myricom Myri-10G
• Neterion Xframe II
64-bit/133 MHz PCI-X NICs
• Intel PRO/10 GbE
• Neterion Xframe
Monitoring card:
• DAG 8.2X (PCI-E)
Theoretical bus throughput:
- 20 Gb/s for x8 PCI-E
- 8 Gb/s for PCI-X
Test setup
• RFC2544 – Benchmarking Methodology for Network
Interconnect Devices
• Frame sizes 1518, 1280, 1024, 512, 256, 128 and 64 bytes
• DUT – Device under test, difficult to isolate in case of a PC card
• tested card
• PC hardware
• NAPI driver for NICs
• Linux 2.6 with standard
IP stack
• MAPI middleware
• test application (header
filter and packet counter)
Processing throughput
Maximum load at zero loss [Gb/s]
Packet
size [B]
Packets/s at
10 Gb/s
Myricom
Intel
Xframe
Xframe II
DAG
(for comparison)
1518
812744
8.5
7.5
7.5
6.0
10
1280
961538
7.0
6.5
6.5
5.5
10
1024
1197318
6.0
5.5
5.0
5.5
10
512
2349624
3.5
3.0
3.0
3.0
10
256
4528986
2.0
1.0
1.5
1.5
10
128
8445946
1.0
0.5
1.0
1.0
10
64
14880952
0.5
0.1
0.5
0.5
10
• Maximum IP-layer throughput in Gb/s with zero-loss processing
• Myricom best among NICs, but marginally
• DAG 100% line rate
Processing frame rates
Myricom Myri-10G PCI-E
Neterion Xframe II PCI-E
Intel PRO/10GbE PCI-X
Neterion Xframe PCI-X
CPU load
• CPU load at max. zero-loss throughput
• For all cards CPU was not bottleneck
• DAG anomaly for larger frames being investigated
Traffic processing in a PC
Example of a modern mainboard – Supermicro X7DB8 with
Intel 5000P chipset):
In a modern PC, bandwidth
of PCI, memory and FSB
are sufficient for sustained
processing of 10 Gb/s data,
bottlenecks are CPUs
and NICs.
In our case NICs were most
likely bottleneck with limit of
~1.3 mil. packets/s.
Cycles per packet
10 Gb/s in 64-byte packets = 14.8*106 packets / second
3 GHz CPUs:
4 cores – 806 cycles / packet
8 cores – 1612 cycles / packet
16 cores – 3224 cycles / packet
Packet sizes in live traffic
Example: GN2 – CESNET link:
~ 40% packets near 64 bytes
~ 40% packets near 1518 bytes
~ 20% packets in between
(~3% near 600 bytes)
average packet size 790 bytes
Example application: ABW
• Traffic classification
into application-layer
protocols
3.6 Gb/s
• Based on MAPI
and trackflib library
• Each protocol requires
combination of header
filtering and payload
searching
• 2x dual-core 3 GHz Xeon: ~3.5 Gb/s of live traffic zero-loss
monitoring (=> 4x quad-core: ~14 Gb/s)
Many-core processing
• Tilera – TILExpress-64 and TILExpress-20G cards
64-cores, 1 or 2 XAUI connectors (Infiniband-style)
• Other many-core cards exists, but without high-speed
network interface (e.g., 128-core NVIDIA Tesla C870
GPU processing board)
Distribution into multiple cores
1. In hardware:
Some monitoring cards have firmware that copies packets
into multiple memory buffers based on user-defined load
balancing (DSM – Data Stream Management in DAG cards,
but more than two buffers available only in NinjaBoxes)
2. In software:
One core runs packet scheduler that creates virtual buffers
(packets are not copied), not splitting flows
Other cores serve virtual buffers, in development …
Conclusion
Complex zero-loss processing of 10 Gb/s packet stream is
possible in a modern PC when two conditions are satisfied:
• Packet are copied from the network to the PC’s memory
efficiently (CPU must not be loaded by this task),
this is currently not possible with NICs, but it is possible
with monitoring cards
• Packets need to be distributed among multiple cores
Thank you for your attention
Questions?
[email protected]