Intel Pro/1000 Small Packet Performance

This is a summary of what I've learned about the Intel Pro/1000 gigabit ethernet card's performance with small (60 byte) packets. The board can receive 680,000 packets/second, and send at 500,000 or 840,000 p/s, depending on model. However, the Intel Linux drivers can drive it at only about half those rates.

The reason to be interested in small-packet performance is if you wish to build routers, or router-like boxes such as NATs. In such situations the average packet size is likely to be about 200 bytes. Many gigabit ethernet board designs (and marketing) seem to focus more on 1500 or 9000 byte packets.

Test Configuration

The test results here actually involve two versions of the Pro/1000. The receiver is model PWLA8490, which has a 33-mHz 64-bit PCI interface. The sender is model PWLA8490SX, often called the "Pro/1000 F Server Adapter". It has a 66-mHz 64-bit PCI interface. You probably want to buy the PWLA8490SX. It seems to be able to send almost twice as fast as the PWLA8490, but doesn't receive any faster.

The test machines are PCs with SuperMicro 370DL3 motherboards, 800 mHz Pentium III CPUs, 133 mHz front-side bus, and 256 MB of PC133 memory. This motherboard has the ServerWorks ServerSet LE chipset and 64-bit PCI slots. The machines have two CPUs but are running a Linux kernel with SMP support turned off.

The machines are running Linux 2.2.16. The networking code, however, is the Click software router toolkit. Depending on the precise configuration, Click replaces some or all of the Linux kernel networking code. The point of using Click is that it can send and receive packets much faster than any user-level program, because it runs in the kernel. The send software for these experiments sends UDP packets at a controlled rate. The receive software just counts and discards packets. The packets are a total of 60 bytes in length, including the 14 byte ethernet header.

The Pro/1000 driver is based on Intel's version 2.5.11, available on the web here.

The two machines involved are directly connected with a fiber cable. Link-level flow control is disabled for all the tests.

Transmit Performance

The Intel driver can send up to 260,000 packets/second with the PWLA8490, and 340,000 p/s with the PWLA8490SX. The limiting factor seems to be that the board uses "delayed" transmit complete interrupts. It probably interrupts only once per fixed period of time. Since the transmit queue is limited to 80 packets, this means the board can send only 80 packets per period of time. The details are hard to pin down since Intel doesn't make board documentation available.

After fixing the driver to specifically ask the board for a transmit complete interrupt every 60 packets, the PWLA8490 is able to send 523,000 p/s. The PWLA8490SX can send about 840,000 p/s. The detailed fix was to not turn on the E1000_TXD_CMD_IDE bit in every 60th transmit descriptor.

Receive Performance

This graph shows the number of packets per second delivered to the receiving software as a function of the rate at which packets are sent to a PWLA8490 card:

The Original line corresponds to the unmodified Intel driver. It can receive about 300,000 p/s. At higher input rates it seems to experience interrupt livelock -- the card interrupts for every received packet, and the cost of the interrupt handling prevents the CPU from performing any other processing.

The driver source includes code to ask the card to delay interrupts, but that code isn't turned on. The relevant variable is e1000_rxint_delay. It appears to be the number of microseconds between interrupts. The receive DMA queue length is set by MAX_RFD, so the maximum receive rate should be about MAX_RFD packets per delay period. Unfortunately these parameters probably have to be tuned for each specific workload. The delay period should be long enough that the CPU can completely process MAX_RFD packets per delay, including user-level processing if appropriate. If the delay is too low, the CPU will experience livelock and get no work done. If the delay is too high, the card will discard packets even though the CPU is idle.

I found that leaving MAX_RFD at 80 packets and setting the receive interrupt delay to 128 (the same as the transmit delay) worked well. This allows about 1.6 microseconds of processing time per packet, which is enough for my receive software to count and discard a packet. The resulting behavior is shown by the Tuned line in the graph above. Note that the receive rate goes up to 450,000 p/s, but then descends. I wasn't able to find MAX_RFD and delay values that prevented the decline. This is too bad -- part of the point of delayed interrupts is to prevent livelock, but it doesn't seem to work.

The Polling line in the graph describes a setup in which the card doesn't interrupt at all. Instead, the Click software polls the card for new packets, fully processes them, and only then polls for more packets. This prevents livelock as well as avoiding interrupt overhead, so the driver can receive (and process) 680,000 p/s even when overloaded with input.

Conclusions

The Pro/1000 hardware can receive 680,000 packets/second, and send 500,000 or 840,000 p/s, depending on model. On the one hand, these numbers are far from saturating a gigabit link (about 1.4 million 60-byte packets/second). On the other hand, it's a lot better than the two other gigabit cards I've used. The Alteon Tigon-II seems to be limited to sending about 100,000 packets per second, possibly due to the firmware taking about 5 microseconds to load each DMA descriptor (see page 70, section 4.4, of the Host/Nic Software Interface Definition). I'm able to send 250,000 p/s with the SysKonnect SK-9843.

It's too bad the Intel Linux driver can only achieve about half the board's potential. It's also disappointing that the delayed interrupt mechanism seems to require manual tuning, and that it doesn't prevent livelock.

You can find my modified version of the Intel 2.5.11 driver here. My modifications support Click's polling and simplify the code to help get rid of some locking and increase concurrency. I could easily have introduced bugs, so don't use my driver if you can't tolerate problems.

You can find a more up-to-date Pro/1000 driver as part of the Click distribution.

Note that since I don't have a manual for the Intel board, I may be misunderstanding its behavior. And it could easily be the case that the Intel, Alteon, and SysKonnect hardware could perform better than I've suggested here with better drivers or with a different test strategy. So take my results and explanations with a grain of salt.

Robert Morris, rtm@lcs.mit.edu, November 2000.