Friday, December 3, 2010

Netperf and SMP - Oddness (Part 1 of 2)

Netperf seems to still be a fairly standard network performance tool. I see iperf out there as well, and interestingly enough, it generates very different numbers from netperf.

I've decided to go with netperf for my benchmarking needs, and have started running some simple tests with it to become more familiar with it's operation.

The first oddness I notice is with SMP. I get wildly different results depending on if multiple CPU's are used.

Let's start with what I'm running on: FreeBSD 8.1 AMD64 on a Dell 1850 PowerEdge, with 2 Xeon 5340 CPU. These machines have 2 Intel 1000MT NICs built into the board, which I have given and to. They run the em0 driver.  I've connected them together with a cross-over cable.

FreeBSD is setup as stock, installed from the CD, no changes.

I'm using cpuset to drive the applications to one CPU or the other. Here's what I'm executing;

cpuset -c -l 0 netserver -L -n 2

cpuset -c -l 0 netperf -H -L -t TCP_STREAM -l 300

By specifying the different IP's, I'm forcing the data to move from one NIC to the other. It's running IPv4, not IPv6.

This combination drives both the receive server (netserver) and the test program (netperf) from the same CPU. If I want to make them run on different CPU's I'd change one of the -l 0's to -l 1. If I want to leave it up to the kernel to schedule, I leave out the cpuset command entirely.

All hyperthreading is turned off. These are two standalone CPU's.

Here's what I'm getting, expressed in GigaBytes per Sec

Same CPU:  1.05 GB/sec
Different CPU: 0.47 GB/sec
No Preference: 0.88 GB/sec.

Very interesting..

We're looking at 1/2 speed when we run it on different CPU's. When we don't set a preference for the CPU, it will flip-flop between the two, sometimes both on one, other times separate. The speed for the no preference is almost exactly a split of the single CPU and dual CPU speeds.

I've researched this online, and found a few other people mentioning similar issues, but the threads never come to a conclusion.

There are two reasons I can think of this wide spread between single and dual CPU speeds;

1) Because all the work is happening on one CPU, there is some sort of cache/memory/buffer combining that allows for a faster transfer of data on the PCI bus. Maybe the data isn't transferring - but I do see the little link lights blinking away furiously when I run the tests.

2) There is significant overhead between the processors for SMP.

I do have a second identical PowerEdge 1850 that I plan on bringing into this equation shortly to try and figure out where this is coming from.  By sending to a separate machine, I'm going to eliminate the possibility that the CPU is combining something.

However, if you're looking to make a firewall run quickly, it looks at from this first small test that a single CPU firewall will outperform a dual.  That's an early conclusion, and I'll post more shortly when I know more.

If anyone has more info on this, that would be great.

Continued here..


  1. WHile it will not catch all the memory allocations (eg the stack) one can request that netperf/netserver bind to specific CPUs via the global -T option:

    -T N # bind netperf and netserver to CPU N on their respective systems
    -T N, # just bind netperf to CPU N and leave netserver floating
    -T ,M # bind netserver to CPU M, float netperf
    -T N,M # you get the idea

  2. Oh, and as for measuring the added overhead, enable the -c and -C test specific options to get netperf to measure CPU utilization and report it and service demand - CPU consumed per unit of work.

  3. Actually that didn't work.

    I tried the -C and -c and kept getting -1.00 for CPU utilization.

    Separately trying the -T options also didn't work, as I could confirm with top -P - It still bounced from CPU to CPU.

    The only way I could get a lock to one CPU was with cpuset.

    I never could get a proper CPU utilization, I just eyeballed it from top -P, and didn't really report it.

    I love it when the tools work, but I'm finding some of the older perf tools have some very interesting oddities in them. My research found threads blaming the amd64 SMP years ago, and then went cold.