Monday, June 27, 2011

Speeding up FreeBSD's NFS on ZFS for ESX clients

My life revolves around four 3-letter acronyms: BSD, ESX, ZFS, and NFS.

However, these four do not get along well, at least not on FreeBSD.

The problem is between ESX's NFSv3 client and ZFS's ZIL.

You can read a bit about this from one of the ZFS programmers here - Although I don't agree that it's as much of a non-issue as this writer found.

ESX uses a NFSv3 client, and when it connects to the server, it always asks for a sync connection. It doesn't matter what you set your server to, it will be forced by the O_SYNC command from ESX to sync all writes.

By itself, this isn't a bad thing, but when you add ZFS to the equation,we now have an unnecessary NFS sync due to ZFS's ZIL. It's best to leave ZFS alone, and let it write to disk when it's ready, instead of instructing it to flush the ZIL all the time. Once ZFS has it, you can forget about it (assuming you haven't turned off the ZIL).

Even if your ZIL is on hardware RAM-drives, you're going to notice a slow-down. The effect is magnified on a HD based ZIL (which is what you have if you don't have a separate log device on SSD/RAM). For my tests, I was using a hardware RAM device for my ZIL.

Some ZFS instances can disable the ZIL. We can't in FreeBSD if you're running ZFS v28.

Here's two quick iometer tests to show the difference between a standard FreeBSD NFS server, and my modified FreeBSD NFS server.

Test Steup: Running iometer 1.1 devel,  on a Windows 7 SP1 machine, connected to the test drives via NFS. iometer has 128 writers, full random, 4k size, 50% write 50% read, 100% sequential access, 8GB file, 15 run-time. Reboot after each test, ran each test twice to make sure we were receiving a sane result. Using FreeBSD 9-CURRENT as of 2011.

Standard NFS
Test1    1086 IOPS    4.45 MBs    117 AvgIO (ms)
Test2    1020 IOPS    4.18 MBs    125 AvgIO (ms)

Modified NFS
Test 3   2309 IOPS    9.45 MBs    55 AvgIO (ms)
Test 4   2243 IOPS    9.19 MBs    57 AvgIO (ms)

I feel the results speak for themselves, but in case they don't - We're looking at an increase in IOPS, MB/per sec, and a decrease in the time to access the information when we use the modified NFS server code. For this particular test, we're looking at nearly a doubling in performance. Other tests are close to a 10% increase in speed, but that's still a wanted increase.

These test results will be apparent if you're using the old NFS server (v2 and v3 only) or the new NFS server (v2-3-4) that is now the default in FreeBSD 9 as of a month ago.

I've used this hack for over 6 months now on my SANs without any issue or corruption, on both 8.1 and various 9-Current builds, so I believe it's fairly safe to use.

I'm too lazy to make a proper patch, but manually editing the source is very easy:

- The file is /usr/src/sys/fs/nfsserver/nfs_nfsdport.c
- Go to line 704, you'll see code like this;
(Edit: Now line 727 in FreeBSD-9.0-RC3)

if (stable == NFSWRITE_UNSTABLE)
  ioflags = IO_NODELOCKED;
  ioflags = (IO_SYNC IO_NODELOCKED);
uiop->uio_resid = retlen;
uiop->uio_rw = UIO_WRITE;

- Change the code to look like this below. We're commenting out the logic that decides to allow this to be an IO_SYNC write.

// if (stable == NFSWRITE_UNSTABLE)
ioflags = IO_NODELOCKED;
// else
// ioflags = (IO_SYNC | IO_NODELOCKED);
uiop->uio_resid = retlen;
uiop->uio_rw = UIO_WRITE;

- Recompile your kernel, install it, and reboot - You're now free from NFS O_SYNC's under ESX.

If you are running the older NFS server (which is the default for 8.2 or older), the file to modify is /usr/src/sys/nfsserver/nfs_serv.c  - Go to line 1162 and comment out these lines as shown in this example:

// if (stable == NFSV3WRITE_UNSTABLE)
ioflags = IO_NODELOCKED;
// else if (stable == NFSV3WRITE_DATASYNC)
// ioflags = (IO_SYNC IO_NODELOCKED);
// else

If you try this, and let me know what type of before and after speed results you're receiving.


  1. hmm seems to be very interesting, but did You tried to reach 2Gbit or more between Your NFS storage and ESX server?
    I'm currently testing round-robin linux bonding mode, I already reached 1,8Gbit for 2 NICs (Broadcom's 4port Server adapter), but not tested this with NFS or iSCSI, just iperf test.

  2. With other tests, I have reached higher than 2Gbit on my 10Gbit network cards. I usually try to leave network bonding out of my tests if possible, as it can add more overhead or complications to the test.

  3. Great job. I applied your "patch" and speed increased dramatically. scary dramatically. thanks. really useful post.

  4. > Some ZFS instances can disable the ZIL. We can't in FreeBSD if you're running ZFS v28.

    Are you sure of that?
    The following worked perfectly for me:
    zfs set sync=disabled

  5. Glad to hear it helped you out Bane.

  6. Anon: Thank you, I completely missed this setting. Setting sync=disabled would accomplish much of the same thing. I'll have to do some speed tests now to see if it's worthwhile to still disable the NFS O_SYNC's or not.

  7. Christopher; Have you had the chance to make that final test to see if you get the same results with `zfs set sync=disabled`? If not, can you? Thanks!

  8. A reference to this post and your modifications has been made at:

    Would it be possible to make a binary version of your modified NFSD available for download?

    Thank you.

  9. sysctl vfs.nfsrv.async=1

    On RELENG_8 anyways

  10. I've done some layman's testing today on FreeBSD 9.0_RC1, and I get the same performance when doing `zfs set sync=disabled tank/vm` as I do when applying the code modifications you wrote in this article. So to me, it seems that they effectively do the same thing in the end. I would love it if someone could verify this though.

  11. No, I'm waiting for 9.0 Final, before I dive into doing any real testing. Work will have me building 2 fairly heavy SAN machines based on FreeBSD in the next two months, so I'll be able to do some A/B comparisons on two physically different machines with the same hardware.

    For now, if you compile anyway, add the source change - It's quite easy. If you don't compile, set sync=disabled - At this point we're just being nit-picky. :-)

  12. Regarding a Binary: That's too much work for me unfortunately, specially where we think the zfs set sync=disable will do the same thing. :-)

  13. Regarding sysctl vfs.nfsrv.async=1

    That's not the same thing. Try some speed tests and you'll see. ESX opens the NSF system with an O_SYNC command, and it has to be obeyed by NFS.

  14. Hi,

    what kind of NVRAM ZIL devices do you use, BTW?

  15. Why is nobody commenting about corruption...?

    If I set sync=disabled and hard reboot while the ESXi guest OS is writing to the disk, I not only end up with corrupt files, but a corrupt file system.

    I get 68 MB/s using an SSD ZIL with a Linux NFS sync client. With ESXi, I only get 7.5 MB/s. Obviously ESXi is doing something stupid, since it is the only client known to have this problem. My guess is that it does a flush after every write (every burst of packets sent) rather than only sync writes (such as a file flush).

    My best workaround so far is to just leave the root file system as the slow virtual disk, and manually NFS mount the rest in fstab, using the guest system's NFS client, with sync enabled.

    But the question I have is: will your change disable sync completely (which is a terrible idea), or just the ridiculously stupid parts of what ESXi is doing? I think the answer is that it disables it completely, so get ready for corruption on every power failure, disk failure, panic, or other IO system service interruption.

  16. Peter:

    Good point on the worries about corruption with sync=disabled.

    My patch only affects the NFS communication for ESX - When ESX says "sync this data", my NFS patch makes NFS lie and say "yeah, yeah, we did, get on with it". I normally don't turn off sync on my pools, I just patch the NFS server. This means that the ZIL and all other write-protection schemes of ZFS are still fully in place to protect your data.

    I'm still holding off on FreeBSD 9.0-RELEASE (should only be a few days away) before I start running some test suites to compare all of this. I think Work will give me a window of a few weeks with some nice hardware to test these theories on.

  17. Rainer: I use ACARD RAM drive at the moment, but I'm probably going to switch to modern SSD's, possibly striped, possibly with battery backed cache - I'm going to run some tests soon to see where the price/performance point is for me. The amount of speed in your ZIL depends on how often you're using it, and how fast your pool is. Generally, you want it faster than your pool, otherwise you're slowing down your pool for ZIL-involved writes. With the new batch of 6gbps SATA drives out with big caches, I'm starting to think that 45 of these drives in a mirror pool will need more speed than my current RAM drives can provide.

  18. When you write a large file to the ESXi client mounted disk, have zfs sync=standard and check gstat output, do the log devices show all of the writing (your test speed matches the gstat reported write speed), to prove that it is really writing synchronously?

    Linux NFS client:

    # dd if=/dev/zero of=/nfsmount/testfile bs=128k count=6000
    ... 66.5 MB/s


    ... 69324 11.8 30.4| gpt/log0
    ... 69324 19.2 47.2| gpt/log1

    Linux VM using its virtual disk, ESXi NFS client:

    # dd if=/dev/zero of=/testfile bs=128k count=2000
    ... 7,0 MB/s

    ... 6608 0.1 75.5| gpt/log0
    ... 6608 0.1 75.5| gpt/log1

  19. Christopher, have you been able to test using FreeBSD 9.0-RELEASE yet?

    1. Mark: Yes, I've started my tests with 9.0-RELEASE. It's still too early to report anything back.

      I'm trying to learn the Phoronix test suite, as I'm always looking for a good benchmark suite. Unfortunately, not all of the tests are available in FreeBSD - iobench for instance.

      However, early tests have shown something interesting: A VM of FreeBSD 9.0-RELEASE has faster disk speed for a LSI2008 controller than a bare metal machine of the same specs. The largest difference between the VM and Bare Metal is that I had to turn off MSI and MSI-X in the /boot/loader.conf file.

      I hope to have results to share in the next while.

    2. ..iozone, not iobench BTW. I'm becomming confused with tiobench.

  20. >I had to turn off MSI and MSI-X in the /boot/loader.conf file.
    What's the point to do so?

  21. Without the entry, it won't boot. :-) Under ESXi 5 there are limited MSI-X interrupts available (vectors). I think FreeBSD is asking for more resources than ESXi can provide.

  22. Hi Christopher
    Previously my nfs with sync=standard only got 3MB/s, after applied your patch it increases up to 60-70MB/s. Seems the performance equivalent to the sync=disabled.

    But when I did zpool iostat 1, I found that's really doing Async on disk level too. I got 2-3 sec without any write io and suddenly surged up to 200MB in a sec with sync=standard setting when doing DD on ESXI, no compression is turned on.

  23. I'm having the exact problem, but I can't seem to find the nfs_nfsdport.c file. I'm running FreeNas 8 and I'm using ESXi 5.1. Can anyone tell me where to find this config file in my setup. Thanks

  24. Any idea if this patch can be applied to FreeNAS ?

  25. Just wondering:
    Wouldn't it be better to patch the nfs code that if it's writing to zfs the ignore the nfs client sync requests ? So that you are safe in case of a mixed filesystem FreeBSD server ? Or have a seperate setting to accomplish this ? (Or both ?)

  26. Anon:

    I'd only think that my patch is useful for ZFS targets, not UFS/other.

    Really, there are other optimizations in the NFS code that I would like to try, and would make more sense than my (possibly risky) patch.

  27. How do you recompile your kernel and install it on FreeNas 9.x ?

  28. FreeNAS is just FreeBSD, so I'd say it's the same process.

  29. Is this hack no longer applicable in Freebsd version 10.1? I can't seem to find the location for it.

    1. Yes, applies to 10.1 and 10.2 - The NFS code isn't changing much in those versions.

  30. Thanks a lot for the tip!. We recompiled the Freenas Kernel 9.10 and we are getting +90MB/s with ESXI.