My life revolves around four 3-letter acronyms: BSD, ESX, ZFS, and NFS.
However, these four do not get along well, at least not on FreeBSD.
The problem is between ESX's NFSv3 client and ZFS's ZIL.
You can read a bit about this from one of the ZFS programmers here - Although I don't agree that it's as much of a non-issue as this writer found.
ESX uses a NFSv3 client, and when it connects to the server, it always asks for a sync connection. It doesn't matter what you set your server to, it will be forced by the O_SYNC command from ESX to sync all writes.
By itself, this isn't a bad thing, but when you add ZFS to the equation,we now have an unnecessary NFS sync due to ZFS's ZIL. It's best to leave ZFS alone, and let it write to disk when it's ready, instead of instructing it to flush the ZIL all the time. Once ZFS has it, you can forget about it (assuming you haven't turned off the ZIL).
Even if your ZIL is on hardware RAM-drives, you're going to notice a slow-down. The effect is magnified on a HD based ZIL (which is what you have if you don't have a separate log device on SSD/RAM). For my tests, I was using a hardware RAM device for my ZIL.
Some ZFS instances can disable the ZIL. We can't in FreeBSD if you're running ZFS v28.
Here's two quick iometer tests to show the difference between a standard FreeBSD NFS server, and my modified FreeBSD NFS server.
Test Steup: Running iometer 1.1 devel, on a Windows 7 SP1 machine, connected to the test drives via NFS. iometer has 128 writers, full random, 4k size, 50% write 50% read, 100% sequential access, 8GB file, 15 run-time. Reboot after each test, ran each test twice to make sure we were receiving a sane result. Using FreeBSD 9-CURRENT as of 2011.05.28.15.00.00
Standard NFS
Test1 1086 IOPS 4.45 MBs 117 AvgIO (ms)
Test2 1020 IOPS 4.18 MBs 125 AvgIO (ms)
Modified NFS
Test 3 2309 IOPS 9.45 MBs 55 AvgIO (ms)
Test 4 2243 IOPS 9.19 MBs 57 AvgIO (ms)
I feel the results speak for themselves, but in case they don't - We're looking at an increase in IOPS, MB/per sec, and a decrease in the time to access the information when we use the modified NFS server code. For this particular test, we're looking at nearly a doubling in performance. Other tests are close to a 10% increase in speed, but that's still a wanted increase.
These test results will be apparent if you're using the old NFS server (v2 and v3 only) or the new NFS server (v2-3-4) that is now the default in FreeBSD 9 as of a month ago.
I've used this hack for over 6 months now on my SANs without any issue or corruption, on both 8.1 and various 9-Current builds, so I believe it's fairly safe to use.
I'm too lazy to make a proper patch, but manually editing the source is very easy:
- The file is /usr/src/sys/fs/nfsserver/nfs_nfsdport.c
- Go to line 704, you'll see code like this;
(Edit: Now line 727 in FreeBSD-9.0-RC3)
if (stable == NFSWRITE_UNSTABLE)
ioflags = IO_NODELOCKED;
else
ioflags = (IO_SYNC IO_NODELOCKED);
uiop->uio_resid = retlen;
uiop->uio_rw = UIO_WRITE;
- Change the code to look like this below. We're commenting out the logic that decides to allow this to be an IO_SYNC write.
// if (stable == NFSWRITE_UNSTABLE)
ioflags = IO_NODELOCKED;
// else
// ioflags = (IO_SYNC | IO_NODELOCKED);
uiop->uio_resid = retlen;
uiop->uio_rw = UIO_WRITE;
- Recompile your kernel, install it, and reboot - You're now free from NFS O_SYNC's under ESX.
If you are running the older NFS server (which is the default for 8.2 or older), the file to modify is /usr/src/sys/nfsserver/nfs_serv.c - Go to line 1162 and comment out these lines as shown in this example:
// if (stable == NFSV3WRITE_UNSTABLE)
ioflags = IO_NODELOCKED;
// else if (stable == NFSV3WRITE_DATASYNC)
// ioflags = (IO_SYNC IO_NODELOCKED);
// else
// ioflags = (IO_METASYNC | IO_SYNC | IO_NODELOCKED);
If you try this, and let me know what type of before and after speed results you're receiving.
However, these four do not get along well, at least not on FreeBSD.
The problem is between ESX's NFSv3 client and ZFS's ZIL.
You can read a bit about this from one of the ZFS programmers here - Although I don't agree that it's as much of a non-issue as this writer found.
ESX uses a NFSv3 client, and when it connects to the server, it always asks for a sync connection. It doesn't matter what you set your server to, it will be forced by the O_SYNC command from ESX to sync all writes.
By itself, this isn't a bad thing, but when you add ZFS to the equation,we now have an unnecessary NFS sync due to ZFS's ZIL. It's best to leave ZFS alone, and let it write to disk when it's ready, instead of instructing it to flush the ZIL all the time. Once ZFS has it, you can forget about it (assuming you haven't turned off the ZIL).
Even if your ZIL is on hardware RAM-drives, you're going to notice a slow-down. The effect is magnified on a HD based ZIL (which is what you have if you don't have a separate log device on SSD/RAM). For my tests, I was using a hardware RAM device for my ZIL.
Some ZFS instances can disable the ZIL. We can't in FreeBSD if you're running ZFS v28.
Here's two quick iometer tests to show the difference between a standard FreeBSD NFS server, and my modified FreeBSD NFS server.
Test Steup: Running iometer 1.1 devel, on a Windows 7 SP1 machine, connected to the test drives via NFS. iometer has 128 writers, full random, 4k size, 50% write 50% read, 100% sequential access, 8GB file, 15 run-time. Reboot after each test, ran each test twice to make sure we were receiving a sane result. Using FreeBSD 9-CURRENT as of 2011.05.28.15.00.00
Standard NFS
Test1 1086 IOPS 4.45 MBs 117 AvgIO (ms)
Test2 1020 IOPS 4.18 MBs 125 AvgIO (ms)
Modified NFS
Test 3 2309 IOPS 9.45 MBs 55 AvgIO (ms)
Test 4 2243 IOPS 9.19 MBs 57 AvgIO (ms)
I feel the results speak for themselves, but in case they don't - We're looking at an increase in IOPS, MB/per sec, and a decrease in the time to access the information when we use the modified NFS server code. For this particular test, we're looking at nearly a doubling in performance. Other tests are close to a 10% increase in speed, but that's still a wanted increase.
These test results will be apparent if you're using the old NFS server (v2 and v3 only) or the new NFS server (v2-3-4) that is now the default in FreeBSD 9 as of a month ago.
I've used this hack for over 6 months now on my SANs without any issue or corruption, on both 8.1 and various 9-Current builds, so I believe it's fairly safe to use.
I'm too lazy to make a proper patch, but manually editing the source is very easy:
- The file is /usr/src/sys/fs/nfsserver/nfs_nfsdport.c
- Go to line 704, you'll see code like this;
(Edit: Now line 727 in FreeBSD-9.0-RC3)
if (stable == NFSWRITE_UNSTABLE)
ioflags = IO_NODELOCKED;
else
ioflags = (IO_SYNC IO_NODELOCKED);
uiop->uio_resid = retlen;
uiop->uio_rw = UIO_WRITE;
- Change the code to look like this below. We're commenting out the logic that decides to allow this to be an IO_SYNC write.
// if (stable == NFSWRITE_UNSTABLE)
ioflags = IO_NODELOCKED;
// else
// ioflags = (IO_SYNC | IO_NODELOCKED);
uiop->uio_resid = retlen;
uiop->uio_rw = UIO_WRITE;
- Recompile your kernel, install it, and reboot - You're now free from NFS O_SYNC's under ESX.
If you are running the older NFS server (which is the default for 8.2 or older), the file to modify is /usr/src/sys/nfsserver/nfs_serv.c - Go to line 1162 and comment out these lines as shown in this example:
// if (stable == NFSV3WRITE_UNSTABLE)
ioflags = IO_NODELOCKED;
// else if (stable == NFSV3WRITE_DATASYNC)
// ioflags = (IO_SYNC IO_NODELOCKED);
// else
// ioflags = (IO_METASYNC | IO_SYNC | IO_NODELOCKED);
If you try this, and let me know what type of before and after speed results you're receiving.