Thursday, January 13, 2011

Testing FreeBSD ZFS v28

I've been testing out the v28 patch code for a month now, and I've yet to report any real issues other than what is mentioned below.

I'll detail some of the things I've tested, hopefully the stability of v28 in FreeBSD will convince others to give it a try so the final release of v28 will be as solid as possible.

I've been using FreeBSD 9.0-CURRENT as of Dec 12th, and 8.2PRE as of Dec 16th

What's worked well:

- I've made and destroyed small raidz's (3-5 disks), large 26 disk raid-10's, and a large 20 disk raid-50.
- I've upgraded from v15, zfs 4, no issues on the different arrays noted above
- I've confirmed that a v15 or v28 pool will import into Solaris 11 Express, and vice versa, with the exception about dual log or cache devices noted below.
- I've run many TB of data through the ZFS storage via benchmarks from my VM's connected via NFS, to simple copies inside the same pool, or copies from one pool to another.
- I've tested pretty much every compression level, and changing them as I tweak my setup and try to find the best blend.
- I've added and subtracted many a log and cache device, some in failed states from hot-removals, and the pools always stayed intact.


- Import of pools with multiple cache or log devices. (May be a very minor point)

A v28 pool created in Solaris 11 Express with 2 or more log devices, or 2 or more cache devices won't import in FreeBSD 9. This also applies to a pool that is created in FreeBSD, is imported in Solaris to have the 2 log devices added there, then exported and attempted to be imported back in FreeBSD. No errors, zpool import just hangs forever. If I reboot into Solaris, import the pool, remove the dual devices, then reboot into FreeBSD, I can then import the pool without issue. A single cache, or log device will import just fine. Unfortunately I deleted my witness-enabled FreeBSD-9 drive, so I can't easily fire it back up to give more debug info. I'm hoping some kind soul will attempt this type of transaction and report more detail to the list.

Note - I just decided to try adding 2 cache devices to a raidz pool in FreeBSD, export, and then importing, all without rebooting. That seems to work. BUT - As soon as you try to reboot FreeBSD with this pool staying active, it hangs on boot. Booting into Solaris, removing the 2 cache devices, then booting back into FreeBSD then works. Something is kept in memory between exporting then importing that allows this to work.  

- Speed. (More of an issue, but what do we do?)

Wow, it's much slower than Solaris 11 Express for transactions. I do understand that Solaris will have a slight advantage over any port of ZFS. All of my speed tests are made with a kernel without debug, and yes, these are -CURRENT and -PRE releases, but the speed difference is very large.

At first, I thought it may be more of an issue with the ix0/Intel X520DA2 10Gbe drivers that I'm using, since the bulk of my tests are over NFS (I'm going to use this as a SAN via NFS, so I test in that environment).

But - I did a raw cp command from one pool to another of several TB. I executed the same command under FreeBSD as I did under Solaris 11 Express. When executed in FreeBSD, the copy took 36 hours. With a fresh destination pool of the same settings/compression/etc under Solaris, the copy took 7.5 hours.

Here's a quick breakdown of the difference in speed I'm seeing between Solaris 11 Express and FreeBSD. The test is Performance Test 6.1 on a Windows 2003 server, connected via NFS to the FreeBSD or Solaris box.  More details are here

Solaris 11 Express svn_151a

903 MBs - Fileserver
466 MBs - Webserver
53 MBs - Workstation
201 MBs - Database

FreeBSD-9.0 Current @ Dec 12th 2010 w/v28 Patch, all Debug off

95 MBs - Fileserver
60 MBs - Webserver
30 MBs - Workstation
32 MBs - Database

Massive difference as you can see. Same machine, different boot drives. That's a real 903 MBs on the Solaris side as well - No cache devices or ZIL in place, just a basic raidz 5 disk pool. I've tried many a tweak to get these speeds up higher. The old v15 could hit mid 400's for the Fileserver test with zil_disable on, but that's no longer an option for v28 pools. I should compile my various test results into a summary and make a separate blog entry for those who care, as I also fiddled with vfs.nfsrv.async with little luck. I took great care to make sure the ZFS details were the same across the tests.

9 is faster than 8.2 for speed by a small amount. Between v28 pools and v15 pools there is speed degradation on both 8.2 and 9, but nothing as big as the difference between Solaris and FreeBSD.

I haven't benchmarked OpenSolaris or any type of Solaris older than 11, so I'm not sure if this is a recent speed boost from the Solaris camp, or if it's always been there.

As always, I'm delighted about the work that PJD and others have poured into ZFS on FreeBSD. I'm not knocking this implementation, just pointing out some points that may not be as well known.

Wednesday, January 12, 2011

Areca 1880ix-12 Can Overheat in a Dell T710 Chassis

While the Areca 1880x-12 doesn't come with any fans mounted on it's heat-syncs, you may need them.

Areca 1880ix-12

I just overheated one of my Areca 1880ix-12 cards the other night during a rather long 26 disk ZFS copy.

I had the card installed in my Dell T710 4U server, which is located in an air-contioned data cetre, properly rack-mounted, covers on, etc. so the case is able to breathe properly, and ambient temp is maintained at 21 Deg C. The card is mounted with a free slot behind and in front of it, so it's got room for air to move around it.

Dell T610's and 710's use shrouds to direct air over the memory and CPU, but leave the PCIe slots to fend for themselves. As I'm learning, it may be an issue for high performance cards installed in the PCIe slots.

Dell T710 Chassis, with shroud

Normally, I add additional cooling to any heatsync that is uncomfortable to the touch, a trick that has kept hardware failures to a minimum in my servers. Since this was still a fairly new build that was being worked on frequently, I hadn't bothered hooking anything up.

The failure happend overnight when I let my 26 disk ZFS array copy 6-7 TB of data from one location in ZFS space to another.

When I returned in the morning, I found the ZFS copy was hung, Solaris was cranky because it didn't think it had any disks, but was running.

I had left the Areca 1880's Raid Storage Manager web page up during the copy, luckily on the Hardware Monitor page, which auto-refreshes to keep the stats up-to-date. The displayed temprature was 79 Deg C for the CPU controller, the highest I've seen yet.

I couldn't manouver in the web console, and was forced to reboot. Upon reboot, the card wouldn't detect any drives. I powered down, and started a physical check. The heatsyncs were extremely hot to the touch.

After letting it cool properly, I was able to boot up without issue, detect drives, and check my ZFS array. Of course, it was fine - ZFS is hard to kill, even with a controller failure of this magnitude.

I installed a quick-fix - an 80 MM case fan blowing directly on the heatsyncs of the Areca. I then powered back up and started the same copy again. This time, I watched it closely - My CPU temp never rose above 51 Deg C, and the copy completed in 7.5 hours.

I've contacted Areca Support, and they confirm that the Areca CPU has a max temp of 90 Deg C, so I should be okay under this temp.

It's quite possible that my Areca CPU hit a much higher temp than 79 Deg C - That's the last value displayed on my web console, but there is no guarentee that the web page didn't time out or update in time before the final lockup.

Video card coolers are probably the best fix for this. I'll either use one that takes up an expansion slot and draws it's air from the rear of the chassis blowing it on the card, or mount some smaller slim video card coolers on the heat syncs.

Sunday, January 9, 2011

SSD Fade. It's real, and why you may not want SSDs for your ZIL

SSD's fade - and they fade quicker than you'd expect.

When I say "fade" I mean that their performance drops quickly with write use.

The owner of ddrdrive is rather active on the forums, and was posting some very interesting numbers about how quickly a SSD can drop in performance. He sells a RAM-drive based hardware device that is actually very suitable for a ZIL, it's just very expensive.

Since my ZFS implementation makes use of 8 OCZ Vertex 2 SSD drives (4x60GB ZIL, 4x120GB L2ARC), I thought I should check into this.

First off, when you test SSD's you have to use one of the beta versions of  iometer, so you can enable "Full Random" on the test data stream. OCZ's SSD's (like all others I believe)  perform compression/dedup to limit their writes, resulting in inflated numbers if you use easily compressed data.

How inflated? Try write speeds of 4700 IOPS / 2 MBs for random data, then 7000 IOPS / 3.6 MBs for repeating bytes. Nearly double the performance for easy-to-compress data.

With iometer generating random data, I tested out one of my OCZ Vertex 2 60 Gig SSDs that was in use for 4 months as a ZIL device. That's not exactly an ideal use of SSD's, and you'll see why shortly.

My test was iometer 1.1 on a Windows 7 32 bit box, driving 100% random, 100% write to the SSD, with a queue depth of 128. The writes were not 4k aligned, which does make a difference in speed, but for purposes of illustrating SSD fade, it doesn't matter as a 512b or 4k write is affected equally.Currently ZFS uses 512b as a drive sector size, so you're getting this type of speed unless you've tweaked your ashift value to 12 with a tool like zfs guru.

Back to the tests:

Fresh from my active ZFS system, I was getting 4300 IOPS / 17.5 MBs

After a secure erase with the OCZ Toolbox 2.22, I was getting 6000 IOPS / 24.5 MBs - That's a 40% increase in IOPS just with a secure erase.  My SSD performance has dropped 40% in 4 months.

Curious to see how quickly the SSD will fade, I set iometer up to run the random write for 8 hours. When I came back and ran the test again, I was shocked to see speed down to 910 IOPS / 3.75 MBs. Wow, we're getting close to the territory of a 15k 2.5" SAS drive (around 677 IOPS / 2.77 MBs).

I then did a secure erase of the drive and tested again - Speed was back up, but not as good as what I originally had. I was now obtaining 4000 IOPS / 16 MBs - Worse than when I first started. 8 hours of hard writes used up my SSD a bit more.

Why did 8 hours of writing break down my SSD faster than 4 months of life as a ZIL? Well, I was using 4 of these 60 gig SSDs as one big ZIL device, and the ZIL doesn't consume much space, so the # of GB written over these 4 SSDs in the course of 4 months wasn't nearly as bad as my 8 hour test on one device.

Interestingly, I also have 4 120 Gig Vertex 2 drives for my L2ARC cache. When I pulled and tested these drives, they were far less degraded - The L2ARC is designed to be very conservative on writes, and is mostly for reads. A MLC SSD here works well.

Should you use SSDs for ZIL? You can, but understand how they will degrade.

Your best bet is a RAM based device like a ddrdrive, or similar. There are other slower, cheaper RAM drives out there you could use. These will not degrade with use, and the ZIL gets a lot of writes - Nearly everything written to your pool is written to the ZIL.

If you don't have one of these, your other option is to use lots of devices like I did to ballance the wear/fade. Pull them every 6 months for 15 min of maintenance, and you'll maintain decent speeds.

For now, my 4 60 gig SSDs are going back into ZIL duty until I can decide if a ddrdrive (or a bunch of cheaper ddr devices) is in my future.

Solaris 11 Express - Faster Speed for SAN-NFS-VM than FreeBSD

I love FreeBSD. It's been my go-to OS since I first switched to it from BSDi many many years ago. However, I hate to stick with something just because I've always used it. Technology changes too quickly to make lifelong bonds. While I prefer FreeBSD, I ultimately will choose what gets the job done the best.

ZFS under FreeBSD has always progressed nicely, but it is obviously behind Solaris, seeing how Sum/Oracle are the creators of ZFS. The ZFS v28 patch is currently being tested for FreeBSD-8.2 and 9, but not released yet.

While I realize these are pre-releases, I wanted to test the performance I could expect from FreeBSD vs Solaris for my SAN that I'm building. Once I put this SAN into regular use, I won't easily be able to switch to another OS, or do a lot of tweaking with it.

This is not just a test of ZFS implementations - It tests the hard disk subsystem, memory, NFS, ethernet drivers, and ZFS to give me a combined test result, which is what I really care about - How fast can my SAN accept and serve up data to my ESXi boxes.

I've seperately been benchmarking FreeBSD 8.2-PRE and 9.0-CURRENT with both the v28 ZFS patch, and with their native v15 ZFS implementations. I'll make that a seperate blog entry a bit later on, but for now, I'm choosing the fastest performer from the v28 portion of those tests, which was FreeBSD 9.0-CURENT.

My test environment is as follows:

ESX Box: Dell PowerEdge T710 w/96Gig
SAN Box: Dell PowerEdge Y710 w/24Gig
Network: Intel X510DA2 10Gbe Ethernet  in each box, direct attached
SAS Card: Areca 1880ix w/4G Cache
Enclosure: SuperMicro SC847-JBOD2 (Dual 6Gbps Backplanes)
Drives: Seagate 1.5 TB SATA, 5 disk raidz, no log or cache
Test Software: Performance Test 6.1 run on a Windows 2003 VM

Solaris 11 Express svn_151a

903 MBs - Fileserver
466 MBs - Webserver
53 MBs - Workstation
201 MBs - Database

FreeBSD-9.0 Current @ Dec 12th 2010 w/v28 Patch, all Debug off
95 MBs - Fileserver
60 MBs - Webserver
30 MBs - Workstation
32 MBs - Database

Wow. Notice the difference? I didn't believe it myself at first, but I tried it again and again, I watched the packet traffic accross the Intel card with iftop, and then I ran seperate benchmarks with iometer just to make sure it wasn't something silly in Performance Meter. I always received similar results - Solaris 11 Express can move some serious data in comparison to FreeBSD 9.0

At this stage, I don't know if it's a problem with the Intel driver holding back FreeBSD or not. You'll notice that the Workstation test is the only one that is compariable between FreeBSD and ZFS, which is made up of a lot of sync random read/writes that brings any ZFS implementation to it's knees.

The thing is: I don't care if it's the Intel driver, the memory sub-system, or whatever - In the end, I need the best performance delivered to my ESXi boxes, and Solaris 11 Express can deliver where FreeBSD isn't ready just yet.

I do fully expect FreeBSD to catch up, but for now, I'm strongly considering spendng the $$ on a Solaris 11 Express license so I can run it as my SAN OS. Solaris' easy CIFS implementation is another bonus for my SAN - it was delivering 78 MBs speeds when my FreeBSD/samba implementation was at 35MBs.

Thursday, January 6, 2011

Areca 1880 Firmware v1.49 improves speed in Web Management, Solaris 11

I just applied the 1.49 firmware update for my Areca 1880, and the speed difference for user interface interaction is noticeable.

You'll have to download it from their Taiwan FTP site, which is listed on their support page.

The web console is now quite snappy, where before it was painful. In Solaris 11 Express, I'm able to list my 26+ drives without delay, where before it would take 5-10 seconds to give me a full drive inventory, which was leading to some serious delays in zpool status commands.

When applying your update via the web console, make sure to apply all 4 files individually. It's not smart, it won't ask for the other files, but it does need them if you want to be fully updated across the board.

I ran a set of iometer tests before and after my upgrade, and I can't detect any change in drive speed, just user interface speed.

Tuesday, January 4, 2011

Partition Alignment with ESX and Windows 2003. Just do it.

You really want to make sure your partitions are aligned.

Partition Alignment, in a nutshell, is making sure your partitions start on Sector # 2048 (or some multiple of 512) instead of the standard Sector # 63.

There are some great blog entries that define the issue, so I won't get into it here.

While you don't need to do this for 2008 partitions (they start properly aligned), I've always known I should align my older 2003 partitions, but I haven't bothered. Today I decided to do a simple test to see the results of running un-aligned.

Test Details:
OS: Windows 2003 32bit
Test Package: Performance Test 6.1, running Webserver disk role benchmark
VM: ESXi 4.1
Drives: NFS share to a Solaris 11 ZFS box.

I chose the webserver role because it's a heavy reader (writes are harder to benchmark on ZFS), and because it's my worst performer. The other roles are hitting 900-1200 MB/sec, which is close to the max I can drive through my 10Gbe card. 

Aligned Partition: 609 MB/sec
Non-Aligned Partition: 436 MB/sec

That's a 30% increase in read speed. Wow, that's a lot of speed to throw away for not aligning a partition. 

Here's a link to a quick little tool that will let you know within Windows if your partition is aligned:

Of course, it can be a pain in the ass to align a partition. You can use a livecd of gparted, or create a new disk and partition it with Windows 2008, copying the data to the new partition from a seperate VM.

Some partition tools are now including the ability to quickly and easily specify a aligned-partition start, so hopefully this will be an easy task in the near future.  If you know the name of any easy booting GUI type partition tools that can move a Windows partition to sector 2048 without much fuss, please let me know and I'll post it here.

I personally do it with gparted live, but I'm also not affraid of the command line. Something easier would be nice.