Tuesday, July 12, 2011

ZFS Dedup Performance - Real World Experience After One Year

I've been running ZFS with deduplication from Open Solaris or FreeBSD for about a year now in a production environment.  I manage ESX vm's, mostly with various flavours of Windows server.

In that year, my thoughts on where to use dedup have changed.

My big ZFS pool is 28.5 TB in a basic mirror, currently showing 11.1TB used, and running a dedup ratio of 1.25x (25% of my data is deduplicated), a savings of around 2.75 TB. This machine runs 24 Gig of RAM, and has a L2ARC comprised of 4 120 Gig SSD's, in additional to 2 ZIL RAM devices.

I have two smaller production units, holding 20.4 TB in a raidz with 12 Gig of RAM, and 16.4 TB in a raidz with 24 gig of RAM.  These secondary systems may or may not have hardware RAM ZIL's and L2ARC SSD caches as I continually move around components in my secondary equipment to fins a sweet-spot between $$ for hardware and performance.

Back to the thought at hand..

Dedup obviously has to have some sort of performance penalty. It simply takes more time to check if you're writing a unique block compared to writing without the check.

Checking the uniqueness of the block involves looking at every block's hash in the pool (even if the fs inside the pool isn't participating in dedup) - A task that increases write execution time as your pool increases in size.

In a perfect world, your system will have enough RAM that the Dedup Table (DDT) will reside entirely in RAM. This means that it only requires traversing the DDT in RAM to find the uniqueness of the block being written.

In the real world, this is nearly impossible. The way the ARC/L2ARC is designed, there isn't a preference for the DDT data - it's simply thought of as metadata, and that competes with user data in the cache system of ZFS. You may think that setting primarycache and/or secondarycache to metadata will fix the problem, but once you switch the primary cache to metadata, you also doom the secondary cache to only hold metadata - Nothing can get into the L2ARC unless it exists in the ARC.

In my experience with a busy pool, enough of your DDT is eroded away from user data writes that dedup performance starts to become a serious issue. These issues don't show up on a fresh system without a lot of data in pool. You need an older, loaded, very fragmented pool to really notice the drop in performance.

How bad? Here's a quick and dirty iometer check on a fs with dedup turned on, and one with it off. This is on my big pool, with plenty of L2ARC and RAM.

Dedup On:  197 IOPS, 0.81 MB/s, 647 ms AvgIO
Dedup Off: 4851 IOPS, 19.87 MB/s, 28 ms AvgIO.

See the difference? This is not a 2x or 4x slowdown of my writes, we're in the 20x slower category to use dedup than with it off.

So we know it's going to be hard to keep the DDT in RAM, so there will be a performance hit. What about the large savings to disk writes, (you don't write the block if it already exists) and thus the performance increase there?

It's simply not there in my systems, and probably not in yours.

Additionally the idea that these dedup blocks could reside in memory cache, allowing multiple systems to take advantage of the same blocks is a good one, but it looks like there is still enough erosion of the caches that this doesn't make a difference in my systems.

If I had a higher dedup ratio, then maybe this would start working out better for me - But even with 40+ VM's of mostly the same stuff, I'm only running a 1.25x dedup ratio on my big pool - I have enough dissimilar data to make dedup impractical on this system.

I'm now going to turn dedup off for nearly all of my production boxes. 

However, turning dedup off doesn't de-dedup your data - it stays deduplicated until fresh data overwites it.

This provides a 'poor-man's offline dedup' trick. Leave dedup off during the day when you need performance, then enable via a script, and copy all of the data in place to *.tmp and then rename to drop the *.tmp when done - You just deduplicated all your data, without the performance hit during the writes in the daytime when I assume your systems would be the busiest.

I'm going to leave dedup enabled on my storage and backup archives - places where I have a lot of duplication, and not a lot of need for speed.Everywhere else, I'm just going to run compression (I find compression=gzip to be a good blend of storage and speed on my systems). I"m averaging close to 2x on my compression alone.

For dedup to work well in a production environment, you're going to need an extreme amount of RAM and L2ARC compared to your drive storage size. I don't have any figures for you, but it's probably cheaper to provision more storage than it would be to try and dedup the data.

What ZFS needs is an offline dedup ability, priority settings for the DDT in ARC/L2ARC, and some general tuning of the ARC/L2ARC to allow for situations like "DDT and Metadata only in RAM, everything else allowed in L2ARC".

..at least for it to be viable in my environment, which is the problem - We all have different environments with different needs.

Monday, June 27, 2011

Speeding up FreeBSD's NFS on ZFS for ESX clients

My life revolves around four 3-letter acronyms: BSD, ESX, ZFS, and NFS.

However, these four do not get along well, at least not on FreeBSD.

The problem is between ESX's NFSv3 client and ZFS's ZIL.

You can read a bit about this from one of the ZFS programmers here - Although I don't agree that it's as much of a non-issue as this writer found.

ESX uses a NFSv3 client, and when it connects to the server, it always asks for a sync connection. It doesn't matter what you set your server to, it will be forced by the O_SYNC command from ESX to sync all writes.

By itself, this isn't a bad thing, but when you add ZFS to the equation,we now have an unnecessary NFS sync due to ZFS's ZIL. It's best to leave ZFS alone, and let it write to disk when it's ready, instead of instructing it to flush the ZIL all the time. Once ZFS has it, you can forget about it (assuming you haven't turned off the ZIL).

Even if your ZIL is on hardware RAM-drives, you're going to notice a slow-down. The effect is magnified on a HD based ZIL (which is what you have if you don't have a separate log device on SSD/RAM). For my tests, I was using a hardware RAM device for my ZIL.

Some ZFS instances can disable the ZIL. We can't in FreeBSD if you're running ZFS v28.

Here's two quick iometer tests to show the difference between a standard FreeBSD NFS server, and my modified FreeBSD NFS server.

Test Steup: Running iometer 1.1 devel,  on a Windows 7 SP1 machine, connected to the test drives via NFS. iometer has 128 writers, full random, 4k size, 50% write 50% read, 100% sequential access, 8GB file, 15 run-time. Reboot after each test, ran each test twice to make sure we were receiving a sane result. Using FreeBSD 9-CURRENT as of 2011.

Standard NFS
Test1    1086 IOPS    4.45 MBs    117 AvgIO (ms)
Test2    1020 IOPS    4.18 MBs    125 AvgIO (ms)

Modified NFS
Test 3   2309 IOPS    9.45 MBs    55 AvgIO (ms)
Test 4   2243 IOPS    9.19 MBs    57 AvgIO (ms)

I feel the results speak for themselves, but in case they don't - We're looking at an increase in IOPS, MB/per sec, and a decrease in the time to access the information when we use the modified NFS server code. For this particular test, we're looking at nearly a doubling in performance. Other tests are close to a 10% increase in speed, but that's still a wanted increase.

These test results will be apparent if you're using the old NFS server (v2 and v3 only) or the new NFS server (v2-3-4) that is now the default in FreeBSD 9 as of a month ago.

I've used this hack for over 6 months now on my SANs without any issue or corruption, on both 8.1 and various 9-Current builds, so I believe it's fairly safe to use.

I'm too lazy to make a proper patch, but manually editing the source is very easy:

- The file is /usr/src/sys/fs/nfsserver/nfs_nfsdport.c
- Go to line 704, you'll see code like this;
(Edit: Now line 727 in FreeBSD-9.0-RC3)

if (stable == NFSWRITE_UNSTABLE)
  ioflags = IO_NODELOCKED;
  ioflags = (IO_SYNC IO_NODELOCKED);
uiop->uio_resid = retlen;
uiop->uio_rw = UIO_WRITE;

- Change the code to look like this below. We're commenting out the logic that decides to allow this to be an IO_SYNC write.

// if (stable == NFSWRITE_UNSTABLE)
ioflags = IO_NODELOCKED;
// else
// ioflags = (IO_SYNC | IO_NODELOCKED);
uiop->uio_resid = retlen;
uiop->uio_rw = UIO_WRITE;

- Recompile your kernel, install it, and reboot - You're now free from NFS O_SYNC's under ESX.

If you are running the older NFS server (which is the default for 8.2 or older), the file to modify is /usr/src/sys/nfsserver/nfs_serv.c  - Go to line 1162 and comment out these lines as shown in this example:

// if (stable == NFSV3WRITE_UNSTABLE)
ioflags = IO_NODELOCKED;
// else if (stable == NFSV3WRITE_DATASYNC)
// ioflags = (IO_SYNC IO_NODELOCKED);
// else

If you try this, and let me know what type of before and after speed results you're receiving.

Wednesday, June 22, 2011

FreeBSD 9.0 and EKOPath (Path64) Compiler

You may be aware that Pathscale announced they are making their EKOPath compiler suite open-source. This is great news, even if it's a GPLv3 license. I''d be happier with a BSD license so we could bundle it easier with FreeBSD, but regardless, having free access to this excellent compiler is just what we need to increase FreeBSD's performance, and start catching up with other OS'es for performance.

Phoronix released info about the story here

Compilers do make a difference in the speed of our OS, don't think for a second that you're getting full performance in FreeBSD from the old gcc 4.2.2 on modern hardware.

I've recently done some tests with clang/llvm vs gcc with various optimization switches, and there is a definite increase from the default generic builds of FreeBSD for my standard environment of NFS-ZFS-ESX. I'll post about that shortly when I've finished my tests.

For now, Martin Matuska has some info about compiling FreeBSD with a newer version of gcc, and also links to some statistically valid data showing the speed increase from doing this. You can start here, and find his performance data link at the bottom.  

clang/llvm is a great step forward, I wish it would beat gcc for compiled binary run speed, but it can't. In my tests, the only time clang/llvm was faster than gcc was when I was doing compression based tests, and I'm assuming this is because the newer clang/llvm can take advantage of more modern processor extensions than the older gcc 4.2.2

FreeBSD is committed to clang/llvm, as it's a BSD license, and we need that for making the entire FreeBSD distribution as GPL free.  It will get better, but FreeBSD isn't the focus of this project.

Anyway, I digress: This is about Path64. Start here:

Sources to download Path64 compiler:
You'll need to include two libraries for this to work;

pkg_add -r libdwarf
pkg_add -r cmake

Follow the instructions in the readme, or check Marcello's page for more info.

I followed his instructions, the only difference is that I'm on FreeBSD-9-CURRENT, a build from 2011. , and I used this cmake command:

set MYLIBPATH=/usr/lib

cmake ~/work/path64 \
-DPSC_DYNAMIC_LINKER_x86_64=/libexec/ld-elf.so.1 \

There is discussion about this on the FreeBSD mailing lists here.

That's as far as I've made it - The tests with the compiler start now. I dream of a buildworld with this, but I know that's not going to be easy. For now, I'm going to start with some "Hello World" type programs.


Sunday, June 19, 2011

Windows 7 SP1 bring increase to Realtek RTL8102/8103 networking performance

One of the items on my To-Do list has been bringing the lab's file transfer speed up to snuff. Samba has never been a very willing component of a fast network, seemingly requiring different tweaking for every different system to get it's speed to be close to Window's smb performance. Adding ZFS to the mixture adds difficulty, as anyone who's tried to benchmark ZFS knows.

I did find out a combination that makes Samba fly in my environment, I'll share that shortly in a separate post.

Whilst performing iometer tests across the network, I noticed one of our ASUS-based systems was slower than the others, giving 80 MBs transfers compared to the 100-114 MBs transfers that I was receiving from identical machines. These boards use a Realtek 8102/8103 PCIe on-board network card.

That's when I noticed this machine hadn't applied Windows 7 SP1 yet.

After installation, I'm now receiving the same speed as the rest of the systems in the lab. I'm not sure if it was an updated driver, changes to the network subsystem, or something else - Unfortunately I don't have time to investigate.

However, if you're not receiving full network speed from your Win 7 machine, and you haven't applied SP1 yet, try that first.

Here's my quick test results:

Using iometer 1.1-devl, 128 writers, and the 4k sequential read test. 5 min ramp-up time, and the test runs for 15 minutes. I used a small 4 Meg test-file to remove ZFS from the equation (this small file is quickly cached in RAM, removing the server's disk speed from the test)

Before SP1
19617 IOPS  80 MBs 6.5 ms Avg IO

After SP1
25177 IOPS 103 MBs 5.1 ms Avg IO

(I'm receiving nearly the same speeds for write as well)

That's quite the difference. With some further tweaking, I'm getting my Win 7 SP1 machines to saturate the network link to 99% utilization using a FreeBSD9/Samba35/ZFS28 SAN.

Since we move a lot of data across this network each day (system images, data recovery, etc) it makes for happier technicians.

Friday, May 20, 2011

Followup: ZFS Data Gone

In Feburary I blogged about a nasty data loss event with ZFS


I've been quite busy since then and haven't followed up on the results. As a few people have been asking if I was able to get the data back, here is my answer:

Yes, I did get most of it back, thanks to a lot of effort from George Wilson (great guy, and I'm very indebted to him). However, any data that was in play at the time of the fault was irreversibly damaged and couldn't be restored. Any data that wasn't active at the time of the crash was perfectly fine, it just needed to be copied out of the pool into a new pool. George had to mount my pool for me, as it was beyond non-ZFS-programmer skills to mount. Unfortunately Solaris would dump after about 24 hours, requiring a second mounting by George. It was also slower than cold molasses to copy anything in it's faulted state. If I was getting 1 Meg/Sec, I was lucky. You can imagine that creates an issue when you're trying to evacuate a few TB of data through a slow pipe.

After it dumped again, I didn’t bother George for a third remounting (or I tried very half-heartedly, the guy was already into this for a lot of time, and we all have our day jobs), and abandoned the data that was still stranded on the faulted pool. I copied my most wanted data first, so what I abandoned was a personal collection of movies that I could always re-rip.

I was still experimenting with ZFS at the time, so I wasn't using snapshots for backup, just conventional image backups of the VM's that were running. Snapshots would have had a good chance of protecting my data from the fault that I ran into.

I was originally blaming my Areca 1880 card, as I was working with Areca tech support on a more stable driver for Solaris, and was on the 3rd revision of a driver with them. However, in the end it wasn't the Areca, as I was very familiar with it's tricks - The Areca would hang (about once every day or two), but it wouldn't take out the pool. After removing the Arcea and going with just LSI 2008 based controllers, I had one final fault about 3 weeks later that corrupted another pool (luckily it was just a backup pool). At that point, the swearing in the server room reached a peak, I booted back into FreeBSD, and haven't looked back.

Originally when I used the Areca controller with FreeBSD, I didn't have any problems with it during the 2 month trial period.

I've had only small FreeBSD issues since then, nothing else has changed on my hardware. So the only claim I can make is that in my environment, on my hardware, I've had better stability with FreeBSD than I did with Solaris.

Interesting Note: One of the speed slow-downs with FreeBSD compared to Solaris from my tests was the O_SYNC method that ESX uses to mount a NFS store. I edited the FreeBSD NFS source to always do a async write, regardless of the O_SYNC from the client, and that perked FreeBSD up a lot for speed, making it fairly close to what I was getting on Solaris.

I'm not sure why this makes such a difference, as I'm sure Solaris is also obeying the O_SYNC command. I do know that the NFS 3 code from FreeBSD is very old, and a bit cluttered - It could just be issues there, and the async hack gives it back speed it loses in other areas.

FreeBSD is now using a 4.1 NFS server by default as of the last month, and I'm just starting my stability tests with using a new FreeBSD-9 build to see if I can run newer code. I'll do speed tests again, and will probably make the same hack to the 4.1 NFS code to force async writes. I'll post to an update when I get this far.

I do like Solaris - After some initial discomfort about the different way things were being done, I do see the overall design and idea, and I now have a wish list of features I'd like see ported to FreeBSD. I think I'll have a Solaris based box setup again for testing. We'll see what time allows.

Monday, May 9, 2011

Captchas for everyone

I sometimes wonder if there is something wrong with me.

A captcha is a quick little test to make sure that a human, not a bot or script is requesting information from a website.

Re Captcha seems to be a very popular captcha that is supported by Google. It's on a lot of the sites that I use, and I can almost never complete the damn thing without multiple tries.

Do we need such cryptic letters that even humans have problems reading them? Can't we use other tricks that require a bit of reasoning instead of very cryptic letters that could be a c e a or o ? Maybe two words that are related instead of "word-like" prompts?

Audio really isn't much better. I'm at 5 attempts from the audio prompts, and I was just asked to spell out "Their". Or was it "There" ? I don't know, must we use homonyms in an audio captcha? Isn't that just toying with the poor user who's trying to decipher these crazy voices? I think this is what insanity must be like.. all these strange sounds swirling around my head, punctuated by occasional nonsense words that don't make sense.

In my mind, the deductive questions I see on some sites are the best. They ask questions like "If Tom has two apples and George has five apples, how many apples does Tom have?" Questions like this are easy to assemble via random strings, and are just as secure against computer login as a crazy jumbled set of letters in a proto-word that no one recognizes.

The current captcha is probably decent for non-English speakers, but then again only if they recognize the Latin alphabet. If they have trouble there, being presented with English audio cues won't help them at all.

I just want to use the site. Can someone come up with a better method?

Sunday, February 6, 2011

ZFS: Data gone, data gone, the love is gone...

Holy %*^&$!

My  entire 21 TB array is now "FAULTED" and basically dead to the world.

It's listing the worst type of corrupt data - pool metadata.

How did I do this? I was just rebooting the box, as always. I was working on integrating Solaris 11 Express with my active directory so CIFS would allow for seamless browsing from my Windows boxes. I was partially through it, when I decided to reboot the box and try the commands again - When I did, my pool was no longer intact.

I was reslivering a drive, but I've rebooted many times in that situation, it shouldn't matter. I also was starting to change the hostname from a generic "solaris" to something that made sense in my network, and it was being cranky about that, but nothing odd was going on.

The official response from the ZFS troubleshooting guides is to restore your data from backup after destroying and making a new pool. Sounds like Microsoft's advice for when your exchange store tanks.

Of course, you know all of my data isn't backed up. Let's not even go there. The most critical stuff is, things that I'd get my ass kicked for - but other data on my second SAN is months old, and then there's the gigs of junk that I've collected over the years that I don't bother backing up because I don't need it, but I want it. Most importantly, I did some pretty decent work this week, and it's all in that SAN ZFS pool, and I really don't want to recreate it.

I KNOW my data is all there. At last count, I had around 12-16 TB of data on these drives, that can't disappear that quickly.

The question is: Which path is faster? Recreating my data, or trying to recover this beast?

Well, you know I'm going to have to try and recover it.

My first day was spent in a light daze (shock maybe) as I researched more about the inner workings of ZFS. Luckily I had already spent a few weekends messing around with trying to port PJD's v28 ZFS patch on FreeBSD to a newer build of FreeBSD, so I had some idea of the code, the vdevs, uberblocks, etc.

My research is leading me to think that my most recent uberblocks were somehow corrupted, leading ZFS to think that I have a destroyed pool.

This is a big issue for ZFS - when those uberblocks don't make sense, the pool won't auto-recover, you've got to dig into things and really mess with the internals of ZFS to get anywhere.

Currently, I'm working with this command;

zpool import -d /mytempdev -fFX -o ro -o failmode=continue -R /mnt 13666181038508963033
..which takes 21 minutes to complete, so it gives you some idea of the slowness of recovering this type of data.

I'll post back if/when I get this recovered.

(May 20th 2011 Followup: http://christopher-technicalmusings.blogspot.com/2011/05/followup-zfs-data-gone.html )

Thursday, January 13, 2011

Testing FreeBSD ZFS v28

I've been testing out the v28 patch code for a month now, and I've yet to report any real issues other than what is mentioned below.

I'll detail some of the things I've tested, hopefully the stability of v28 in FreeBSD will convince others to give it a try so the final release of v28 will be as solid as possible.

I've been using FreeBSD 9.0-CURRENT as of Dec 12th, and 8.2PRE as of Dec 16th

What's worked well:

- I've made and destroyed small raidz's (3-5 disks), large 26 disk raid-10's, and a large 20 disk raid-50.
- I've upgraded from v15, zfs 4, no issues on the different arrays noted above
- I've confirmed that a v15 or v28 pool will import into Solaris 11 Express, and vice versa, with the exception about dual log or cache devices noted below.
- I've run many TB of data through the ZFS storage via benchmarks from my VM's connected via NFS, to simple copies inside the same pool, or copies from one pool to another.
- I've tested pretty much every compression level, and changing them as I tweak my setup and try to find the best blend.
- I've added and subtracted many a log and cache device, some in failed states from hot-removals, and the pools always stayed intact.


- Import of pools with multiple cache or log devices. (May be a very minor point)

A v28 pool created in Solaris 11 Express with 2 or more log devices, or 2 or more cache devices won't import in FreeBSD 9. This also applies to a pool that is created in FreeBSD, is imported in Solaris to have the 2 log devices added there, then exported and attempted to be imported back in FreeBSD. No errors, zpool import just hangs forever. If I reboot into Solaris, import the pool, remove the dual devices, then reboot into FreeBSD, I can then import the pool without issue. A single cache, or log device will import just fine. Unfortunately I deleted my witness-enabled FreeBSD-9 drive, so I can't easily fire it back up to give more debug info. I'm hoping some kind soul will attempt this type of transaction and report more detail to the list.

Note - I just decided to try adding 2 cache devices to a raidz pool in FreeBSD, export, and then importing, all without rebooting. That seems to work. BUT - As soon as you try to reboot FreeBSD with this pool staying active, it hangs on boot. Booting into Solaris, removing the 2 cache devices, then booting back into FreeBSD then works. Something is kept in memory between exporting then importing that allows this to work.  

- Speed. (More of an issue, but what do we do?)

Wow, it's much slower than Solaris 11 Express for transactions. I do understand that Solaris will have a slight advantage over any port of ZFS. All of my speed tests are made with a kernel without debug, and yes, these are -CURRENT and -PRE releases, but the speed difference is very large.

At first, I thought it may be more of an issue with the ix0/Intel X520DA2 10Gbe drivers that I'm using, since the bulk of my tests are over NFS (I'm going to use this as a SAN via NFS, so I test in that environment).

But - I did a raw cp command from one pool to another of several TB. I executed the same command under FreeBSD as I did under Solaris 11 Express. When executed in FreeBSD, the copy took 36 hours. With a fresh destination pool of the same settings/compression/etc under Solaris, the copy took 7.5 hours.

Here's a quick breakdown of the difference in speed I'm seeing between Solaris 11 Express and FreeBSD. The test is Performance Test 6.1 on a Windows 2003 server, connected via NFS to the FreeBSD or Solaris box.  More details are here

Solaris 11 Express svn_151a

903 MBs - Fileserver
466 MBs - Webserver
53 MBs - Workstation
201 MBs - Database

FreeBSD-9.0 Current @ Dec 12th 2010 w/v28 Patch, all Debug off

95 MBs - Fileserver
60 MBs - Webserver
30 MBs - Workstation
32 MBs - Database

Massive difference as you can see. Same machine, different boot drives. That's a real 903 MBs on the Solaris side as well - No cache devices or ZIL in place, just a basic raidz 5 disk pool. I've tried many a tweak to get these speeds up higher. The old v15 could hit mid 400's for the Fileserver test with zil_disable on, but that's no longer an option for v28 pools. I should compile my various test results into a summary and make a separate blog entry for those who care, as I also fiddled with vfs.nfsrv.async with little luck. I took great care to make sure the ZFS details were the same across the tests.

9 is faster than 8.2 for speed by a small amount. Between v28 pools and v15 pools there is speed degradation on both 8.2 and 9, but nothing as big as the difference between Solaris and FreeBSD.

I haven't benchmarked OpenSolaris or any type of Solaris older than 11, so I'm not sure if this is a recent speed boost from the Solaris camp, or if it's always been there.

As always, I'm delighted about the work that PJD and others have poured into ZFS on FreeBSD. I'm not knocking this implementation, just pointing out some points that may not be as well known.

Wednesday, January 12, 2011

Areca 1880ix-12 Can Overheat in a Dell T710 Chassis

While the Areca 1880x-12 doesn't come with any fans mounted on it's heat-syncs, you may need them.

Areca 1880ix-12

I just overheated one of my Areca 1880ix-12 cards the other night during a rather long 26 disk ZFS copy.

I had the card installed in my Dell T710 4U server, which is located in an air-contioned data cetre, properly rack-mounted, covers on, etc. so the case is able to breathe properly, and ambient temp is maintained at 21 Deg C. The card is mounted with a free slot behind and in front of it, so it's got room for air to move around it.

Dell T610's and 710's use shrouds to direct air over the memory and CPU, but leave the PCIe slots to fend for themselves. As I'm learning, it may be an issue for high performance cards installed in the PCIe slots.

Dell T710 Chassis, with shroud

Normally, I add additional cooling to any heatsync that is uncomfortable to the touch, a trick that has kept hardware failures to a minimum in my servers. Since this was still a fairly new build that was being worked on frequently, I hadn't bothered hooking anything up.

The failure happend overnight when I let my 26 disk ZFS array copy 6-7 TB of data from one location in ZFS space to another.

When I returned in the morning, I found the ZFS copy was hung, Solaris was cranky because it didn't think it had any disks, but was running.

I had left the Areca 1880's Raid Storage Manager web page up during the copy, luckily on the Hardware Monitor page, which auto-refreshes to keep the stats up-to-date. The displayed temprature was 79 Deg C for the CPU controller, the highest I've seen yet.

I couldn't manouver in the web console, and was forced to reboot. Upon reboot, the card wouldn't detect any drives. I powered down, and started a physical check. The heatsyncs were extremely hot to the touch.

After letting it cool properly, I was able to boot up without issue, detect drives, and check my ZFS array. Of course, it was fine - ZFS is hard to kill, even with a controller failure of this magnitude.

I installed a quick-fix - an 80 MM case fan blowing directly on the heatsyncs of the Areca. I then powered back up and started the same copy again. This time, I watched it closely - My CPU temp never rose above 51 Deg C, and the copy completed in 7.5 hours.

I've contacted Areca Support, and they confirm that the Areca CPU has a max temp of 90 Deg C, so I should be okay under this temp.

It's quite possible that my Areca CPU hit a much higher temp than 79 Deg C - That's the last value displayed on my web console, but there is no guarentee that the web page didn't time out or update in time before the final lockup.

Video card coolers are probably the best fix for this. I'll either use one that takes up an expansion slot and draws it's air from the rear of the chassis blowing it on the card, or mount some smaller slim video card coolers on the heat syncs.

Sunday, January 9, 2011

SSD Fade. It's real, and why you may not want SSDs for your ZIL

SSD's fade - and they fade quicker than you'd expect.

When I say "fade" I mean that their performance drops quickly with write use.

The owner of ddrdrive is rather active on the forums, and was posting some very interesting numbers about how quickly a SSD can drop in performance. He sells a RAM-drive based hardware device that is actually very suitable for a ZIL, it's just very expensive.

Since my ZFS implementation makes use of 8 OCZ Vertex 2 SSD drives (4x60GB ZIL, 4x120GB L2ARC), I thought I should check into this.

First off, when you test SSD's you have to use one of the beta versions of  iometer, so you can enable "Full Random" on the test data stream. OCZ's SSD's (like all others I believe)  perform compression/dedup to limit their writes, resulting in inflated numbers if you use easily compressed data.

How inflated? Try write speeds of 4700 IOPS / 2 MBs for random data, then 7000 IOPS / 3.6 MBs for repeating bytes. Nearly double the performance for easy-to-compress data.

With iometer generating random data, I tested out one of my OCZ Vertex 2 60 Gig SSDs that was in use for 4 months as a ZIL device. That's not exactly an ideal use of SSD's, and you'll see why shortly.

My test was iometer 1.1 on a Windows 7 32 bit box, driving 100% random, 100% write to the SSD, with a queue depth of 128. The writes were not 4k aligned, which does make a difference in speed, but for purposes of illustrating SSD fade, it doesn't matter as a 512b or 4k write is affected equally.Currently ZFS uses 512b as a drive sector size, so you're getting this type of speed unless you've tweaked your ashift value to 12 with a tool like zfs guru.

Back to the tests:

Fresh from my active ZFS system, I was getting 4300 IOPS / 17.5 MBs

After a secure erase with the OCZ Toolbox 2.22, I was getting 6000 IOPS / 24.5 MBs - That's a 40% increase in IOPS just with a secure erase.  My SSD performance has dropped 40% in 4 months.

Curious to see how quickly the SSD will fade, I set iometer up to run the random write for 8 hours. When I came back and ran the test again, I was shocked to see speed down to 910 IOPS / 3.75 MBs. Wow, we're getting close to the territory of a 15k 2.5" SAS drive (around 677 IOPS / 2.77 MBs).

I then did a secure erase of the drive and tested again - Speed was back up, but not as good as what I originally had. I was now obtaining 4000 IOPS / 16 MBs - Worse than when I first started. 8 hours of hard writes used up my SSD a bit more.

Why did 8 hours of writing break down my SSD faster than 4 months of life as a ZIL? Well, I was using 4 of these 60 gig SSDs as one big ZIL device, and the ZIL doesn't consume much space, so the # of GB written over these 4 SSDs in the course of 4 months wasn't nearly as bad as my 8 hour test on one device.

Interestingly, I also have 4 120 Gig Vertex 2 drives for my L2ARC cache. When I pulled and tested these drives, they were far less degraded - The L2ARC is designed to be very conservative on writes, and is mostly for reads. A MLC SSD here works well.

Should you use SSDs for ZIL? You can, but understand how they will degrade.

Your best bet is a RAM based device like a ddrdrive, or similar. There are other slower, cheaper RAM drives out there you could use. These will not degrade with use, and the ZIL gets a lot of writes - Nearly everything written to your pool is written to the ZIL.

If you don't have one of these, your other option is to use lots of devices like I did to ballance the wear/fade. Pull them every 6 months for 15 min of maintenance, and you'll maintain decent speeds.

For now, my 4 60 gig SSDs are going back into ZIL duty until I can decide if a ddrdrive (or a bunch of cheaper ddr devices) is in my future.

Solaris 11 Express - Faster Speed for SAN-NFS-VM than FreeBSD

I love FreeBSD. It's been my go-to OS since I first switched to it from BSDi many many years ago. However, I hate to stick with something just because I've always used it. Technology changes too quickly to make lifelong bonds. While I prefer FreeBSD, I ultimately will choose what gets the job done the best.

ZFS under FreeBSD has always progressed nicely, but it is obviously behind Solaris, seeing how Sum/Oracle are the creators of ZFS. The ZFS v28 patch is currently being tested for FreeBSD-8.2 and 9, but not released yet.

While I realize these are pre-releases, I wanted to test the performance I could expect from FreeBSD vs Solaris for my SAN that I'm building. Once I put this SAN into regular use, I won't easily be able to switch to another OS, or do a lot of tweaking with it.

This is not just a test of ZFS implementations - It tests the hard disk subsystem, memory, NFS, ethernet drivers, and ZFS to give me a combined test result, which is what I really care about - How fast can my SAN accept and serve up data to my ESXi boxes.

I've seperately been benchmarking FreeBSD 8.2-PRE and 9.0-CURRENT with both the v28 ZFS patch, and with their native v15 ZFS implementations. I'll make that a seperate blog entry a bit later on, but for now, I'm choosing the fastest performer from the v28 portion of those tests, which was FreeBSD 9.0-CURENT.

My test environment is as follows:

ESX Box: Dell PowerEdge T710 w/96Gig
SAN Box: Dell PowerEdge Y710 w/24Gig
Network: Intel X510DA2 10Gbe Ethernet  in each box, direct attached
SAS Card: Areca 1880ix w/4G Cache
Enclosure: SuperMicro SC847-JBOD2 (Dual 6Gbps Backplanes)
Drives: Seagate 1.5 TB SATA, 5 disk raidz, no log or cache
Test Software: Performance Test 6.1 run on a Windows 2003 VM

Solaris 11 Express svn_151a

903 MBs - Fileserver
466 MBs - Webserver
53 MBs - Workstation
201 MBs - Database

FreeBSD-9.0 Current @ Dec 12th 2010 w/v28 Patch, all Debug off
95 MBs - Fileserver
60 MBs - Webserver
30 MBs - Workstation
32 MBs - Database

Wow. Notice the difference? I didn't believe it myself at first, but I tried it again and again, I watched the packet traffic accross the Intel card with iftop, and then I ran seperate benchmarks with iometer just to make sure it wasn't something silly in Performance Meter. I always received similar results - Solaris 11 Express can move some serious data in comparison to FreeBSD 9.0

At this stage, I don't know if it's a problem with the Intel driver holding back FreeBSD or not. You'll notice that the Workstation test is the only one that is compariable between FreeBSD and ZFS, which is made up of a lot of sync random read/writes that brings any ZFS implementation to it's knees.

The thing is: I don't care if it's the Intel driver, the memory sub-system, or whatever - In the end, I need the best performance delivered to my ESXi boxes, and Solaris 11 Express can deliver where FreeBSD isn't ready just yet.

I do fully expect FreeBSD to catch up, but for now, I'm strongly considering spendng the $$ on a Solaris 11 Express license so I can run it as my SAN OS. Solaris' easy CIFS implementation is another bonus for my SAN - it was delivering 78 MBs speeds when my FreeBSD/samba implementation was at 35MBs.

Thursday, January 6, 2011

Areca 1880 Firmware v1.49 improves speed in Web Management, Solaris 11

I just applied the 1.49 firmware update for my Areca 1880, and the speed difference for user interface interaction is noticeable.

You'll have to download it from their Taiwan FTP site, which is listed on their support page.

The web console is now quite snappy, where before it was painful. In Solaris 11 Express, I'm able to list my 26+ drives without delay, where before it would take 5-10 seconds to give me a full drive inventory, which was leading to some serious delays in zpool status commands.

When applying your update via the web console, make sure to apply all 4 files individually. It's not smart, it won't ask for the other files, but it does need them if you want to be fully updated across the board.

I ran a set of iometer tests before and after my upgrade, and I can't detect any change in drive speed, just user interface speed.

Tuesday, January 4, 2011

Partition Alignment with ESX and Windows 2003. Just do it.

You really want to make sure your partitions are aligned.

Partition Alignment, in a nutshell, is making sure your partitions start on Sector # 2048 (or some multiple of 512) instead of the standard Sector # 63.

There are some great blog entries that define the issue, so I won't get into it here.

While you don't need to do this for 2008 partitions (they start properly aligned), I've always known I should align my older 2003 partitions, but I haven't bothered. Today I decided to do a simple test to see the results of running un-aligned.

Test Details:
OS: Windows 2003 32bit
Test Package: Performance Test 6.1, running Webserver disk role benchmark
VM: ESXi 4.1
Drives: NFS share to a Solaris 11 ZFS box.

I chose the webserver role because it's a heavy reader (writes are harder to benchmark on ZFS), and because it's my worst performer. The other roles are hitting 900-1200 MB/sec, which is close to the max I can drive through my 10Gbe card. 

Aligned Partition: 609 MB/sec
Non-Aligned Partition: 436 MB/sec

That's a 30% increase in read speed. Wow, that's a lot of speed to throw away for not aligning a partition. 

Here's a link to a quick little tool that will let you know within Windows if your partition is aligned: http://ctxadmtools.musumeci.com.ar/VMCheckAlign/VMCheckAlignment10.html

Of course, it can be a pain in the ass to align a partition. You can use a livecd of gparted, or create a new disk and partition it with Windows 2008, copying the data to the new partition from a seperate VM.

Some partition tools are now including the ability to quickly and easily specify a aligned-partition start, so hopefully this will be an easy task in the near future.  If you know the name of any easy booting GUI type partition tools that can move a Windows partition to sector 2048 without much fuss, please let me know and I'll post it here.

I personally do it with gparted live, but I'm also not affraid of the command line. Something easier would be nice.