Sunday, January 9, 2011

SSD Fade. It's real, and why you may not want SSDs for your ZIL

SSD's fade - and they fade quicker than you'd expect.

When I say "fade" I mean that their performance drops quickly with write use.

The owner of ddrdrive is rather active on the forums, and was posting some very interesting numbers about how quickly a SSD can drop in performance. He sells a RAM-drive based hardware device that is actually very suitable for a ZIL, it's just very expensive.

Since my ZFS implementation makes use of 8 OCZ Vertex 2 SSD drives (4x60GB ZIL, 4x120GB L2ARC), I thought I should check into this.

First off, when you test SSD's you have to use one of the beta versions of  iometer, so you can enable "Full Random" on the test data stream. OCZ's SSD's (like all others I believe)  perform compression/dedup to limit their writes, resulting in inflated numbers if you use easily compressed data.

How inflated? Try write speeds of 4700 IOPS / 2 MBs for random data, then 7000 IOPS / 3.6 MBs for repeating bytes. Nearly double the performance for easy-to-compress data.

With iometer generating random data, I tested out one of my OCZ Vertex 2 60 Gig SSDs that was in use for 4 months as a ZIL device. That's not exactly an ideal use of SSD's, and you'll see why shortly.

My test was iometer 1.1 on a Windows 7 32 bit box, driving 100% random, 100% write to the SSD, with a queue depth of 128. The writes were not 4k aligned, which does make a difference in speed, but for purposes of illustrating SSD fade, it doesn't matter as a 512b or 4k write is affected equally.Currently ZFS uses 512b as a drive sector size, so you're getting this type of speed unless you've tweaked your ashift value to 12 with a tool like zfs guru.

Back to the tests:

Fresh from my active ZFS system, I was getting 4300 IOPS / 17.5 MBs

After a secure erase with the OCZ Toolbox 2.22, I was getting 6000 IOPS / 24.5 MBs - That's a 40% increase in IOPS just with a secure erase.  My SSD performance has dropped 40% in 4 months.

Curious to see how quickly the SSD will fade, I set iometer up to run the random write for 8 hours. When I came back and ran the test again, I was shocked to see speed down to 910 IOPS / 3.75 MBs. Wow, we're getting close to the territory of a 15k 2.5" SAS drive (around 677 IOPS / 2.77 MBs).

I then did a secure erase of the drive and tested again - Speed was back up, but not as good as what I originally had. I was now obtaining 4000 IOPS / 16 MBs - Worse than when I first started. 8 hours of hard writes used up my SSD a bit more.

Why did 8 hours of writing break down my SSD faster than 4 months of life as a ZIL? Well, I was using 4 of these 60 gig SSDs as one big ZIL device, and the ZIL doesn't consume much space, so the # of GB written over these 4 SSDs in the course of 4 months wasn't nearly as bad as my 8 hour test on one device.

Interestingly, I also have 4 120 Gig Vertex 2 drives for my L2ARC cache. When I pulled and tested these drives, they were far less degraded - The L2ARC is designed to be very conservative on writes, and is mostly for reads. A MLC SSD here works well.

Should you use SSDs for ZIL? You can, but understand how they will degrade.

Your best bet is a RAM based device like a ddrdrive, or similar. There are other slower, cheaper RAM drives out there you could use. These will not degrade with use, and the ZIL gets a lot of writes - Nearly everything written to your pool is written to the ZIL.

If you don't have one of these, your other option is to use lots of devices like I did to ballance the wear/fade. Pull them every 6 months for 15 min of maintenance, and you'll maintain decent speeds.

For now, my 4 60 gig SSDs are going back into ZIL duty until I can decide if a ddrdrive (or a bunch of cheaper ddr devices) is in my future.

5 comments:

  1. "Nearly everything written to your pool is written to the ZIL." -- from what I understand of ZFS, as well as based on my observations, this statement is wrong.

    As far as I am aware, the ZIL only matters for sync writes, few of which should normally take place on a system. Ordinary async writes bypass the ZIL completely.

    Of course, you with your VMs may get many sync writes (I don't know), but IMO then you should qualify this statement.

    ReplyDelete
  2. I will have to post on this again shortly, because there seems to be a lot of misconceptions about what the ZIL does.

    The statement I made about the ZIL is true for my pool, perhaps I shouldn't have said "your pool" but instead "my pool".

    However, I believe the only thing that bypasses the ZIL is a large sync write - that is sent directly to the disks to keep performance high. Otherwise, it all goes through the ZIL, including random writes, because these are normally waiting to be written in a transaction group, and we can't lose them if we lose power.

    ReplyDelete
  3. The way I understand it, at most the metadata updates related to random async writes go through the zil (and I'm not even certain about that).

    Async writes can definitely be lost when the system loses power. Just think about it: the async write operation returns before the data has been committed to non-volatile storage. It logically follows that there is a time window for it to be lost before it does get committed.

    (Note that sync vs. async is orthogonal to sequential vs. random -- your reply seems to imply some confusion in that regard.)

    See e.g. here: http://constantin.glez.de/blog/2010/07/solaris-zfs-synchronous-writes-and-zil-explained

    ReplyDelete
  4. hi Christopher
    Would you tell me how do you "did a secure erase of the drive"
    I try to format the ssd for zil, and re-enable zil, it seemed there is no obvious improvement.

    ReplyDelete
    Replies
    1. Hi subdragon,

      Since I use OCZ SSD's, I use the OCZ Toolbox to perform my secure erase of the drives.

      Delete