Tuesday, July 12, 2011

ZFS Dedup Performance - Real World Experience After One Year

I've been running ZFS with deduplication from Open Solaris or FreeBSD for about a year now in a production environment.  I manage ESX vm's, mostly with various flavours of Windows server.

In that year, my thoughts on where to use dedup have changed.

My big ZFS pool is 28.5 TB in a basic mirror, currently showing 11.1TB used, and running a dedup ratio of 1.25x (25% of my data is deduplicated), a savings of around 2.75 TB. This machine runs 24 Gig of RAM, and has a L2ARC comprised of 4 120 Gig SSD's, in additional to 2 ZIL RAM devices.

I have two smaller production units, holding 20.4 TB in a raidz with 12 Gig of RAM, and 16.4 TB in a raidz with 24 gig of RAM.  These secondary systems may or may not have hardware RAM ZIL's and L2ARC SSD caches as I continually move around components in my secondary equipment to fins a sweet-spot between $$ for hardware and performance.

Back to the thought at hand..

Dedup obviously has to have some sort of performance penalty. It simply takes more time to check if you're writing a unique block compared to writing without the check.

Checking the uniqueness of the block involves looking at every block's hash in the pool (even if the fs inside the pool isn't participating in dedup) - A task that increases write execution time as your pool increases in size.

In a perfect world, your system will have enough RAM that the Dedup Table (DDT) will reside entirely in RAM. This means that it only requires traversing the DDT in RAM to find the uniqueness of the block being written.

In the real world, this is nearly impossible. The way the ARC/L2ARC is designed, there isn't a preference for the DDT data - it's simply thought of as metadata, and that competes with user data in the cache system of ZFS. You may think that setting primarycache and/or secondarycache to metadata will fix the problem, but once you switch the primary cache to metadata, you also doom the secondary cache to only hold metadata - Nothing can get into the L2ARC unless it exists in the ARC.

In my experience with a busy pool, enough of your DDT is eroded away from user data writes that dedup performance starts to become a serious issue. These issues don't show up on a fresh system without a lot of data in pool. You need an older, loaded, very fragmented pool to really notice the drop in performance.

How bad? Here's a quick and dirty iometer check on a fs with dedup turned on, and one with it off. This is on my big pool, with plenty of L2ARC and RAM.

Dedup On:  197 IOPS, 0.81 MB/s, 647 ms AvgIO
Dedup Off: 4851 IOPS, 19.87 MB/s, 28 ms AvgIO.

See the difference? This is not a 2x or 4x slowdown of my writes, we're in the 20x slower category to use dedup than with it off.

So we know it's going to be hard to keep the DDT in RAM, so there will be a performance hit. What about the large savings to disk writes, (you don't write the block if it already exists) and thus the performance increase there?

It's simply not there in my systems, and probably not in yours.

Additionally the idea that these dedup blocks could reside in memory cache, allowing multiple systems to take advantage of the same blocks is a good one, but it looks like there is still enough erosion of the caches that this doesn't make a difference in my systems.

If I had a higher dedup ratio, then maybe this would start working out better for me - But even with 40+ VM's of mostly the same stuff, I'm only running a 1.25x dedup ratio on my big pool - I have enough dissimilar data to make dedup impractical on this system.

I'm now going to turn dedup off for nearly all of my production boxes. 

However, turning dedup off doesn't de-dedup your data - it stays deduplicated until fresh data overwites it.

This provides a 'poor-man's offline dedup' trick. Leave dedup off during the day when you need performance, then enable via a script, and copy all of the data in place to *.tmp and then rename to drop the *.tmp when done - You just deduplicated all your data, without the performance hit during the writes in the daytime when I assume your systems would be the busiest.

I'm going to leave dedup enabled on my storage and backup archives - places where I have a lot of duplication, and not a lot of need for speed.Everywhere else, I'm just going to run compression (I find compression=gzip to be a good blend of storage and speed on my systems). I"m averaging close to 2x on my compression alone.

For dedup to work well in a production environment, you're going to need an extreme amount of RAM and L2ARC compared to your drive storage size. I don't have any figures for you, but it's probably cheaper to provision more storage than it would be to try and dedup the data.

What ZFS needs is an offline dedup ability, priority settings for the DDT in ARC/L2ARC, and some general tuning of the ARC/L2ARC to allow for situations like "DDT and Metadata only in RAM, everything else allowed in L2ARC".

..at least for it to be viable in my environment, which is the problem - We all have different environments with different needs.


  1. Hi, i got the same problem and i'm gonna try this http://ifreedom.livejournal.com/84129.html
    Hope it could help

  2. I'm about to decide between Solaris 11 Express and FreeBSD 9.0 (when it will be a full release). What do you recommend? It will be used for home data and home lab with VMware/XEN/etc.

    I'm really thinking about what you mention the performance diff between the two OSes back few months ago.

    By the way, great blog!!! I'm always keeping an eye on your blog...


  3. I'd recommend FreeBSD over Solaris, mostly because you're more like to have supported hardware and forum support with it over Solaris.

  4. ...so FreeBSD will be! ;-)
    I was leaning towards FreeBSD as well because I like it for years, but I wanna to hear your opinion as well.

    Thanks a mill!!!

  5. Dunno, FreeBSD hardware support isn't that expansive either, so either system you choose (Solaris/FreeBSD), you still have to choose your hardware carefully. So then I'd say it might be best to simply take Solaris.

    My home server uses 4 Intel 1000 Desktop NICs, 2 IBM 1015 controllers with IT Firmwares (makes them LSI 9211 IT's), 2 HP SAS Expanders (they don't really count since they are only visible to a system trough the SAS bus), 8GB of ram, an Intel Q6600 and a metric ton of 320GB SATA drives. All supported by Solaris (and FreeBSD).

    As for FreeBSD's ZFS support, the z-version will always be way more advanced and bugfixed better on Solaris compared to FreeBSD, not forgetting the chances FreeBSD being able to keep up with versions beyond 28 is not looking very bright (even the opensolaris branches seem stuck at 28).

    And for this article, the author tells us that he is having rather crappy dedup ratio's with 40+ VM's, I'm wondering what block size he has in use and if the VM's were created from the same master image or random installs.

    On one of Oracles blogs it states:

    Dedup works on a ZFS block or record level. For a iSCSI or FC LUN, i.e. objects backed by ZVOL datasets, the default blocksize is 8K. For filesystems (NFS, CIFS or Direct Attach ZFS), object smaller than 128K (the default recordsize) are stored as a single ZFS block while objects bigger than the default recordsize are stored as multiple records Each record is the unit which can end up deduplicated in the DDT. Whole Files which are duplicated in many filesystems instances are expected to dedup perfectly. For example, whole DB copied from a master file are expected to fall in this category. Similarly for LUNS, virtual desktop users which were created from the same virtual desktop master image are also expected to dedup perfectly.

    From this I read that if you start 1000 virtual desktops from copies of the same master image, the amount of data from the master image will be completely deduped. But if you were to install 1000VM's from scratch, good dedupclication wouldn't happen.

    Besides that, I'd also guess that when using smaller block sizes, deduping works better since there is more chance that a file split into 10000 4K blocks has dedupable blocks, then a file split into 313 128K blocks

    I'm currently testing with various block sizes on a storage system and will report back with my results. First test I found is that with a small 32GB ZIL, 8GB of ram and 32GB L2ARC, with super random data (using some OS Images and mostly DVD/BD rips as test data) ZFS is already complaining that my DDT table is growing too large (after only copying 200GB) so yes, the Dedup table will ofcourse grow in size because of smaller block sizes, 32 times more blocks will mean 32 times more 32 Byte records.

    I'm guessing I'm on the hunt for a sweetspot here. Where the growth in DDT size doesn't outweigh the better deduplication in both IO and size.

  6. Hi Aspergeek:

    Few things:

    ZFS on Oracle will always be the best, because they are designing the closed-source product. FreeBSD v28 is nearly as good as OpenIndiana, but all of the Free ZFS's are currently stalled while they try and figure out where to go post-Oracle.

    I think they are wasting their time waiting for Oracle to do anything. Oracle is currently suing Google for using Java in their Android platform. Oracle isn't a friendly place to open-source at all.

    2) Dedup and Clones. Yup, doing a proper VM or ZFS clone is the best way to achieve excellent storage consolidation, but it's not something that works well in our environment. We have 3 large storage devices on our SAN, and a VM may have to be moved from one device to another in the SAN for various reasons. As soon as we move outside of one ZFS pool, ZFS cloning is of no use to use. There's more luck with linked clones in ESX, but that' not a perfect fit either.

  7. I have the exactly same performance metrics (~20MB/s with dedup off and <1MB/s with it on).

    I am not sure what the bottleneck is though. When writing to a dedup volume, the CPU and IO wait is not high at all when looking on top.

    I guess I need to use a better tool(s) to find out what the bottleneck is. From what I understand, it would be the disks due to the DDT not fitting in RAM and thus on disk.


  8. Unknown:

    Probably lack of RAM.

    How much RAM do you have in comparison to the size of data in the pool right now? The DDT is quite large.

    I'd love to see a cache tweak where the DDT takes precedence for memory cache, but I just don't have the time, and others are working on other ZFS features.

    However, if it is lack of RAM for DDT, then you should see good read speed, and bad write speed. You only need to traverse the DDT when you write. Then again, the read cache may be non-existent due to the large DDT pushing it out of memory.

  9. I have 8GM of ECC DDR2 memory in this machine.

    free -m
    total used free shared buffers cached
    Mem: 7981 7930 50 0 41 926
    -/+ buffers/cache: 6962 1018
    Swap: 2303 25 2278

    When you ask data in the pool, you mean used space across all volumes (dedup'd or not)?

    My dedup'd volumes are as follows:

    Filesystem Size Used Avail Use% Mounted on
    zfs/ISO 50G 7.0G 44G 14% /zfs/ISO
    Filesystem Size Used Avail Use% Mounted on
    zfs/VZ-backups 250G 197G 54G 79% /zfs/VZ-backups
    Filesystem Size Used Avail Use% Mounted on
    zfs/both 2.1T 691M 2.1T 1% /zfs/both
    Filesystem Size Used Avail Use% Mounted on
    zfs/dedup 2.2T 3.9G 2.1T 1% /zfs/dedup
    Filesystem Size Used Avail Use% Mounted on
    zfs/homes 460G 389G 72G 85% /zfs/homes
    Filesystem Size Used Avail Use% Mounted on
    zfs/vz 199G 26G 173G 13% /zfs/vz

    "dedup" and "both" were 2 test volumes I created for testing.

    I added a 30GB SSD as a cache drive hoping it would hold the DDT but it didn't make a difference to the performance. I am wondering if I need to play with the primarycache and secondarycache settings?

    Thanks for the response Christopher.

    I'll check out some read speed testing.

    1. Yes, when I ask about space in the pool, it's for all volumes deduped or not. It all adds up the same.

      8Gig isn't enough for Dedup at the size you have - You're going to suffer from the DDT not existing in memory, which means each write needs to access the disk to load the rest of the DDT into memory before it can confirm that the block you're about to write is unique or a duplicate.

      If you want Dedup and speed, you need RAM - Lots of it. I gave up on Dedup on a machine with 96 Gigs of RAM - However, it was running a pool that was 20 TB in size..

  10. I tried reading a 2.5GB file from a dedup volume to /dev/null
    After getting 500MB /s the first time, I tried another file and got:

    2.53GB 0:00:21 [ 120MB/s]

    So yes, my read speeds are quite good.
    I think I need more tweaking of the cache SSD drive.

    1. Hmm, 120MB/sec is about the speed of a single, decent SATA3 drive. Your pool should be able to do better than that.

  11. Thanks Christopher.

    total used free shared buffers cached
    Mem: 7981 7692 289 0 129 404
    -/+ buffers/cache: 7157 823
    Swap: 2303 74 2229

    zfs 4.06T 1.50T 2.56T 37% 1.77x ONLINE -

    Looking at "zpool status -D zfs" I have 951MB of DDT in memory and 1815MB of DDT on disk (L2ARC?). I have a 30B SSD as the pools log drive. I thought that would speed things up as I would think 30GB would be enough to hold the DDT.

    Any thoughts appreciated.

    1. The problem with trying to cache the DDT is that zfs's cache system doesn't place the DDT at a higher priority than the standard read cache. This means that regular file access can push the DDT out of memory on to disk.

      I've proposed a switch in the past that forces the DDT to stay in memory above everything else, as performance will suffer the instant any part of the DDT is on disk - However, I just don't have the time/funding to write it myself (as much as I would like to), and others are busy with other ZFS issues.

      Which means: If you want DDT, you need a LOT of RAM. L2ARCh SSD's won't be good enough, unless you're happy with slow write speeds.

      It's like loud-speaker design (anyone try and make the perfect loud speaker?) You can have some of the requested items, but not all of them.

  12. Thanks very much for your responses Chistopher. Appreciate it. It has helped me better understand dedup and its flaws in ZFS.