In that year, my thoughts on where to use dedup have changed.
My big ZFS pool is 28.5 TB in a basic mirror, currently showing 11.1TB used, and running a dedup ratio of 1.25x (25% of my data is deduplicated), a savings of around 2.75 TB. This machine runs 24 Gig of RAM, and has a L2ARC comprised of 4 120 Gig SSD's, in additional to 2 ZIL RAM devices.
I have two smaller production units, holding 20.4 TB in a raidz with 12 Gig of RAM, and 16.4 TB in a raidz with 24 gig of RAM. These secondary systems may or may not have hardware RAM ZIL's and L2ARC SSD caches as I continually move around components in my secondary equipment to fins a sweet-spot between $$ for hardware and performance.
Back to the thought at hand..
Dedup obviously has to have some sort of performance penalty. It simply takes more time to check if you're writing a unique block compared to writing without the check.
Checking the uniqueness of the block involves looking at every block's hash in the pool (even if the fs inside the pool isn't participating in dedup) - A task that increases write execution time as your pool increases in size.
In a perfect world, your system will have enough RAM that the Dedup Table (DDT) will reside entirely in RAM. This means that it only requires traversing the DDT in RAM to find the uniqueness of the block being written.
In the real world, this is nearly impossible. The way the ARC/L2ARC is designed, there isn't a preference for the DDT data - it's simply thought of as metadata, and that competes with user data in the cache system of ZFS. You may think that setting primarycache and/or secondarycache to metadata will fix the problem, but once you switch the primary cache to metadata, you also doom the secondary cache to only hold metadata - Nothing can get into the L2ARC unless it exists in the ARC.
In my experience with a busy pool, enough of your DDT is eroded away from user data writes that dedup performance starts to become a serious issue. These issues don't show up on a fresh system without a lot of data in pool. You need an older, loaded, very fragmented pool to really notice the drop in performance.
How bad? Here's a quick and dirty iometer check on a fs with dedup turned on, and one with it off. This is on my big pool, with plenty of L2ARC and RAM.
Dedup On: 197 IOPS, 0.81 MB/s, 647 ms AvgIO
Dedup Off: 4851 IOPS, 19.87 MB/s, 28 ms AvgIO.
See the difference? This is not a 2x or 4x slowdown of my writes, we're in the 20x slower category to use dedup than with it off.
So we know it's going to be hard to keep the DDT in RAM, so there will be a performance hit. What about the large savings to disk writes, (you don't write the block if it already exists) and thus the performance increase there?
It's simply not there in my systems, and probably not in yours.
Additionally the idea that these dedup blocks could reside in memory cache, allowing multiple systems to take advantage of the same blocks is a good one, but it looks like there is still enough erosion of the caches that this doesn't make a difference in my systems.
If I had a higher dedup ratio, then maybe this would start working out better for me - But even with 40+ VM's of mostly the same stuff, I'm only running a 1.25x dedup ratio on my big pool - I have enough dissimilar data to make dedup impractical on this system.
I'm now going to turn dedup off for nearly all of my production boxes.
However, turning dedup off doesn't de-dedup your data - it stays deduplicated until fresh data overwites it.
This provides a 'poor-man's offline dedup' trick. Leave dedup off during the day when you need performance, then enable via a script, and copy all of the data in place to *.tmp and then rename to drop the *.tmp when done - You just deduplicated all your data, without the performance hit during the writes in the daytime when I assume your systems would be the busiest.
I'm going to leave dedup enabled on my storage and backup archives - places where I have a lot of duplication, and not a lot of need for speed.Everywhere else, I'm just going to run compression (I find compression=gzip to be a good blend of storage and speed on my systems). I"m averaging close to 2x on my compression alone.
For dedup to work well in a production environment, you're going to need an extreme amount of RAM and L2ARC compared to your drive storage size. I don't have any figures for you, but it's probably cheaper to provision more storage than it would be to try and dedup the data.
What ZFS needs is an offline dedup ability, priority settings for the DDT in ARC/L2ARC, and some general tuning of the ARC/L2ARC to allow for situations like "DDT and Metadata only in RAM, everything else allowed in L2ARC".
..at least for it to be viable in my environment, which is the problem - We all have different environments with different needs.