Sunday, February 6, 2011

ZFS: Data gone, data gone, the love is gone...

Holy %*^&$!

My  entire 21 TB array is now "FAULTED" and basically dead to the world.

It's listing the worst type of corrupt data - pool metadata.

How did I do this? I was just rebooting the box, as always. I was working on integrating Solaris 11 Express with my active directory so CIFS would allow for seamless browsing from my Windows boxes. I was partially through it, when I decided to reboot the box and try the commands again - When I did, my pool was no longer intact.

I was reslivering a drive, but I've rebooted many times in that situation, it shouldn't matter. I also was starting to change the hostname from a generic "solaris" to something that made sense in my network, and it was being cranky about that, but nothing odd was going on.

The official response from the ZFS troubleshooting guides is to restore your data from backup after destroying and making a new pool. Sounds like Microsoft's advice for when your exchange store tanks.

Of course, you know all of my data isn't backed up. Let's not even go there. The most critical stuff is, things that I'd get my ass kicked for - but other data on my second SAN is months old, and then there's the gigs of junk that I've collected over the years that I don't bother backing up because I don't need it, but I want it. Most importantly, I did some pretty decent work this week, and it's all in that SAN ZFS pool, and I really don't want to recreate it.

I KNOW my data is all there. At last count, I had around 12-16 TB of data on these drives, that can't disappear that quickly.

The question is: Which path is faster? Recreating my data, or trying to recover this beast?

Well, you know I'm going to have to try and recover it.

My first day was spent in a light daze (shock maybe) as I researched more about the inner workings of ZFS. Luckily I had already spent a few weekends messing around with trying to port PJD's v28 ZFS patch on FreeBSD to a newer build of FreeBSD, so I had some idea of the code, the vdevs, uberblocks, etc.

My research is leading me to think that my most recent uberblocks were somehow corrupted, leading ZFS to think that I have a destroyed pool.

This is a big issue for ZFS - when those uberblocks don't make sense, the pool won't auto-recover, you've got to dig into things and really mess with the internals of ZFS to get anywhere.

Currently, I'm working with this command;

zpool import -d /mytempdev -fFX -o ro -o failmode=continue -R /mnt 13666181038508963033
..which takes 21 minutes to complete, so it gives you some idea of the slowness of recovering this type of data.

I'll post back if/when I get this recovered.

(May 20th 2011 Followup: )