# How to recover a ZFS RAIDZ2 pool after OS-drive failure?



## WiiGame (Oct 21, 2013)

(This is a mix of crash recovery, reinstall & upgrade, and ZFS storage ... I don't know which went wrong, but my core issue is a ZFS pool, so I'll start here in the Storage forum.)

I am in a bit of a panic. I did look around without finding something close enough to my situation to act; please just point me elsewhere, if there is such a post. I am trying to keep this initial post short, so I will start with a ...

SUMMARY:
A 9.1-RELEASE system running the OS on a USB flash (UFS) with a ZFS mirror gets upgraded to 9-STABLE and, somewhere in there, it gets a ZFS RAID6 added (and the 2 ZFS pools report back with slightly differently messages, beyond the obvious). Everything is fine for a while, until the OS flash drive stops working. After bringing it back up from an old 9.1-REL backup and re-upgrading to 9-STABLE (but somehow, unintentionally getting 9.1-STABLE), the ZFS mirror is fine but the ZFS RAID6 is *not* fine and, instead, reports pools that were live at the time of the old OS backup, but are long defunct.

Q: How can I oh-so-carefully bring this RAID6 pool back up and running the way it was?

At this point, I am afraid to make any moves without advice, because I do not want to screw up something that might be presently salvageable. And, based on what I've seen so far, I do still think full recovery is possible.

I know no one can help me based only on that, so details will immediately follow in the next post (just to keep them separate).


----------



## WiiGame (Oct 21, 2013)

*The Next Level of Details*

The basic setup was 9-STABLE on a relatively fast, low-profile USB2 flash drive supporting two ZFS pools: a 2-drive mirror, and a 6-drive RAID-6. Both pools were shared via Samba to the internal home network.

When the shares stopped responding, the console was found unresponsive, as well, so I had no choice but to press the case's reset button. The OS restarted with lots of complaining ... basically unusable, if memory serves. That was days ago, and if I made a note of the errors, I can't find it right now; I could try harder if the exact issue matters. But the clear impression to me was the issues were all about the USB OS flash drive and nothing to do with the ZFS drives. [And, I am still open to being wrong.]

When I found the backup dump of the OS drive, I found it was from the previous version: 9.1-RELEASE back when only the ZFS mirror was built (pre-RAIDZ2). After that backup, I think I built the RAID6 under 9.1-REL, upgraded to 9-STABLE and then did a `zpool upgrade` only on that pool, but I could have also built it after the upgrade ... I can't be sure right now.

IMPORTANT: After the original 9.1-REL to 9-STABLE upgrade, ZFS complained about a legacy format due to 9-STABLE that was not in 9.1-REL. Per above, I somehow allowed the new feature on the RAID6 pool but did not bother with it on the mirrored pool. This means that the mirror always mentioned I should upgrade whenever I did a `zpool status`.  Fortunately, I found what the mirror reported pre-crash:


```
status: The pool is formatted using a legacy on-disk format.  The pool can still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the pool will no longer be accessible on software that does not support feature flags.
```

That is the background ... next is what I did post-crash. My goal was just get things back to same as it was before.

I used the same, original install 9.1-REL drive (from which I installed onto the original, now-failed 8GB flash drive), and I installed FreeBSD to a new (and much slower) PNY 16GB flash drive I had lying around. Then, I did a full restore from the 9.1-REL dump backup via instructions I found here   (where "random1" is the source containing the dump file):

```
newfs -U /dev/da2p2
mount /dev/da2p2 /mnt/PNY
mount /dev/da4 /mnt/random1
cd /mnt/PNY
restore -rvf /mnt/random1/allroot.dump
```

Naturally, this got me back to a running system under 9.1-RELEASE, the way it was back in February. A `zpool status` showed the mirror was fine but also reported two old, obsolete zpools which had originally helped me learn & play with ZFS. It was detecting those pools as UNAVAIL and FAULTED but nothing about the RAID6 pool. (The messages are further below.)

I figured this was because the RAID6, being upgraded with a feature only in 9-STABLE, could not be read under 9.1-REL. And, since I never upgraded the mirror with this feature, that was why the mirror was fine.

So I upgraded the OS using a backup of /usr/src I had from my 9-STABLE upgrade, still trying just to get back to where I was. Somehow, the process gave me an unexpected upgrade because now I am mysteriously at 9.1-STABLE, unintentionally.

```
# uname -a
FreeBSD raid.quinns.int 9.1-STABLE FreeBSD 9.1-STABLE #0 r246798: Sun Oct 20 08:39:50 EDT 2013     dude@raid.mine.int:/usr/obj/usr/src/sys/GENERIC  amd64
```

But that is better, right? So I figure `zpool status` should give back my RAID6 pool now. However, instead I am getting the same info I was getting just before this last upgrade (i.e., under 9.1-REL):


```
# zpool status
  pool: mir1
state: ONLINE
status: The pool is formatted using a legacy on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on software that does not support feature
        flags.
  scan: scrub repaired 0 in 4h32m with 0 errors on Sun Sep  8 05:52:54 2013
config:

        NAME               STATE     READ WRITE CKSUM
        mir1               ONLINE       0     0     0
          mirror-0         ONLINE       0     0     0
            gpt/mir1disk1  ONLINE       0     0     0
            gpt/mir1disk2  ONLINE       0     0     0

errors: No known data errors

  pool: tempmir1
state: UNAVAIL
status: One or more devices could not be opened.  There are insufficient
        replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-3C
  scan: none requested
config:

        NAME                      STATE     READ WRITE CKSUM
        tempmir1                  UNAVAIL      0     0     0
          mirror-0                UNAVAIL      0     0     0
            16317492388348038357  UNAVAIL      0     0     0  was /dev/gpt/tempmir1disk1
            13969314449511946543  UNAVAIL      0     0     0  was /dev/gpt/tempmir1disk2

  pool: tempmir1backpool1
state: FAULTED
status: The pool metadata is corrupted and the pool cannot be opened.
action: Destroy and re-create the pool from
        a backup source.
   see: http://illumos.org/msg/ZFS-8000-72
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tempmir1backpool1  FAULTED      0     0     1
          ada7      ONLINE       0     0     6
```

When the system crashed, pools tempmir1 and tempmir1backpool1 were long gone. That obviously means this pool info somehow came back from my restore; I didn't expect that, but I don't know exactly how this all works. However, my novice work with ZFS so far indicates to me that these pools "should" be able to act fairly independently of system if treated correctly, so I believe (and hope) that the true RAID6 configuration should still be on the RAID6 drives themselves and, since I haven't written anything to them, they should somehow be retrievable back to its pre-crash state.

By the way, I strongly prefer recovering this pool to rebuilding it, if at all possible. And I have been reading and doing this long enough to know what you all are thinking about backups. I know, I know, so I don't need that lecture. :stud In fact, the mirrored pool is backed up and off-site. But this pool holds entertainment only: serving out my discs, ripped & renamed individual TV episodes, stored recordings off my old VHS and my DVR and the like. It took a lot of time and work, but I took a calculated risk because I simply cannot afford another 10+TB of drive to back all this up (and I planned, but had not yet gotten to, backing up certain pieces that are harder to replace). Therefore, this is not "the end of the world" or the loss of something completely irreplaceable if I have to start from scratch, but it is going to be a long road of re-doing things if I have to do that.

Thanks in advance for your consideration and assistance with recovering this. I hold out hope that it is something simple I do not know  because I am a ZFS admin novice.


----------



## WiiGame (Oct 25, 2013)

*Just in case I could get a little advice through a few distilled questions ...*

Per the end of the last code block above, /dev/ada7 is (was?) most certainly in my RAIDZ2. But the original/restored config had it in pool tempmir1backpool1. So ...

Should I export, or somehow otherwise release, these two obsolete pools? Or would that write to the disks and then truly destroy my RAID6, making it unrecoverable?

Or should I maybe offline the disks (e.g., via disabling that controller in the BIOS) and then somehow release the pools (if they would even still be there at that point)?

Or would a `zpool import` work as-is, even though the OS thinks at least one (if not three) of the six disks is actually part of on old pool?

Or is there some way to "hack" into some ZFS configuration file on the OS that will clear the dead wood and allow me to import a non-exported pool?

Or maybe I just need to save off a few .conf files, re-install the OS from scratch, do not restore anything, and then a `zpool status` might show the true, pre-crash RAIDZ2 pool from there? (That would take days on this slow, temporary OS flash drive.)

Any ideas/advice/direction? Thanks!


----------



## WiiGame (Jun 13, 2014)

Just to put a bow on this for future generations ...

I ultimately solved my problem on my own in Dec 2013 by just installing the latest STABLE (9.2 at the time) from scratch to a brand new flash drive and not using the backup of the OS at all. Since my OS drive/file system is completely separate from my ZFS pools, this was probably easier than if the OS was on ZFS, or so I think. And, yes, I had a bit of OS reconfiguring to do in order to get it close to the previous state.

In any case, the ZFS pools turned out to be just fine. Apparently, they were not corrupted in any way, as they had seemed. And, by extension, apparently the OS keeps some config data about the ZFS pools off of the pools in the OS somewhere. So, restoring the old OS info (apparently including this old config data) caused havoc, but installing new and have it rebuild this config data from the pools as they are made everything OK.

So everyone can stand down now. I knew you were worried about me.


----------



## kpa (Jun 13, 2014)

The only external piece of configuration data in a ZFS pool is the /boot/zfs/zpool.cache file that contains a cache of known ZFS pools in the system.  It's possible that the file was lost for some reason and the pools were no longer recognized automatically.


----------



## WiiGame (Jun 13, 2014)

Oh, hey, thanks for that!  More than likely, in my case, I was recovering a zpool.cache from a prior backup where the actual pools configuration changed in between. So when the cache didn't match reality, it looked scary and I didn't want to lift a finger for fear or wrecking it. In any case, that is very, very good to know.


----------

