# ZFS hangs when mounting a particular filesystem, yet others are ok



## stuart (Dec 30, 2011)

Hi all,

I have a freebsd FreeBSD backup system with several disks in a ZFS "mirrored stripe". I was causing a lot of disk io on one of the filesystems which seemd to cause the system to hang (by "a lot", I mean about 150 rm's concurrently in different parts which its coped with before). By "hang", the system responded to pings but I couldn't ssh to it, nor gain access via kvm. 

When rebooting, it reached the filesystem mounts and then just hung there, ^T showed it stuck on zfs->io_cv most of the time. I rebooted into single user mode and it was ok, and - of the 3 filesystems present on the zfs pool - mounting 2 of them manually was fine. Mounting the one which previously had all of the IO just causes it to hang again. I've left it for a few hours and its still just sitting there. When I set the mountpoint of that filesystem to 'legacy' and booted it as normal, the system is fine (but obviously I can't access that filesystem). 

I rebooted the system and ran the legacy mount in a screen to try to debug it, but I'm not entirely sure what to do, or how to use zdb. I scrubbed the entire pool recently, which completed without error, but the same thing happens. When trying to mount this particular fs, zpool iostat -v shows about 700K/sec total, with the system appearing to read all of the disks, with no other processes requiring disk io. zfs get all works against the filesystem (noatime,noexec,nosuid,fletcher4,dedup=off,copies=1) - literally the only thing I can't seem to do is mount it.

Does anyone have any suggestions as to what this might be? Not familiar enough with zfs or zdb to know fully - could it be the zfs intent log or something? Is there any way to tell? The mount process appears to be unkillable.

Various bits of system info:


```
[root@vault ~]# zpool get version vault
NAME   PROPERTY  VALUE    SOURCE
vault  version   28       default
```


```
[root@vault ~]# zpool status
  pool: vault
 state: ONLINE
 scan: resilvered 6.75T in 35h40m with 0 errors on Fri Dec 30 07:27:25 2011
config:

        NAME                STATE     READ WRITE CKSUM
        vault               ONLINE       0     0     0
          mirror-0          ONLINE       0     0     0
            da8.eli         ONLINE       0     0     0
            da0.eli         ONLINE       0     0     0
            label/bZp0.eli  ONLINE       0     0     0
          mirror-1          ONLINE       0     0     0
            da1.eli         ONLINE       0     0     0
            label/bBp2.eli  ONLINE       0     0     0
            ada1.eli        ONLINE       0     0     0
          mirror-2          ONLINE       0     0     0
            label/bZp2.eli  ONLINE       0     0     0
            da10.eli        ONLINE       0     0     0
            da2.eli         ONLINE       0     0     0
          mirror-3          ONLINE       0     0     0
            label/bZp3.eli  ONLINE       0     0     0
            da3.eli         ONLINE       0     0     0
            da11.eli        ONLINE       0     0     0
```


```
[root@vault ~]# uname -a
FreeBSD vault 9.0-PRERELEASE FreeBSD 9.0-PRERELEASE #16: Wed Dec 28 17:35:39 GMT 2011     root@:/usr/obj/usr/src/sys/vault  amd64
```


----------



## stuart (Dec 31, 2011)

[Solved]

Eventually managed to fix this by leaving the rogue filesystem umounted, created a snapshot, cloned it, promoted the clone and destroyed the original.

Would still be interested in why this happened as its never happened before and will cause problems if it happens again, but that seems to be one way of fixing it if anyone else runs into this issue.


----------



## stuart (Jan 2, 2012)

*Recurring problem with ZFS and heavy disk IO*

[Not quite solved]

This problem has occured again; the symptoms are that under very heavy disk io, the system just seems to hang. By 'heavy io', I mean resilvering 3 disks and performing about 90 rm's in different directories, within the same filesystem. 

Any ideas how to debug this? ZFS has been completely solid up until now but, for some reason, heavy disk use seems to hang the server and corrupt the filesystem (zfs snapshot/clone fixes it though).

Any thoughts anyone?


----------

