# zpool hangs/stall



## reneg (Feb 20, 2013)

Hi guys,

I have a SUN FIRE X4540 running 

```
uname -a
FreeBSD <hostname> 9.1-RELEASE FreeBSD 9.1-RELEASE #0 r243825: Tue Dec  4 09:23:10 UTC 2012     [email]root@farrell.cse.buffalo.edu[/email]:/usr/obj/usr/src/sys/GENERIC  amd64
```

The problem: zpool "data" is not responding after a while, stalled completely. System itself is still working, but any command accessing the pool never finishes: 

```
ps ax
4644  0  D+      0:00.00 ls -GF /data
```
The second pool "rpool" works just fine. 

*zpool iostat* also shows no io activity on the pool after the first iteration:

```
zpool iostat 5
              capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data        71.9T  8.75T  1.15K     33  79.9M   220K
rpool       7.38G  7.50G      5      0   143K  4.30K
----------  -----  -----  -----  -----  -----  -----
data        71.9T  8.75T      0      0      0      0
rpool       7.38G  7.50G      0      9      0  36.4K
----------  -----  -----  -----  -----  -----  -----
data        71.9T  8.75T      0      0      0      0
rpool       7.38G  7.50G      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
data        71.9T  8.75T      0      0      0      0
rpool       7.38G  7.50G      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
data        71.9T  8.75T      0      0      0      0
rpool       7.38G  7.50G      0      0      0      0
```

I added zpool_status.txt where you can see it even hangs while doing a scrub. 

When further examining the problem I discovered disk da73 is in state "CORRUPT" when running a *gpart list*. Note that in the zpool status output da73 has status "ONLINE" and has not been taken offline or listing the specific raidz pool as "DEGRADED", like usually should happen when a disk fails. Executing camcontrol inquiry da73 also never completes:

```
ps ax
4868  1  DL+     0:00.00 camcontrol inquiry da73
```

Any ideas on dealing with this problem?


----------



## reneg (Feb 20, 2013)

Update:

I decided to replace the corrupt disk with a spare; Initially it resulted in a stall as well, so I had to reboot the whole system. 
After that I had no problems in replacing the disk:
[CMD="zpool offline data label/disk73"]17946570595209808127  OFFLINE      0     0     0  was /dev/label/disk73[/CMD]
[CMD="zpool replace data label/disk73 label/disk19"]  raidz2-8                  DEGRADED     0     0     0
	    label/disk72            ONLINE       0     0     0
	    spare-1                 OFFLINE      0     0     0
	      17946570595209808127  OFFLINE      0     0     0  was /dev/label/disk73
	      label/disk19          ONLINE       0     0     0  (resilvering)
	    label/disk74            ONLINE       0     0     0
	    label/disk75            ONLINE       0     0     0
	    label/disk77            ONLINE       0     0     0
	    label/disk78            ONLINE       0     0     0
	    label/disk79            ONLINE       0     0     0
	    label/disk81            ONLINE       0     0     0
	    label/disk82            ONLINE       0     0     0
	    label/disk83            ONLINE       0     0     0
	    label/disk84            ONLINE       0     0     0
[/CMD]

And resilvering seems fine and dandy now, no stalls at this moment. The speed of resilvering seems also way faster than it ever was, it never went above 270M/s before. 
Now:
[CMD="zpool status"]  pool: data
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Feb 20 10:16:32 2013
        6.87T scanned out of 71.9T at 668M/s, 28h21m to go
        292M resilvered, 9.55% done
[/CMD]

If the resilvering is finished successfully I'll start a scrub on the zpool. I'll keep you posted.

Still, remains the question why ZFS did not offline the corrupt disk though. Bug?


----------

