# Major ZFS Problems



## the_sadman (Aug 5, 2010)

Something horrible has happened.  I have 5 drives as follows:
ad0 and ad1 are on gm0 -- mirrored root/OS devices UFS
ad6 -- a single drive on ZFS (loner)
ad4 and ad8 -- a mirrored ZFS pool (share)

After I moved across town, the first boot showed me that ad0 had failed.  Easy enough, I just unplugged it and made ad1 the new ad0.  I should note that ad1 and ad0 are on IDE controller on the MB while all other drives are on the SATA controller on the MB.  A couple weeks later the ZFS drives finally started to get some real IO again.  A couple days later and share zpool had permanent errors (checksum).  Anytime ANY access was made to the filesystems at that point the user session would hang.  So I reboot and started a scrub.  Scrub stalled at ~4% and it's been like this for 8 hours.  Also, now the loner pool is now dead with IO errors.  So thus far all my ZFS drives have failed and I can't seem to do anything.  I've included some output with my tears.  Tonight I hope to have a PCI SATA controller and good news.  If the controller doesn't fix it I don't know what to do I really need that data.  I find it very unlikely that ALL my drives would fail from a move.  I can't do much debugging remotely because even things like "reboot" and "sudo" cause the user session to hang.  I did a 'zpool clear' on mirror so that's why it shows no errors I guess. 


```
[root@hermes ~]# zpool status
  pool: loner
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	loner       UNAVAIL      1     7     0  insufficient replicas
	  ad6       UNAVAIL      8     0     0  experienced I/O failures

errors: 4 data errors, use '-v' for a list

  pool: share
 state: ONLINE
 scrub: scrub in progress for 8h27m, 4.02% done, 201h51m to go
config:

	NAME        STATE     READ WRITE CKSUM
	share       ONLINE       0     0     0
	  mirror    ONLINE       0     0     0
	    ad4     ONLINE       0     0     0
	    ad8     ONLINE       0     0     0

errors: No known data errors


[root@hermes ~]# tail -n25 /var/log/messages
Aug  5 08:25:06 hermes kernel: ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug  5 08:25:26 hermes kernel: ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug  5 08:25:42 hermes su: otheruser to root on /dev/pts/3
Aug  5 08:25:46 hermes kernel: ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
Aug  5 08:26:06 hermes kernel: ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
Aug  5 08:26:26 hermes kernel: ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
Aug  5 08:26:26 hermes kernel: ad4: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=1728069120
Aug  5 08:26:46 hermes kernel: ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug  5 08:27:06 hermes kernel: ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug  5 08:27:26 hermes kernel: ad8: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
Aug  5 08:27:46 hermes kernel: ad8: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
Aug  5 08:28:06 hermes kernel: ad8: WARNING - SET_MULTI taskqueue timeout - completing request directly
Aug  5 08:28:06 hermes kernel: ad8: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=1728068864
Aug  5 08:28:26 hermes kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug  5 08:28:46 hermes kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug  5 08:29:06 hermes kernel: ad6: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
Aug  5 08:29:26 hermes kernel: ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
Aug  5 08:29:46 hermes kernel: ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly
Aug  5 08:29:46 hermes kernel: ad6: FAILURE - READ_DMA timed out LBA=0
Aug  5 08:30:06 hermes kernel: ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug  5 08:30:27 hermes kernel: ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug  5 08:30:46 hermes kernel: ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
Aug  5 08:31:07 hermes kernel: ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
Aug  5 08:31:26 hermes kernel: ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
Aug  5 08:31:26 hermes kernel: ad4: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=1728069376

[root@hermes /var/log]# zpool get all share 
NAME   PROPERTY       VALUE       SOURCE
share  size           928G        -
share  used           832G        -
share  available      96.1G       -
share  capacity       89%         -
share  altroot        -           default
share  health         ONLINE      -
share  guid           10098610899381386190  -
share  version        13          default
share  bootfs         -           default
share  delegation     on          default
share  autoreplace    off         default
share  cachefile      -           default
share  failmode       wait        default
share  listsnapshots  off         default
```

The messages started last night when I noticed the failure and haven't stopped since.  I checked the log since I moved and a month before and I don't see errors like this anywhere else.  Any insight would be much appreciated.


----------



## jem (Aug 6, 2010)

Was your PC knocked around during your move?  If so, maybe you really do have multiple disk failures.

When I transport my computer, I make sure it's strapped into a car seat so that it has some cushioning against bumps and jolts.  I don't put it on the floor or in the boot/trunk.

Alternatively, you may be lucky and it's just something has shaken loose during transit.  Try reseating and reconnecting all your disks and controllers inside the PC and see if that helps.


----------



## the_sadman (Aug 6, 2010)

Last night I tried more debugging.  I plugged the drives one-by-one with a reboot in-between and re-ordered them all on different SATA ports on the MB.  ALL SATA drives worked fine this time (odd).  So I decided to quickly scrub and somewhere around 60% I got the same locks as before with the same messages in /var/log/messages.  Since I never get messages for the drive on IDE (nevermind that one of my IDE drives went boom a couple weeks back), I am thinking that this is a controller problem (maybe after a while it heats up too much and starts bugging out) or possibly a software bug (not likely I guess).  Tonight I plan to use a PCI SATA card to coerce the data off a degraded mirror as best as possible, maybe I can be lucky and grab two cards from work .

Typically when unsure, do you replace everything (HDs, Controller, etc) ... ?


----------



## gkontos (Aug 6, 2010)

the_sadman said:
			
		

> Typically when unsure, do you replace everything (HDs, Controller, etc) ... ?


It appears that the problems with your drives are related to the specific controller. From what I noticed all of them showed errors. So, I would start by replacing the controller first.

George


----------



## the_sadman (Aug 8, 2010)

I replaced the motherboard (and thus the SATA controller) and haven't had a single error since.  Scrubs completed with success too.  Hurray!


----------

