# ZFS Panic Galore



## hackish (May 3, 2012)

I'm running a 32 bit version with zfs. There is 4GB of RAM on this machine.

Lately it has started crashing every month or so. I had more RAM added and upgraded to 8.3. I was trying the PAE kernel but it could crash after an hour or so.

So I backed down to the regular GENERIC kernel but if I'm lucky it stays up for an hour.







Any insight into what might be wrong?


----------



## Beeblebrox (May 4, 2012)

Looks like an HDD problem to me....
Are you sure that the system is not having HDD time-outs, which means loosing the connection with the HDD? Look through all the messages in the system's logs for any hint of what's going on.

Install and run sysutils/smartmontools.
`# smartctl -a /dev/ada0`
will show how many and what type of errors you have had re the HDD.


----------



## hackish (May 4, 2012)

There are no hints in the system logs.

Here is some of the output

```
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Blue Serial ATA
Device Model:     WDC WD5000AAKS-00V1A0
Serial Number:    WD-WCAWF3101822
LU WWN Device Id: 5 0014ee 1027d88ad
Firmware Version: 05.01D05
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Fri May  4 14:36:31 2012 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

snip...

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART overall-health self-assessment test result: PASSED
```

To me it would appear that the drive is in good health.


----------



## hackish (May 4, 2012)

Here is another dump. 






Hardware problem? I just had the RAM replaced with fresh stuff and it happened again.


----------



## kpa (May 4, 2012)

Have you done any tuning of vm.kmem_size* tunables? On i386 the recommendation is to up both  vm.kmem_size and vm.kmem_size_max at least to  512M for stable operation of ZFS:

http://wiki.freebsd.org/ZFSTuningGuide#i386

As the page states, if you need more than 512MBs of kmem you'll have to compile your own custom kernel with increased KVA_PAGES setting.


----------



## hackish (May 4, 2012)

The kernel is a GENERIC 8.3-RELEASE-p1 as of May 4th 2012.

The only startup options are

```
vm.kmem_size="512M"
vm.kmem_size_max="512M"
vfs.zfs.arc_max="160M"
```

These were taken from the ZFSTuningGuide. I also did try recompiling the kernel with the KVA_PAGES options but it would panic immediately. My assumption is that a GENERIC kernel has the best chance of being consistent with the rest of the world.  

I am having serious second thoughts about continuing to run ZFS on a production machine.


----------



## kpa (May 4, 2012)

You would probably have better luck with amd64 version of FreeBSD, all of this tuning would be mostly unnecessary except for the vfs.zfs.arc_max setting.

I would run both short and long self-tests on the disk drive with smartctl(8) to make sure the drive is ok, sometimes clean SMART stats do not tell the whole story about the drive's health.


----------



## Beeblebrox (May 4, 2012)

Those messages look way too much like hardware error and I propose you first eliminate any such possibility from your system.
If you are sure that the HDD is fine, try memtest, preferably from a linux CD or else sysutils/memtest.


----------



## hackish (May 4, 2012)

All of the smart tests passed. I have arranged to have a fresh hard drive installed with a fresh copy of 64 bit FreeBSD on it. I'll see if I can use that to read the ZFS partitions.


----------



## Beeblebrox (May 4, 2012)

Hardware isn't "just the HDD". You need to take a complete approach, including those trivial cables.
Have a look at this also: http://www.inquisitor.ru


----------



## hackish (May 5, 2012)

We completely replaced the hardware. Same issue. I suspect it may be a bug in the ZFS system where it's unable to deal with some sort of filesystem corruption in the ZFS. There are now 2 drives in the system, one with FreeBSD 64 bit and the original. Any suggestions on how I might try to mount this filesystem from with the FreeBSD 64 bit system? Of course the fresh install doesn't know how to find the drive with the old zfs pools.


----------



## Beeblebrox (May 5, 2012)

Sorry to have wasted your time on a the hardware side, better to be safe than sorry though I think...

Can you tell us where is your swap? Is swap part of the ZFS pool or on its own proper slice?



> Any suggestions on how I might try to mount this filesystem from with the FreeBSD 64 bit system?


`# zpool import -f -R /media/rescue <poolname>`
*f* to force the import, *R* to specify where you want it mounted (altroot)
You can also pass in the above command -o canmount=noauto to prevent automatic-mounting of datasets, then mount datasets by hand using
`# mount -t zfs pool/dataset <mountpoint>`


----------



## hackish (May 5, 2012)

The swap was on a slice. There was only 1 ZFS partition.


```
FreeBSD cl-t153-284cl 8.3-RELEASE FreeBSD 8.3-RELEASE #0: Mon Apr  9 21:23:18 UTC 2012     
root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64
```


```
zpool import -f -R /dev/ad16s1g email
Unexpected XML: name=stripesize data="0"
Unexpected XML: name=stripeoffset data="977305088"
---SNIP---
Unexpected XML: name=stripesize data="0"
Unexpected XML: name=stripeoffset data="32256"
cannot import 'email': pool is formatted using a newer ZFS version
```

The 32 bit version of the OS was cvsup'd within minutes of each other. No idea why they show different versions. Is it possible that the format of the file has changed with the 64 bit version? I understood this was one of the strengths of zfs - uniform format between all platforms.


----------



## hackish (May 5, 2012)

Hmmm... Correction it looks like I still need to build/install to get the version up to date.


----------



## Beeblebrox (May 5, 2012)

`# zpool upgrade -v`
Will give you the version you are running. For details, zpool(8)(). The mountpoint in import needs to be folder name, not device.
`# zpool import -f -R /media email`


----------



## hackish (May 5, 2012)

It would appear that I'm running version 14. I thought I read somewhere that 8.3 was supposed to be version 28. I was planning to upgrade it to 9 so I'll try that and see if it's able to read the 32 bit filesystem that way. 

Just for kicks I tried to rebuild the kernel (dual booted to 32 bit) so maybe I could get it to run long enough and query what ZFS version it is. If I load the zfs module on that I get about 5-10 seconds before kernel panic...


----------



## hackish (May 6, 2012)

I'm still having problems with this zfs thing. I upgraded to 9.0-release.

The replacement filesystem has a ZFS pool called email. I need to find a way to access the email zfs pool from the old disk which is installed in the system. It's on /dev/ad16s1g whereas the "good" zfs system is on /dev/ad6s1g and it's up and running properly.

The closest I've gotten is this:

```
zpool import email ad16s1g
cannot import 'email': pool may be in use from other system, it was last accessed by mail (hostid: 0xb486f72a) on Sat May  5 22:14:07 2012
use '-f' to import anyway
```

I'm a little afraid at this moment that it's going to do bad things to my existing pool that has the same name. How will I know what pool is which? I want to mount it, copy the data off and unmount it.


----------



## hackish (May 6, 2012)

Ok, bit the bullet. Backed up my /email and tried it. Looks like there must be a bug in the kernel.

[cmd=]zpool import -f email ad16s1g[/cmd]


----------



## kpa (May 6, 2012)

Leave out the ad16s1g from the import command, the pool is detected by on disk metadata and you can not give device names as part of the import command like that.


----------



## hackish (May 6, 2012)

How do you tell zfs to look on that disk for the partition? If I do a *zfs list* here is what I get:

```
zfs list
NAME    USED  AVAIL  REFER  MOUNTPOINT
email   118M   356G   118M  /email
```

That's my new email zfs filesystem living on /dev/ad0s1g.

As you can see from my feeble attempts above I really want to work with the one on /dev/ad16s1g so I can copy the data to this one.


----------



## kpa (May 6, 2012)

The detection is solely based on on-disk metadata, you can not tell zpool(8) to use a specific device for import.

If you run just this as root it should list all pools that are available for import:

`# zpool import`


----------



## hackish (May 6, 2012)

Running import lists only 1 pool. 

```
pool: email
    id: 10433152746165646153
 state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
        the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

        email       ONLINE
          ada1s1g   ONLINE
```

I'm not sure anymore what pool this is...


----------



## hackish (May 6, 2012)

Would you suggest I run [cmd=]zpool import -f 10433152746165646153[/cmd]
?

Nearly every command I've tried so far on this thing has resulted in a kernel panic. So far I've tried 8.3 32bit and 8.3 64bit and now 9.0 64 bit with very consistent panics.


----------



## kpa (May 6, 2012)

Try this, it will import the pool forcibly and mount it under temporary mount point /altroot so you take a look what the pool contains (and to avoid a situation where the pool gets mounted over the system directories):

`# zpool import -f -R /altroot 10433152746165646153`


----------



## hackish (May 6, 2012)

```
zpool import -f -R /mnt 10433152746165646153
cannot import 'email': pool already exists
```


```
zpool import -f -R /altroot 10433152746165646153
cannot import 'email': pool already exists
```

Wasn't sure if the altroot was an option or path...


```
zpool import -f -R /altroot 10433152746165646153 olddata
```

Did some googling... As soon as I ran this I got the all too familiar kernel panic.

I'll see if I can bring the image home with me.


```
dd if=/dev/ad16s1g > zfsimage.dat
```

I don't think I have time to go to bsdcon but if this problem persists I might have to go over there and see if any of the kernel experts can help.


----------



## kpa (May 6, 2012)

It starts to look like there's something corrupted pretty badly in the pool, I can not provide much more help unfortunately. You could try your luck at the freebsd-fs mailing list.


----------



## hackish (May 6, 2012)

Thanks for your help. I'll take it up with them. In the meantime I'm dumping the filesystem to a file to test it like that. If it still croaks I'll sign up for FreeBSDCon and take it down there on an external drive.


----------



## Beeblebrox (May 6, 2012)

> cannot import 'email': pool already exists


1. Backup your pool somewhere
2. connect the HDD with the pool to the amd64 9 version you have installed
`# zpool list`
3. Is email listed? Does it say ONLINE or FAULTED?
4. If online or faulted, 1st=general info 2nd=partition and zpool error info 3rd=run the repair steps.
`# zpool get all email`
`# zpool satus -v email`
`# zpool scrub email`
5. If you get kernel panic:  The 2nd command above showed the info for the slice. Why was the pool online or visible without an import to begin with? REBOOT and (assuming pool email has no sub-datasets?):
`# zfs get all email`
`# zfs set canmount=noauto email`
`# umount -t zfs email`
`# zfs export email`
Now you have more control over the pool since you can mount/unmount as you like. Now to start recovery. Some reading first though, so as not to make any mistakes during the process:
- Read the section "Repairing ZFS Storage Pool-Wide Damage" in http://docs.oracle.com/cd/E19963-01/html/821-1448/gbbwl.html
- This is also good: http://docs.oracle.com/cd/E19082-01/817-2271/gavwg/index.html. Suggests running
`# zpool history email`
To see where and how exactly the error messages start showing up. I strongly urge you to do a full read of the second link before beginning the procedure.



> Wasn't sure if the altroot was an option or path...





> The mountpoint in import needs to be folder name, not device.


From post #15


----------



## hackish (May 6, 2012)

Beeblebrox said:
			
		

> 1. Backup your pool somewhere



In process: 
`# dd if=/dev/ad16s1g > zfsimage.dat`
Hope I'm doing it right...



> 2. connect the HDD with the pool to the amd64 9 version you have installed
> `# zpool list`
> 3. Is email listed? Does it say ONLINE or FAULTED?


2 is done albeit a bit complicated since I have 2 pools named email.
`# zpool list`
Does not show the other pool, but 
`# zpool import` 
shows

```
pool: email
    id: 10433152746165646153
 state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
        the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

        email       ONLINE
          ada1s1g   ONLINE
```



> 4. If online or faulted, 1st=general info 2nd=partition and zpool error info 3rd=run the repair steps.
> `# zpool get all email`



Given that it doesn't find the pool on that disk step 4 fails.


----------



## Beeblebrox (May 6, 2012)

> 2 is done albeit a bit complicated since I have 2 pools named email.





> cannot import 'email': pool already exists


Apologies, but at this point, you deserve a "facepalm", buddy!
How can you expect to mount 2 ZFS pools by the same name at the same time? Let's re-name the pool you want to restore / rescue:
`# zpool import -f -R <folder_name> email <newname>`
folder_name is a folder-path in root (/), newname can be anything you desire. To make it easy, you can make both variables the same so that next time you import it mounts automatically to <newname>.


----------



## hackish (May 6, 2012)

Please see post #25

`# zpool import -f -R /altroot 10433152746165646153 olddata`

The only reason there are 2 pools named email is that one is the original that I'm trying to get the data off.


----------



## Beeblebrox (May 6, 2012)

Ah! Missed that one...


> reason there are 2 pools named email is..


Yeah, I figured that out.

Post 25 also states that when you import the pool that way, the system crashes. The 1st link I posted has several different methods worth trying in order to get the pool imported. One is:
`# zpool import -f -o readonly=on -R /newname email newname`

Hope you get it sorted out...


----------



## t1066 (May 6, 2012)

Would you try a minimalistic approach? Remove the new disk. Boot your system into single user mode preferably using a 9.0 release CDROM. Run

`# zpool import`

to see if your pool is recognized. If so, run

`# zpool import email`

or run

`# zpool import -f email`

if the above command fail. If either one of the above import works, you should then scrub the pool.


----------



## hackish (May 6, 2012)

t1066:
I have tried this a number of times.
As soon as ZFS starts up the kernel panics. Looking at the backtrace it seems to happen as soon as the system tries to auto-scrub. I was looking at `# zpool scrub -s` but I think it will be a race. As soon as the filesystem copy is done I'll take a few more cracks at it. With the filesystem dumped to a file via dd I think it will be easier to "play" with.


----------



## Beeblebrox (May 6, 2012)

> it seems to happen as soon as the system tries to auto-scrub.


That's why I suggested the read-only mount (or any other method which will prevent the auto-scrub from running). Then you can hopefully (maybe) copy the data off from the pool.


----------



## hackish (May 6, 2012)

Beeblebrox said:
			
		

> That's why I suggested the read-only mount (or any other method which will prevent the auto-scrub from running). Then you can hopefully (maybe) copy the data off from the pool.



Yes, good point, I didn't think of that. As soon as the dd has completed it's 450Gb dump I'll try it out.


----------



## hackish (May 7, 2012)

Beeblebrox, thanks for your help. Mounting the system as read-only allowed me to read all the data from the volume. 100% of it was recovered and no files were corrupted. I've kept an image of the damaged filesystem so on my own time I can try and find/fix the bug in the kernel.


----------



## Beeblebrox (May 7, 2012)

Glad to hear!
Quick note I reserved for the end: You were using MBR partition structure on your first HDD; hope you are using GPT structure on your second / new HDD.

ZFS on MBR results in a logical volume inside a logical partition (ada1s1g) and, though not 100% sure on this, such setup may have contributed to the initial problem. ZFS should play much nicer with with an allocated partition named ada1p<n> which you'll get under GPT. Good Luck...


----------

