# ZFS kernel panic



## mcgee (Aug 28, 2011)

I am having serious ZFS kernel panics, and I don't know how to fix it.

I bought a used server to set up my "big box" home file server, intending to run a pretty basic setup with 8.2R and using ZFS to manage the storage pool. I'm new at ZFS (I didn't get much exposure to FBSD 5 thru 7) so I wanted to take time to familarize myself, set up some dummy file-backed pools and poke them with a stick until I understood what I'd need to do in the event of a real disk failure.

So I made some empty backing files of 100Mb each using dd, and created a raidz pool from them (this is from memory, because the history evaporates when the box panics, but I've done it a few times by now):

```
cd /usr/z
dd if=dev/zero of=disk0 size=204800
dd if=dev/zero of=disk1 size=204800
dd if=dev/zero of=disk2 size=204800
dd if=dev/zero of=disk3 size=204800
zpool create tank raidz /usr/z/disk0 /usr/z/disk1 /usr/z/disk2
echo hellohowareyou > /tank/hello
zpool add tank spare /usr/z/disk3
zpool status tank
```
So far, so good, a "zpool status" shows me nothing unexpected. Now I simulate a disk failure, and try to make zfs aware of it:

```
dd if=dev/zero of=disk1 size=204800
zpool scrub tank
```
And that's when the kernel panic happens. After this, I can reboot over and over again, and anytime I bring ZFS online by doing a zpool command (it's *not* enabled in rc.conf) to list my pools, or show their status, or whatever...boom, another kernel panic.

This happens in at least two FreeBSD distros. I started with 8.2R-i386 and fought to make it work (because my box can only run 32-bit VMs on top of ESXi, which is a separate challenge, but for this test I was running on the metal), but gave up and tried 8.2R-amd64, and got the same kernel panic under the same circumstances. It is always a page fault 12, supervisor read, page not present. The instruction, stack and frame pointer values are all (respectively) the same from crash to crash.

As a bonus, my box seems incapable of doing a crash dump. It looks like it starts, but then it locks up and doesn't dump, nor automatically reboot; after I bounce it, crashinfo says there's nothing there.

My hardware is a Dell PowerEdge 1800, dual Xeon 3Ghz (Nocona I think; HTT yes, VT no), 2Gb RAM. It also has a Dell CERC SATA 1.5/6ch RAID controller, but for these exercises I wasn't using that, rather a separate disk on the onboard SATA port.

I'm not sure where to go from here. I ran Memtest86+ for a while, and got a passing grade. I (basically) tried two different OSes, on different HDDs, and got the same failure. That points to hardware, but the failure mode points to software. What else can I try?


----------



## Crest (Aug 28, 2011)

Use mdconfig to create file backed md devs. Use this md devs to create your Pool. Do your self a favor and don't use ZFS on i386 or amd64 with less than 4GiB of RAM. (2Gb = 2Gigabit = 256MiB?)


----------



## mcgee (Aug 29, 2011)

Thank you Crest, using md devices worked well. I'm still curious about the kernel panics when using files directly, as this is explicitly supported by ZFS -- I wonder if it's worth making a bug report?


----------



## Terry_Kennedy (Aug 29, 2011)

mcgee said:
			
		

> I'm still curious about the kernel panics when using files directly, as this is explicitly supported by ZFS -- I wonder if it's worth making a bug report?


If it isn't too much trouble, you might want to try it on 8-STABLE - either by updating your tree and re-building, or from a recent snapshot. There have been a lot of changes in the ZFS code since 8.2-RELEASE - in particular, ZFS v28 was MFC'd. Needless to say, don't upgrade any pools to v28 if you want to be able to use them on older releases.


----------



## phoenix (Aug 29, 2011)

Yes, ZFS supports using files for the backing store of a vdev.  However, it rarely works in practise.    It's really only meant for testing and prototyping, and should not be used for any decision making processes.

Use real block devices (like md(4)-backed files) if you want to do real testing with real failure modes.


----------



## mcgee (Aug 30, 2011)

phoenix, that's exactly what I thought I was doing, setting up a sandbox where I could play with ZFS to learn the ropes before trying to use it in production. If file-backed vdevs were just a bad idea, didn't work well, or didn't work at all, that would be one thing...but the consistent, full-bore, show stopping kernel panics are something else, and IMO something that must be fixed. Per Terry's suggestion I am spinning up some VMs to try to elicit the same bug on -STABLE, both i386 and amd64...in between other tasks, so it may take a few days....


----------



## mcgee (Sep 18, 2011)

Terry_Kennedy said:
			
		

> If it isn't too much trouble, you might want to try it on 8-STABLE - either by updating your tree and re-building, or from a recent snapshot. There have been a lot of changes in the ZFS code since 8.2-RELEASE - in particular, ZFS v28 was MFC'd. Needless to say, don't upgrade any pools to v28 if you want to be able to use them on older releases.



A bit delayed, no matter. So I spun up a new ESXi VM with 8 CPUs and 12Gb RAM, put on 8.2-STABLE amd64 from sources, and repeated my file-backed experiment on ZFS v28. I got the same kernel panic.

This time I caught the crash dump, though I don't really know how to read it. One possibly useful thing I spotted, in the segment 
	
	



```
Loaded symbols for /boot/kernel/opensolaris.ko
```
 is this:


```
#5  0xffffffff808c6d4f in trap (frame=0xffffff83642386f0)
    at /usr/src/sys/amd64/amd64/trap.c:477
#6  0xffffffff808aec24 in calltrap ()
    at /usr/src/sys/amd64/amd64/exception.S:228
#7  0xffffffff8107d6ed in vdev_file_io_start (zio=0xffffff010af24a50)
    at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c:157
#8  0xffffffff81098137 in zio_vdev_io_start (zio=0xffffff010af24a50)
    at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:2374
#9  0xffffffff81097d63 in zio_execute (zio=0xffffff010af24a50)
    at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1196
```

What I think I see is that everything was fine until vdev_file_io_start, which is what generated the page fault that cascaded into a panic. So if I were debugging, I guess I'd start there. But I'm not a kernel hacker.


----------

