# zfs snapshot NFS hang



## peetaur (Feb 8, 2012)

So, I have an issue where after some point (arbitrary number of snapshots? a specific snapshot? gremlins?), exporting a directory that contains a .zfs directory, whether or not snapdir=hidden is set, then listing the directory in the NFS client causes a total hang of the dataset and some commands like "zdb -d poolname". I don't know the root cause, so I don't know how to reproduce it, or create a PR.

eg. 
`# echo /tank/dataset -maproot=root 10.10.10.10 >> /etc/exports`
`# kill -HUP `cat /var/run/mountd.pid``
`# tail /var/log/messages`

```
Feb  8 15:47:54 bcnas1 mountd[46760]: can't delete exports for /tank/dataset/.zfs/snapshot/replication-20120204134001: Invalid argument
Feb  8 15:47:54 bcnas1 mountd[46760]: can't delete exports for /tank/dataset/.zfs/snapshot/replication-20120208140000: Invalid argument
...
```
(I would think the above shows that the NFS server is not really compatible with this situation / buggy)
`# ssh 10.10.10.10 "mount bcnas1:/tank/dataset /mountpoint ; ls /mountpoint/.zfs/snapshot"`
(hang on this command, and anything else after this poing using the same dataset)

There was a point when this would not hang, but instead just show many directories, and then for many other snapshots (all of the ones listed in the /var/log/messages errors and more), there would be strange binary files, or directories with the wrong files in them (it would show me a subdirectory of the correct root of the snapshot).

All of these problems happen whether or not I set snapdir=hidden or snapdir=visible.

Today an identical problem happened, and I rebooted to fix it. I don't know if it was the same cause, but I would like to find out.

Can someone give me ideas of how to track the problem, or tell me which source files I should open up in /usr/src, hack apart or add debugging and either:

find the root cause of the problem
prevent NFS from exporting any .zfs directories

Or does someone know if this has been fixed in the latest 8-STABLE or 9?

I mentioned this problem briefly before in my old thread here.

The only 'perfect' workaround I can think of is reorganizing all my datasets so the root directory is empty except one directory, and then share only that subdirectory which does not contain a .zfs directory. But ideally, nfs clients should be able to view snapshots to recover files.


----------



## SirDice (Feb 9, 2012)

Why don't you use sharenfs?

`# zfs set sharenfs="maproot=0" tank/dataset`

This only exports .zfs if it's not hidden.


----------



## peetaur (Feb 9, 2012)

Everything I read about FreeBSD's ZFS NFS says that it is not like Solaris, but just the same NFS server as the normal FreeBSD one, and acts exactly the same. Therefore, I would conclude that "sharenfs" only edits the /etc/zfs/exports file, and the normal daemon does the work including both that file and /etc/exports.

See output from `# ps aux | grep mount`

```
root  51407  0.0  0.0 19208  4556  ??  Is    7:59PM   0:00.01 /usr/sbin/mountd -r -p 876 /etc/exports /etc/zfs/exports
```

Are you sure of what you are saying? Test it by sharing a directory and testing to see if .zfs exists. Do not simply do `# ls -a`, but actually use the directory as if it existed: `# cd .zfs/snapshot; ls`.


----------



## SirDice (Feb 9, 2012)

peetaur said:
			
		

> Are you sure of what you are saying? Test it by sharing a directory and testing to see if .zfs exists. Do not simply do `# ls -a`, but actually use the directory as if it existed: `# cd .zfs/snapshot; ls`.


Yep, you're right. 


```
dice@molly:~>zfs list tank/FreeBSD/ports
NAME                 USED  AVAIL  REFER  MOUNTPOINT
tank/FreeBSD/ports  19.7G  1.29T   362M  /usr/ports
dice@molly:~>cd /usr/ports/.zfs/snapshot
dice@molly:/usr/ports/.zfs/snapshot>ls
20111225 20120112 20120123 20120131 20120204
```

But, it's not an issue on my side.


```
dice@williscorto:~>mount | grep /usr/ports
molly:/usr/ports on /usr/ports (nfs, read-only)
molly:/usr/ports/distfiles on /usr/ports/distfiles (nfs, read-only)
molly:/usr/ports/packages on /usr/ports/packages (nfs, read-only)
dice@williscorto:~>cd /usr/ports/.zfs/snapshot
dice@williscorto:/usr/ports/.zfs/snapshot>ls
20111225 20120112 20120123 20120131 20120204
```


----------



## peetaur (Feb 9, 2012)

Maybe `# ls -l` or `# find . -maxdepth 2` would be more likely to crash it.

But also, it looks like you don't have as many snapshots as I do.


```
[CMD="#"]ls -1 /tank/.zfs/snapshot/ | wc -l[/CMD]
     751
```

And thanks very much for the try. I really wish I had help fixing this.


----------



## SirDice (Feb 9, 2012)

peetaur said:
			
		

> Maybe `# ls -l` or `# find . -maxdepth 1` would be more likely to crash it.


Tried a `# find /usr/ports/.zfs/snapshot -type f` Which produced a humungous list but it didn't crash the server.

FWIW I'm running 

```
dice@williscorto:/usr/ports/.zfs/snapshot>uname -a
FreeBSD williscorto.dicelan.home 9.0-STABLE FreeBSD 9.0-STABLE #0: Wed Jan 25 13:03:03 CET 2012     root@molly.dicelan.home:/usr/obj/usr/src/sys/CORTO  amd64
```
Same version on the server.


----------



## peetaur (Feb 9, 2012)

I suggested using maxdepth so you would scan inside all snapshots instead of just the first before you hit CTRL+c.

And unfortunately, this bug is hard to reproduce... I only have the one server where it happens. I think even on the replicated server, the problem does not exist. Let's test that...

`# # mount bcnas1bak:/tank/bcnasvm1 bcnasvm1 -o ro`
`# cd bcnasvm1/.zfs/snapshot`
`# ls -l`

No hang. But some results are strange and interesting.

Normal looking snapshot:

```
drwxr-xr-x 27 root    root          27 2012-01-12 17:45 daily-2012-01-13T00:00:09
```

Several strange ones (files instead of directories; also wrong owner):

```
-rw-r--r--  1 root    root           0 2012-01-03 15:33 daily-2012-01-18T00:00:00
-rw-r--r--  1 root    root        2674 2011-03-15 02:20 daily-2012-01-29T00:00:00
-rwxr-xr-x  1 openvpn sambashare  8355 2011-09-20 12:33 daily-2012-02-03T00:00:00
-rw-rw-r--  1 root    root        8689 2010-12-30 13:04 daily-2012-02-08T00:00:00
-rw-r--r--  1 root    sambashare 23778 2011-04-02 19:20 replication-20120130002001
```

One of those weird things where it is not the correct directory (also wrong owner):


```
drwxr-xr-x  2 openvpn users          3 2011-10-11 14:16 replication-20120202195044

root@peter:/mnt/bcnasvm1/.zfs/snapshot# ls -l replication-20120202195044
total 3
-r--r--r-- 1 openvpn users 704 2011-10-11 14:16 schema.m
```

(note there is no openvpn user on the nas, it just has the same uid as something on my workstation)

Some snapshots that seem to think they are symlinks:

```
lrwxrwxrwx  1 root    root          25 2011-05-17 11:40 replication-20120203042000 -> applications-graphics.png
lrwxrwxrwx  1 root    root          42 2011-08-05 08:43 replication-20120203044001 -> ../../C/figures/naut_sampleemblem_icon.png
```

And of course these files look fine without NFS:

`#  ls -ld /tank/.zfs/snapshot/replication-20120203044001`

```
drwxr-xr-x  13 root  wheel  13 Jan 12 12:47 /tank/.zfs/snapshot/replication-20120203044001
```

I also ran `# find . -maxdepth 2` which didn't hang.

I wish it would hang... but of course, nothing ever hangs on the backup server... only on the most important one. (Must be some variation on Murphy's Law)


----------



## SirDice (Feb 9, 2012)

peetaur said:
			
		

> I suggested using maxdepth so you would scan inside all snapshots instead of just the first before you hit CTRL+c.


I left it running until it finished. Maxdepth didn't do much but that's probably because I don't have a lot of snapshots.



> And unfortunately, this bug is hard to reproduce... I only have the one server where it happens.


The trickiest bugs are the ones that can't easily be reproduced 

I don't have any other ideas. Except perhaps posting your problem to the @freebsd-fs mailinglist. Perhaps one of the developers can help out.


----------



## peetaur (Feb 9, 2012)

I just tested it with a FreeBSD NFS client, and it all looks correct. So I guess only the Linux client triggers this strange behavior. But the /var/log/messages error messages clearly show the server is messed up, not (only?) the client.


----------



## peetaur (Feb 15, 2012)

(last week) I updated to the csup'd version I downloaded and built on Feb 4th, but it shows the same issues (other than the unreproducable hang).

However it does seem to fix another hang I had (which I posted in PR #161968).  (This is why I decided to upgrade the running system now)


----------

