# Reading file over NFS results in corruption



## whoops (Sep 2, 2014)

I have a FreeBSD 10 server with a ZFS filesystem which our XenServer hosts backup to, and I discovered some rather disturbing behavior when I attempted to use to it distribute a patch to the XenServers, namely that when the client reads a file from the NFS share, the file is corrupted. An example using Pride and Prejudice can be found in this repository: https://github.com/Whoops/crazy-nfs

It looks like in the corrupted version, every so often in the middle of the file there will be a binary string, which is followed by the start of the file. Writing over NFS results in a correct file on disk, but the same client reading it back will see a corrupted version. Each client appears to get a different corrupted version. I've replicated this on several Xenserver 6.2 (kernel 2.6.32) clients as well as an Ubuntu 10.04  (3.11.0), however when I tested on our only other FreeBSD based machine (PFSense 2.1.4) it did *not* see the issue, so I can't rule this out as a Linux bug. I (thankfully) do not see this issue on anything connected to our main storage server, which runs Ubuntu 10.04.

This occurs with both NFS3 and 4. My exports file is pretty straightforward:


```
/backup/vm      -maproot=root -alldirs
V4: /backup/vm -sec sys
```

And I'm not using an special options on the client

```
10.15.10.251:/ on /mnt/backup type nfs4 (rw,addr=10.15.10.251)
```

I'm not sure where to go from here in troubleshooting this. I've been unable to find anyone with similar problems on the web, nor have I been able to find an NFS bug that sounds related.


----------



## junovitch@ (Sep 3, 2014)

Can't say I've seen something like that using Linux clients talking NFS to a FreeBSD NFS server.  Wierd.  Is it consistent as in does the same file get corrupted every time or does the file get corrupted in the same place?

My instinct would be to try to rule out any kind of corruption in transit.  Some ideas I would try:
Force a TCP mounting to add an extra layer of error checking.
SCP some files back and forth, take SHA hashes of the files before and after.


----------



## whoops (Sep 3, 2014)

All sufficiently large files seem to get corrupted, though I'm not sure what sufficiently large means. A file with only "Hello World" for example did not appear corrupted, while Alice in Wonderland appeared corrupted on most, but not all clients. Every one of our backups is corrupted when read from NFS, but fine when accessed over CIFS. I believe NFSv3 chooses TCP by default, while NSFv4 only uses TCP, but I did actually try explicitly including TCP in the option list. There are some more details here: http://serverfault.com/questions/625255 ... nfs-server, where I compared two downloads of Alice in Wonderland across 3 different mounts.

Interestingly, while each client gets a different picture, it appears to get a consistent picture, as in if a client checksums the file multiple times, the checksum will be the same each time. I don't know if NFS has any caching that comes into play on that. Tomorrow I should try rebooting, or maybe just reading other large files to clear the cache, and see if I get a different version. With the way the start of the file gets reinserted at various points, I can't help but think the binary portion is probably part of the NFS protocol, but I don't have nearly enough knowledge to test that theory.


----------



## Sebulon (Sep 3, 2014)

Hi @whoops!

Have you tried mounting explicitly with "-o sync" on the linux clients? I know that a difference between FreeBSD- and linux clients is that BSD's mount as sync by default, while linux mount with async by default.

/Sebulon


----------



## SirDice (Sep 3, 2014)

Since you mentioned ZFS, are you using a ZIL with it? Try turning that off. The corruption may also be caused by some bad L2ARC drive but I would expect the file to be corrupt locally too. That doesn't seem to be the case?

https://wiki.freebsd.org/ZFSTuningGuide#NFS_tuning


----------



## kpa (Sep 3, 2014)

SirDice said:
			
		

> Since you mentioned ZFS, are you using a ZIL with it? Try turning that off. The corruption may also be caused by some bad L2ARC drive but I would expect the file to be corrupt locally too. That doesn't seem to be the case?
> 
> https://wiki.freebsd.org/ZFSTuningGuide#NFS_tuning



ZIL should have no effect since it's only written to under normal operating conditions. The only time when the contents of the ZIL are read is when the pool is imported after a system crash that has left the pool in a non-consistent state and the ZIL needs to be replayed.


----------



## whoops (Sep 3, 2014)

Sebulon said:
			
		

> Hi @whoops!
> Have you tried mounting explicitly with "-o sync" on the Linux clients?



I didn't get the chance to do most of the testing I wanted to today, but I did test this, and no, async vs sync didn't seem to change anything. Also it appears that the client will get the same version for the duration of the mount, but if I unmount and remount the share, it gets a, still corrupted, but different version of the file. For the time being I have all real backups and restores going over CIFS, but I'd still like to get this figured out.

At this point I'm confident enough that it's a bug in the server that I think maybe it's time to figure out how to file a bug report.


----------



## Sebulon (Sep 4, 2014)

whoops said:
			
		

> I didn't get the chance to do most of the testing I wanted to today, but I did test this, and no, async vs sync didn't seem to change anything. Also it appears the the client will get the same version for the duration of the mount, but if I unmount and remount the share, it gets a, still corrupted, but different version of the file. For the time being I have all real backups and restores going over CIFS, but I'd still like to get this figured out.
> 
> At this point I'm confident enough that it's a bug in the server that I think maybe it's time to figure out how to file a bug report.



Shame it just couldn´t be something simple like that  Xenserver 6.2 is new, while your Ubuntu is quite old except, as it happens on both of them, that shouldn´t be the issue either... Please show output of:
`# zpool status`
`# gpart show`
`# zdb | grep ashift`
`# cat /boot/loader.conf`
`# echo "$(grep 'avail memory' /var/run/dmesg.boot | awk '{print $4}')/1024/1024/1024" | bc`

Just to get a clearer picture.

/Sebulon


----------



## junovitch@ (Sep 4, 2014)

By any chance, are you using PF with any scrub rules somewhere where PF's fragment reassembly can interfere with NFS's fragments?


----------



## SirDice (Sep 4, 2014)

If possible I'd also try to serve the same file from an UFS partition. Just to rule out any weird ZFS-NFS interactions.


----------



## whoops (Sep 4, 2014)

junovitch said:
			
		

> By any chance, are you using PF with any scrub rules somewhere where PF's fragment reassembly can interfere with NFS's fragments?


No firewall actually. Good thought though, got my hopes up for a minute, couldn't remember if that system had one.



			
				SirDice said:
			
		

> If possible I'd also try to serve the same file from an UFS partition. Just to rule out any weird ZFS-NFS interactions.


Good thought too! I really should have tried that without prompting.

From the NFS server (filtered to relevant mounts):

```
[root@barmaid /backup/vm]# mount
tank/root on / (zfs, NFS exported, local, nfsv4acls)
backup/vm on /backup/vm (zfs, NFS exported, local, nfsv4acls)
/dev/ada9p1 on /mnt/ssd (ufs, NFS exported, local, soft-updates)

[root@barmaid /backup/vm]# cat /etc/exports
/backup/vm      -maproot=root -alldirs
V4: /backup/vm -sec sys
/export -maproot=root -alldirs
/mnt/ssd -maproot=root -alldirs

[root@barmaid /backup/vm]# md5 pg1342.txt ISI.PFSense#2014-Aug-27-14-15.xva
MD5 (pg1342.txt) = fb2a25e4343052b1f7e49290ea9a33ba
MD5 (ISI.PFSense#2014-Aug-27-14-15.xva) = 4e88f82926bc3de7b3188d39804f3343
```

From the XenServer:

```
10.15.10.251:/backup/vm on /mnt/test type nfs (rw,addr=10.15.10.251)
10.15.10.251:/export on /mnt/tank type nfs (rw,addr=10.15.10.251)
10.15.10.251:/mnt/ssd on /mnt/ufs type nfs (rw,addr=10.15.10.251)

[root@SuperMicro1 /]# md5sum /mnt/test/pg1342.txt /mnt/test/ISI.PFSense#2014-Aug-27-14-15.xva
dbd955855006c1ac1385286e48c495cf  /mnt/test/pg1342.txt
245b168a537e08f64c77565630954d7c  /mnt/test/ISI.PFSense#2014-Aug-27-14-15.xva

[root@SuperMicro1 /]# md5sum /mnt/tank/*
b96613521cee59e0b2c08d0ac4f0833b  /mnt/tank/ISI.PFSense#2014-Aug-27-14-15.xva
687e9807d8399d1802beab50ebd48f2d  /mnt/tank/pg1342.txt

[root@SuperMicro1 /]# md5sum /mnt/ufs/*
ef6a9f48ea40fbecc23b9b5fb2306c60  /mnt/ufs/ISI.PFSense#2014-Aug-27-14-15.xva
a4d23c1287597fb2ccdca9d2f0c01718  /mnt/ufs/pg1342.txt
```

From Ubuntu:

```
root@grunt:~# md5sum /mnt/ufs/*
1936c8314cc05d32dbd889c1dbda1f6e  /mnt/ufs/ISI.PFSense#2014-Aug-27-14-15.xva
6685069d60b1c68c87cc7adb5e06da64  /mnt/ufs/pg1342.txt
```

Damn, I was really, really hoping it'd be specific to ZFS.



			
				Sebulon said:
			
		

> Shame it just couldn´t be something simple like that  Xenserver 6.2 is new, while your Ubuntu is quite old except, as it happens on both of them, that shouldn´t be the issue either... Please show output of:
> `# zpool status`
> `# gpart show`
> `# zdb | grep ashift`
> ...




```
[root@barmaid /backup/vm]# zpool status
  pool: backup
 state: ONLINE
  scan: scrub repaired 0 in 5h14m with 0 errors on Sat Jun 28 00:26:06 2014
config:

	NAME           STATE     READ WRITE CKSUM
	backup         ONLINE       0     0     0
	  raidz2-0     ONLINE       0     0     0
	    ada0       ONLINE       0     0     0
	    raid/r128  ONLINE       0     0     0
	    ada2       ONLINE       0     0     0
	    raid/r0    ONLINE       0     0     0
	    ada4       ONLINE       0     0     0
	    ada6       ONLINE       0     0     0
	    raid/r1    ONLINE       0     0     0
	    raid/r2    ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: ONLINE
  scan: none requested
config:

	NAME                      STATE     READ WRITE CKSUM
	tank                      ONLINE       0     0     0
	  diskid/DISK-L2173FXHp3  ONLINE       0     0     0

errors: No known data errors

[root@barmaid /backup/vm]# gpart show
=>       34  195371501  diskid/DISK-L2173FXH  GPT  (93G)
         34        128                     1  freebsd-boot  (64K)
        162   16777216                     2  freebsd-swap  (8.0G)
   16777378  178594157                     3  freebsd-zfs  (85G)

=>       34  468862061  ada9  GPT  (224G)
         34          6        - free -  (3.0K)
         40  468862048     1  freebsd-ufs  (224G)
  468862088          7        - free -  (3.5K)

[root@barmaid /backup/vm]# zdb | grep ashift
            ashift: 12
            ashift: 9

[root@barmaid /backup/vm]# cat /boot/loader.conf
zfs_load="YES"
vfs.root.mountfrom="zfs:tank/root"
vfs.zfs.arc_max="2G"
vfs.zfs.zil_disable="1"

[root@barmaid /backup/vm]# echo "$(grep 'avail memory' /var/run/dmesg.boot | awk '{print $4}')/1024/1024/1024" | bc
7
```

I'll try putting together a second FreeBSD 10 server to test with and see if I see the same behavior. With a problem like this I feel like there's got to be an edge case I'm not seeing somewhere.


----------



## junovitch@ (Sep 4, 2014)

Next thing I can think of would be to just get a sanity check by posting the output of `sysctl vfs.nfsd` and counters shown with `nfsstat -e`.  Also, is this a 10.0-RELEASE system or 10.0-STABLE?  If STABLE, is this before or after the NFSv4.1 support was merged into the tree?


----------



## whoops (Sep 4, 2014)

It's FreeBSD 10.0-RELEASE-p7


```
[administrator@barmaid ~]$ sysctl vfs.nfsd
vfs.nfsd.disable_checkutf8: 0
vfs.nfsd.nfs_privport: 0
vfs.nfsd.server_min_nfsvers: 2
vfs.nfsd.server_max_nfsvers: 4
vfs.nfsd.async: 0
vfs.nfsd.mirrormnt: 1
vfs.nfsd.commit_blks: 0
vfs.nfsd.commit_miss: 0
vfs.nfsd.issue_delegations: 0
vfs.nfsd.enable_locallocks: 0
vfs.nfsd.tcphighwater: 0
vfs.nfsd.udphighwater: 500
vfs.nfsd.tcpcachetimeo: 43200
vfs.nfsd.cachetcp: 1
vfs.nfsd.fha.enable: 1
vfs.nfsd.fha.bin_shift: 22
vfs.nfsd.fha.max_nfsds_per_fh: 8
vfs.nfsd.fha.max_reqs_per_nfsd: 0
vfs.nfsd.fha.fhe_stats: No file handle entries.
vfs.nfsd.minthreads: 4
vfs.nfsd.maxthreads: 4
vfs.nfsd.threads: 4
vfs.nfsd.request_space_used: 0
vfs.nfsd.request_space_used_highest: 1138116
vfs.nfsd.request_space_high: 47185920
vfs.nfsd.request_space_low: 31457280
vfs.nfsd.request_space_throttled: 0
vfs.nfsd.request_space_throttle_count: 0
```

I don't know what pretty much any of that means, so I'll assume it's sane   .


```
[administrator@barmaid ~]$ nfsstat -e
Client Info:
Rpc Counts:
  Getattr   Setattr    Lookup  Readlink      Read     Write    Create    Remove
       39         0         6         0       577         0         0         3
   Rename      Link   Symlink     Mkdir     Rmdir   Readdir  RdirPlus    Access
        0         0         0         0         0         4         0         5
    Mknod    Fsstat    Fsinfo  PathConf    Commit   SetClId SetClIdCf      Lock
        0         1         1         0         0         2         2         0
    LockT     LockU      Open   OpenCfr
        0         0         1         1
OpenOwner     Opens LockOwner     Locks    Delegs  LocalOwn LocalOpen LocalLOwn
        0         0         0         0         0         0         0         0
LocalLock
        0
Rpc Info:
 TimedOut   Invalid X Replies   Retries  Requests
        0         0         0         0       657
Cache Info:
Attr Hits    Misses Lkup Hits    Misses BioR Hits    Misses BioW Hits    Misses
    36983         9        11         6     36338       577         0         0
BioRLHits    Misses BioD Hits    Misses DirE Hits    Misses
        0         0        12         4         8         0

Server Info:
  Getattr   Setattr    Lookup  Readlink      Read     Write    Create    Remove
   761777        13       377         0    993981   2363321         9        14
   Rename      Link   Symlink     Mkdir     Rmdir   Readdir  RdirPlus    Access
       12         0         0         0         0        24        48      6673
    Mknod    Fsstat    Fsinfo  PathConf    Commit   LookupP   SetClId SetClIdCf
        0       118        15         2     18396         4        10        10
     Open  OpenAttr OpenDwnGr  OpenCfrm DelePurge   DeleRet     GetFH      Lock
       15         0         0        15         0         0       131         0
    LockT     LockU     Close    Verify   NVerify     PutFH  PutPubFH PutRootFH
        0         0        15         0         0    759637         0        20
    Renew RestoreFH    SaveFH   Secinfo RelLckOwn  V4Create
      208        15        15         0         0         0
Server:
Retfailed    Faults   Clients
        0         0         1
OpenOwner     Opens LockOwner     Locks    Delegs
        0         0         0         0         0
Server Cache Stats:
   Inprog      Idem  Non-idem    Misses CacheSize   TCPPeak
        0         0         0   3402965         1        25
```


----------



## junovitch@ (Sep 4, 2014)

Alright.  Your results look pretty much like mine.  Normally at this point I would start gathering data on what's going over the wire and doing a bit of compare and contrast work.  I've debugged some pretty dumb issues this way just comparing the difference between some traffic where things work and some traffic where it doesn't work.  If you want to do that and post the results I can do the same on an Ubuntu machine and FreeBSD NFS server I have and compare the difference.  No guarantee anything will come out it but it's worth a try.

Something along the lines of this should do the trick:

1. Start with a client and server where the client has not mounted the server.
2. Start a TCPdump on the server side, NFSv4 should be fine and probably easier because we should only need one port.  I don't think the RPC stuff will be needed so just the following for NFSv4 should be good:
`tcpdump -i em0 -w /tmp/nfs-capture.pcap port 2049 and host 10.15.10.XX`
3. On the client, mount the NFS server
4. Run some commands to do some different operations and bump up different counters in the `nfsstat -e` output.  Something like this should be good.

```
md5 somefile
touch anotherfile
rm anotherfile
mkdir somedir
rmdir somedir
```
5. Unmount on the client side
6. Stop the capture and post it to a download site somewhere. I'll pull it up in Wireshark and compare it with some known good results that I can do with my systems.


----------



## whoops (Sep 5, 2014)

Sounds like a plan, thanks for all your help in this. I have an intern starting tomorrow, which always eats huge chunks of time, so I probably won't be able to get to running it till Monday.


----------



## Sebulon (Sep 5, 2014)

whoops said:
			
		

> ```
> [root@barmaid /backup/vm]# zpool status
> pool: backup
> state: ONLINE
> ...



So the disks have been added raw, which may or may not work out OK, depending on the drive I guess, and ashift 12 is good to see. On the negative side, you haven´t scrubbed the pool since late June, so you may have problems you just don´t know about yet, especially since you disabled the ZIL. Please remove that line from loader.conf and reboot the server promptly. Now, that is not the answer to your problems, since serving from UFS exhibits the same behaviour but trust me, if you ever have a black-out, you´ll be sorry when ZFS doesn´t have anything to rewind to to get back to a consistent state.

/Sebulon


----------



## whoops (Sep 8, 2014)

OK, here we go. /backup/vm/test has two files in it, the FreeBSD iso found at: ftp://ftp.freebsd.org/pub/FreeBSD/relea ... 4-dvd1.iso and a text version of Pride and Prejudice found here: http://www.gutenberg.org/cache/epub/1342/pg1342.txt

Capture command run on server: `tcpdump -i bxe0 -w /tmp/nfs-capture.pcap port 2049 and host 10.15.4.5`

The capture is of the following interaction:


```
administrator@grunt:~$ sudo mount -t nfs4 10.15.10.251:/test /mnt/test
administrator@grunt:~$ mount
/dev/xvda1 on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
none on /proc/xen type xenfs (rw)
rpc_pipefs on /run/rpc_pipefs type rpc_pipefs (rw)
10.15.10.251:/test on /mnt/test type nfs4 (rw,addr=10.15.10.251,clientaddr=10.15.4.5)
administrator@grunt:~$ cd /mnt/test/
administrator@grunt:/mnt/test$ ls
FreeBSD-10.0-RELEASE-amd64-dvd1.iso  pg1342.txt
administrator@grunt:/mnt/test$ md5sum *
f289f5841d37124acaefbbb741f7d5bb  FreeBSD-10.0-RELEASE-amd64-dvd1.iso
fb2a25e4343052b1f7e49290ea9a33ba  pg1342.txt
administrator@grunt:/mnt/test$ sudo touch foo
administrator@grunt:/mnt/test$ sudo mkdir test
administrator@grunt:/mnt/test$ rm -rf foo test
rm: cannot remove `foo': Permission denied
rm: cannot remove `test': Permission denied
administrator@grunt:/mnt/test$ sudo !!
sudo rm -rf foo test
administrator@grunt:/mnt/test$ cd ..
```

The capture is unfortunately quite large (2.1G), as it contains the entire FreeBSD DVD, but I'm not sure how to get around that. In the interaction above, the checksum for Pride and Prejudice is correct, but the one for FreeBSD is wrong. The larger the file (unsurprisingly) the more likely the read is corrupted.

The full capture can be found here: https://pinnpointebackups.blob.core.win ... pture.pcap


----------



## junovitch@ (Sep 9, 2014)

It caught my eye that there are a massive amount of Duplicate ACKs, Out of Orders, and fragment reassembly errors.  I didn't even have to look too hard. Then I noticed you have a bxe card.  Does it sound like this affects you from the 10.0-RELEASE errata?  Since NFS uses a relatively big protocol data unit size (rsize/wsize) compared to CIFS and other protocols, fragmenting it would get offloaded to the NIC and this seems to explain the behavior we have here.

https://www.freebsd.org/releases/10.0R/errata.html


> The bxe(4) driver can cause packet corruption when TSO (TCP Segmentation Offload) feature is enabled. This feature is enabled by default and can be disabled by using a -tso parameter of ifconfig(8). It can be specified in rc.conf(5) like the following:
> 
> ifconfig_bxe0="DHCP -tso"
> 
> This bug has been fixed on FreeBSD 10.0-STABLE.


----------



## whoops (Sep 9, 2014)

junovitch said:
			
		

> It caught my eye that there are a massive amount of Duplicate ACKs, Out of Orders, and fragment reassembly errors.  I didn't even have to look too hard. Then I noticed you have a bxe card.  Does it sound like this affects you from the 10.0-RELEASE errata?  Since NFS uses a relatively big protocol data unit size (rsize/wsize) compared to CIFS and other protocols, fragmenting it would get offloaded to the NIC and this seems to explain the behavior we have here.
> 
> https://www.freebsd.org/releases/10.0R/errata.html
> 
> ...



And there it is! Sure enough adding the -tso that interface appears to have solved the problem. I'll run through some of the larger backups to be safe, but after adding that flag I've been unable to reproduce the issue with any of the files I had been using for testing. The worst part is, is that I had considered packet loss/corruption as a possible cause, and so the first thing I tried was explicitly setting the TCP option when mounting on the client, not realizing that since NFSv3+ uses TCP by default.

Thanks a million. That seemed like a severe enough bug in NFS I was starting to doubt my sanity.

Note to self: when encountering problems read the Errata.


----------



## junovitch@ (Sep 9, 2014)

Absolutely, the errata, release notes, security advisories, and /usr/{ports,src}/UPDATING are all your friends.  All those backups being written lets the network fragmentation work happen on the client sides and it was only those reads from the client where NFS was getting fragmented poorly by the bxe driver.  Add in all the other protocols that'll comfortably use 1500 byte packets and you have an interesting problem to solve and a lot of lost time.  Keep that workaround in place until 10.1-RELEASE and you'll be good to go.  Oh well, on to to the next fire to put out.


----------

