# NFS+ZFS: 2-minute stall (no I/O) when copying files larger than 5.6GB



## progman32 (Sep 6, 2020)

This is an odd one. I have a FreeBSD 11.4 server as a NAS for my small network. It is equipped with a 10G NIC connected to a 10G switch, to which is also connected a Linux client with a 10G NIC as well.

The ZFS config is 4 HDDs in a 2x2 mirror.

```
[root@343-guilty-spark /share/tmp]# zpool status -v
  pool: zroot
state: ONLINE
  scan: resilvered 2.25T in 0 days 14:32:10 with 0 errors on Thu Sep  3 01:02:39 2020
config:

        NAME            STATE     READ WRITE CKSUM
        zroot           ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            ada1p3.eli  ONLINE       0     0     0
            ada0p3.eli  ONLINE       0     0     0
          mirror-1      ONLINE       0     0     0
            ada2p1.eli  ONLINE       0     0     0
            ada3p1.eli  ONLINE       0     0     0
```

The ZFS filesystem /share is exported to the network and mounted by the Linux machine using NFSv3:

```
192.168.1.15:/share on /share type nfs (rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.1.15,mountvers=3,mountport=855,mountproto=udp,local_lock=none,addr=192.168.1.15,_netdev)
```

The problem is as follows. If I initiate a large transfer that fsync()s before closing the file, I get a multiple-minute IO stall on the FreeBSD box, but ONLY if the file is above ~5.6GB. It is extremely reproducible. On the Linux machine, this DD command demonstrates the issue:

```
user@linux:~/share/tmp$ dd conv=fsync if=/dev/zero of=test2 bs=4k count=1370000
1370000+0 records in
1370000+0 records out
5611520000 bytes (5.6 GB, 5.2 GiB) copied, 8.76319 s, 640 MB/s
```
As you can see, this worked fine (aside: block size doesn't seem to matter). I ran this command at least a dozen times without incident.
Adding a hundred megs or so displays the broken behavior:

```
user@linux:~/share/tmp$ dd conv=fsync if=/dev/zero of=test2 bs=4k count=1400000
1400000+0 records in
1400000+0 records out
5734400000 bytes (5.7 GB, 5.3 GiB) copied, 191.214 s, 30.0 MB/s
```
What I experience during the transfer is:
1. Initially, the file transfer proceeds apace. I have individual activity LEDs on each HDD on the FreeBSD machine; all show high activity.
2. After the transfer is almost complete, all disk activity stops dead. Not a single HDD LED flicker. All attempts to access files stalls, including directly from a shell on the FreeBSD machine. Network activity on the client drops to zero, the FreeBSD server still seems to have high network activity but no packets reach the client.
3. About two minutes pass.
4. Network activity resumes, disk activity resumes, remainder of data is transferred, operation completes successfully once all data is sent.

If I remove conv=fsync, the transfer completes nice and quick every time:

```
user@linux:~/share/tmp$ dd if=/dev/zero of=test2 bs=4k count=1400000
1400000+0 records in
1400000+0 records out
5734400000 bytes (5.7 GB, 5.3 GiB) copied, 8.95905 s, 640 MB/s
```

Checking the Linux machine's dmesg, I see many lines like this:

```
[271233.699379] call_decode: 15859 callbacks suppressed
[271233.699751] nfs: server 192.168.1.15 not responding, still trying
[271234.533861] nfs: server 192.168.1.15 OK
[271532.292673] nfs: RPC call returned error 13
```

If I run the same DD command on the FreeBSD box directly (via the coreutils port), everything is fine:

```
[user@343-guilty-spark /share/tmp]# gdd conv=fsync if=/dev/zero of=test2 bs=4k count=1400000
1400000+0 records in
1400000+0 records out
5734400000 bytes (5.7 GB, 5.3 GiB) copied, 16.432 s, 349 MB/s
```

Any clues? Could this have something to do with an NFS ID rollover given the machine is using a fast interconnect?

Aside: The resilver is probably not relevant; I replaced a disk that I didn't trust. I have no reason to believe any of the other disks are failing.


----------



## progman32 (Sep 6, 2020)

The mystery deepens. I rebooted and the problem seems gone for now, though I've encountered it on previous reboots. Yesterday I continued having the issue even after restarting nfsd.


----------



## Peter Eriksson (Sep 6, 2020)

There can be many different problems causing this. Some are:

Try to run your 10G ethernet cards with TSO and LRO disabled. We've had issues with that on Intel X710 (ixl) cards on FreeBSD 11.3 causing hangs/freezes. It *might* be solved these days - haven't seen it for a while now (but if that is due to the driver being patched, or the firmware updates we've pushed to the 10G cards I don't know).

There have been similar problems in Linux NFS when writing a lot of data (check the Linux-NFS mail archives) that might have been fixed in the later bleeding edge kernels (sorry I don't remember the subject lines so you'll have to do some creative searches).

Have you tried similar writes from non-Linux clients (FreeBSD, MacOS, Solaris)? Have you tried NFSv4.1?


----------



## progman32 (Sep 6, 2020)

Thanks for the ideas, Peter. My particular card is a Chelsio S310E-CR 10. Now that you've jogged my brain, I recall that a couple times my open SSH connections also stalled at the same time (no echo at an idle sh prompt until the problem self-cleared). That suggests an issue at a higher level than NFS or ZFS.

I thought I had a stable reproduction of the issue, but apparently not. I'm going to focus on getting a reliable way to reproduce the issue, then start playing with LRO/TSO as you suggest. I suspect it'll be a little while till I have anything to report. My card is a "working pull" so I wouldn't be too surprised if it goes unstable sometimes.

All this being said, I'm a little wary of how sensitive the issue was to fsync. I don't know NFS well enough to know how much of an effect fsync would have at a TCP level, but I'll see what I can see.

Edit: Sorry, forgot to answer all your questions. I have not tried other client OSes or NFS4 of any kind. I'll check them out once I get a repro.


----------

