# Need assistance with sluggish ZFS host



## darecki (Apr 12, 2018)

Hi Everyone,

I am trying to come back to FreeBSD after turning to Linux in 1998. What I am building is a host with ZFS and several MySQL instances in jails. Well, it works but I am hitting some weird behavior which I can't explain.

First a bit about my hardware, yes I remember I had to craft it to match to FreeBSD not otherwise. Here it is:

1TB of RAM
2x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
16x TOSHIBA PX05SVB096Y  SAS3 SSD
and as it is Dell it has Dell's HBA330+ controller
Intel(R) Ethernet Connection XL710
4 NUMA domains

FreeBSD 11.1-release-p9

There is 2x ZFS:
- system root with 2x nvmem M.2 drives running as mirror
- data root with 16x Toshiba in mirrors (8 vdevs)
- compression enabled (lz4), no dedup

My plan was to utilize most of RAM as ARC and run MySQLs with direct flush or small buffers. I started testing it by running some well known SQL statements and it worked very well, however, not always.

Let's rewind to the very first day.

I had to install MySQL data from xtrabackup files. I had them in tar+lzop, 800G in file, around 1.8T unpacked. Run "lzop -d file | tar xf -" and bang - after unpacking around a half of its size system stopped responding properly. I repeated this a few times with exactly the same result:

- Whatever was open, say, a file in vim, still worked
- any other action such as ls, vi, top - hung frozen

After a few tries I started top first and kept watching. Nice, got some clue - these processes are waiting for another process trying to free ARC. Left for hours didn't progress so warm reboot.

Next step was to limit arc to 640G and set mem target and minimum to be 128/64. It worked, I was able to unpack files. Then I run MySQL with some tests with no issues, however, I noted wired memory growing fast and far beyond ARC. Disabled UMA - helped a bit but made it slow. Enabled UMA again. 

Then further testing with unpacking. After having two tests run over MySQL and ARC sitting around 350G while wired around 500G, I started "pbzip2 -d file > deleteme". file.bz2 was 400G size with 1.8T packed content.
Whole process runs at 15% of drives speed until around 400G is unpacked. Then it slows down to next to nothing. ARC is full (640G), wired leaving around 63G free (less than min set) and still growing slowly. System is terribly slow, I am able to launch top but it takes minutes to start, no option to open a file, ls sometimes returns after minutes. Killing bzip doesn't help. Once I tried to delete "deleteme" file while system was dying, actually I had two files, one from previous test. Run "rm file1; rm file2" then ls. Files were gone but system deteriorated further. Warm reboot after around 30 minutes of nothing but top still running and reporting 59G free in RAM. After reboot surprise - files are back  Conclusion - unable to flush transaction.
After that I limited number of blocks deleted in single transaction to 200000 and then to 50000 but that had no influence.

I switched timers to HPET as I allowed C3 state. Disallowing C3 (still C1/C2 allowed and powerd running) made SQL test faster but nothing else better.

So where am I now?
I am able to kill it every time I run unpacking for long enough. I tested compression=disabled with no difference. While dead it deteriorates slowly even when left alone, shows load around 5 while only top runs and reboot fixes it. BTW - it also doesn't reboot stopping after famous "uptime", disabling usb wait doesn't help.

What should I look at or read about to fix it?

Cheers,


----------



## Bobi B. (Apr 12, 2018)

Have you tried monitoring your system to see where the bottleneck is? Meaning, run all of top(1) (`top -SH` and `top -SH -mio -ototal`), gstat(8) (`gstat -p`), systat(1) (`systat -vmstat`, `systat -zarc`, `systat -iostat`, `systat -pigs`), zpool(8) (`zpool iostat -Td`, `zpool status -v -Td 5`) simultaneously in ssh sessions. Also tell syslogd(8) to forward events to a remote host, so you can monitor what's happening. This way you won't be bothered if host gets unresponsive again.

Do you have a swap partition (`swapctl -l`)? Can you turn it off (`swapoff -a`)?


----------



## darecki (Apr 12, 2018)

- gstat with no options shows little activity on drives and when dead next to nothing

[root@testmother ~]# swapctl -l
Device:       1024-blocks     Used:
/dev/ada0p2     2097152         0
/dev/ada1p2     2097152         0

vm.swap_enabled="0"

I run some of the others and didn't spot anything. I will run them again tomorrow to post results.

Cheers,


----------



## SirDice (Apr 12, 2018)

I've had similar issues on a few VMs I created. Turned out to be memory related. There was too much trying to grab RAM and when it ran out it eventually stalled everything.


----------



## darecki (Apr 12, 2018)

SirDice said:


> I've had similar issues on a few VMs I created. Turned out to be memory related. There was too much trying to grab RAM and when it ran out it eventually stalled everything.



Any tips how to limit it?


----------



## SirDice (Apr 12, 2018)

For the VMs it was easy, I just gave them more memory. But in your case that's probably not useful. For your situation I would probably limit ARC (since it interferes with MySQL's own caching) and verify all MySQL instances are not configured to use too much memory. Make sure ARC plus all maxed memory usages of all the MySQL instances add up to less than the amount of RAM the machine has. You can also find a couple of specific settings for each ZFS filesystem that has a MySQL database on them: https://wiki.freebsd.org/ZFSTuningGuide#MySQL


----------



## darecki (Apr 12, 2018)

SirDice said:


> I've had similar issues on a few VMs I created. Turned out to be memory related. There was too much trying to grab RAM and when it ran out it eventually stalled everything.



Any tips how to limit it?


SirDice said:


> For the VMs it was easy, I just gave them more memory. But in your case that's probably not useful. For your situation I would probably limit ARC (since it interferes with MySQL's own caching) and verify all MySQL instances are not configured to use too much memory. Make sure ARC plus all maxed memory usages of all the MySQL instances add up to less than the amount of RAM the machine has. You can also find a couple of specific settings for each ZFS filesystem that has a MySQL database on them: https://wiki.freebsd.org/ZFSTuningGuide#MySQL



Yep, would be difficult to add while it has 1TB already.

I set MySQL buffer to be 2G only so that whole process takes 3G. I am aware of double cache issue, I "tried innodb_flush_method = O_DIRECT" but SQL alone is not a problem, it dies when I copy files. I started thinking it could be too much RAM for CPU computing power. It  is easy to scan 64G of RAM to find and evict 5G but not as easy to scan 700G of RAM to find and evict 250G.

I noticed there is a setting for ZFS which sets number of background workers doing compression, default to 3/4 of available threads. Some people say having it at 200% helps. Is there anything similar working on ARC or Wired? Working on scans and evictions? Maybe I need more CPU power there as the RAM is giant?  

How can I find what else stays in Wired? It grows to 1.4 of ARC size.


----------



## darecki (Apr 12, 2018)

I tried to rsync files out to defragment and to change recordsize which I screwed (128k instead 16k under InnoDB). Started rsyc and:

- Run 50-100MB/s while from ARC.
- slowed down 10x when run beyond cached data. I tested it by re-rsyncing large file which again run fast to the same point (around 50%) and then slowed down the same way.
- Left for hours deteriorated and now there is no transfer but rsync still there. It seems the same issue which I had before and I am able to launch stuff so hopefully I would be able to collect these commands above.

First `top`:


```
last pid: 39994;  load averages:  3.08,  3.16,  3.91                                                                                                                                                                 up 0+21:04:14  07:16:55
39 processes:  4 running, 34 sleeping, 1 stopped
CPU:  0.0% user,  0.0% nice,  5.7% system,  0.0% interrupt, 94.3% idle
Mem: 22M Active, 55M Inact, 25M Laundry, 692G Wired, 306G Free
ARC: 642G Total, 622G MFU, 13G MRU, 428K Anon, 7368M Header, 821M Other
     634G Compressed, 874G Uncompressed, 1.38:1 Ratio
Swap: 4096M Total, 4096M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
  870 root          1  52    0 46444K  3420K CPU38  38   0:01  49.83% sudo
36306 zabbix        1  52    0 34204K  3820K piperd 35  73:27   8.46% zabbix_agentd
32268 zabbix        1  52    0 13180K  2728K CPU16  16   0:00   8.40% sh
98222 zabbix        1  52    0  8356K  2148K piperd 49   0:00   7.55% sed
96024 root          1  52    0 46444K  3428K CPU15  15   0:00   7.18% sudo
41294 zabbix        1  52    0 34204K  3824K select 40  73:15   0.78% zabbix_agentd
28420 root          1  20    0 20164K  3588K CPU56  56   0:02   0.05% top
10987 zabbix        1  20    0 35788K  9780K nanslp 70   0:43   0.05% zabbix_agentd
46093 root          1  20    0 10496K  2612K select 65   1:51   0.01% syslogd
13449 root          1  20    0 10464K  1552K select 66   0:10   0.01% powerd
69986 root          1  20    0 85232K  7828K select 64   0:03   0.01% sshd
90137 root          1  20    0 20584K 12492K select 57   0:12   0.00% ntpd
94177 root          1  52    0 11864K  2920K arc_re 31 522:08   0.00% rsync
80547 root          1  52    0 13912K  3676K arc_re 32 437:28   0.00% rsync
18407 zabbix        1  52    0 34204K  3808K piperd 58  73:35   0.00% zabbix_agentd
44610 zabbix        1  52    0 35788K  4744K piperd 65  13:11   0.00% zabbix_agentd
93587 root          1  52    0 28892K  3996K nanslp 61   2:48   0.00% smartd
69409 root          1  52    0 46652K 15584K STOP   79   0:48   0.00% dtrace
14650 root          1  20    0 85232K  7660K select  6   0:07   0.00% sshd
```

Stuck on *arc_reclaim_thread*!


----------



## darecki (Apr 12, 2018)

`top -SH`


```
last pid: 52449;  load averages:  2.64,  2.73,  3.30                                                                                                                                                                 up 0+21:12:44  07:25:25
1557 processes:83 running, 1346 sleeping, 1 stopped, 127 waiting
CPU:  0.0% user,  0.0% nice,  2.5% system,  0.0% interrupt, 97.5% idle
Mem: 23M Active, 55M Inact, 25M Laundry, 691G Wired, 307G Free
ARC: 642G Total, 622G MFU, 13G MRU, 491K Anon, 7368M Header, 821M Other 634G Compressed, 874G Uncompressed, 1.38:1 Ratio
Swap: 4096M Total, 4096M Free

  PID USERNAME   PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
   11 root       155 ki31     0K  1280K CPU0    0  20.6H 100.00% idle{idle: cpu0}
   11 root       155 ki31     0K  1280K CPU6    6  20.6H 100.00% idle{idle: cpu6}
   11 root       155 ki31     0K  1280K CPU2    2  20.6H 100.00% idle{idle: cpu2}
   11 root       155 ki31     0K  1280K CPU7    7  20.6H 100.00% idle{idle: cpu7}
   11 root       155 ki31     0K  1280K CPU14  14  20.4H 100.00% idle{idle: cpu14}
   11 root       155 ki31     0K  1280K CPU56  56  20.3H 100.00% idle{idle: cpu56}
   11 root       155 ki31     0K  1280K CPU23  23  20.1H 100.00% idle{idle: cpu23}
   11 root       155 ki31     0K  1280K CPU17  17  20.1H 100.00% idle{idle: cpu17}
   11 root       155 ki31     0K  1280K CPU20  20  20.0H 100.00% idle{idle: cpu20}
   11 root       155 ki31     0K  1280K CPU35  35  19.9H 100.00% idle{idle: cpu35}
   11 root       155 ki31     0K  1280K CPU21  21  19.9H 100.00% idle{idle: cpu21}
   11 root       155 ki31     0K  1280K CPU59  59  19.8H 100.00% idle{idle: cpu59}
   11 root       155 ki31     0K  1280K CPU58  58  19.7H 100.00% idle{idle: cpu58}
   11 root       155 ki31     0K  1280K RUN    39  18.9H 100.00% idle{idle: cpu39}
43739 root        52    0 46444K  3428K select 25   0:05 100.00% sudo
   11 root       155 ki31     0K  1280K RUN    48  20.2H  99.77% idle{idle: cpu48}
   11 root       155 ki31     0K  1280K CPU22  22  20.3H  99.73% idle{idle: cpu22}
   11 root       155 ki31     0K  1280K CPU18  18  19.6H  99.55% idle{idle: cpu18}
   11 root       155 ki31     0K  1280K CPU60  60  18.9H  99.25% idle{idle: cpu60}
   11 root       155 ki31     0K  1280K CPU38  38  19.8H  98.55% idle{idle: cpu38}
   11 root       155 ki31     0K  1280K CPU31  31  19.9H  96.95% idle{idle: cpu31}
   11 root       155 ki31     0K  1280K CPU34  34  20.0H  93.34% idle{idle: cpu34}
   11 root       155 ki31     0K  1280K CPU1    1  20.7H  49.62% idle{idle: cpu1}
31100 root        28    0 24260K  6752K CPU48  48   0:01   0.21% top
   12 root       -60    -     0K  2032K WAIT    0   2:27   0.12% intr{swi4: clock (0)}
   27 root       -16    -     0K    16K -      23   1:31   0.10% racctd
10987 zabbix      20    0 35788K  9780K nanslp 23   0:44   0.04% zabbix_agentd
13449 root        20    0 10464K  1552K select 14   0:11   0.01% powerd
   16 root       -16    -     0K    16K -      31   0:37   0.01% rand_harvestq
46093 root        20    0 10496K  2612K select 56   1:51   0.01% syslogd
69986 root        20    0 85232K  7828K select 18   0:03   0.01% sshd
   21 root       -16    -     0K    16K psleep 38   0:04   0.00% vmdaemon
   12 root       -92    -     0K  2032K WAIT    0   2:42   0.00% intr{irq287: ixl0:q0}
90137 root        20    0 20584K 12492K select 60   0:12   0.00% ntpd{ntpd}
   12 root       -92    -     0K  2032K WAIT    0   0:01   0.00% intr{irq296: ixl1:q0}
    5 root        -8    -     0K   192K spa->s  6   4:09   0.00% zfskern{trim Raid10}
   12 root       -92    -     0K  2032K WAIT    1   0:33   0.00% intr{irq288: ixl0:q1}
    0 root       -92    -     0K 20112K -      56   0:01   0.00% kernel{ixl0 aq}
    5 root        -8    -     0K   192K spa->s 20   0:20   0.00% zfskern{trim zroot}
```


`top -SH -mio -ototal`


```
last pid: 90557;  load averages:  2.45,  2.62,  3.20                                                                                                                                                                 up 0+21:14:06  07:26:47
1561 processes:84 running, 1349 sleeping, 1 stopped, 127 waiting
CPU:  0.0% user,  0.0% nice,  7.3% system,  0.0% interrupt, 92.7% idle
Mem: 25M Active, 55M Inact, 25M Laundry, 691G Wired, 307G Free
ARC: 642G Total, 622G MFU, 13G MRU, 495K Anon, 7368M Header, 821M Other 634G Compressed, 874G Uncompressed, 1.38:1 Ratio
Swap: 4096M Total, 4096M Free

  PID USERNAME     VCSW  IVCSW   READ  WRITE  FAULT  TOTAL PERCENT COMMAND
    5 root           10      0      0     15      0     15  48.39% zfskern{txg_thread_enter}
87961 root            1      1     10      0      0     10  32.26% zfs
46093 root            4      0      0      2      0      2   6.45% syslogd
57285 root            3      2      2      0      0      2   6.45% sudo
52853 root            3      2      2      0      0      2   6.45% sudo
79214 root            0      0      0      0      0      0   0.00% ftcleanup
69409 root            0      0      0      0      0      0   0.00% dtrace
94177 root            0      0      0      0      0      0   0.00% rsync
38103 root            0      0      0      0      0      0   0.00% bash
69986 root            2      0      0      0      0      0   0.00% sshd
57386 root            0      0      0      0      0      0   0.00% bash
84516 root            0      0      0      0      0      0   0.00% sshd
15202 mailnull        0      0      0      0      0      0   0.00% exim
20801 root            0      0      0      0      0      0   0.00% bash
14650 root            0      0      0      0      0      0   0.00% sshd
37796 root            0      0      0      0      0      0   0.00% getty
32316 root            0      0      0      0      0      0   0.00% getty
25008 root            0      0      0      0      0      0   0.00% getty
23383 root            0      0      0      0      0      0   0.00% getty
19741 root            0      0      0      0      0      0   0.00% getty
14635 root            0      0      0      0      0      0   0.00% getty
13728 root            0      0      0      0      0      0   0.00% getty
11065 root            0      0      0      0      0      0   0.00% getty
89155 root            0      0      0      0      0      0   0.00% cron
18335 root            0      0      0      0      0      0   0.00% sshd
93587 root            0      0      0      0      0      0   0.00% smartd
13449 root            7      0      0      0      0      0   0.00% powerd
90137 root            2      0      0      0      0      0   0.00% ntpd{ntpd}
44610 zabbix          0      0      0      0      0      0   0.00% zabbix_agentd
41294 zabbix          5      1      0      0      0      0   0.00% zabbix_agentd
36306 zabbix          3      0      0      0      0      0   0.00% zabbix_agentd
18407 zabbix          0      0      0      0      0      0   0.00% zabbix_agentd
10987 zabbix          2      0      0      0      0      0   0.00% zabbix_agentd
93779 zabbix          0      0      0      0      0      0   0.00% zabbix_agentd
86550 root            0      0      0      0      0      0   0.00% devd
98361 root            0      0      0      0      0      0   0.00% moused
13727 root            0      0      0      0      0      0   0.00% adjkerntz
   27 root            2      0      0      0      0      0   0.00% racctd
   26 root            3      0      0      0      0      0   0.00% syncer
```


`gstat -p`


```
dT: 1.012s  w: 1.000s
L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0      0      0      0    0.0      0      0    0.0    0.0| ada0
    0      0      0      0    0.0      0      0    0.0    0.0| ada1
    0      0      0      0    0.0      0      0    0.0    0.0| da0
    0      0      0      0    0.0      0      0    0.0    0.0| da1
    0      0      0      0    0.0      0      0    0.0    0.0| da2
    0      0      0      0    0.0      0      0    0.0    0.0| da3
    0      0      0      0    0.0      0      0    0.0    0.0| da4
    0      0      0      0    0.0      0      0    0.0    0.0| da5
    0      0      0      0    0.0      0      0    0.0    0.0| da6
    0      0      0      0    0.0      0      0    0.0    0.0| da7
    0      0      0      0    0.0      0      0    0.0    0.0| da8
    0      0      0      0    0.0      0      0    0.0    0.0| da9
    0      0      0      0    0.0      0      0    0.0    0.0| da10
    0      0      0      0    0.0      0      0    0.0    0.0| da11
    0      0      0      0    0.0      0      0    0.0    0.0| da12
    0      0      0      0    0.0      0      0    0.0    0.0| da13
    0      0      0      0    0.0      0      0    0.0    0.0| da14
    0      0      0      0    0.0      0      0    0.0    0.0| da15
```


----------



## darecki (Apr 12, 2018)

`systat -vmstat`


```
3 users    Load  2.63  2.66  3.10                  Apr 13 07:29
   Mem usage:  69%Phy 64%Kmem
Mem: KB    REAL            VIRTUAL                      VN PAGER   SWAP PAGER
        Tot   Share      Tot    Share    Free           in   out     in   out
Act  110724    9104   812164    13160 322617k  count
All  111268    9628   822448    19056          pages
Proc:                                                            Interrupts
  r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt        ioflt  1133 total
  1          31       349  125  494 1135   38  117     26 cow         cpu0:timer
                                                       69 zfod        cpu1:timer
2.6%Sys   0.0%Intr  0.0%User  0.0%Nice 97.4%Idle         ozfod       cpu41:time
|    |    |    |    |    |    |    |    |    |           %ozfod       cpu40:time
=                                                         daefr       cpu5:timer
                                           dtbuf      115 prcfr       cpu48:time
Namei     Name-cache   Dir-cache   8388608 desvn      846 totfr       cpu27:time
   Calls    hits   %    hits   %    662440 numvn          react       cpu78:time
     711     711 100                607811 frevn          pdwak       cpu54:time
                                                          pdpgs       cpu36:time
Disks  ada0  ada1   da0   da1   da2   da3   da4           intrn       cpu66:time
KB/t  14.67 15.46  0.00  0.00  0.00  0.00  0.00   723924k wire        cpu26:time
tps       8     7     0     0     0     0     0     22032 act         cpu11:time
MB/s   0.11  0.11  0.00  0.00  0.00  0.00  0.00     56216 inact       cpu43:time
%busy     0     0     0     0     0     0     0     25928 laund       cpu10:time
                                                  322617k free        cpu4:timer
                                                          buf         cpu8:timer
                                                                      cpu64:time
                                                                      cpu42:time
                                                                      cpu55:time
                                                                      cpu24:time
                                                                      cpu17:time
                                                                      cpu45:time
                                                                      cpu65:time
                                                                      cpu16:time
                                                                      cpu79:time
                                                                      cpu76:time
                                                                      cpu49:time
                                                                      cpu22:time
                                                                      cpu44:time
                                                                      cpu32:time
                                                                      cpu68:time
                                                                      cpu39:time
                                                                      cpu69:time
                                                                      cpu74:time
                                                                      cpu34:time
                                                                      cpu61:time
                                                                      cpu20:time
                                                                      cpu12:time

    3 users    Load  2.63  2.66  3.10                  Apr 13 07:29
   Mem usage:  69%Phy 64%Kmem
Mem: KB    REAL            VIRTUAL                      VN PAGER   SWAP PAGER
        Tot   Share      Tot    Share    Free           in   out     in   out
Act  110724    9104   812164    13160 322617k  count
All  111268    9628   822448    19056          pages
Proc:                                                            Interrupts
  r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt        ioflt  1133 total
  1          31       349  125  494 1135   38  117     26 cow         cpu0:timer
                                                       69 zfod        cpu1:timer
2.6%Sys   0.0%Intr  0.0%User  0.0%Nice 97.4%Idle         ozfod       cpu41:time
|    |    |    |    |    |    |    |    |    |           %ozfod       cpu40:time
=                                                         daefr       cpu5:timer
                                           dtbuf      115 prcfr       cpu48:time
Namei     Name-cache   Dir-cache   8388608 desvn      846 totfr       cpu27:time
   Calls    hits   %    hits   %    662440 numvn          react       cpu78:time
     711     711 100                607811 frevn          pdwak       cpu54:time
                                                          pdpgs       cpu36:time
Disks  ada0  ada1   da0   da1   da2   da3   da4           intrn       cpu66:time
KB/t  14.67 15.46  0.00  0.00  0.00  0.00  0.00   723924k wire        cpu26:time
tps       8     7     0     0     0     0     0     22032 act         cpu11:time
MB/s   0.11  0.11  0.00  0.00  0.00  0.00  0.00     56216 inact       cpu43:time
%busy     0     0     0     0     0     0     0     25928 laund       cpu10:time
                                                  322617k free        cpu4:timer
                                                          buf         cpu8:timer
                                                                      cpu64:time
                                                                      cpu42:time
                                                                      cpu55:time
                                                                      cpu24:time
                                                                      cpu17:time
                                                                      cpu45:time
                                                                      cpu65:time
                                                                      cpu16:time
                                                                      cpu79:time
                                                                      cpu76:time
                                                                      cpu49:time
                                                                      cpu22:time
                                                                      cpu44:time
                                                                      cpu32:time
                                                                      cpu68:time
                                                                      cpu39:time
                                                                      cpu69:time
                                                                      cpu74:time
                                                                      cpu34:time
                                                                      cpu61:time
                                                                      cpu20:time
                                                                      cpu12:time
    3 users    Load  2.63  2.66  3.10                  Apr 13 07:29
   Mem usage:  69%Phy 64%Kmem
Mem: KB    REAL            VIRTUAL                      VN PAGER   SWAP PAGER
        Tot   Share      Tot    Share    Free           in   out     in   out
Act  110724    9104   812164    13160 322617k  count
All  111268    9628   822448    19056          pages
Proc:                                                            Interrupts
  r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt        ioflt  1133 total
  1          31       349  125  494 1135   38  117     26 cow         cpu0:timer
                                                       69 zfod        cpu1:timer
2.6%Sys   0.0%Intr  0.0%User  0.0%Nice 97.4%Idle         ozfod       cpu41:time
|    |    |    |    |    |    |    |    |    |           %ozfod       cpu40:time
=                                                         daefr       cpu5:timer
                                           dtbuf      115 prcfr       cpu48:time
Namei     Name-cache   Dir-cache   8388608 desvn      846 totfr       cpu27:time
   Calls    hits   %    hits   %    662440 numvn          react       cpu78:time
     711     711 100                607811 frevn          pdwak       cpu54:time
                                                          pdpgs       cpu36:time
Disks  ada0  ada1   da0   da1   da2   da3   da4           intrn       cpu66:time
KB/t  14.67 15.46  0.00  0.00  0.00  0.00  0.00   723924k wire        cpu26:time
tps       8     7     0     0     0     0     0     22032 act         cpu11:time
MB/s   0.11  0.11  0.00  0.00  0.00  0.00  0.00     56216 inact       cpu43:time
%busy     0     0     0     0     0     0     0     25928 laund       cpu10:time
                                                  322617k free        cpu4:timer
                                                          buf         cpu8:timer
                                                                      cpu64:time
                                                                      cpu42:time
                                                                      cpu55:time
                                                                      cpu24:time
                                                                      cpu17:time
                                                                      cpu45:time
                                                                      cpu65:time
                                                                      cpu16:time
                                                                      cpu79:time
                                                                      cpu76:time
                                                                      cpu49:time
                                                                      cpu22:time
                                                                      cpu44:time
                                                                      cpu32:time
                                                                      cpu68:time
                                                                      cpu39:time
                                                                      cpu69:time
                                                                      cpu74:time
                                                                      cpu34:time
                                                                      cpu61:time
                                                                      cpu20:time
                                                                      cpu12:time
```


----------



## darecki (Apr 12, 2018)

`systat -zarc`

```
Total     MFU     MRU    Anon     Hdr   L2Hdr   Other
     ZFS ARC            642G    621G     12G    428K   7367M      0K    820M
                                rate    hits  misses   total hits total misses
     arcstats                  :  0%       0       0     51354648     39471192
     arcstats.demand_data      :  0%       0       0     46362087      4523687
     arcstats.demand_metadata  :  0%       0       0      2670818      4992774
     arcstats.prefetch_data    :  0%       0       0      2174883     29882852
     arcstats.prefetch_metadata:  0%       0       0       146860        71879
     zfetchstats               :  0%       0       0      4492247      9591496
     arcstats.l2               :  0%       0       0            0            0
     vdev_cache_stats          :  0%       0       0         8650        19624
```
`systat -iostat,`

```
/0   /1   /2   /3   /4   /5   /6   /7   /8   /9   /10
     Load Average   >>>>>>>>>>>
          /0%  /10  /20  /30  /40  /50  /60  /70  /80  /90  /100
cpu  user|
     nice|
   system|XX
interrupt|
     idle|XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

          /0%  /10  /20  /30  /40  /50  /60  /70  /80  /90  /100
ada0  MB/s
      tps|
ada1  MB/s
      tps|
da0   MB/s
      tps|
da1   MB/s
      tps|
da2   MB/s
      tps|
da3   MB/s
      tps|
da4   MB/s
      tps|
da5   MB/s
      tps|
da6   MB/s
      tps|
da7   MB/s
      tps|
da8   MB/s
      tps|
da9   MB/s
      tps|
da10  MB/s
      tps|
da11  MB/s
      tps|
da12  MB/s
      tps|
da13  MB/s
      tps|
da14  MB/s
      tps|
da15  MB/s
      tps|
pass0 MB/s
      tps|
pass1 MB/s
      tps|
pass2 MB/s
      tps|
pass3 MB/s
      tps|
pass4 MB/s
      tps|
pass5 MB/s
      tps|
pass6 MB/s
      tps|
```

`systat -pigs`


```
/0   /1   /2   /3   /4   /5   /6   /7   /8   /9   /10
     Load Average   >>>>>>>>>>>>>

                    /0%  /10  /20  /30  /40  /50  /60  /70  /80  /90  /100
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root        zfskern XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root           idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
```


----------



## darecki (Apr 12, 2018)

`zpool status -v -Td 5`

```
Fri Apr 13 07:34:05 AEST 2018
  pool: Raid10
state: ONLINE
  scan: scrub repaired 0 in 0h23m with 0 errors on Mon Apr  9 14:28:25 2018
config:

        NAME        STATE     READ WRITE CKSUM
        Raid10      ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            da0p1   ONLINE       0     0     0
            da8p1   ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            da1p1   ONLINE       0     0     0
            da9p1   ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            da2p1   ONLINE       0     0     0
            da10p1  ONLINE       0     0     0
          mirror-3  ONLINE       0     0     0
            da3p1   ONLINE       0     0     0
            da11p1  ONLINE       0     0     0
          mirror-4  ONLINE       0     0     0
            da4p1   ONLINE       0     0     0
            da12p1  ONLINE       0     0     0
          mirror-5  ONLINE       0     0     0
            da5p1   ONLINE       0     0     0
            da13p1  ONLINE       0     0     0
          mirror-6  ONLINE       0     0     0
            da6p1   ONLINE       0     0     0
            da14p1  ONLINE       0     0     0
          mirror-7  ONLINE       0     0     0
            da7p1   ONLINE       0     0     0
            da15p1  ONLINE       0     0     0

errors: No known data errors

  pool: zroot
state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Thu Apr  5 21:04:54 2018
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada0p3  ONLINE       0     0     0
            ada1p3  ONLINE       0     0     0
```


----------



## darecki (Apr 12, 2018)

So my conclusion is - a whole system is idling while waiting for arc_reclaim_thread but it doesn't even take CPU. Stuck for many hours. I believe slow reading was caused by slow reclaim too.

Similar to this?
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224080

Or I would be able to tune it out somehow?

Cheers,


----------



## darecki (Apr 12, 2018)

My ARC tunes:

```
vfs.zfs.l2arc_norw: 1
vfs.zfs.l2arc_feed_again: 1
vfs.zfs.l2arc_noprefetch: 0
vfs.zfs.l2arc_feed_min_ms: 200
vfs.zfs.l2arc_feed_secs: 1
vfs.zfs.l2arc_headroom: 2
vfs.zfs.l2arc_write_boost: 8388608
vfs.zfs.l2arc_write_max: 8388608
vfs.zfs.arc_meta_limit: 171798691840
vfs.zfs.arc_free_target: 1813966
vfs.zfs.compressed_arc_enabled: 1
vfs.zfs.arc_shrink_shift: 7
vfs.zfs.arc_average_blocksize: 8192
vfs.zfs.arc_min: 137438953472
vfs.zfs.arc_max: 687194767360
```


----------



## hotaronohanako (Apr 13, 2018)

darecki said:


> Hi Everyone,
> 
> I am trying to come back to FreeBSD after turning to Linux in 1998. What I am building is a host with ZFS and several MySQL instances in jails. Well, it works but I am hitting some weird behavior which I can't explain.
> 
> ...


My advice .. Dont go back to freebsd. Reason ?  bloaded as hell


----------



## darecki (Apr 13, 2018)

hotaronohanako said:


> My advice .. Dont go back to freebsd. Reason ?  bloaded as hell



Not many systems with native ZFS and Solaris is dead.


----------



## t1066 (Apr 13, 2018)

I would suggest the following two tests.
1. disable trim
2. drastically limit arc to say, 16G or 32G.


----------



## darecki (Apr 13, 2018)

t1066 said:


> I would suggest the following two tests.
> 1. disable trim
> 2. drastically limit arc to say, 16G or 32G.



Thanks,

I will try with no trim in a moment.


----------



## darecki (Apr 13, 2018)

Rebooted to fix, load 100HZ and disable trim.

After 6h of running rsync out with read-only activity on ZFS:
- First change is that Wired is next to ARC - 477ARC, 496 Wired. 
- Still dying ready to reboot now, `^C` on running top takes ages.

Reducing ARC to 64G now.


----------



## darecki (Apr 13, 2018)

What is 
	
	



```
vfs.zfs.arc_shrink_shift
```
?


----------



## darecki (Apr 13, 2018)

Here is some fun....
After reboot and loader.conf set to 64G ARC limit:
2x rsync out and then...

- While climbing up ARC 2x30MB's
- 2x50MB/s after hitting 64G in ARC
- expanded ARC to 128G and 2x30MB/s while climbing, 2x40MB/s while full ARC
- Decreased ARC to 32GB and 2x60MB/s

In general load on drives reflects transfer. ARC is 100% MFU.
Now I am scratching my head why transfer is slower with 128G ARC?!


WCPU on rsyncs is 25% while ARC is full and transfer 2x60MB/s, 90% while climbing ARC and transfer 2x30MB/s


----------



## gkontos (Apr 13, 2018)

Long shot maybe but have you tested your memory with memtest?

Also, what is the output of `#zdb | grep ashift` ? Might be irrelevant in this case but it never hurts to know if your pool is properly aligned.


----------



## darecki (Apr 13, 2018)

gkontos said:


> Long shot maybe but have you tested your memory with memtest?
> 
> Also, what is the output of `#zdb | grep ashift` ? Might be irrelevant in this case but it never hurts to know if your pool is properly aligned.


ashift is 12. Memory is tested, this is new server and Dell does memtest + smarttests (short and long) on drives before delivery.


----------



## darecki (Apr 13, 2018)

As with reduced ARC it doesn't want to die, I played a bit. 

- There is general rule - if ARC is climbing, does not matter from what value up to what I am getting 2x 30MB/s 
- I shifted shift from 7 to 11 and now building ARC up to 256G. Seems better so far.

Next reboot enabling trim and 1000HZ


----------



## darecki (Apr 13, 2018)

There is more which I don't understand. I am rsyncing out, no writing. After two days of READING fragmentation dropped from 32% to 12%. Does it heal by copying on READ while sees fragmentation?


----------



## darecki (Apr 14, 2018)

Alright, after some more diagnostic:

- Running with 256G ARC is smooth after adding

```
vfs.zfs.arc_shrink_shift: 11
```
Before adding it I was unable to complete rsync of 800G if ARC was over 64G.
Thinking about setting 12 and rising ARC to 512G

- Re-enabling trim killed it again on untaring. I disabled it and I am retesting it now to confirm.

Is there any manual for these tunes?

```
vfs.zfs.trim.max_interval: 1
vfs.zfs.trim.timeout: 30
vfs.zfs.trim.txg_delay: 32
vfs.zfs.trim.enabled: 0
vfs.zfs.vdev.trim_max_pending: 10000
vfs.zfs.vdev.trim_max_active: 64
vfs.zfs.vdev.trim_min_active: 1
```

I believe trim might be tuned back to normal as these SAS3 drives don't lock on trim as SATAs do.


----------



## t1066 (Apr 14, 2018)

Since you are using SSD, you may try disable prefetch.

`#sysctl vfs.zfs.prefetch_disable=1`


----------



## darecki (Apr 15, 2018)

After a few more repeated testings my conclusion is:
- with ARC up to 64G it does not die
- next value tested is 128G and I can kill it every time
- shift fixed issue on reading, outbound rsync works even up to 256G ARC (never tested further)
- same rsync inbound and untar of exactly the same InnoDB files to second location kills zfs
- at the end everything waits for arc_reclaim which does not reclaim, also ARC is slightly over its limit
- there is interesting side effect - rsync from remote machine and untar from local file both runs at half speed until ARC hits its limit (no matter what this limit is)
- in case 128G ARC there is around 5k evict skips since reboot, in case 64G ARC after completed untar/rsync there is around 13k skips left

I have found linux case (https://github.com/zfsonlinux/zfs/issues/4726) which seems half relevant - saying it stops evicting at some point even if that is more related to size of metadata cache. However, I can't find any similar tunes to what they used to overcome issue on FreeBSD.
I use a few FreeNAS setups, none has that issue, however, all are 64G of memory. However, I can rsync from them many times with limited number of skips, etc. They use custom kernel.

What would you recommend as next step? Open a bug? Some kernel debugging?
Does anyone use large ARC (256G and over)?


----------



## t1066 (Apr 16, 2018)

Definitely file a bug report.


----------

