# swapinfo avail and swap partition size mismatch by a factor of 4.



## cfs (Mar 30, 2021)

Hello,

TL;DR:

My problem:  I allocated 256G for swap during installation of 12.2 (zfs install option, single disk).  Real memory is 64G.  The partition was created correctly for that size; however, swap available as reported by both swapinfo and top shows only 64G.

The last line in dmesg output is telling:


```
WARNING: reducing swap size to maximum of 65536MB per unit
```

but googling I have not been able to find a reference to a limit or how to change it.

The system is a fresh 12.2 install plus an update from source to 12-STABLE.

Longer version:


```
root@caleuche 00:04:46 /usr/home/cfs
# swapinfo -h
Device          512-blocks     Used    Avail Capacity
/dev/nvd1p3      134217728       0B      64G     0%

root@caleuche 00:08:10 /usr/home/cfs
# diskinfo -t nvd1p3
nvd1p3
        4096            # sectorsize
        274877906944    # mediasize in bytes (256G)
        67108864        # mediasize in sectors
        0               # stripesize
        210763776       # stripeoffset
        4177            # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
        SHGP31-1000GM-2 # Disk descr.
        AD0AN99691060B214       # Disk ident.
        Yes             # TRIM/UNMAP support
        0               # Rotation rate in RPM

Seek times:
        Full stroke:      250 iter in   0.008899 sec =    0.036 msec
        Half stroke:      250 iter in   0.008701 sec =    0.035 msec
        Quarter stroke:   500 iter in   0.018557 sec =    0.037 msec
        Short forward:    400 iter in   0.015540 sec =    0.039 msec
        Short backward:   400 iter in   0.015513 sec =    0.039 msec
        Seq outer:       2048 iter in   0.063381 sec =    0.031 msec
        Seq inner:       2048 iter in   0.063434 sec =    0.031 msec

Transfer rates:
        outside:       102400 kbytes in   0.065011 sec =  1575118 kbytes/sec
        middle:        102400 kbytes in   0.064456 sec =  1588681 kbytes/sec
        inside:        102400 kbytes in   0.064432 sec =  1589272 kbytes/sec


root@caleuche 00:14:46 /usr/home/cfs
# gpart list nvd1
Geom name: nvd1
modified: false
state: OK
fwheads: 255
fwsectors: 63
last: 244190640
first: 6
entries: 128
scheme: GPT
Providers:
1. Name: nvd1p1
   Mediasize: 209715200 (200M)
   Sectorsize: 4096
   Stripesize: 0
   Stripeoffset: 24576
   Mode: r0w0e0
   efimedia: HD(1,GPT,6d0640a8-900d-11eb-8ce6-a0369f3e8c80,0x6,0xc800)
   rawuuid: 6d0640a8-900d-11eb-8ce6-a0369f3e8c80
   rawtype: c12a7328-f81f-11d2-ba4b-00a0c93ec93b
   label: efiboot0
   length: 209715200
   offset: 24576
   type: efi
   index: 1
   end: 51205
   start: 6
2. Name: nvd1p2
   Mediasize: 524288 (512K)
   Sectorsize: 4096
   Stripesize: 0
   Stripeoffset: 209739776
   Mode: r0w0e0
   efimedia: HD(2,GPT,6d0d0cfa-900d-11eb-8ce6-a0369f3e8c80,0xc806,0x80)
   rawuuid: 6d0d0cfa-900d-11eb-8ce6-a0369f3e8c80
   rawtype: 83bd6b9d-7f41-11dc-be0b-001560b84f0f
   label: gptboot0
   length: 524288
   offset: 209739776
   type: freebsd-boot
   index: 2
   end: 51333
   start: 51206
3. Name: nvd1p3
   Mediasize: 274877906944 (256G)
   Sectorsize: 4096
   Stripesize: 0
   Stripeoffset: 210763776
   Mode: r1w1e0
   efimedia: HD(3,GPT,6d12c01c-900d-11eb-8ce6-a0369f3e8c80,0xc900,0x4000000)
   rawuuid: 6d12c01c-900d-11eb-8ce6-a0369f3e8c80
   rawtype: 516e7cb5-6ecf-11d6-8ff8-00022d09712b
   label: swap0
   length: 274877906944
   offset: 210763776
   type: freebsd-swap
   index: 3
   end: 67160319
   start: 51456
4. Name: nvd1p4
   Mediasize: 725115469824 (675G)
   Sectorsize: 4096
   Stripesize: 0
   Stripeoffset: 210763776
   Mode: r1w1e1
   efimedia: HD(4,GPT,6d161641-900d-11eb-8ce6-a0369f3e8c80,0x400c900,0xa8d4400)
   rawuuid: 6d161641-900d-11eb-8ce6-a0369f3e8c80
   rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b
   label: zfs0
   length: 725115469824
   offset: 275088670720
   type: freebsd-zfs
   index: 4
   end: 244190463
   start: 67160320
Consumers:
1. Name: nvd1
   Mediasize: 1000204886016 (932G)
   Sectorsize: 4096
   Mode: r2w2e3


root@caleuche 00:15:36 /usr/home/cfs
# nvmecontrol identify nvme1ns1
Size:                        244190646 blocks
Capacity:                    244190646 blocks
Utilization:                 244190646 blocks
Thin Provisioning:           Not Supported
Number of LBA Formats:       2
Current LBA Format:          LBA Format #01
Data Protection Caps:        Not Supported
Data Protection Settings:    Not Enabled
Multi-Path I/O Capabilities: Not Supported
Reservation Capabilities:    Not Supported
Format Progress Indicator:   Not Supported
Deallocate Logical Block:    Read Not Reported
Optimal I/O Boundary:        0 blocks
NVM Capacity:                0 bytes
Globally Unique Identifier:  00000000000000000000000000000000
IEEE EUI64:                  ffffffffffffffff
LBA Format #00: Data Size:   512  Metadata Size:     0  Performance: Best
LBA Format #01: Data Size:  4096  Metadata Size:     0  Performance: Best

root@caleuche 00:15:56 /usr/home/cfs
# uname -a
FreeBSD caleuche 12.2-STABLE FreeBSD 12.2-STABLE r369525 GENERIC  amd64
```


```
root@caleuche 00:29:44 /usr/src
# dmesg | tail
ums0 on uhub0
ums0: <KYE Optical Mouse, class 0/0, rev 1.10/0.00, addr 1> on usbus0
ums0: 3 buttons and [XYZ] coordinates ID=0
uhid0 on uhub0
uhid0: <NOVATEK USB Keyboard, class 0/0, rev 1.10/1.12, addr 2> on usbus0
Security policy loaded: MAC/ntpd (mac_ntpd)
WARNING: autofs_trigger_one: cv_wait_sig for /n/ failed with error 4
Accounting enabled
Accounting disabled
WARNING: reducing swap size to maximum of 65536MB per unit
```


```
root@caleuche 00:29:48 /usr/src
# dmesg | head -n 60
---<<BOOT>>---
Copyright (c) 1992-2021 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 12.2-STABLE r369525 GENERIC amd64
FreeBSD clang version 10.0.1 (git@github.com:llvm/llvm-project.git llvmorg-10.0.1-0-gef32c611aa2)
VT(efifb): resolution 1024x768
CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz (4200.20-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x906e9  Family=0x6  Model=0x9e  Stepping=9
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x7ffafbbf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x121<LAHF,ABM,Prefetch>
  Structured Extended Features=0x29c6fbf<FSGSBASE,TSCADJ,SGX,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,NFPUSG,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PROCTRACE>
  Structured Extended Features3=0xc000000<IBPB,STIBP>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
real memory  = 68719476736 (65536 MB)
avail memory = 66711527424 (63621 MB)
[...]
```

Thanks in advance,

-Cristian


----------



## SirDice (Mar 30, 2021)

cfs said:


> I allocated 256G for swap


That's a ridiculous amount, completely overkill. 


cfs said:


> Real memory is 64G.


I have 16GB of swap for a machine with 96GB of memory. That's more than enough.


----------



## George (Mar 30, 2021)

Or this? I didn't calculate..

```
vm.swap_maxpages = Maximum amount of swap supported
```


----------



## zirias@ (Mar 30, 2021)

cfs said:


> The last line in dmesg output is telling:
> 
> 
> ```
> ...


I can't find that documented right now, but it's pretty straight forward: the size of a single swap space (you can have multiple) is limited to 64GB. It probably doesn't need much documentation because even for _total_ swap, this would be ridiculously large for almost any machine.

A sane swap size for 64GB of RAM is somewhere in the range 8GB to 16GB. A reason for this is: Swap is extremely slow when compared to RAM. So, you NEVER want to have things in swap that are regularly accessed, your machine will be busy swapping in and out and not being able any more to achieve anything else. Swap is only for storing memory pages that _aren't_ accessed for a reasonable period of time, so the RAM is free for something else.

If you _ever_ run into a situation where 64GB + 16GB swap isn't enough, you _want_ the OOM killer, that way, ONE process is sacrificed, but the machine keeps working. If it can just swap more instead, it will do so, and become completely unusable.


----------



## zirias@ (Mar 30, 2021)

Elazar said:


> Or this? I didn't calculate..


Hm, this seems to be the maximum for _total_ swap. On my 8GB RAM desktop, it's set to 15656752, which is a bit short of 60GB, on my server with 64GB RAM, the value is 130478656, which would be nearly 498 GB.

I calculated with the standard page size of 4k here, guess that's what's meant.

Still, just don't try to use silly amounts of swap


----------



## Snurg (Mar 30, 2021)

IIrc the maximum supported swap partition size is 64G on 12.2.
In 13 it has been increased, either 128 or 256G.

If you want more than this, you need multiple swap partitions.

Edit:
The maximum swap size had been increased recently because of the well-known issue that FreeBSD by default has very high swappiness, likes to swap out literally the whole RAM in some scenarios. And for this reason it can be bad if the swap is smaller than RAM.


----------



## Mjölnir (Mar 30, 2021)

Zirias said:


> I can't find that documented right now, but it's pretty straight forward: the size of a single swap space (you can have multiple) is limited to 64GB. It probably doesn't need much documentation because even for _total_ swap, this would be ridiculously large for almost any machine.


Please RTFM tuning(7).  Notable exceptions to _"almost"_ below.


Zirias said:


> A sane swap size for 64GB of RAM is somewhere in the range 8GB to 16GB. A reason for this is: Swap is extremely slow when compared to RAM.


This gap reduced significantly with NVMe devices, which the author OP has.


Zirias said:


> So, you NEVER want to have things in swap that are regularly accessed, your machine will be busy swapping in and out and not being able any more to achieve anything else. Swap is only for storing memory pages that _aren't_ accessed for a reasonable period of time, so the RAM is free for something else.


Not correct.  There are two notable use cases that need a lot of swap, even on systems with huge RAM, and the system will not _"not being able any more to achieve anything else"_:

When a tmpfs(5) is used heavily in terms of size.  tmpfs(5) resides in uses swapspace when RAM is exhausted.
Systems where many users with a huge accumulated working set in RAM log in & out frequently _"and lots of idle processes"_, e.g. a _number crunche_r server of a R&D department or a large build(7) server.  Then the admin may want to tune the VM system according to tuning(7), i.e. tune `vm.swap_idle_enabled` & the two `vm.swap_idle__threshold{1,2}` to swap out processes very early.



Zirias said:


> If you _ever_ run into a situation where 64GB + 16GB swap isn't enough,


WhenEVER () you're tempted to use total terms like _any, (n)ever, all, none_, etc., please carefully question yourself & search for notable, reasonable exceptions.


Zirias said:


> you _want_ the OOM killer, that way, ONE process is sacrificed, but the machine keeps working. If it can just swap more instead, it will do so, and become completely unusable.


Not correct.  See above.
To the genuine Q, all I found that you may succeed by increasing `kern.maxswzone`.  FMLU, the #swap devices is limited (hardwired) to 4.
EDIT corrected statement about tmpfs(5).  I didn't initially write what I thought... typical mismatch between thoughts - words.


----------



## SirDice (Mar 30, 2021)

Mjölnir said:


> When a tmpfs(5) is used heavily in terms of size. tmpfs(5) resides in swapspace.


No, tmpfs(5) uses memory, it can use the total amount of memory which includes swap. It will use RAM first if that's available and will only move things to swap if there are more pressing needs for the memory. If you need to literally store gigabytes of data then you shouldn't be using tmpfs(5) for that. As it directly competes with things like file and process cache that's stored in memory. As the disks are fast enough it makes more sense to just create a "real" filesystem for it in that case.


----------



## zirias@ (Mar 30, 2021)

Mjölnir said:


> This gap reduced significantly with NVMe devices, which the author OP has.


No. The difference in _access time_ is still several magnitudes. Add to that that swap can't be physically addressed by the CPU, so you always have the copying effort.


Mjölnir said:


> When a tmpfs(5) is used heavily in terms of size. tmpfs(5) resides in swapspace.


Plain out wrong. It can swap out actual file contents, but not metadata. And then, for _accessing_ these contents, they must be swapped back in.


Mjölnir said:


> Systems where many users with a huge accumulated working set in RAM log in & out frequently _"and lots of idle processes"_, e.g. a _number crunche_r server of a R&D department or a large build(7) server. Then the admin may want to tune the VM system according to tuning(7), i.e. tune `vm.swap_idle_enabled` & the two `vm.swap_idle__threshold{1,2}` to swap out processes very early.


If you have huge amounts of RAM actually accessed frequently, nothing will help. The machine will just constantly swap in/out.


Mjölnir said:


> WhenEVER () you're tempted to use total terms like _any, (n)ever, all, none_, etc., please carefully question yourself & search for notable, reasonable exceptions.


Please know what you claim before expressing strong opinions.


----------



## Mjölnir (Mar 30, 2021)

Zirias said:


> If you have huge amounts of RAM actually accessed frequently, nothing will help. The machine will just constantly swap in/out.


Please file in a bug report on the manual page tuning(7)?  The important factor is the statement _"lots of idle processes"_, which should better read _"significant portions of the working set in RAM not accessed for a long time"_, where _long time_ is FMLU in the order of @least seconds.  If you have e.g. a bunch of engineers doing fluid dynamics calculations with huge working set in RAM but high locality (only a small data subset is accessed for a few seconds, then the next subset etc.; another example I could imagine would be weather forecast?), the system will benefit from huge swapspace and not constantly swap in & out.

A similar effect will be e.g. when the use case generally makes "normal" use of a tmpfs(5), but occasionally peaks of size usage occur.  Then _"swap space is the saving grace of UNIX and even if you do not normally use much swap, it can give you more time to recover from a runaway program before being forced to reboot"_.  This statement from tuning(7) remains true.  It recommends swap = 2*RAM if RAM <= 4GB, else swap = RAM.


Zirias said:


> Please know what you claim before expressing strong opinions.


Please try to understand the implications & demands of the two use cases I gave as exceptions.


----------



## mark_j (Mar 30, 2021)

Zirias said:


> I can't find that documented right now, but it's pretty straight forward: the size of a single swap space (you can have multiple) is limited to 64GB. It probably doesn't need much documentation because even for _total_ swap, this would be ridiculously large for almost any machine.


It was 64GB per instance/unit (maximum 4?), so you can have 256GB, just make it 4 partitions on 4 disks (or 4 files).

But, as you and SirDice said, 256GB is just wasted space. Even if you save kernel dumps 64GB is more than enough.


----------



## zirias@ (Mar 30, 2021)

Mjölnir said:


> Please file in a bug report on the manual page tuning(7)?


Well, I personally don't care what ancient stuff is in there. The current text block about swap dates back to 2014, and is only a slight rewrite of the earlier block from 2009, when it was first changed from unconditionally recommending 2x RAM size. The 1x RAM size recommendation is still there, but pretty much obsolete with systems with 64GB (and more) RAM.


Mjölnir said:


> "swap space is the saving grace of UNIX and even if you do not normally use much swap, it can give you more time to recover from a runaway program before being forced to reboot"


THIS statement remained unchanged since the introduction of tuning(7) with FreeBSD 5, in 2003. It's not wrong now, but desperately needs a warning that you can indeed "overdo" it.


Mjölnir said:


> Please try to understand the implications & demands of the two use cases I gave as exceptions.


You don't help the OP by making up usecases that you *think* would profit from ridiculous amounts of swap; they won't. Of course, to understand why, you must understand how virtual memory and swap work in general.


----------



## Snurg (Mar 30, 2021)

To the use cases Mjölnir described I want to add the typical desktop usage.
It is just insanely bad user experience if one has to wait seconds to minutes if one switches to another Firefox window, and sees that in top constantly _swread_ing, while the rest of the computer acts sluggish-to-unusable, despite almost all memory being free.

There are many users who would prefer the traditional simple Unix swap method of _only swapping when actually needed_? (e.g. swappiness = 0, while default FreeBSD behaves like swappiness=99)

*Why is the idea of introducing a swappiness option being rejected so much by some people?*


----------



## Mjölnir (Mar 30, 2021)

Zirias said:


> Well, I personally don't care what ancient stuff is in there. The current text block about swap dates back to 2014, and is only a slight rewrite of the earlier block from 2009, when it was first changed from unconditionally recommending 2x RAM size. The 1x RAM size recommendation is still there, but pretty much obsolete with systems with 64GB (and more) RAM.


When such a large system is used by a large #of users concurrently or runs a reasonable large #of processes with a lot of idle processes, the statements in tuning(7) remain true.  They are not obsolete.


Zirias said:


> THIS statement remained unchanged since the introduction of tuning(7) with FreeBSD 5, in 2003. It's not wrong now, but desperately needs a warning that you can indeed "overdo" it.


Please file in a bug report on tuning(7) and/or talk to the VM wizzards on <freebsd-hackers>.


Zirias said:


> You don't help the OP by making up usecases that you *think* would profit from ridiculous amounts of swap; they won't. Of course, to understand why, you must understand how virtual memory and swap work in general.


Please enlighten me or point me to the right direction (links into the _Weltnetz_).


----------



## zirias@ (Mar 30, 2021)

Snurg said:


> To the use cases Mjölnir described I want to add the typical desktop usage.
> It is just insanely bad user experience if one has to wait seconds to minutes if one switches to another Firefox window, and sees that in top constantly _swread_ing, while the rest of the computer acts sluggish-to-unusable, despite almost all memory being free.
> 
> There are many users who would prefer the traditional simple Unix swap method of _only swapping when actually needed_? (e.g. swappiness = 0, while default FreeBSD behaves like swappiness=99)
> ...


There seem to be at least two misunderstandings.

This is *exactly* the problem that will be *more* likely if you just add more swap. You should avoid it and limit swap to a sane size.
FreeBSD doesn't swap "just because". It does so only when it's necessary. You don't happen to use ZFS on 12.x or earlier? ARC is always wired (think about it, what good would be a cache for a disk stored on a disk), and the old implementation was very reluctant to give back memory, so the system swaps other things instead. If you're not on 13 yet, limit the amount of ARC to prevent that. If that doesn't help, your system simply has too little RAM for your workload. Swap is *never* a replacement for RAM.
Mjölnir just no. I already explained in depth why such amounts of swap make no sense at all and do harm. If you don't *want* to understand, that isn't my problem.

Side note: at least the installer does a somewhat sane thing and never automatically suggests a swap partition larger than 4GB:





						part_wizard.c « partedit « bsdinstall « usr.sbin - src - FreeBSD source tree
					






					cgit.freebsd.org
				



 It's far from a "perfect" implementation though, doesn't seem to look at actual RAM size at all.


----------



## Snurg (Mar 30, 2021)

Zirias said:


> FreeBSD doesn't swap "just because". It does so only when it's necessary. You don't happen to use ZFS on 12.x or earlier? ARC is always wired (think about it, what good would be a cache for a disk stored on a disk), and the old implementation was very reluctant to give back memory, so the system swaps other things instead. If you're not on 13 yet, limit the amount of ARC to prevent that. If that doesn't help, your system simply has too little RAM for your workload. Swap is *never* a replacement for RAM.


Your reply is a typical example of the denial of the fact that FreeBSD actually swaps "just because".

With ARC set to a sensible maximum, say, 1GB, there is still a lot of swap going on even when, say, 20% of memory are actually used (e.g. not "free memory"). It is just insane when FreeBSD is able to fill a 2GB swap partition when only ~8GB of 48GB RAM has ever been used since boot and is not "free memory".
This insane behaviour only stops when one deactivates swapping and just takes care that sufficient RAM is free.

I haven't tried FreeBSD 13, so I cannot yet tell whether this insane swappiness has become better.


----------



## zirias@ (Mar 30, 2021)

I don't know how you get your system to do *anything* like that. My desktop with 8GB RAM, ARC limited to 3G (and nothing else "tuned"), chromium with a dozen tabs running, as well as several other applications, leaving right now less than 1G of free RAM, not a single byte of swap is used.

Yes, this is an observed fact, not "denial" of facts. And no, this wasn't different on 11 and 12. My HDD is slow, I notice every bit of having to wait for swap immediately.


----------



## Snurg (Mar 30, 2021)

Just leave a few big memory eaters like LibreOffice, several tab-filled Firefox windows and the like idle for a long time, say, overnight.

Then you can experience the disruptive feeling when your system suddenly starts to swap in gigabytes that the friendly VM swapped out through the night, and you have no idea whether it will become responsive again in seconds, or whether it is better to go smoke one or make a tea, as this swap-in can sometimes take quite a while.

I repeat, this is _very disruptive and annoying_.

And this is commonly observed experience if you use a PC with _plenty_ of RAM, and not one with only puny amount of RAM like the majority of "desktop" users.


----------



## zirias@ (Mar 30, 2021)

Snurg said:


> Just leave a few big memory eaters like LibreOffice, several tab-filled Firefox windows and the like idle for a long time, say, overnight.


I do that regularly, and had exactly this problem UNTIL I restricted ARC to a sane maximum, back on 11.x. Since then, it never happened again. It was *always* ARC that forced other things to get swapped out. It probably happens during periodic(8) jobs in the middle of the night. Swapping back in will only happen once you request the data in the swapped pages.

Therefore, if such a restriction doesn't solve that problem on your machine immediately, something else must be configured in an unusual way.

Finally, it's _possible_ this is not needed any more on 13. OpenZFS' ARC very quickly returns memory. I have to try whether this is good enough to not run into a heavy swapping situation, even without restricting ARC.


----------



## Mjölnir (Mar 30, 2021)

Just a sidenote from someone who's got no clue about the VM system & what all that swap is about (I wonder how I managed to pass the univ tests...): a multitasking OS is well capable of performing other tasks while waiting for some memory pages to get swapped in, espc. when it runs on multi-core HW.


----------



## zirias@ (Mar 30, 2021)

Oh, wow. Then please, provoke a memory congestion with lots of swap activity. Observe. Then ask yourself, how many of your processes get anywhere _without_ the need for disk I/O?

Spoiler: very few.


----------



## Mjölnir (Mar 30, 2021)

Zirias said:


> Oh, wow. Then please, provoke a memory congestion with lots of swap activity.


1st I have to look up what that is...  So it's not possible that the tasks who access only pages that are present in RAM get CPU runtime slices while some pages are swapped in or out?


----------



## SirDice (Mar 30, 2021)

Mjölnir said:


> So it's not possible that the tasks who access only pages that are present in RAM get CPU runtime slices while some pages are swapped in or out?


Yes, but then you don't have memory congestion.


----------



## cfs (Mar 30, 2021)

I want to thank Snurg and Mjölnir for actually addressing my question.  I have to say, coming back to the forums after several years and getting my use case called "ridiculous", in the first sentence of the first answer, by a moderator, was not a great experience.  With no other content in the message.  I imagine you guys have to tell newbies "don't do that!" a lot, and this was my first post from a new account.  But still, not great.  PH Kamp wrote "You're Doing It Wrong" more than 10 years ago.

Snurg, an idea in case is useful: these days you can get a 16 Gb Intel optane drive (3D Xpoint memory on nvme/M.2 form factor) on ebay for around the price of a large starbucks coffee.  If you have a free PCIe x1 slot you can buy an adapter from M.2 to PCIe x1 for the cost of a second large coffee.  The sustained random read latency of 3D Xpoint today is around the speed RAM had in the mid 90s.  I am guessing swapping to that would make your life happier.

-Cristian


----------



## zirias@ (Mar 30, 2021)

Mjölnir said:


> 1st I have to look up what that is...  So it's not possible that the tasks who access only pages that are present in RAM get CPU runtime slices while some pages are swapped in or out?


That's not the point. Sure this is possible. Just find some process doing anything useful that _doesn't_ need disk I/O. Also remember that every execve(2) needs disk I/O _and_ memory pages to actually load the binary into. There's just no way a machine can do anything useful unencumbered while heavily swapping. If you don't believe it, test it.


----------



## cfs (Mar 30, 2021)

Zirias said:


> [...]There's just no way a machine can do anything useful unencumbered while heavily swapping.



I guess the reference is too obscure.   Here:  https://queue.acm.org/detail.cfm?id=1814327


----------



## Snurg (Mar 30, 2021)

Zirias said:


> Therefore, if such a restriction doesn't solve that problem on your machine immediately, _something else must be configured in an unusual way_.


Yes, and this "unusual configuration" is simply having >= 32GB RAM to make this issue very obvious.
Very few desktop users have 32GB or more RAM installed.
Many users ridiculed me,_ just because they did not experience this issue due to their low-RAM configurations and apparently were unable to imagine such an absurd system behaviour_.

I investigated that issue years ago in depth here in the forums, forgot whether it was on 10 or 11.
ZFS ARC was always cut down to a defined maximum, so this should not be a factor.

Back then, this behaviour was confirmed by several users who also had >= 32GB installed.
The _only way I know of to make the system work snappy_, without the annoying lags when it swapped in stuff that it swapped out "just because", was _to entirely disable swap_.
This is no good practice, but it is manageable when you take care that some amount of memory, say ~5-10GB, always stays free.

cfs : Very good suggestion! This also would help avoid that extreme disk thrashing while unswapping stuff.
But I definitely would prefer a sysctl like swappy='off' to restore original UNIX swapping behaviour.


----------



## zirias@ (Mar 30, 2021)

cfs said:


> getting my use case called "ridiculous"


You didn't describe a "use case", as there's no description how this ridiculous amount of swap should ever be used in a sane way. And yes, it's impossible.


cfs said:


> The sustained random read latency of 3D Xpoint today is around the speed RAM had in the mid 90s.


We're talking about access times of around a few µs at least, as compared to modern RAM which is in the single-digit ns range here, so that's a factor of roughly 1000. Sure, _if_ your swap device is as fast as possible, this helps a bit in situations when swapping is unavoidable. Still, heavy constant swapping kills the performance of any machine with any workload.

As mentioned above, no matter how quickly you can access a page in swap (where 1000 times slower than a memory cell is a current optimum), it can't be _used_ from there. It must be copied to RAM first, and another one must be copied to swap to make room for that.


----------



## zirias@ (Mar 30, 2021)

Snurg said:


> Yes, and this "unusual configuration" is simply having >= 32GB RAM to make this issue very obvious.


That's very unlikely. My server with 64GB, many virtual machines and jails, also used for desktop stuff remotely, also doesn't use any swap unless there's only very little free RAM (around 1GB). I monitor that closely cause I sometimes run large builds on it and want to make sure it can stand the load.


----------



## Snurg (Mar 30, 2021)

You and many others seem simply not to understand that this sick swap behaviour _is connected with low system load/activity._
Servers like yours are typically not quasi "idling" like a desktop.


----------



## zirias@ (Mar 30, 2021)

Sorry, that's nonsense as well. My server is idling most of the time, as it only "powers" a private house with two partys.

It's not that I doubt what you observe. But very much the conclusions you present.


----------



## cfs (Mar 30, 2021)

Zirias said:


> You didn't describe a "use case", as there's no description how this ridiculous amount of swap should ever be used in a sane way. And yes, it's impossible.



Is hard for me to see the return on investment of your strategy of digging in hard on that.   But ok, let's play.

"The really short version of the story is that Varnish knows it is not running on the bare metal but under an operating system that provides a virtual-memory-based abstract machine. For example, Varnish does not ignore the fact that memory is virtual; it actively exploits it. A 300-GB backing store, memory mapped on a machine with no more than 16 GB of RAM, is quite typical. The user paid for 64 bits of address space, and I am not afraid to use it."

Quoted from "You're Doing It  Wrong".
By PH Kamp . 
10 years ago.

The article is not a 1-1 mapping to our discussion, but that is the VM subsystem and is swapping.  And in any case, if you are willing to make hard absolute statements, I imagine you are willing to put the effort on doing some research for a rebuttal, so I won't do that work for you ;-).

-Cristian


----------



## Snurg (Mar 30, 2021)

*Sigh*
Seems of no use to talk about this topic with people that are either no developers and know the details of how FreeBSD swapping works, or have never made the experience of using a 32+GB desktop.

Whatever, when I got enough time and brain free and am in the mood, I'll look at the memory/swap management code and try to find out what needs to be patched to implement a swappiness sysctl, or maybe a build option for zero swappiness (probably easier).


----------



## Mjölnir (Mar 30, 2021)

cfs said:


> I want to thank Snurg and Mjölnir for actually addressing my question.  I have to say, coming back to the forums after several years and getting my use case called "ridiculous", in the first sentence of the first answer, by a moderator, was not a great experience.


He didn't call your use case _"ridiculous"_, but the configuration with that unusual large amount of swap.


cfs said:


> With no other content in the message.


You could have provided that.  Or do it now; we're curious...


----------



## zirias@ (Mar 30, 2021)

I am a developer, I graduated in operating systems at university, I had to write Linux kernel code back then. No, I didn't look deep into FreeBSD kernel code so far, but I _do_ know how virtual memory management works
I do use a "32+GB desktop". This server has a full install of KDE and several applications, and that's used regularly. You don't want to tell me that using this from remote changes swap behavior…
All in all, your conclusions just make no sense. Something else must be wrong in your setup.


----------



## Mjölnir (Mar 30, 2021)

Snurg said:


> Whatever, when I got enough time and brain free and am in the mood, I'll look at the memory/swap management code and try to find out what needs to be patched to implement a swappiness sysctl, or maybe a build option for zero swappiness (probably easier).


`sysctl -d vm.swap_idle_{enabled,threshold{1,2}}`

```
vm.swap_idle_enabled=Allow swapout on idle criteria
vm.swap_idle_threshold1=Guaranteed swapped in time for a process
vm.swap_idle_threshold2=Time before a process will be swapped out
```
EDIT to workaround your issue, you could also reduce your RAM & send me the RAM modules that you pull out  According to some info I stumbled upon in the _Weltnetz_, I can put in a 16 GB RAM module into the memory slot of my ThinkPad, but I only have 4GB bonded soldered on MB + 8 GB in the slot...  Thx in advance!
EDIT2 And send more money fast!  I need a new M.2 LTE modem card!


----------



## zirias@ (Mar 30, 2021)

cfs said:


> "The really short version of the story is that Varnish knows it is not running on the bare metal but under an operating system that provides a virtual-memory-based abstract machine. For example, Varnish does not ignore the fact that memory is virtual; it actively exploits it. A 300-GB backing store, memory mapped on a machine with no more than 16 GB of RAM, is quite typical. The user paid for 64 bits of address space, and I am not afraid to use it."


Given the context you put this quote in, do you understand the article? Because, yes, it is possible to write a (special-purpose!) software actively "exploting" virtual memory. Doing so, it has to make sure swapping in/out is reduced to a minimum. The article describes (among other things) how that is achieved.

Now, this special-purpose software is a high-performance cache for potentially huge amounts of data. Do you want to run _that_? If not, the conclusion that you could in any way benefit from such a huge swap space is badly flawed.


----------



## cfs (Mar 30, 2021)

Zirias said:


> Given the context you put this quote in, do you understand the article? Because, yes, it is possible to write a (special-purpose!) software actively "exploting" virtual memory. Doing so, it has to make sure swapping in/out is reduced to a minimum. The article describes (among other things) how that is achieved.
> 
> Now, this special-purpose software is a high-performance cache for potentially huge amounts of data. Do you want to run _that_? If not, the conclusion that you could in any way benefit from such a huge swap space is badly flawed.



Ok, so we moved away from "ridiculous" and "impossible" to "huge" and "if not"?   Progress!  ;-).

I think my use case is as old as the hills.  I have a big data set and a program that works by loading it, doing processing for a long time, and then writing the result.  The problem is the opposite of "embarrassingly parallel", meaning, it can't work by processing pieces of input at a time.   We could re-write the program to use mmap, but that is not a trivial change that will be more expensive than paying $134 for 1 Tb of nvme.

-Cristian


----------



## Snurg (Mar 30, 2021)

Mjölnir said:


> `sysctl -d vm.swap_idle_{enabled,threshold{1,2}}`
> EDIT to workaround your issue, you could also reduce your RAM


The idea might be good, disable idle swapping and setting threshold to a few days...
But if I remember correctly, it was the very normal pager that I was unable to convince to stop swapping out.
I'll try that 

But, honestly, I don't want to lose memory 
These are old 4GB PC3-10600R, of which I bought a dozen years ago cheap for <2 euros/GB, and filled up my PC.
Non-parity RAMs are way more expensive... ouch


----------



## zirias@ (Mar 30, 2021)

Snurg said:


> The idea might be good, disable idle swapping and setting threshold to a few days...


The thing is, it's disabled by default. And enabling it _should_ only reduce priority for pages of idle processes faster, so they can be swapped out earlier. It shouldn't change anything about not swapping out pages unless there's a (foreseeable) need…


----------



## Mjölnir (Mar 30, 2021)

Zirias, OT, but IIUC from the conversations on the <freebsd-hackers>, s/o is currently reworking on the OOM-killer.  Plus IMHO with these new low-latency NVRAM storage technologies _Optane_ & NMVe, _the BeaSD_'s VM swap implementation could be enhanced to honour a swap device _priority_ to stage stagger swap devices.  And now with that _review incident accident_...  IMHO you're the perfect candidate to either help with coding or writing tests or review that stuff.
EDIT When a _noob_ like me can get an account on _Phabricator_ quickly, you as a proven ports(7) author (& maintainer (?)) should get it even quicker.


----------



## zirias@ (Mar 30, 2021)

Mjölnir said:


> Plus IMHO with these new low-latency NVRAM storage technologies _Optane_ & NMVe, _the BeaSD_'s VM swap implementation could be enhanced to honour a swap device priority to stage swap devices.


Nothing against making swap perform better. But it's impossible to solve a few underlying problems, so (yes, in absence of a special-purpose program carefully choosing its data structures in a way that will minimize page faults) heavy swapping will always be a performance killer. In fact, this mentioned program's design actually avoids heavy swapping while still allocating huge amounts of virtual memory.


Mjölnir said:


> IMHO you're the perfect candidate to either help with coding or writing tests or review that stuff.


Probably not, cause knowing the theory and having touched _some_ kernel code some time is _far_ from enough to be qualified for such reviews. I'd need to invest a lot of time first to get familiar with the FreeBSD kernel


----------



## PMc (Mar 30, 2021)

SirDice said:


> That's a ridiculous amount, completely overkill.


I disagree. It depends on what the application wants to do.

There are scenarios where it is a lot cheaper to pull in a precomputed working set from swap than to build it anew (and the application may not be designed to write it to file). Then when there are multiple such working sets, used only occasionally and not in parallel, you have the usecase.



SirDice said:


> I have 16GB of swap for a machine with 96GB of memory. That's more than enough.



On smaller machines I always had more swap in active use than memory installed. There is obviousely a delay when accessing another application, but that is faster than starting the individual applications only when used.

Sadly, with Rel.12 this doesn't work well anymore, because the knobs to tune swapping behaviour have disappeared (they are now in the NUMA stuff, individual to the numadomain, and not accessible at runtime).


Snurg said:


> Edit:
> The maximum swap size had been increased recently because of the well-known issue that FreeBSD by default has very high swappiness, likes to swap out literally the whole RAM in some scenarios. And for this reason it can be bad if the swap is smaller than RAM.


That's by design of the VM system (and not only FreeBSD): every memory page is logically mapped to some diskspace.
But I have never seen FreeBSD actually make use of it. It does always shrink the ARC first - which is not what I want, because when the ARC is gone, all disk access gets slow, whereas paging out one application and pulling in another, after that delay things would continue to work normally (if the apps aren't used in parallel).
Up to Rel.11 one could set different thresholds for ARC shrinking and for pageout, but -as said above- no longer on Rel.12.



Snurg said:


> *There are many users who would prefer the traditional simple Unix swap method of *_only swapping when actually needed_? (e.g. swappiness = 0, while default FreeBSD behaves like swappiness=99)



How do you make that happen? Mine won't swap unless needed.



Snurg said:


> Your reply is a typical example of the denial of the fact that FreeBSD actually swaps "just because".
> With ARC set to a sensible maximum, say, 1GB, there is still a lot of swap going on even when, say, 20% of memory are actually used (e.g. not "free memory"). It is just insane when FreeBSD is able to fill a 2GB swap partition when only ~8GB of 48GB RAM has ever been used since boot and is not "free memory".


What are you doing? My desktop has only 8G, no ARC limit, and no swapping. Except when building two llvm in parallel, and then only a few MB.
Only when running for days, over time a few things are moved to swap.



Zirias said:


> You don't help the OP by making up usecases that you *think* would profit from ridiculous amounts of swap; they won't. Of course, to understand why, you must understand how virtual memory and swap work in general.


On smaller scale I had these usecases, and they did work.
2G men installed and ~5G swap in use. Certainly much slower than with all-ram available, but still useable. Now given the current solidstate storage, it may well work with bigger scales also. And the price tag between 64G and 256G ram is significant.


----------



## Mjölnir (Mar 30, 2021)

Zirias said:


> Probably not, cause knowing the theory and having touched _some_ kernel code some time is


enough to @least write or outline tests & constraints.  The underlying theory did not change for decades IIUC.


Zirias said:


> _far_ from enough to be qualified for such reviews.


_Pppffhhh..._  You would be surprised how many beginner's bugs even a _wizzard__ guru_ is able to commit @4:00 A.M. (local time).... 


Zirias said:


> I'd need to invest a lot of time first to get familiar with the FreeBSD kernel


Not true.  VM is VM for decades, and beeing a _noob_ can even be advantageous because that _noob_ guy asks nasty questions that the _wizzard_ might forget when s/he's _in the flow_.


----------



## Snurg (Mar 30, 2021)

PMc said:


> What are you doing? My desktop has only 8G, no ARC limit, and no swapping. Except when building two llvm in parallel, and then only a few MB.
> Only when running for days, over time a few things are moved to swap.


I hate to rebuild my desktop, and so I just sleep the PC instead of starting it up/shutting it down every day.
So there are quite long uptimes in which swap used and inactive/laundry memory grows to insane amounts.
I usually reboot only after updates.


----------



## PMc (Mar 30, 2021)

Snurg said:


> Just leave a few big memory eaters like LibreOffice, several tab-filled Firefox windows and the like idle for a long time, say, overnight.
> 
> Then you can experience the disruptive feeling when your system suddenly starts to swap in gigabytes that the friendly VM swapped out through the night, and you have no idea whether it will become responsive again in seconds, or whether it is better to go smoke one or make a tea, as this swap-in can sometimes take quite a while.


Ah, now it becomes clearer. (I don't leave the desktop on over night anymore.)

Yes it does that - it does move out things to swap after a very long time (hours/days), so this is not related to vm.swap_idle* (which talk about a few seconds).
I don't know where this is controlled or if it can be tuned, but I would agree that it should be fixed or made tuneable: if the machine can run with this working set today, it should also be able to run with it tomorrow in the same fashion.

Anyway, I can confirm the effect from a different angle: I used to tune my server (contrary to the desktop) to page out ASAP and not shrink the ARC. Then with R.12 the respective knob was gone. And consequentially, at the first day after boot things were not to my liking (under load the ARC was too small to fill up the L2ARC), and I considered putting in another RAM. But, after 2-3 days, it had levelled out, swap usage was at nominal as before, and behaviour also.



Snurg said:


> Whatever, when I got enough time and brain free and am in the mood, I'll look at the memory/swap management code and try to find out what needs to be patched to implement a swappiness sysctl, or maybe a build option for zero swappiness (probably easier).


Wow, have fun with that. (If You make it tuneable in both directions, I'm interested.)


----------



## Mjölnir (Mar 30, 2021)

Snurg said:


> I hate to rebuild my desktop, and so I just sleep the PC instead of starting it up/shutting it down every day.
> So there are quite long uptimes in which swap used and inactive/laundry memory grows to insane amounts.
> I usually reboot only after updates.


Aha.  That whole suspend/resume stuff is not 100% sound (very likely also due to broken ACPI BIOSes).  I have numerous issues, unfortunately you guys keep posting interesting stuff here & I don't find the time to write 1/2-way qualified bug reports (I also have the pride to @least _try_ to find a fix or workaround)...  So would you agree to just reboot once a week?


----------



## Snurg (Mar 30, 2021)

Mjölnir said:


> Aha.  That whole suspend/resume stuff is not 100% sound (very likely also due to broken ACPI BIOSes).  I have numerous issues, unfortunately you guys keep posting interesting stuff here & I don't find the time to write 1/2-way qualified bug reports (I also have the pride to @least _try_ to find a fix or workaround)...


Yes, that's sometimes hard to find all suspend/resume breakers.
But it's sort of challenge, too.



Mjölnir said:


> So would you agree to just reboot once a week?


Uh-uh. Question: how often should one update for a reasonably safe machine?

```
% uptime
7:45PM  up 35 days,  5:54, 8 users, load averages: 0.36, 0.66, 0.64
%
```
I think you are right, I should update and reboot more often... last reboot was when uptime >60 days.

Edit: Actual uptime is way less, as I sleep the PC multiple times every day.


----------



## Crivens (Mar 30, 2021)

cfs said:


> Ok, so we moved away from "ridiculous" and "impossible" to "huge" and "if not"?   Progress!  ;-).
> 
> I think my use case is as old as the hills.  I have a big data set and a program that works by loading it, doing processing for a long time, and then writing the result.


I smell FORTRAN...


cfs said:


> We could re-write the program to use mmap, but that is not a trivial change that will be more expensive than paying $134 for 1 Tb of nvme.
> 
> -Cristian


Do you read it all at once? If yes, mmap will not change much. If you re-read it multiple times a tmpfs may help.
FreeBSD is good at caching file content, ZFS also. We need more detailes what you are doing.
Untill then, swapping on multiple NVMEs is your second best way. Best one is much more RAM.

And there is a difference in VM detailes between FreeBSD and Linux. I did read those codebases, some time ago.


----------



## zirias@ (Mar 30, 2021)

cfs said:


> Ok, so we moved away from "ridiculous" and "impossible" to "huge" and "if not"?   Progress!  ;-).


No, the fact that you can have a single application working in a huge virtual address space perform in an acceptable manner *if* it is written in a way aware of the problem (simplest possible case "sequential processing") just doesn't invalidate the generic reasoning. Add some other memory-hungry processes to the picture and you're in "heavy swapping" scenario again. And the use-cases made up in this thread still just won't work.

So, instead of acting offended by people telling you how bad that idea is, you could just have explained quickly that this is for a special-purpose host/application.



cfs said:


> I think my use case is as old as the hills.  I have a big data set and a program that works by loading it, doing processing for a long time, and then writing the result.  The problem is the opposite of "embarrassingly parallel", meaning, it can't work by processing pieces of input at a time.   We could re-write the program to use mmap, but that is not a trivial change that will be more expensive than paying $134 for 1 Tb of nvme.


TBH, reading this article about the design of that web cache, mmap(2) was the very first thing I had in mind as a means to make something like this, taking advantage of the 64bit address space, work without a very weird system configuration. Just _maybe_, there _might_ be a small performance penalty having to pass the filesystem (although I don't think it would be relevant, but you typically have to test such things to be sure…)

In any case, if you can't make sure there is some "access pattern" (to the memory pages) in your processing that isn't _completely_ random, the resulting performance will be just as bad as your common memory-pressure situation.


----------



## zirias@ (Mar 30, 2021)

PMc said:


> Yes it does that - it does move out things to swap after a very long time (hours/days), so this is not related to vm.swap_idle* (which talk about a few seconds).
> I don't know where this is controlled or if it can be tuned, but I would agree that it should be fixed or made tuneable: if the machine can run with this working set today, it should also be able to run with it tomorrow in the same fashion.


I still think what you see here is the effect of other things, one candidate being the "periodic" jobs. They are I/O heavy, so ARC (and also other caches) will want RAM. Maybe some of them also need RAM directly. Then of course, if free memory runs short, pages of all these long idling processes are swapped out first.

This shouldn't even be a problem _if_ whatever needs the RAM would reliably give it back when needed – then, you'd of course notice when using those idle applications, but only shortly. But that's what I desribed earlier: ARC (up to 12.x) was very reluctant to really give back memory. _Maybe_ it isn't the only example.


----------



## Mjölnir (Mar 30, 2021)

Zirias said:


> And the use-cases made up in this thread still just won't work


I think you're referring to what I came up with?  Please elaborate, and keep in mind that @one place I explicitely noted _locality_ (i.e. in terms of memory access).  On the other use cases -- well, these are expicitely mentioned in tuning(7), and FMLU the mathematical facts underlying CS didn't change in the last decades...


----------



## Snurg (Mar 30, 2021)

Caveats regarding mmap and ZFS.


----------



## zirias@ (Mar 30, 2021)

Sorry Mjölnir, the two cases I've seen from you were

tmpfs, which doesn't make sense, because if all, it can only swap out file _contents_, and if that was something happening regularly, using tmpfs would be moot, cause a regular on-disk filesystem would perform better.
heavy multi-user systems, well, partially, cause adding a lot of swap there only serves to prevent the OOM killer, so the system keeps running (somehow), but performance will be bad. The tunables you mention are a partial mitigation, cause if you swap out idle processes more agressively, the chances that actually used processes get stalled by heavy swapping are slightly lower.
Did I overlook something else?


----------



## zirias@ (Mar 30, 2021)

Snurg said:


> Caveats regarding mmap and ZFS.


Oh well, that ARC has to play nice by actually knowing about mmapped files is obvious, but do you say that _this_ issue is still unresolved? Hard to believe, mmap(2) isn't exactly an "exotic" thing…


----------



## richardtoohey2 (Mar 30, 2021)

Not sure if an example of what you are talking about, but MySQL imports and exports can cause MySQL to eat up swap (and then get killed by OOM) on machines with 32G RAM, most of that in the inactive state.  And I see this on machines without ZFS, so don't think related to ZFS/ARC.  This is on 12.x, sometimes 11.x.  In my brief testing so far 13.x seems to handle it a lot better - swap doesn't get touched.


----------



## Mjölnir (Mar 30, 2021)

I also mentioned 
running applications with large RAM footage & high RAM locality, who perform computations on chunks of subsets of their data (@least ~seconds/chunk).
On that "heavy multi-user systems", the manpage tuning(7) & I explicitely noted _lots of idle processes_.  You might have overlooked that small but important detail.
on tmpfs(5), I refined to _occasional spikes_ of using large sizes, which is not unreasonable IMHO.


----------



## cfs (Mar 31, 2021)

Zirias said:


> In any case, if you can't make sure there is some "access pattern" (to the memory pages) in your processing that isn't _completely_ random, the resulting performance will be just as bad as your common memory-pressure situation.



I honestly think you are wrong.  I believe you are underestimating what these nvme technologies can do today.  They can give you 800 Mb/s of sustained random reads, today, at commodity hardware prices.   This is literally swapping from RAM to slower RAM.   The latency response at high queue depths does not resemble what you are used to for storage at all.  They behave like RAM.

Earlier SSDs from 10 to 5 years ago did not behave like that.  At low queue depths they had great latency response but that dropped very quickly as the queue depths increased.  They were constrained by their controllers and the SATA bus and protocols.  Until 3 years ago or so Intel Optane / 3D Xpoint was the only really different one in terms of actual sustained random read behavior resembling RAM when used over nvme/PCIe.  They are still number one, but some of the higher end commodity drives are getting close.  And by the way these drives are not good at everything, 3D Xpoint is slower for sequential write performance, so if your use case is accelerating a database redo log you want something else.   But yay sustained random read.

-Cristian


----------



## zirias@ (Mar 31, 2021)

cfs said:


> I believe you are underestimating what these nvme technologies can do today. They can give you 800 Mb/s of sustained random reads, today, at commodity hardware prices. This is literally swapping from RAM to slower RAM. The latency response at high queue depths does not resemble what you are used to for storage at all. They behave like RAM.


The limiting factor is not _only_ the transfer rate but indeed the access time. Looking for numbers, I found these are in the single-digit µs range for those modern drives, which is awesome, but still a factor 1000 from modern RAM (single-digit ns range). So, the performance loss from swapping is still substantial, when compared to just accessing physical RAM.

The _other_ question would be whether this could still be "acceptable" in practice with this kind of modern hardware. A definitive answer to this is only possible by testing/benchmarking, but let's do the maths for a hypothetical scenario anyways:

Let's assume this (uncommon) scenario of a single application working in a huge allocation and let's assume it needs 1GB of pages currently swapped out. In this scenario, we'll probably have many "superpages" of size 2M (on amd64), but as FreeBSD splits up superpages into regular 4k pages again under memory pressure, we'll also have some of them.

A lot of assumptions needed here, I'll assume further to find 128MB in 32768 regular pages and 896MB in 448 superpages (and a similar assumption for the pages that must be swapped out to make room). With this, we must transfer a total of 66432 pages to/from swap in order to swap in this 1GB of memory. With an access time of 5µs, we'd end up in sum with 0.53s just waiting for transfer to start. Even assuming some of the pages could be arranged contiguously on the swap device, say it's 0.4s. Add the transfer itself for 2GB at 4GB/s (0.5s), this makes for a total of 0.9s.

Now, if our application has 300GB to process and can organize its accesses in a way that every piece of memory has to be swapped only once, the additional processing time due to swapping will be ~4.5 minutes. With random and repeated accesses causing every piece to be swapped, say, 5 times, you're already at 22.5 minutes…

Yes, this is nowhere near the catastrophic figures with old drives. But it's still substantial, although it _does_ look "acceptable", depending on the usecase.

Side note: your "typical" memory pressure scenario caused by just lots of processes will still be a lot worse, because you can assume to have _much_ more regular 4k pages and most of the time the access time for each and every one of them.

—
If you achieved one thing, it's inducing me a wish for new hardware, although I don't need it for my private desktop. I rest my case swap can't "replace" RAM, still these speeds are awesome of course.


----------



## crypt47 (Feb 15, 2022)

Snurg said:


> Edit:
> The maximum swap size had been increased recently because of the well-known issue that FreeBSD by default has very high swappiness, likes to swap out literally the whole RAM in some scenarios. And for this reason it can be bad if the swap is smaller than RAM.


I would be happy to read about it if somebody points me to this well-known issue.


----------



## PMc (Feb 15, 2022)

crypt47 said:


> I would be happy to read about it if somebody points me to this well-known issue.


I explained it a couple of times here already. It is not a specific FreeBSD issue, rather a behaviour of (certain kind of) VMM designs. I learned about it when I was working with AIX some 25 years ago. At that time it was just impossible to run AIX with less swap than ram.

The native behaviour of the VMM is that it expects every memory location to be backed by some file location, and that it can, at any time, re-fetch the memory contents from file. For this to be possible, it is necessary that all modified (aka "dirty") memory locations have a place in swap where they can be put (and later re-fetched). Which basically translates to: you need at least as much swap as you have mem.

Over time this behaviour was gradually remedied on public demand: with memory installments getting bigger and bigger, and swap not even intended to be used because of its slowness, people did not want to reserve huge swapspaces for no practical benefit. And so the VMM was somehow tuned to cope with the situation of not having (enough) swapspace. But this is only an add-on, and it doesn't work well under all conditions.

AFAIK linux does have a different VMM design (not derived from the BSD lineage) that does not have this behaviour.


----------



## crypt47 (Feb 16, 2022)

PMc said:


> I explained it a couple of times here already. It is not a specific FreeBSD issue, rather a behaviour of (certain kind of) VMM designs.


Sorry, didn't read that far.

> The native behaviour of the VMM is that it expects every memory location to be backed by some file location

Yes, I've read about it in old design-and-implementation-of-the-freebsd-operating-system. I put SSD for swap so in case this scenario is on by default it's ok. Currently have 47Gb  swaped out with 32Gb of RAM, the browser works just fine.^)

> AFAIK linux does have a different VMM design (not derived from the BSD lineage) that does not have this behaviour.

Yes, there are some tweaks to affect pressure and vm overcommit. What puzzled me is that FreeBSD has different pagers (algorithms or services to move pages) for different types of vm pages, but Linux afaik declares just one. Can't get my head around, but ok... The problem I'm currently tring to solve for fun and profit is to calculate how much swap is used per process. Procstat gives me a lot of VM kernel mapping via kinfo_vmentry struct, but it's seems that it's not what I want. The numbers doesn't correspond to swapinfo output. The FreeBSD design book doesn't mention of kinfo_vmetry at all. My guess that if I want to get the usage of physical memory on SSD, the VM structures don't help me much.


----------



## PMc (Feb 16, 2022)

crypt47 said:


> Sorry, didn't read that far.
> 
> > The native behaviour of the VMM is that it expects every memory location to be backed by some file location
> 
> Yes, I've read about it in old design-and-implementation-of-the-freebsd-operating-system. I put SSD for swap so in case this scenario is on by default it's ok. Currently have 47Gb  swaped out with 32Gb of RAM, the browser works just fine.^)


Ups. I don't ask what You're doing.  (never used more than 4gig for a browser)

But then, if we run a VM with, say, 15gig memory, the guest will access these memory pages in a random fashion. Then, when they are not quickly used again, the host will see them as dirty+idle, and will after some time write them to swap.
So if we have 32Gig installed, and start two VM with 15Gig each, it should perfectly fit. But instead, after a day or so, we have 30Gig in swap.



crypt47 said:


> Yes, there are some tweaks to affect pressure and vm overcommit. What puzzled me is that FreeBSD has different pagers (algorithms or services to move pages) for different types of vm pages, but Linux afaik declares just one. Can't get my head around, but ok...


Yes, there are various threads and each of them tries to manage a specific item. It's like a corporation with multiple intelligences each doing their certain job, and the final outcome should be the thing we want...

The effect is, on Linux (didn't use it since 1995, so my knowledge is not current) things run fine as long as memory suffices, but when it starts to move things out, there is a remarkable performance hit. On Berkeley the transition is smooth - it starts to pageout real early, and then gradually increases - you don't notice the precise point where phys mem is full.



crypt47 said:


> The problem I'm currently tring to solve for fun and profit is to calculate how much swap is used per process.


Oh yeah. Thats a nice one. 
It's quite difficult. I never bothered to get really through with that.


crypt47 said:


> Procstat gives me a lot of VM kernel mapping via kinfo_vmentry struct, but it's seems that it's not what I want. The numbers doesn't correspond to swapinfo output.


No, on that high level the figures won't match. Swapinfo shows what is actually written out - and the pager decides on that by it's own discretion. Then there is vm.swap_reserved, that is what has been counted to potentially be written out. It should be possible to sum up the pages that make up this figure, and then step by step get deeper into the mesh.


----------

