# How to make "impossible" memory allocations fail in a sane way?



## zirias@ (Jan 3, 2022)

Background: working on some service written in C that's using a BTREE database as provided by dbopen(3). This seems to use caching in memory extensively, so if you don't want to force a `sync()` after every change, you should make sure it's properly closed on exit if you don't want to lose data.

As there's no sane way to _recover_ from OOM, I'm using the `xmalloc()` paradigm: wrap malloc(3) in a function that just exists on error. Normally, you use abort(3) for that, but then, there's no cleanup code executed. You could attempt to clean up from some `SIGABRT` signal handler, but that's fragile and cumbersome. So I came up with a different idea: My own "panic" function using longjmp(3) to throw away most of the calling stack, but still execute the final cleanup.

This works perfectly fine when just simulating an allocation error. But trying to get a real one, I had to learn malloc(3) *just won't fail*, even when trying to allocate more than your physical RAM + swap. Instead, the OOM killer will wreak havoc randomly killing large processes when you attempt to _use_ that memory that doesn't really exist. 

So, I came across the `vm.overcommit` sysctl. tuning(7) has the following to say about it:

```
Setting bit 0 of the vm.overcommit sysctl causes the virtual memory
     system to return failure to the process when allocation of memory causes
     vm.swap_reserved to exceed vm.swap_total.  Bit 1 of the sysctl enforces
     RLIMIT_SWAP limit (see getrlimit(2)).  Root is exempt from this limit.
     Bit 2 allows to count most of the physical memory as allocatable, except
     wired and free reserved pages (accounted by vm.stats.vm.v_free_target and
     vm.stats.vm.v_wire_count sysctls, respectively).
```

Therefore, I tried `sysctl vm.overcommit=1`. A second later, my kernel (13.0-RELEASE-p4) panicked:

```
kernel:
syslogd: last message repeated 1 times
kernel: Fatal trap 12: page fault while in kernel mode
kernel: cpuid = 2; apic id = 02
kernel: fault virtual address     = 0x18
kernel: fault code                = supervisor write data, page not present
kernel: instruction pointer       = 0x20:0xffffffff80ca2596
kernel: stack pointer             = 0x28:0xfffffe00deaccb20
kernel: frame pointer             = 0x28:0xfffffe00deaccb80
kernel: code segment              = base 0x0, limit 0xfffff, type 0x1b
kernel:                   = DPL 0, pres 1, long 1, def32 0, gran 1
kernel: processor eflags  = interrupt enabled, resume, IOPL = 0
kernel: current process           = 2317 (chrome)      
kernel: trap number               = 12
kernel: panic: page fault
kernel: cpuid = 2
kernel: time = 1641225063
kernel: KDB: stack backtrace:
kernel: #0 0xffffffff80c58a85 at kdb_backtrace+0x65
kernel: #1 0xffffffff80c0b461 at vpanic+0x181
kernel: #2 0xffffffff80c0b2d3 at panic+0x43
kernel: #3 0xffffffff8108c1b7 at trap_fatal+0x387
kernel: #4 0xffffffff8108c20f at trap_pfault+0x4f
kernel: #5 0xffffffff8108b86d at trap+0x27d
kernel: #6 0xffffffff81062f18 at calltrap+0x8
kernel: #7 0xffffffff80ca138b at shm_truncate+0x5b
kernel: #8 0xffffffff80c77fa1 at kern_ftruncate+0xa1
kernel: #9 0xffffffff8108cabc at amd64_syscall+0x10c
```
(yes, there are two empty log lines from the kernel ...)

What's happening here? And is there a sane way to tell FreeBSD to _fail_ on memory allocations that can obviously never be fullfilled?


----------



## zirias@ (Jan 3, 2022)

Not sure this move makes sense. Although I came across this problem while doing "userland programming", it's actually a question about the base system (the kernel). Can we have a cross-post?


----------



## SirDice (Jan 3, 2022)

I still consider it to be a userland programming question. It may have some overlap with FreeBSD Development (Kernel development, writing drivers, coding, and questions regarding FreeBSD internals). Certainly not a base OS "General" question.


----------



## zirias@ (Jan 3, 2022)

So, what if I just want some other daemon, not written by me, to fail when allocating too much memory instead of triggering the OOM killer some time later? 

My usecase is my own programming here, yes, but the question is about the `vm.overcommit` sysctl (and why setting it to 1 results in a kernel panic)...


----------



## covacat (Jan 3, 2022)

setting it to 1 works for me but i tried on an mostly idle system
i speculate that is a bug in shm stuff that caused it to panic when changing 0->1
1 have a 1GB with zfs where i send snapshots / offsite backup 
sometimes zfs receives bombs with out of memory and i tried to create a shit tool to pressure the memory allocator to shrink arc/other wired memory
to my surprise i could allocate 8GB without problems (even with calloc)
then i found out about vm.overcommit
setting it to 1 had some effect (i could reclaim some memory) but never panic-ed


----------



## zirias@ (Jan 3, 2022)

Thanks covacat, this at least confirms this panic is "unexpected". Maybe worth a PR? It _looks_ like it can hit a process while in a syscall as well (and did so quite promptly here), and I kind of doubt this is intended...


----------



## covacat (Jan 3, 2022)

I think it's worth a PR
can you try to see if it bombs if you  set  it to 1 in sysctl.conf ?


----------



## shkhln (Jan 3, 2022)

Zirias said:


> Therefore, I tried `sysctl vm.overcommit=1`. A second later, my kernel (13.0-RELEASE-p4) panicked:


This exactly matches my experience, you don't want to touch that at all.


----------



## obsigna (Jan 3, 2022)

Do atexit(3) handlers not work in case the daemon is stopped by your wrapped malloc?

In my daemons, I use the atexit mecahnism exactly for this purpose. One of my daemons installs 6 atexit handlers and all of these become called on normal return from main() as well as by any exit() on errors, and here among of these are many OOM error paths.


```
...
if (initDAQ())
{
   atexit(freeDAQ);
   ...
   ...
   if (SSL_thread_setup())
   {
      atexit(SSL_thread_cleanup);
      ...
      ...
      if (initDatabases() && initPotentiostat())
         atexit(resetPotentiostat);
         ...
         ...
         /* starting the continuous measurement threads */
         atexit(stopMeasurements);
         ...
         ...
         /* open urandom */
         atexit(urandom_close);
         ...
         ...
         /* instantiate the inlne calculator */
         atexit(calculator_release);
```


----------



## zirias@ (Jan 3, 2022)

Ok, thanks, so I'm not the only one.

I think I'll test covacat's suggestion first as soon as I find the time to reboot into a potentially unstable system, cause this was my thought as well: maybe there's just a bug making it dangerous to change it in-flight


----------



## mark_j (Jan 4, 2022)

The (hybrid) demand paging used by FreeBSD always allows overcommits to memory. This is, by its nature, what virtual memory is.

Overcommitment of memory presumes, though, that there is sufficient backing store to provide the virtual address with a physical address to get memory should it actually need be used (from RAM or indirectly by swapping out something to disk).

When you set vm.overcommit to 10, you're telling the vm_map(9) subsystem to not worry about reserving swap space for this and other processes. (This is something inherited from Mach, if I recall correctly).
When you set vm.overcommit to 1:
If there isn't enough backing store to cover the allocation of virtual memory, then no more processes can be created. Eventually the system is just going to panic.

Funnily (sadistically), mmap(2) used to have a MAP_NORESERVE option to achieve this. (In fact a quick check shows Linux still does: https://man7.org/linux/man-pages/man2/mmap.2.html . Go figure!)

Edit: typing on tablet means i miss lines-such a small window to work with.. Apologies.


----------



## zirias@ (Jan 4, 2022)

mark_j said:


> The (hybrid) demand paging used by FreeBSD always allows overcommits to memory. This is, by its nature, what virtual memory is.


If you call it overcommit to allow more pages than would fit in physical memory at the same time, then yes. I'm talking about allowing more than could ever be backed, including swap...


mark_j said:


> When you set vm.overcommit to 1, you're telling the vm_map(9) subsystem to not worry about reserving swap space for this and other processes. (This is something inherited from Mach, if I recall correctly).


This doesn't sound quite right. I understood it the other way around from the manpage. But maybe the manpage isn't correct or I don't understand it correctly? (I quoted the relevant text in my first post...) – with this sysctl set to 0 (the default), you can successfully allocate an amount of memory larger than your physical RAM and swap together...

What I want is `malloc()` (which, IIRC, uses `mmap()` internally) to fail when requesting an amount that couldn't be backed. Is there a way to have that?


----------



## covacat (Jan 4, 2022)

it should be 1, you understood correctly


----------



## Ambert (Jan 4, 2022)

I am new to FreeBSD, but here is how I would do it.

To detect an attempt to allocate more memory with `malloc()` than the size of physical RAM + swap, I would compute that size myself by requesting the relevant information to the operating system, and then I would monitor the size occupied by my process (excluding shared libraries), to see if it grows beyond the sum of physical RAM and swap.

However, if I was in your shoes, I would also monitor the amount of free memory still available, and adapt the frequency of my calls to `sync()` accordingly. For instance, if the amount of free memory is greater than 2 GiB, I would call `sync()` every 10 minutes. Otherwise, if the amount of free memory is between 0.5 and 2 GiB, I would call `sync()` every 3 minutes. And if the amount of free memory is less than 0.5 GiB, I would call `sync()` every minute and every time I call `malloc()`.


----------



## shkhln (Jan 4, 2022)

There is MADV_PROTECT, but it doesn't seem to be appropriate here, its purpose is mostly in keeping a few key daemons (like init) alive.

Anyway, you can never depend on your exit code being called. There are too many ways a process could go down: the operating system could crash altogether, PSU fail, etc. This should be taken into account, which likely means calling sync on a fixed (configurable) interval. It should _not_ depend on the amount of free memory or anything like that. It's probably a good idea to put some kind of limit on the database size, though.


----------



## zirias@ (Jan 4, 2022)

Well, I'm not interested in workarounds _as of now_. Of course, you can never 100% make sure your daemon (or the system it's running on) doesn't crash, therefore my plan is to add explicit `sync()` calls on any "sensitive" change (while just normal content won't trigger a sync).

But: OOM should be a condition that can be handled at least with a clean exit. And this would work if I'd ever get an error (null returned) from `malloc()` et al. So, _for now_, I'm back to this sysctl supposed to control "overcommit" behavior in FreeBSD


----------



## shkhln (Jan 4, 2022)

Zirias said:


> But: OOM should be a condition that can be handled at least with a clean exit. And this would work if I'd ever get an error (null returned) from `malloc()` et al. So, _for now_, I'm back to this sysctl supposed to control "overcommit" behavior in FreeBSD


The thing is, if some process consumes an amount of memory you failed to predict, this is already a problem. OOMs should never happen in normal operation.


----------



## covacat (Jan 4, 2022)

you get null from malloc with overcommit=1
didn't try the other bits but 1 works


----------



## ralphbsz (Jan 5, 2022)

There is another aspect to this that has not been discussed explicitly. When you call malloc() or any of its friends (like sbrk() and mmap()), you don't actually get any memory. All that really happens is that your address space is adjusted, so you can actually start using more memory in those new address ranges. What does "using" in the above sentence mean? When you first touch an address in that new address range, a page fault will occur, and the page fault handler will actually give you a physical memory page. To do that, it either has to find a free memory page, evict something else from memory and give you that page (typically that's an already-written file system buffer), or it has to take some other process's address space, write it out to swap, page-protect it (so that other process can't use it for a while), and give you the page.

And that's one of the wonderful contradictions of malloc(): It doesn't actually have to fail. It can just pretend to give you memory, under the (very reasonable) assumptions that most programs that allocate memory will never actually use it. Linux is quite famous for malloc() hardly ever failing (except for ulimit-style settings). Instead, it gives you the illusion of memory, and when you try to use it, you'll get a segfault at a random place in your code, where you can't put error handling. Sure, you could set up a signal handler for SIGSEGV or SIGBUS, but what useful action can that signal handler take? In particular since most seg faults are not caused by running out of memory, but by coding bugs? And in particular since the signal handler can't actually do anything productive (like create more memory out of nothing, or check and repair all data structures the program has in memory).

I subscribe to the philsophy: Don't bother handling malloc errors. Instead think about the memory usage of your program, think about what type of computer it is installed on (how much physical + swap is available), and control memory usage yourself. One of the reasons for this attitude is this: there is another reason for segfaults, which is the stack. And while the stack today can get very big (in userspace, not in the kernel), there is no mechanism like malloc to manage stack space. So instead of trying to handle errors, write your code to have fewer errors in the first place. And then, when errors happen ... which they will ...



shkhln said:


> Anyway, you can never depend on your exit code being called. There are too many ways a process could go down: the operating system could crash altogether, PSU fail, etc. This should be taken into account, ...


Your code will occasionally crash. You can minimize the number of crashes by good engineering, but not eliminate them. To quote an old colleague: In a sufficiently large system, the unlikely will happen all the time, and the impossible will happen occasionally. I once saw a process crash due to a CPU fault (which was correctly reported and logged, the system continued running on the three surviving CPUs). So prepare for your code to crash. As shkhln said, there are standard techniques for that: Write checkpoints, sync your state to permanent (persistent) storage, automatically restart, use deadman timers or liveness checks or deadlock preventers to crash the system if things are wedged. It can even be a good practice to automatically reboot your computer at random times (once a day for example), to make sure your crash-handling code is well exercised. Overall long-term reliability doesn't come from just one aspect (such as malloc), but from taking a whole-system view.

With good automated recovery and handling all other forms of crashes, malloc() problems are just one of the many things


----------



## zirias@ (Jan 5, 2022)

ralphbsz I wasn't looking for lectures about "good programming" here (one of the reasons I don't think this question belongs into the programming section) but instead for some insight about the configurable behavior of FreeBSD's virtual memory management and why reconfiguring it leads to a kernel panic  Still nice intro to the general workings of virtual memory, but one thing sticks out:


ralphbsz said:


> under the (very reasonable) assumptions that most programs that allocate memory will never actually use it.


How is *that* ever "reasonable"? If you said "rarely", ok, that's why swapping out pages makes sense. But not use it at all? Why should you ever reserve memory if you'll never write to it? I'd call that ill program design...

All the countless reasons your program _could_ crash aside (you _can_ eliminate intrinsic reasons in theory, but not environmental reasons): Running out of memory is a condition that allows at least a "graceful" exit, *if* your program would learn about it the moment it tries to reserve memory. `vm.overcommit` _should_ allow to configure that (as covacat confirmed). And at least in my definition, "overcommitting" here means to allow more reservations than there is total backing store (physical RAM + swap) for all the pages required.

The bad thing about that practice is: Once the system learns it can't map all the currently needed pages to physical RAM any more, the only resort is the OOM killer, randomly killing some large process (so, _any_ process in the system can be affected). A broken program just reserving insane amounts of memory will be able to bring down other processes on the same machine. That's something virtual memory was originally designed to avoid.



ralphbsz said:


> I subscribe to the philsophy: Don't bother handling malloc errors.


That sounds like a consequence of the behavior of today's systems. Could lead to a vicious cycle: If no application software ever bothers handling the problem, there's no use signalling it. And if there's indeed a lot of software reserving memory it will never use, hoping for that is a somewhat appropriate strategy. But I wouldn't call that "sane", at least it isn't robust.

*edit*: about stack space, I don't really see a problem with that. As long as you use neither VLAs, stuff like `alloca()` or recursion, you can guarantee an upper bound for stack usage of your program (and any algorithm can be implemented without these).


----------



## shkhln (Jan 5, 2022)

Zirias said:


> Why should you ever reserve memory if you'll never write to it? I'd call that ill program design...


Mmaping a large file is one such use case. Are you sure dbopen doesn't do this?


----------



## covacat (Jan 5, 2022)

but files are their own backing store like private swap


----------



## shkhln (Jan 5, 2022)

I'm not quite sure how it all works, especially with overcommit disabled, however it definitely fits "reserve first, decide what to read/write later" pattern. From the userspace application point of view, of course.


----------



## Ambert (Jan 5, 2022)

Zirias said:


> ralphbsz said:
> 
> 
> > under the (very reasonable) assumptions that most programs that allocate memory will never actually use it.
> ...



Handling dynamic memory allocations with `malloc()` makes your program vulnerable to memory fragmentation. If this is not acceptable, a solution is to use `mmap()` instead, to have better control over the layout of the virtual address space of your process. To avoid fragmentation of the virtual address space, I reserve a huge range of it with `mmap()`, and then I effectively use only a small part of it by writing on some of the pages. If I didn't reserve a huge part of the virtual address space, some functions from another library could surround my contiguous data structure by memory mappings, and it would prevent me from increasing the size of that contiguous data structure when the program needs to allocate more memory. I know that the flag `MAP_GUARD` exists for that purpose in FreeBSD, but sadly this flag is not portable to other Unix systems. To reserve a range of memory addresses, I mmap them with the flag `PROT_NONE`. I don't know if such a reserved range is a problem when `vm.overcommit` is set to `1`.

I had the same frustration as you when I discovered that `malloc()` do not fail because of the overcommit thing, so now I use `mmap()` instead. It gives you much better control over the memory. `mmap()` is somewhat portable to other BSDs, to Linux and to macOS, but it is not portable to Windows (unless WSL becomes a thing).


----------



## zirias@ (Jan 5, 2022)

Ambert said:


> Handling dynamic memory allocations with `malloc()` makes your program vulnerable to memory fragmentation. If this is not acceptable, a solution is to use `mmap()` instead, to have better control over the layout of the virtual address space of your process. To avoid fragmentation of the virtual address space, I reserve a huge range of it with `mmap()`, and then I effectively use only a small part of it by writing on some of the pages. If I didn't reserve a huge part of the virtual address space, some functions from another library could surround my contiguous data structure by memory mappings


This actually makes sense. Of course, `realloc()` would still work, but potentially copy large areas of memory...

Still it feels like a workaround for a flawed design. In a perfect world, reserving (virtual) address-space could be clearly separated from reserving actual memory, so the application has a sane way to react when a memory request can't be fullfilled...  

BTW, the service I'm currently building has mostly smaller and transient "allocated objects", so using `malloc()` isn't a problem for me (`realloc()` is rarely needed and only for not too large objects as well). But I understand your usecase.


----------



## obsigna (Jan 5, 2022)

Zirias said:


> But: OOM should be a condition that can be handled at least with a clean exit. And this would work if I'd ever get an error (null returned) from `malloc()` et al. So, _for now_, I'm back to this sysctl supposed to control "overcommit" behavior in FreeBSD


Am I missing something?

I use atexit(3) handlers for this. Works perfectly also in OOM conditions. Your malloc wrapper most not call abort(3) but exit(3). The difference ist that abort simply pulls the plug of the executable, while exit calls all registered atexit handlers in reverse order before stopping the program. I would even not be very surprised if the btree stuff would install an atexit handler, but on abort it won’t become called. Anyway, for me atexit handlers together with a malloc-exit-wrapper instead of malloc-abort-wrapper is the sane OOM solution, and here it works without hick up.


----------



## zirias@ (Jan 5, 2022)

obsigna said:


> Am I missing something?


Yes, the simple fact that `malloc()` just (almost?) never fails to begin with, at least with the default setting of `vm.overcommit=0`. You can't handle what you don't know.


obsigna said:


> I use atexit(3) handlers for this. Works perfectly also in OOM conditions.


That's a way to do it. I decided instead for a `longjmp()` back to where the crucial cleanup is executed anyways*. But neither of this helps if `malloc()` just succeeds and the problem only arises when _using_ the memory, forcing the kernel to engage the OOM killer...

----
*) using this generic "panic" function:

```
void Service_panic(const char *msg)
{
    if (running) for (int i = 0; i < numPanicHandlers; ++i)
    {
        panicHandlers[i](msg);
    }
    logsetasync(0);
    logmsg(L_FATAL, msg);
    if (running) longjmp(panicjmp, -1);
    else abort();
}
```
together with the following around the service main loop:

```
running = 1;
    if (setjmp(panicjmp) < 0) goto shutdown;

    // [...] main event loop

shutdown:
    running = 0;
    // [...] cleanup
```
(And btw, `atexit()` wouldn't work for my service cause it has worker threads. The threadpool registers a "panic handler" that checks whether it's called on a worker thread, in that case it `longjmp()`s out of the thread job first and the handler for a finished thread job on the main thread then calls `Service_panic()` again....)


----------



## obsigna (Jan 5, 2022)

Excerpt from malloc(3):


> RETURN VALUES
> Standard API
> The malloc() and calloc() functions return a pointer to the allocated
> memory if successful; otherwise a NULL pointer is returned and errno is
> ...



Are you telling that malloc alway returns non-NULL pointers, even in cases it cannot provide the requested memory?

I have to admit, that I only check whether the returned pointer is NULL and in this case do the OOM handling. If the actual behaviour of malloc is different than written in the man page above, then you would need to file a PR against malloc.

I see also that many things changed with this jemalloc sophistication, perhaps not everything is really useful.


----------



## zirias@ (Jan 5, 2022)

obsigna said:


> Are you telling that malloc alway returns non-NULL pointers, even in cases it cannot provide the requested memory?


Yes.


obsigna said:


> I have to admit, that I only check whether the returned pointer is NULL and in this case do the OOM handling. If the actual behaviour of malloc is different than written in the man page above, then you would need to file a PR against malloc.


It's most likely not `malloc()` causing that (which, IIRC, uses `mmap()` internally ... traditionally, `sbrk()` was used, but I guess that's a thing from the past). The problem is that the OS gives you anonymous mappings, even if they can't be backed. If you read this whole thread, it's actually "by design" (and the sysctl `vm.overcommit` is meant to give some control over this overcommit behavior).


----------



## obsigna (Jan 5, 2022)

Zirias said:


> (And btw, `atexit()` wouldn't work for my service cause it has worker threads. The threadpool registers a "panic handler" that checks whether it's called on a worker thread, in that case it `longjmp()`s out of the thread job first and the handler for a finished thread job on the main thread then calls `Service_panic()` again....)


While I believe you, that atexit(3) does not work in your case, I just checked it (again) with one of my heavily threaded daemons, and here atexit handlers become executed regardless from which thread exit(3) is called. I restrict myself to the standard pthread(2) API, though.



Zirias said:


> It's most likely not `malloc()` causing that (which, IIRC, uses `mmap()` internally ... traditionally, `sbrk()` was used, but I guess that's a thing from the past). The problem is that the OS gives you anonymous mappings, even if they can't be backed. If you read this whole thread, it's actually "by design" (and the sysctl `vm.overcommit` is meant to give some control over this overcommit behavior).



Well this is almost unbelievable. So, malloc(3) gives me memory which cannot be used????? This would be one of the buggiest bugs that I ever heard of.

EDIT: I never touched vm.overcommit, here it is 0.


----------



## zirias@ (Jan 5, 2022)

obsigna said:


> While I believe you, that atexit(3) does not work in your case, I just checked it (again) with one of my heavily threaded daemon, and here atexit handlers become executed regardless from which thread exit(3) is called. I restrict myself to the standard pthread(2) API, though.


The problem is not that it wouldn't be called. The problem is that the cleanup would have to access memory that was last modified by a different thread without any memory barrier (e.g. pthread_mutex) in-between. That's one of the things that "just work fine" in 99% of the cases, but can go horribly wrong.


obsigna said:


> Well this is almost unbelievable. So malloc(3) gives me memory which cannot be used????? This would be one of the buggiest bugs that I ever heard of.


Actually, not really. Ambert outlined a somewhat "sane" usecase for that (in a nutshell, ensuring an unfragmented virtual address space for large and potentially growing objects). Still it's unfortunate and destroys all hope to react to OOM with a graceful exit...


----------



## obsigna (Jan 5, 2022)

Now, I just checked it:


```
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

int main(int argc, char *const argv[])
{
   long size = strtol(argv[1], NULL, 10)*1024*1024*1024;
   void *p = malloc(size);
   printf("%ld, 0x%016zX\n", size, (intptr_t)p);

   return 0;
}
```


```
CyStat-210:~ root# ./oomcheck 10
10737418240,      0x0000000801200700
CyStat-210:~ root# ./oomcheck 100
107374182400,     0x0000000801200700
CyStat-210:~ root# ./oomcheck 1000
1073741824000,    0x0000000801200700
CyStat-210:~ root# ./oomcheck 10000
10737418240000,   0x0000000801200700
CyStat-210:~ root# ./oomcheck 100000
107374182400000,  0x0000000801200700
CyStat-210:~ root# ./oomcheck 1000000
1073741824000000, 0x0000000000000000
```

What the fuck?!?

malloc returns a valid pointer for 100000 GB, who could have expected this. Well at least for an allocation of 1 petabyte (10¹⁵) it does not want to give a guaranty anymore. Was this ridiculous behaviour introduced by jemalloc?


----------



## zirias@ (Jan 5, 2022)

obsigna said:


> Was this ridiculous behaviour introduced by jemalloc?


No. As I said before, I'm _pretty_ sure the user-space allocator has nothing to do with that. It will just pass through the error (if any) given by the kernel.

*edit:* Try to write that memory (e.g. with `memset()`), you'll see your system go into heavy swapping and finally the OOM killer randomly killing large processes.


----------



## zirias@ (Jan 5, 2022)

Resuming this thread in a nutshell so far:

setting `vm.overcommit` to 1 should indeed prohibit allocations exceeding the available backing store (physical RAM + swap)
the kernel panic I've seen changing that sysctl was seen by others as well and is most likely a bug
there's a usecase for reserving a "ridiculous" amount of memory from your application: ensure contiguous virtual address space
I'd conclude the traditional APIs are lacking. There _should_ be a way to just reserve address space, without reserving actual memory


----------



## covacat (Jan 5, 2022)

surprise, on my mac i just allocated 1tb

```
macmini:~ mac$ ./mm 1
1073741824, 0x00000001189F3000
macmini:~ mac$ ./mm 1000
1073741824000, 0x0000000108E1F000
```


----------



## zirias@ (Jan 5, 2022)

covacat said:


> surprise, on my mac i just allocated 1tb


Yes, this behavior seems pretty wide-spread. I know Linux is doing the same as well. And now that I learned a "sane" usecase for it, I think it can't really be fixed without changed/improved APIs. Having contiguous address space for a potentially growing object, that's a valid requirement...


----------



## shkhln (Jan 5, 2022)

Zirias said:


> Yes, the simple fact that `malloc()` just (almost?) never fails to begin with, at least with the default setting of `vm.overcommit=0`.


I think it should work with resource limits. At least I don't see why it wouldn't. (Just a nitpick.)


----------



## Ambert (Jan 5, 2022)

Zirias said:


> I'd conclude the traditional APIs are lacking. There _should_ be a way to just reserve address space, without reserving actual memory



Maybe the `MAP_GUARD` flag is specifically designed for that purpose (cf. mmap(2)). And I don't know if memory pages reserved with the `PROT_NONE` protection are really a problem when `vm.overcommit` is set to `1`.

Zirias, when you will perform the test suggested by covacat (setting `vm.overcommit` to `1` in sysctl.conf), I suggest you try it with several sizes of swap installed on your computer. Maybe increasing the size of the swap will make things better (cf. the definition of `vm.overcommit` in tuning(7)).

Also, I don't know much about signal handling, but maybe it is possible to catch the signal sent by the OOM killer to your process, and do some cleaning before termination.


----------



## mark_j (Jan 5, 2022)

Zirias said:


> Resuming this thread in a nutshell so far:
> 
> setting `vm.overcommit` to 1 should indeed prohibit allocations exceeding the available backing store (physical RAM + swap)


No, physical ram has nothing to do with it. It's reserved swap preserved when a process begins versus physical swap space available.



Zirias said:


> the kernel panic I've seen changing that sysctl was seen by others as well and is most likely a bug
> there's a usecase for reserving a "ridiculous" amount of memory from your application: ensure contiguous virtual address space
> I'd conclude the traditional APIs are lacking. There _should_ be a way to just reserve address space, without reserving actual memory


----------



## unitrunker (Jan 5, 2022)

shkhln said:


> The thing is, if some process consumes an amount of memory you failed to predict, this is already a problem. OOMs should never happen in normal operation.


This comment right here trumps the whole thread.


----------



## ralphbsz (Jan 6, 2022)

(About programs malloc'ing memory, then not using it)


Zirias said:


> How is *that* ever "reasonable"? If you said "rarely", ok, that's why swapping out pages makes sense. But not use it at all? Why should you ever reserve memory if you'll never write to it? I'd call that ill program design...


You are right, it's not very common. But it happens. Example: I create a vector that is supposed to hold up to 1 million entries (because that's a reasonable upper limit for how many things my program has to deal with). This run, there are only 10K entries. Maybe I'm using a data structure that's deliberately sparse for great insert/remove performance, and I just don't care that I'm wasting a few dozen megabytes, because memory is cheap. It particularly happens with server code that does internal caching with good cache management: If the workload the server has to handle doesn't need much caching (great locality), some memory will go unused. That's fine, the malloc() calls were nearly free, and don't use many resources. That's exactly the idea behind overcommit.



> All the countless reasons your program _could_ crash aside (you _can_ eliminate intrinsic reasons in theory, but not environmental reasons): Running out of memory is a condition that allows at least a "graceful" exit, *if* your program would learn about it the moment it tries to reserve memory.


In practice, recovering form malloc failure (even if such a thing commonly happened) is harder than it seems. There are several reasons. One is that the "graceful exit" code probably needs memory allocation too. One technique I've seen used is to send all such emergency exit code through one common routine. And then at startup time, reserve one memory buffer (maybe 1MB) that is never ever used, and the first thing the emergency exit code does is to free that memory buffer, so the exit code can function with a few mallocs. But even that fails today. The reason is that most big modern programs are multi-threaded (and have to be, to take advantage of multi-core CPUs and to overlap network and IO latencies). So one thread runs out of memory, and longjmp's to the common exit routine. But that exit routine can not synchronously stop all other threads from malloc'ing (since any synchronous locking mechanism would be too slow), and even if it free's an emergency reserve pool, that will be immediately consumed by the other threads. I've tried writing such "out of memory recovery" code, and after weeks of messing with it, gave up. You're better off looking at the problem you're trying to solve, estimate how much memory is available (you know the machine, you know what other software is running), and planning accordingly. And if (god forbid) someone runs a giant memory hog (like emacs'ing a 1GB log file) on the machine, it's game over. Doctor, it hurts when I do that. Well, then stop doing it.



> The bad thing about that practice is: Once the system learns it can't map all the currently needed pages to physical RAM any more, the only resort is the OOM killer, randomly killing some large process (so, _any_ process in the system can be affected).


Or your program catches a segfault. Just as painful and unpleasant.



> A broken program just reserving insane amounts of memory will be able to bring down other processes on the same machine. That's something virtual memory was originally designed to avoid.


There are many things that traditional operating systems were designed to avoid. For example isolation between users ... and we've given up on that, we instead move competing users into VMs, containers, or jails. I mean, we even do things like running a simple and harmless piece of software (the DNS server) in a jail, just "because". I think what you're saying is that OS design has not reached its goal. I agree, but that's the world we live in. Writing reliable and performant software in an imperfect world can require gritting your teeth and accepting reality.



> about stack space, I don't really see a problem with that. As long as you use neither VLAs, stuff like `alloca()` or recursion, you can guarantee an upper bound for stack usage of your program (and any algorithm can be implemented without these).


Agree. With good coding practices, running out of stack space should be rare. You just have to make sure all programmers on the project understand that.

(About shkhln's comment: "if some process consumes an amount of memory you failed to predict, this is already a problem.")


unitrunker said:


> This comment right here trumps the whole thread.


Sadly true. If you want to build reliable production systems, look at all the processes on the machine.


----------



## covacat (Jan 6, 2022)

you probably don't run you service as root, but just in case vm.overcommit is not enforced for root


----------



## zirias@ (Jan 6, 2022)

mark_j said:


> No, physical ram has nothing to do with it. It's reserved swap preserved when a process begins versus physical swap space available.


That wouldn't make much sense (and doesn't correspond to the wording in the manpage, why just beginning of a process? it only talks about reserved swap vs available swap). The consequence would be with `vm.overcommit=1`, userspace could _never_ get any memory if there wasn't any swap at all.

BUT I guess I found why changing the setting fails so badly. Looking at the machine (8GB RAM, 8GB swap) I tried that while running normally, I see `vm.swap_reserved` at > 600 GB   – that's just insane... I still think the kernel shouldn't panic, but of course any userspace allocation will immediately fail.


ralphbsz said:


> Example: [...] That's exactly the idea behind overcommit.


That's pretty similar to the usecase Ambert already described. It boils down to what you _actually_ want is reserving _address space_. I think it could be solved by allowing to reserve address space and the memory backing it separately...


ralphbsz said:


> One is that the "graceful exit" code probably needs memory allocation too.


Depends on what it's doing. Persisting something to disk is probably pretty common, if _that_ needs dynamic allocations, then yes. I don't use some "emergency buffer", the quick path out (taken with the `longjmp()`) frees a few objects anyways, so there's hope this will be enough.


ralphbsz said:


> So one thread runs out of memory, and longjmp's to the common exit routine. But that exit routine can not synchronously stop all other threads from malloc'ing (since any synchronous locking mechanism would be too slow), and even if it free's an emergency reserve pool, that will be immediately consumed by the other threads.


In my design, the thread exiting for OOM (or any other "panic" reason) would tell the main thread (via some shared/locked memory allocated by the main thread) that there was a panic, which causes the main thread to signal all other threads to exit... _IF_ `malloc()` would reliably return NULL, this should be enough.


ralphbsz said:


> You're better off looking at the problem you're trying to solve, estimate how much memory is available


There is no concrete problem with my code, I thought that was obvious from my initial post. OOM is a condition I can't predict (it depends on what else is running on the machine) and I'd like to react with a graceful exit _IF_ it ever happens (best effort), that's all. Seems overcommit makes that impossible in practice.


----------



## zirias@ (Jan 6, 2022)

covacat said:


> you probably don't run you service as root, but just in case vm.overcommit is not enforced for root


It's _started_ as root, but drops privileges early on (just after setting up its listening sockets). uid/gid are configurable, I currently use nobody/nogroup.

But the thing is: I didn't even run it with `vm.overcommit=1` so far cause trying to test it on my desktop/dev machine, I immediately got this kernel panic.


----------



## Ambert (Jan 6, 2022)

Zirias said:


> OOM is a condition I can't predict (it depends on what else is running on the machine) and I'd like to react with a graceful exit _IF_ it ever happens (best effort), that's all.



Since the OOM condition is handled by the operating system, and it solves it by sending termination signals to some guilty-looking processes, maybe the "best effort" you can do is write a function that terminates your application nicely when it receives the `SIGTERM` signal. According to the handbook:



			
				Handbook said:
			
		

> Two signals can be used to stop a process: `SIGTERM` and `SIGKILL`. `SIGTERM` is the polite way to kill a process as the process can read the signal, close any log files it may have open, and attempt to finish what it is doing before shutting down. In some cases, a process may ignore `SIGTERM` if it is in the middle of some task that cannot be interrupted.
> 
> `SIGKILL` cannot be ignored by a process. Sending a `SIGKILL` to a process will usually stop that process there and then. [1]



That way, you would offer the operating system a choice: either reclaim your memory gently and let you clean up, or reclaim your memory abruptly and make a mess.

Handling the `SIGTERM` signal is useful for other terminating conditions, not only OOM. For instance, the handbook says that a shutdown will:



			
				Handbook said:
			
		

> send all processes the TERM signal, and subsequently the KILL signal to any that do not terminate in a timely manner



----------------------
Edit: I found a similar thread on the freebsd-hackers mailing list: Why kernel kills processes that run out of memory instead of just failing memory allocation system calls? -- Basically, they say that there is no easy way to handle an OOM condition in a program's code, due to the design of the operating system (overcommitment).
-----------------------
Edit2: For your particular case, I think I have found a solution: 1) Create a child process. 2) Do your main computation in the child. 3) You keep a small summary of the changes made to your database since the last `sync()` in the parent process (a "diff"). 4) The child is a much more interesting target for the OOM killer, so it is killed first. 5) When the parent find out the child process has been killed, the parent process writes the diff to a log file, and terminates. 6) When the program restarts, it reads the log file and apply the changes to the database.


----------



## zirias@ (Jan 6, 2022)

Ambert, handling SIGTERM is a must for a well-behaved daemon, but the OOM killer doesn't send a signal, it just terminates your process forcefully ("SIGKILL")


----------



## zirias@ (Jan 6, 2022)

Ambert said:


> I found a similar thread on the freebsd-hackers mailing list: Why kernel kills processes that run out of memory instead of just failing memory allocation system calls? -- Basically, they say that there is no easy way to handle an OOM condition in a program's code, due to the design of the operating system (overcommitment).


There is no "easy" way, but there certainly is a "best effort" way... But this *is* an interesting response (quoting from there):


> The fork() should give the child a private "copy" of the 1 GB buffer, by
> setting it to copy-on-write.


That's yet another API shortcoming, `fork()` shouldn't be the _only_ way to start a new process.. (edit: yes, I remember there was `vfork()` e.g. on Linux trying to solve exactly this problem, but it was an ill-defined disaster -- and Windows only offers `CreateProcess()` and no `fork()`, which is just as bad....)


> The disadvantage, of
> course, is that if someone calls the bluff, then we kill random processes.


That's exactly the issue 


> although programs can in theory handle failed allocations and respond
> accordingly, in practice they don't do so and just quit anyway.


and *this* sounds to me like a chicken/egg problem. Programs don't bother handling it because they know it's moot anyways, because the OS doesn't tell them the error at a time they _could_ react 
(edit2: adding your usecase of having contiguous address space to the picture, I think the solution would really be to provide a way to reserve _just_ address space and still require the program to reserve memory backing it before it can be used, so there's a chance to get and handle that error when it happens)


Ambert said:


> For your particular case, I think I have found a solution: 1) Create a child process. 2) Do your main computation in the child. 3) You keep a small summary of the changes made to your database since the last `sync()` in the parent process (a "diff"). 4) The child is a much more interesting target for the OOM killer, so it is killed first. 5) When the parent find out the child process has been killed, the parent process writes the diff to a log file, and terminates. 6) When the program restarts, it reads the log file and apply the changes to the database.


MUCH too complex. As I said above, there is no problem with my code needing unusually much memory. And then, I guess just adding a `sync()` after sensitive changes is probably much less overhead than this inter-process communication with extra buffers etc...


----------



## ralphbsz (Jan 6, 2022)

There is another solution, but it's nasty: Give up on the idea that a computer can be shared between multiple users or multiple processes. Go back to the 1950s, and use a dedicated computer for the task you want to run. Except that today you don't use a real physical computer, but instead get yourself a virtual one.

So if your task can calculate how much memory it will use at maximum, and can do internal tracking of memory usage, just instantiate a VM of that memory size, and run your code as a single-purpose VM.

The reason this is nasty is: It means that the whole _raison d'etre_ of operating systems (which is resource control, management and virtualization) has failed at the OS layer, and is instead being done at the VM layer. And you need to be sure your VM layer does the memory allocation the way you like it (namely guaranteed), which isn't always the case. I vaguely remember that kubernetes will overcommit memory too. Oops ...


----------



## zirias@ (Jan 6, 2022)

ralphbsz said:


> It means that the whole _raison d'etre_ of operating systems (which is resource control, management and virtualization) has failed at the OS layer, and is instead being done at the VM layer.


Even worse, it's just the same thing taken to a different level (memory ballooning).

The simple thruth is: memory is a limited resource, and the concept of virtual memory was once invented to make sure no application can bring down _other_ applications (or even the OS itself). The resource "memory" should be given out by a strict "first come, first serve" policy, with optional resource limits configurable by the system's administrator...


----------



## grahamperrin@ (Jan 6, 2022)

Ambert said:


> … I found a similar thread on the freebsd-hackers mailing list: Why kernel kills processes that run out of memory instead of just failing memory allocation system calls? …



Also, more recent, if you haven't already seen it:

The out-of-swap killer makes poor choices | <https://markmail.org/message/siorx6pswhpncluf>


----------



## obsigna (Jan 6, 2022)

Do we know since when calls to malloc(3) never fail, and who implemented this OOM Killer?  I am curious about, what stuff this developer
is smoking, wherever he lives. This looks also like a big security flaw. I would not be surprised, if this could be easily exploited at least for some denial of service attacks.

For example, I experienced a major hassle in October/November 2020 with a tiny AWS-EC2 instance, which was for years perfectly running an Apache/PHP/MySQL web service. All over a sudden, Apache continued to being killed because of for some reason MySQL ran out of swap space. I found it already stupid to kill Apache and not MySQL.

Does this sound familiar?

I never was able to find out, what request actually triggered this, I am even not sure whether there was a trigger at all, perhaps the memory hog MySQL is a leaking hog. Finally, I added a second volume to the EC2 instance, 1 GB for swap only, and that solved the problem.

For my daemons, I use a malloc wrapper. It got already some introspection facilities, like total allocations, and I will add a configurable limit which hopefully keeps my daemons below the radar of the OOM Killer. In case it becomes killed, a watchdog will restart FreeBSD then, because killing the OOM Killer is the only clean solution.


----------



## shkhln (Jan 6, 2022)

obsigna said:


> Do we know since when calls to malloc(3) never fail, and who implemented this OOM Killer?


Been there for a while…



obsigna said:


> This looks also like a big security flaw. I would not be surprised, if this could be easily exploited at least for some denial of service attacks.


Not at all, OOM kills can only be triggered by exhausting available memory, which would be a DoS situation even without overcommit.


----------



## obsigna (Jan 6, 2022)

shkhln said:


> Been there for a while…


OK the linked commit is from 1994, and the commit message fits to its age - keyword _floppy situation_.


> Various changes to allow operation without any swapspace configured.
> Note that this is intended for use only in floppy situations and is done at
> the sacrifice of performance in that case (in ther words, this is not the
> best solution, but works okay for this exceptional situation).



Now, the question remains when somebody decided to make this the general, i.e. not-only-floppy behaviour. And I cannot see anything in this commit which let's malloc never fail, which was the actual question.



shkhln said:


> Not at all, OOM kills can only be triggered by exhausting available memory, which would be a DoS situation even without overcommit.


Well, I still feel uncomfortable. The delayed OOM makes me nervous. And why is there a sysctl setting vm.overcommit, which defaults to 0 (I assume this means inactive), and the system happily does overcommiting up to hundreds of thousands of gigabyte? Fortunately, trying to allocate a petabyte makes the kernel feel uncomfortable as well and it eventually bails out.


----------



## _martin (Jan 6, 2022)

To the panic itself: please do open a PR. Panic the way you shared is not expected to happen whatever memory pressure you try to do from userspace. Your panic on virtual address 0x18 is clearly a bogus kernel address.
I tried to rub the bug several ways while `while true; do sysctl vm.overcommit=1 ; sysctl vm.overcommit=0; done` was running in background on 13.0p4 but was not able to trigger anything.

Are you able to reproduce this crash?


----------



## shkhln (Jan 6, 2022)

obsigna said:


> And I cannot see anything in this commit which let's malloc never fail, which was the actual question.


No, the question this answers is "who implemented this OOM Killer?". (Not exactly the same thing as overcommit, by the way — it's possible to handle OOM situations by simply crashing, for example.) I'm not going to study jemalloc's sources.


----------



## covacat (Jan 6, 2022)

_martin said:


> To the panic itself: please do open a PR. Panic the way you shared is not expected to happen whatever memory pressure you try to do from userspace. Your panic on virtual address 0x18 is clearly a bogus kernel address.
> I tried to rub the bug several ways while `while true; do sysctl vm.overcommit=1 ; sysctl vm.overcommit=0; done` was running in background on 13.0p4 but was not able to trigger anything.
> 
> Are you able to reproduce this crash?


try to run some large process that uses shm
it never panic-ed for me on an 1GB system and i changed it lots of times


----------



## obsigna (Jan 6, 2022)

obsigna said:


> ...
> And why is there a sysctl setting vm.overcommit, which defaults to 0 (I assume this means inactive), and the system happily does overcommiting up to hundreds of thousands of gigabyte? Fortunately, trying to allocate a petabyte makes the kernel feel uncomfortable as well and it eventually bails out.


Answering myself — reading helps tuning(7):


> The vm.overcommit sysctl defines the overcommit behaviour of the vm
> subsystem.  The virtual memory system always does accounting of the swap
> space reservation, both total for system and per-user.  Corresponding
> values are available through sysctl vm.swap_total, that gives the total
> ...


So, vm.overcommit = 0 means that it is completely active. In order to enforce the actual limits, it must be set to b0|b1|b2 = 7, and we must not allocate memory as root.

So I tried:
`# sysctl vm.overcommit=7`

Now I check the setting with my test program from this post: https://forums.freebsd.org/threads/...ocations-fail-in-a-sane-way.83582/post-549632

`sudo -u rolf ./oomcheck 1` (trying to allocate 1 GB, does work as expected):
1073741824, 0x0000000801200700

`sudo -u rolf ./oomcheck 10` (trying to allocate 10 GB, does not work, which is also the expected behaviour, since this system does not have 10 GB):
10737418240, 0x0000000000000000

The user root can still allocate any amount of memory below one petabyte.


----------



## zirias@ (Jan 7, 2022)

obsigna you keep talking about malloc/jemalloc although it's pretty obvious this has nothing to do with it: malloc is just a userspace allocator and memory manager built on top of what the kernel provides to get page mappings (well, nowadays most likely just mmap). If and only if malloc needs to get memory from the kernel and _that_ fails, it will return NULL.

Also, I'm not sure you understand all the implications? After discovering that on my desktop machine, astounding 600GB of swap are "reserved" (with just 8GB actually available), I'm very reluctant to disable overcommit. I _suspect_ in my case, it's chromium reserving all that memory, but then, what does it help? As discussed in this thread, having contiguous virtual address space is a valid requirement, and the only way to get that with the existing APIs, unfortunately, is to reserve a huge chunk of memory *). Obviously there are programs doing just that, probably well aware systems _will_ overcommit, so "it isn't a problem". Then of course, you can't disable overcommit without breaking these programs (or at least have lots of RAM go to waste). A vicious circle... and a huge pile of suck.



_martin said:


> Are you able to reproduce this crash?


I didn't try just yet. It happened on my desktop machine I use for my personal dev stuff, but I really need this machine operational, especially in these times of remote working .... will probably do more tests some weekend 

-----
*) oh well, just thinking about another workaround, you might just mmap a sufficiently large file instead. But that sucks as well...


----------



## obsigna (Jan 7, 2022)

Zirias, you don’t understand how it helps not to mix-up the different domains. I keep on talking about malloc(3) because that’s the API for us user space developers, and I expect that the API works as advertised in the respective man page. *And specifically* that means that a NULL pointer is returned, when a memory allocation cannot be full filled, and I mean memory that can be really used, not phantom one that leads to a crash when the program starts writing to it.

OK, here I learned that funny things may happen in kernel space, to say the least. Be assured that I understand all the implications very well, and be assured, that I will find out the pros and cons of setting vm.overcommit to 7 for the systems which I am running FreeBSD on. Beyond this, I won’t turn myself into a kernel hacker, I got better things to do. In case it turns out that vm.overcommit leads into a dead-end road, I got already plan B.

So, good luck everybody with whatever your plans A, B or 0 are.


----------



## zirias@ (Jan 7, 2022)

obsigna said:


> I keep on talking about malloc(3) because that’s the API for us user space developers, and I expect that the API works as advertised in the respective man page.


Wait, that's what I was doing initially  But you brought jemalloc (a concrete implementation) to the table, suggesting it would be at fault here, which it isn't: any malloc implementation will just use the memory it gets from the kernel 


obsigna said:


> In case it turns out that vm.overcommit leads into a dead-end road, I got already plan B.


It probably does, at least if you want to use programs that reserve address-space by reserving huge chunks of memory. I don't have a "plan B"  Well, other than accepting the situation as it is. "Fixing" it would require additional kernel APIs and userspace applications using them correctly...


----------



## obsigna (Jan 7, 2022)

Well, I got actually a working and already implemented plan C, although for other reasons, than the one here.

A group of my daemons are electrochemical measurement controllers, and the potential/current measurements are done with high speed PCIe-DAQ boards from National Instruments. For the high speed mode, quite big amounts of system RAM need to be provided for DMA. For this, I wrote a kernel module which allocates on boot 256 MB of RAM for AI-DMA and 48 MB of RAM for AO-DMA. This reserved memory is no more visible to other processes, however, by the way of ioctl's in my kernel module, the whole chunks may be mapped into user space, and by this way my program gets access to the waveforms and measurement data belonging to the DMA channels of the DAQ board.

Of couse, this would work also without any DAQ-board. We let a kernel module reserve the memory, and map it into *our-only* user space by an ioctl. From my experience other programs don’t touch it, otherwise, the HS measurements would give strange results, like arbitrary spikes, voids and punches in the curves - they don't. In this case, we need to write our own allocator, which takes the memory out of said our-only pool.

By this way, our daemons could keep themselves below the radar of the OOM killer. Even if we use preallocated system RAM let’s say 1 GB, the OOM killer won’t know this and if we don’t allocate much regular space, then there would always be processes which allocated much more, and therefore are subject to be killed before our daemon.


----------



## zirias@ (Jan 7, 2022)

obsigna I get you're talking about "wired" memory. That's nice and all, but it doesn't solve the generic problem: You can't attach a kernel module to every daemon


----------



## obsigna (Jan 7, 2022)

Zirias said:


> obsigna I get you're talking about "wired" memory. That's nice and all, but it doesn't solve the generic problem: You can't attach a kernel module to every daemon


Of course, I don’t want to attach it to every daemon, but only to my. If every daemon uses it, what would be the benefit for me? That’s my plan C after all, for outpacing a questionable system behaviour,


----------



## Ambert (Jan 7, 2022)

Zirias said:


> As discussed in this thread, having contiguous virtual address space is a valid requirement, and the only way to get that with the existing APIs, unfortunately, is to reserve a huge chunk of memory *). Obviously there are programs doing just that, probably well aware systems _will_ overcommit, so "it isn't a problem". Then of course, you can't disable overcommit without breaking these programs (or at least have lots of RAM go to waste).



What do you mean by "existing APIs"? Do you include `mmap()` in those APIs? Or just `malloc()`? Because I think `mmap()` *does* allow the reservation of a range of virtual addresses that does not count as allocated memory when the kernel enforces a strict no overcommit policy. Each virtual memory page of (typically) 4 KiB has a status regarding its access (read, write, execute), and I think only memory pages having the WRITE attribute count as allocated memory under a no overcommit policy. You can temporarily turn off the WRITE attribute of the memory pages that you are not currently using.

You can check that out by running the following test program. It takes a size (in GiB) as argument, and calls `mmap()` to reserve a range of virtual addresses of that size (first with MAP_GUARD, then with PROT_NONE, then with PROT_READ, then with PROT_WRITE), allowing you to check what kind of reservation has an impact on the amount of reserved swap on your computer.


```
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <sys/mman.h> // mmap
#include <unistd.h> // getpid

void print_address_space(void) {
    fflush(stdout);
    char command[100];
    snprintf(command, 100, "procstat vm %ld", (long)getpid());
    system(command);
}

void * mmap_wrap(void *addr, size_t len, int prot, int flags, char *taskname) {
    void *addr2 = mmap(addr, len, prot, flags, -1, 0);
    if (addr2 == MAP_FAILED) {
        printf("mmap() failure.\n");
        exit(EXIT_FAILURE);
    }
    printf("%s done (at %p). Map of the virtual address space:\n", taskname, addr2);
    print_address_space();
    printf("Check swap reservation now. Then press Enter to continue.\n");
    getchar();
    return addr2;
}

int main(int argc, char *const argv[]) {
    if (argc < 2) {
        printf("Please provide a size (in GiB) as argument.\n");
        exit(EXIT_FAILURE);
    }
    unsigned long size = strtoul(argv[1], NULL, 10)*1024*1024*1024;
    void *ptr = mmap_wrap(NULL, size, PROT_NONE,
        MAP_PRIVATE | MAP_ANON | MAP_GUARD, "mmap(MAP_GUARD)");
    ptr = mmap_wrap(ptr, size, PROT_NONE,
        MAP_PRIVATE | MAP_ANON | MAP_FIXED, "mmap(PROT_NONE)");
    ptr = mmap_wrap(ptr, size, PROT_READ,
        MAP_PRIVATE | MAP_ANON | MAP_FIXED, "mmap(PROT_READ)");
    ptr = mmap_wrap(ptr, size, PROT_READ | PROT_WRITE | PROT_EXEC,
        MAP_PRIVATE | MAP_ANON | MAP_FIXED, "mmap(PROT_WRITE)");
    return EXIT_SUCCESS;
}
```

Unfortunately, the simple interface provided by `malloc()` does not allow it to be implemented in a way that suits your needs: all the reserved pages have to be set as writable immediately, because the user can write at any address at any time (and an attempt to write a byte in a virtual memory page defined as read-only triggers a segmentation fault, it does not make the memory page writable). And since `malloc()` is standard and easy to use, most people just use it instead of `mmap()`.

That being said, this thread of the freebsd-hackers mailing list mentions other valid use-cases for overcommitment. So the simple interface provided by `malloc()` is not the only culprit.

----------------------------

obsigna -- I think there is an easy way to make a process disappear from the radar of the OOM killer: make your process call the function madvise with `MADV_PROTECT` as argument (with root privileges).


----------



## zirias@ (Jan 7, 2022)

Ambert said:


> What do you mean by "existing APIs"? Do you include `mmap()` in those APIs? Or just `malloc()`? Because I think `mmap()` *does* allow the reservation of a range of virtual address space that does not count as allocated memory when the kernel enforces a strict no overcommit policy. [...] You can temporarily turn off the WRITE attribute of the memory pages that you are not currently using.


I'm not sure this helps with the problem at hand: You want to reserve a (potentially huge) chunk of contiguous address space, and then you want to reserve actual memory backing it page-by-page, as needed. I think with `mmap()`, you can only change the access mode for the whole chunk? Correct me if I'm wrong...



Ambert said:


> _*[FONT=monospace]obsigna[/FONT]*_ -- I think there is an easy way to make a process disappear from the radar of the OOM killer: make your process call the function madvise with the `MADV_PROTECT` behaviour (with root privileges).


Ah, the API that goes with protect(1). Interesting idea! But, of course, _still_ a workaround


----------



## obsigna (Jan 7, 2022)

Ambert said:


> ...
> 
> obsigna -- I think there is an easy way to make a process disappear from the radar of the OOM killer: make your process call the function madvise with the `MADV_PROTECT` behaviour (with root privileges).


For me this is the solution, to prevent my daemon from being inadvertently killed, which would be problematic, because the DAQ board(s) might be left in a non-appropriate non-idle state, and nothing would be supervising it -> may (remotely) result in the destruction of the electrochmical cell, depending on which method was running.

So, I will implement this with my plan B, i.e, using my malloc wrapper to impose a reasonable limit on how much memory my daemon might allocate. If this would be exceeded for some reason — usually a bug, then it would quit gracefully and it could leave a respective message.

By the way, I never understood why the Linux people like mmap to anything so much. Mapping memory is computational expensive, I know this, because I did this already step by step in my kernel module. So mmap is not a toll free bridge to all kind of storage, the toll is quite expensive.

For example, I did already the first experiments with vm.overcommit = 7. Once I increased the swap partition to 16 GB, I could run the GNOME 3 desktop and most of its applications without problems. Once I started Firefox, its first tab crashed, because 5 times of 2.5 GB swap space allocation exceeded the limit. And now comes the best of all, this idiotic OOM killer did not kill Firefox, but reproducibly something of the ssh/bash/tty session, by which I was logged-in from another machine, in order to be able to monitor the system. Can this be more stupid?


----------



## zirias@ (Jan 7, 2022)

obsigna said:


> By the way, I never understood why the Linux people like mmap to anything so much. Mapping memory is computational expensive, I know this, because I did this already step by step in my kernel module. So mmap is not a toll free bridge to all kind of storage, the toll is quite expensive.


I would be surprised if any `malloc()` code would still use `sbrk()` (and then, mapping pages needs to happen with `sbrk()` as well). When you use `malloc()`, you already use `mmap()`.


----------



## Ambert (Jan 7, 2022)

Zirias said:


> I'm not sure this helps with the problem at hand: You want to reserve a (potentially huge) chunk of contiguous address space, and then you want to reserve actual memory backing it page-by-page, as needed. I think with `mmap()`, you can only change the access mode for the whole chunk? Correct me if I'm wrong...



I think you are wrong. I think we can change the access protection on a page-by-page basis (although it is better to keep pages with the same access protection compacted together, to save kernel memory). And everything I said in my previous post does not help you with your issue, but you repeated several times something I think is wrong, so I took the time to write an explanation (you are the OP).

obsigna -- Next time there is a call for Foundation-supported project ideas, you can suggest to give the admin the ability to rank processes according to their importance, so that the OOM killer will target the low ranking processes first (currently, I think there are only two ranks: untouchable and fair game).


----------



## eternal_noob (Jan 7, 2022)

Ambert said:


> you can suggest to give the admin the ability to rank processes according to their importance, so that the OOM killer will target the low ranking processes first (currently, there are only two ranks: untouchable and fair game).


Since the oom killer only kicks in if something is really wrong, i think it's better if you notice it as soon as possible. Killing lower ranked processes will introduce a delay until you notice something's wrong.

I think the current two ranks are the best solution here.


----------



## zirias@ (Jan 7, 2022)

Ambert said:


> I think you are wrong. I think we can change the access protection on a page-by-page basis


If that's indeed possible, could you give an example _how_? Cause then I think "overcommit" would be unnecessary if all programs were well-behaved... (well, disregarding this `fork()` issue for now)


----------



## obsigna (Jan 7, 2022)

Ambert said:


> _*obsigna*_-- Next time there is a call for Foundation-supported project ideas, you can suggest to give the admin the ability to rank processes according to their importance, so that the OOM killer will target the low ranking processes first (currently, I think there are only two ranks: untouchable and fair game).


Basically I am happy with the two „ranks“, which as eternal_noob said should be sufficient. I am unhappy with the poor choices of the OOM killer in the fair game ranks. It should kill the process which actually caused the OOM, and not unrelated processes. Does this make any sense to anyone that Firefox (the culprit) continues running, while one of (or all) sshd/bash/tty becomes killed?

[Edit]: The same with my other incident in October/November 2020 (s. #50). Why it kills Apache and not the culprit MySQL? A stupid and cumbersome decision.


----------



## zirias@ (Jan 7, 2022)

obsigna said:


> Does this make any sense to anyone that Firefox (the culprit) continues running, while one of (or all) sshd/bash/tty becomes killed?


Yes, the process currently needing RAM is currently doing some work, so chances are it's more "important" than some other, idle, process that holds a sufficient amout of RAM. Yes, that's a very imperfect heuristic. The OOM-killer is a last resort and you don't want to ever need it...


----------



## Ambert (Jan 7, 2022)

I am sorry I started it, but we should avoid talking about improving FreeBSD, or this useful thread might be locked. The call for project ideas was a special temporary exception.



			
				DutchDaemon said:
			
		

> As of today, FreeBSD Forums staff will actively close down (and eventually remove) topics that serve no other purpose than to complain that "FreeBSD is not (like) Linux" (or Windows, or MacOS, or any other operating system), or that "FreeBSD does not use systemd", or that "FreeBSD has no default GUI", or that "FreeBSD does not encrypt gremlins", etc. This also includes topics that devolve into that kind of debate.
> 
> 
> Note that this is a general user and administrator forum, where the community aims to assist those who want to install, run, or upgrade _*FreeBSD as-is*_. Discussions about what FreeBSD _needs to be_, or _needs to add_, or _needs to lose_, are pointless on the forums. We do not maintain the operating system here.



Now, to get back to how to use FreeBSD as-is, here is a test program showing how to change the access protection of the virtual address space, on a page-by-page basis:


```
#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
#include <sys/mman.h> // mmap
#include <unistd.h> // getpid sysconf

void print_address_space(void) {
    fflush(stdout);
    char command[100];
    snprintf(command, 100, "procstat vm %ld", (long)getpid());
    system(command);
}

int main(void) {
    unsigned long page_size = sysconf(_SC_PAGESIZE);
    unsigned long range_size = 100*page_size;
    printf("Page size: %lu bytes.\n\n", page_size);

    printf("Initial address space:\n");
    print_address_space();

    void *ptr = mmap(NULL, range_size, PROT_WRITE | PROT_EXEC,
                     MAP_PRIVATE | MAP_ANON, -1, 0);
    printf("\nAddress space after a range of 100 contiguous pages have been\n"
           "allocated with write+exec protection at %p:\n", ptr);
    print_address_space();

    void *first_page_address = ptr;
    mmap(first_page_address, page_size, PROT_EXEC,
         MAP_PRIVATE | MAP_ANON | MAP_FIXED, -1, 0);
    printf("\nThe first page of the range has been deleted and replaced with\n"
           "a new exec-only page (at %p).\n", first_page_address);

    void *second_page_address = ((char*)first_page_address) + page_size;
    mmap(second_page_address, page_size, PROT_NONE,
         MAP_PRIVATE | MAP_ANON | MAP_FIXED, -1, 0);
    printf("\nThe second page of the range has been deleted and replaced with\n"
           "a new page (PROT_NONE) (at %p).\n", second_page_address);

    void *third_page_address = ((char*)second_page_address) + page_size;
    mmap(third_page_address, page_size, PROT_READ | PROT_WRITE | PROT_EXEC,
         MAP_PRIVATE | MAP_ANON | MAP_FIXED, -1, 0);
    printf("\nThe third page of the range has been deleted and replaced with\n"
           "a new page (with read+write+exec access) (at %p).\n",
           third_page_address);

    printf("\nFinal address space:\n");
    print_address_space();

    return 0;
}
```

Edit: I have tested this program with QEMU on FreeBSD-13.0-RELEASE-amd64.qcow2, and it works as expected.

There is a way to change the access protection of a range of virtual memory addresses without removing the data, but the granularity of the change is not guaranteed to be a single page.


----------



## zirias@ (Jan 7, 2022)

Ambert Uh-oh ... this code doesn't look "pretty" for sure, but I take your word it works, so, it actually fullfills the requirement. Maybe someone should write a better allocator than standard `malloc()` using this technique behind the scenes then. Used consequently, it would eliminate at least _one_ important reason for overcommit!

In fact, I wasn't aware you can "re-mmap" just parts of what you mapped before, that's a (pleasant!) surprise.

BTW, I think you misunderstand this "forum rule" a bit. Nobody on here has a problem with general OS design discussions, as long as they are not plain requests to change FreeBSD (which would of course make no sense on here, and typically come from e.g. systemd-fanboys )


----------



## Ambert (Jan 7, 2022)

You should not take my word for it. For the moment I can't have access to a FreeBSD installation (I am stuck on a Linux install), so the code was not tested properly. That's why most of my sentences begin with "I think that". But I think you can execute this code on any FreeBSD install and you will understand how it works just by looking at the output of the program. I could have made that program prettier, but it would not have been a good introduction to `mmap()`. _Edit: I have tested the code on FreeBSD over QEMU and it works as expected._



Zirias said:


> Used consequently, it would eliminate at least _one_ important reason for overcommit!



See one of my previous posts:



Ambert said:


> Unfortunately, the simple interface provided by `malloc()` does not allow it to be implemented in a way that suits your needs: all the reserved pages have to be set as writable immediately, because the user can write at any address at any time (and an attempt to write a byte in a virtual memory page defined as read-only triggers a segmentation fault, it does not make the memory page writable). And since `malloc()` is standard and easy to use, most people just use it instead of `mmap()`.



In addition, if you want to write portable code, `malloc()` is a good bet. Otherwise you have to abstract the memory handling in some kind of library, and use `mmap()` for Unix systems and `VirtualAlloc()` for Windows systems (and potentially other low-level memory functions for other operating systems). And `VirtualAlloc()` cannot do everything `mmap()` can.

In particular, `mmap()` can reserve the entire virtual address space without memory overhead (as long as all the pages have the same access protection). But with `VirtualAlloc()`, reserving a range of virtual addresses has a cost of 1 bit per 64 KiB, and if you write over that memory, the cost becomes 8 bytes per 4 KiB (which is 0.2%), even if you release the written pages by setting their access protection to none. You have to completely release the entire reserved area to get almost all of your memory back. I say "almost" because you can never get back the reservation cost of 1 bit per 64 KiB, unless you terminate the process. That's true for Windows 8 and 10 (x86_64). I don't know about Windows 11.


----------



## grahamperrin@ (Jan 8, 2022)

Ambert said:


> … @obsigna -- Next time there is a call for Foundation-supported project ideas, …



The call for _proposals_ is open. 









						FreeBSD Foundation Soliciting Project Proposals
					

Hello everyone,  The call for proposals has been sent out.  -- Joe (with Foundation hat on)




					forums.freebsd.org
				




FreeBSD Foundation 2022 Call for Proposals : freebsd


----------



## zirias@ (Jan 8, 2022)

Ambert said:


> You should not take my word for it. For the moment I can't have access to a FreeBSD installation (I am stuck on a Linux install), so the code was not tested properly. That's why most of my sentences begin with "I think that". But I think you can execute this code on any FreeBSD install and you will understand how it works just by looking at the output of the program. I could have made that program prettier, but it would not have been a good introduction to `mmap()`.


I don't think you can make it "prettier" the way I meant it. Although `mmap()` is specified in POSIX, anonymous mappings aren't specified there, so it's also unclear whether this "re-mapping" page-wise would _really_ work.

But if this code works on FreeBSD, it would at least mean you can write a program on FreeBSD reserving a large contiguous address space without having to rely on overcommit, which is already kind of cool.


Ambert said:


> In addition, if you want to write portable code, `malloc()` is a good bet. Otherwise you have to abstract the memory handling in some kind of library, and use `mmap()` for Unix systems and `VirtualAlloc()` for Windows systems (and potentially other low-level memory functions for other operating systems). And `VirtualAlloc()` cannot do everything `mmap()` can.


That's one reason I think some extended (and portable) interface would be nice *). Of course, supporting Windows in addition to `mmap()` as a backend would be more work for an implementation...

Probably it's much too late for all my thoughts here because overcommit is just an "accepted fact" and people wouldn't start adopting better interfaces in different OS _and_ applications . So you can just add the "OOM killer" to the list of unforseeable environmental events that can always bring down your application/service in an unexpected way...

-----
*) *edit*: For simple usecases like a potentially growing array, a simple extension like this could be enough:

```
void *mreserve(size_t size);
void *mallocfrom(void *pool, size_t size);
void munreserve(void *pool);
```
Of course, this won't suffice if what you need is some sparse data structure...


----------



## _martin (Jan 8, 2022)

covacat said:


> try to run some large process that uses shm


I don't have anything useful at hand. stress-ng had issues running it. It's been over 20 years since I coded anything shm related, didn't have time to write anything myself. But I think it's worth exploring further. Or maybe even checking current PRs.

Zirias As you mentioned somewhere above protect(1) is a FreeBSD way to achieve what you need (assuming I understood you want to exlcude your process from OOM). You can also call procctl(2) within your process (if run as root).


----------



## grahamperrin@ (Jan 8, 2022)

covacat said:


> try to run some large process that uses shm …





_martin said:


> I don't have anything useful at hand. stress-ng had issues running it. …



Is stress2 of any relevance? (Just curious.)









						freebsd-src/tools/test/stress2 at main · freebsd/freebsd-src
					

FreeBSD src tree (read-only mirror). Contribute to freebsd/freebsd-src development by creating an account on GitHub.




					github.com
				




README


----------



## covacat (Jan 8, 2022)

_martin said:


> I don't have anything useful at hand. stress-ng had issues running it


you can probably hack /usr/src/tests/sys/posixshm


----------



## zirias@ (Jan 8, 2022)

_martin said:


> Zirias As you mentioned somewhere above protect(1) is a FreeBSD way to achieve what you need (assuming I understood you want to exlcude your process from OOM). You can also call procctl(2) within your process (if run as root).


Yes this would enable me to rule out the OOM killer as _one_ possible "crash reason" for my process and I think it would even be feasible _here_: It mostly doesn't need much RAM. Just occassionally, it uses a hash function from security/libargon2 which _is_ very memory-hungry, but that memory is quickly released again.

Still I'm thinking about ways to conceptually improve the situation (although, sure, this will most probably not lead anywhere ). It would make so much more sense if an application could learn when it can't get the RAM it needs (as in, the _actual_ RAM, not just address-space reservation) and react to it. Most of the time a (somewhat) clean exit would be the best it could do, but that's better than "crashing". And thinking about my usage of libargon2: It's just ONE function of the service, so in cases like this, you could even think about just giving the client a temporary error in case the function currently can't get the RAM it needs. Well, you can at least dream of better design...


----------



## _martin (Jan 8, 2022)

Well if you have coredump available you could debug the reason for the crash of your program. At least that's where I'd start. There's also the question what "crash" means in your case.

That's the beauty of the virtual address space -- you as a program don't know if the address is ram, swap or anything in between. Also strictly speaking "address space reservation" doesn't make much sense, memory reservation does. Userspace (64b) address space is way larger than any available ram (at least for now, excluding some special computers).

As memory is lazy allocated you could malloc and write to it. If it's a bigger chunk than PAGE_SIZE writing one byte to each page is enough (technically one bit but granularity is one byte). You're wasting system memory this way but those pages would be allocated to your process. They could still get swapped out in case of memory pressure but they will be there. malloc (using this term loosely to say memory allocator) will most likely hold to these pages even when freed as it does its thing on it. munmap() would release those pages immediately.


----------



## zirias@ (Jan 8, 2022)

_martin said:


> Well if you have coredump available you could debug the reason for the crash of your program.


I really wonder where this repeated misunderstanding comes from, maybe from the fact this was moved to the "programming" section  There is no problem with my code, and it doesn't crash either. I just added "sane" handling of an OOM situation, and when trying to test it more realistically, I noticed it's impossible with the default setting of `vm.overcommit`. So, yes, it's basically a question about overcommit behavior of FreeBSD and the corresponding sysctl.



_martin said:


> "address space reservation" doesn't make much sense


I beg to differ, the usecase for it was first mentioned by Ambert in this thread: You want some growing memory to be contiguous in virtual address space. With `malloc()`, this isn't possible in the presence of multiple parts of the program (or even shared libs) using it. So what programs do is reserve just a huge chunk of memory, relying on overcommit behavior...


----------



## _martin (Jan 8, 2022)

Zirias said:


> this would enable me to rule out the OOM killer as _one_ possible "crash reason" for my process


Judging from this. 

Virtual address space is flat. With some small exceptions you have full 128TiB virtual address space available to you.

malloc (allocator) will always provide you flat, contiguous space in requested size (i.e. allocated chunk is always contiguous). It is allocating space on already contiguous mmaped chunk (big chunk that is used at starting ground for allocator). You as a userspace programmer should not care if two chunks of allocated memory are next to each other (e.g malloc(32), malloc(32)). But, for example, if you need to extend the dynamically allocated buffer (hence a need for larger contiguous chunk) you can use realloc. Allocator will take care of that. And for sure no, text/data/bss/stack/vdso/shared libs segment or any other custom mapped segment into the address space is a problem.

You can also write your own allocator or avoid it by using mmap. If you know what you're doing you can use MAP_FIXED to keep chunks at known addresses. I can't think of reason why you'd need to, but there's a way (only time I need that is when writing exploits and I need to know where I am, e.g. jumping to custom exploit page). 

Important note though: contiguous chunk in virtual address does not mean contiguous chunk in physical memory.


----------



## zirias@ (Jan 8, 2022)

_martin said:


> But, for example, if you need to extend the dynamically allocated buffer (hence a need for larger contiguous chunk) you can use realloc. Allocator will take care of that.


You should be aware in presence of heap fragmentation, `realloc()` often needs to copy all the contents to a new location. This is not acceptable for _some_ usecases (we've had all of this in this thread before....). It is acceptable for my service here, I'm using it.


_martin said:


> You can also write your own allocator or avoid it by using mmap. If you know what you're doing you can use MAP_FIXED to keep chunks at known addresses.


This has been discussed here before as well. Sure it works, but it also requires to actually request the memory upfront (although the OS will only try to really provide it at a page fault, but then the program isn't in control any more to handle it if it can't be fullfilled). Also, this is not really portable, but that's just a side note.


_martin said:


> Important note though: contiguous chunk in virtual address does not mean contiguous chunk in physical memory.


This is less relevant for most usecases.


----------



## obsigna (Jan 8, 2022)

Here comes *my preliminary* summary on what I understood about the problem and how I started to mitigate this.

*My understanding of the problem:*
Actually the title of this thread tells it: _„How to make "impossible" memory allocations fail in a sane way?“_. My first reaction was, what is the problem? Only check for OOM in your program than call exit(3) and let the atexit(3) handlers do the cleaning up. Then I learned about two unbelievably ridiculous circumstances which prohibit this _„sane“_ way of OOM handling within our programs:

malloc(3) never fails, unless our program asks it to allocate 1 petabyte of memory. Therefore our program never would learn about an OOM situation early enough for doing anything sane about it.


The FreeBSD system got implemented an OOM Killer, i.e. a cave man of the stone age who, in case of a problem, smashes the head of the next person around with a big club - it could at least signal a TERM before KILL (bronze age), couldn’t it?
*My mitigation:*

I had already an implementation of a malloc wrapper, which got introspection facilities, like total allocation and count of allocated chunks. This was already very handy when hunting memory leaks, since on program exit I require both to be 0. With that already in place, it was easy to impose a limit for the total allocation (512 MB default) which can be changed on the command line. In case the total allocation exceeds the defined limit, the malloc-wrapper returns NULL, and my program does whatever would be appropriate in the given situation.


The wrapper actually calls a = malloc(s) and now in the same breath it calls madvise(a, s, MADV_WILLNEED|MADV_PROTECT) (thanks to one of the suggestions of Ambert).
According to madvise(2), MADV_PROTECT:


> Informs the VM system this process should not be killed when the swap space is exhausted. The process must have superuser privileges. This should be used judiciously in processes that must remain running for the system to properly function.


I need to check somehow, whether this is really the case, though.

*My wish:*
A modern age OOM supervisor (not killer), which implements a new signal OOM and sends this to all user land processes first. I would happily let my process finish in a _„sane"_ way, when it would receive a signal like this. In case the OOM signal does not free enough memory, it could send a TERM to the memory hogs (bronze age), for only then eventually falling back to the stone age (smashing heads) behaviour.


----------



## shkhln (Jan 8, 2022)

obsigna said:


> unbelievably ridiculous


Considering that every other shared resource I can think of (ISP bandwidth, road capacity, hospital beds, whatever) is provisioned under assumption that it won't be simultaneously needed by 100% of potential users, "ridiculous" is not the word I'll use. More like "inevitable".



obsigna said:


> The wrapper actually calls a = malloc(s) and now in the same breath it calls madvise(a, s, MADV_WILLNEED|MADV_PROTECT)


Do you enjoy deadlocks?


----------



## obsigna (Jan 8, 2022)

shkhln said:


> Do you enjoy deadlocks?


Deadlock of what into what?


----------



## shkhln (Jan 8, 2022)

obsigna said:


> Deadlock of what in what?


At least with overcommit enabled, FreeBSD most likely still won't return any nulls to you, and since FreeBSD can't kill your process it would rather pause it until enough memory is available. Then you'll be like this guy: https://forums.freebsd.org/threads/out-of-memory.69755/.


----------



## obsigna (Jan 8, 2022)

shkhln said:


> Considering that every other shared resource I can think of (ISP bandwidth, road capacity, hospital beds, whatever) is provisioned under assumption that it won't be simultaneously needed by 100% of potential users, "ridiculous" is not the word I'll use. More like "inevitable".


In developed countries, doctors do not smash the heads of patients when they ran out of beds, only to make space for new occupants. This would be _unbelievably ridiculous_, wouldn’t it?


----------



## shkhln (Jan 8, 2022)

obsigna said:


> In developed countries, doctors do not smash the heads of patients when they ran out of beds, only to make space for new occupants. This would be _unbelievably ridiculous_, wouldn’t it?


They do triaging.


----------



## obsigna (Jan 8, 2022)

shkhln said:


> They do triaging.


If it comes that far, according to well thought rules. And there are procedures, like a signal OOB (out of beds) before a TERM before a KILL. See my wish.


----------



## _martin (Jan 8, 2022)

Zirias said:


> but it also requires to actually request the memory upfront (although the OS will only try to really provide it at a page fault


That's why I mentioned writing to each page (preallocation) and unmap it when not needed. 

Still relevant was the trigger of the panic you were able to achieve. Can you confirm you have p4 generic kernel running on your system ? I'll try to massage the bug a bit.


----------



## obsigna (Jan 8, 2022)

shkhln said:


> At least with overcommit enabled, FreeBSD most likely still won't return any nulls to you, and since FreeBSD can't kill your process it would rather pause it until enough memory is available. Then you'll be like this guy: https://forums.freebsd.org/threads/out-of-memory.69755/.


This is a different situation. I ship measurement controllers operated by FreeBSD, and when the measurement daemon needs to be killed only for somebody wants to run Firefox or Chrome or alike on it, then it is pointless that a measurement controller without a measurement daemon continuous to operate. In this case, I prefer the user holds down the power button for a few seconds. Some users are clever enough to just no more run Firefox/Chrome etc., others would need to go through this some more times, but eventually they will understand as well.


----------



## zirias@ (Jan 9, 2022)

_martin said:


> That's why I mentioned writing to each page (preallocation) and unmap it when not needed.


That's not really a solution either if your usecase is some dynamic, potentially large and growing array. You want simple indices to work, so it needs to be contiguous in memory. But once you unmap what you don't need currently, other code-paths might put a mapping exactly in this location of virtual address space...

As there _are_ programs allocating huge chunks of memory they will most likely never really use, I assume they have usecases like for example this. That's why I think decoupling the reservation of address space from that of actual pages backing it _could_ improve the situation, you could at least meet these needs without overcommit then.


_martin said:


> Still relevant was the trigger of the panic you were able to achieve. Can you confirm you have p4 generic kernel running on your system ? I'll try to massage the bug a bit.


It's almost GENERIC, except for `device sg` (Linux-compatible raw SCSI devices). I'd be very surprised if _that_ made a difference. What's probably relevant is that my system was already heavily overcommitting (I checked with roughly the same programs running later, `vm.swap_reserved` was at >600GB on this machine with 8GB RAM / 8GB swap).


----------

