# sys/malloc and WAIT_OK option



## DavidMarec (May 13, 2016)

While investigating about a  freeze when loading pf, PR 209475, Olivier has suspected the hash table allocation to be responsible of the loading freeze.

Reading the code, I noticed that the allocation process within pf_initialize() function sets the M_WAITOK flag. This way, the intialisation process might hang to wait for the requested resources.

Moreover, it sounds like requesting more than the total amount of ram causes a panic.

What should happen if the request *can't* be performed?

Reading malloc(9), malloc never returns NULL with this flag set.
So, the process will wait indefinitely?


----------



## fnoyanisi (May 15, 2016)

You can verify Olivier's theory by applying a patch, similar to one given below, to update sys/netpfil/pf/pf.c in a way that malloc()(9) calls will return NULL in case of the request cannot be fullfilled. Of course, you will have to recompile the source in order to test....


```
malloc(...., ..., M_NOWAIT|M_ZERO);
```

Once you are confident that this is the cause of the issue, you can always insert extra `if()..else...` checks before allocating memory for the hash table with malloc(9).

From malloc(9)

```
M_NOWAIT
         Causes malloc(), realloc(), and reallocf()    to return NULL if the
         request cannot be immediately fulfilled due to resource shortage.
         Note that M_NOWAIT    is required when running in an interrupt con-
         text
```


----------



## DavidMarec (May 16, 2016)

beyond the PR, my question regards  to an overall topic:

What is the expected malloc(9)'s behavior  in case of an impossible request ?

Digging into the code, I have found that panic(9) may be called; but the fact that *malloc will never return NULL* is no so clear. ( KASSERT(9) is also called to check this out)
So, if one modifies  sysctl variables that handle size of memory tables, one must keep in mind that this may issue  panic(9), especially when using system with low memory.

On the other hand, developers should be encouraged to do proper bound checking before calling malloc(9) when M_WAITOK has been set ( the way NetBSD does  ?) ?

I only  wonder how senior developers  manage the allocation in kernel space and let me known their thoughts about this behavior.


----------



## fnoyanisi (May 16, 2016)

DavidMarec said:


> What is the expected malloc(9)'s behavior in case of an impossible request ?
> 
> Digging into the code, I have found that panic(9) may be called; but the fact that *malloc will never return NULL* is no so clear. ( KASSERT(9) is also called to check this out)


I think this is more an Operating System design matter than being a "case specific issue"....



DavidMarec said:


> On the other hand, developers should be encouraged to do proper bound checking before calling malloc(9) when M_WAITOK has been set


Yes, they should indeed....Have you considered submitting a patch to Olivier's PR? I added myself to the CC list of the PR, I also am curious about the solution to this bug....


----------



## DavidMarec (May 25, 2016)

fnoyanisi said:


> Have you considered submitting a patch to Olivier's PR? I added myself to the CC list of the PR, I also am curious about the solution to this bug....



As far as I see, there are lots of stuff that are allocated this way by the pf(4) module.
As  I am not  involved enough in this project, I must consider that there should be a good reason for this.

On the other hand, I quickly wrote a simple module(9) that only allocates(9) huge amount of RAM, until more than the physical memory contains, at startup.

Despite what Oliver noticed,  never this module  - or pf -  caused the sytem to panic(9);  but it remained stuck.


----------



## fnoyanisi (May 25, 2016)

Aha... I wrote this last night, and I was thinking to submit a patch soon 

(the code below is not the patch itself, it just shows how to calculate "the usable memory" while doing stuff in the kernel space.)

The idea is to add an extra check before allocating the memory for the `pf_hashsize`. If the requested amount via sysctl(8) is too much to handle, just allocate `PF_HASHSIZ` (which is 32768) amount of memory and set the value of sysctl tunable to reflect the allocated amount.

I also need to recompile the source with the updated pf(4) source.


```
#include <sys/types.h>
#include <sys/sysctl.h>  /* SYSCTL_X family macros */
#include <sys/vmmeter.h>  /* vm structs */
#include <sys/module.h>
#include <sys/systm.h>  /* uprintf */
#include <sys/errno.h>
#include <sys/param.h>  /* defines used in kernel.h */
#include <sys/kernel.h>  /* types used in module initialization */

#include <vm/vm_param.h>  /* VM_TOTAL */

#include <sys/proc.h>

/*
* Load handler that deals with the loading and unloading of a KLD.
*/

static void
print_free_mem(){
   int err;
   static u_int inact_pages, free_pages, page_size;
   static uint64_t free_mem;
   size_t sz;
  
   sz = sizeof(u_int);
  
   uprintf("memstat module loaded.\n");
   err = kernel_sysctlbyname(&thread0,"vm.stats.vm.v_free_count",
   &free_pages, &sz, 0, 0, 0, 0);
   err = kernel_sysctlbyname(&thread0,"vm.stats.vm.v_inactive_count",
   &inact_pages, &sz, 0, 0, 0, 0);
   err = kernel_sysctlbyname(&thread0,"vm.stats.vm.v_page_size",
   &page_size, &sz, 0, 0, 0, 0);
  
   free_mem = (uint64_t)(free_pages + inact_pages)*page_size;
  
   uprintf("free_pages = %d\ninact_pages = %d\npage_size = %d\nfree_mem = %ld\n",
   free_pages, inact_pages, page_size, free_mem);
}

static int
memstat_loader(struct module *m, int what, void *arg)
{
   int err = 0;
  
   switch (what) {
   case MOD_LOAD:  /* kldload */
   print_free_mem();
   break;
   case MOD_UNLOAD:
   uprintf("memstat unloaded.\n");
   break;
   default:
   err = EOPNOTSUPP;
   break;
  }
   return(err);
}

/* Declare this module to the rest of the kernel */
static moduledata_t memstat_mod = {
   "memstat",
   memstat_loader,
   NULL
};

DECLARE_MODULE(memstat, memstat_mod, SI_SUB_KLD, SI_ORDER_ANY);
```


----------



## DavidMarec (May 25, 2016)

fnoyanisi said:


> Aha... I wrote this last night, and I was thinking to submit a patch soon
> (the code below is not the patch itself, it just shows how to calculate "the usable memory" while doing stuff in the kernel space.)



I had much the same idea, regarding the bunch of tables and pools of different sizes that pf needs to malloc.

If I am not mistaken, malloc(9) allocate only physical memory.

And,  I wonder  if malloc(9)requires allocated blocks to be contiguous ?
- If so, computing the available memory won't be enough. -

`sysctl vm.phys_free`


----------



## fnoyanisi (May 25, 2016)

DavidMarec said:


> I had much the same idea, regarding the bunch of tables and pools of different sizes that pf needs to malloc.
> 
> If I am not mistaken, malloc(9) allocate only physical memory.
> 
> ...



Not aware of the malloc(9) internals to be honest, but `vm.phys_free` (or `vm.physmem`) could be looked into... `vm.phys_free` seem to be a better option than `vm.stats` sysctl tunables (less `kernel_sysctl()` calls to be made)

I will try to get the patch ready this weekend, and see how it goes...Will update this thread accordingly...


----------



## fnoyanisi (May 27, 2016)

DavidMarec said:


> I had much the same idea, regarding the bunch of tables and pools of different sizes that pf needs to malloc.
> 
> If I am not mistaken, malloc(9) allocate only physical memory.
> 
> ...



Design and Implementation of 4.4 BSD Section 2.5.2 gives answer to your questions.

- Kernel allocates physical memory only, and is not paged.

And, one need to consider the *memory available to kernel *I think?


----------



## fnoyanisi (May 29, 2016)

I set up to VMs to reproduce the case and test the patch I have written. VM1 has the default pf(4) that comes with 11.0-CURRENT, wheres VM2 has a patched version of the pf(4).

Both VMs have 512MB of RAM each, and bhyve(8) is used as the hypervisor to run VMs.

I simply added an extra check for the requested amount of memory against `hw.physmem`.

_*VM1*_

```
vm1 # uname -a
FreeBSD test-pf 11.0-CURRENT FreeBSD 11.0-CURRENT #0 r297692: Fri Apr  8 03:07:13 UTC 2016  root@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
vm1 # cat /boot/loader.conf
net.pf.states_hashsize: 2147483648
vm1 # sysctl hw.realmem
hw.realmem: 536870912
vm1 # kldstat
Id Refs Address  Size  Name
1  3 0xffffffff80200000 1ee2bc0  kernel
vm1 # service pf onestart
Enabling pfKernel page fault with the following non-sleepable locks held:
exclusive rw pf rulesets (pf rulesets) r = 0 (0xffffffff822466e0) locked @ /usr/src/sys/modules/pf/../../netpfil/pf/pf_ioctl.c:2901
stack backtrace:
#0 0xffffffff80a91a90 at witness_debugger+0x70
#1 0xffffffff80a92d77 at witness_warn+0x3d7
#2 0xffffffff80e92817 at trap_pfault+0x57
#3 0xffffffff80e91ea4 at trap+0x284
#4 0xffffffff80e71ea7 at calltrap+0x8
#5 0xffffffff8090b646 at devfs_ioctl_f+0x156
#6 0xffffffff80a97106 at kern_ioctl+0x246
#7 0xffffffff80a96e51 at sys_ioctl+0x171
#8 0xffffffff80e92f6b at amd64_syscall+0x2db
#9 0xffffffff80e7218b at Xfast_syscall+0xfb


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x0
fault code     = supervisor read data, page not present
instruction pointer   = 0x20:0xffffffff8223001f
stack pointer    = 0x28:0xfffffe002b724310
frame pointer    = 0x28:0xfffffe002b724800
code segment     = base 0x0, limit 0xfffff, type 0x1b
       = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags   = interrupt enabled, resume, IOPL = 0
current process     = 635 (pfctl)
[ thread pid 635 tid 100074 ]
Stopped at  pfioctl+0x4ef:  movq  (%rdi),%rsi
db>
```
_*VM2*_

```
vm2 # uname -a
FreeBSD test-pf 11.0-CURRENT FreeBSD 11.0-CURRENT #2: Sun May 29 12:06:57 NZST 2016  test@test-pf:/usr/obj/usr/src/sys/GENERIC  amd64
vm2 # cat /boot/loader.conf
net.pf.states_hashsize="2147483648"
vm2 # sysctl hw.realmem
hw.realmem: 536870912
vm2 # kldstat
Id Refs Address  Size  Name
1  1 0xffffffff80200000 1ee2bc0  kernel
vm2 # service pf onestart
Enabling pf.
vm2 # kldstat
Id Refs Address  Size  Name
1  3 0xffffffff80200000 1ee2bc0  kernel
2  1 0xffffffff82219000 34c30  pf.ko
```

And this is the code snippet I added into the pf.c (there is also addition in terms of header files to be included)

```
/* Per-vnet data storage structures initialization. */
void
pf_initialize()
{
   struct pf_keyhash   *kh;
   struct pf_idhash   *ih;
   struct pf_srchash   *sh;
  u_long hw_physmem;
   u_int i;
  size_t size;

   /* Get the available hardware memory */
  size=sizeof(hw_physmem);
  if (kernel_sysctlbyname(&thread0, "hw.physmem", &hw_physmem,   \
   &size, 0, 0, 0, 0) != 0)
   /* make the check against hw.physmem irrelevant */
     hw_physmem = ULONG_MAX;

   if (pf_hashsize == 0 || !powerof2(pf_hashsize) ||      \
   (pf_hashsize * sizeof(struct pf_keyhash)) > hw_physmem ||    \
   (pf_hashsize * sizeof(struct pf_idhash)) > hw_physmem)
     pf_hashsize = PF_HASHSIZ;
   if (pf_srchashsize == 0 || !powerof2(pf_srchashsize) ||    \
   (pf_srchashsize * sizeof(struct pf_srchash)) > hw_physmem )
     pf_srchashsize = PF_HASHSIZ / 4;
```

I will also add a comment to the PR in the bugzilla and see how it goes...

The diff (patch) file (`diff -u pf.c p.c.fni > pf.c.dif` , where pf.c.fni is the patched version) 

```
--- pf.c   2016-05-29 20:31:10.980290000 +1200
+++ pf.c.fni   2016-05-29 20:32:24.194178000 +1200
@@ -59,6 +59,7 @@
 #include <sys/sysctl.h>
 #include <sys/taskqueue.h>
 #include <sys/ucred.h>
+#include <sys/proc.h>
 #include <net/if.h>
 #include <net/if_var.h>
@@ -776,11 +777,23 @@
   struct pf_keyhash   *kh;
   struct pf_idhash   *ih;
   struct pf_srchash   *sh;
+  u_long hw_physmem;
   u_int i;
-
-   if (pf_hashsize == 0 || !powerof2(pf_hashsize))
+  size_t size;
+   
+   /* Get the available hardware memory */
+  size=sizeof(hw_physmem);
+  if (kernel_sysctlbyname(&thread0, "hw.physmem", &hw_physmem,   \
+   &size, 0, 0, 0, 0) != 0)
+   /* make the check against hw.physmem irrelevant */
+     hw_physmem = ULONG_MAX;
+
+   if (pf_hashsize == 0 || !powerof2(pf_hashsize) ||      \
+   (pf_hashsize * sizeof(struct pf_keyhash)) > hw_physmem ||    \
+   (pf_hashsize * sizeof(struct pf_idhash)) > hw_physmem)
     pf_hashsize = PF_HASHSIZ;
-   if (pf_srchashsize == 0 || !powerof2(pf_srchashsize))
+   if (pf_srchashsize == 0 || !powerof2(pf_srchashsize) ||    \
+   (pf_srchashsize * sizeof(struct pf_srchash)) > hw_physmem )
     pf_srchashsize = PF_HASHSIZ / 4;
   V_pf_hashseed = arc4random();
```


----------

