# idprio -- make sure a task will *never* take time slices when a normal prio task could run?



## zirias@ (Apr 20, 2020)

This is a question about scheduling. Scenario: I have a single machine doing everything I need at home, this includes several (bhyve) virtual machines, one of them acting as my router/firewall (with PCI passthrough of the physical NICs), and also several VNET jails, one of them is my build host for the base system and poudriere for ports.

The "builder" should ideally only do something when a CPU core would be idle otherwise. So far I started that jail with `idprio`, which is automatically inherited by all child processes. Poudriere fires up [ncpu] build tasks (8 in my case), and this solution worked very well -- it never affected the responsiveness of other things the same hardware is doing. Sometimes, this leads to a situation where a huge port builds using a single task, and all other ports in the queue depend on this one -- so I started allowing parallel building for _some_ ports that typically suffer from that problem. Of course, this can lead to poudriere using 15 or even 23 tasks at the same time, and suddenly I noticed my router was affected by this. To restore normal network performance, I had to put my router vm on `rtprio`. This doesn't feel like a good solution, especially as other vms (like the Windows server I need for remote work) are affected as well.

My assumption now is that even idle prio tasks do get a minimum share of time slices, probably to prevent starving. Is this correct? If so, is there some knob (e.g. sysctl) I can use to configure something like a threshold to further limit the amount of CPU time an idle priority task can "steal"?


----------



## Ordoban (Apr 20, 2020)

Dumb question: Are you sure it is the CPU load, who are affect the router vm? There are several other things who can be the bottleneck. Maybe HDD-IO? Or RAM-Bus? Maybe an Cache issue?


----------



## zirias@ (Apr 20, 2020)

I thought about that, but abandoned that thought based on the fact that `rtprio` for the router vm did help. Is this conclusion flawed?

edit: of course, many parallel build jobs will increase IO load as well, so I see where you're coming from ....


----------



## PMc (Apr 20, 2020)

Yes, idprio makes certain the task is only run when nothing else wants to run. And also, as Ordoban already mentioned, this does only concern CPU cycles. Other things that get dispatched (e.g. the whole bunch of zfs threads in the kernel, or specifically the txg.timeout) are not affected.
Then, I think it is generally a good idea to run networking at rtprio. Only you have to make certain that concerned threads cannot go into an endless loop - because if all CPU are occupied, you're effectively locked out - so I would put only concerned networking tasks to rtprio and/or use a watchdog (like asterisk does).

Looking into it in more detail with `ps axlH`, we see that the zfs stuff is run at prio -16 or -8 (default kernel prios), while rtprio gets us something between -21 and -52. Interrupt-handling and In-kernel networking should be still higher, somewhere -76 to -92.


----------



## zirias@ (Apr 20, 2020)

Thanks a lot PMc -- I think I have a much better idea now what's going on, and if I understand you correctly, there's not much I could do regarding the scheduler. Just to double-check I understand it correctly, so the userspace build tasks might be blocked, but the in-kernel ZFS code has work to do, which is scheduled with higher priority, therefore taking precedence over other tasks on the machine, correct? And as builds do harddisk I/O a lot, this will happen very frequently?


PMc said:


> Then, I think it is generally a good idea to run networking at rtprio. Only you have to make certain that concerned threads cannot go into an endless loop - because if all CPU are occupied, you're effectively locked out - so I would put only concerned networking tasks to rtprio and/or use a watchdog (like asterisk does).


This also makes sense. As I'm talking about a bhyve vm here doing all my routing and filtering between subnets and the internet which is only configured for 2 vCPUs, there's no risk to hog the whole machine, so I'll probably leave it on rtprio, just to be sure nothing happening on the machine will ever kill my network performance.

Then I guess I have no way to ensure my builds will never affect *other* services and virtual machines on the same host, other than reverting my poudriere config not to allow parallel builds for any package (as "only" 8 build tasks never had an observable impact), or maybe patch poudriere so it will pause other jobs as long as a package is built in parallel ...


----------



## PMc (Apr 20, 2020)

Zirias said:


> Thanks a lot PMc -- I think I have a much better idea now what's going on, and if I understand you correctly, there's not much I could do regarding the scheduler. Just to double-check I understand it correctly, so the userspace build tasks might be blocked, but the in-kernel ZFS code has work to do, which is scheduled with higher priority, therefore taking precedence over other tasks on the machine, correct?



Yes - when the user process is on idprio, it cannot run while other processes or kernel tasks want to run. But then, if it finds a gap to run for a while, it will schedule some disk i/o, which then may run some seconds later (txg.timeout) when the situation is different again.
So, using idprio is still the right thing, only it is not 100%.



> And as builds do harddisk I/O a lot, this will happen very frequently?



That depends how over-powered your CPU is.  On my ivybridge-quadcore w/ ssd and zfs, builds are mostly CPU, and I am quite certain idprio will do a good job if I happen to need it.

Nevertheless, You have two other possible options in stock. 1) You can dedicate a fixed amount of CPUs to the build jail, to make certain the other cores are kept free (`cpuset`), and/or 2) you can limit the builds to fractional parts of CPU and/or amounts of disk i/o (`rctl`). These two are not so flexible that they would grab spare resouces, but then they could be dynamically modified by some controlling process according to whatever metrics are available (source code example on request).



> This also makes sense. As I'm talking about a bhyve vm here doing all my routing and filtering between subnets and the internet which is only configured for 2 vCPUs, there's no risk to hog the whole machine, so I'll probably leave it on rtprio, just to be sure nothing happening on the machine will ever kill my network performance.



Again, putting the network to rtprio does a good job, but seems not to be 100%. With that I got the jitter to the internet down from some 5+ ms to near 300 µs, and then VoIP started to work without interruptions. And that is a very very old machine, and it is usually heavily loaded, because it also collects my backups and has to sort them into a 40mio record database.
But now, after putting that machine to VIMAGE jails, I do observe strange random ping delays between the units - which should be impossible to happen, because internal network should entirely run in network-prio (faster than rtprio) and neither depend on disk nor ram. And currently I have no idea why that happens, as I don't see a clear relation to other activity. It may have to do with some mutex locks between kernel threads (there are lot of them, and I won't track them down).

Then, if You want "network performance", that can be three different things (at least): avoid saturation stalls, maximize bandwidth, minimize delays. And each of these is a science on it's own...



> Then I guess I have no way to ensure my builds will never affect *other* services and virtual machines on the same host, other than reverting my poudriere config not to allow parallel builds for any package (as "only" 8 build tasks never had an observable impact),



Well, You have the usual option of engineering: find out precisely where and why the actual bottleneck does happen, and then it often is easy to fix. *biggrin*

Another thing that just comes to my mind: parallel builds are notoriously memory-hungry. You may have a look on what Your pagedaemons and vmdaemon does.


----------



## bds (Apr 21, 2020)

The rtprio/idprio man page mentions:


> Under FreeBSD system calls are currently never preempted, therefore non-realtime processes can starve realtime processes, or idletime processes can starve normal priority processes.


I'm usually wary of using idle scheduling for anything that interacts heavily with file systems and paging; an idle process that acquires a shared resource may then be unable to run for a while. For builds, `nice` works well enough for me. Note though that all processes can have pages swapped out and if that happens, latency issues are inevitable.


----------



## PMc (Apr 21, 2020)

bds said:


> The rtprio/idprio man page mentions:
> 
> 
> > Under FreeBSD system calls are currently never preempted, therefore non-realtime processes can starve realtime processes, or idletime processes can starve normal priority processes.



Yes, that may be the case. Let's contemplate it (correct me if I'm wrong): the classical scheduler did wait until a task would report back (either by doing a system call or being interrupted by a clock or other interrupt) and would then run some other task according to prio. This is no problem on single CPU machines, because everything that is going on will happen to be an interrupt, and so things are re-shuffled anyway.
It is different on multi-core, and so we got the preemptive scheduler, which can stop processes per it's own initiative. And then this quote seems to say, while it can stop processes, it will never stop an ongoing system call?

That is an interesting input, thank You! I never thought about that side, I only recognized that if we have PREEMPTION enabled (which we always do, nowadays), then most of the task switches do happen via preemption and no longer in the classical way.


----------

