# How to kill a running process?



## PMc (Mar 1, 2022)

Traditionally, with `kill -9` we could unconditionally kill a running process. This doesn't work anymore:


```
# ps ax | grep 91293
91293  -  RJ     1121:57.70 pg_dump -U postgres -p 5432 -bF c -f /var/db/pg-i
# kill -9 91293
# ps ax | grep 91293
91293  -  RJ     1122:25.71 pg_dump -U postgres -p 5432 -bF c -f /var/db/pg-i
# kill -9 91293
# ps ax | grep 91293
91293  -  RJ     1122:31.02 pg_dump -U postgres -p 5432 -bF c -f /var/db/pg-i
```

That thing is in an endless loop, continuously eating 100% of one CPU. How to get rid of it?


----------



## gpw928 (Mar 1, 2022)

J       Marks a process which is in jail(2)


----------



## Phishfry (Mar 1, 2022)

`pkill pg_dump` or `pkill 91293`


----------



## PMc (Mar 1, 2022)

Phishfry said:


> `pkill pg_dump` or `pkill 91293`


Why?



> pgrep, pkill – find or signal processes by name
> HISTORY
> The pkill and pgrep utilities first appeared in NetBSD 1.6.  They are
> modelled after utilities of the same name that appeared in Sun Solaris 7.
> They made their first appearance in FreeBSD 5.3.



Do You think `pkill` is by any means "better" than old-fashioned `kill`?



gpw928 said:


> J       Marks a process which is in jail(2)


Yes. I tried to kill it with `kill -9` from within the jail, and from the host. Neither had any effect. No logging, either.
So, that means, processes running in a jail, cannot be killed, at all? 


And I found the, eh, "root cause": that process was writing to a file, on an UFS filesystem. The disk had a bad block that could not be read:


```
# dd if=/dev/da6 of=/dev/null bs=64k
dd: /dev/da6: Input/output error
4267+0 records in
4267+0 records out
# dd if=/dev/zero of=/dev/da6 seek=4267 count=1 bs=64k
1+0 records in
1+0 records out
65536 bytes transferred in 1.092641 secs (59979 bytes/sec)
# dd if=/dev/da6 of=/dev/null bs=64k
# echo $?
0
```

Apparently all fine now, back in business. What I *don't like* is that it needed a reboot to get rid of these stuck processes.


----------



## Erichans (Mar 1, 2022)

Out of curiosity, when trying to kill things from within the jail, did you get into the "jail root" by:
`# jexec -l <jailname> login -f root`


----------



## PMc (Mar 1, 2022)

Erichans said:


> Out of curiosity, when trying to kill things from within the jail, did you get into the "jail root" by:
> `# jexec -l <jailname> login -f root`


No.


----------



## mark_j (Mar 1, 2022)

I've seen processes in jails defy killing, BUT, they've always had the state of *D* for *uninterruptible*. Eventually, though, they stop being uninterruptible and die as they should. Sometimes that can take minutes. Your process shows no *D *though.

Your situation seems very odd and possibly is a bug. Even if the file system is junk, the driver should time out and allow the process to be killed.


----------



## Jose (Mar 2, 2022)

I've seen this happen to processes that were accessing an NFS filesystem. It's been a long while, though. Here's something similar:








						Unkillable process: vi of NFS mounted file in 7.0-STABLE
					

I have a virtual machine called quark running 7.0-STABLE and serving as both an NFS server and NFS client.  It mounts a directory with NFS from another virtual machine called hestia running 8.2-PRERELEASE.  Attempting to open a file within the mounted directory on quark with vi causes the...




					forums.freebsd.org


----------



## PMc (Mar 2, 2022)

mark_j said:


> I've seen processes in jails defy killing, BUT, they've always had the state of *D* for *uninterruptible*. Eventually, though, they stop being uninterruptible and die as they should. Sometimes that can take minutes. Your process shows no *D *though.


Yepp. And I verified this one with top, and there was always some CPU shown with 100%.  (In D state it would not eat CPU cycles.)



mark_j said:


> Your situation seems very odd and possibly is a bug. Even if the file system is junk, the driver should time out and allow the process to be killed.


There was no damage on the filesystem. The process was trying to create a new file; this did not return, and after reboot I did see an inconsitency in fsck - as it should be.
Then I did find and zero the bad block, let the disk do whatever housekeeping it might require, and then fsck had no further errors.  Maybe one of the files does now contain wrong data - but the filesystem never had a flaw.

Logfiles say, there were device errors, ending in  Error 5, Retries exhausted
And after these, about 100 of such errors apper:

```
kernel: GEOM_ELI: g_eli_write_done() failed (error=5) da6p1.eli[WRITE(offset=279674880, length=131072)]
kernel:
kernel: g_vfs_done():da6p1.eli[WRITE(offset=279674880, length=131072)]error = 5
```

So it seems that Geli might be part of the problem, and does not bother much about device errors.

Anyway, afte not being able to stop the process (or unmount), I unplugged the disk. It was all correctly detached (made it even to syslog), and right after that the machine panicked at some "softdep_" stuff - and was not in the mood to write a coredump.

So there is no bughunting from obtained data. :/

I think the primary cause is that Geli reduces the fault tolerance from the device. But another question is, how can a process receive a hard kill and still continue to obtain CPU time slices (on various cores)? I don't think I've ever seen that before...


----------



## mark_j (Mar 2, 2022)

PMc said:


> I think the primary cause is that Geli reduces the fault tolerance from the device. But another question is, how can a process receive a hard kill and still continue to obtain CPU time slices (on various cores)? I don't think I've ever seen that before...



The thing is, though, the kernel shouldn't give it an option.

I wonder if this is a shell built-in command that's running and somehow testing for X before attempting a true kill?
You know something like "If process is alive, kill it otherwise wait for it".


----------



## mark_j (Mar 2, 2022)

Jose said:


> I've seen this happen to processes that were accessing an NFS filesystem. It's been a long while, though. Here's something similar:


In that case it would/should show uninterruptible (D) as a process state.


----------



## grahamperrin@ (Mar 2, 2022)

PMc said:


> …bad block … `Input/output error` …



Any number of unwanted or troublesome behaviours may ensue.



mark_j said:


> … possibly is a bug. Even if the file system is junk, the driver should time out and allow the process to be killed. …



Depending on the context of an error, it's not unusual for an operating system halt to fail in response to `shutdown -p now`.

shutdown(8)

Ideally: things should be more graceful.

Realistically: it's sometimes necessary to force off the power.


----------



## PMc (Mar 2, 2022)

grahamperrin said:


> Any number of unwanted or troublesome behaviours may ensue.


In the old times this was the case. A device error would likely require a reboot.
But nowadays we have all these modular designs, plugable and hot-plugable components - and that should be more robust - but obviousely it is not perfect/bugfree.
Also, nowadays we have much bigger installments, hundreds of subsystems, dozens of jails, lots of guests on a single physical instance - it's not always fun to crash such a piece in midflight.



mark_j said:


> I wonder if this is a shell built-in command that's running and somehow testing for X before attempting a true kill?
> You know something like "If process is alive, kill it otherwise wait for it".


It is a shell-builtin, in the default root's /bin/csh.That points to /usr/src/contrib/tcsh - and I'm currently not in the mood to read all that.


----------



## grahamperrin@ (Mar 2, 2022)

PMc said:


> … not always fun …



Indeed. 

Anecdotally, from personal experience over the past two or three years, I sense that integral zpool(8) nowadays is far more likely to remain usable in problem situations. Presumably thanks to OpenZFS. 

<https://www.freebsd.org/cgi/man.cgi?query=zpool&sektion=8&manpath=FreeBSD>

<https://openzfs.github.io/openzfs-docs/man/8/zpool.8.html>


----------



## PMc (Mar 8, 2022)

What does happen when a process runs an endless loop in kernel code, within a syscall?


----------



## Jose (Mar 8, 2022)

PMc said:


> What does happen when a process runs an endless loop in kernel code, within a syscall?


I'm not sure what you mean. If the endless loop is in the kernel code, the process won't be killable. If the endless loop is in user space, there will be a possibly infinitesimally small chance of killing it.


----------



## PMc (Mar 8, 2022)

Jose said:


> I'm not sure what you mean. If the endless loop is in the kernel code, the process won't be killable.


I thought so, and yes, this is exactly what I mean.

As it seems, such processes will also ignore rctl/racct. Normally this would periodically stop the process to limit CPU consumtion, if utilized that way.


----------

