# System OOM killing my shell but top almost empty



## DaveQB (Apr 25, 2020)

Hi all,

Been using Unix systems for 15 years now and can generally find my answers with searching, but my FreeBSD experience is only a few years and I cannot figure this one out. I can reboot the server to solve this, but I want to learn more. I will increase the swap space once I get this up, but I want to learn how better to troubleshoot this.
FreeBSD 12.1







The system has killed off most processes, as you can see. When I try to login, it is a race to try and run something (like top) before my shell is killed with "out of swap" message. Even logging in, my login process can be killed between entering the username and the password. I cannot see, in top, what is hogging all of the memory.









I want to run something like this: `ps -eo pmem,pid,pcpu,rss,vsz,time,args | sort -k 1 -r` but I am unable to type it fast enough.
I did just manage get this info:





Any suggestions appreciated.

Thank you.


----------



## T-Daemon (Apr 25, 2020)

DaveQB said:


> I want to run something like this:  ps -eo pmem,pid,pcpu,rss,vsz,time,args | sort -k 1 -r but I am unable to type it fast enough.


You could write a sh(1) script (which dumps the output to a file, or if a suitable mail system is available, sends a mail), and scp(1) to remote system (images #1,#2 looks like a ssh session). Choose the file names for the script and dumped file short (one character should suffice), to shorten the time to type the execution command of the script and, eventually, read the dumped file, if not scp to ssh client system.


----------



## George (Apr 25, 2020)

Or type "swapinfo", or use an alias.

Or add  "ps... >/var/log/mylog.txt" to your ~/.login file.


----------



## gpw928 (Apr 26, 2020)

The solution depends on exactly when the problem is happening.

Assuming it occurs only after you login, then create and test a script to gather the evidence you need on another system if you can.  That will shorten the debug cycle on the problem system.

Then boot single user mode, and mount the file systems you need writable.  

Then install the script in your $HOME/.profile (or whatever profile your shell uses) so that it's executed automatically every time you login.  

Reboot multi-user.


----------



## shkhln (Apr 26, 2020)

DaveQB said:


> I cannot see, in top, what is hogging all of the memory.



You mean "1783M Wired" is not clear enough?


----------



## DaveQB (Apr 26, 2020)

T-Daemon said:


> You could write a sh(1) script (which dumps the output to a file, or if a suitable mail system is available, sends a mail), and scp(1) to remote system (images #1,#2 looks like a ssh session). Choose the file names for the script and dumped file short (one character should suffice), to shorten the time to type the execution command of the script and, eventually, read the dumped file, if not scp to ssh client system.


Thanks. I did think of this, but didn't even bother trying, as I only have seconds to run something, not enough time to scp. Like I said, the login is being killed between entering a username and a password.

Nah, SSHD is killed and won't start. It is an HP ILO dialogue I am using to access it.

Thanks.


----------



## DaveQB (Apr 26, 2020)

gpw928 said:


> The solution depends on exactly when the problem is happening.
> 
> Assuming it occurs only after you login, then create and test a script to gather the evidence you need on another system if you can.  That will shorten the debug cycle on the problem system.
> 
> ...



Memory is full, full time. So even the login screen is having its process killed, as seen in one of the screenshots. Ok, I'll try that, but I doubt scp command will complete before being killed. But I am more interested in what to run to debug (applications, in memory consumption order, even though the output of top and ps are saying there's only about 10-13 processes running with minimal memory usage).

Once I reboot, the issue will be no more. Like I said in my post "I can reboot the server to solve this, but I want to learn more."

Thanks.


----------



## DaveQB (Apr 26, 2020)

shkhln said:


> You mean "1783M Wired" is not clear enough?



No. "1783M Wired" doesn't tell me which process(es) is hogging the memory. Could you explain please further? There's not enough characters in "1783M Wired" to be able to list all of the processes, as far as I can see. Unless it is some sort of stenography I am not understanding 

Thanks though.


----------



## AngryChris (Apr 27, 2020)

DaveQB said:


> No. "1783M Wired" doesn't tell me which process(es) is hogging the memory. Could you explain please further? There's not enough characters in "1783M Wired" to be able to list all of the processes, as far as I can see. Unless it is some sort of stenography I am not understanding
> 
> Thanks though.


FreeBSD Wiki - Memory - While it can represent mlock'ed application memory, it's pretty much "what the kernel is consuming." pyret is probably correct if you're using zfs.


----------



## gpw928 (Apr 27, 2020)

If your system is compromised all you can do is plan what to do next time and reboot.

If you are lucky, there might be something informative in /var/log/messages.

If the system is running out of memory as soon as it boots, then you can still install diagnostics in single user mode to be run when you boot.

So install this script as /usr/local/etc/rc.d/memwatch.  Make it executable and adjust the diagnostic commands to your needs:

```
#!/bin/sh

# PROVIDE: memwatch
# REQUIRE: DAEMON
# BEFORE:  LOGIN
# KEYWORD: shutdown

PASSES=600      # runs for for 60 seconds from early boot

n=0
while [ $n -lt $PASSES ]
do
    (
        date
        ps mauxwwwwwwww
        top -b -H -w 10
        echo ====
    ) >>/var/tmp/mylog
    sync
    sleep 0.1
    n=$((n+1))
done &
```

Leave the log on the local disk.  Boot single user.  Add swap space.  Reboot.  Examine the log.  Delete/disable the diagnostic.


----------



## gpw928 (Apr 27, 2020)

When you can't rely on network applications to be working when you need them, I would format a file system on a USB stick, mount it somewhere suitable form /etc/fstab, and write your diagnostics to that.

If the situation warranted, you could even mount /var/log on the USB stick, and collect all the system logs.


----------



## DaveQB (Apr 27, 2020)

pyret said:


> Contents of the ARC and the buffer cache are wired.  Do you have zfs stats installed?


I do. Good idea. I'll look at that.


----------



## DaveQB (Apr 27, 2020)

AngryChris said:


> FreeBSD Wiki - Memory - While it can represent mlock'ed application memory, it's pretty much "what the kernel is consuming." pyret is probably correct if you're using zfs.



Right. So ZFS memory consumption won't show in top? That might well explain it. This being a backup server, it does a lot of ZFS work (send/recv/syncoid)

Thanks for the link. I'll take a read.


----------



## DaveQB (Apr 27, 2020)

gpw928 said:


> If your system is compromised all you can do is plan what to do next time and reboot.
> 
> If you are lucky, there might be something informative in /var/log/messages.
> 
> ...



Brilliant script idea. Thanks for that. Great tool to add to the sys admin tool box.


----------



## DaveQB (Apr 27, 2020)

gpw928 said:


> When you can't rely on network applications to be working when you need them, I would format a file system on a USB stick, mount it somewhere suitable form /etc/fstab, and write your diagnostics to that.
> 
> If the situation warranted, you could even mount /var/log on the USB stick, and collect all the system logs.



Sure. Could do.

I am more interested in:

- How to list processes by memory
- Everything so far is showing little memory use by the few processes remaining


----------



## DaveQB (Apr 27, 2020)

AngryChris said:


> FreeBSD Wiki - Memory



Oh yes. I remember reading this FreeBSD Wiki page a few years ago. Good to refresh the memory though.
Thanks.


----------



## gpw928 (Apr 28, 2020)

It looks to me like you have 2GB real memory and 767MB swap.  That's pretty skinny for a ZFS server.

Are you using de-duplication (it's a memory hog)?

There's lots of ZFS threads on the list.


----------



## Alain De Vos (Apr 28, 2020)

Killing of a shell is frequently running out of memory.
Try specifying a good value for vfs.zfs.arc_max.


----------



## rootbert (Apr 28, 2020)

DaveQB said:


> Sure. Could do.
> 
> I am more interested in:
> 
> ...


`top -o res` should provide info, and of course the ARC info in upper part of top. As mentioned before, wired memory consists of ZFSs ARC cache (not only). You could limit ARC cache to 512MB like this: `echo "vfs.zfs.arc_max=536870912" >> /boot/loader.conf` and reboot


----------



## PMc (Apr 29, 2020)

1140M arc lines up well with 1780M wired, but it does not line up with 12M free or 35M compressed. That arc is unable to shrink. Could be s/t similar to the max-vnode issue.


----------



## DaveQB (Apr 29, 2020)

gpw928 said:


> It looks to me like you have 2GB real memory and 767MB swap.  That's pretty skinny for a ZFS server.
> 
> Are you using de-duplication (it's a memory hog)?
> 
> There's lots of ZFS threads on the list.



It is a backup server. So syncoid over some ZFS snapshots to this box and upload to BackBlaze. No dedup. I have learnt "never use it". I wish there was a file level de-dedup. It could have been more memory friendly  



Alain De Vos said:


> Killing of a shell is frequently running out of memory.
> Try specifying a good value for vfs.zfs.arc_max.



Thanks.



rootbert said:


> `top -o res` should provide info, and of course the ARC info in upper part of top. As mentioned before, wired memory consists of ZFSs ARC cache (not only). You could limit ARC cache to 512MB like this: `echo "vfs.zfs.arc_max=536870912" >> /boot/loader.conf` and reboot



Thanks. I think I will. I did manually sort top by RES and not much in the way of memory used. Thus I came here to discuss. It must be the ARC, but as noted, ARC is only using 1140MB



PMc said:


> 1140M arc lines up well with 1780M wired, but it does not line up with 12M free or 35M compressed. That arc is unable to shrink. Could be s/t similar to the max-vnode issue.



Yes. This is my confusion


----------



## PMc (May 1, 2020)

DaveQB said:


> Yes. This is my confusion


could give it a try... find out what `sysctl kern.maxvnodes` is. if it's around 100000, reduce to 40000 (and kern .minvnodes to some 1/4 of that). and anyway, `sysctl kstat.zfs | grep evictable` might be interesting.


----------



## DaveQB (May 3, 2020)

gpw928 said:


> It looks to me like you have 2GB real memory and 767MB swap.  That's pretty skinny for a ZFS server.
> 
> Are you using de-duplication (it's a memory hog)?
> 
> There's lots of ZFS threads on the list.



It is a backup server. So syncoid over some ZFS snapshots to this box and upload to BackBlaze. No dedup. I have learnt "never use it". I wish there was a file level de-dedup. It could have been more memory friendly  



Alain De Vos said:


> Killing of a shell is frequently running out of memory.
> Try specifying a good value for vfs.zfs.arc_max.



Thanks.



rootbert said:


> `top -o res` should provide info, and of course the ARC info in upper part of top. As mentioned before, wired memory consists of ZFSs ARC cache (not only). You could limit ARC cache to 512MB like this: `echo "vfs.zfs.arc_max=536870912" >> /boot/loader.conf` and reboot



Thanks. I think I will. I did manually sort top by RES and not much in the way of memory used. Thus I came here to discuss. It must be the ARC, but as noted, ARC is only using 1140MB



PMc said:


> 1140M arc lines up well with 1780M wired, but it does not line up with 12M free or 35M compressed. That arc is unable to shrink. Could be s/t similar to the max-vnode issue.



Yes. This is my confusion


----------



## DaveQB (May 3, 2020)

Thanks everyone, for your input. I am rebooting it now. I'll reduce the ARC. That should prevent this happening again. Still odd that ARC wasn't what was filling up all of the RAM (1140M ARC on a 2GB RAM system).


----------

