# Difficulty determining cause of crash / shutdown



## pkc (Sep 5, 2019)

The server is a VPS and I am unable to capture the console to record what is occurring.  


```
FreeBSD s12 12.0-RELEASE-p8 FreeBSD 12.0-RELEASE-p8 r350477 GENERIC  amd64
```

What I observe is that the server shuts off unexpectedly, and does not restart. This happens a few times a week. I restart it using the VPS-related tools.

I have a default syslog configuration, and there does not appear to be any important information in /var/log/messages. However, `last` yields several lines that indicate a crash, such as:

```
root       pts/5    tmux(36290).%7         Tue Sep  3 22:16 - crash (1+04:06)
```

I have tried to set up a dumpdev a few times with no success so far. Prior to the most recent shutdown, I performed the following actions

Setup a 10G partition as swap using `swapctl`


```
# swapon /dev/gpt/crash# swapctl -l
Device: 1024-blocks Used:
/dev/vtbd0p4 10485760 0
/dev/gpt/crash 10485760 0
```


Set the volume up for use with dumpdev


```
# dumpon /dev/gpt/crash
# dumpon -l
gpt/crash
```


Additionally, I have these options set in rc.conf.

```
dumpdev="/dev/gpt/crash"
dumpdir="/var/crash"
```

Then checking with savecore, I have the following

```
# savecore -vC /dev/gpt/crash
checking for kernel dump on device /dev/gpt/crash
mediasize = 10737418240 bytes
sectorsize = 512 bytes
magic mismatch on last dump header on /dev/gpt/crash
No dump exists
```

Am I missing something? Should I pursue other options for investigating this?

I've noticed some similarities among the messages in `/var/log/messages` that immediately precede the boot messages (from when I boot manually). Here are a few examples. I'm not sure if any of it is relevant: the ssh authentication attempts are more or less constant, and from what I can read online the `conftest` is nothing to be worried about. I'm not sure what's causing the `syslogd: sendto: Host is down` messages, since I have no remote logging configured on the host. And there are "crashes" that do not have any of those three elements immediately present in the logs. 


```
Aug  7 01:58:40 server sshd[96240]: error: maximum authentication attempts exceeded for invalid user admin from 168.232.130.140 port 37428 ssh2 [preauth]
Aug  7 01:58:40 server syslogd: sendto: Host is down
Aug 7 01:58:49 server syslogd: last message repeated 47 times
Aug 7 01:58:49 server sshd[89736]: error: maximum authentication attempts exceeded for invalid user admin from 168.232.130.140 port 37433 ssh2 [preauth]
Aug 7 01:58:49 server syslogd: sendto: Host is down
Aug 7 01:59:04 server syslogd: last message repeated 62 times
Aug 7 01:59:04 server sshd[24121]: error: maximum authentication attempts exceeded for invalid user oracle from 168.232.130.140 port 37440 ssh2 [preauth]
Aug 7 01:59:04 server syslogd: sendto: Host is down
Aug 7 01:59:13 server syslogd: last message repeated 47 times
Aug 7 01:59:13 server sshd[75440]: error: maximum authentication attempts exceeded for invalid user oracle from 168.232.130.140 port 37450 ssh2 [preauth]
Aug 7 01:59:13 server syslogd: sendto: Host is down
Aug 7 01:59:28 server syslogd: last message repeated 62 times
Aug 7 01:59:28 server sshd[92032]: error: maximum authentication attempts exceeded for invalid user usuario from 168.232.130.140 port 37457 ssh2 [preauth]
Aug 7 01:59:28 server syslogd: sendto: Host is down
Aug 7 01:59:37 server syslogd: last message repeated 47 times
Aug 7 01:59:37 server sshd[81908]: error: maximum authentication attempts exceeded for invalid user usuario from 168.232.130.140 port 37464 ssh2 [preauth]
Aug 7 01:59:37 server syslogd: sendto: Host is down
Aug 7 01:59:54 server syslogd: last message repeated 62 times
Aug 7 01:59:54 server sshd[99613]: error: maximum authentication attempts exceeded for invalid user test from 168.232.130.140 port 37474 ssh2 [preauth]
Aug 7 01:59:54 server syslogd: sendto: Host is down
Aug 7 02:00:03 server syslogd: last message repeated 50 times
Aug 7 02:00:03 server sshd[4132]: error: maximum authentication attempts exceeded for invalid user test from 168.232.130.140 port 37481 ssh2 [preauth]
Aug 7 02:00:03 server syslogd: sendto: Host is down
Aug 7 02:00:19 server syslogd: last message repeated 62 times
Aug 7 02:00:19 server sshd[16286]: error: maximum authentication attempts exceeded for invalid user user from 168.232.130.140 port 37491 ssh2 [preauth]
Aug 7 02:00:19 server syslogd: sendto: Host is down
Aug 7 02:00:28 server syslogd: last message repeated 47 times
Aug 7 02:00:28 server sshd[45985]: error: maximum authentication attempts exceeded for invalid user user from 168.232.130.140 port 37495 ssh2 [preauth]
Aug 7 02:00:28 server syslogd: sendto: Host is down
Aug 7 02:00:44 server syslogd: last message repeated 62 times
Aug 7 02:00:44 server sshd[248]: error: maximum authentication attempts exceeded for invalid user ftpuser from 168.232.130.140 port 37507 ssh2 [preauth]
Aug 7 02:00:44 server syslogd: sendto: Host is down
Aug 7 02:00:54 server syslogd: last message repeated 47 times
Aug 7 02:00:54 server sshd[6622]: error: maximum authentication attempts exceeded for invalid user ftpuser from 168.232.130.140 port 37512 ssh2 [preauth]
Aug 7 02:00:54 server syslogd: sendto: Host is down
Aug 7 02:01:31 server syslogd: last message repeated 77 times
Aug 7 02:01:48 server syslogd: last message repeated 72 times
Aug 7 02:13:26 server syslogd: last message repeated 15 times
Aug 7 02:20:03 server syslogd: last message repeated 154 times
Aug 7 02:20:04 server kernel: pid 50991 (conftest), uid 65534: exited on signal 11 (core dumped)
Aug 7 02:20:04 server syslogd: sendto: Host is down
Aug 7 02:20:29 server syslogd: last message repeated 24 times
Aug 7 02:22:26 server syslogd: last message repeated 89 times
Aug 7 02:22:41 server syslogd: last message repeated 8 times
Aug  7 02:33:10 server syslogd: last message repeated 7 times
Aug  7 03:01:01 server syslogd: last message repeated 1 times
Aug 7 03:06:55 server syslogd: last message repeated 20 times
Aug 7 04:30:11 server syslogd: last message repeated 1 times
Aug 7 04:30:11 server syslogd: last message repeated 3 times
Aug 7 05:31:06 server syslogd: last message repeated 1 times
Aug 7 05:40:09 server syslogd: last message repeated 53 times
Aug 7 05:41:34 server kernel: pid 18858 (conftest), uid 65534: exited on signal 11 (core dumped)
Aug 7 06:00:53 server syslogd: sendto: Host is down
Aug 7 06:00:53 server syslogd: last message repeated 7 times
Aug 7 11:40:55 server sshd[89269]: error: maximum authentication attempts exceeded for root from 27.44.204.211 port 49093 ssh2 [preauth]
Aug 7 12:20:58 server syslogd: sendto: Host is down
Aug 7 12:21:26 server syslogd: last message repeated 37 times
Aug 7 12:21:29 server syslogd: last message repeated 1 times
Aug 7 12:42:23 server kernel: pid 58381 (conftest), uid 65534: exited on signal 11 (core dumped)
Aug 7 13:03:41 server syslogd: sendto: Host is down
Aug 7 13:03:41 server syslogd: last message repeated 7 times
Aug 7 13:05:32 server syslogd: last message repeated 3 times
Aug 7 13:15:24 server syslogd: last message repeated 28 times
Aug 7 13:26:27 server sshd[20435]: error: maximum authentication attempts exceeded for root from 121.231.155.36 port 45422 ssh2 [preauth]
Aug 7 13:40:43 server sshd[55975]: error: PAM: Authentication error for illegal user admin from 194.61.26.4
Aug 7 14:07:34 server syslogd: sendto: Host is down
Aug 7 14:07:34 server syslogd: last message repeated 7 times
Aug 7 14:08:51 server syslogd: last message repeated 12 times
Aug 7 14:17:52 server syslogd: last message repeated 58 times
Aug 7 14:30:28 server syslogd: last message repeated 40 times
Aug 7 14:37:09 server syslogd: last message repeated 127 times
Aug 7 14:37:14 server kernel: pid 32959 (conftest), uid 65534: exited on signal 11 (core dumped)
Aug 7 14:37:14 server syslogd: sendto: Host is down
Aug 7 14:37:21 server syslogd: last message repeated 8 times
Aug 7 14:41:12 server syslogd: last message repeated 1 times
Aug 7 14:49:03 server syslogd: last message repeated 212 times
Aug 7 15:05:08 server kernel: pid 24387 (conftest), uid 65534: exited on signal 11 (core dumped)
Aug 7 15:06:25 server syslogd: sendto: Host is down
Aug 7 15:06:55 server syslogd: last message repeated 33 times
Aug 7 15:08:59 server syslogd: last message repeated 84 times
Aug 7 15:16:52 server syslogd: last message repeated 44 times
Aug 7 15:22:43 server syslogd: last message repeated 29 times
Aug 7 15:23:27 server kernel: pid 63898 (conftest), uid 65534: exited on signal 11 (core dumped)
Aug 7 15:25:53 server syslogd: sendto: Host is down
Aug 7 15:26:34 server syslogd: last message repeated 44 times
Aug  7 15:28:02 server syslogd: last message repeated 100 times
Aug  7 15:35:04 server kernel: pid 88155 (conftest), uid 65534: exited on signal 11 (core dumped)
Aug 7 15:35:19 server syslogd: sendto: Host is down
Aug 7 15:35:51 server syslogd: last message repeated 23 times
Aug 7 15:36:07 server syslogd: last message repeated 20 times
Aug 7 15:38:43 server kernel: pid 66981 (conftest), uid 65534: exited on signal 11 (core dumped)
Aug 7 15:40:05 server syslogd: sendto: Host is down
Aug 7 15:40:24 server syslogd: last message repeated 17 times
Aug 7 15:40:55 server syslogd: last message repeated 17 times
Aug 7 15:41:08 server kernel: pid 43862 (conftest), uid 65534: exited on signal 11 (core dumped)
Aug 7 15:41:08 server syslogd: sendto: Host is down
Aug 7 15:41:14 server syslogd: last message repeated 5 times
```


```
Aug  7 15:51:34 server sshd[28347]: error: maximum authentication attempts exceeded for root from 1.58.12.38 port 58865 ssh2 [preauth]
Aug  7 15:51:35 server kernel: lo0: link state changed to UP
Aug 7 15:59:02 server kernel: pid 55391 (conftest), uid 65534: exited on signal 11 (core dumped)
Aug 7 19:33:27 server sshd[92057]: error: maximum authentication attempts exceeded for invalid user usuario from 182.113.205.136 port 52233 ssh2 [preauth]
Aug 8 00:02:17 server syslogd: sendto: Host is down
Aug 8 00:02:18 server syslogd: last message repeated 9 times
Aug 8 00:02:18 server sshd[10475]: error: maximum authentication attempts exceeded for invalid user mother from 221.231.95.221 port 63495 ssh2 [preauth]
Aug 8 00:02:18 server syslogd: sendto: Host is down
Aug 8 00:02:18 server syslogd: last message repeated 8 times
Aug 8 00:27:44 server sshd[75278]: error: maximum authentication attempts exceeded for root from 183.157.171.110 port 33791 ssh2 [preauth]
Aug 8 02:54:33 server sshd[56578]: error: maximum authentication attempts exceeded for root from 80.208.148.190 port 56265 ssh2 [preauth]
Aug 8 03:47:44 server syslogd: sendto: Host is down
Aug 8 03:48:23 server syslogd: last message repeated 13 times
Aug 8 03:48:23 server syslogd: last message repeated 8 times
Aug 8 05:17:57 server sshd[56686]: error: maximum authentication attempts exceeded for invalid user admin from 208.59.69.99 port 48072 ssh2 [preauth]
Aug 8 05:34:51 server kernel: pid 32329 (conftest), uid 65534: exited on signal 11 (core dumped)
```


```
Aug  8 14:14:05 server kernel: pid 42327 (conftest), uid 65534: exited on signal 11 (core dumped)
Aug  8 15:29:37 server kernel: pid 60431 (conftest), uid 65534: exited on signal 11 (core dumped)
Aug 8 16:53:18 server kernel: pid 63627 (conftest), uid 65534: exited on signal 11 (core dumped)
Aug 8 16:53:33 server kernel: pid 60248 (conftest), uid 65534: exited on signal 11 (core dumped)
Aug 8 17:28:58 server kernel: pid 42987 (conftest), uid 65534: exited on signal 11 (core dumped)
Aug 8 17:48:37 server kernel: pid 74233 (cc), uid 65534: exited on signal 11 (core dumped)
Aug 8 17:59:27 server kernel: pid 72386 (conftest), uid 65534: exited on signal 11 (core dumped)
Aug 8 18:17:13 server sshd[5107]: error: maximum authentication attempts exceeded for root from 115.196.0.221 port 49504 ssh2 [preauth]
Aug 8 18:25:58 server sshd[40199]: error: PAM: Authentication error for illegal user paxmms from 194.61.26.34
Aug 8 22:20:02 server kernel: pid 95812 (conftest), uid 65534: exited on signal 11 (core dumped)
Aug 8 22:47:14 server kernel: pid 30038 (conftest), uid 65534: exited on signal 11 (core dumped)
```


----------



## ralphbsz (Sep 5, 2019)

I would begin by figuring out what conftest is, and why it crashes all the time. Signal 11 is segmentation fault, so that program either has a bug, or it is running out of memory. If it's the latter, maybe your whole machine is running out of memory.

By the way, I see at least one case where cc crashes (or runs out of memory). That is the normal C compiler. This is very weird. Are you sure your swap and memory configuration is reasonable?

And I would figure out what is wrong with syslog. That message makes me suspect that there are things that should have been logged that we can't see. Don't ignore that until you know that there is NO skeleton in that closet.


----------



## pkc (Sep 6, 2019)

True. I had not noticed the cc one. I often compile large volumes of software on this machine using jails (poudriere) but I haven't yet associated the shutdown with these jobs consistently, and they usually complete without incident. It's a machine with limited memory but I also have 10G swap configured. I can try adding some monitoring for the memory situation to see how it turns out immediately preceding a crash.

It's true that this is a machine with 2GB physical memory only and a few hundred GB zfs volumes, though I've set the max arc to around 250MB (somehow it is consistently above that figure, right now it's around 550MB). It's not meant to perform well, just make progress, but I hadn't considered that maybe it could introduce noticeable problems. I'll be sure to monitor the memory. If the system shut down because of memory issues would there be any evidence aside from the core dumps in the log? Is this usually noted as 'crash' by `last`?


----------



## ralphbsz (Sep 6, 2019)

2GB with ZFS, with heavy workload like compiles? Syslog misconfigured? Ignoring the fact that tasks like conftest are already crashing? You're not just playing with fire, you're pouring gasoline on an already burning fire. I don't know how much adding swap helps, given that ZFS likes to pin physical memory.

I think you need to rethink your configuration and use.


----------



## pkc (Sep 6, 2019)

Well, the only misconfiguration for syslog is a few jails that have an erroneous file in syslog.d that needs to be fixed. Other than that the host and all the jails have the default configuration.
I mentioned I was not taking into consideration the conftest crashes because apparently that is exactly what the tool is supposed to do (see for example https://lists.freebsd.org/pipermail/freebsd-questions/2007-July/154070.html)

Sure the memory configuration given the workload and ZFS is a little silly, but I did not anticipate it would be anything but a performance issue. I will put in place some monitoring to see what happens with the memory.

Though I am still perplexed about the fact that there does not seem to be a core dump.


----------



## olli@ (Sep 6, 2019)

The conftest thing is harmless. When building ports, conftest may provoke SIGSEGV, so this is expected behavior.

As far as syslogd(8) is concerned: If you don't need remote logging at all, put `syslogd_flags="-ss"` in your /etc/rc.conf. It'll prevent syslogd(8) from opening any network sockets. Also make sure that you don't have any remote host entries in /etc/syslog.conf or /etc/syslog.d/*, i.e. no hostnames preceded by `@`, and no lines beginning with `+`, `-`, `#+` or `#-` (the latter are _not_ treated as comments!).

Note that the llvm/clang compiler (cc) can use quite a huge amount of memory when compiling complex source code at high optimization levels. That _might_ become a problem if your ZFS’ ARC is also quite big, given that 2 GB RAM isn't terribly much. You might want to tune the ARC size down a little bit. (Update: I just read you've already done that. Ok.) However, running out of RAM should _not_ cause the system to crash and/or reboot. It should't even cause the compiler to receive SIGSEGV (rather, it would be killed by the kernel, and this fact would be logged appropriately).


----------



## ralphbsz (Sep 6, 2019)

Interesting, didn't know about conftest deliberately segfaulting. Makes sense, although it creates scary messages.

Other than that: As Olli said, this shouldn't be happening. With 10gig of swap, compiles pretty much have to succeed (unless you are running many of them in parallel). They might start swapping and get very slow, but they should not segfault. And the way cc is failing also makes no sense. And under no circumstances should the machine crash or reboot because user processes use too much memory.

Given that we suspect that syslog isn't working right (there should be messages right before the crash), and we don't know how to fix it, here is a hacky workaround suggestion: Write a little shell script that runs vmstat every second, and saves the result in a file. If possible, put "fsync <file>" into the script too. Keep that running, and the next time the machine crashes, see whether the file is preserved, and what the memory status is right before the crash. That alone might not be good enough; it might just tell you that something is wrong, not who is causing it; if that happens, try running top in that loop (with the appropriate switches to send the output to a file, and display the things you want to see, for example sorting processes by RSS).


----------



## pkc (Sep 6, 2019)

I will be unable to tend to this server or discuss the issue for  several days and I will see if lasts without any large build jobs. I will also install some monitoring, and I will report back with my findings. Thanks


----------



## pkc (Oct 1, 2019)

Hello again. I did perform some monitoring and as we expected there does not seem to be a memory issue just before the crash.

What I continue to be puzzled by however is that I cannot recover any crash dump (see first post). It's the first time I've done this so I'm not sure if I did anything incorrectly, but I believe it's all correct. Is there a scenario where `last` may shows "crash", but that this crash is not an event that produces a crash dump?


----------



## SirDice (Oct 1, 2019)

pkc said:


> Is there a scenario where  last may shows "crash", but that this crash is not an event that produces a crash dump?


There are several scenarios I can think off. The most obvious one is the (disk) controller driver crashing. But in that case nothing would be logged at all (unless it's remote) as it simply cannot access the disks any more.


----------

