# Post the conditions under which your kernel crashes and why ?



## Alain De Vos (Nov 6, 2022)

Sometimes a kernel can crash.
It interesting to know when & why ? Maybe there is a line to find ?


----------



## zirias@ (Nov 6, 2022)

My (*-RELEASE) kernel doesn't crash 

The closest I've seen to a crash was a deadlock in the vfs which forced me to reboot the machine...


----------



## cmoerz (Nov 6, 2022)

I recently happened to involuntarily "reproduce" this crash on 13.1-p2 - https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263505
Admittedly, not anything one happens to do on a daily basis. I just happend to try adding a vlan to wlan0 and first got a complete system freeze on the first and a reboot on the second try.

The handbook even serves a full chapter on the topic of debug dumps (https://docs.freebsd.org/en/books/developers-handbook/kerneldebug/).

In my instance, I didn't investigate much further since by accident I had seen that bug on bugzilla a couple days before. I suppose crashinfo() could have helped troubleshoot, though and might be a pointer for Alain De Vos?


----------



## eternal_noob (Nov 6, 2022)

Alain De Vos said:


> Maybe there is a line to find ?


What line do you expect to find? Something like

```
if (user == "alain" && isFullMoon())
    crash(E_CRASH_TYPE::HARD);
```
?


----------



## Alain De Vos (Nov 6, 2022)

_View: https://www.youtube.com/watch?v=5IaYm3NjJnM_


----------



## jbo (Nov 6, 2022)

Alain De Vos said:


> Sometimes a kernel can crash.
> It interesting to know when & why ? Maybe there is a line to find ?


I ever only experienced crashing once in my life. FreeBSD is a rock solid beast.
This was the incident: Still "unsolved": https://forums.freebsd.org/threads/debugging-crash.82415/


----------



## Alain De Vos (Nov 6, 2022)

I'm able to crash the kernel under following conditions:
-Run Wayland
-Build ports simultaneously with poudriere
-Play a youtube video with falkon browser
Sometimes the kernel crashes ...


----------



## ProphetOfDoom (Nov 6, 2022)

Apparently the myth that wolves hunting humans is a myth is a myth.


----------



## ProphetOfDoom (Nov 6, 2022)

Perhaps more interesting is the Miscellaneous Beasts:








						The Wolves of Paris
					

I have always thought that France was fairly unlucky as a country to have been ravaged over the centuries by various Beasts, the majority of which nobody has been able to identify with 100% certain…




					johnknifton.com


----------



## ProphetOfDoom (Nov 6, 2022)

This one looks rather jolly.


----------



## drhowarddrfine (Nov 6, 2022)

Just want to ditto what others have said. In 20 years I have never had a kernel crash. 
Next weird question, please.


----------



## rootbert (Nov 6, 2022)

using a ggated device as zpool vdev device crashes my host under moderate load, but I haven't tried this config since 13.0


----------



## cmoerz (Nov 7, 2022)

Since this is starting to drift off topic, shouldn't we also add a "nice try, NSA, but start looking elsewhere for zero day entry points?"


----------



## larshenrikoern (Nov 7, 2022)

Alain De Vos said:


> I'm able to crash the kernel under following conditions:
> -Run Wayland
> -Build ports simultaneously with poudriere
> -Play a youtube video with falkon browser
> Sometimes the kernel crashes ...


Thank you. I have many crashes especially when building ports. No one have ever before said that building in parallel causes this. I will try with just one build at a time   

For all those of you who has never seen a crash: Try some new hardware. My i7-12700K with 64GB ram on 4 sticks and an ASUS board. Lot of problems = a lot to learn and understand.


----------



## _martin (Nov 7, 2022)

eternal_noob said:


> if (user == "alain" && isFullMoon()) crash(E_CRASH_TYPE::HARD);



We had a customer who had cascade NFS mounts on his environment. Dealing with issues there was a nightmare. Especially on prod where you really didn't want to lose access to those shares. We had to call my colleague who fixed it as for some reason "his commands" worked. We did the same thing he did but it didn't work.
We were laughing that those commands and kernel modules have uid check.


----------



## drhowarddrfine (Nov 7, 2022)

larshenrikoern said:


> For all those of you who has never seen a crash: Try some new hardware.


In 20 years I have never had a kernel crash on new hardware.


----------



## jbo (Nov 7, 2022)

larshenrikoern said:


> For all those of you who has never seen a crash: Try some new hardware. My i7-12700K with 64GB ram on 4 sticks and an ASUS board. Lot of problems = a lot to learn and understand.


Within the context of the FreeBSD community known to me, I tend to be the guy with the "latest hardware" more often than not. In fact, I usually get a lot of comments in the line of "just wait a year".

But never the less: Not a single FreeBSD kernel crash. Not once. Not on desktop hardware, not on server hardware, not on laptops...

That being said, the Alder Lake architecture does seem to need some work to make it work as intended on FreeBSD. However, looking at how different Alder Lake is from "everything we've seen before" (in recent years), I wouldn't exactly use that as an argument in this particular conversation.


----------



## larshenrikoern (Nov 7, 2022)

So Alain de Vos And I are almost the only ones experiencing kernel crashes


----------



## ralphbsz (Nov 8, 2022)

I looked at my logs for all system administration; they are online since 2012:

20120602: Running smartmon on the Seagate disk causes the kernel to hang for one minute, but then comes back to life. Repeatable, ended up making sure that the Seagate disk is never touched by SMART.
20121227: System spontaneously crashed at 2am, when "zpool scrub" starts on a disk connected by USB-3. Cause unknown. No disk or USB error messages on the console. Did not reboot, system remained down until we came back from Christmas vacation and I power cycled it.
20130105: Crashed by pulling out a PCI card without shutting down the system first. Oops. Don't do that.
20131207: I had been plugging in SD cards in mass production to read photos from old cameras; after a few hours of that, the system disk (!) caused a crash with error message "ata3: timeout waiting to issue command" and "ata3: error issuing READ_DMA48 command" and "(ada0:ata3:0:0:0): lost device". Without the boot/root disk ada0, the system stayed up (kernel running), but was completely unusable.
20140203: System hung with "ata0: already running", at 3am. No crash messages, but dead as a doornail. Power-cycled in the morning.
20140212: Around 1PM, machine crashed or hung.  Would not reboot.  Some disk making strange noises. Eventually, tracked it down to a Seagate disk that when plugged in causes absolutely nothing to work (not even the BIOS), while making noises like a grinder. Threw the disk in the trash and bought a Hitachi.
20140223: Somewhere between 2am and 5am, machine crashed hard. Symptoms: No network, no reaction from the keyboard (not even scroll lock, caps lock, or ctrl-alt-del), disk light on solid.  Rebooted, works now.
20140319: System crashed, bottom of screen said "rebooting", then came back, worked fine. Unknown cause.
20140331: System crashed around 2:40am. Managed to destroy the BIOS settings (!), which I had to manually restore. Now the printer doesn't work any more (it's connected via parallel port).
20161013: System crashed this morning exactly at 7AM.  No idea why; /var/log/messages says "Fatal trap 12: page fault while in kernel mode" and nothing else. Probably caused by backup trying to run.
20170106: Inadvertent crash due to messing up power cords while replacing UPS battery.
20170110: Running a user-mode program that uses POSIX aio...() calls by the thousand causes a kernel crash. Since that is an unrealistic use case, ignored it.
20171215: I deleted /var/crash/vmcore.0 (was using too much disk space), but then the next reboot crashed immediately and created a new core file. The next reboot succeeded. Not reproducible.
20190720: Computer became unresponsive; running services were still running, but no login possible, and no new processes could be started. Eventually ended up rebooting by cutting power. Afterwards, found that I had forgotten to set kern.kstack_pages=4 in loader.conf (which is needed for ZFS on low-memory machines), so probably this was a self-inflicted injury.
20200316: Kernel crash, message on console "ada1 already running". After reboot, one of the disks is missing. Had to open the machine and reseat power cables. Found an enormous amount of dust in the machine.
To explain things: This machine used to have two Seagate Barracudas, which are crap and ticking timebombs. At night, starting at 1am, I run maintenance jobs, such as "zpool scrub", big backups, and log file moving, so the disks are busiest in the middle of the night. The system is protected against power outages with a UPS, so uptime is usual months.

What do I conclude from the above? Nearly all problems are caused by disk interfaces, which are perfectly capable of bringing the system down, both in hardware and in kernel space. Having built large disk systems, this is not inherently a question of bugs in the kernel, but typically lower-level things (firmware in the HBAs, miscommunication between disk/HBA/kernel developers) that blow everything up. Another handful are user errors, including really dumb things like pulling the power cord when you didn't intend to. But buried in the above ~15 crashes in 10 years are at least two real kernel bugs that caused crashes, under extreme workload. I think of all of those, only one (abuse of aio... calls) was even reproducible.

In addition, there were about 50-100 crashes of vitally important user space programs, such as my backup, my equipment monitoring, apache, and Berkeley DB. So I can gratuitously round as follows: Of the ~100 outages in ~10 years, 80% are caused by userspace programs, 18% by the disk IO system below the kernel or the wetware doing something dumb, and 2% by the kernel itself.


----------



## cy@ (Nov 8, 2022)

zirias@ said:


> My (*-RELEASE) kernel doesn't crash
> 
> The closest I've seen to a crash was a deadlock in the vfs which forced me to reboot the machine...


Deadlocks are worse than panics. At least with a panic the box reboots without having to go downstairs to punch the reset button. Panics also provide register dumps which are much easier to debug, because we're walking the stack through a backtrace. The source of the deadlock could have occurred seconds, minutes or even hours previously.  While with a deadlock (or as we in the IBM mainframe world called them deadly embrace) leave you with little or no information.

When I was an MVS systems programmer, a subsidiary of ours did development on a production mainframe -- company made that decision because they didn't want to spend the millions on another mainframe. The developers had a small bug in their kernel code -- they were writing software to support a hardware vendor's to be announced tape robot. The IBM mainfame has no stack so the standard was to dynamically allocate (GETMAIN, aka malloc()) a 72 byte save area to save registers. Just prior to return() the function would restore the callers registers and free (using FREEMAIN, aka free()) the 72 byte save area. Save areas are a linked list giving you something like a stack without aa stack (no such thing as a stack exploit on the IBM mainframe) and save areas are stored on what we in the UNIX world call the heap. Well, with this bug their code freed 144 bytes or two save areas worth of memory while in kernel state.  The problem didn't bite us until the next morning when people started to log in. The kernel would GETMAIN (malloc()) free memory, which was in fact still in use but considered freed by memory management in the kernel because 144 bytes were freed instead of the intended 72 bytes. This caused a deadlock because the memory freed were linked lists used to manage page tables (DAT - google dynamic address translation). The affected control blocks maintained a lock which would be used to lock the control block using a spin loop (compare and swap instruction -- we have those same instructions in Intel and ARM). To make a long story short, the CPU went into a tight loop (a compare and swap spin loop) waiting for a resource that no longer existed because it was overwritten by something else because memory management believed the memory was free beause  the memory had been erroneously freed (144 bytes instead of 72 bytes) many (up to 8) hours before by the kernel modules the developers were testing.

In UNIX when we issue free() all we need to pass to free() is an address because the length is stored with the address by malloc(3) in userspace and malloc(9) because the mallocs keep track of not just the address but the length malloc()ed so when free() frees the memory all that is needed is an address and memory management does all the rest. On the mainframe the programmer passes not only the address of the memory to be freed to FREEMAIN but also the length to be freed. This is significantly more dangerous because one little mistake like above can lead to a deadlock. The programmer must keep all this minutiae in mind while programming (in assembler), especially when doing work in kernel.

This deadlock took weeks to sift through the kernel dumps to find the root cause. Fortunately the subsidiary's code contained some eye catchers when they wrote their data memory. Eventually we had to simply ask. They looked through their code and embarrassingly, they fixed the bug. But only after the company had lost a significant amount of money due to performance penalties.

Deadlocks are hard to diagnose, debug, and fix. I'd rather have a panic with a register dump telling me, "there's the problem, go fix it."

Ever since my ports bit was upgraded to include src, I run -CURRENT everywhere. Each of my machines has an alternate boot partition, a FreeBSD-CURRENT I can fall back to in case anything goes horribly wrong, plus an external USB boot disk for the extremely rare circumstance when nothing else will work. -CURRENT has been stable for me and any time I've never needed the alternate boot partitions for recovery, except when I've shot myself in the foot, my self and my fault. The last time my -CURRENT panicked was, again, my fault, playing around with some half baked thing.

My uptimes are generally low since I installworld/installkernel quite regularly.

FreeBSD is simply rock solid and stable. Even -CURRENT.


----------



## zirias@ (Nov 8, 2022)

cy@ said:


> Deadlocks are worse than panics.


Of course they are, I didn't qualify anything here. Just saying my (RELEASE) kernel never crashed, in many years  11-CURRENT panicked a few times on me, newer -CURRENT versions indeed only when I messed up something myself.



cy@ said:


> At least with a panic the box reboots without having to go downstairs to punch the reset button.


Not for me, as I use GELI on my private server and need to type the passphrase on the serial console  – but:



cy@ said:


> Panics also provide register dumps which are much easier to debug, because we're walking the stack through a backtrace. The source of the deadlock could have occurred seconds, minutes or even hours previously. While with a deadlock (or as we in the IBM mainframe world called them deadly embrace) leave you with little or no information.


This is certainly true. I didn't even bother trying to find out what happened. Somehow, the vfs subsystem seemed to be in a deadlocked state, so disk I/O was dead, network socket I/O still worked (which is useless when you need the disk for any command typed on SSH).

Fortunately, it only happened twice, while my server was running 13.0. It seems to be gone in 13.1


----------



## Crivens (Nov 8, 2022)

Ignoring the crashes when writing device drivers, I have not seen real trouble since the 7.x when USB support was pretty sketchy, so unplugging a mounted stick would crash the kernel, and unmounting XFS drives would cause a crash.


----------



## covacat (Nov 8, 2022)

11.x or 10.x bombed when ppp disconnected (when removing tunX routes), moved to mpd5


----------



## larshenrikoern (Nov 8, 2022)

The point is not weather FreeBSD is stable or not. It for sure is when it is. It is also about configuration and hardware, and how prone FreeBSD are to those things. Right now my machine has become much more stable. Partly because Alain De Vos mentioning of Poudriere crashing the system when building multiple packages at a time. THANK YOU sir . Since set to one at a time, no crashes. Before that it happened around every hour or two. And then I have set it to build it all in ram. So actually also faster.

And configuring the bios on my system with 4 sticks of ddr4 ram. Setting the speed manually did make the system stable. If set with XMP or not set (so auto 2133) was unstable. Setting it explicit needed. So I have now enabled the p cores on my I7-1277k. And has been compiling and using the system for 12 hours without a crash. That is a record for this setup. I will update my post about alderlake (in the Install forum thread) if it is still running stable in a couple of days.


----------



## drhowarddrfine (Nov 9, 2022)

If poudriere is making the kernel crash, this sounds like someone needs to file a PR. But no one has, as far as I know, which makes me suspicious as to where the problem actually lies.

Fiddling with speed settings to make a system stable tells me the system isn't stable which has nothing to do with FreeBSD.  What were the settings beforehand? Were they changed and then the system started crashing? And now one is blaming FreeBSD's kernel?

Inquiring minds want to know. Or not. Fiddling with clock timing requires informed minds. When I designed motherboards, a lot of thought and testing was put into what would work reliably and you didn't touch what I put in there. Any changes were just guesswork on the part of the user (not that you could cause it wasn't an option).


----------



## ralphbsz (Nov 9, 2022)

About the poudriere crash issue: My educated guess is that this is not a kernel or software problem, but a hardware problem. For example, the RAM in the machine might be marginal, and it is not ECC. Under normal light usage, it works fine. Under extreme load, it overheats, or it uses too much power and voltages drop, or errors that are normally rare start piling up. Eventually, memory errors corrupt kernel data structures, and the kernel can't do anything other than crash. Why are kernel data structures such a good target? Because all free memory is usually used for file system buffer cache, and the data structures that describe that buffer cache are the biggest target for memory corruption.

In the mid 90s, lots of people built cheap i386 and i486 systems and ran Linux of them. There was a whole cottage industry of white-box computer assemblers. Many of these systems were cheaply built, with little quality control. Most ran just fine under Windows 3.1 or DOS, but had problems under the more intense workloads that Linux could put on them. Memory tests typically didn't find the problem (they were too simple minded, and didn't stress the rest of the system like disks, which use power too), so the best stress test for the system was doing Linux kernel compiles. I used to run them overnight on my system, and if it survived from midnight to 7am without a kernel crash, it was good enough to use in production.

This is one of the reasons I swear by ECC memory ... except this is a case of "do as I say, not as I do": My little server at home does not have ECC. Shame on me.


----------



## larshenrikoern (Nov 9, 2022)

drhowarddrfine said:


> If poudriere is making the kernel crash, this sounds like someone needs to file a PR. But no one has, as far as I know, which makes me suspicious as to where the problem actually lies.
> 
> Fiddling with speed settings to make a system stable tells me the system isn't stable which has nothing to do with FreeBSD.  What were the settings beforehand? Were they changed and then the system started crashing? And now one is blaming FreeBSD's kernel?
> 
> Inquiring minds want to know. Or not. Fiddling with clock timing requires informed minds. When I designed motherboards, a lot of thought and testing was put into what would work reliably and you didn't touch what I put in there. Any changes were just guesswork on the part of the user (not that you could cause it wasn't an option).


I used the stock settings (without XMP enabled). And the system kept crashing. Disabled the E-cores (alderlake). Still crashing. Asus is by default enabling "Intel adaptive boost" which is not supported by my cpu. Disabled that. XMP unstable even with 2t enabled. Most of the time without using any sort of FreeBSD power saving fetaures (if enbled then it crashed again). But setting the ram speed manually with 2t enabled it seems much more stable. But only time will tell. But the main issue is there is more than one problem at a time. And a part of that is FreeBSDs not complete support of alderlake at this time.


----------



## larshenrikoern (Nov 9, 2022)

Hi again

I did check the ram modules. It seems the firm I where I ordered the computer has taken the chance to take two separate ram kits, total four slots, and not a kit with four modules together. I looked at the manufactures homepage. I found what looks like the same kit with four modules running at 2666mhz and 1.2 V, instead of the 3200mhz 1,35 V kit.

So I have the choice of running with only two modules or four modules with lower speed. I am trying out the latter firstusing manual settings. Lets see how it goes


----------



## ralphbsz (Nov 9, 2022)

larshenrikoern said:


> I found what looks like the same kit with four modules running at 2666mhz and 1.2 V, instead of the 3200mhz 1,35 V kit.


And this is the kind of real-world problem that causes crashes. Much more common that software bugs in the kernel.

Anecdote: My first "IBM PC" at home (meaning x86 architecture machine) was in about 1992, and was an AMD 386-40. So a full 32-bit machine, with 4 M of RAM, and running very fast: At the time Intel only had CPUs up to 33 MHz, and by going with AMD instead, I got a significant speed boost at no additional cost. A while later (see below), I needed to add an FPU, both because I was doing floating-point intensive physics calculations, and because I was using X Windows, which does font rendering using floating point. The problem was that getting FPUs in 40 MHz was really hard; there was only one manufacturer (Cyrix), and they were rarely used, since most people using computers in that price range (a few thousand $) didn't need numeric performance.

This was before the internet (instead, local stores had phonebook-sized catalogs), and I spent a few afternoons going from computer store to computer store, and nobody had the Cyrix 387-40 in stock. Eventually I ended up at a small Chinese-owned computer store in a back alley of Palo Alto. When I asked about that FPU, the owner (and only employee) opened his desk drawer, and pulled out a chip: No antistatic protection, pins a little bit bent, but it did say that it was a Cyrix 387-40 on it. I asked how much it would be, and he didn't really know, so I handed him $20 in cash, and he was happy. My thinking was: This chip is very unlikely to work, given that it has been stored in a desk drawer, but it's cheap.

Turned out it worked perfectly. That machine continued functioning for about 6 or 7 years, and was my main workhorse at home. After a while, I added a second computer (a 486-25), so both my wife and me could use X at the same time (before that, one of us had to log in via a serial-connected VT200 terminal and work in text mode).

The other funny anecdote was the OS I ran on it. Initially, I had wanted to run BSD, but it hadn't been ported to 386 yet (Bill Jolitz' 386BSD wasn't available yet, it came a year later). But BSDi was selling a version of BSD that ran on 386 at the time, for about $1000. The only problem was: I really needed graphics (for data analysis). And while BSDi was working on porting X Windows to their OS, they only supported one video card (the Tseng ET4000) which was de-facto unavailable; and their X Windows port was known to be so broken  that you could de-facto not run it. I didn't feel like spending several hundred $ on a rare video card plus $1000 on the OS, just to end up with a train wreck. But then, this crazy toy OS named "Linux" became available, and while it had no graphics yet, at least CLI mode was rumored to work, and it had a working C compiler (Fortran I could do myself, by using AT&T's f2c). So I downloaded the ~30 floppies of the SLS or Slackware distribution, and got it to work. When I bought the second computer, I initially connected them via PLIP (special cable to go printer port to printer port), since ISA bus Ethernet cards were still very expensive. And a few months later, the X windows port on Linux started working, and by 1994 Linux was widely supported in the science community, and the rest is history. I only went back to the *BSDs in the early 2000s, when I decided that Linux was just too sloppy and insecure for a dedicated firewall machine.


----------



## drhowarddrfine (Nov 10, 2022)

ralphbsz said:


> My thinking was: This chip is very unlikely to work, given that it has been stored in a desk drawer



Back in the day, a technician fingered some of those new-fangled CMOS chips sitting on my bench.
Technician: "Are those some of those new-fangled CMOS chips I've been hearing about?"
Me: "They were!"


----------



## unitrunker (Nov 10, 2022)

I had crashes after installing an experimental SD card reader driver. After a few iterations - it stopped crashing - and eventually became part of the 13.0-RELEASE.


----------



## msplsh (Nov 10, 2022)

Crashes when I plug in my Areca ARC-1320 with the cables plugged in on only one end.


----------



## drhowarddrfine (Nov 10, 2022)

It crashes for me when I open the cabinet and wiggle the plugin cards, power cables, and things like disk cables


----------



## yaslam (Dec 8, 2022)

My trash laptop with a celeron and 4gb of ram ran out of memory when compiling stuff and froze, had to restart it.


----------



## cmoerz (Dec 9, 2022)

larshenrikoern said:


> complete support of alderlake


Can second that. For all the stable years before it, switching to Alderlake has brought with it a forgotten instability and constant fear of crashes. I have the same experience on OpenBSD though - so I don't think it's purely a FreeBSD thing.
A lot of the crashes I'm seeing recently are due to inteldrm, so I'm not holding my breath that this is going to be a smooth ride any time soon on FreeBSD either.

Intel's recent introduction of performance and efficiency cores have brought a new bag of computer science problems, which will take a few years to get settled - i.e. what load to run on which cores under which circumstances and how to switch between them whenever factors shift.

So I expect things are going to get worse before they get any better.


----------



## jbo (Dec 9, 2022)

cmoerz said:


> So I expect things are going to get worse before they get any better.


Exactly the reason why I decided to purchase a new i7 11th gen laptop last year. This way I still got a reasonable hardware upgrade but also don't have to deal with all this pain until somebody else fixed it for us


----------

