# How does one investigate failure to boot?



## zeRusski (Mar 26, 2021)

I'll start with this brief outline of my adventure installing FreeBSD on Dell R730 server that led to this thread. Here's how it went today:
- got a new (refirb really) PowerEdge R730 server,
- tweak some obvious settings in BIOS,
- update every firmware in the system to Dell's latest and greatest via Dell's Lifecycle Controller,
- that included its PERC H730 RAID controller (uses LSI SAS3108 chip really),
- all updates went smoothly,
- plug in USB stick with BSD image,
- connect to remote console via iDrac - essentially KVM over wire IIUC, so everything is done via HTML5 console as if I have monitor and keyboard plugged into that server,
- all goes well until it doesn't - stuff flashes past - good luck reading that - until something bad enough happens that simply sends system into immediate reboot without any sort of warning or grace and u're staring at BIOS startup screen etc.

Q1:
First thought. WTF just happened and how do I even begin investigating this? I can't possibly read this fast. So here's question number 1. In this scenario is there a way to recover `dmesg` or some kind of print out? Is it being written to that USB flash drive or anywhere at all? Is there a way to write it somewhere I can later review?

I thought of nothing better to do than to try and screen-capture something while it flashes past.
I got lucky. Twice. Here's that srcreencap:





First time lucky simply because I managed to capture that. Second - because I spotted that `mfi` mentioned there and I knew that was the RAID controller driver from my past adventures with R720 and its PERC H710 controller. Which leads us to ...

Q2:
Being lucky like this sucks. I'll take it this time, but it may not happen tomorrow. How would you go about (a) retracing your steps and reproducing what happened - that's essentially my Q1 above (b) go about solution when you've no clue what any of that stuff on the screen means?

Well. I knew roughly what `mfi` was about, so naturally that RAID controller was our prime suspect. Some `man` reading and googling later I arrive at mrsas(4)(). Turns out my chip and that card should in fact be using that driver but for whatever reason mfi(4)() takes priority unless you override that. So, at boot I drop into loader prompt:

```
$ set mrsas_load="YES"
$ set hw.mfi.mrsas_enable="1"
$ boot
```
bingo, we hit jackpot and we boot - all is well with the world.

We install the system, drop into shell and write `mrsas_load="YES"` to /boot/loader.conf and  `hw.mfi.mrsas_enable="1"` to /boot/device.hints, reboot and confirm everything works.

Q3:
That behavior of ... just wiping state and rebooting seems incredibly abnormal. IMO it is only ever allowed to happen when hardware actually fails. So, this driver somehow faulted ... but surely it shouldn't flat out abort mission and leave me staring at BIOS load screen? Is this normal expected behavior or are we looking at genuine bug?

Dear oldhats. What do you do when stuff like this happens to you?
Thank you


----------



## PMc (Mar 26, 2021)

Ideally, your machine should have a serial interface, COM1 or such.
Ideally, you would switch console to serial where you can log the whole story.
You can also introduce kernel debugger etc. (there is manpages on all the stuff). The kernel debugger would then engage at your panic, and you can look into all the cpu registers etc.etc. 

In fact, I never did that. I always did it the way You do it here: see what we've got, apply some logical thinking, and solve the problem.

Another common practice is called "minimal config": exclude all of the optional hardware until we have only the minimum necessary, and so try to figure out which component introduces the failure.


----------



## zeRusski (Mar 26, 2021)

PMc said:


> Ideally, your machine should have a serial interface, COM1 or such.
> Ideally, you would switch console to serial where you can log the whole story.


I've seen that mentioned in various places but every single time the writer assumed the reader would totally know what they were talking about and I still have no idea what that serial console is and what I can do with it.


----------



## zeRusski (Mar 26, 2021)

PMc said:


> You can also introduce kernel debugger etc. (there is manpages on all the stuff). The kernel debugger would then engage at your panic, and you can look into all the cpu registers etc.etc.


that was going to be a follow up question, but I didn't want to detract and derail conversation. I'd totally want to know how to attach the debugger (e.g. when you are not on the same machine but remote like that) and investigate for reals. If anyone could teach me that'd be rad. Scientific approach and step debugging tramp luck .. or at least they should.


----------



## PMc (Mar 26, 2021)

zeRusski said:


> I've seen that mentioned in various places but every single time the writer assumed the reader would totally know what they were talking about and I still have no idea what that serial console is and what I can do with it.


Well, a serial console is a serial console. Question: what knowledge are you missing? You know what a modem is? what a nullmodem is? what a terminal is? that traditionally unix machine would allow login not only on console (and via network), but also on a number of terminals attached to the serial ports (directly wired or with modems and telephony dialup)?

The whole point of serial console is that you move the boot console away from keyboard+monitor and onto one of the serial interfaces. Obviousely you therefore need another end to connect to, either a terminal (not very common anymore today) or another computer/laptop/whatever.

BTW: do you have physical access to that Dell machine, or is it in some compute center somewhere else on the world (which would make things a bit more difficult).

And another question out of personal curiosity: on which OS do you run the browser that would give you an iDRAC console? (I didn't get that working with current Firefox on FreeBSD - I can replay the boot logs, but cannot get it to work interactive)


----------



## zeRusski (Mar 26, 2021)

PMc said:


> there is manpages on all the stuff


do you remember which mans, please?



PMc said:


> Well, a serial console is a serial console. Question: what knowledge are you missing? You know what a modem is? what a nullmodem is? what a terminal is? that traditionally unix machine would allow login not only on console (and via network), but also on a number of terminals attached to the serial ports (directly wired or with modems and telephony dialup)?


I realize I may not even know what console is anymore. To me its always been what I described: user interaction between me and machine - "chess" like turn-based game. Would "not serial" console even be useful? Like, "concurrent" console - now that would be a mess. Isn't every console serial? Modem is this box that sends bips and bops over telephone wire. I don't recall what nullmodem is exactly but I seem to remember you could enter those modem AT codes by hand - that what you mean? I've no idea what terminal is. Seriosly, it means many things in many fields. Now, with serial port - we are getting somewhere. That I think is an actual port I can see in the back of my server. So IIUC there maybe a way to redirect boot output to that port or something that stands for it. How do I do this? Remotely? Do I communicate Telnet style or smth.  So, yeah, essentially how do I capture that output? What are the steps to take? I have more questions than when we started


----------



## Mjölnir (Mar 26, 2021)

Guys, not so fast please. 1st of your issues:

The BeaSD keeps the boot messages in /var/run/dmesg.boot.  So you can always `less /var/run/dmesg.boot`.
The periodic(8) stuff keeps `/var/log/dmesg.{to,yester}day`
Now I'll read on where I left off in your 1st post, hopefully I have more nonsense to comment on.  I like that, it's my beloved hobby...


----------



## George (Mar 26, 2021)

zeRusski said:


> but surely it shouldn't flat out abort mission and leave me staring at BIOS load screen?


That screenshot looks like the kernel debugger, kdb, not BIOS. It was invoked due to a kernel panic. kdb then caused the reboot. You could try the following sysctls:

```
debug.debugger_on_panic: 0
kern.panic_reboot_wait_time: -1
```

There should be a crashdump according the handbook: Chapter 10. Kernel Debugging 


> After rebooting, your system should save a dump in /var/crash along with a matching summary from crashinfo(8).


It's unfortunate that this happened.


----------



## SirDice (Mar 26, 2021)

Yep, before you start hooking up remote kernel debuggers (which are definitely cool, don't get me wrong), have it stop when it encounters a panic(9), not reboot it instantly. That alone will give you more time to actually look at the dump on screen. As you already found out you can tell quite a lot from it (you recognized mfi(4) and prior experience already provided you with a possible solution). If that doesn't provide enough clues you can take a look at the crash dump. But it requires quite a bit of kernel knowledge to make sense of it. That crash dump can certainly be useful in the hands of a seasoned kernel developer. 




zeRusski said:


> drop into shell and write `mrsas_load="YES"` to /boot/loader.conf


Don't need to load it, mrsas(4) is built in with the GENERIC kernel.


----------



## Mjölnir (Mar 26, 2021)

/boot/device.hints gets overwritten on updates (device.hints(5)). Thus
IIUC put any overrides in loader.conf(5) or /boot/loader.conf.local if you manage a bunch of machines that share common *.conf & machine-specific stuff goes into *.local.


----------



## zirias@ (Mar 26, 2021)

Mjölnir said:


> /boot/device.hints gets overwritten on updates


Oh, really? mergemaster(8) offers to merge it, doesn't freebsd-update(8) have a similar mechanism?


----------



## Mjölnir (Mar 26, 2021)

Zirias said:


> Oh, really? mergemaster(8) offers to merge it, doesn't freebsd-update(8) have a similar mechanism?


Hmm.  Good question.  By default, yes: `fgrep device.hints /usr/src/usr.sbin/freebsd-update/freebsd-update.conf`

```
MergeChanges /etc/ /boot/device.hints
```
More or less, it "belongs" to the kernel; i.e. it is installed with the kernel when you build your own kernel version.  IIRC it gets overwritten by `make installkernel`.  That's non-trivial to verify, because the Makefiles are deeply nested _ad ultimo_.


----------



## zirias@ (Mar 27, 2021)

Mjölnir said:


> Hmm.  Good question.  By default, yes: `fgrep device.hints /usr/src/usr.sbin/freebsd-update/freebsd-update.conf`


But that's the answer, freebsd-update(8) should *not* just overwrite it (full snippet):

```
# When upgrading to a new FreeBSD release, files which match MergeChanges
# will have any local changes merged into the version from the new release.
MergeChanges /etc/ /boot/device.hints
```



Mjölnir said:


> More or less, it "belongs" to the kernel; i.e. it is installed with the kernel when you build your own kernel version.  IIRC it gets overwritten by `make installkernel`.  That's non-trivial to verify, because the Makefiles are deeply nested until _ultimo_.


No, it was never touched here by either `installworld` or `installkernel` targets. I need a change to it on my server to have the console on COM2, as this is strangely the one on my board that's actually wired – so I would notice immediately.

I _assume_ it would be written by the `distribution` target.


----------



## PMc (Mar 27, 2021)

zeRusski said:


> do you remember which mans, please?


try `apropos console` for a start.


----------



## PMc (Mar 27, 2021)

Mjölnir said:


> Guys, not so fast please. 1st of your issues:
> 
> The BeaSD keeps the boot messages in /var/run/dmesg.boot.  So you can always `less /var/run/dmesg.boot`.


Nope, that appears only after successfully going multiuser (done in /etc/rc.d/dmesg).



Mjölnir said:


> The periodic(8) stuff keeps `/var/log/dmesg.{to,yester}day`


never seen that one (but I have a dmidecode.{to,yester}day).


----------



## neilulrich (Mar 28, 2021)

only up for 1 second check power maybe reduce load?


----------



## tingo (Mar 28, 2021)

Also, in this specific case, the use of a systems management console (iDrac) _might_ give you access to: system management logs, console output logs, virtual serial console, virtual media (usb, cd-rom, ??) and so on. Time to learn how to use what you have access to.


----------

