# Machine rebooting because of bad disk in array?



## maketo (Dec 18, 2021)

My main dev machine is a FreeBSD R13 machine with an SSD for a main drive (OS install etc.) and 4x3TB Western Digital SATA drives in a RAID10-like backup (using gmirror/gstripe, instructions here: https://forums.freebsd.org/threads/...e-of-two-raid1-mirrors-on-freebsd-10-1.51277/)

Yesterday I was working on something and machine just rebooted out of nowhere. It failed complaining about one of the drives in the RAID array so I unplugged them all and rebooted. I started plugging them back in and when I reached the last one, it rebooted again. I am able to successfully reproduce the reboot every time I plug that drive in.

All other things aside, I would expect a robust, 21-century OS in almost year 2022 to NOT reboot when a non-critical drive fails. Has anyone experienced this?

The kernel crash dump is useless since the system is complaining about a "gdb inconsistency" so I cannot even figure out what actually caused the crash. What can I do to debug this? Thanks! 

Attached is some gmirror output (datastore01 is missing the bad drive, obviously):

```
> gmirror list
Geom name: datastore01
State: DEGRADED
Components: 2
Balance: load
Slice: 4096
Flags: NONE
GenID: 0
SyncID: 4
ID: 897743119
Type: AUTOMATIC
Providers:
1. Name: mirror/datastore01
   Mediasize: 3000592940544 (2.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r0w0e0
Consumers:
1. Name: ada1p1
   Mediasize: 3000592941056 (2.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   State: ACTIVE
   Priority: 0
   Flags: (null)
   GenID: 0
   SyncID: 4
   ID: 1580905567

Geom name: datastore02
State: DEGRADED
Components: 2
Balance: load
Slice: 4096
Flags: NONE
GenID: 1
SyncID: 10
ID: 2286728413
Type: AUTOMATIC
Providers:
1. Name: mirror/datastore02
   Mediasize: 3000592940544 (2.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w0e0
Consumers:
1. Name: ada2p1
   Mediasize: 3000592941056 (2.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   State: SYNCHRONIZING
   Priority: 0
   Flags: DIRTY, SYNCHRONIZING
   GenID: 1
   SyncID: 10
   BytesSynced: 641964441600
   Synchronized: 21%
   ID: 3498563448
2. Name: ada3p1
   Mediasize: 3000592941056 (2.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   State: ACTIVE
   Priority: 1
   Flags: (null)
   GenID: 1
   SyncID: 10
   ID: 2269059054
```


----------



## ralphbsz (Dec 18, 2021)

Part 1: The problem is the OS itself. In that case, you have found a bug, since a failure of a non-root disk should not cause a reboot (failure of the root disk may cause a halt / crash / reboot, since without a root disk the system can not make forward progress). Debug the reboot, and open a PR.

Part 2: The problem is not the OS. It could easily be a problem with the motherboard or power supply, in which case the best OS in the world can't help. As an example, a few years ago one of the data disks on my home server failed. Similar configuration, the server uses a small SSD as a boot/root disk, and then several large spinning drives as data disks, all connected by SATA. The failed disk was failed so thoroughly that if it was plugged into the SATA port, the system would not boot (not even get to the BIOS screens), and plugging it in after the system was booted would cause everything to stop immediately.

So my suggestion would be: Figure out the root cause of the problem. If it is hardware (disk, motherboard, memory, power supply, etc.), fix it. If hardware has been excluded, then create a crash dump and either diagnose or open a PR.


----------



## _martin (Dec 18, 2021)

Even if the kernel dump is inconsistent maybe core.txt.$n can shine some more light. Can you check contents of it? FYI: you need to use kgdb(1) when trying to check the dumps, not gdb(1).


----------



## grahamperrin@ (Dec 23, 2021)

maketo said:


> SSD for a main drive (OS install etc.)



`tunefs -p /`
`pkg -vv | grep -e url -e enabled`
`freebsd-version -kru`
`uname -aKU`


----------



## mer (Dec 23, 2021)

datastore01/02 the devices are both reporting 2.7TB but have different sizes.  I've always preferred to gpart things to be the same size so you don't run into MFG "sizes".

That said, I'm agreeing with ralphbsz Check the power.
What if you plug drives in a different order?  Does the reboot always follow the Last drive plugged in or does it really follow a specific drive?  If it truly follows a specific drive  then suspect the cables and the drive.  Try a different cable, try a different power connector.
Check for temperatures, blow all the dust out.


----------

