# ECC memory errors always begin at 16 mins past boot



## rowan194 (Aug 25, 2020)

Xeon E5-2630L v2 with 4 x 32GB LRDIMM. There's some kind of incompatibility (buggy BIOS perhaps?) which causes memory errors to start spewing out whenever the 32GB sticks are used. 16GB sticks work fine.

Here's an example of the error:


```
Aug 25 11:25:44 test kernel: MCA: Bank 7, Status 0xcc0009c000010090
Aug 25 11:25:44 test kernel: MCA: Global Cap 0x0000000001000c17, Status 0x0000000000000000
Aug 25 11:25:44 test kernel: MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 0
Aug 25 11:25:44 test kernel: MCA: CPU 0 COR (39) OVER RD channel 0 memory error
Aug 25 11:25:44 test kernel: MCA: Address 0xba909040
Aug 25 11:25:44 test kernel: MCA: Misc 0x1400e8e86
```

The odd thing is that the errors begin almost at exactly the same point after boot. I measured from the last log entry at boot, to the first log entry reporting a memory error, and repeated several times. With the exception of the first, all failures begin within a few seconds of each other... all just after the 16 minute mark. The machine is idle.

1. 16:25 (first memory error reported 16 min 25 sec after boot)
2. 16:08
3. 16:12
4. 16:10
5. 16:12

(In hindsight, measuring from the *first* log entry at boot may have been less variable.)

The same thing happens when I boot to single user. Although I don't have a precise timestamp from written logs, I watched the console near the 16 minute point, and shortly after that the MCA errors started.

By comparison, I booted a Linux on USB kernel, and the memory errors start immediately.

So... what happens for the first 16 minutes after FreeBSD boot that "prevents" (or possibly ignores) memory errors?!

FreeBSD 12.1-RELEASE, fresh install.


----------



## ralphbsz (Aug 25, 2020)

Plausible explanation: The way that physical memory is used is different in different OSes. Here's a plausibly but hypothetical scenario: You get ECC errors when addressing memory in the range of physical addresses 0xABCD0000...0xABCDFFFF. Linux uses physical memory in a scattershot fashion, a little bit here, a little bit there, so it is very likely that this range will be hit soon. FreeBSD uses physical memory starting from the bottom going up, and it takes about 16 minutes under your normal workload to get there. And yes, I know that the physical memory map is much more complex than just these hypothetical examples, but you get the idea.


----------



## rowan194 (Aug 25, 2020)

ralphbsz said:


> FreeBSD uses physical memory starting from the bottom going up, and it takes about 16 minutes under your normal workload to get there.



I wish it was that simple. The tests above were done with an idle machine - at most, a remote login and some viewing of the logs.

I did a fresh boot with some attempts at giving the memory a bit of a workout (*make buildworld* and *find / -type f -exec cat {} > /dev/null ;* and the errors started right on cue, at 16 minutes and 11 seconds...

EDIT: To be clear, there's obviously some serious issue with this hardware, and it's not related to FreeBSD. I'm just really curious why there's the 16 or so minute delay with FreeBSD.


----------



## mark_j (Aug 25, 2020)

Does thus server have ipmi? If so ipmitool might help. Does each time it fail show the same address?


----------



## rowan194 (Aug 27, 2020)

No IPMI support that I can see.

It gets stranger...

Based on the "*Memory Subsystem Bandwidth*" section of https://frankdenneman.nl/2015/02/19/memory-deep-dive-memory-subsystem-bandwidth/ , I think a fully populated board with large capacity DIMMs requires downclocking of the memory speed. (I would expect the BIOS to handle this, but based on the 2012 copyright date it's possible it was last updated before 32GB DDR3 DIMMs were widely available). If I manually lock the memory at DDR3-1066 (normal speed 1600) then it works without any errors.

But after removing a couple of DIMMs, and noting that 2x32GB works fine at the normal 1600 speed, now that I'm back to 4x32GB, everything is working perfectly at 1600?! I have the full 128GB available, at the maximum speed the CPU supports (confirmed 1600 MT/s with *dmidecode -t 17*), and after more than 12 hours of continuous kernel compiles it's still chugging away without a single error. Cautiously optimistic that there was some incompatible setting stored in the BIOS NVRAM (there was a different brand of 32GB DIMMs on this board previously, which also repeatedly reported ECC errors), and moving to 2 DIMMs triggered an update of those settings.


----------



## ralphbsz (Aug 27, 2020)

You removed and replaced DIMMs? That makes it sound like it could have been a contact problem. Except it's really hard to imagine a contact problem that shows up after exactly 16 minutes.


----------



## rowan194 (Aug 27, 2020)

ralphbsz said:


> You removed and replaced DIMMs? That makes it sound like it could have been a contact problem. Except it's really hard to imagine a contact problem that shows up after exactly 16 minutes.



Same issue happened in another machine with a similar mainboard (I didn't specifically notice the 16 minute thing at that time), but with different brand 32GB DIMMs. I had a spare mainboard so I set up a temporary test machine.

I effectively replaced all hardware (including the DIMMs), so it has to be some kind of quirk or incompatibility, rather than a particular item of hardware being electrically faulty.

Kingston RAM + mainboard #1
Kingston RAM + mainboard #2
Hynix RAM + mainboard #2


----------

