# MCA errors



## spork (Jul 22, 2021)

I'm going to bring this up with my vendor, but I'm trying to figure out if the logs are telling me anything more than "bad DIMM".

So I had a box running 12.2-RELEASE p7 lock up. On logging into the IP-KVM, I had a mess of MCA logs on the console, but the host was unresponsive: no keyboard response, software shutdown did nothing, had to reboot. This was 5 days ago, and the box hasn't done anything odd since. 

Board is a Supermicro X11SPW-TF.

What seems odd is the sheer volume of messages. It looks like they were spewed out in just over a minute (first one at 17:46:05, last one at 17:47:44) and it's quite a few:


```
[root@clweb2 /home/spork]# bzgrep 'MCA:' /var/log/messages.0.bz2 | wc -l
  539977
```

In digging around on how to match this to a particular DIMM, I saw people noting that the address of the error is helpful, and is usually a single address.  Not so much here - 81,184 lines with an address line (and I assume 81,184 is the number of MCA errors logged as well) and if I sort it for unique addresses, it seems like I have 77,524 unique addresses. This seems odd based on what I'm reading about MCA errors.


```
[root@clweb2 /home/spork]# bzgrep "MCA: Address" /var/log/messages.0.bz2 | wc -l
81184
[root@clweb2 /home/spork]# bzgrep "MCA: Address" /var/log/messages.0.bz2 |sort -u | wc -l
77524
```

If I pipe all these logs through mcelog, I get variations on this, with one line always being "CPU 0 BANK 7" or "CPU 0 BANK 13":


```
Hardware event. This is not a software error.
CPU 0 BANK 7
MISC 200000c000001086 ADDR 165629c00
MCG status:
M2M: MscodDataRdErr
STATUS dc001f8001010090 MCGSTATUS 0
MCGCAP 7000c14 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 85 Step 4

Hardware event. This is not a software error.
CPU 0 BANK 13
MISC 9000081c3830086 ADDR f348f88c0
MCG status:
MemCtrl: Corrected patrol scrub error
STATUS cc000200000800c0 MCGSTATUS 0
MCGCAP 7000c14 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 85 Step 4
```

Not shown in mcelog output are messages like this that indicate something ran out of space for recording these errors:


```
MCA: Unable to allocate space for an event.
```

In the IPMI SEL logs I have a few interesting things:


```
11 | 07/02/2021 | 20:14:31 | Fan #0x42 | Lower Critical going low  | Asserted
  12 | 07/02/2021 | 20:14:31 | Fan #0x42 | Lower Non-recoverable going low  | Asserted
  13 | 07/02/2021 | 20:14:37 | Fan #0x42 | Lower Non-recoverable going low  | Deasserted
  14 | 07/02/2021 | 20:14:37 | Fan #0x42 | Lower Critical going low  | Deasserted
  15 | 07/02/2021 | 20:16:16 | Processor #0xff | IERR () | Asserted
  16 | 07/14/2021 | 00:38:50 | Power Supply #0xc9 | Failure detected () | Asserted
  17 | 07/14/2021 | 00:39:05 | Power Supply #0xc9 | Failure detected () | Deasserted
  18 | 07/14/2021 | 18:09:12 | Power Supply #0xc8 | Failure detected () | Asserted
  19 | 07/14/2021 | 18:09:30 | Power Supply #0xc8 | Failure detected () | Deasserted
  1a | 07/16/2021 | 17:48:06 | Memory | Uncorrectable ECC (@DIMMA1(CPU1)) | Asserted
```

The first four logs on 7/2/21 were me taking the lid off to examine why a LAN port was not latching and then putting the case back on. The "Processor #0xff | IERR () | Asserted" entry is a few minutes after, but of note the box just kept chugging along. The four PSU warnings were me verifying I could query PSU status with ipmitool.

The log on 7/16 ( 1a | 07/16/2021 | 17:48:06 | Memory | Uncorrectable ECC (@DIMMA1(CPU1)) | Asserted) seems to point to a bad DIMM, but then again that brings me back to the big question - is it hardware or not? How trustworthy is this MCA logging?

Any suggestions on other things to look at?


----------



## SirDice (Jul 22, 2021)

One way to find out if it's the DIMM or the mainboard is to record the address that's giving errors now. Then move every DIMM to another slot, swap everything around. If the address moves with the DIMM it's the DIMM that's faulty.



spork said:


> is it hardware or not? How trustworthy is this MCA logging?


They are always hardware errors.





						Machine Check Architecture - Wikipedia
					






					en.wikipedia.org


----------



## spork (Jul 22, 2021)

Yeah, I saw some talk about using the address line in other posts, but considering I have 77K+ addresses in that log (in a period of 2 seconds), and I can't make the box do this on demand, hard to try that. I noticed other people with this issue seem to have only a handful of memory addresses - that's part of what has me confused - I didn't see anyone with something similar happening.


----------



## SirDice (Jul 22, 2021)

spork said:


> I noticed other people with this issue seem to have only a handful of memory addresses


It's going to depend on the "density" of the DIMM. Bigger DIMMs with less chips on them have a higher density of bits (more bits need to be crammed in fewer chips). If one of those chips fails then a higher density DIMM would have more memory errors than a lower density one.


----------



## Phishfry (Jul 22, 2021)

I had memory errors on a board recently. All it took was to reseat the memory chip.
Have you tried to re-insert your problem module?


----------



## spork (Jul 22, 2021)

SirDice said:


> It's going to depend on the "density" of the DIMM. Bigger DIMMs with less chips on them have a higher density of bits (more bits need to be crammed in fewer chips). If one of those chips fails then a higher density DIMM would have more memory errors than a lower density one.


This is where I get stuck. My brain doesn't do hex, so I can't really figure out a way to do something with all these addresses to sort of verify they fall within a range. It would certainly be interesting to do this and then see if they all fall within one DIMM. I mean, the server event log flags a single DIMM. It would also be nice if I could figure out a way to trigger this.

For all I know, my vendor might just ship me another board populated with the same config.


----------



## spork (Jul 22, 2021)

Phishfry said:


> I had memory errors on a board recently. All it took was to reseat the memory chip.
> Have you tried to re-insert your problem module?


The box is currently doing redundant stuff, so I absolutely plan on pulling it out of the rack and giving it a once-over next time I'm there.


----------



## SirDice (Jul 22, 2021)

spork said:


> My brain doesn't do hex, so I can't really figure out a way to do something with all these addresses to sort of verify they fall within a range.


That's all you need anyway. Look at the output from sysutils/dmidecode, it will tell you which range is used by what DIMM. That will allow you to track down the failing DIMM.


----------



## spork (Jul 22, 2021)

SirDice said:


> That's all you need anyway. Look at the output from sysutils/dmidecode, it will tell you which range is used by what DIMM. That will allow you to track down the failing DIMM.


OK, so at this point I'm just doing this as an exercise in figuring out the relationship between the logs and the dmidecode output. The vendor is looking at this stuff too, and I think they're just going to ship a new DIMM, and we both agree on which DIMM it is based on the server event log in the IPMI device:

Memory | Uncorrectable ECC (@*DIMMA1*(CPU1)) | Asserted
The manual shows a slot labelled DIMMA1, so I think we're good.

Anyhow, for reference, here's some info out of dmidecode. First each DIMM along with the object one level up from that (which is some grouping of DIMMs I guess - it shows a size of 32GB, and each DIMM is 16GB). I think the "Handle" for each DIMM is what I cross-reference with the next block.



> 1st Group/Bank(?)
> 
> Handle 0x0021, DMI type 19, 31 bytes
> Memory Array Mapped Address
> ...



Elsewhere in the dmidecode output I have this, which is I think laying out address ranges for each DIMM. The "physical device handle" here matches with the DIMMs in the prior snippet.



> Handle 0x0031, DMI type 20, 35 bytes
> Memory Device Mapped Address
> *Starting Address: 0x00000000000
> Ending Address: 0x003FFFFFFFF*
> ...




So I think I can posit that *DIMMA1* with a handle of *0x0022* matches the "Memory Device Mapped Address" range of *Starting Address: 0x00000000000 *and *Ending Address: 0x003FFFFFFFF.*

But then I look at just a small snippet of my MCA logs and...



> (first log line)
> Jul 16 17:46:05 clweb2 kernel: MCA: Address 0x106eac7800
> Jul 16 17:46:05 clweb2 kernel: MCA: Address 0x107ffe80c0
> Jul 16 17:46:05 clweb2 kernel: MCA: Address 0x107ffe94c0
> ...



Now those don't appear to be in any of the ranges above, so that's where I'm scratching my head about this (or again, I don't do hex, and therefore I'm not reading any of that right).


----------



## Phishfry (Jul 22, 2021)

spork said:


> *DIMMA1*


If this is a SuperMicro board this is all you need. The Memory sockets are labeled thus.
The socket closest to CPU on the power connector side.


----------

