# MCA Errors



## abishai (Aug 18, 2022)

Hello,
I've noticed several suspicious messages in system log

```
<2>1 2022-08-15T22:52:42.537378+00:00 alpha kernel - - - MCA: Bank 8, Status 0x88000040000200cf
<2>1 2022-08-15T22:52:42.538644+00:00 alpha kernel - - - MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
<2>1 2022-08-15T22:52:42.538705+00:00 alpha kernel - - - MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 49
<2>1 2022-08-15T22:52:42.538744+00:00 alpha kernel - - - MCA: CPU 19 COR (1) MS channel ?? memory error
<2>1 2022-08-15T22:52:42.538781+00:00 alpha kernel - - - MCA: Misc 0x1020408000057100
<2>1 2022-08-16T22:46:18.037014+00:00 alpha kernel - - - MCA: Bank 8, Status 0x88000040000200cf
<2>1 2022-08-16T22:46:18.038387+00:00 alpha kernel - - - MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
<2>1 2022-08-16T22:46:18.038450+00:00 alpha kernel - - - MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 49
<2>1 2022-08-16T22:46:18.038490+00:00 alpha kernel - - - MCA: CPU 19 COR (1) MS channel ?? memory error
<2>1 2022-08-16T22:46:18.038529+00:00 alpha kernel - - - MCA: Misc 0x1020408000056c00
```

If these are memory errors, were they corrected? And, it looks like there is no information to figure out exact module location.


----------



## richardtoohey2 (Aug 18, 2022)

This might help:









						MCA errors
					

I'm going to bring this up with my vendor, but I'm trying to figure out if the logs are telling me anything more than "bad DIMM".  So I had a box running 12.2-RELEASE p7 lock up. On logging into the IP-KVM, I had a mess of MCA logs on the console, but the host was unresponsive: no keyboard...




					forums.freebsd.org


----------



## abishai (Aug 18, 2022)

Not really, I don't have address entry here. Channel is not detected either.
Also, `ipmitool sel elist` shows no memory related messages as well.

Could it me something else if not memory?


----------



## richardtoohey2 (Aug 18, 2022)

If you search the internet for that message it seems to be saying an issue with bank 8 of your RAM.

What is your RAM set-up?  What is the motherboard?


----------



## VladiBG (Aug 19, 2022)

Did you try to swap the memory modules and check if the message bank is different?


----------



## abishai (Aug 19, 2022)

richardtoohey2 said:


> If you search the internet for that message it seems to be saying an issue with bank 8 of your RAM.
> 
> What is your RAM set-up?  What is the motherboard?


Asus Z8NR-D12 with 2 Xeons E56xx
There is no BANK 8 in `dmidecode` output:

```
abishai@alpha:~ % doas dmidecode -t memory
# dmidecode 3.4
Scanning /dev/mem for entry point.
SMBIOS 2.5 present.

Handle 0x0036, DMI type 16, 15 bytes
Physical Memory Array
    Location: System Board Or Motherboard
    Use: System Memory
    Error Correction Type: Multi-bit ECC
    Maximum Capacity: 96 GB
    Error Information Handle: Not Provided
    Number Of Devices: 12

Handle 0x0038, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_A1
    Bank Locator: BANK0
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: 2FBC69FC
    Asset Tag: AssetTagNum0
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x003A, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_A2
    Bank Locator: BANK1
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: 39FF2843
    Asset Tag: AssetTagNum1
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x003C, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_B1
    Bank Locator: BANK2
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: 2BBC69FC
    Asset Tag: AssetTagNum2
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x003E, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_B2
    Bank Locator: BANK3
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: B7C57AE0
    Asset Tag: AssetTagNum3
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x0040, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_C1
    Bank Locator: BANK0
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: E26F71DF
    Asset Tag: AssetTagNum4
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x0042, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_C2
    Bank Locator: BANK1
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: 4ABB69FC
    Asset Tag: AssetTagNum5
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x0044, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_D1
    Bank Locator: BANK2
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: 26BC69FC
    Asset Tag: AssetTagNum6
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x0046, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_D2
    Bank Locator: BANK3
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: F33D69E7
    Asset Tag: AssetTagNum7
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x0048, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_E1
    Bank Locator: BANK0
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: 3ABC69FC
    Asset Tag: AssetTagNum8
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x004A, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_E2
    Bank Locator: BANK1
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: 1E6F71DF
    Asset Tag: AssetTagNum9
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x004C, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_F1
    Bank Locator: BANK2
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: 31BC69FC
    Asset Tag: AssetTagNum10
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x004E, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_F2
    Bank Locator: BANK3
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: C96F71DF
    Asset Tag: AssetTagNum11
    Part Number: 36JSZF1G72PZ-1G4D1
```



VladiBG said:


> Did you try to swap the memory modules and check if the message bank is different?


Well, I have a hope, as where were  only 2 messages so far, it was some random aberration and without knowledge of module location all of them must be moved. Also, I don't understand why address information is not available, usually other people got the address and could pinpoint bank by address range.

Can we narrow the search if we assume, that if CPU 19 is a second die, BANK8 must be connected to it?


----------



## VladiBG (Aug 19, 2022)

BANK8 is not the memory bank. It's a mc_bank error class which is different for Intel/AMD and it depend of the processor family.


> Intel The P6 processor has the five banks:
> 0. Data Cache
> 1. Instruction Cache
> 2. Bus Unit
> ...



src:


			https://lists.xenproject.org/archives/html/xen-devel/2009-02/pdf47CGiwzq4V.pdf


----------



## cartesius23 (Sep 1, 2022)

Are MCA errors the same thing as MCE exceptions? We've been pursuing that issue on Linux (AMD Ryzen machines) for >= 2 years. Random reboots due to mostly fake errors. It was all firmware related and the latest firmware upgrades by e.g. ASUS resolved many of them. The thing is that those errors also showed some issues with memory banks but all of them were fake. The errors were related to higher CPU C-states.


----------



## SirDice (Sep 1, 2022)

cartesius23 said:


> Are MCA errors the same thing as MCE exceptions?


They're not the same but are related. Both would signal the OS about hardware issues. 






						Machine Check Architecture - Wikipedia
					






					en.wikipedia.org
				








						Machine-check exception - Wikipedia
					






					en.wikipedia.org


----------



## VladiBG (Sep 1, 2022)

cartesius23 said:


> Are MCA errors the same thing as MCE exceptions?





			https://www.intel.com/content/dam/develop/external/us/en/documents/emca2-integration-validation-guide-556978.pdf


----------



## abishai (Sep 3, 2022)

3 days ago I've received a hundred of these messages in a second and then silence. I've rebooted server and found these messages under IPMI/BMC BIOS entry.

Still, I can't figure out the root of the problem. Looks like it triggered 2 sensors (Memory and OEM Memory), but still no module location.
Maybe, someone seen these messages before and know to to read them?

For now, I interchanged CPU1 and CPU2 memory modules and booted the server. Sporadical nature of errors gives me a hope that it can be electrical contact issues and maybe modules change will help.


----------

