# Mem Error since 11.2-RELENG



## kira12 (Jul 30, 2018)

Hello Guys,

I update to release 11.2 and 2 of my HP Servers have memory errors,

```
DL380G7:
MCA: Bank 8, Status 0x88000040000200cf
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 33
MCA: CPU 13 COR (1) MS channel ?? memory error
MCA: Misc 0x204081000016000
```

DL580G7 too  much to read and the Machine crash. What happened with 11.2?

best regards ré


----------



## nihr43 (Jul 30, 2018)

Sounds like you've just got bad memory.  Those are pretty old now, aren't they?  Look at the built in management if you have it.


----------



## SirDice (Jul 31, 2018)

nihr43 said:


> Sounds like you've just got bad memory.


Yep, that would be my guess too.


----------



## kira12 (Jul 31, 2018)

Hello,

but its on two HP Servers after the update to 11.2.

regards ré


----------



## _martin (Jul 31, 2018)

kira12 said:


> n two HP Servers after the update to 11


Well, bad memory can happen any time. What does ILO say ? Does it report any issues ? I'd boot off Memtest+ or alike and do a full memory check. To be sure.


----------



## SirDice (Jul 31, 2018)

Note that this is ECC memory and the error was corrected (COR). So it's unlikely a memory test will find failures, the test may trigger more MCA messages though.

The error may have been there before the upgrade too, and nobody noticed it. Now that you've upgraded the system you're looking at the logging more closely. So it "appears" these errors are due to the upgrade.


----------



## Chris_H (Jul 31, 2018)

FWIW I see these periodically; once in some 7 - 8 months on one of my servers. I recently changed the memory to double the capacity, and speed of the RAM. Then saw one a week later (this is new memory). It might be worth noting that the L(123) cache is _also_ memory, and can also throw these errors. I suspect (in my case) that the CPU is working exceptionally hard at the time (heavy load) and that the (cache) memory isn't performing correctly while the CPU is so hot.
Unless they become a frequent occurrance. You can probably disregard the message(s).
As to the FreeBSD upgrade. It may well be that the upgrade introduces something that makes these (tempfails) more evident.

HTH

--Chris
EDIT
I should also note that a failing PSU will cause this (poor quality electricity eg; failing diode(s)). RAM, and CACHE are especially sensitive to the _quality_ of the electricity.


----------



## SirDice (Jul 31, 2018)

In the past I've used sysutils/mcelog to "decode" those MCA messages. They can be quite cryptic.


----------



## Maelstorm (Jul 31, 2018)

The MCA errors are thrown when the CPU detects a possible hardware problem.  I've seen these for the L1 cache on my machine.  They are usually informational, but they do indicate that there may be a problem.  Since CPUs now contain memory controllers, it looks like it registered an error with the memory, but it was also corrected.  I wouldn't worry too much about it unless it happens frequently, which would indicate a hardware fault or flaky hardware (memory, CPU, mainboard, PSU, etc...).


----------



## Crivens (Jul 31, 2018)

These days, memory cells are small enough to be corrupted by radiation. How is the sun storm activity? And yes, I _am_ serious.


----------



## Maelstorm (Aug 2, 2018)

Crivens said:


> These days, memory cells are small enough to be corrupted by radiation. How is the sun storm activity? And yes, I _am_ serious.



I know you are, as I am well aware of that myself.  A cosmic ray has enough energy to flip the bit in a memory cell.  So yes, space weather is becoming very important to admins as well.

Space weather, not just for power grids, satellites, and communications operators/providers any more.


----------



## Chris_H (Aug 2, 2018)

Maelstorm , you forgot the NSA, and Intel. 

--Chris


----------

