# Find faulted Memory!



## shahzaib (Jan 18, 2021)

Hello,

Recently one of the SuperMicro node gone crashed. On checking the logs we found ECC memory errors, although these errors should be auto-corrected. However, we tried to find the faulted DIM using mcelog utility but needs instructions/help to identify the physical memory bank location using dmidecode.

Here is the mcelog output: https://pastebin.com/Jwent8YN

Here is the dmidecode output: https://pastebin.com/NjBNc9hG

We need to find the exact memory module to be replaced. There is a method that is used to identify by mapping the memory address range, please let me know how can i map and spot the module.


----------



## SirDice (Jan 18, 2021)

Look at the memory address where the error happened, look at the actual message in /var/log/messages. Then trace back that memory location in the dmidecode output.

For example, if the error happened somewhere in address 0x00E00000000 to 0x00FFFFFFFFF:

```
Handle 0x0027, DMI type 20, 19 bytes
Memory Device Mapped Address
        Starting Address: 0x00E00000000
        Ending Address: 0x00FFFFFFFFF
        Range Size: 8 GB
        Physical Device Handle: 0x0026
        Memory Array Mapped Address Handle: 0x0023
        Partition Row Position: 1
        Interleave Position: Unknown
        Interleaved Data Depth: 2
```
Look at the Physical Device Handle, that refers to another entry in the dmidecode output:

```
Handle 0x0026, DMI type 17, 28 bytes
Memory Device
        Array Handle: 0x0022
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 8192 MB
        Form Factor: DIMM
        Set: None
        Locator: P2-1B
        Bank Locator: BANK7
        Type: DDR3
        Type Detail: Other
        Speed: 1066 MHz
        Manufacturer: Hyundai       
        Serial Number: CA90C20C
        Asset Tag: AssetTagNum7
        Part Number: HMT31GR7BFR4C-H9  
        Rank: Unknown
```


----------



## shahzaib (Jan 18, 2021)

Hello,

That was the exact error:


```
Jan 16 00:36:49 cw025 kernel: MCA: Bank 8, Status 0xcc0001400001009f
Jan 16 00:36:49 cw025 kernel: MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
Jan 16 00:36:49 cw025 kernel: MCA: Vendor "GenuineIntel", ID 0x106a5, APIC ID 0
Jan 16 00:36:49 cw025 kernel: MCA: CPU 0 COR (5) OVER RD channel ?? memory error
Jan 16 00:36:49 cw025 kernel: MCA: Address 0x2c03a2800
Jan 16 00:36:49 cw025 kernel: MCA: Misc 0xd58f2bb600050180
Jan 16 00:36:49 cw025 kernel: MCA: Bank 8, Status 0xcc0000800001009f
Jan 16 00:36:49 cw025 kernel: MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
Jan 16 00:36:49 cw025 kernel: MCA: Vendor "GenuineIntel", ID 0x106a5, APIC ID 0
Jan 16 00:36:49 cw025 kernel: MCA: CPU 0 COR (2) OVER RD channel ?? memory error
Jan 16 00:36:49 cw025 kernel: MCA: Address 0x1c42d80
Jan 16 00:36:49 cw025 kernel: MCA: Misc 0x30ee40da00055840
Jan 16 00:36:49 cw025 kernel: MCA: Bank 8, Status 0xcc0002000001009f
Jan 16 00:36:49 cw025 kernel: MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
Jan 16 00:36:49 cw025 kernel: MCA: Vendor "GenuineIntel", ID 0x106a5, APIC ID 0
Jan 16 00:36:49 cw025 kernel: MCA: CPU 0 COR (8) OVER RD channel ?? memory error
Jan 16 00:36:49 cw025 kernel: MCA: Address 0x149d91480
Jan 16 00:36:49 cw025 kernel: MCA: Misc 0x598f303400050380
Jan 16 00:36:49 cw025 kernel: MCA: Bank 8, Status 0xcc0001400001009f
Jan 16 00:36:49 cw025 kernel: MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
Jan 16 00:36:49 cw025 kernel: MCA: Vendor "GenuineIntel", ID 0x106a5, APIC ID 0
Jan 16 00:36:49 cw025 kernel: MCA: CPU 0 COR (5) OVER RD channel ?? memory error
Jan 16 00:36:49 cw025 kernel: MCA: Address 0x149d01fc0
Jan 16 00:36:49 cw025 kernel: MCA: Misc 0x562eeacc00051287
```


----------



## SirDice (Jan 18, 2021)

Look at the physical addresses where the error happened:

```
MCA: Address 0x2c03a2800
```
That appears to fall in this range:

```
Handle 0x0019, DMI type 20, 19 bytes
Memory Device Mapped Address
        Starting Address: 0x00200000000
        Ending Address: 0x003FFFFFFFF
        Range Size: 8 GB
        Physical Device Handle: 0x0018
        Memory Array Mapped Address Handle: 0x0015
        Partition Row Position: 1
        Interleave Position: Unknown
        Interleaved Data Depth: 2
```
Which refers to Physical Device Handle: 0x0018:

```
Handle 0x0018, DMI type 17, 28 bytes
Memory Device
        Array Handle: 0x0014
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 8192 MB
        Form Factor: DIMM
        Set: None
        Locator: P1-1B
        Bank Locator: BANK1
        Type: DDR3
        Type Detail: Other
        Speed: 1066 MHz
        Manufacturer: Hyundai       
        Serial Number: 038BC50D
        Asset Tag: AssetTagNum1
        Part Number: HMT31GR7BFR4C-H9  
        Rank: Unknown
```

So, it's P1-1B that's probably broken. You can typically find that same marking on the mainboard (or the manual). Do the same for the other addresses you find.


----------



## shahzaib (Jan 21, 2021)

SirDice thanks for quick respond for help. Since you addressed only first MCA: Address 0x2c03a2800 , while there are other underlying MCA addresses too, should i be bother about them :


```
Jan 16 00:36:49 cw025 kernel: MCA: Address 0x1c42d80
Jan 16 00:36:49 cw025 kernel: MCA: Address 0x149d91480
Jan 16 00:36:49 cw025 kernel: MCA: Address 0x149d01fc0
```


----------



## shahzaib (Jan 21, 2021)

Hello SirDice , system again went down and this time these were the mca addresses:



```
Jan 19 02:56:57 cw025 kernel: MCA: Bank 8, Status 0x8c0000400001009f
Jan 19 02:56:57 cw025 kernel: MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
Jan 19 02:56:57 cw025 kernel: MCA: Vendor "GenuineIntel", ID 0x106a5, APIC ID 0
Jan 19 02:56:57 cw025 kernel: MCA: CPU 0 COR (1) RD channel ?? memory error
Jan 19 02:56:57 cw025 kernel: MCA: Address 0x776a9b80
Jan 19 02:56:57 cw025 kernel: MCA: Misc 0x40a33bc300054303
Jan 19 03:33:49 cw025 kernel: MCA: Bank 8, Status 0xcc0001c00001009f
Jan 19 03:33:49 cw025 kernel: MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
Jan 19 03:33:49 cw025 kernel: MCA: Vendor "GenuineIntel", ID 0x106a5, APIC ID 0
Jan 19 03:33:49 cw025 kernel: MCA: CPU 0 COR (7) OVER RD channel ?? memory error
Jan 19 03:33:49 cw025 kernel: MCA: Address 0x10bbbab00
Jan 19 03:33:49 cw025 kernel: MCA: Misc 0x9bbbdd5e00051487
Jan 19 03:33:49 cw025 kernel: MCA: Bank 8, Status 0xcc0001800001009f
Jan 19 03:33:49 cw025 kernel: MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
Jan 19 03:33:49 cw025 kernel: MCA: Vendor "GenuineIntel", ID 0x106a5, APIC ID 0
Jan 19 03:33:49 cw025 kernel: MCA: CPU 0 COR (6) OVER RD channel ?? memory error
Jan 19 03:33:49 cw025 kernel: MCA: Address 0xa52a79c00
Jan 19 03:33:49 cw025 kernel: MCA: Misc 0x68b16a7600050880
Jan 19 03:33:53 cw025 kernel: MCA: Bank 8, Status 0xcc0000800001009f
Jan 19 03:33:53 cw025 kernel: MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
Jan 19 03:33:53 cw025 kernel: MCA: Vendor "GenuineIntel", ID 0x106a5, APIC ID 0
Jan 19 03:33:53 cw025 kernel: MCA: CPU 0 COR (2) OVER RD channel ?? memory error
Jan 19 03:33:53 cw025 kernel: MCA: Address 0x1fec02ec0
Jan 19 03:33:53 cw025 kernel: MCA: Misc 0x7a5bd32300055b43
Jan 19 03:57:07 cw025 kernel: MCA: Bank 8, Status 0xcc0000c00001009f
Jan 19 03:57:07 cw025 kernel: MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
Jan 19 03:57:07 cw025 kernel: MCA: Vendor "GenuineIntel", ID 0x106a5, APIC ID 0
Jan 19 03:57:07 cw025 kernel: MCA: CPU 0 COR (3) OVER RD channel ?? memory error
Jan 19 03:57:07 cw025 kernel: MCA: Address 0x401e09240
Jan 19 03:57:07 cw025 kernel: MCA: Misc 0x9f97700300055b43
```


If i check again the dmidecode , the address 0x776a9b80 falls in this range:



```
Handle 0x001D, DMI type 20, 19 bytes

Memory Device Mapped Address
        Starting Address: 0x00600000000
        Ending Address: 0x007FFFFFFFF
        Range Size: 8 GB
        Physical Device Handle: 0x001C
        Memory Array Mapped Address Handle: 0x0015
        Partition Row Position: 1
        Interleave Position: Unknown
        Interleaved Data Depth: 2
```


Which refers to physical adress 0X001C:


```
Handle 0x001C, DMI type 17, 28 bytes

Memory Device
        Array Handle: 0x0014
       Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 8192 MB
        Form Factor: DIMM
        Set: None
        Locator: P1-2B
        Bank Locator: BANK3
        Type: DDR3
        Type Detail: Other
        Speed: 1066 MHz
        Manufacturer: Hyundai
        Serial Number: 049AC422
        Asset Tag: AssetTagNum3
        Part Number: HMT31GR7BFR4C-H9
        Rank: Unknown
```

Does that mean P1-2B Bank3 is faulted as well?


----------



## Snurg (Jan 21, 2021)

It can also mean that all modules are OK, and just the connection (socket) is borderline.
I have a HP Z400 which I only need to move not-absolutely-carefully to "loosen" RAM modules.
I then have to take them all out and reseat.
Particularly bad if modules and sockets are on the opposite sides of tolerances and thus are "bad matches".
In this case a module swap action between different computers can solve problems permanently.

Or it is a killer mobo. Once had a FSC workstation that seemed to fry memory modules.


----------



## SirDice (Jan 21, 2021)

It's not uncommon to find more than one faulty module. But yes, it's also possible they're just not seated properly or there's an issue with the mainboard. Note the modules and the memory locations, then swap a few 'faulty' modules around. If the address failures moves with the module (it's in a different socket) then you know the module is faulty. If the faulty addresses stay the same it's the socket or mainboard.


----------



## tingo (Jan 22, 2021)

Or even one of the two power supplies. This is server class hardware after all.


----------

