# FreeBSD intermittent crash!



## shahzaib (Dec 16, 2015)

Hi,

We're new to FreeBSD as well as this forum, so please pardon me for any wrong here.

We've switched to FreeBSD recently because of its improved ARC caching and asynchronous performance but so far our experience is not very good with it. It crashes every 2-3 days and we're unable to track down the problem. The server specs are pretty high :


Supermicro X5690 (12 cores, 24 threads - 2u)
96GB RAM
12x3TB RAID-10 (HBA-LSI9211)
X8DT3 Board
Supermicro PS- 902-1R 900W

Here is the screenshot of recent crash :

http://prntscr.com/9er3pk

One thing worth mentioning is, before going down there's not load on server, more or less free RAM usually is around 12GB.

Please guys help us out to resolve this issue. Its really killing us


----------



## SirDice (Dec 16, 2015)

Which version of FreeBSD?


----------



## shahzaib (Dec 16, 2015)

Thanks for quick response. Here it is :

```
FreeBSD 10.1-RELEASE FreeBSD 10.1-RELEASE #0 r274401: Tue Nov 11 21:02:49 UTC 2014     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
```


----------



## PacketMan (Dec 16, 2015)

I'll let the more experienced folks suggest what could be the cause of the crash, but it looks like you are using a base un-patched version of 10.1-RELEASE.   Consider applying the patches.  One of my machines for example is running 10.1-RELEASE-p24, and its likely there another higher patched release available. It might be possible the cause of your issue has already been addressed, and maintaining an up-to-date patched system is a good practice.

Starting reading this whole document as time permits, but patching and updating are found in Section 23.2.

http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/


----------



## shahzaib (Dec 16, 2015)

Thanks a lot. I am about to apply patches. I'll update here soon. 

Meanwhile looking forward to more expert advice.


----------



## SirDice (Dec 16, 2015)

You may want to check for hardware errors, the mca_intr call refers to Machine Check Architecture. But I'm not sure if it's the cause of the panic(9).

You said it crashes ever 2-3 days, but the picture shows it has been up almost 12 days.


----------



## shahzaib (Dec 16, 2015)

Right, thanks. Regarding updating , I've found tons of patches and about to update now but one point is very much important before upgrade take place. Is there any chance of zpool corruption after the upgrade ? We've around 16TB data in the zpool. Sorry for newbie question, but I am newbie to FreeBSD.


----------



## SirDice (Dec 16, 2015)

There shouldn't be a risk to your existing pools, but it's always a good idea to have proper backups of course.


----------



## ondra_knezour (Dec 16, 2015)

You may encounter minor shock if:

System boots from given pool
ZFS was upgraded between version you had before and you upgraded into
You run `# zpool upgrade` and you forget to also upgrade boot code

This may render your system unbootable, because boot code would not be able to read the ZFS filesystem from which system has to boot. However data will not be lost and you can fix it by booting live system new enough to contain the same version of the ZFS as your new pool and run `# gpart bootcode <required params>`.


----------



## shahzaib (Dec 16, 2015)

I am trying to update the system using freebsd-update(8) install but output is really insane :


```
Installing updates...install: ///usr/src/contrib/file/magic/Magdir/kerberos: No such file or directory
install: ///usr/src/contrib/file/magic/Magdir/meteorological: No such file or directory
install: ///usr/src/contrib/file/magic/Magdir/qt: No such file or directory
install: ///usr/src/contrib/ntp/html/drivers/driver40-ja.html: No such file or directory
install: ///usr/src/contrib/ntp/html/drivers/driver45.html: No such file or directory
install: ///usr/src/contrib/ntp/html/drivers/driver46.html: No such file or directory
install: ///usr/src/contrib/ntp/html/drivers/mx4200data.html: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/accopt.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/audio.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/authopt.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/clockopt.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/command.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/config.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/confopt.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/external.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/hand.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/install.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/manual.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/misc.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/miscopt.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/monopt.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/refclock.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/special.txt: No such file or directory
install: ///usr/src/contrib/ntp/include/declcond.h: No such file or directory
install: ///usr/src/contrib/ntp/include/intreswork.h: No such file or directory
install: ///usr/src/contrib/ntp/include/lib_strbuf.h: No such file or directory
install: ///usr/src/contrib/ntp/include/libntp.h: No such file or directory
install: ///usr/src/contrib/ntp/include/ntp_assert.h: No such file or directory
```


----------



## da1 (Dec 16, 2015)

As a workaround, just create the missing directories and run the fetch and install commands again.


----------



## ANOKNUSA (Dec 16, 2015)

da1 said:


> As a workaround, just create the missing directories and run the fetch and install commands again.



Or, since all the errors are in the source code distribution that the OP isn't using anyway, just configure freebsd-update(8) to exclude the source code. See freebsd-update.conf(5) for info on that.

EDIT: Unless the source code _is_ being used for something. I shouldn't presume too much, sorry.


----------



## feld (Dec 16, 2015)

ondra_knezour said:


> You may encounter minor shock if
> - system boots from given pool
> - ZFS was upgraded between version you had before and you upgraded into
> - you run `# zpool upgrade`
> ...



I haven't seen a zpool upgrade happen within security/errata patches. If he upgraded from 10.1-RELEASE to 10.2-RELEASE this might be the case.

All of my research on MCA panics reported to the mailing lists so far seem to indicate a hardware issue -- usually processor.



da1 said:


> As a workaround, just create the missing directories and run the fetch and install commands again.



You can just remove src from Components in /etc/freebsd-update.conf to make those messages go away with future updates.


----------



## shahzaib (Dec 17, 2015)

Thanks guys for work around, I created missing directories and updated and rebooted the OS.


```
[root@cw001 ~]# uname -a
FreeBSD 10.1-RELEASE-p24 FreeBSD 10.1-RELEASE-p24 #0: Mon Nov  2 12:17:28 UTC 2015     [EMAIL]root@amd64-builder.daemonology.net[/EMAIL]:/usr/obj/usr/src/sys/GENERIC  amd64
```


----------



## shahzaib (Dec 17, 2015)

feld said:


> I haven't seen a zpool upgrade happen within security/errata patches. If he upgraded from 10.1-RELEASE to 10.2-RELEASE this might be the case.
> 
> All of my research on MCA panics reported to the mailing lists so far seem to indicate a hardware issue -- usually processor.
> 
> ...



Thanks for tips. I'll monitor this server downtime to see if it crash again ?


----------



## PacketMan (Dec 17, 2015)

shahzaib said:


> Thanks for tips. I'll monitor this server downtime to see if it crash again ?



Of course, and don't forget that SirDice asked you to check for hardware errors. Regardless if you have crashes or not, updating to the latest patch release is a good practice.


----------



## _martin (Dec 17, 2015)

You're missing the important information regarding the crash in the picture - the message. Only backtrace is shown. 
Do you have dump configured ? If so you can find the text info in /var/crash/core.txt.$N by default, where N is the number of last crash. 

If you don't have it set look at *dumpdev* in /etc/rc.conf configuration. 

Does it crash regularly (though 12d uptime doesn't fix the "crashes every two days" criteria). ? 
Is some heavy job scheduled to be run during that period ? You said no - were you logged just before it crashed monitoring ?

When it comes to FreeBSD I'd push for 10.2 version as guys are improving performance every release. 

As SirDice mentioned already - do check for the HW issues. Especially with non-ecc RAM. Running Memtest+ for few hours, etc. could show possible memory issues.


----------



## shahzaib (Dec 17, 2015)

matoatlantis said:


> You're missing the important information regarding the crash in the picture - the message. Only backtrace is shown.
> Do you have dump configured ? If so you can find the text info in /var/crash/core.txt.$N by default, where N is the number of last crash.
> 
> If you don't have it set look at *dumpdev* in /etc/rc.conf configuration.
> ...



Thanks for detailed answer. Yes dump is configured and I can find a big core.txt.0 text file. Now, I don't know how to debug it in order to find the bottleneck of crash. So i am attaching here.


----------



## _martin (Dec 17, 2015)

I was looking for the panic string only. Information in the core.txt is confidential to some state. Nowadays I'd be more paranoid than not.
Remove it from attachment.

Interesting part is:


```
panic: Unrecoverable machine check exception

Unread portion of the kernel message buffer:
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 5
MCA: CPU 5 UNCOR PCC internal timer error
MCA: Address 0x802bf6e59
MCA: Misc 0x0
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 4
MCA: CPU 4 UNCOR PCC internal timer error
MCA: Address 0x802bf6e59
MCA: Misc 0x0
panic: Unrecoverable machine check exception
cpuid = 7
```

You can actually search forums here for this MCA string. At first look I would focus on CPU and memory.

Now is it a false alarm or does it really hit problem with HW? Don't know right now, I would need to google around too. Some searches suggest issue with fw (bios) on motherboard. I'd check that too (compare FW/bios version of the board to the vendor's last update, etc..).


----------



## kpa (Dec 17, 2015)

My first assumption would be that there is really something wrong with the hardware and take the issue to the manufacturer of the motherboard, look at the documentation and their web support.


----------



## shahzaib (Dec 23, 2015)

Guys, again the same server got rebooted on its own and zpool didn't even mounted itself though it is enabled in rc.conf and loaded in loader.conf. Here is the panic log :


```
panic: Unrecoverable machine check exception

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 3
MCA: CPU 3 UNCOR PCC internal timer error
MCA: Address 0x802bf6a69
MCA: Misc 0x0
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 2
MCA: CPU 2 UNCOR PCC internal timer error
MCA: Address 0x802bf6a69
MCA: Misc 0x0
panic: Unrecoverable machine check exception
cpuid = 8
KDB: stack backtrace:
#0 0xffffffff80962d90 at kdb_backtrace+0x60
#1 0xffffffff80927eb5 at panic+0x155
#2 0xffffffff80e3bfeb at mca_intr+0x6b
#3 0xffffffff80d24c09 at trap+0x99
#4 0xffffffff80d0aec2 at calltrap+0x8
#5 0xffffffff80361eea at acpi_cpu_idle+0x13a
#6 0xffffffff80d0f89f at cpu_idle_acpi+0x3f
#7 0xffffffff80d0f940 at cpu_idle+0x90
#8 0xffffffff80953585 at sched_idletd+0x1d5
#9 0xffffffff808f88fa at fork_exit+0x9a
#10 0xffffffff80d0b3fe at fork_trampoline+0xe
```
----------------------

Where should I look  , some ppl people are suggesting to disable MCA panic using hw.mca.enabled=0″ to the file /boot/loader.conf.

Please help


----------



## SirDice (Dec 23, 2015)

Don't disable MCA, it's reporting hardware errors. You can use sysutils/mcelog to translate those MCA messages:

```
dice@test:~ % mcelog --ascii --no-syslog
mcelog: Cannot open /dev/mem for DMI decoding: Permission denied
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 3
MCA: CPU 3 UNCOR PCC internal timer error
MCA: Address 0x802bf6a69
MCA: Misc 0x0
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 2
MCA: CPU 2 UNCOR PCC internal timer error
MCA: Address 0x802bf6a69
MCA: Misc 0x0HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 3 BANK 5
MISC 0 ADDR 802bf6a69
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 3 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
```


----------



## shahzaib (Dec 23, 2015)

Thanks, here is the output :


```
[root@cw001 /var/crash]# mcelog --no-dmi --ascii --file core.txt.1 
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 3 BANK 5 
MISC 0 ADDR 802bf6a69 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 3 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 BANK 5 
MISC 0 ADDR 802bf6a69 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 3 BANK 5 
MISC 0 ADDR 802bf6a69 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 3 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 BANK 5 
MISC 0 ADDR 802bf6a69 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 44
```


----------



## SirDice (Dec 23, 2015)

You have hardware errors. No amount of fiddling with software settings is going to change that fact.


----------



## shahzaib (Dec 23, 2015)

Thanks SirDice, is there a way iI can find out the specific hardware component which is causing this panic?


----------



## shahzaib (Dec 26, 2015)

We've again encountered panic with another Supermicro x5690 server. It crashed and booted up automatically:


```
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 17 BANK 5
MISC 0 ADDR 802bf6c60
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 25 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 16 BANK 5
MISC 0 ADDR 802bf6c60
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 24 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 17 BANK 5
MISC 0 ADDR 802bf6c60
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 25 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 16 BANK 5
MISC 0 ADDR 802bf6c60
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 24 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
```

We think it maybe related to Power-Supply issue because x5690 is a heavy CPU requires more power to process when it comes to encoding videos with ffmpeg but we're not sure. Is there any way to find the specific hardware component which is culprit for panic?

CPU 17 BANK 5 -- what does it mean?


----------



## shahzaib (Dec 31, 2015)

I showed those Hardware errors to Vendor from whom we purchased Supermicro servers . This is what he has to say :

-----------------------------------
Why do you not made one test environment with CentOS or one other Linux that you know to use, and see if you have same errors ??? if not than you know that the errors come from OS not from hardware. ( CentOS, RedHead….work diferend like FreeBSD – work direct on hardware if you don’t have the right kernel settings can the server crashed. CentOS , RedHead…. don’t work direct on hardware and distribute the resource load better and you have better control and you can better debug one situation)
-----------------------------------

According to vendor, issue could be related to software. What should we do now, they are crashing alot with same dump crash. Today we examined a bit different behaviour server was crashed but no crash dump was created. We're really stuck on it .

One more point is worth mentioning is, we installed these servers using FreeBSD Backup Image in clonezilla and restored those images on these servers.

Now iI am going to disable mca permanently. What would be disadvantages of disabling it?

hw.mca.enabled="0"


----------



## Terry_Kennedy (Dec 31, 2015)

shahzaib said:


> We've again encountered panic with another Supermicro x5690 server. It crashed and booted up automatically:
> 
> 
> ```
> ...


It is possible it is power-related, but (unless you have an easily swappable power supply), I'd suspect the CPU chip, CPU cooler, motherboard first and then chassis cooling / memory / power supply if I didn't find the issue in the first set of parts I replaced.



> I showed those Hardware errors to Vendor from whom we purchased Supermicro servers . This is what he has to say :
> 
> -----------------------------------
> Why do you not made one test environment with CentOS or one other Linux that you know to use, and see if you have same errors ??? if not than you know that the errors come from OS not from hardware. ( CentOS, RedHead….work diferend like FreeBSD – work direct on hardware if you don’t have the right kernel settings can the server crashed. CentOS , RedHead…. don’t work direct on hardware and distribute the resource load better and you have better control and you can better debug one situation)
> -----------------------------------


That sounds like the vendor being discussed in this post. They refused to take responsibility even after they demonstrated the problem on CentOS. The customer in that case ended up returning the system for a partial refund (just to get away from that vendor) and buying one from a vendor who is specifically FreeBSD-friendly (iXsystems) and things went fine except for being fumbled repeatedly by the "remote hands" in the data center they were using.

Supermicro lists FreeBSD as one of their supported operating systems here, by the way. Products without a checkmark for FreeBSD were probably not tested, but are compatible. In some cases, FreeBSD may not support some particular onboard device - for example, integrated WiFi - and that is why there is no checkmark. FreeBSD would boot on such a board and operate properly, but would be less useful due to not having drivers for some of the onboard hardware.

More below.



> Now i am going to disable mca permanently. What would be disadvantages of disabling it ?
> 
> hw.mca.enabled="0"


If you successfully disable MCA, when the system has a fault that would normally be handled by MCA, the system will continue to run (possibly with corrupt data) until it either locks up or manifests at a higher level (like a panic from a corrupted instruction stream). Either will be *much* harder to isolate the actual cause.

Intel publishes a list of instructions their CPUs support and the requirements for each instruction. It doesn't matter what operating system you use, or if you write a program on "the bare metal" - everybody has to play by the same set of rules or the CPU will reject it. There are occasionally errata published by Intel which describe a particular instruction sequence that either a) should be permissible, but which has unexpected results / side effects (example - FDIV bug), or b) should be rejected, but is incorrectly accepted (example - f00f bug). It may be that some operating systems are affected to a greater or lesser extent due to their particular code paths, but the problem is still in the hardware. The vast majority of code bugs will be visible *much* higher up, with a trap due to referencing unallocated memory, etc. MCA happens at a much lower level inside the CPU. Data paths are checked, and currently-inactive units may run self-tests. If a machine check happens, it is pretty much certain that the problem is within the tightly-coupled area of the CPU and memory controller (integrated on the CPU for some time now). I'd go on, but hopefully I've made my point.


----------



## shahzaib (Jan 7, 2016)

Hi,

I showed crash logs to our hosting team which is very good at hardware stuff and thats what they replied with :

-------------------------------------------------
I've set bios to optimized defaults, (this changed the multiplier back to 26) and disabled the following OC / powersaving features:


Ø  Intel Turbo Boost

Ø  Intel C-STATE Tech

Ø  Intel EIST tech

Ø  C1E Support


I have to mention that FreeBSD is not meant to work flawless on any hardware, we had situations in the past were our client had to switch from FreeBSD to other OS as the hardware - software combination was causing unexpected crashes.

-------------------------------------------------------


----------



## _martin (Jan 7, 2016)

Did you overclocked the HW ? If so, I'm sorry but FreeBSD has nothing to do with it (MCA is related to HW here, not the faulty OS). 
_Internal timer error_ indicates problem with CPU/clock and "related".


----------



## shahzaib (Jan 7, 2016)

I don't know if it was overclocked or not because i I didn't do anything except for reducing CMOS multiplier ratio from max(26) to 18. Is there some other option in BIOS for overclocking ? Is Intel Turbo Boost is what overclock is ?

Sorry i I am new to this overclock stuff.


----------



## _martin (Jan 8, 2016)

You didn't mention what board you have. 

Last 3 are basically power saving settings. If you put them in google you'll find a good explanations of them. 
Turbo boost - http://www.intel.com/content/www/us/en/architecture-and-technology/turbo-boost/turbo-boost-technology.html. 

As I mentioned earlier - I would check for BIOS updates or known problems with these technologies on the board you have. Interesting that power saving options are off in opt settings. Maybe manufacturer knows why.  

Intel does have CPU diagnostic tools on their site. Some are for Windows only, but still worth putting a Windows-installed disk there for that purpose temporarily.


----------



## shahzaib (Jan 8, 2016)

The supermicro motherboard is "X8DT3" and already using the latest BIOS.



matoatlantis said:


> ...
> Interesting that power saving options are off in opt settings. Maybe manufacturer knows why.


Power saving options were not off, our support team disabled it now.


----------



## shahzaib (Jan 8, 2016)

Unfortunately server again went down with following error on screen :

http://prntscr.com/9nlzvr

and didn't generated any crash dump under /var/crash directory. Would you please let me know why server is failed to generated crash dump ?
On further checking the OC / Power saving options under BIOS I found out that Intel EIST option was not disabled by our support guy as though he mentioned in his reply that he has disabled it (maybe he forgot to disable this option). So I've disabled EIST option now and rebooted the server to further monitor server performance.

http://prntscr.com/9nm1yp


----------



## shahzaib (Jan 9, 2016)

To test FreeBSD with hw.mca.enabled: 0 on one of the same Supermicro server. Here is its uptime since then :


```
[root@cw002 /var/crash]# uptime
1:52PM  up 8 days,  2:36, 1 user, load averages: 2.62, 0.73, 0.50
```

So ignoring MCA didn't cause any downtime till now. Don't know if should we disable MCA on all servers?


----------



## RedShift1 (Jan 9, 2016)

shahzaib said:


> To test FreeBSD with hw.mca.enabled: 0 on one of the same Supermicro server. Here is its uptime since then :
> 
> 
> ```
> ...



By flipping hw.mca.enabled to 0, the kernel is deliberately ignoring a signal from the processor that it detected an error in its operation. That signal is something serious, the root cause of it must be determined.

I would start with the basics: do a memtest86 (http://www.memtest.org/). Because you experience the problem only very intermittently (multiple days pass by), don't trust the result for just one pass, do multiple passes. Also do the bitfade test.

Furthermore, you suspected the power supplies before but can you post full system specs so the power requirements can be assessed and determined if inadequate?

Keep collecting as much information about the crashes as you can (screenshot of the on-screen messages, output of mcelog, etc...), so we can see if there's a pattern.

So do the memtest and let's take it from there.


----------



## leebrown66 (Jan 9, 2016)

shahzaib said:


> Don't know if should we disable MCA on all servers ?


You indicate you have more than 1 server:

Is this problem happening on all of them?
Do they all have the same CPU, motherboard and chassis?
Do they all have identical BIOS settings?

Are they all running the same version of FreeBSD?


----------



## shahzaib (Jan 10, 2016)

RedShift1 said:


> By flipping hw.mca.enabled to 0, the kernel is deliberately ignoring a signal from the processor that it detected an error in its operation. That signal is something serious, the root cause of it must be determined.
> 
> I would start with the basics: do a memtest86 (http://www.memtest.org/). Because you experience the problem only very intermittently (multiple days pass by), don't trust the result for just one pass, do multiple passes. Also do the bitfade test.
> 
> ...



Hi, actually we've total of 5 same specs Supermicro servers performing same job of video encoding and serving over 80 port. The reason i am not doubting memory is, its highly unlikely that all 5 servers has faulted memory because all of them goes down intermittently but I guess I'll try memtest on one of them. Could you please guide a bit about how to test with memtest ? Do i need to boot server with memtest ISO and let it run for days or there's another way despite of downtime for memtest ?



RedShift1 said:


> ...
> Furthermore, you suspected the power supplies before but can you post full system specs so the power requirements can be assessed and determined if inadequate?


All servers are built upon following components :

2 x Intel X5690 @ 3.47Ghz (12 cores, 24 threads)
12 x 3TB SATA 7200rpm
12 x 8GB DIMM (96GB Memort)
2 x 800W Redundant PS
X8DT3 Board



RedShift1 said:


> ...
> Keep collecting as much information about the crashes as you can (screenshot of the on-screen messages, output of mcelog, etc...), so we can see if there's a pattern.


Here is recent crash occurred last night when load-avg was 0.2 on server :

http://pastebin.com/vCxypF2z


----------



## shahzaib (Jan 10, 2016)

leebrown66 said:


> You indicate you have more than 1 server:
> 
> Is this problem happening on all of them?
> Do they all have the same CPU, motherboard and chassis?
> ...



Yes we've 5 servers with same specs doing identical job of encoding videos and store them, you can call it a cluster of servers.

YES, the problem happening with all of them, same CPU, motherboard, chassis, identical BIOS settings.


leebrown66 said:


> ...
> Are they all running the same version of FreeBSD?


Yes, we restored the same FreeBSD image from clonezilla on all 5 servers. So same configs.


----------



## leebrown66 (Jan 10, 2016)

Considering they all exhibit the same problem, I'd put the BIOS to default settings first.  Then if you still experience problems, do exactly what the vendor suggests, throw a different OS on there and run the same code.  If it still crashes, it's obviously not OS related therefore must be hardware (per the MCA message, although mis-configured BIOS can cause hardware problems, hence setting it to defaults).

Was this a self-build or did you purchase them pre-built?  If self-build, the vendor is going to be your best source of support.  If purchased, you have a good case for returning them as it indicates they did something wrong.


----------



## shahzaib (Jan 10, 2016)

One point worth mention is the model of PS which is "Supermicro PS- 902-1R 900W". Though iI've edited in my first post now.


----------



## shahzaib (Jan 18, 2016)

After lots of testing, we've decided installing Debian in one of these Supermicro servers, if it stops crashing after that we'll consider that FreeBSD has issues supporting Supermicro hardware. If that would be the case than we'll think to buy Dell R510 server backed with LSI-9211 controller and install FreeBSD on it. Could anyone please let us know that Dell supports FreeBSD as in the following thread it is stated that the user was unable to smoothly run FreeBSD on dell due to some issue with LSI-9211 controller on Dell-FreeBSD but that was working fine on other distros e.g Red Hat:

http://hardforum.com/showthread.php?t=1681334

If you guys can give some valuable suggestion to whether we should go with Dell or not for FreeBSD, it'd be really kind? Following will be our configs with Dell :

Dell R510
2 x x5675 (12 cores, 24 threads)
RAM 64GB
12 x 3TB SATA Raid-10 (HBA LSI-9211)


----------



## Terry_Kennedy (Jan 19, 2016)

shahzaib said:


> Could anyone please let us know that Dell supports FreeBSD as in the following thread it is stated that the user was unable to smoothly run FreeBSD on dell due to some issue with LSI-9211 controller on Dell-FreeBSD but that was working fine on other distros e.g Red Hat:
> 
> http://hardforum.com/showthread.php?t=1681334


I'm running FreeBSD on an R710 and an NX3100 (which is based on something else in the Rx10 series - possibly even the R510). If you have 12 drives in that R510, there is a SAS expander in there (on the drive backplane). SATA drives behind a SAS expander are normally a bad idea. It is too long to go into here, though. The Dell H700 controller for that R510 is a nice controller and supported in FreeBSD by the mfi(4) driver. Since it is a Dell controller, it will either warn or flat-out refuse to work with non-Dell-certified drives (newer controller firmware just warns).

I also have a number of Supermicro systems running FreeBSD without any problems. They're all X8-series motherboards.


----------



## shahzaib (Jan 19, 2016)

terry Thanks for detailed reply iI'll go through it but first here is the recent summary to diagnose issue, though not confirmed yet.

So as we went with Debian 8 on one of the Supermicro server. So, things were working pretty stable for first two days though we had encountered lots of CPU heating logs in kernel but as long as server didn't crashed we didn't bothered Googling those logs but eventually today at morning around 6:00am server crashed and rebooted automatically, on reading kernel logs we found the same heating logs which were occurring from the beginning on Debian 8 before the crash happened. Logs are boiled down :

```
Jan 19 05:16:29 cws004 kernel: [362404.826424] mce: [Hardware Error]: Machine check events logged
Jan 19 05:16:29 cws004 mcelog: Processor 17 heated above trip temperature. Throttling enabled.
Jan 19 05:16:29 cws004 mcelog: Please check your system cooling. Performance will be impacted
Jan 19 05:16:29 cws004 mcelog: Processor 5 heated above trip temperature. Throttling enabled.
Jan 19 05:16:29 cws004 mcelog: Please check your system cooling. Performance will be impacted
Jan 19 05:16:29 cws004 mcelog: Processor 5 below trip temperature. Throttling disabled
Jan 19 05:16:29 cws004 mcelog: Processor 17 below trip temperature. Throttling disabled
```

-----------------------------------------------------------

On reporting this issue to DC this is what they did to fixed heating issue :

The airflow for CPU1 was obstructed by a "feature" of the plastic cover that should had been removed in a 2 CPU scenario.  We removed it, and now both CPU's are cooled correctly.
------------

Now we're not sure if the FreeBSD was also crashed due to that heating issue as the logs were different on FreeBSD (MCA:Internal timer error) while Debian had pretty different and explicit logging to debug issue on the spot. We'll monitor servers for days and will update in thread about the situation.


----------



## fossette (Jan 21, 2016)

Sounds like a problem I've encountered when I started with FreeBSD.  Demanding too much to the CPU may freak its heat sensors.
https://forums.freebsd.org/threads/one-solution-to-kernel-panic-computer-reboot.51941/


----------



## shahzaib (Feb 19, 2016)

Hi, came back after a long time, so yes the issue not solved and soon we'll replace hardware though i noticed some more errors when server got back online after the crash (No crash log though). Here is the log - something related to LSI SAS controller i guess :

http://pastebin.com/piDcMszC

Could that be the cause of crash ? Though it looks like the errors were generated during boot instead of before crash. Still not sure and need to learn these logs. 

Thanks to all for your help !!


----------



## leebrown66 (Feb 20, 2016)

Stating the obvious, you have a disk problem.  Command 12 is (as the log states) an INQUIRY command (reference here) , which is failing.  The SCSI command reference can be found via a link on the bottom of that page.  The driver is issuing an INQUIRY command in order to fetch data about the disk(s) and there's some kind of problem.  I've only seen this type of fault with disks that have read faults due to age, but I wouldn't rule out a bad card or cables either.  Try swapping cables/moving the disks to difference ports see if the error moves, which would indicate faulty disk.  If the error sticks on the same port I'd guess a bad card.  Try another card if you have one....


----------



## shahzaib (Feb 20, 2016)

leebrown66 said:


> Stating the obvious, you have a disk problem.  Command 12 is (as the log states) an INQUIRY command (reference here) , which is failing.  The SCSI command reference can be found via a link on the bottom of that page.  The driver is issuing an INQUIRY command in order to fetch data about the disk(s) and there's some kind of problem.  I've only seen this type of fault with disks that have read faults due to age, but I wouldn't rule out a bad card or cables either.  Try swapping cables/moving the disks to difference ports see if the error moves, which would indicate faulty disk.  If the error sticks on the same port I'd guess a bad card.  Try another card if you have one....



Thanks for the reply, we're using 12 x 3TB HDD stripping+mirroring backed by LSI-9211 HBA controller. Could you please kindly answer some of my newbie questions ? 

- When you said moving disks to different ports, is that means moving disks to different ports of controller ? 
- As it looks like these logs were generated on boot time instead of crash, so do you think could that be the cause of crash ? Because if they were the cause of crash they should had been generated just before the crash event - though i still need clearance on that.


----------



## leebrown66 (Feb 20, 2016)

> ...moving disks to different ports, is that means moving disks to different ports of controller ?


yes either the controller or the expander backplane.  You are trying to see if the mps0:0:3:0 and mps0:0:3:1 entries change really.  I am not familiar with the mps driver, so I don't know how it indexes the disks, but if you can identify which disk is 0:3:0 swap it for example with 0:3:2, then if you still get an error on 0:3:0 looks like the card, if the error moves to 0:3:2, it must be the disk or cable (or expander I suppose).



> ...do you think could that be the cause of crash


That's impossible to say really.  For example, it's possible a disk has been silently corrupted during a write operation (imagine faulty controller on the disk itself).  If it's a mirrored RAID you may have had a read from a good disk with no crash and at another time a read from a bad disk causing a crash, making the crash random.  That's also assuming it's code that got corrupted.  I would have thought you'd get a different error from the RAID controller though, not an INQUIRY.
The LSI's BIOS should have a check disk operation you could try if you can isolate a single disk.  As you are using ZFS, try scrubbing.


----------



## Terry_Kennedy (Feb 21, 2016)

leebrown66 said:


> Stating the obvious, you have a disk problem.  Command 12 is (as the log states) an INQUIRY command (reference here) , which is failing.


No, it is a regression in FreeBSD. See the FreeBSD-stable@ thread I started here. It is harmless, it just delays the boot process for a while during the retries.


----------



## leebrown66 (Feb 21, 2016)

Terry_Kennedy said:


> No, it is a regression in FreeBSD. See the FreeBSD-stable@ thread I started here. It is harmless, it just delays the boot process for a while during the retries.


I saw that, but I didn't think 10.1-RELENG had that problem, but now I realize it's been introduced sometime after 8.4.  Sorry for the misdirection.


----------



## shahzaib (Feb 21, 2016)

Terry_Kennedy said:


> No, it is a regression in FreeBSD. See the FreeBSD-stable@ thread I started here. It is harmless, it just delays the boot process for a while during the retries.


Hi , so you saying that shouldn't be the cause of crash ?


----------



## Terry_Kennedy (Feb 22, 2016)

shahzaib said:


> Hi , so you saying that shouldn't be the cause of crash ?


Correct. It is a harmless (other than the delay) set of messages during boot, and also when a rescan of the bus is requested.


----------



## shahzaib (Apr 18, 2016)

Hi again, got back after a long time. So yes, we've move to new Dell R510 Hardware now. Here is the specs :

DELL R510
2 x L5520
64GB RAM
12x3TB Raid stripping+mirroring (HBA LSI-9211-fw version 19.00)
FreeBSD cw009.tunefiles.com 10.2-RELEASE-p14 FreeBSD 10.2-RELEASE-p14 #0: Wed Mar 16 20:46:12 UTC 2016     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

After 9days of uptime, server again got crashed with following error in crash log :

http://pastebin.com/baShWuMP

I am so much depressed now, there's much pressure on me from my company  . Please help us resolving this crash issue .


----------



## shahzaib (Apr 19, 2016)

Is FreeBSD-10.2 an stable release? Do *I* need to upgrade to 10.3 using freebsd-update utility?


----------



## SirDice (Apr 19, 2016)

FreeBSD 10.2 is supported until the end of this year. So yes, it's a stable release.

https://www.freebsd.org/security/security.html#sup


----------



## shahzaib (Apr 19, 2016)

SirDice , we really need help now. We've discarded supermicro and moved with Dell R510 but errors look to be same 'Internal Timer error' . I am so depressed and no idea where to go from here.


----------



## SirDice (Apr 19, 2016)

The errors look like hardware errors. Apparently you're not that lucky when it comes to hardware.

As a side note, one of my clients has around 25 SuperMicro servers (old, new, various sizes) and they all run FreeBSD (mostly 9.3, some 10.1) without any issues.


----------



## shahzaib (Apr 19, 2016)

SirDice  , please don't demoralize me more, i am already very much depressed  . Don't know what to do , someone on other freebsd mailling list suggested to update cpu microcode but looks like intel doesn't provide microcode file for FreeBSD : 

- Install microcode updates and hope, it will fix it

Intel offers for many CPUs an microcode update.
https://downloadcenter.intel.com/download/25512/Linux-Processor-Microcode-Data-File?v=t


----------



## shahzaib (Apr 19, 2016)

Machine check exception could be caused by software/applicaion?  Maybe we're on the wrong side of troubleshoot and culprit is our application not hardware ? Servers utilize following programs :

- NGINX + PHP_FPM (Uploading videos and streaming them to end users)
- FFMPEG (Encode the uploaded videos to make it ready for streaming)


----------



## CurlyTheStooge (Apr 19, 2016)

Very depressing thread, not hard to feel sympathy for the OP. But hard to believe that they got faulty hardware back to back, I can't buy this.

Shahzaib, what's the response from Dell on this though? 
Also, inspite of Debian 8, I'd suggest CentOS 7.1/7.2 for a test on one of the servers.

Regards.


----------



## shahzaib (Apr 20, 2016)

CurlyTheStooge  , thanks for showing sympathy  . Well, that's all Dell has to say for now :

http://en.community.dell.com/support-forums/servers/f/956/p/19682411/20901711#20901711


----------



## _martin (Apr 20, 2016)

When you said you moved to the new HW, does it mean brand new ? Meaning no moving parts from server to server (RAM, e.g).
Also - who did install those servers ? Was proper procedure followed ?

I can too imagine how stressful it has to be to deal with the HW behaving like this. Seems way too unlikely to get so many faulty HW. 
It might be forth moving those parts in order to single out the faulty item. Remove all but one CPU, keep it running with the lowest amount of memory modules possible. And do real stress tests - put quite a load on those servers.


----------



## CurlyTheStooge (Apr 20, 2016)

shahzaib said:


> CurlyTheStooge  , thanks for showing sympathy  . Well, that's all Dell has to say for now :
> 
> http://en.community.dell.com/support-forums/servers/f/956/p/19682411/20901711#20901711



I can understand your situation, been there.
Well, Dell also has stated they didn't validate Debian on these servers. Might be good to try CentOS 7.1 / 7.2 I think?


Cheers from neighborhood.
Regards.


----------



## shahzaib (May 8, 2016)

Well, after disabling logical cores on servers, situation got much stable. Though, there was a recent crash of FreeBSD-10.2 on DELL with different error panic: page fault . Following guide suggested to grab the value of "instruction pointer" but the value was not found even omitting the digits. :

https://www.freebsd.org/doc/faq/advanced.html

Here is the crash dump :

http://prntscr.com/b1mgj3


----------



## RedShift1 (May 8, 2016)

matoatlantis said:


> When you said you moved to the new HW, does it mean brand new ? Meaning no moving parts from server to server (RAM, e.g).


----------



## shahzaib (May 9, 2016)

RedShift1 , exactly !! No moving part.


----------



## PacketMan (May 9, 2016)

shahzaib said:


> Machine check exception could be caused by software/applicaion?  Maybe we're on the wrong side of troubleshoot and culprit is our application not hardware ?



Since you said you are now running on 100% new hardware, that the new server has none of the hardware from the old server then that makes me suspicious that somehow a software issue appears to be a hardware issue.  So stop using the application software you were using on one of the other servers, install something else on it, and let it run. Put a game server port on it, you need to de-stress a bit anyway.   See if it still crashes.


----------



## shahzaib (May 9, 2016)

Have you checked my last post with HT disabled error got change ? :

Here is the crash dump :

http://prntscr.com/b1mgj3


----------



## sizigee (May 9, 2016)

shahzaib said:


> Have you checked my last post with HT disabled error got change ? :
> 
> Here is the crash dump :
> 
> http://prntscr.com/b1mgj3



Panic: page fault... hmmm... try removing some DIMMs out of it and work your way up.  Might be a memory module that is faulty.  But I can be completely wrong too...

I read the whole thread and I feel for you.


----------



## RedShift1 (May 10, 2016)

PacketMan said:


> Since you said you are now running on 100% new hardware, that the new server has none of the hardware from the old server then that makes me suspicious that somehow a software issue appears to be a hardware issue.  So stop using the application software you were using on one of the other servers, install something else on it, and let it run. Put a game server port on it, you need to de-stress a bit anyway.   See if it still crashes.


This might have some merit to it. A long time ago I had to manage a Windows application on a bunch of servers and the application itself had to run as Administrator. The reason for that is because it did some funky memory stuff and it made Windows crash every now and then, regardless of what hardware it was running on. So maybe the business application shahzaib is running might be doing the same...


----------



## sizigee (May 10, 2016)

shahzaib said:


> Have you checked my last post with HT disabled error got change ? :
> 
> Here is the crash dump :
> 
> http://prntscr.com/b1mgj3





RedShift1 said:


> This might have some merit to it. A long time ago I had to manage a Windows application on a bunch of servers and the application itself had to run as Administrator. The reason for that is because it did some funky memory stuff and it made Windows crash every now and then, regardless of what hardware it was running on. So maybe the business application shahzaib is running might be doing the same...



I experienced the same (M$ background here).  The BSoD I kept getting was PAGE_FAULT_IN_NON_PAGED_AREA.  I replaced all the DIMMs and it stopped blue screening on me.  Maybe its the same in this case.

*EDIT* I re-read what you said.  It could be the application causing a memory leak, but I would rather blame the hardware, as that resolves the issue a lot quicker, than asking dev's to look into it.


----------



## sizigee (Jun 23, 2016)

shahzaib  How are things going? managed to resolve the issue?


----------



## shahzaib (Jun 23, 2016)

Hi,

After disable HT , things are improved. Now it crashes around after 1 month instead of each week but crash is not fixed. Here is the recent log of a crash attached. Please read if you can help diagnosing the problem ?

Thanks for following up.


----------



## SirDice (Jun 23, 2016)

```
Unread portion of the kernel message buffer:
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 16
MCA: CPU 9 UNCOR PCC internal timer error
MCA: Address 0x8018cfa2c
MCA: Misc 0x0
panic: Unrecoverable machine check exception
```
This is a hardware error.


----------



## Murph (Jun 23, 2016)

shahzaib said:


> SirDice  , please don't demoralize me more, i am already very much depressed  . Don't know what to do , someone on other freebsd mailling list suggested to update cpu microcode but looks like intel doesn't provide microcode file for FreeBSD :
> 
> - Install microcode updates and hope, it will fix it
> 
> ...



I'm not saying this is your fix, but the Intel microcode for Linux is available as the sysutils/devcpu-data port.


----------



## sizigee (Jun 27, 2016)

SirDice said:


> ```
> Unread portion of the kernel message buffer:
> MCA: Bank 5, Status 0xbe00000000800400
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
> ...


Faulty core, maybe?


----------



## User23 (Jun 27, 2016)

Different machines with hardware errors on different OS's and temperature related throttling. Did you log the temperature of the hottest core on each cpu?


----------



## shahzaib (Jul 2, 2016)

Well, yes i've already installed microcode. We've Dell & supermicro crashing with same errors and both of these servers have one identical model e.g :

- Dell Perc H200 controller flashed with IT mode in LSI-9211
- X5600 series CPUs which is , x5690 & x5675

Now i've started to assume that these X series CPUs are somehow culprit maybe they take more power during FFmpeg encoding and become overheated but its all just my assumption. Soon we'll be adding 3-5 more servers of HP DL180G6 with low consumption cpus L5639 to see how they go as compare to X series CPUs.

Thanks all of you for helping us that far. I guess its the longest thread of this forum  Well, the ugliest one too


----------



## sizigee (Jul 5, 2016)

Ugly? no... There was no flame war, so that makes it good 

Have you tried using an nvidia graphics card for the encoding?  it'll save you buying another server or 5...  Also have you tried linux, just to see if it'll make a difference?  I'm just curious


----------

