# Crashing Server - Temp/Voltage Monitoring



## APseudoUtopia (Apr 16, 2009)

I have a 7.1-RELEASE-p3 server running in a remote data center. I've had problems starting about 3 days ago in which the server is randomly rebooting (about once per day). Nothing shows up in /var/log/messages about the problem (only the normal boot-up messages). The `last` command does show that the server did in fact crash.

I'm trying to figure out what is causing this. I have the suspicion that it is a hardware problem, and I'd like to install a tool that lets me monitor the temperature and voltage of the processors and system. I've tried using Healthd, but after playing with it for a short period of time, I realized that it wasn't detecting anything (it said 0 temp, 0 volts, etc).

I was wondering if anyone can help me figure out how I can monitor the hardware of the server. Here's some info on the hardware:

Intel Xeon CPU 2.40GHz
PCI Devices:

ATI Technologies Inc - Rage XL PCI
Intel Corporation - 82540EM Gigabit Ethernet Controller
Intel Corporation - 82801 Family (ICH2/3/4/4/5/5/6/7/8/9,63xxESB) Hub Interface to PCI Bridge
Intel Corporation - 82801CA (ICH3) UltraATA/100 EIDE Controller
Intel Corporation - 82801CA/CAM (ICH3-S/ICH3-M) LPC Interface
Intel Corporation - 82801CA/CAM (ICH3-S/ICH3-M) SMBus Controller
(2x) Intel Corporation - 82801CA/CAM (ICH3-S/ICH3-M) USB Controller
Intel Corporation - E7500 System Controller (MCH, Hub Interface A) Error Reporter
Intel Corporation - E7501 Host Controller
IDE Devices

ad0: WDC WD1600AAJB-00WRA0 58.01H58

I was reading up on various software I can use to do the monitoring, such as lmmon, mbmon, healthd, and ipmitool. From what I can tell, I will be required to recompile my kernel after adding a few options to it in order to have support for /dev/smb or something.

Also, will I need to enable ACPI? Right now it is disabled.

Thanks.


----------



## SirDice (Apr 16, 2009)

You will most likely need ACPI.. Not sure what else but /dev/smb isn't always needed (it isn't on my board and temp monitoring works with mbmon).

Also consider installing smartmontools if you have S.M.A.R.T. enabled drives.


----------



## User23 (Apr 20, 2009)

If you dont get the thermal sensors working, you should 

kldload /boot/kernel/cpufreq.ko

After that you can use (sysctl -a | grep freq) the sysctl variable "dev.cpu.0.freq" to set a lower frequency.

I know this is only a poor workaround, but if you dont have access to the server it may help.

Btw. dont run the powerd for that test.


----------

