# ESXi 4.1 - mpt0 errors



## throAU (Jan 16, 2013)

Ladies/Gents, just recently I have noticed the following errors in two of my FreeBSD 8.3 VMs (a name server and an MX server - both running as ESXi 4.1 guests).  I'll post output from the MX server as it appears more often (and obviously does more I/O).

Example:


```
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: request 0xffffff80002378b0:9223 timed out for ccb 0xffffff000198b000 (req->ccb 0xffffff000198b000)
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: attempting to abort req 0xffffff80002378b0:9223 function 0
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: completing timedout/aborted req 0xffffff80002378b0:9223
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: abort of req 0xffffff80002378b0:0 completed
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: request 0xffffff80002324e0:9224 timed out for ccb 0xffffff003eaca000 (req->ccb 0xffffff003eaca000)
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: attempting to abort req 0xffffff80002324e0:9224 function 0
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: completing timedout/aborted req 0xffffff80002324e0:9224
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: abort of req 0xffffff80002324e0:0 completed
```

Any ideas what may be causing this?

I am running GENERIC kernel, and have VMware tools installed.  System is up to date using freebsd-update.

Physical hardware is a Cisco B200 blade in a UCS 8 slot chassis, the physical storage is a Netapp FAS2240 connected via NFS over 10 gig fibre through a Cisco 4507.

The virtual storage is just VMware virtual machine provided virtual disks - LSI logic parallel emulation.

The Netapp is not running anywhere near flat out in terms of IO, so I'm pretty sure it shouldn't be timing out due to IO throttling - and all our user ports on the 4507 are running at 100Mb POE (plugged into old phones which are 100Mb limited) with only 8 10Gb ports in use and say 36 ports running at 1Gb.  It has dual Sup 7s with SSO, so should be no problem there either.

I'm not seeing storage errors on anything else.


Any idea where to start looking to track this down?  The machine had 188 days of uptime at that point and stupidly, I rebooted it.

However my name server (also exhibiting the issue to a lesser extent due to less IO) has not yet been rebooted (also has 188 days of uptime) - if there are any diagnostics I should perform prior to reboot I can perform them on that.


Cheers


----------



## throAU (Jan 16, 2013)

Oh and one more thing - the filesystems are all UFS.


----------



## cpm@ (Jan 16, 2013)

Probably this was caused by some firmware issue, but I can't confirmed it. 

Please, show this outputs:

```
# camcontrol tags [device id]
# camcontrol devlist
```


----------



## throAU (Jan 16, 2013)

I gather you mean the device ID for the system (only) virtual disk (da0)?


```
mx2# camcontrol tags da0
(pass0:mpt0:0:0:0): device openings: 127
mx2# camcontrol devlist
<VMware Virtual disk 1.0>          at scbus0 target 0 lun 0 (pass0,da0)
mx2#
```


----------



## throAU (Jan 16, 2013)

Also - have confirmed that I wasn't seeing these errors prior to last night - so perhaps it is an uptime-related thing.  I'll leave the name server running and monitor the MX server for problems post-reboot.

Both machines are virtually identical virtual hardware, running on same physical infrastructure and at the same patch level, etc. and both started generating errors at roughly the same time.

I did not make any changes to the host environment in the last 2 weeks or so (last change would have been VMware ESXi updates on the hosts).


----------



## cpm@ (Jan 16, 2013)

throAU said:
			
		

> I gather you mean the device ID for the system (only) virtual disk (da0)?


Correct 

Then stay tuned for some irregularity, and report anyway to be analyzed with more calm. Use SMART to check disk data connected to the controller. Can be useful if you post the results.


----------



## throAU (Jan 16, 2013)

SMART data?  Isn't that only valid for the actual physical drives?  They are several levels of abstraction away and in RAID-DP (i.e., Netapp's variant of RAID6).  I have zero failed drives in the Netapp and no alarms.

Even if the Netapp had several failed disks, ESXi shouldn't even know, as it is connected via NFS?


----------



## cpm@ (Jan 16, 2013)

throAU said:
			
		

> SMART data?  Isn't that only valid for the actual physical drives?  They are several levels of abstraction away and in RAID-DP (i.e., Netapp's variant of RAID6).  I have zero failed drives in the Netapp and no alarms.
> 
> Even if the Netapp had several failed disks, ESXi shouldn't even know, as it is connected via NFS?



Sure, the guest can only see the virtual hardware, not the physical hardware. Check physical drive. Is recommended read VirtualBox Troubleshootings.


----------



## throAU (Jan 16, 2013)

The physical drives are shared with many other VMs that are not showing errors - the netapp is an enterprise grade SAN with 48 drives, none of which are showing errors.

The Netapp will pro-actively fail disks that have smart errors (it does a nightly scrub), as far as the clients are concerned they won't see any change in service.

I have actually had a drive failure in the FAS some months ago, and did not get these errors in my log previously.


----------



## joel@ (Jan 16, 2013)

I have plenty of 8.X VM's running in my ESX 4.1 environment, also using NetApp for storage. My ESX machines are all running with the latest patches from VMware. I do not see these errors.

Uptime varies a lot, but several VM's have around 200 - 300 days of uptime.

I suggest you send a mail to stable@freebsd.org


----------



## throAU (Jan 16, 2013)

joel@ said:
			
		

> I have plenty of 8.X VM's running in my ESX 4.1 environment, also using NetApp for storage. My ESX machines are all running with the latest patches from VMware. I do not see these errors.
> 
> Uptime varies a lot, but several VM's have around 200 - 300 days of uptime.
> 
> I suggest you send a mail to stable@freebsd.org




Cheers,

Will monitor both VMs over the next couple of days and see if the problem persists and email stable@ as appropriate.

I've been running FreeBSD in ESX for about 6 years now myself (these exact VMs for 2-3 years) and never seen this before either so it is quite peculiar.


----------



## cpm@ (Jan 16, 2013)

throAU said:
			
		

> SMART data?  Isn't that only valid for the actual physical drives?  They are several levels of abstraction away and in RAID-DP (i.e., Netapp's variant of RAID6).  I have zero failed drives in the Netapp and no alarms.
> 
> Even if the Netapp had several failed disks, ESXi shouldn't even know, as it is connected via NFS?



Some Dell's drives working with SAS 6/iR BIOS allow use SMART to check virtual disks to get raw values. Linux provides to physical disks attached to the virtual controller device named /dev/sgX then is possible to use smartmontools.

Tested on linux.

FreeBSD should get implement this option :stud


----------



## throAU (Jan 17, 2013)

cpu82 said:
			
		

> Some Dell's drives working with SAS 6/iR BIOS allow use SMART to check virtual disks to get raw values. Linux provides to physical disks attached to the virtual controller device named /dev/sgX then is possible to use smartmontools.
> 
> Tested on linux.
> 
> FreeBSD should get implement this option :stud





The disks are not in the same physical machine... the ESXi host is running VMs off a data store served via NFS from the Netapp.

This setup is based on a fully VMware supported Cisco/Netapp/Vmware partnership Flexpodarchitecture.


----------



## cpm@ (Jan 17, 2013)

You are right, but I suppose can be parsing how to implement this possibility, although not be easy do this in a different architecture. Some limited pass-through functionality must exist to start to implement code.

From smartmontools FAQ:


> Do smartctl and smartd run on a virtual machine guest OS?
> 
> Yes and no. Smartctl and smartd run on a virtual machine guest OS without problems. But this isn't very useful because the virtual disks do not support SMART. If a guest OS disk is configured as a raw disk, this only means that its sectors are mapped transparently to the underlying physical disk. This does not imply the ATA or SCSI pass-through access required to access the SMART info of the physical disk. Even the disk's identity is typically not exposed to the guest OS.


----------



## throAU (Feb 1, 2013)

Update:  errors have not recurred after a reboot of only one of the VMs.

Wierd.


----------



## cpm@ (Feb 1, 2013)

throAU said:
			
		

> Update:  errors have not recurred after a reboot of only one of the VMs.
> 
> Wierd.



Despite mpt0 errors haven't occurred again, be recommended reporting I/O detailed symptoms that disappeared from your MX server, for avoid future headaches.


----------



## mk96 (Mar 5, 2013)

*S*ame symptoms here with Samba file server on FreeBSD 8.3 p3 on VMWare ESXi 4.1/348481 - Server with 3ware 9690SA and configured RAID5 with 4 x 1TB Samsung Spinpoints F1 - HE103UJ - 1TB each. So it is a certified controller with certified HDDs. The CPU/HDD does not show 100% exhaust and incoming traffic was about 15MB/s - as the disk usage in MB/s. Any idea what is behind this problem? Yesterday I rebooted VM checked virtual disks in VM, checked physical disks in RAID - everything was OK.


----------



## mk96 (Mar 5, 2013)

I read something here. So I am going to try it.


----------



## mk96 (Mar 5, 2013)

I saw FreeBSD 8.3 is not supported on ESXi 4.1U3, it is supported in 5.1. So I upgraded the hypervisor to 5.1 (with e1000 and e1000e driver from 5.0U1 - older is without bug), upgraded vmtools-freebsd in VM and added to /boot/loader.conf:

```
hw.pci.enable_msi="0"
hw.pci.enable_msix="0"
```
which addresses the issue when IRQs are shared among mpt0/em0/emX.

The source is here. We will see how it performs tomorrow and until the end of the week. I will post here my experience 



			
				throAU said:
			
		

> Ladies/Gents, just recently I have noticed the following errors in two of my FreeBSD 8.3 VMs (a name server and an MX server - both running as ESXi 4.1 guests).  I'll post output from the MX server as it appears more often (and obviously does more I/O).
> 
> Example:
> 
> ...


----------



## throAU (Mar 7, 2013)

My name server now has 238 days of uptime and has not seen this issue recur.

However, one thing I have just remembered - NetApp recommend installing some "guest OS" tools that modify some IO time-out settings for Windows and Linux - this is for when the FAS has a controller fail-over happen (in case of firmware upgrade, hardware failure, etc).

I'll see if I can dig out more info as to what those time outs are and how they could be set on FreeBSD.

As far as I am aware - I've never had a NetApp HA failover.  But maybe this happened and caused the issue...


----------



## mk96 (Mar 8, 2013)

Mine is running second day without issue. FreeBSD 8.3 is in ESXi 4.1 unsupported - in my case I strongly suspect that was the problem, even ESXi 4.1u3 crashed under high IO (just a simple background fsck_ufs on a 1TB /home). It does not happen anymore in ESXi 5.1, so I think they probably fixed that.



			
				throAU said:
			
		

> My name server now has 238 days of uptime and has not seen this issue recur.
> 
> However, one thing I have just remembered - NetApp recommend installing some "guest OS" tools that modify some IO time-out settings for Windows and Linux - this is for when the FAS has a controller fail-over happen (in case of firmware upgrade, hardware failure, etc).
> 
> ...


----------



## throAU (Mar 8, 2013)

Hmm.

I've never had any crashes of ESXi or otherwise.  I also have not performed the sysctl tuning above yet.

What do those sysctls actually do?


----------



## mk96 (Mar 10, 2013)

These syctls turn on/off MSI - message signaled interrupts.



			
				throAU said:
			
		

> Hmm.
> 
> I've never had any crashes of ESXi or otherwise.  I also have not performed the sysctl tuning above yet.
> 
> What do those sysctls actually do?


----------

