# SAS5ira raid controller issue



## pvanulden (Apr 28, 2011)

Hello there!

We are running FreeBSD 8.1 on a Dell PowerEdge 860 that has an SAS5ira raid controller with two 1.5 TB drives attached.  We've started to notice some messages in the logs pertaining to the mpt driver, usually when writing larger amounts of data to the server.  I've been using mptutil to try and diagnose what the issue might be but I don't seem to be able to get any useful information.


```
server# mptutil show adapter
mpt0 Adapter:
       Board Name: SAS5ira
   Board Assembly: 
        Chip Name: C1068
    Chip Revision: UNUSED
      RAID Levels: RAID0, RAID1, RAID1E
    RAID0 Stripes: 64K
   RAID1E Stripes: 64K
 RAID0 Drives/Vol: 2-8
 RAID1 Drives/Vol: 2
RAID1E Drives/Vol: 3-8
```


```
server# mptutil volume status 0
Volume 0 status:
    state: OPTIMAL
    flags: ENABLED
```


```
server# mptutil show drives
mpt0 Physical Drives:
   0 ( 1397G) ONLINE <WDC WD15EADS-00P 0A01> SATA bus 0 id 1
   1 ( 1397G) ONLINE <WDC WD15EADS-00P 0A01> SATA bus 0 id 32
```


```
server# grep mpt /var/log/messages
Apr 28 10:22:09 server kernel: mpt0: mpt_cam_event: 0x16
Apr 28 10:22:09 server kernel: mpt0: mpt_cam_event: 0x16
Apr 28 10:24:17 server kernel: mpt0: mpt_cam_event: 0x16
Apr 28 10:24:17 server kernel: mpt0: mpt_cam_event: 0x12
Apr 28 10:24:17 server kernel: mpt0: mpt_cam_event: 0x16
Apr 28 10:24:25 server kernel: mpt0: mpt_cam_event: 0x16
Apr 28 10:24:25 server kernel: mpt0: mpt_cam_event: 0x12
Apr 28 10:24:25 server kernel: mpt0: mpt_cam_event: 0x16
Apr 28 10:28:08 server kernel: mpt0: request 0xc56c9b90:57649 timed out for ccb 0xc5d29800 (req->ccb 0xc5d29800)
Apr 28 10:28:08 server kernel: mpt0: attempting to abort req 0xc56c9b90:57649 function 0
Apr 28 10:28:08 server kernel: mpt0: request 0xc56ce140:57650 timed out for ccb 0xc6611000 (req->ccb 0xc6611000)
Apr 28 10:28:08 server kernel: mpt0: request 0xc56ceaf0:57651 timed out for ccb 0xc5d1a000 (req->ccb 0xc5d1a000)
Apr 28 10:28:08 server kernel: mpt0: mpt_recover_commands: IOC Status 0x4a. Resetting controller.
Apr 28 10:28:08 server kernel: mpt0: mpt_cam_event: 0x80
Apr 28 10:28:08 server kernel: mpt0: completing timedout/aborted req 0xc56c9b90:57649
Apr 28 10:28:08 server kernel: mpt0: completing timedout/aborted req 0xc56ce140:57650
Apr 28 10:28:08 server kernel: mpt0: completing timedout/aborted req 0xc56ceaf0:57651
Apr 28 10:28:08 server kernel: mpt0: mpt_cam_event: 0x16
Apr 28 10:28:08 server kernel: mpt0: mpt_cam_event: 0x12
Apr 28 10:28:08 server kernel: mpt0: mpt_cam_event: 0x12
Apr 28 10:28:08 server kernel: mpt0: mpt_cam_event: 0x16
Apr 28 10:28:08 server kernel: mpt0: mpt_cam_event: 0x21
Apr 28 10:28:08 server kernel: mpt0: mpt_cam_event: 0x21
Apr 28 10:28:08 server kernel: mpt0:vol0(mpt0:0:0): Volume Status Changed
Apr 28 10:28:08 server kernel: mpt0: mpt_wait_req(4) timed out
Apr 28 10:28:08 server kernel: mpt0: read_cfg_page(1) timed out
Apr 28 10:28:08 server kernel: mpt0: mpt_refresh_raid_data: Failed to read IOC Page 2
Apr 28 10:28:08 server kernel: mpt0: mpt_cam_event: 0x16
Apr 28 10:28:47 server kernel: mpt0: mpt_cam_event: 0x16
Apr 28 10:34:38 server kernel: mpt0: mpt_cam_event: 0x16
Apr 28 10:34:38 server kernel: mpt0: mpt_cam_event: 0x16
Apr 28 10:39:53 server kernel: mpt0: mpt_cam_event: 0x16
Apr 28 10:42:16 server kernel: mpt0: mpt_cam_event: 0x16
Apr 28 10:42:16 server kernel: mpt0: mpt_cam_event: 0x16
Apr 28 10:43:24 server kernel: mpt0:vol0(mpt0:0:0): RAID-1 - Optimal
Apr 28 10:43:24 server kernel: mpt0:vol0(mpt0:0:0): Status ( Enabled )
Apr 28 10:43:24 server kernel: mpt0: mpt_cam_event: 0x16
Apr 28 10:43:54 server kernel: mpt0: mpt_cam_event: 0x16
Apr 28 10:48:12 server kernel: mpt0: mpt_cam_event: 0x16
Apr 28 10:48:12 server kernel: mpt0: mpt_cam_event: 0x16
```


If any of the drives are failing, is there a way to determine which one?  Looking for any help whatsoever which might help to find out what the issue might be.

Thanks in advance!

Cheers,
Phil


----------



## SirDice (Apr 28, 2011)

You can try updating to 8.2. There have been a lot of changes in that driver between 8.1 and 8.2.


----------



## pvanulden (Apr 28, 2011)

Thanks SirDice, I will try upgrading.  Here is some additional information from the boot log:


```
Apr 28 13:58:09 server kernel: mpt0: <LSILogic SAS/SATA Adapter> port 0xec00-0xecff mem 0xfe9fc000-0xfe9fffff,0xfe9e0000-0xfe9effff irq 16 at device 8.0 on pci2
Apr 28 13:58:09 server kernel: mpt0: [ITHREAD]
Apr 28 13:58:09 server kernel: mpt0: MPI Version=1.5.13.0
Apr 28 13:58:09 server kernel: mpt0: Capabilities: ( RAID-0 RAID-1E RAID-1 )
Apr 28 13:58:09 server kernel: mpt0: 1 Active Volume (2 Max)
Apr 28 13:58:09 server kernel: mpt0: 2 Hidden Drive Members (10 Max)
Apr 28 13:58:09 server kernel: mpt0:vol0(mpt0:0:0): Settings ( Member-WCE Hot-Plug-Spares High-Priority-ReSync )
Apr 28 13:58:09 server kernel: mpt0:vol0(mpt0:0:0): Using Spare Pool: 0
Apr 28 13:58:09 server kernel: mpt0:vol0(mpt0:0:0): 2 Members:
Apr 28 13:58:09 server kernel: (mpt0:1:32:0): Primary Online
Apr 28 13:58:09 server kernel: (mpt0:1:1:0): Secondary Online
Apr 28 13:58:09 server kernel: mpt0:vol0(mpt0:0:0): RAID-1 - Optimal
Apr 28 13:58:09 server kernel: mpt0:vol0(mpt0:0:0): Status ( Enabled )
Apr 28 13:58:09 server kernel: (mpt0:vol0:1): Physical (mpt0:0:1:0), Pass-thru (mpt0:1:0:0)
Apr 28 13:58:09 server kernel: (mpt0:vol0:1): Online
Apr 28 13:58:09 server kernel: (mpt0:vol0:0): Physical (mpt0:0:32:0), Pass-thru (mpt0:1:1:0)
Apr 28 13:58:09 server kernel: (mpt0:vol0:0): Online
Apr 28 13:58:09 server kernel: (xpt0:mpt0:1:-1:-1): rescan already queued
Apr 28 13:58:09 server kernel: pass1 at mpt0 bus 1 scbus1 target 0 lun 0
Apr 28 13:58:09 server kernel: da0 at mpt0 bus 0 scbus0 target 0 lun 0
```


----------



## Terry_Kennedy (Apr 29, 2011)

pvanulden said:
			
		

> We are running FreeBSD 8.1 on a Dell PowerEdge 860 that has an SAS5ira raid controller with two 1.5 TB drives attached.  We've started to notice some messages in the logs pertaining to the mpt driver, usually when writing larger amounts of data to the server.  I've been using mptutil to try and diagnose what the issue might be but I don't seem to be able to get any useful information.


The first thing to do is to see if there is new Dell firmware available for either the SAS 5 controller or the drives, and install it. You might have to search for things like "Dell SAS Hard Drive Firmware Utility", as sometimes the newer versions don't appear as an update for older Dell models.

If you're using generic non-Dell drives, whether they work well or not seems to be up to whatever firmware they come with. I'm using a pair of WD3000HLFS on a Dell SAS 5/iR and the only issue is that the driver will spit out a "mpt0: mpt_cam_event: 0x21" message every few days.

You can see the binary event log with mptutil:


```
(1:15) new-gate:/sysprog/terry# mptutil show events
 ID     Time   Type Log Data
  378 1385863s 8001 22 00 00 00 02 00 01 01 01 00 01 01 2a 03 |"...........*.|
                    14 01 00 00 00 00 00 00 00 00 00 00 00 00 |..............|
  379 1406217s 8001 22 00 00 00 02 00 01 01 09 00 01 00 2a 03 |"...........*.|
                    14 01 00 00 00 00 00 00 00 00 00 00 00 00 |..............|
  380 2096744s 8001 22 00 00 00 02 00 01 01 01 00 01 01 2a 03 |"...........*.|
                    14 01 00 00 00 00 00 00 00 00 00 00 00 00 |..............|
  381      36s 8001 01 00 00 00 20 00 10 00 00 10 58 00 28 10 |.... .....X.(.|
                    0e 1f 00 00 00 00 00 00 00 00 00 00 00 00 |..............|
  382      42s 8001 0b 00 00 00 01 00 15 00 00 00 00 00 02 01 |..............|
                    00 00 00 00 00 00 00 00 00 00 00 00 00 00 |..............|
```

Unfortunately, mpt events are an opaque blob to the utilities available on FreeBSD. That's why you receive messages like:


```
mpt0: mpt_cam_event: 0x21
```

Only when a problem is reported outside of the mpt driver will you see a detailed error message, and as your log entries show, they're of the form "I told the mpt device to do something, and it didn't."

I looked into what would be needed to add useful event decoding to mptutil, but there didn't seem to be enough info available, and what little there was, was nested a half-dozen layers deep in some #include files.


----------



## pvanulden (Apr 29, 2011)

Thanks Terry.  

I upgraded to FreeBSD 8.2 and the problem still exists.  We were running FreeBSD 6 on there up until a few months ago when we got new hard drives for it.  Before I installed FreeBSD 8 on the new drives, I tried flashing the BIOS on the raid controller card using a USB boot drive but it complained about not being able to find the adapter.  I'm going to try it again to see if I can get it to work.

Is a USB drive the best (only?) way to go about flashing that raid controller?  There is no floppy drive in the server and even if there was, I have no idea where I would even get a floppy disk.  Also, is there any risk to losing the raid when flashing the BIOS on those adapters?

These mpt errors just started happening in the last week so I presume I'm dealing with either a bad drive or possibly something wrong with the adapter though that seems less likely.  If I knew which drive was causing the problem, I could just try swapping that drive to see if the errors go away.  Regarding swapping the drive, is there anything I need to do before attempting that?  Can I just simply shutdown the server, change out the drive and then the raid will automatically rebuild itself when it boots up?

Thanks everyone for the feedback.


----------



## SirDice (Apr 29, 2011)

Since most of those flash utilities require windows you might want to take a look on http://www.bootdisk.com. They have a few bootable CD options too. 

I have no idea what will happen to your array when flashing the controller's BIOS. That will probably be mentioned in the READMEs that came with the tool/flash image. But it's always best to be on the safe side and backup just to be sure.


----------



## Terry_Kennedy (May 1, 2011)

pvanulden said:
			
		

> Is a USB drive the best (only?) way to go about flashing that raid controller?  There is no floppy drive in the server and even if there was, I have no idea where I would even get a floppy disk.  Also, is there any risk to losing the raid when flashing the BIOS on those adapters?


I use 3 methods to create bootable media for firmware flashing and similar tasks:

USB flash drive
CD
USB floppy drive

Which one I use depends on how the update was packaged and the particular media supported by the target system. The most annoying updater I've seen was a Windows-only executable. Fortunately, I discovered that the "repair system / command prompt" option when booting a Windows 7 CD provides a suitable environment.

Normally, flashing the controller firmware should not affect the RAID set. However, it is always a good idea to have a working backup handy, just in case. As always, read the firmware release notes for any cautions specific to that firmware.



> These mpt errors just started happening in the last week so I presume I'm dealing with either a bad drive or possibly something wrong with the adapter though that seems less likely.  If I knew which drive was causing the problem, I could just try swapping that drive to see if the errors go away.  Regarding swapping the drive, is there anything I need to do before attempting that?  Can I just simply shutdown the server, change out the drive and then the raid will automatically rebuild itself when it boots up?


I assume you have a mirror set and not a stripe set? You may need to tell the controller that it is Ok to initialize the replacement drive. As above, it is important to have a good backup before starting. In the worst case, you guess and wind up swapping the good drive, the bad drive fails during the copy due to the higher I/O load, and the controller won't accept the drive you removed because it is an older instance of the RAID set. I'm not saying that's going to happen, but it is best to be prepared.


----------



## pvanulden (May 13, 2011)

Thanks for the help folks.  We started out by flashing the raid controller and that didn't solve the problem.  We swapped out disk1 and we were still getting the same messages in the log from the mpt driver.  After the raid finished re-syncing, I swapped out disk0 and now everything is working great.  It would be nice if there was some means to determine which drive was causing the problem.  Even the BIOS on the raid controller itself said that both drives were fine which I guess if it thinks things are okay, then the driver probably isn't going to be able to tell either.

Anyway, glad to have this sorted out.

Cheers!


----------

