# How to verify check hardware?



## boris_net (Jun 2, 2013)

Hi all,

I have a FreeBSD 9.1 VM on ESXi5.1 with vmware-tools with:

8 x 3 TB and recently added 12 x 1 TB
3 x IBM M1015 flashed with 2118IT.bin as PCI Device passthrough under ESXi (used exclusively for ZFS arrays since the system is on an SSD used by ESXi).
Using the 15.00.00.00 driver on all the controllers and FreeBSD 9.1.


```
dev.mpslsi.0.firmware_version: 15.00.00.00
dev.mpslsi.0.driver_version: 15.00.00.00
dev.mpslsi.1.firmware_version: 15.00.00.00
dev.mpslsi.1.driver_version: 15.00.00.00
dev.mpslsi.2.firmware_version: 15.00.00.00
dev.mpslsi.2.driver_version: 15.00.00.00
```

ZFS:

1 x ZFS raidz1 of 5 x 3 TB drives (I want to expand to a raidz2 with 8 drives or whatever I will get as recommendation from the other topic I started here)
2 x ZFS raidz1 of 6 x 1 TB drives to copy the data from the 5 x 3 TB array (not fully used today) so I can destroy and recreate a larger pool with more 3 TB and more resilience
Drives are :

7 x 3 TB WD RED (0.60 Amp at +5 V / 0.45 A at +12 V)
1 x 3 TB Seagate (no idea of the power requirements but will eventually check)
12 x 1 TB Seagate 2.5 inch (0.80 Amp at +5 V / 0.20 Amp at +12 V)

PSU is:

Seasonic X Series 750 W (currently using only 2 of the 6-pin to molex rail at the back for HDD, I think this is enough but I have started to doubt on everything)

Motherboard/CPU/Memory:

Supermicro X9SRL-F / E5-2665 / 128GB-ECC

My question is how to verify what is failing and why since it could be:

one of the 3 controllers
one line in one of the 6 SFF cables 
one of the 6 backplanes on the NORCO case (RPC4224) 
FreeBSD mpslsi driver v15.00.00.00 
power lines to respective backplane of the NORCO case
I have the following messages at boot time:

```
(probe230:mpslsi2:0:232:0): INQUIRY. CDB: 12 0 0 0 24 0
(probe230:mpslsi2:0:232:0): CAM status: Invalid Target ID
(probe230:mpslsi2:0:232:0): Error 22, Unretryable error
```

Then checking across the controllers:

```
grep probe /var/run/dmesg.boot | grep mpslsi2 | wc -l
     924
[root@softimage ~]# grep probe /var/run/dmesg.boot | grep mpslsi0 | wc -l
       0
[root@softimage ~]# grep probe /var/run/dmesg.boot | grep mpslsi1 | wc -l
       0
```

The following messages during disk activity:

```
grep ATTENTION /var/log/messages
Jun  2 10:19:48 softimage kernel: (da3:mpslsi0:0:25:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 10:19:56 softimage kernel: (da3:mpslsi0:0:25:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 10:19:57 softimage kernel: (da2:mpslsi0:0:22:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 10:20:08 softimage kernel: (da3:mpslsi0:0:25:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 10:20:08 softimage kernel: (da17:mpslsi2:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 10:20:08 softimage kernel: (da18:mpslsi2:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 10:20:09 softimage kernel: (da5:mpslsi0:0:27:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 10:20:11 softimage kernel: (da1:mpslsi0:0:21:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 10:20:11 softimage kernel: (da6:mpslsi0:0:28:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 10:20:31 softimage kernel: (da3:mpslsi0:0:25:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 10:20:42 softimage kernel: (da4:mpslsi0:0:26:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 10:20:43 softimage kernel: (da2:mpslsi0:0:22:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 10:21:51 softimage kernel: (da3:mpslsi0:0:25:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 10:21:53 softimage kernel: (da1:mpslsi0:0:21:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 10:21:55 softimage kernel: (da4:mpslsi0:0:26:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 10:22:12 softimage kernel: (da5:mpslsi0:0:27:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 10:22:29 softimage kernel: (da5:mpslsi0:0:27:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 11:23:53 softimage kernel: (da12:mpslsi1:0:25:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 11:23:58 softimage kernel: (da12:mpslsi1:0:25:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 11:25:21 softimage kernel: (da13:mpslsi1:0:26:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 11:25:22 softimage kernel: (da13:mpslsi1:0:26:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 11:25:23 softimage kernel: (da13:mpslsi1:0:26:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 11:25:25 softimage kernel: (da13:mpslsi1:0:26:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 11:25:25 softimage kernel: (da13:mpslsi1:0:26:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 11:26:03 softimage kernel: (da20:mpslsi2:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
```

A bit of that:

```
Jun  2 11:33:48 softimage kernel: (da12:mpslsi1:0:25:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
[root@softimage ~]# grep iuCRC /var/log/messages | wc -l
    2302
```

Last but not the least, this type of message across 9 drives...

```
Jun  2 10:20:03 softimage kernel: (da3:mpslsi0:0:25:0): WRITE(10). CDB: 2a 0 0 76 d9 0 0 0 d8 0 length 110592 SMID 553 terminated ioc 804b scsi 0 state c xfer 110592
Jun  2 10:20:03 softimage kernel: (da2:mpslsi0:0:22:0): WRITE(10). CDB: 2a 0 0 76 e3 d8 0 0 d8 0 length 110592 SMID 599 terminated ioc 804b scsi 0 state c xfer 110592
Jun  2 10:20:04 softimage kernel: (da1:mpslsi0:0:21:0): WRITE(10). CDB: 2a 0 0 79 99 48 0 0 d8 0 length 110592 SMID 289 terminated ioc 804b scsi 0 state c xfer 110592
Jun  2 10:20:05 softimage kernel: (da17:mpslsi2:0:5:0): WRITE(10). CDB: 2a 0 0 5b 55 60 0 0 d8 0 length 110592 SMID 496 terminated ioc 804b scsi 0 state c xfer 110592
Jun  2 10:20:05 softimage kernel: (da18:mpslsi2:0:6:0): WRITE(10). CDB: 2a 0 0 5b cf 20 0 0 d8 0 length 110592 SMID 151 terminated ioc 804b scsi 0 state c xfer 110592
Jun  2 10:20:05 softimage kernel: (da4:mpslsi0:0:26:0): WRITE(10). CDB: 2a 0 0 7a 2b f8 0 0 d0 0 length 106496 SMID 653 terminated ioc 804b scsi 0 state c xfer 106496
Jun  2 10:20:05 softimage kernel: (da6:mpslsi0:0:28:0): WRITE(10). CDB: 2a 0 0 7a 26 f0 0 0 d8 0 length 110592 SMID 805 terminated ioc 804b scsi 0 state c xfer 110592
Jun  2 10:20:05 softimage kernel: (da5:mpslsi0:0:27:0): WRITE(10). CDB: 2a 0 0 7a b7 f8 0 0 d8 0 length 110592 SMID 817 terminated ioc 804b scsi 0 state c xfer 110592
Jun  2 10:20:08 softimage kernel: (da3:mpslsi0:0:25:0): WRITE(10). CDB: 2a 0 0 77 bc 80 0 0 d8 0
```

Either I am cursed or there is something going on, but I have to admit I have no idea where to start and certainly no confidence I am actually going to isolate any issue...

Current action:
I have removed all the 1 TB drives and kept only the 8 x 3 TB distributed across the 3 controllers. In other words, that is no more than 2 drives per backplane (6 in total on the RPC4224). I have then launched a scrub on my 5-drive raidz1 and towards the end:

```
pool status
  pool: zstuff
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Jun  2 11:57:30 2013
        3.74T scanned out of 5.86T at 978M/s, 0h37m to go
        141M resilvered, 63.75% done
config:

	NAME           STATE     READ WRITE CKSUM
	zstuff         ONLINE       0     0     0
	  raidz1-0     ONLINE       0     0     0
	    gpt/disk1  ONLINE       0     0     2  (resilvering)
	    gpt/disk2  ONLINE       0     0     0
	    gpt/disk3  ONLINE       0     0     0  (resilvering)
	    gpt/disk4  ONLINE       0     0     0
	    gpt/disk5  ONLINE       0     0     0

errors: No known data errors
```

then


```
zpool status -v
  pool: zstuff
 state: ONLINE
  scan: resilvered 141M in 0h19m with 0 errors on Sun Jun  2 12:17:09 2013
config:

	NAME           STATE     READ WRITE CKSUM
	zstuff         ONLINE       0     0     0
	  raidz1-0     ONLINE       0     0     0
	    gpt/disk1  ONLINE       0     0     0
	    gpt/disk2  ONLINE       0     0     0
	    gpt/disk3  ONLINE       0     0     0
	    gpt/disk4  ONLINE       0     0     0
	    gpt/disk5  ONLINE       0     0     0
```

I did a second scrub and got a few like this:

```
Jun  2 16:19:57 softimage kernel: (da5:mpslsi1:0:25:0): READ(10). CDB: 28 0 4 89 0 a0 0 0 c0 0
Jun  2 16:19:57 softimage kernel: (da5:mpslsi1:0:25:0): CAM status: SCSI Status Error
Jun  2 16:19:57 softimage kernel: (da5:mpslsi1:0:25:0): SCSI status: Check Condition
Jun  2 16:19:57 softimage kernel: (da5:mpslsi1:0:25:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
Jun  2 16:19:57 softimage kernel: (da5:mpslsi1:0:25:0): Retrying command (per sense data)
```

then


```
Jun  2 16:21:02 softimage kernel: (da1:mpslsi0:0:9:0): READ(10). CDB: 28 0 5 75 d1 60 0 0 40 0
Jun  2 16:21:02 softimage kernel: (da1:mpslsi0:0:9:0): CAM status: SCSI Status Error
Jun  2 16:21:02 softimage kernel: (da1:mpslsi0:0:9:0): SCSI status: Check Condition
Jun  2 16:21:02 softimage kernel: (da1:mpslsi0:0:9:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
Jun  2 16:21:02 softimage kernel: (da1:mpslsi0:0:9:0): Retrying command (per sense data)
```

Now the interesting bit is last time (3 weeks ago) I ran a long smartmon test, I got a clean report for all the 3 TB drives. Before the smartmon test one of them was reporting issues when a lot of drives were present in the chassis. Going back to 5 drives (at that time) I would have an error-free smartmon long test report on each drive.

If you are still on this thread (thanks a lot for reading), at one point, I had 6 x 10k RPM 2.5 72 GB SAS drives in this NORCO RPC4224 chassis and they were failing (the famous click click noise at some stage not too far after startup). I put them back in the DL380 they came from, and they were all fine. I am more and more thinking of a power related issue since I use only two of the 5 rails dedicated 'Peripheral IDE/SATA' shown on this photo of my PSU

If anybody has got any idea on best place to start to get a reliable setup, let me know.

Thanks.


----------



## wblock@ (Jun 2, 2013)

Swapping the power supply for a different one would be a relatively easy test.  That said, there's a lot going on there besides what you listed.  The VM, the PCI passthrough...

The earlier drives having trouble in that case but working in a different one is interesting.  To eliminate the case backplane as a problem, you could temporarily connect the drives directly to the power supply and controller.


----------



## boris_net (Jun 2, 2013)

Thanks @wblock@ for the ideas. Going from the controller directly to the HDD means I need a different cable since it's SFF8087 to SFF8087(on the backplane). I would need a fan out cable. I think I will buy one anyway and some spare SFF8087 to SFF8087 so I can highlight an issue on a cable/backplane.

With the components involved I will build the whole thing from scratch, validating every single components. I just need to get my data copied to a different computer while I get this sorted.

Could the VM and passthrough be responsible for this as well? I will see if I can install FreeBSD directly and see if it makes any difference.

So many options  

Thanks!


----------



## phoenix (Jun 2, 2013)

You may also want to downgrade the firmware to 14, and use the included driver from FreeBSD (which is also 14). You need to match the driver version to the firmware version to get the best results.


----------



## boris_net (Jun 2, 2013)

Thanks @phoenix,

I have aligned on 15.00.00.00 as described here: Post

I will put this option of downgrading to v14 down in the long list of things to check ;-)


----------



## boris_net (Jun 3, 2013)

Not that I am using this exactly but my suspicion on the power delivered through the backplane maybe backed up by this post


----------



## Terry_Kennedy (Jun 3, 2013)

boris_net said:
			
		

> Either I am cursed or there is something going on, but I have to admit I have no idea where to start and certainly no confidence I am actually going to isolate any issue...
> 
> I have the following messages at boot time:
> 
> ...


This is a red herring. The various drivers in this LSI family are overly "chatty" during probes (as shown by the "probe230" in the above log) - the driver flails about looking for things and logs every time the controller gets annoyed and fails the command. I belive this is due to an agressive setting in (for example) the dev.mps.0.debug_level tunable.



> The following messages during disk activity:
> 
> ```
> grep ATTENTION /var/log/messages
> ...


This is happening on each of your 3 controllers. So we should be able to rule out a bad controller, bad cable, or bad drive. Look for things in common - software / firmware, power supplies, backplane cabling, and so on. My bet is on insufficient power supply capacity, worsened by inadequate connections from the power supply to your drive backplane(s). I'm running an 820 W supply for 16 WD2003-FYYS 2 TB RE4 drives.



> Last but not the least, this type of message across 9 drives...
> 
> ```
> Jun  2 10:20:03 softimage kernel: (da3:mpslsi0:0:25:0): WRITE(10). CDB: 2a 0 0 76 d9 0 0 0 d8 0 length 110592 SMID 553 terminated ioc 804b scsi 0 state c xfer 110592
> ...



This is also happening on multiple controllers (2, not all 3). Again, look for commonalities.


----------



## boris_net (Jun 3, 2013)

Another question: Is running a smartmon long test enough to stress the whole chain of components? I have installed FreeBSD 9.1 directly on the HW hardware to rule out ESXi from the equation. I did a long test on the 8 3 TB drives overnight. Only one went out missing with the following message:


```
Jun  3 00:18:31 nativeBSD kernel: mps1: mpssas_alloc_tm freezing simq
Jun  3 00:18:31 nativeBSD kernel: mps1: IOCStatus = 0x4b while resetting device 0xa
Jun  3 00:18:31 nativeBSD kernel: mps1: mpssas_free_tm releasing simq
Jun  3 00:18:31 nativeBSD kernel: (da5:(pass5:mps1:0:mps1:0:11:11:0): lost device - 0 outstanding, 1 refs
Jun  3 00:18:31 nativeBSD kernel: 0): passdevgonecb: devfs entry is gone
Jun  3 00:18:31 nativeBSD kernel: (da5:mps1:0:11:0): removing device entry
```

Interestingly enough, the drive number has changed since I am running FreeBSD directly on the hardware and not through ESXi and this drive had never reported any issue so far.


```
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST3000DM001-9YN166
Serial Number:    Z1F1AKYY
LU WWN Device Id: 5 000c50 04e66ea6c
Firmware Version: CC4H
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Jun  3 15:37:14 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
```

I am running a second smartmon long test but would welcome any confirmation on what else to run/do to make sure that I have something stable if test is successful. I will try to swap my PSU or use more rails from the PSU to the backplanes of the norco.

Thanks.


----------



## wblock@ (Jun 3, 2013)

boris_net said:
			
		

> Another question: Is running a smartmon long test enough to stress the whole chain of components?



No, I don't think so.  The tests run locally on the hard drive, so there is no stress on the controller or connections.


----------



## boris_net (Jun 4, 2013)

Thanks @wblock@.

Shall I just fill my ZFS with multiple `dd` instances running at the same time?

Thanks.


----------



## wblock@ (Jun 4, 2013)

Depends on whether you stop the test at a certain time or when it finishes a certain task.  Lots of head contention mean it will take a long time to do a given size of transfer.  But yes, that should put real load on it.


----------

