# Performance issues with PERC 730 RAID controller after upgrading to FreeBSD 13.1



## elno (Sep 20, 2022)

We have a Dell Poweredge R530 with a PERC 730P Mini RAID Controller (RAID-5 with three disks, UFS file system). Firmware is up to date. 
After upgrading from FreeBSD 12.3 to 13.1-p2 the system sometimes seems to hang for a while, load is going up, and top is showing an almost 100% system load.
This can be reproduced for example with repeatedly doing

`$  /bin/rm -f -r /tmp/usr/ports/sysutils/webmin/work`

which deletes 117086 files and directories. It usually takes some more tries to trigger the problem.


```
$ top -a
last pid: 87619;  load averages:  1.53,  0.96,  0.76                                                                                                     
968 processes: 9 running, 959 sleeping
CPU:  0.5% user,  0.0% nice, 99.4% system,  0.0% interrupt,  0.0% idle
Mem: 2558M Active, 47G Inact, 2583M Laundry, 3341M Wired, 508M Buf, 6881M Free
Swap: 16G Total, 144M Used, 16G Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
 9822     88       41  20    0  4324M   762M select   0  34:13 246.28% /usr/local/libexec/mysqld --defaults-extra-file=/usr/local/etc/mysql/my.cnf --basedir=/usr/local --datadi
86419 www           1  85    0   190M    65M CPU6     6   0:09  83.21% /usr/local/sbin/httpd -DNOHTTPACCEPT
86438 www           1  40    0   330M   129M select   1   0:03  72.91% /usr/local/sbin/httpd
87619 root          1  52    0    15M  5020K RUN      0   0:07  71.87% /bin/rm -f -r /tmp/usr/ports/sysutils/webmin/work
85062 www           1  39    0   225M    48M range    4   0:02  56.22% /usr/local/sbin/httpd -DNOHTTPACCEPT
86084 www           1  30    0   330M   134M CPU4     4   0:02  41.69% /usr/local/sbin/httpd
87334 www           1  33    0   231M    63M CPU3     3   0:02  39.70% /usr/local/sbin/httpd -DNOHTTPACCEPT

$ grep ^da0 /var/run/dmesg.boot
da0 at mrsas0 bus 0 scbus0 target 0 lun 0
da0: <DELL PERC H730P Mini 4.30> Fixed Direct Access SPC-3 SCSI device
da0: Serial Number 001036e80c651ebc1f0084897fa06d86
da0: 150.000MB/s transfers
da0: 102400MB (209715200 512 byte sectors)


$ pciconf -lv | grep -A 4  mrsas
mrsas0@pci0:1:0:0:      class=0x010400 rev=0x02 hdr=0x00 vendor=0x1000 device=0x005d subvendor=0x1028 subdevice=0x1f47
    vendor     = 'Broadcom / LSI'
    device     = 'MegaRAID SAS-3 3108 [Invader]'
    class      = mass storage
    subclass   = RAID
```
gstat does not show a significant disk load
`hw.mfi.mrsas_enable="1"` is set in loader.conf

I have been searching thru several forums already but could not really find a clue.


----------



## VladiBG (Sep 20, 2022)

Some steps that you can try to diagnose such problem are:
Reseat all connectors/cables
Check the hard disk smart status.
Enable mrsas driver debug options and look for I/O timeouts or online controller reset.
During high load check _dev.mrsas.X.fw___outstanding _
Check if there's some changes in mrsas driver in FreeBSD Current (PR)
Check if there's new firmware for DELL PERC H730P Mini 4.30 and see what is fixed in it comparing all changes between your current installed firmware and latest one.


----------



## SirDice (Sep 20, 2022)

Keep in mind that 12 used an imported and modified ZFS while 13.x switched to OpenZFS. Not sure if it's going to do much performance wise but have you upgraded the pool yet? Make sure you update the bootloader(s) _before_ upgrading your pool though. Pre-13.0 bootloaders are not able to boot from OpenZFS pools. The upgrade process (freebsd-update(8) or build(7)) won't do this, you have to do this yourself.


----------



## covacat (Sep 20, 2022)

SirDice said:


> Keep in mind that 12 used an imported and modified ZFS while 13.x switched to OpenZFS. Not sure if it's going to do much performance wise but have you upgraded the pool yet? Make sure you update the bootloader(s) _before_ upgrading your pool though. Pre-13.0 bootloaders are not able to boot from OpenZFS pools. The upgrade process (freebsd-update(8) or build(7)) won't do this, you have to do this yourself.


he uses UFS


----------



## richardtoohey2 (Sep 20, 2022)

I’ve got a few Dells (R430s and R640s) with H730s, upgraded to 13.1, UFS, RAID5 and not seen anything like this (so far). On some machines (a few years ago) I could get the machines with a process state of “suspfs“ which seemed to be the RAID controller getting too far behind in writes - it eventually caught up. Doesn’t look like you’ve got a process in that state.

Have you tried installing MegaCli to see if any clues in there - can use it to check the controller logs as well.


----------



## elno (Sep 21, 2022)

Thanks for all the feedback so far.

dev.mrsas.X.fw_outstanding is 0 or very low during high load. Enabling mrsas driver debugging does not work for me:



> $ sysctl hw.mrsas.0.debug_level
> sysctl: unknown oid 'hw.mrsas.0.debug_level'



After doing some more tests I also see that some processes keep staying in getblk state for the time the system is hanging:



> 9822    23     88       41  20    0  4629M  1098M select   7  96:04  82.46% /usr/local/libexec/mysqld --defaults-extra-file=/u
> 69493    13 www           1  52    0   198M    68M RUN      7   0:04  55.43% /usr/local/sbin/httpd -DNOHTTPACCEPT
> 70260    23 www           1  52    0   192M    65M getblk   7   0:08  53.67% /usr/local/sbin/httpd -DNOHTTPACCEPT
> 68674     7 www           1  33    0   204M    27M lockf    3   0:02  52.55% /usr/local/sbin/httpd -DNOHTTPACCEPT
> ...



megacli does not show any problems but after installing smartmontools I see one disk with a rather high Non-medium error count: 18916.
This value does not rise even after several further high loads so I doubt that's the reason.
Since the problem started immediately after upgrading to 13.1 I really do not think of any hardware issue, anyway.


----------



## VladiBG (Sep 21, 2022)

Do you have spare disk ?


----------



## elno (Sep 21, 2022)

I have. But this is a RAID5 with 4TB NLSAS disks and I actually refuse to break it because of the risks that this might cause. The machine itself is due to be replaced in 2023. I was just worrying if the load problem might be general with FreeBSD 13.1 and that particular controller but as richardtoohey2 writes his machines are running fine.


----------



## covacat (Sep 21, 2022)

it might be a problem with UFS too
hard to tell in which part of the kernel the cpu load is created


----------



## richardtoohey2 (Sep 29, 2022)

Sample machine - I've NOT yet tried deleting hundreds of thousands of files - it's a production machine, but I've got a T330 somewhere that I can have a look at.


```
# grep ^da0 /var/run/dmesg.boot
da0 at mrsas0 bus 0 scbus0 target 0 lun 0
da0: <DELL PERC H730 Mini 4.27> Fixed Direct Access SPC-3 SCSI device
da0: Serial Number 009e735e1eb80c2628009e29b8a06d86
da0: 150.000MB/s transfers
da0: 7628800MB (15623782400 512 byte sectors)
da0 at mrsas0 bus 0 scbus0 target 0 lun 0
da0: <DELL PERC H730 Mini 4.27> Fixed Direct Access SPC-3 SCSI device
da0: Serial Number 009e735e1eb80c2628009e29b8a06d86
da0: 150.000MB/s transfers
da0: 7628800MB (15623782400 512 byte sectors)

# pciconf -lv | grep -A 4  mrsas
mrsas0@pci0:1:0:0:    class=0x010400 rev=0x02 hdr=0x00 vendor=0x1000 device=0x005d subvendor=0x1028 subdevice=0x1f49
    vendor     = 'Broadcom / LSI'
    device     = 'MegaRAID SAS-3 3108 [Invader]'
    class      = mass storage
    subclass   = RAID

# freebsd-version -ruk
13.1-RELEASE-p2
13.1-RELEASE-p2
13.1-RELEASE-p2
                                    
# df -h
Filesystem    Size    Used   Avail Capacity  Mounted on
/dev/da0p2    7.0T    2.5T    4.0T    38%    /
devfs         1.0K    1.0K      0B   100%    /dev

Sample MegaCli output:

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :System
RAID Level          : Primary-5, Secondary-0, RAID Level Qualifier-3
Size                : 7.275 TB
Sector Size         : 512
Is VD emulated      : No
Parity Size         : 1.818 TB
State               : Optimal
Strip Size          : 64 KB
Number Of Drives    : 5
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Enabled
Encryption Type     : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: No
LD has drives that support T10 power conditions: No
LD's IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: No
Is VD Cached: No



Exit Code: 0x00
```


----------



## elno (Sep 29, 2022)

Thanks a lot. There is a similar machine that has to be upgraded. I decided to upgrade this one to 12.4 after it is released and see what happens.


----------

