# ZFS freezes system during extended activity



## Pawtuxet (Apr 5, 2014)

Hi,

I've recently put together a new home server, its primary role being storage, but also running other services. I had it running for a couple of weeks, before the HBA card arrived (a SuperMicro AOC-USAS2-L8e) - time spent configuring it similarly to the machine it's replacing. Once it arrived, it was installed and connected to eight 4 TB WD40EFRX drives using two SFF-8087 breakout cables.

So far, so good. I create a ZFS storage pool around all eight drives using only default options, no deduplication or compression.

```
# camcontrol devlist
<ATA WDC WD40EFRX-68W 0A80>        at scbus0 target 0 lun 0 (pass0,da0)
<ATA WDC WD40EFRX-68W 0A80>        at scbus0 target 1 lun 0 (pass1,da1)
<ATA WDC WD40EFRX-68W 0A80>        at scbus0 target 2 lun 0 (pass2,da2)
<ATA WDC WD40EFRX-68W 0A80>        at scbus0 target 3 lun 0 (pass3,da3)
<ATA WDC WD40EFRX-68W 0A80>        at scbus0 target 4 lun 0 (pass4,da4)
<ATA WDC WD40EFRX-68W 0A80>        at scbus0 target 5 lun 0 (pass5,da5)
<ATA WDC WD40EFRX-68W 0A80>        at scbus0 target 6 lun 0 (pass6,da6)
<ATA WDC WD40EFRX-68W 0A80>        at scbus0 target 7 lun 0 (pass7,da7)
<Samsung SSD 840 PRO Series DXM05B0Q>  at scbus3 target 0 lun 0 (ada0,pass8)

# zpool status
  pool: tank
 state: ONLINE
  scan: scrub repaired 0 in 3h24m with 0 errors on Mon Mar 24 01:19:59 2014
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da5     ONLINE       0     0     0
            da6     ONLINE       0     0     0
            da7     ONLINE       0     0     0

errors: No known data errors
```

Everything so far has gone smoothly, no incompatibilities or unexpected behavior. None that are apparent, anyway.

The problem is, that whenever I start any extensive activity on the pool, there's a chance the entire system halts. All connections to the server are terminated and the console is frozen. No messages are written to the console, nor to any log files I have found. It can happen within minutes of starting the activity, or after running for hours. It has happened during scrubbing, writing to and simply reading from the pool.

I should mention that during all the instances where it froze shortly after starting the activity, the system only had 8 GB memory. As of friday, it has 32 GB and the two times it froze since, was after working for at least half an hour. It may be coincidence, but it seems to take longer now. Also, the initial set of RAM is not part of the current set, so that probably rules out faulty memory as the cause.

I also suspected overheating as a possible cause, since both the Intel H77 chip and the LSI SAS2008 became pretty hot during operation, but after dedicating an 80 mm fan to each chip, they now both measure around 33 C during load - at least on the surface. Besides, if overheating was the culprit, I would expect more consistency between lockups.

I've searched extensively for a solution or at least an explanation, but so far come up mostly empty. I seem to be affected by this issue kern/187594, but I'm unsure if it results in anything as drastic as a complete system halt. I was watching `top` during the last two freezes, and they did occur after free memory hit 25%.

Does anyone have an idea as to what's happening?
Also, please let me know if you require the contents of system variables or configuration files.


 PSU: Corsair RM450
 Motherboard: ASUS P8H77-M Pro
 CPU: Intel Core i3-3250
 RAM: Corsair XMS3
 HBA: SuperMicro AOC-USAS2-L8e


```
# uname -a
FreeBSD dingo.pawtuxet.dk 10.0-RELEASE FreeBSD 10.0-RELEASE #0 r260789: Thu Jan 16 22:34:59 UTC 2014     root@snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
```


```
# dmesg
Copyright (c) 1992-2014 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 10.0-RELEASE #0 r260789: Thu Jan 16 22:34:59 UTC 2014
    root@snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64
FreeBSD clang version 3.3 (tags/RELEASE_33/final 183502) 20130610
CPU: Intel(R) Core(TM) i3-3250 CPU @ 3.50GHz (3500.08-MHz K8-class CPU)
  Origin = "GenuineIntel"  Id = 0x306a9  Family = 0x6  Model = 0x3a  Stepping = 9
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x3d9ae3bf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,POPCNT,TSCDLT,XSAVE,OSXSAVE,AVX,F16C>
  AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
  AMD Features2=0x1<LAHF>
  Standard Extended Features=0x281<GSFSBASE,SMEP,ENHMOVSB>
  TSC: P-state invariant, performance statistics
real memory  = 34359738368 (32768 MB)
avail memory = 32979804160 (31451 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <ALASKA A M I>
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s) x 2 SMT threads
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
 cpu2 (AP): APIC ID:  2
 cpu3 (AP): APIC ID:  3
ioapic0 <Version 2.0> irqs 0-23 on motherboard
Cuse4BSD v0.1.30 @ /dev/cuse
kbd1 at kbdmux0
random: <Software, Yarrow> initialized
acpi0: <ALASKA A M I> on motherboard
acpi0: Power Button (fixed)
acpi0: reservation of 67, 1 (4) failed
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
cpu2: <ACPI CPU> on acpi0
cpu3: <ACPI CPU> on acpi0
hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 950
Event timer "HPET" frequency 14318180 Hz quality 550
Event timer "HPET1" frequency 14318180 Hz quality 440
Event timer "HPET2" frequency 14318180 Hz quality 440
Event timer "HPET3" frequency 14318180 Hz quality 440
Event timer "HPET4" frequency 14318180 Hz quality 440
atrtc0: <AT realtime clock> port 0x70-0x77 irq 8 on acpi0
atrtc0: Warning: Couldn't map I/O.
Event timer "RTC" frequency 32768 Hz quality 0
attimer0: <AT timer> port 0x40-0x43,0x50-0x53 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pcib1: <ACPI PCI-PCI bridge> irq 16 at device 1.0 on pci0
pci1: <ACPI PCI bus> on pcib1
vgapci0: <VGA-compatible display> port 0xf000-0xf03f mem 0xf7800000-0xf7bfffff,0xe0000000-0xefffffff irq 16 at device 2.0 on pci0
agp0: <IvyBridge desktop GT1 IG> on vgapci0
agp0: aperture size is 256M, detected 262140k stolen memory
vgapci0: Boot video device
xhci0: <Intel Panther Point USB 3.0 controller> mem 0xf7e00000-0xf7e0ffff irq 16 at device 20.0 on pci0
xhci0: 32 byte context size.
xhci0: Port routing mask set to 0xffffffff
usbus0 on xhci0
pci0: <simple comms> at device 22.0 (no driver attached)
ehci0: <Intel Panther Point USB 2.0 controller> mem 0xf7e17000-0xf7e173ff irq 23 at device 26.0 on pci0
usbus1: EHCI version 1.0
usbus1 on ehci0
hdac0: <Intel Panther Point HDA Controller> mem 0xf7e10000-0xf7e13fff irq 22 at device 27.0 on pci0
pcib2: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0
pci2: <ACPI PCI bus> on pcib2
mps0: <LSI SAS2008> port 0xe000-0xe0ff mem 0xf7dc0000-0xf7dc3fff,0xf7d80000-0xf7dbffff irq 16 at device 0.0 on pci2
mps0: Firmware: 16.00.01.00, Driver: 16.00.00.00-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
pcib3: <ACPI PCI-PCI bridge> irq 16 at device 28.4 on pci0
pci3: <ACPI PCI bus> on pcib3
re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0xd000-0xd0ff mem 0xf0004000-0xf0004fff,0xf0000000-0xf0003fff irq 16 at device 0.0 on pci3
re0: Using 1 MSI-X message
re0: Chip rev. 0x48000000
re0: MAC rev. 0x00000000
miibus0: <MII bus> on re0
rgephy0: <RTL8169S/8110S/8211 1000BASE-T media interface> PHY 1 on miibus0
rgephy0:  none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow
re0: Ethernet address: d8:50:e6:41:59:24
pcib4: <ACPI PCI-PCI bridge> irq 18 at device 28.6 on pci0
pci4: <ACPI PCI bus> on pcib4
atapci0: <Marvell ATA controller> port 0xc040-0xc047,0xc030-0xc033,0xc020-0xc027,0xc010-0xc013,0xc000-0xc00f mem 0xf7c10000-0xf7c101ff irq 18 at device 0.0 on pci4
ata2: <ATA channel> at channel 0 on atapci0
ata3: <ATA channel> at channel 1 on atapci0
ehci1: <Intel Panther Point USB 2.0 controller> mem 0xf7e16000-0xf7e163ff irq 23 at device 29.0 on pci0
usbus2: EHCI version 1.0
usbus2 on ehci1
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
atapci1: <Intel Panther Point SATA300 controller> port 0xf110-0xf117,0xf100-0xf103,0xf0f0-0xf0f7,0xf0e0-0xf0e3,0xf0d0-0xf0df,0xf0c0-0xf0cf irq 19 at device 31.2 on pci0
ata4: <ATA channel> at channel 0 on atapci1
ata5: <ATA channel> at channel 1 on atapci1
pci0: <serial bus, SMBus> at device 31.3 (no driver attached)
atapci2: <Intel Panther Point SATA300 controller> port 0xf0b0-0xf0b7,0xf0a0-0xf0a3,0xf090-0xf097,0xf080-0xf083,0xf070-0xf07f,0xf060-0xf06f irq 19 at device 31.5 on pci0
ata6: <ATA channel> at channel 0 on atapci2
ata7: <ATA channel> at channel 1 on atapci2
acpi_button0: <Power Button> on acpi0
acpi_tz0: <Thermal Zone> on acpi0
acpi_tz1: <Thermal Zone> on acpi0
ppc1: <Parallel port> port 0x378-0x37f irq 5 on acpi0
ppc1: Generic chipset (NIBBLE-only) in COMPATIBLE mode
ppbus0: <Parallel port bus> on ppc1
lpt0: <Printer> on ppbus0
lpt0: Interrupt-driven port
ppi0: <Parallel I/O> on ppbus0
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
orm0: <ISA Option ROM> at iomem 0xc0000-0xce7ff on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
ppc0: cannot reserve I/O port range
est0: <Enhanced SpeedStep Frequency Control> on cpu0
p4tcc0: <CPU Frequency Thermal Control> on cpu0
est1: <Enhanced SpeedStep Frequency Control> on cpu1
p4tcc1: <CPU Frequency Thermal Control> on cpu1
est2: <Enhanced SpeedStep Frequency Control> on cpu2
p4tcc2: <CPU Frequency Thermal Control> on cpu2
est3: <Enhanced SpeedStep Frequency Control> on cpu3
p4tcc3: <CPU Frequency Thermal Control> on cpu3
Timecounters tick every 1.000 msec
hdacc0: <Realtek ALC892 HDA CODEC> at cad 0 on hdac0
hdaa0: <Realtek ALC892 Audio Function Group> at nid 1 on hdacc0
pcm0: <Realtek ALC892 (Rear Analog 7.1/2.0)> at nid 20,22,21,23 and 24,26 on hdaa0
pcm1: <Realtek ALC892 (Front Analog)> at nid 27 and 25 on hdaa0
pcm2: <Realtek ALC892 (Rear Digital)> at nid 30 on hdaa0
pcm3: <Realtek ALC892 (Onboard Digital)> at nid 17 on hdaa0
hdacc1: <Intel Panther Point HDA CODEC> at cad 3 on hdac0
hdaa1: <Intel Panther Point Audio Function Group> at nid 1 on hdacc1
pcm4: <Intel Panther Point (HDMI/DP 8ch)> at nid 5 on hdaa1
pcm5: <Intel Panther Point (HDMI/DP 8ch)> at nid 7 on hdaa1
random: unblocking device.
usbus0: 5.0Gbps Super Speed USB v3.0
usbus1: 480Mbps High Speed USB v2.0
usbus2: 480Mbps High Speed USB v2.0
ugen0.1: <0x8086> at usbus0
uhub0: <0x8086 XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
ugen2.1: <Intel> at usbus2
uhub1: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus2
ugen1.1: <Intel> at usbus1
uhub2: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus1
uhub0: 8 ports with 8 removable, self powered
uhub1: 2 ports with 2 removable, self powered
uhub2: 2 ports with 2 removable, self powered
ugen2.2: <vendor 0x8087> at usbus2
uhub3: <vendor 0x8087 product 0x0024, class 9/0, rev 2.00/0.00, addr 2> on usbus2
ugen1.2: <vendor 0x8087> at usbus1
uhub4: <vendor 0x8087 product 0x0024, class 9/0, rev 2.00/0.00, addr 2> on usbus1
uhub4: 6 ports with 6 removable, self powered
uhub3: 8 ports with 8 removable, self powered
ugen2.3: <Logitech> at usbus2
uhub5: <Logitech Logitech BT Mini-Receiver, class 9/0, rev 2.00/49.00, addr 3> on usbus2
uhub5: 3 ports with 1 removable, bus powered
ugen2.4: <Logitech> at usbus2
ukbd0: <Logitech Logitech BT Mini-Receiver, class 0/0, rev 2.00/49.00, addr 4> on usbus2
kbd2 at ukbd0
ugen2.5: <Logitech> at usbus2
ugen2.6: <Pulse-Eight> at usbus2
da0 at mps0 bus 0 scbus0 target 0 lun 0
da0: <ATA WDC WD40EFRX-68W 0A80> Fixed Direct Access SCSI-6 device
da0: Serial Number      WD-WCC4E0889559
da0: 600.000MB/s transfers
da0: Command Queueing enabled
da0: 3815447MB (7814037168 512 byte sectors: 255H 63S/T 486401C)
da0: quirks=0x8<4K>
da1 at mps0 bus 0 scbus0 target 1 lun 0
da1: <ATA WDC WD40EFRX-68W 0A80> Fixed Direct Access SCSI-6 device
da1: Serial Number      WD-WCC4E0879162
da1: 600.000MB/s transfers
da1: Command Queueing enabled
da1: 3815447MB (7814037168 512 byte sectors: 255H 63S/T 486401C)
da1: quirks=0x8<4K>
da2 at mps0 bus 0 scbus0 target 2 lun 0
da2: <ATA WDC WD40EFRX-68W 0A80> Fixed Direct Access SCSI-6 device
da2: Serial Number      WD-WCC4E0899225
da2: 600.000MB/s transfers
da2: Command Queueing enabled
da2: 3815447MB (7814037168 512 byte sectors: 255H 63S/T 486401C)
da2: quirks=0x8<4K>
da3 at mps0 bus 0 scbus0 target 3 lun 0
da3: <ATA WDC WD40EFRX-68W 0A80> Fixed Direct Access SCSI-6 device
da3: Serial Number      WD-WCC4E0889556
da3: 600.000MB/s transfers
da3: Command Queueing enabled
da3: 3815447MB (7814037168 512 byte sectors: 255H 63S/T 486401C)
da3: quirks=0x8<4K>
da4 at mps0 bus 0 scbus0 target 4 lun 0
da4: <ATA WDC WD40EFRX-68W 0A80> Fixed Direct Access SCSI-6 device
da4: Serial Number      WD-WCC4E0909983
da4: 600.000MB/s transfers
da4: Command Queueing enabled
da4: 3815447MB (7814037168 512 byte sectors: 255H 63S/T 486401C)
da4: quirks=0x8<4K>
da5 at mps0 bus 0 scbus0 target 5 lun 0
da5: <ATA WDC WD40EFRX-68W 0A80> Fixed Direct Access SCSI-6 device
da5: Serial Number      WD-WCC4E0889528
da5: 600.000MB/s transfers
da5: Command Queueing enabled
da5: 3815447MB (7814037168 512 byte sectors: 255H 63S/T 486401C)
da5: quirks=0x8<4K>
da6 at mps0 bus 0 scbus0 target 6 lun 0
da6: <ATA WDC WD40EFRX-68W 0A80> Fixed Direct Access SCSI-6 device
da6: Serial Number      WD-WCC4E0899947
da6: 600.000MB/s transfers
da6: Command Queueing enabled
da6: 3815447MB (7814037168 512 byte sectors: 255H 63S/T 486401C)
da6: quirks=0x8<4K>
da7 at mps0 bus 0 scbus0 target 7 lun 0
da7: <ATA WDC WD40EFRX-68W 0A80> Fixed Direct Access SCSI-6 device
da7: Serial Number      WD-WCC4E0906234
da7: 600.000MB/s transfers
da7: Command Queueing enabled
da7: 3815447MB (7814037168 512 byte sectors: 255H 63S/T 486401C)
da7: quirks=0x8<4K>
ada0 at ata4 bus 0 scbus3 target 0 lun 0
ada0: <Samsung SSD 840 PRO Series DXM05B0Q> ATA-9 SATA 3.x device
ada0: Serial Number S12RNEAD401503J
ada0: 600.000MB/s transfers (SATA 3.x, UDMA5, PIO 8192bytes)
ada0: 244198MB (500118192 512 byte sectors: 16H 63S/T 16383C)
ada0: Previously was known as ad8
Netvsc initializing... SMP: AP CPU #1 Launched!
SMP: AP CPU #2 Launched!
SMP: AP CPU #3 Launched!
Timecounter "TSC-low" frequency 1750041194 Hz quality 1000
Trying to mount root from ufs:/dev/ada0p2 [rw]...
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
ums0: <Logitech Logitech BT Mini-Receiver, class 0/0, rev 2.00/49.00, addr 5> on usbus2
ums0: 14 buttons and [XYZT] coordinates ID=2
ums0: 8 buttons and [XYZT] coordinates ID=5
umodem0: <Pulse-Eight USB-CEC Adapter, class 0/0, rev 1.10/10.00, addr 6> on usbus2
umodem0: data interface 1, has CM over data, has break
ums1: <Pulse-Eight USB-CEC Adapter, class 0/0, rev 1.10/10.00, addr 6> on usbus2
ums1: 3 buttons and [XY] coordinates ID=0
pid 1032 (xfsettingsd), uid 1001: exited on signal 11 (core dumped)
info: [drm] Initialized drm 1.1.0 20060810
drmn0: <Intel IvyBridge> on vgapci0
info: [drm] MSI enabled 1 message(s)
info: [drm] AGP at 0xe0000000 256MB
iicbus0: <Philips I2C bus> on iicbb0 addr 0xff
iic0: <I2C generic I/O> on iicbus0
iic1: <I2C generic I/O> on iicbus1
iicbus2: <Philips I2C bus> on iicbb1 addr 0xff
iic2: <I2C generic I/O> on iicbus2
iic3: <I2C generic I/O> on iicbus3
iicbus4: <Philips I2C bus> on iicbb2 addr 0xff
iic4: <I2C generic I/O> on iicbus4
iic5: <I2C generic I/O> on iicbus5
iicbus6: <Philips I2C bus> on iicbb3 addr 0xff
iic6: <I2C generic I/O> on iicbus6
iic7: <I2C generic I/O> on iicbus7
iicbus8: <Philips I2C bus> on iicbb4 addr 0xff
iic8: <I2C generic I/O> on iicbus8
iic9: <I2C generic I/O> on iicbus9
iicbus10: <Philips I2C bus> on iicbb5 addr 0xff
iic10: <I2C generic I/O> on iicbus10
iic11: <I2C generic I/O> on iicbus11
iicbus12: <Philips I2C bus> on iicbb6 addr 0xff
iic12: <I2C generic I/O> on iicbus12
iic13: <I2C generic I/O> on iicbus13
iicbus14: <Philips I2C bus> on iicbb7 addr 0xff
iic14: <I2C generic I/O> on iicbus14
iic15: <I2C generic I/O> on iicbus15
info: [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
info: [drm] Driver supports precise vblank timestamp query.
drmn0: taking over the fictitious range 0xe0000000-0xf0000000
info: [drm] GMBUS timed out, falling back to bit banging on pin 7 [gmbus bus dpd]
info: [drm] Initialized i915 1.6.0 20080730
```


----------



## Pawtuxet (Apr 10, 2014)

Hi again,

Since my original post, I've had to revise my impression of the cause of this issue, since I've now had three lockups that involved no ZFS activity whatsoever.
I'm cautiously optimistic that I may have found the actual cause, however, but it's too soon to tell. I'd like to see a few days of stable operation, before breaking out the celebratory last-can-of-imported-root-beer that I'd been hanging on to.

The system drive is a 256 GB Samsung 840 Pro SSD which I scavenged from the old server, where it had been running for a year and a half without incident, so I had entirely discounted that as an possible avenue of failure. That is, until a friend reminded me of the trouble I had on my Windows machine with an earlier OCZ Agility 3 drive, that kept disconnecting itself until it was finally stabilized on its 3rd or 4th firmware update. There was an update available for the Samsung drive, and, although the changelog didn't contain any obvious smoking gun entries, it does seem to have made a positive difference.

I know I should confirm the result of each change, before implementing another, but I also updated the system to 10.0-RELEASE-p1 shortly before the SSD firmware update, so that may also be the reason. My money's on the firmware, though. Once I'm certain the issue's resolved. Which I'm not, although it looks promising.


----------



## SirDice (Apr 10, 2014)

Pawtuxet said:
			
		

> I know I should confirm the result of each change, before implementing another, but I also updated the system to 10.0-RELEASE-p1 shortly before the SSD firmware update, so that may also be the reason.


I very much doubt the patch was involved. Only the OpenSSL library was updated, nothing else. With normal OpenSSL traffic the bug wouldn't manifest, it was only in some extreme cases where it could lead to an information leak. Nothing in the kernel uses SSL any way so I doubt this had any impact at all. 



> My money's on the firmware, though.


Mine too. Speaking of firmware, also make sure the machine and the LSI controller themselves are using the latest UEFI/BIOS/Firmware as that could also have an impact.

If all else fails there's always the option to upgrade to 10.0-STABLE. There have been quite a number of commits since the release. Those bug fixes won't find their way to 10.0, only security issues get patched on the release versions.


----------



## Pawtuxet (Apr 11, 2014)

Thanks for the reply, @SirDice. 
The motherboard is running its latest firmware, and now so is the SSD. I think there's something newer for the LSI controller than the 16.00.01.00 version it's running now, but I have't tried updating it, since I've no idea how to upgrade its driver accordingly - I read somewhere that they absolutely have to match. Not sure if anything else has an updateable firmware.


```
mps0: <LSI SAS2008> port 0xe000-0xe0ff mem 0xf7dc0000-0xf7dc3fff,0xf7d80000-0xf7dbffff irq 16 at device 0.0 on pci2
mps0: Firmware: 16.00.01.00, Driver: 16.00.00.00-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
```

Sadly, a couple of hours after my last post, the machine froze up again. This was about 60%, or 2.5 hours into a scrub process that was estimated at around 4.5 hours.
I've noticed a pattern of sorts - namely, that once the first crash occurs, it's very likely to happen again soon after restarting, if I let the activity resume. In this case, scrubbing continued from where it was interrupted and within 3 minutes, it froze again. This happened three times, before I stopped the scrub.

I'd already discounted "overheating" as unlikely, but I may have to have another look - this really feels like a heat issue. The ambient temperature rises slightly throughout the day, and it may be enough to nudge it into fail-territory. Howevery, the only FreeBSD port I've found, that's given me any kind of hardware info at all, is sysutils/healthd, but given that it doesn't recognize the board, I don't really know if it's reliable.


```
# healthd -d
************************
* Hardware Information *
************************
WinBond Chip: (unknown)
************************

Temp.= 31.0, 32.5,  0.0; Rot.=    0,    0,    0
 Vcore = 1.46, 2.00; Volt. = 3.28, 5.51,  7.72, -14.16, -4.58
```

This is when the machine is idle, but even under load I haven't seen it rise above 34 degrees.

I'm considering installing Windows on a spare drive, to see if I can provoke a similar response and maybe get a reading from the internal sensors. I'm sure there are more than two in there. The LSI controller should also have one.

If that checks out, I'll go ahead an give 10.0-STABLE a try.


----------

