# Had a few crashes



## azathoth (Oct 26, 2017)

I think because my second internal 500g drive with ufs is store for transmission-qt...

Box would just go down hard.

I ran fsck and it says seconday gps table corrupt which is weird because I did a gpart destroy before formmating with newfs -U

I think maybe fsck wrote some stuff as ROOT and my USER g couldn't read? causing a soft update panic?

I thought it also might be bad ram chip.....but stable sofar after I did fsck twice and ran chmod -R g: /mnt/a  where I mount the drive....

transmission-qt now happy.....

Was just weird because I ran fsck originally when it faulted, but it went down hard again.....

and I think I may have fixed with the chmod, fingers crossed no more crashes.

11.1amd64


----------



## aragats (Oct 27, 2017)

I would advise installing sysutils/smartmontools and checking it's report, in particular:
	
	



```
# smartctl -a /dev/ada0 | grep Reallocated_Sector
```
If the last number is not zero, your disk is degraded and will fail soon. You can also run a test:
	
	



```
# smartctl -t short /dev/ada0
# smartctl -l selftest /dev/ada0
```
(Replace /dev/ada0 with your device name)


----------



## azathoth (Oct 27, 2017)

So far so good solid running since yesterday..
root@nofapp:~ # smartctl -a /dev/ada0 | grep Reallocated_Sector
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       1
root@nofapp:~ # smartctl -a /dev/ada1 | grep Reallocated_Sector
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0


oh wow ada0 might fail soon?

```

```
root@nofapp:~ # smartctl -a /dev/ada0 > test
root@nofapp:~ # less test
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-RELEASE-p1 amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HITACHI HDS725050KLA360
Serial Number:    ZBHVW3NH
LU WWN Device Id: 5 000cca 20eda4ef3
Firmware Version: K2AOAB0A
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA/ATAPI-7 T13/1532D revision 1
Local Time is:    Fri Oct 27 18:36:06 2017 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (10419) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 174) minutes.
SCT capabilities:              (0x003f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   159   159   050    Pre-fail  Offline      -       207
  3 Spin_Up_Time            0x0007   110   110   024    Pre-fail  Always       -       591 (Average 677)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       1130
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       1
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   136   136   020    Pre-fail  Offline      -       31
  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       27549
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       1119
192 Power-Off_Retract_Count 0x0032   099   099   050    Old_age   Always       -       1485
193 Load_Cycle_Count        0x0012   099   099   050    Old_age   Always       -       1485
194 Temperature_Celsius     0x0002   141   141   000    Old_age   Always       -       39 (Min/Max 15/57)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Warning! SMART Selective Self-Test Log Structure error: invalid SMART checksum.
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

(END)


----------



## azathoth (Oct 28, 2017)

crap just had 3 crashes!
Back up after going single user and fsck -y /dev/ada1 and ada0
Then after doign it twice i couldnt reboot.
Had some weird failure.
rebooted got 1 more crash.
Then rebooted cam eup and chmod -R g: /mnt/a and /usr/home/g
now root top shows some fsck thing happening...
fhew
jeez
I wonder if ada0 is getting flaky?


----------



## azathoth (Oct 28, 2017)

Can someone give a bit of guidance as to what to do here? should I dump data and reinstall to ada1 as the os drive?


----------



## azathoth (Oct 28, 2017)

```
root@nofapp:~ # smartctl -l selftest /dev/ada0
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-RELEASE-p1 amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     27550
```


----------



## azathoth (Oct 28, 2017)

Moving files, guna reinstall to ada1 as os disk....   Now when I do gpart destroy that cleans all the stuff off the disk right?
 I did that b4 doing newfs -U on the disk....per the handbook.....but still get this crap about  gps secondary corrupt....so not sure how to sanitize the disk completely before doing the fresh install....

me oh my!


----------



## azathoth (Oct 28, 2017)

I dunno howto move this thread to storage...


----------



## azathoth (Oct 28, 2017)

Is a full install onto /dev/ada1 the best way to switch the os stuff and boot thingy to the second disk? or can I liek copy over stuff and install a bootlaoder from where I am now?


----------



## _martin (Oct 28, 2017)

Do you have crash dump from that crash ? Or crash hard means it just powered off ? 
It can be all HW related so it's hard to say. If you can afford having the machine down (or even better if you have easy access to the box) do a battery test on that machine and try to isolate the issue. 

I suggest (not including SMART as it was already suggested):

a) test RAM - memtest+ or others (i.e. boot straight to memtest+)
b) test CPU - some math, i.e. finding md5 collisions , spread around all cpus/threads 
c) use different disk and test again
d) PSU - if you have another power supply at hand swap it with the current one and do test again

I would rather manually boot to other disk, especially if you are just testing. But you can install bootloader to the secondary disk, remove the first one and let it boot.


----------



## azathoth (Oct 28, 2017)

Where would I find the crash dump?
How test cpu....
I think I remember memtest is a iso image I can burn to usb key and boot to?


----------



## azathoth (Oct 28, 2017)

```
root@nofapp:~ # ls /var/crash/
minfree

root@nofapp:~ # less /var/crash/minfree
2048
```


----------



## _martin (Oct 28, 2017)

Yes, /var/crash is the default path for crashes. Check if you have crashes enabled:

`# sysctl kern.coredump
kern.coredump: 1`

Also check if you have dumpdev set in /etc/rc.conf, at least to "AUTO":

`# grep dumpdev /etc/rc.conf
dumpdev="AUTO"`

This option assumes you have at  least one swap device defined where system can dump. 
You have to specify this disk with dumpon(8) or if you are not comfortable doing that you can reboot the machine. 

You can test CPU by doing some stress test on it. As I mentioned - by some cracking, computing something. etc. Maybe ports have some tools in benchmarks too, but I never used them.


----------



## azathoth (Oct 28, 2017)

_martin said:


> Yes, /var/crash is the default path for crashes. Check if you have crashes enabled:
> 
> `# sysctl kern.coredump
> kern.coredump: 1`
> ...







```
root@nofapp:~ # sysctl kern.coredump
kern.coredump: 1
```


----------



## azathoth (Oct 28, 2017)

ah dumpdev was set to no ok I set now to auto


----------



## _martin (Oct 28, 2017)

Ok, so after this if your system panics you should be able to get the dump ( _if the panic actually happened in your case). 
You can verify that you have set it properly with:
`dumpon -l`
You need to see the some device displayed. Also verify /var/crash has enough free space left, at least few gigabytes in your case. 

And I still recommend doing those tests I mentioned earlier. Bad RAM and bad disks are usually the issue, sometimes CPU and rarely PSU. Unless it's something different, which may also be the case.


----------



## azathoth (Oct 28, 2017)

_martin said:


> Ok, so after this if your system panics you should be able to get the dump ( _if the panic actually happened in your case).
> You can verify that you have set it properly with:
> `dumpon -l`
> You need to see the some device displayed. Also verify /var/crash has enough free space left, at least few gigabytes in your case.
> ...



ok just got a crash!!

looking up on how to analyze!!


----------



## azathoth (Oct 28, 2017)

```
root@nofapp:/var/crash # ls
bounds          info.0          info.last       minfree         vmcore.0        vmcore.last
root@nofapp:/var/crash # kgdb kernel.debug vmcore.0
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...kernel.debug: No such file or directory.

Can't open a vmcore without a kernel
(kgdb)
```


----------



## azathoth (Oct 28, 2017)

```
Oct 28 17:19:39 nofapp savecore: reboot after panic: ufs_dirbad: /mnt/a: bad dir ino 61074817 at offset 0: mangled entry
Oct 28 17:19:39 nofapp savecore: writing core to /var/crash/vmcore.0
Oct 28 17:20:59 nofapp kernel: info: [drm] Initialized drm 1.1.0 20060810
```

why would the whole box go down jsut because of a problem with the second internal drive???

```
root@nofapp:/var/crash # mount
/dev/ada0p2 on / (ufs, local, journaled soft-updates)
devfs on /dev (devfs, local, multilabel)
/dev/ada1 on /mnt/a (ufs, local, soft-updates)
root@nofapp:/var/crash # cat /etc/fstab
# Device        Mountpoint      FStype  Options Dump    Pass#
/dev/ada0p2     /               ufs     rw      1       1
/dev/ada0p3     none            swap    sw      0       0
/dev/ada1       /mnt/a          ufs     rw      2       2
```


----------



## azathoth (Oct 28, 2017)




----------



## azathoth (Oct 28, 2017)

fsck -y two times /dev/ada1

did this b4 tho


----------



## azathoth (Oct 28, 2017)




----------



## azathoth (Oct 28, 2017)

```
SALVAGE? yes

SUMMARY INFORMATION BAD
SALVAGE? yes

BLK(S) MISSING IN BIT MAPS
SALVAGE? yes

14212 files, 103268609 used, 14993626 free (3330 frags, 1873787 blocks, 0.0% fragmentation)

***** FILE SYSTEM STILL DIRTY *****

***** FILE SYSTEM WAS MODIFIED *****

***** PLEASE RERUN FSCK *****
root@nofapp:/var/crash # fsck -y /dev/ada1
** /dev/ada1
** Last Mounted on /mnt/a
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
14212 files, 103268609 used, 14993626 free (3330 frags, 1873787 blocks, 0.0% fragmentation)

***** FILE SYSTEM MARKED CLEAN *****

root@nofapp:/var/crash # df -h
Filesystem     Size    Used   Avail Capacity  Mounted on
/dev/ada0p2    447G    181G    231G    44%    /
devfs          1.0K    1.0K      0B   100%    /dev
/dev/ada1      451G    394G     21G    95%    /mnt/a
root@nofapp:/var/crash # chown -R g: /mnt/a
```

ran this last time tho


----------



## _martin (Oct 29, 2017)

Meh, I think you should put those picture down. They are really distracting and not sure if in sync with rules. 

As you can see the crash and the panic string you can rule out PSU as an issue. Filesystem inconsistency caused panic. So main suspects are RAM and disk (though you can't still rule out CPU). 

You didn't do the kernel debugging command properly, but for the time being it doesn't matter. Do that RAM check with memtest+ and check with other disk (or stop using the suspicious one). 

You can also do a check where you leave the big disk out and start writing data to temp for example (assuming sh): 

`while true ; do for i in `seq 8`; do echo creating blob$i; dd if=/dev/zero of=/tmp/blob${i} bs=1024M count=16; done; done`

This will create 8 16GB files in /tmp. To stress writes a bit.


----------



## azathoth (Nov 4, 2017)

_martin said:


> Meh, I think you should put those picture down. They are really distracting and not sure if in sync with rules.
> 
> As you can see the crash and the panic string you can rule out PSU as an issue. Filesystem inconsistency caused panic. So main suspects are RAM and disk (though you can't still rule out CPU).
> 
> ...



pics for halloween

hmmm  everything stable last week or so......I wonder if soft updates on second non problem drive caused some problem when combined with transmission-qt.......which does some kinda recheck stuff after a fault


----------



## ralphbsz (Nov 4, 2017)

The smartctl data you show up there looks perfect; the disk seems healthy.  The reallocated sector and pending sector counts and errors are all zero.

Now, the fact that the disk itself (spinning platter and moving heads) are healthy doesn't always mean that the disk interface it also healthy.  But there is no indication of a bad interface in your case.  In theory, one could have a bad SATA cable, and given the state of hardware drivers, it can cause FreeBSD on commodity hardware to come to a sudden stop (no writing of crash dumps, no reboot, machine completely frozen).  Been there, done that, had to pull all disk cables out one at a time until it started working again.  But that's not your problem: (a) You are getting a crash dump, it it's telling you what the reason is, which is not hardware related.  (b) The machine reboots fine.

As already discussed by others above: The reason is a file system inconsistency, meaning a data structure on disk has become garbage, in a directory.  It's the kind of inconsistency which causes the kernel to deliberately crash.  This could theoretically be caused by the disk hardware (undetected read error, causing corruption in a data structure).  But more likely is a bug in the FreeBSD software.  There is no way a normal user-mode application (like transmission-qt) can legally cause this.  The only way to cause this without assuming an OS bug would be if root deliberately overwrites a disk, which is highly unlikely (unless someone has been playing with dd command, which hopefully doesn't happen).

What should you do?  Save your data, and then reformat the problem file system.  If you have a spare disk drive, here's what I would do: Get the spare drive, format it with the desired file system, copy the surviving data onto the spare (hope the copy process survives without a crash), put the spare into the running system, take the problem disk out, and reformat it with the file system, and you have another spare again.


----------

