# Bizzare Disk failure / "Periph destroyed" | Lost



## jaymax (May 10, 2016)

`# uname -a`

```
FreeBSD MACH1 10.2-RELEASE FreeBSD 10.2-RELEASE #0 r286666: Wed Aug 12 19:31:38 UTC 2015  root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  i386
```

All disks on the system are new (less than a couple of months), except one {am1...];
all are labelled

`# df`
==>

```
Filesystem  Capacity  Mounted on
/dev/gpt/gptrootfs  4%  /
devfs  100%  /dev
/dev/gpt/am1usrfs  0%  /disk_01
/dev/gpt/am2usrfs  79%  /disk_02
/dev/gpt/am6usrfs  0%  /disk_06
fdescfs  100%  /dev/fd
```

`# gpart show -l`
provided the following '/dev/ada' assignments

```
ada0   am2usrfs   i.e. /disk_02
ada1   am6usrfs   i.e. /disk_06
ada2   am1usrfs   i.e. /disk_01
ada3   gptrootfs   i.e. /
```

After a few minutes, I'll received several messages on the console and in
/var/log/messages

```
May  5 22:00:59 MACH1 smartd[586]: Device: /dev/ada2, 6307 Currently unreadable (pending) sectors
May  5 22:00:59 MACH1 smartd[586]: Device: /dev/ada2, 6307 Offline uncorrectable sectors
May  5 22:01:00 MACH1 smartd[586]: Device: /dev/ada3, FAILED SMART self-check. BACK UP DATA NOW!
May  5 22:01:00 MACH1 smartd[586]: Device: /dev/ada3, 1891 Currently unreadable (pending) sectors
May  5 22:01:00 MACH1 smartd[586]: Device: /dev/ada3, Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
May  5 22:30:59 MACH1 smartd[586]: Device: /dev/ada2, 6307 Currently unreadable (pending) sectors
May  5 22:30:59 MACH1 smartd[586]: Device: /dev/ada2, 6307 Offline uncorrectable sectors
May  5 22:30:59 MACH1 smartd[586]: Device: /dev/ada3, FAILED SMART self-check. BACK UP DATA NOW!
May  5 22:30:59 MACH1 smartd[586]: Device: /dev/ada3, 1891 Currently unreadable (pending) sectors
May  5 22:30:59 MACH1 smartd[586]: Device: /dev/ada3, Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
...
...
May  6 08:00:59 MACH1 smartd[586]: Device: /dev/ada3, FAILED SMART self-check. BACK UP DATA NOW!
May  6 08:00:59 MACH1 smartd[586]: Device: /dev/ada3, 1891 Currently unreadable (pending) sectors
May  6 08:00:59 MACH1 smartd[586]: Device: /dev/ada3, Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
```


Which would indicate that am1usrfs and gptrootfs failure could be imminent, i.e. /disk_01 & / (the system disk).

Oddly enough, it is the ada1, am6usrfs (/disk_06) that would functionally disappear from the system - requiring a shutdown on restart it brings it back up again. The disk will remain visible from a df -k but a ls ==> a not configured state. At this stage dmesg shows a "periph destroyed" message.

The whole sequence seems bizarre, ada3 failing messages (continues to be functional), but ada1 being dropped or destroyed instead and with an apparent clicking sound too - seems strange.

Can anyone explain?

Thanks!


----------



## k.jacker (May 10, 2016)

I have received those messages (but without the "FAILED SMART self-check" message) earlier when I accidentialy hit my PC while making a backup with `rsync`.
The messages only lasted a second or so and then dissapeared. No data corruption appeared as I ckecked and `rsynced` once more after the first run finished.
Just a little chance, but you may have to much vibration from the disks or something else. I'm running a simple ZFS mirror with 2 disks, so there is not a lot of vibration, unless I hit my PC.
I thought if the vibration had lasted a little longer and the read errors continued for a longer time a disk (or both) might have gone offline...


----------



## SirDice (May 10, 2016)

```
May  5 22:30:59 MACH1 smartd[586]: Device: /dev/ada2, 6307 Currently unreadable (pending) sectors
May  5 22:30:59 MACH1 smartd[586]: Device: /dev/ada2, 6307 Offline uncorrectable sectors
```
These are not temporary errors. And judging by the amount the disk should have been replaced a long time ago. Same for ada3. Your posted logs don't show anything for ada1 did it have similar errors? And are these SSD or plain old HDD? I've noticed SSDs tend to go offline suddenly after a number of uncorrectable errors.


----------



## jaymax (May 10, 2016)

k.jacker said:


> I have received those messages (but without the "FAILED SMART self-check" message) earlier when I accidentialy hit my PC while making a backup with `rsync`.
> The messages only lasted a second or so and then dissapeared. No data corruption appeared as I ckecked and `rsynced` once more after the first run finished.
> Just a little chance, but you may have to much vibration from the disks or something else. I'm running a simple ZFS mirror with 2 disks, so there is not a lot of vibration, unless I hit my PC.
> I thought if the vibration had lasted a little longer and the read errors continued for a longer time a disk (or both) might have gone offline...



No mechanical injury, vibration etc!


----------



## jaymax (May 11, 2016)

SirDice said:


> ```
> May  5 22:30:59 MACH1 smartd[586]: Device: /dev/ada2, 6307 Currently unreadable (pending) sectors
> May  5 22:30:59 MACH1 smartd[586]: Device: /dev/ada2, 6307 Offline uncorrectable sectors
> ```
> These are not temporary errors. And judging by the amount the disk should have been replaced a long time ago. Same for ada3. Your posted logs don't show anything for ada1 did it have similar errors? And are these SSD or plain old HDD? I've noticed SSDs tend to go offline suddenly after a number of uncorrectable errors.



All are plain old HDD
Yes! ada1 just disappears, its a new disk just a week or two old - sprouts no  error messages.
Other disks - ada2 and ada3 are relatively new too ~ 2 and 4 months respectively. sprouts error messages.
ada3  system disk w/ clean installation
Ironically, ada0 is old, real old - several yrs, no error messages apparently stable like hell.


----------



## SirDice (May 11, 2016)

I'd return ada1, 2 and 3. At least 2 and 3 have offline uncorrectable errors and 1 is pretty dodgy. If they're this new they should never have been sold. Maybe it's a bad batch, maybe it's refurbished. In any case this should be covered by warranty.


----------



## wblock@ (May 11, 2016)

Could be failing power supply, bad SATA cables, other hardware issues.


----------



## exeter (Jul 14, 2016)

Always suspect the cables (Pournelle's  law).


----------



## Murph (Jul 14, 2016)

exeter said:


> Always suspect the cables (Pournelle's  law).


A variation on that, which I rather like: Always wiggle the wires first.

I agree with previous comments that there's no way you should be seeing that sort of error count on drives less than 6 months old, unless they have been physically abused.  I'd certainly try swapping out the cables, possibly check the voltage being supplied to the drives with a meter or scope (only do this if you are competent to safely work on live electronics).  It certainly seems like the drives are just plain bad and should be replaced ASAP, but I wouldn't want to entirely rule out a false positive due to bad cables/power/cooling/etc.


----------



## jaymax (Jul 14, 2016)

Wiggle, I've certainly done, more like a samba flounce. I don't have access to a scope but my meter give a +12V & +5V respectively and a null on the +3.3V lead (which is normal I think on many SATA's). My SATA cables are the longer type, more than the 18" cables and I have wondered if there could be significant voltage drop? Similarly, I have wondered about the SATA channels/ports on the controller; I have switched these around but to no avail. 
What is disturbing would a MOBO, cable or controller shortcoming result in a permanent rather than a temporary disk failure - i.e. loss of partitions and inability and failure to lay down a new file system?


----------



## Murph (Jul 14, 2016)

jaymax said:


> What is disturbing would a MOBO, cable or controller shortcoming result in a permanent rather than a temporary disk failure - i.e. loss of partitions and inability and failure to lay down a new file system?



Well, a problem outside the drive might just be generating a false positive from the drive's SMART stuff, maybe.  If you have another system you could test in, trying the drives one at a time in that system might be worth a shot.  If you get the same general problems and SMART error counts on a known good system+controller+port+cable+etc, then it's probably the drives themselves that are bad.  Honestly, I think the problem is the drives themselves, either a bad manufacturing batch or something bad happened to them at some point between the factory and now, it just never hurts to rule out the rest.

One last resort thing that could be tried is a low level format with camcontrol(8) (total data loss, obviously).  I can't say I've ever LL formatted a SATA drive (have done it plenty of times on SCSI drives), so actually not sure if it's expected to work.  If something bad happened to the drives in terms of a magnetic blip corrupting the formatting (e.g. they sat next to a giant electric motor for a while), instead of a physical type of defect, that might cure it.  At your own risk.  I recommend being quite careful if it seems to fix them, in case there's still an underlying issue, so extended testing post-format is strongly recommended.


----------



## RedShift1 (Jul 23, 2016)

Murph said:


> ...
> One last resort thing that could be tried is a low level format with camcontrol(8) (total data loss, obviously).  I can't say I've ever LL formatted a SATA drive (have done it plenty of times on SCSI drives), so actually not sure if it's expected to work.  If something bad happened to the drives in terms of a magnetic blip corrupting the formatting (e.g. they sat next to a giant electric motor for a while), instead of a physical type of defect, that might cure it.  At your own risk.  I recommend being quite careful if it seems to fix them, in case there's still an underlying issue, so extended testing post-format is strongly recommended.


You can't low-level format a modern hard drive. That's done at the factory and its geometry is fixed from then on. The best you can do is overwrite the entire drive with zeros (dd if=/dev/zero of=/dev/...) and rely on the disk's bad sector reallocation to take out the bad sectors.


----------



## wblock@ (Jul 23, 2016)

RedShift1 said:


> You can't low-level format a modern hard drive.


True, but you can send the command and the drive returns like it has done it really, really quickly.  So a lot of people think it still does something.


----------



## Murph (Jul 23, 2016)

RedShift1 said:


> You can't low-level format a modern hard drive. That's done at the factory and its geometry is fixed from then on. The best you can do is overwrite the entire drive with zeros (dd if=/dev/zero of=/dev/...) and rely on the disk's bad sector reallocation to take out the bad sectors.



Yeah, that is what I was uncertain about.  I've done real low-level formats on fast+wide SCSI and older many times, even on some LVD Ultra-SCSI, where it certainly took long enough (i.e. hours) to give the appearance of laying down new format markers.  I've just never had the need or inclination to do it in more recent years, so uncertain about expected behaviour on current SATA and SAS.


----------



## kpa (Jul 24, 2016)

Even the very old IDE drives from early '90s were the same, you couldn't low level format them. The "last mohicans" were indeed those SCSI drives, after them all drives in use have the low level format completely hidden from the user after factory initialization.


----------

