# How can I scrub a geom_mirror RAID1?



## LateNight (Aug 29, 2013)

Is there a standard way to scrub a geom_mirror RAID 1?

If not, would the following be effective, without undesirable side-effects?
`cp /dev/ad* /dev/zero`
(Assuming that everything matched by /dev/ad* is a RAID component that I want to scrub.)


----------



## wblock@ (Aug 29, 2013)

"Scrub" as ZFS does it is really more of a filesystem operation.  fsck(8) is closer to that, but does not test the data, just the filesystem.  To check the drive media for problems, install sysutils/smartmontools and run `smartctl -tlong /dev/ada0` on each drive.


----------



## LateNight (Aug 30, 2013)

Thanks very much; smartmontools will do just what I was looking for.

In playing around with them, I noticed that one of my drives has a non-zero Reallocated_Event_Count and a couple of errors in its error log. I'm going to keep an eye on all my drives, using some smartctl commands in a cron script.

Just to satisfy my curiosity, is there any disadvantage to just doing something like `dd bs=10M if=/dev/adX of=/dev/null` vs. `smartctl -tlong /dev/adX` (besides taking longer and using some CPU and memory)? I assume both would result in checking all the checksums on the drive and would result in the same handling for any errors detected.


----------



## bthomson (Aug 31, 2013)

Wikipedia says the SMART tests do a bit more than just read from the drive. Whether those things are worth doing is debatable. I suppose the controller could fail in such a way that it thinks it's reading the entire disk correctly when really it isn't and the SMART test somehow manages to detect that, but the odds seem slim.

In recent memory I've had three drive failures and all of them would have been detected with your dd command (possibly combined with a timeout, since the drives didn't have Time Limited Error Recovery).


----------



## wblock@ (Aug 31, 2013)

LateNight said:
			
		

> Thanks very much; smartmontools will do just what I was looking for.
> 
> In playing around with them, I noticed that one of my drives has a non-zero Reallocated_Event_Count and a couple of errors in its error log. I'm going to keep an eye on all my drives, using some smartctl commands in a cron script.



Even better, the port already has that function.  It does not do it by default, you have to edit the config file in /usr/local/etc and set the notification email address.



> Just to satisfy my curiosity, is there any disadvantage to just doing something like `dd bs=10M if=/dev/adX of=/dev/null` vs. `smartctl -tlong /dev/adX` (besides taking longer and using some CPU and memory)? I assume both would result in checking all the checksums on the drive and would result in the same handling for any errors detected.



They are a bit different.  Reading every block is not much of a test.  It won't detect data corruption, but might detect block errors on the drive.

Writing to every block with dd(1) might have errors, but if the drive corrects them by mapping them out to spare sectors, the only way to tell would be to check the SMART numbers afterwards.  The SMART long test takes the same amount of time, but will report the results afterwards.  They take the same amount of time if dd(1) is given a buffer of at least 64k: `dd if=/dev/ada0 of=/dev/null bs=64k`.  The SMART short and long tests are non-destructive.  I have not really investigated how they work, but the results are consistent with manual tests.

But again, it's important to note that these are drive tests, not data tests.  ZFS can detect data corruption.  These tests check to see if the drive faithfully writes data.  If some sector has become corrupted, it will not be noticed.  The drive has no way of knowing what that data should have been.


----------



## LateNight (Aug 31, 2013)

wblock@ said:
			
		

> ... these are drive tests, not data tests.  ...  If some sector has become corrupted, it will not be noticed.  The drive has no way of knowing what that data should have been.



My understanding is that drives actually can tell if data has been corrupted and can sometimes correct it. (A brief elaboration of this can be found here.) Thus, my belief that just reading the whole drive, with either of the methods discussed above, would be enough to uncover any corruptions and result in either (1) the drive silently correcting and re-mapping them, which would show up if you did a `smartctl -A` or (2) the drive reporting an uncorrectable error, which would result (*I hope*) in the RAID subsystem taking action, to re-sync or remove the drive that had the error.


----------



## kpa (Sep 1, 2013)

No, a drive can not recover data that has been corrupted while it has been stored on the disk for some time. There is no built-in redundancy to do that. That's why you need RAID to have any kind of redundancy.

What a drive can do is that when a write error is detected while doing a write operation is to mark the faulty block as unusable and select new block from the list of spare blocks and store the data there. This works because the data is still available in the drive's buffer after the first unsuccessful write and will not be discarded until the operation completes. This would be invisible to the operating system since the new block from the list of spare blocks just replaces the faulty block while the LBA numbering remains the same. Reading from the disk can not trigger this reallocation because as I said above there's no good copy of the data available.


----------

