# ZFS auto check/repair



## 6502 (Mar 31, 2019)

Hi. I am new in FreeBSD and plan to install a small server (postfix/web/samba). My first task is to choose the file system. It looks ZFS is better than UFS in any way. I expected that with all its features it is prepared for auto-recovery and there is small risk to corrupt single files or the file system at power loss. But I read that manual repair with *fsck* may be required and switch to single user mode. Is this true or some automation can be made to check and repair file system and continue work after power is back without user intervention? For example, in most cases NTFS is able to run without problems after power loss - or at least Windows can boot and file repairs can be delayed. I will add UPS but want to have some kind of auto recovery.


----------



## kpa (Mar 31, 2019)

There is no fsck(8) functionality for ZFS, this is by the design of it. The self healing after a crash happens automatically without user intervention when the pool is imported and it's extremely rare now that the filesystem would be left in an inconsistent state.

You can run a `zpool scrub` on a pool that has redundancy (mirror or some form of raidz) to check if there are problems with the data replication but this can't be used until the system has booted to at least single user mode with the pool successfully imported so it can't be used in the very rare case that the self healing wasn't successful in repairing the damage caused by the crash.


----------



## SirDice (Apr 1, 2019)

Keep in mind with ZFS that a single disk or a striped set does not have error _correction_, only error _detection_.


----------



## xtaz (Apr 1, 2019)

SirDice said:


> Keep in mind with ZFS that a single disk or a striped set does not have error _correction_, only error _detection_.



Although you can do poor mans error correction if you set copies=2 (or 3) on the dataset properties. But this will only work if the other copies of the data are still accessible on the disk. If the disk has failed completely then you lose your data. Also there is no guarantee that the copies will be on different areas of the disk. They could all be next to each other.

The other issue with this is that it affects performance, writing data takes 2 or 3 times longer as it doesn't do the writes in parallel or in the background. But still. For some use cases this could work, which is why I mention it.


----------



## 6502 (Apr 1, 2019)

SirDice said:


> set does not have error _correction_, only error _detection_.


I guess the error correction is for wrongly read bytes (e.g. error from HDD/RAM) not for logically broken file system from power loss.

In fact for the moment I do not need most of the features of ZFS. Need stable file system with journaling - something like NTFS/ext4 but it seems the only options are UFS and ZFS.


----------



## xtaz (Apr 1, 2019)

There will never be a broken filesystem from power loss. ZFS is copy-on-write. Whenever there is a write to the disk it makes a copy of the data first, confirms that it's correctly written, updates any pointers to this data, and then deletes the older data. Using this method the filesystem is never inconsistent. At worst you lose a few seconds of data, but it will always come up cleanly.


----------



## SirDice (Apr 1, 2019)

xtaz said:


> There will never be a broken filesystem from power loss.


_Never_ is probably a bit of an exaggeration. Although ZFS takes a lot of precautions and can deal with a lot of errors, it's certainly not bullet-proof.


----------



## Ordoban (Apr 1, 2019)

SirDice said:


> _Never_ is probably a bit of an exaggeration. Although ZFS takes a lot of precautions and can deal with a lot of errors, it's certainly not bullet-proof.


That's right. I had such an error, and I started to write my own ZFS data recovery tool. Described here.
I finally found out whats happened to this pool: one blockpointer table was overwritten with bad data. This bad blockpointer table was correct compressed and crc'ed, so the data corruption must be done in kernel before writing. I had some experiments with the PCI Passthrough feature of bhyve with many kernel crashes before, I think this was the trigger for this data corruption.
I suspended the work on the ZFS data recovery tool due lack of public interest.


----------



## gpw928 (Apr 6, 2019)

xtaz said:


> There will never be a broken filesystem from power loss. ZFS is copy-on-write. Whenever there is a write to the disk it makes a copy of the data first, confirms that it's correctly written, updates any pointers to this data, and then deletes the older data. Using this method the filesystem is never inconsistent. At worst you lose a few seconds of data, but it will always come up cleanly.


The software process described is impeccable.  But budget hardware (consumer grade SSDs and non-ECC memory) might let the team down...

SSDs controllers routinely acknowledge data written to volatile cache.  That's one of the reasons that they are so fast.  The controller updates the page in the background.

Expensive SSDs, usually tagged "enterprise class", have "power loss data protection" (usually via capacitors) which save the volatile cache at the time of unexpected power loss.

It's acutely rare for "consumer class" SSDs to have "power loss data protection".  Intel made some a while back (the 730), but seemed reluctant to admit it.

If you don' have "power loss data protection", data in the volatile cache can be lost with sudden power loss.

Enterprise class SSDs also generally have "end-to-end data protection" (parity or CRC on all data paths), but the risk area covered by that feature is smaller.

I  don't use SSDs on my ZFS server. But if I did, they would be "enterprise class" (NVMe for the ZIL might be a hoot).  I also use ECC memory for data path protection (Google: "DRAM errors in the wild").

In practical terms, people use consumer grade SSDs and non-ECC memory all the time, mostly without significant issues.  But the design assumptions underlying ZFS do rely on "enterprise class" hardware.

By the way, these data loss maladies are just as likely to hit any type of file system.  ZFS is generally better equipped to detect corruption (and maybe fix it, if you have redundant data), via its end-to-end checksums.

Caveat emptor.


----------



## Himalaya (Apr 7, 2019)

Just felt this link may be relevant to above discussion


----------



## ralphbsz (Apr 7, 2019)

xtaz said:


> Although you can do poor mans error correction if you set copies=2 (or 3) on the dataset properties. ... The other issue with this is that it affects performance, writing data takes 2 or 3 times longer as it doesn't do the writes in parallel or in the background.


I can easily be much slower than 2x or 3x: it can take a big sequential write of many GB, and turn it into a random hop and skip.  But not always.  The answer whether the performance penalty (which is large and unknown) is worth the reliability gain (which is small but non-zero) depends on the needs and wants of the user.



6502 said:


> In fact for the moment I do not need most of the features of ZFS. Need stable file system with journaling ...


The one feature of ZFS that you (and everyone else) do need is checksum verification on data read from disks.  Disks and data volumes have become so big, while error rates have not improved, that we are now in a world where undetected / uncorrected read errors are reality.  Not using checksums on big modern systems is beginning to be reckless.  Some other file systems are beginning to recognize this, and also have checksums, at least on some metadata structures.  But in the free/open software realm, ZFS is miles ahead of everyone else in this respect.

The other thing is: You don't need journaling.  Nobody needs journaling.  Journaling is a technique, that tries to solve a particular problem: Make file systems consistent and reduce loss of unacknowledged data in case of a system stop (power loss, crash).  There are many other ways to solve the same problem.  What you do need is a file system that doesn't get corrupted after a system stop.  Typical ingredients do solving this include journaling, CoW, logging or log-structured file systems, write buffering, and so on.  Because journaling is the technique used by the most popular Linux file system (which has a huge market share), people tend to think that journaling is the one and only answer.

Saying "I need journaling" is like saying "I need a Ford to get from home to work".  Sorry, wrong.  What you really need is reliable, inexpensive, and safe transportation.  There are many options there, including Chevrolet, Honda, and Volkswagen; for particular situations there are also horses, rollerskates, and the subway.  Ford is just one possible solution, not always the best.



gpw928 said:


> SSDs controllers routinely acknowledge data written to volatile cache.  That's one of the reasons that they are so fast.  The controller updates the page in the background.


True.  Particularly common among comsumer hardware.  By the way, hard-disks do the same thing.



> Expensive SSDs, usually tagged "enterprise class", have "power loss data protection" (usually via capacitors) which save the volatile cache at the time of unexpected power loss.


Also true.  Although I've seen plenty of cases where expensive enterprise class SSDs also lost data during a power shutdown or reset.  The firmware installed on SSDs is of frighteningly bad quality (spinning hard disks are leagues better), and even some expensive enterprise SSDs are far from perfect.



> In practical terms, people use consumer grade SSDs and non-ECC memory all the time, mostly without significant issues.


It's a cost/benefit/risk tradeoff.  I happen to have an enterprise SSD from a reputable vendor at home, but non-ECC RAM.  Given the pricing at the time, this was the least bad option.



> But the design assumptions underlying ZFS do rely on "enterprise class" hardware.


Careful: while your statement is not outright wrong, it can be mis-interpreted.  With enterprise class hardware (in particular ECC RAM, and using redundant storage), ZFS can reach levels of data durability that are common in the enterprise/cloud market, and way better than the amateur/consumer/discount market.  Even without ECC RAM, and with non-redundant storage, ZFS is still better than most other file systems as far as data durability and error detection is concerned ... but it is not as good as it could be.  On the other hand, if you take a piece of crap file system (say a FAT implementation written by a second-year college student who was drunk most of the time), it will still suck, even on million dollar hardware.

I think the right way to express it is something like this: ZFS is so good, that it exposes the reliability and data durability bottlenecks in the rest of the system.  To get the best value out of ZFS, you should also make the rest of the system stronger, which will cost more money.



> Caveat emptor.


Hallelujah.  Exactly that. 



Himalaya said:


> Just felt this link may be relevant to above discussion


Well, it is slightly relevant, but also contains lots of wrong and obsolete information, and is terribly Linux-FS centric.  There is a world outside of extN, XFS and BtrFS, but many people are gleefully unaware of that.


----------



## gpw928 (Apr 7, 2019)

ralphbsz said:


> Careful: while your statement ['design assumptions underlying ZFS do rely on "enterprise class" hardware'] is not outright wrong, it can be mis-interpreted.  With enterprise class hardware (in particular ECC RAM, and using redundant storage), ZFS can reach levels of data durability that are common in the enterprise/cloud market..


I don't think that we are disagreeing, but my statement was meant to reflect that ZFS was designed by engineers at Sun Microsystems for use with Solaris on Sun hardware.  At the time ZFS was designed, and in that context, there was an underlying assumption of "enterprise" class hardware, because that's what they had to work with.


----------

