# God I love ZFS!



## olav (Dec 3, 2010)

During my weekly scrub I got this e-mail from my server.


```
pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 1h55m with 0 errors on Fri Dec  3 03:55:15 2010
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada1    ONLINE       0     0     1  32K repaired
            ada3    ONLINE       0     0     0
            ada0    ONLINE       0     0     0
            ada4    ONLINE       0     0     0

errors: No known data errors
```
Good to know that my data is safe and has been healed


----------



## AndyUKG (Dec 3, 2010)

Its not always so great. I recently had an issue where the disks in a pool lost power while the server was running. This resulted in corruption in two files, unfortunately they were meta files which cannot be deleted or restored from backup, so I had to destroy the pool and recreate it . Pretty crappy!


----------



## olav (Dec 3, 2010)

Didn't you have a backup of the /boot/zfs/zpool.cache file?


----------



## hopla (Dec 3, 2010)

olav said:
			
		

> Didn't you have a backup of the /boot/zfs/zpool.cache file?



Interesting, what is the importance of that file? How would it have helped him in his situation?


----------



## AndyUKG (Dec 3, 2010)

hopla said:
			
		

> Interesting, what is the importance of that file? How would it have helped him in his situation?



I don't think it would have helped at all is the answer. This seems to be a cache of info about the pool, such as disk device names etc. The issue I had was corruption of meta data on the disks, ie meta data of actual data, not meta data of the disk pool (if that makes sense  ).

Andy.


----------



## Alt (Dec 3, 2010)

Which raid type(raidz1/mirror/stripe) you had?


----------



## AndyUKG (Dec 3, 2010)

Alt said:
			
		

> Which raid type(raidz1/mirror/stripe) you had?



mirror


----------



## carlton_draught (Dec 3, 2010)

AndyUKG said:
			
		

> mirror


Did you actually lose any data? Or was it just something that righted itself when you destroyed and recreated the pool?


----------



## AndyUKG (Dec 4, 2010)

carlton_draught said:
			
		

> Did you actually lose any data? Or was it just something that righted itself when you destroyed and recreated the pool?



Well isn't destroying the pool loosing all my data?? If I hadn't had a back copy I would have lost data, and it would have been very hard to identify exactly what data was corrupt as insufficient info was provided by the ZFS error.

FYI the error was ZFS-8000-8A - Corrupted data, the description of this is:



> Damaged files may or may not be able to be removed depending on the type of corruption. If the corruption is within the plain data, the file should be removable. If the corruption is in the file metadata, then the file cannot be removed, though it can be moved to an alternate location. In either case, the data should be restored from a backup source. It is also possible for the corruption to be within pool-wide metadata, resulting in entire datasets being unavailable. If this is the case, the only option is to destroy the pool and re-create the datasets from backup.


----------



## Galactic_Dominator (Dec 5, 2010)

AndyUKG said:
			
		

> Well isn't destroying the pool loosing all my data??


An unreadable pool would be considered loss of data.  Sometimes ZFS corruption simply results in a pool being unable to write new data.  Ultimately this type of situation requires a new pool, but wouldn't be considered loss of data for obvious reasons.  Since you gave very few details as to the nature of your reported corruption, the question you replied to seems appropriate and your supercilious response does not.


----------



## AndyUKG (Dec 5, 2010)

If the terms, corruption (which I used in my first post) and destroying don't imply loss of data I don't know what does...


----------



## AndyUKG (Dec 5, 2010)

Galactic_Dominator said:
			
		

> An unreadable pool would be considered loss of data.  Sometimes ZFS corruption simply results in a pool being unable to write new data.  Ultimately this type of situation requires a new pool, but wouldn't be considered loss of data for obvious reasons.  Since you gave very few details as to the nature of your reported corruption, the question you replied to seems appropriate and your supercilious response does not.



The reality is that as I had a backup copy of the data I didn't do any detailed analysis of what might have been affected data wise, it would have been an interesting exercise but impractical as the pool contained millions of files. If I hadn't had a backup copy to restore from, or even to compare the data against I would have felt extremely nervous about the integrity of the data (ie if I'd had to copy the data off, destroy the pool and create it). This seems to me a pretty piss poor result from a simple power outage on a supposedly advanced fault tolerant file system.

Is it really necessary to come onto a thread and label people with insulting names when they are trying to share experiences and knowledge??


----------



## aragon (Dec 5, 2010)

AndyUKG said:
			
		

> If I hadn't had a backup copy to restore from, or even to compare the data against I would have felt extremely nervous about the integrity of the data (ie if I'd had to copy the data off, destroy the pool and create it). This seems to me a pretty piss poor result from a simple power outage on a supposedly advanced fault tolerant file system.



Busy building an 8 TB NAS at the moment and this is quite a big worry for me.  It's not all that easy to backup 8 TB of data either, so... *cringe*


----------



## DutchDaemon (Dec 6, 2010)

Take the exit at Semantics Junction, guys. Thanks.


----------



## fronclynne (Dec 6, 2010)

aragon said:
			
		

> Busy building an 8 TB NAS at the moment and this is quite a big worry for me.  It's not all that easy to backup 8 TB of data either, so... *cringe*



Well, I've had UFS(2) fail well enough to hose data three or four times, FAT[12|16|32] more times than I can count, NTFS is as fault-tolerant as the 880 (warning! California joke, sorry), and a wayward sand particle made a backup on CD rather . . . unreadable.

I'm going to patent/copyright/trademark a ZenFS, for when your data are unimportant as individual bits.  The fact that you are tied to the idea of discreet information retention is why you fail at enlightenment, my child.  When all of your partitions are copies of /dev/urandom you will know true freedom.


----------



## Galactic_Dominator (Dec 6, 2010)

AndyUKG said:
			
		

> The reality is that as I had a backup copy of the data I didn't do any detailed analysis of what might have been affected data wise, it would have been an interesting exercise but impractical as the pool contained millions of files. If I hadn't had a backup copy to restore from, or even to compare the data against I would have felt extremely nervous about the integrity of the data (ie if I'd had to copy the data off, destroy the pool and create it). This seems to me a pretty piss poor result from a simple power outage on a supposedly advanced fault tolerant file system.



It's both well documented and common knowledge that certain types of hardware failures can cause corruption regardless of your filesystem.  Specifically in ZFS's case, these failures tend to cause errors exactly as you have stated.  See here for some information:

http://docs.sun.com/app/docs/doc/819-5461/gavwg?a=view

Essentially, these issue's come down to several forms of hardware.

Cable cross-talk
Faulty hardware eg RAM
Sudden power failure

Since you report sudden power failure, I'll tell exactly why it happened and why it's your fault and not ZFS's.

ZFS, as with any FS, trusts a flush request.  Because of ZFS's COW abilities and transaction grouping this is generally not a problem but there is one generally rare situation that results in corruption even on ZFS.  Hard drives have a feature write-caching which greatly increases performance at the risk of possible corruption.

ZFS guarantees "good" data is not moved until the entire write is complete, but that guarantee comes with a caveat some do not realize.  If the hard drive "lies" to ZFS that one portion of the write is complete, ZFS will go ahead with committing the transaction group and updating the uberblock.  Say you lose power at this point, and the disk completes the writes but issues them out of order.  You're stuck with new COW data, but with wrong uberblock so the COW differentials are unable to track changes to specific files and you end up with your exact scenario.  This is simplified a bit but you should get the idea.  Remember in ZFS, redundancy and consistency are different things and one doesn't always guarantee the other.  The consistency portion is what went wrong for you so the mirror doesn't help.

You can find plenty more of these but this link shows my explanation in the real world:

http://mail.opensolaris.org/pipermail/zfs-discuss/2010-January/035740.html

Every reliability document worth it's weight advises you to disable drive write-caching.  Here's just one example:

http://wiki.postgresql.org/wiki/SCSI_vs._IDE/SATA_Disks

There are ZFS and hardware methods you can use reduce the performance impact of disabling this but for a lot of setups it's going to be more effective to simply keep full backups as this is generally a rare issue.

Because you didn't take adequate measures to insure ZFS could operate reliably, you are at fault here.  You compounded the issue by blaming ZFS and spreading FUD.  I don't think it was intentional on your part, but nevertheless potentially harmful.  Please don't take this as another personal assault as I'm sure you're a fine person.  You also seem like an intelligent person and a decent sysadmin with some room for improvement.  I believe you're a decent sysadmin since you had backups.

This explanation was brought to you by your more detailed problem description.  HIH.



			
				AndyUKG said:
			
		

> Is it really necessary to come onto a thread and label people with insulting names when they are trying to share experiences and knowledge??


Maybe you're confusing me with someone else as I'm quite sure I never called anyone here an insulting name.  When I do ask questions, I really dislike getting misleading responses, FUD, or answers that do nothing but serve to inflate the responder's post count.  So what I was pointing out to you is there are more details in ZFS than are dreamt of in your philosophy, and you need not get snippy when someone asks for a clarification.  

I'm all for you sharing your experiences though as we're all in this ZFS boat together, and hopefully reports like yours(the detailed version, not the original) can help both awareness and resolution for everyone.


----------



## olav (Dec 6, 2010)

Aha, I thought ZFS was designed to be safe with write cache enabled. From here: http://www.postgresql.org/docs/current/static/wal-reliability.html


> The Solaris ZFS file system is safe with disk write-cache enabled because it issues its own disk cache flush commands.


I guess that's not quite true then?

Anyway I tested with write cache disabled here. For sequential data transfer the speed on my ZFS pool is pretty much equal as with write cache enabled. However the OS disk is now 5-10x slower. The only thing I did was adding this to /boot/loader.conf
[CMD=""]hw.ata.wc=0[/CMD]


----------



## olav (Dec 6, 2010)

Ack, the 
[CMD=""]hw.ata.wc=0[/CMD]
property doesn't work with AHCI drives. Only with regular IDE.


```
[olav@zpool ~]$ sudo camcontrol identify ada0
pass0: <SAMSUNG HD203WI 1AN10003> ATA-8 SATA 2.x device
pass0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)

protocol              ATA/ATAPI-8 SATA 2.x
device model          SAMSUNG HD203WI
firmware revision     1AN10003
serial number         S1UYJDWZ725504
WWN                   50024e903c88923
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 512, offset 0
LBA supported         268435455 sectors
LBA48 supported       3907029168 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6 
media RPM             5400

Feature                      Support  Enable    Value           Vendor
read ahead                     yes      yes
write cache                    yes      yes
flush cache                    yes      yes
overlap                        no
Tagged Command Queuing (TCQ)   no       no
Native Command Queuing (NCQ)   yes              32 tags
SMART                          yes      yes
microcode download             yes      yes
security                       yes      no
power management               yes      yes
advanced power management      yes      no      0/0x00
automatic acoustic management  yes      no      0/0x00  254/0xFE
media status notification      no       no
power-up in Standby            yes      no
write-read-verify              no       no      0/0x0
unload                         yes      yes
free-fall                      no       no
data set management (TRIM)     no
```

No wonder my ZFS pool didn't show a speed decrease. How can I disable the write cache for AHCI enabled devices?


----------



## aragon (Dec 6, 2010)

fronclynne said:
			
		

> Well, I've had UFS(2) fail well enough to hose data three or four times, FAT[12|16|32] more times than I can count, NTFS is as fault-tolerant as the 880 (warning! California joke, sorry), and a wayward sand particle made a backup on CD rather . . . unreadable.


I've lost some data on UFS too, but never an entire file system in the more than 10 years I've been using FreeBSD (ok, baring the odd total disk failure).  With ZFS though, I get the impression it's relatively easy to loose an entire pool from the slightest mishap in setup or environment.

Won't stop me trying it though.  Too many awesome features to pass up.


----------



## AndyUKG (Dec 6, 2010)

aragon said:
			
		

> Busy building an 8 TB NAS at the moment and this is quite a big worry for me.  It's not all that easy to backup 8 TB of data either, so... *cringe*



If you don't already have some enterprise backup with an LTO library or something similar the easiest way is probably going to be to have a duplicate copy of your pool which you replicate via zfs send/receive. I too have an 8TB pool and am using this method. Of course you could still be vulnerable to some ZFS bug that renders both pools useless!


----------



## AndyUKG (Dec 6, 2010)

olav said:
			
		

> Aha, I thought ZFS was designed to be safe with write cache enabled. From here: http://www.postgresql.org/docs/current/static/wal-reliability.html
> 
> I guess that's not quite true then?
> 
> ...



Just googled this; it would seem that ZFS is safe to use with write cache enabled from all the info I've seen , BUT that this obviously depends on the disks you are using behaving as they should (honouring flush requests from ZFS). With cheap consumer grade drives the most likely to not behave well, and expensive SAS disks the most likely to be good. This could explain my issue...
All in all, seems to leave us in a bit of a lottery where its impossible to know if the drive you have will work reliably  Unless someone has compiled a list of good drives somewhere. On all my systems I am using cheap SATA drives for ZFS.


----------



## AndyUKG (Dec 6, 2010)

aragon said:
			
		

> With ZFS though, I get the impression it's relatively easy to loose an entire pool from the slightest mishap in setup or environment.



To clarify what happened in my case. ZFS reported unrecoverable corruption in 2 meta data files and gave as the corrective action "destroy the pool and recreate from backup". However the pool was still mounted and data readable. So it wasn't a case when all data would have been lost if I hadn't had a backup.


----------



## aragon (Dec 6, 2010)

AndyUKG said:
			
		

> To clarify what happened in my case. ZFS reported unrecoverable corruption in 2 meta data files and gave as the corrective action "destroy the pool and recreate from backup". However the pool was still mounted and data readable. So it wasn't a case when all data would have been lost if I hadn't had a backup.


Good to know, thanks!


----------



## Galactic_Dominator (Dec 6, 2010)

aragon said:
			
		

> I've lost some data on UFS too, but never an entire file system in the more than 10 years I've been using FreeBSD (ok, baring the odd total disk failure).  With ZFS though, I get the impression it's relatively easy to loose an entire pool from the slightest mishap in setup or environment.


One detail I forgot to mention earlier is that ZFS reported the corruption, in a different FS this type of thing would have been silent.  So because of ZFS's design, the recovery process sucks but at least you were aware of the issue and can take action to resolve it.


----------



## phoenix (Dec 6, 2010)

olav said:
			
		

> Aha, I thought ZFS was designed to be safe with write cache enabled. From here: http://www.postgresql.org/docs/current/static/wal-reliability.html
> 
> I guess that's not quite true then?



It is true.  ZFS sends cache flush commands as needed.  *However*, not all harddrives obey the command.  Some will respond with "flush complete" even though it has only written the data to the cache and not to the platters.  AFA ZFS is concerned, the data is on the platters (the disk told it so), so it continues on with the next transaction.

It's always a trade-off between "pure speed" and "total data security".  So long as you have a good, working UPS properly configured to issue an ordered shutdown of the box, and have good harddrives that don't lie about "flush complete", and you don't mind the slim chance of a drive dying with data in the cache, then you can run with the disk caches enabled.  If you are absolutely paranoid about data safety and don't mind sacrificing a lot of write throughput, then run with all caches (including controller caches) disabled.

It's up to you to decide what's important, and configure the system to match.


----------



## carlton_draught (Dec 6, 2010)

AndyUKG said:
			
		

> To clarify what happened in my case. ZFS reported unrecoverable corruption in 2 meta data files and gave as the corrective action "destroy the pool and recreate from backup". However the pool was still mounted and data readable. So it wasn't a case when all data would have been lost if I hadn't had a backup.


This is what I was getting at. You mentioned that you had 2 metafiles that were corrupted, potentially the rest of your files may have been ok. You could have done a recursive diff and seen what had changed between your backup and hosed, but readable pool. And also copied all data to a working pool.


----------



## AndyUKG (Dec 7, 2010)

phoenix said:
			
		

> So long as you have a good, working UPS properly configured to issue an ordered shutdown of the box



Even in companies with the best UPS kit and generators etc there is always the possibility of an unexpected power failure for one reason or another, I think you have to plan for the worst case which is that your systems should be resilient to a power failure, that is the goal should be that they should be able to reboot after and if there are file system errors it should be able to at least repair to a point where the FS is marked as clean. As an example, high availability clusters are built on this assumption (at least on the clean FS part) that an unexpected power failure isn't going to irreparably damage data on disk.



			
				phoenix said:
			
		

> If you are absolutely paranoid about data safety and don't mind sacrificing a lot of write throughput, then run with all caches (including controller caches) disabled.



So far on this thread no one has been able to identify how you disable SATA disk write cache when using the AHCI driver (or other similar ie SIIS). Any idea if this is currently possible?

thanks Andy.


----------



## AndyUKG (Dec 7, 2010)

carlton_draught said:
			
		

> This is what I was getting at. You mentioned that you had 2 metafiles that were corrupted, potentially the rest of your files may have been ok. You could have done a recursive diff and seen what had changed between your backup and hosed, but readable pool. And also copied all data to a working pool.



Hi, yes I could have done a diff against a good copy of the data, on my system I think this would have taken a good 24 hours or so due to the volume of files and then you still have to do an analysis of each change to decided if its a valid change or corruption. In my case it was faster and easier to recreate the pool from another copy of the pool (luckily for me the pool that died was a DR copy of the pool), this allowed me to recover the corrupt pool in just a few hours. 
I suppose for me the important point isn't how and if I could have recovered my data assuming I had a backup, it was the fact that part of the solution was having to destroy the pool. As someone else commented, this starts to become more and more problematic when your pool gets very large, even if you have the fastest disks and best setup etc restoring several terabytes is going to cause you a pretty long service outage.
And all this just from a power failure, maybe in my case this SATA write flush issue could be the culprit. One of the advantages of buying a system like this from Sun/Oracle is that they test and qualify all of the hardware together, great if you can afford it!

thanks Andy.


----------



## danbi (Dec 7, 2010)

You should be able to change write cache setting via camcontrol.


----------



## chrcol (Dec 9, 2010)

these discussions come up occasionally regarding write cache.  The same conclusions I always come to is it just isnt feasible to have it disabled, the performance loss is way too severe.

the risk of data loss as a result of having it enabled is almost non existant.  bad hdd's may lose data regardless of the setting so its not insurance against hdd failure.  Its more protection against unexpected shutdown's such as power cuts.

at home I have had one power cut in 9 years, in remote server locations I have suffered 2 powercut's in 8 years.  In both cases I lost no data. Or rather no noticeble data.


----------

