# Chances of failure to drives on a 4 drive ZFS raidz1



## overmind (Oct 27, 2010)

What are the chances for failure of 2 drives at the same time on a 4 drive raidz 1?

I've read that chances are 1 from 10 drives to fail over a period of time of one year or more.

What is the best approach to having more space but still have redundancy? 
3 drives in raidz 1?
6 drives in raidz 2?

I would like to use a pool of 4 x 1.5 TB drives for a mini backup server.

I've read some posts regarding ZFS here on forum and on http://www.solarisinternals.com/ but I need your opinion on that: chances of failure for hard drives from your experience.


----------



## gkontos (Oct 27, 2010)

Once I had a failure of 2 drives on a 3disk raid 5 :\
The system was not using ZFS. In fact it was using ext3 on a Centos (can't remember the version) It turned out that a voltage issue caused both drives to fail at the same time. So, honestly the more the better. And of course never forget that a good raid will remain good as long as it is being backed up regularly.

George


----------



## shitson (Oct 27, 2010)

I'm sorry to be the bringer of bad news but anything would just be a massive guess - The chance of 2 drives failing at the same moment in time are possible i.e lighting storm, faulty firmware, faulty batch... but you can mitigate the chances by running your gear with a UPS that is both surge/spike protected. But your really dealing with consumer grade hardware - 

There is a chance and really never leave your data in the hands of the hardware gods


----------



## overmind (Oct 27, 2010)

I will use server motherboard and UPS to reduce some risks.


----------



## phoenix (Oct 28, 2010)

The chance of two drives failing at the same time is pretty low.  HOWEVER, the chance of a second drive dying while going through the stress of a resilver is much higher.  And if that second drive dies while the first drive is rebuilding ...

A 3-drive raidz1 is no better than a mirror (same amount of disk space), will be slower than a mirror, and will "waste" an extra drive of disk space compared to a mirror.

A 4-drive raidz1 will be slower than a pair or mirrors, and will have less redundancy (raidz1 can only lose 1 drive; pair of mirrors can lose two drives if they are the right two).

A 4-drive raidz2 is no better than a mirror (same amount of disk space), but will be slower than a mirror, and will "waste" two drives of disk space compared to a mirror.  A pair of mirrors has the same amount of redundancy, but better speed.

A 6-drive raidz2 is the "sweet" spot where raidz overtakes mirrors in terms of disk space and redundancy.

If you are absolutely paranoid about redundancy, and can't stand to lose a shred of data, then use 3-disk mirrors.    Or, use OpenSolaris with 8-drive raidz3 vdevs.

Or, go with whatever you are most comfortable with.    There's no "right" answer.


----------



## danbi (Oct 28, 2010)

phoenix, wouldn't 3 drive raidz1 have the space of two drives stripped? A mirror will have the space of only one drive. Then 4 drive raidz1 will have the space of three drives etc.
raidz is however slower than mirrors for random writes and writing in general.

In any case, off-line backup is what saves your data. The redundancy of mirror or raidz{123} is here to let your system run while there is disk failure. Some systems must run at all times, no matter what, other can tolerate extended downtime (to restore from backup).

Not making backups and hoping your data will be safe is... well, hoping your data will be safe


----------



## aragon (Oct 28, 2010)

Prepare for the worst.  Run smartd (sysutils/smartmontools) so that you're notified early on of potential drive failure and purchase a cold standby which you keep in a safe place.


----------



## User23 (Oct 28, 2010)

There is nothing to be prepared for the worst.

If all your drives were produced on the same date, there could be a production problem so all drives could fail at nearly the same time.

Smartmontools f.e. wont help you if your powersupply let your drives die through to much voltage.

Even if you have a backup, if it is in the same room or house it could be burned by fire.

http://en.wikipedia.org/wiki/Murphy's_law


----------



## overmind (Oct 28, 2010)

danbi said:
			
		

> Not making backups and hoping your data will be safe is... well, hoping your data will be safe



The idea is that the machine I am talking about will be a backup server. Should I then backup the backup server?

Until now I've used gmirror on all my servers and it worked ok until no, I had no data loss. All servers connected to 1000VA ups.


----------



## phoenix (Oct 28, 2010)

danbi said:
			
		

> phoenix, wouldn't 3 drive raidz1 have the space of two drives stripped? A mirror will have the space of only one drive. Then 4 drive raidz1 will have the space of three drives etc.



Hee hee, oops.  You're right.  My math is off.


----------



## fgordon (Oct 29, 2010)

At the moment I'm using a 12 drive ZFS (raidz2) and as a backup system a Linux sw based raid-6 solution - also with 12 drives (Samsungs 2 TB 5400 U/min - they stay cool while running)

Former systems had up to 24 drives in a single array (RAID-6) and I never had to use my backup system so far (~ 10 years now) even with pata and 24 drive - configurations   (no server drives)

I think the normal drives are a lot better than one might expect - I do have a backup-server of course.

I'm not using an UPS - but electricity is very stable here - I think using a really good PSU is enough (at least for home usage)


----------



## wonslung (Nov 1, 2010)

I've got a few servers in production using 20 1TB or 2TB drives using 10 drive wide raidz2 vdevs.   They work fantastically for backup servers and/or streaming servers.  

So long as you don't make the mistake of using western digital green drives, i see no problem with using somewhat wide raidz2 stripes (10-12 drives)  The main thing to remember is that random i/o is going to be very limited due to the variable block size (with raidz(1,2,3) a block is stipe length so you get the I/O of a single drive, but sequential access is fine, and can make good use of ZFS prefetching.)


The biggest issue with wide stripes, other than random i/o is resilver times, but i get between 200-600 MB/s resilvers and scrubs with this layout.  I was using 3-4 vdevs but for backup servers and servers which are doing mostly sequential access, this turned out to be a waste.

The bottom line is, you should test it for your workload, but i see no real issues.


----------



## overmind (Nov 1, 2010)

@wonslung
I have WD Green drives with gmirror, for backup (not zfs) and I had no problem. Tell me more about WD Green, what could be the problem with them?
(I used green ones for low power consumption).

So, for a backup server is ok a 10 drives raidz2, from a disk space point right?
And using two stripes of 5 drives raidz1 would be a little bit faster but with less space?


----------



## fgordon (Nov 1, 2010)

WD Green are sold with a VERY short idle timeout before parking heads ~ 8 Seconds - if you don't change that (wdidle.exe) the load/unload cycles will grow really fast - finally this will lead to a SMART failure - as the number of load/unload cycles is limited.


----------



## jalla (Nov 1, 2010)

overmind said:
			
		

> @wonslung
> So, for a backup server is ok a 10 drives raidz2, from a disk space point right?
> And using two stripes of 5 drives raidz1 would be a little bit faster but with less space?



I'd say not faster, and identical in space.
The big difference is in safety. One 10 disk radz2 is *much* more robust than two raidz vdevs of 5 disks combined.


----------



## phoenix (Nov 1, 2010)

overmind said:
			
		

> @wonslung
> I have WD Green drives with gmirror, for backup (not zfs) and I had no problem. Tell me more about WD Green, what could be the problem with them?
> (I used green ones for low power consumption).



Search the forums and the freebsd-stable/-current/-fs mailing lists.  There are *lots* of posts about just how horrible the WD Green-series, WD GP-series, and WD "Advanced Format" versions really are.  Just avoid them all.  The *only* place they are useful is if you need a super-low-power and very quiet drive for putting into an HTPC or similar.  However, do not use them in RAID, do not use them in a server, and do not use more than 1 per system.  Just ... don't.  It's not worth the hassle.


----------



## Christopher (Nov 2, 2010)

phoenix said:
			
		

> A 4-drive raidz2 is no better than a mirror (same amount of disk space), but will be slower than a mirror, and will "waste" two drives of disk space compared to a mirror.  A pair of mirrors has the same amount of redundancy, but better speed.



Yes, but in the event of a single drive failure, a raidz2 retains redundant copies of all the information, unlike a pair of mirrors.   This is useful during the stressful resilvering process.  On a mirror, a single I/O error can result in data loss, but not true in a raidz2.


----------



## mix_room (Nov 2, 2010)

phoenix said:
			
		

> A 4-drive raidz2 is no better than a mirror (same amount of disk space), but will be slower than a mirror, and will "waste" two drives of disk space compared to a mirror.  A pair of mirrors has the same amount of redundancy, but better speed.



That is not strictly true. A 4-drive raidz2 has slightly better redundancy than a mirror. (Assuming that by mirror you mean some form of Raid-1+0) With the raidz2 you can lose any 2 disks, while in a mirror you can only lose a specific combination of the disks.


----------



## wonslung (Nov 3, 2010)

overmind said:
			
		

> @wonslung
> I have WD Green drives with gmirror, for backup (not zfs) and I had no problem. Tell me more about WD Green, what could be the problem with them?
> (I used green ones for low power consumption).
> 
> ...






Theres a lot of problems with them, but the biggest problem is the so called "advanced formating"

the sector size is physcially 4k but reports that it is 512b so for raidz this is just absolutely the worst possibkle situation.

Raidz uses a variable blocksize.  ZFS tries to turn random writes into sequential writes by saving them in ram then flushing them to disk every so often, and with raidz it will try to always write in blocks which are as wide as the raidz group (so it will break 1 block into 5 parts for a vdev with 5 drives (including parity)

Because the drives report that the sector size is 512b, it will sometimes save these "blocks" in 512b "pieces" across the drives...this causes the drive to have to read and write over and over for what should be normal disk operations.


----------



## wonslung (Nov 3, 2010)

jalla said:
			
		

> I'd say not faster, and identical in space.
> The big difference is in safety. One 10 disk radz2 is *much* more robust than two raidz vdevs of 5 disks combined.




Yes, I agree (and thought I conveyed this) but it's often given as advice to not use wide stripes.  But it DOES depend on your use. 2 raidz vdevs with 5 drives will be better in some situatons than a single raidz2 vdev with 10 drives.


----------



## overmind (Nov 4, 2010)

wonslung said:
			
		

> Theres a lot of problems with them, but the biggest problem is the so called "advanced formating"
> 
> the sector size is physcially 4k but reports that it is 512b so for raidz this is just absolutely the worst possibkle situation.
> 
> ...



Then, which hard drives (1TB and 1.5TB) are ok? 
- HDD Western Digital Caviar Black, 1.5 TB, 7200rpm, 64MB, SATA2 are kind of expensive.
- I've noticed Samsung has green drives: HDD Samsung F2 Eco Green Series 1.5 TB, 5400 rpm, 32MB, SATA2, I wonder if is the same issue with them as with WD green ones
- there's also: HDD Samsung SpinPoint 1.5 TB, 5400rpm, 32MB, SATA2, which I think is not green.

If you use some of them that work ok please advice.

I would not choose Seagate, I had lots of problems with them in the past (high failure rate). Also I would prefer ones that runs cooler.

Thank you and best regards!


----------



## aragon (Nov 4, 2010)

overmind said:
			
		

> Then, which hard drives (1TB and 1.5TB) are ok?


Everyone seems to rate the Hitachi highly:

http://www.newegg.com/Product/Product.aspx?Item=N82E16822145369

Although it looks like Newegg won't be selling them in 2TB anymore....



			
				overmind said:
			
		

> - I've noticed Samsung has green drives: HDD Samsung F2 Eco Green Series 1.5 TB, 5400 rpm, 32MB, SATA2, I wonder if is the same issue with them as with WD green ones


There's also the F3 series if you can find them for sale somewhere.  They didn't have sector emulation like the F4 does.



			
				overmind said:
			
		

> Also I would prefer ones that runs cooler.


Notebook hard drives are an option too.

(I'm still on the fence regarding the severity of 4k sector emulation when we have gnop at our disposal)


----------



## overmind (Nov 4, 2010)

Yes, notebook hard drives are interesting, if we can get 1TB at a low prices. Because of low power consumption, high density. Check this out: http://www.chenbro.com/corporatesite/products_detail.php?sku=117 (there's also a smaller, 24 drive version).


----------



## fgordon (Nov 5, 2010)

*...*

At the moment I'm using 24 Samsung F3 2 TByte drives....and had no failure so far (~ 1 year)

They stay really cool(!)  I log temperatures of every drive and they are normally only 10 degrees (C) warmer than the room temp. (when under heavy load for hours (scrub))

Hmmm I thought the only problem are 4k drives that report they are using 512 bytes? Do the F4 have this "emulation"?  If a drive reports as a 4k  zfs should not have a problem with it.... as long as one partitions it correctly - or uses it without any partition...?


----------



## phoenix (Nov 5, 2010)

The problem with ZFS is that it uses a variable block size, and will gladly use 0.5 KK, 1 KB, or 2 KB blocks, which will play havoc with disk alignment after a bit of use.

Yes, one can force a single block size onto a ZFS filesystem via the *recordsize* property.  However, that eliminates a lot of the potential performance gains of using a variable block, as *recordsize* is both a minimum and a maximum size for all blocks.

What one needs to do is to recompile all the ZFS code to set the *minimum* block size to 4 KB, without affecting the maximum.  There's still a lot of debate over where in the code this needs to be set (FreeBSD patches set it in once place, OSol patches set it in a different place, nobody knows if either is enough).

Aligning the first partition is not enough for ZFS to stay aligned.

Until the OSOl and/or FreeBSD devs come up with a way to either autotune this based on what the drive reports, or set it at runtime, or even add a KNOB to set it at compile time, one really should avoid all 4K drives when using ZFS.  Or, suffer through sub-par performance.


----------



## fgordon (Nov 6, 2010)

ZFS is designed as 128 bit   so it will be usable for 10 or even 20 years - but already the "current" generation of hard disks makes major changes in the source code(!!) necessary.

Even 1 Terabyte Sectors make sense with a 128 bit Filesystem!

Hmmm how strange is that?


----------



## wonslung (Nov 6, 2010)

fgordon said:
			
		

> ZFS is designed as 128 bit   so it will be usable for 10 or even 20 years - but already the "current" generation of hard disks makes major changes in the source code(!!) necessary.
> 
> Even 1 Terabyte Sectors make sense with a 128 bit Filesystem!
> 
> Hmmm how strange is that?





no, hard drive manufacturers need to give a shit about their customers and stop "lying" about the sector size.

ZFS's code is fine, except for one thing, it assumes the hard drive isn't lying about the sector size, go figure.


----------



## fgordon (Nov 7, 2010)

No it's not, just read what Phoenix wrote....

So the problem of ZFS is not disks "lying" about sector size  the real problem ist ZFS has to be recompiled to work efficiently with any media > 512 byte sector size - as 512 byte is the "hard-coded" sector-size assumed by ZFS for ANY drive and cannot be changed without changing the source code and recompile it - and it seems the necessary changes are not that easy - if the drives would report 4k ZFS would not be any better...

Hehe ZFS is able to use Trillions of Terabyte.... but is designed to this on 512 sector size media only *g* I find that a bit strange  of course one can patch ZFS when new sector sizes come - but this will make it very difficult to have   512, 4k, 8k sector sized media parallel in use.... like a big server and external drives...

I still think ZFS is great I just find this really strange for such an advanced fs-design.


----------



## aragon (Nov 7, 2010)

fgordon said:
			
		

> So the problem of ZFS is not disks "lying" about sector size  the real problem ist ZFS has to be recompiled to work efficiently with any media > 512 byte sector size - as 512 byte is the "hard-coded" sector-size assumed by ZFS for ANY drive and cannot be changed without changing the source code and recompile it - and it seems the necessary changes are not that easy - if the drives would report 4k ZFS would not be any better...


If that's true, how are people using gnop as a workaround?


----------



## danbi (Nov 8, 2010)

fgordon said:
			
		

> No it's not, just read what Phoenix wrote....
> 
> So the problem of ZFS is not disks "lying" about sector size  the real problem ist ZFS has to be recompiled to work efficiently with any media > 512 byte sector size - as 512 byte is the "hard-coded" sector-size assumed by ZFS for ANY drive and cannot be changed without changing the source code and recompile it - and it seems the necessary changes are not that easy - if the drives would report 4k ZFS would not be any better...



You are, of course wrong. ZFS is designed to use any sector size the underlying media says it has. And, you need to understand, that ZFS does not deal with only 'spinning disk media' but will all sorts of other storage devices, including network storage. Therefore, there is not any "512 byte sector" code compiled in.

The problem with those poor disks is the attitude of WD to restrict their usage to low-end desktop storage. The 'lie' is programmed in their microcode and WD are refusing to provide a fix for those operating systems that need to know the real geometry of the drive. The fact that it works with primitive OS like DOS/Windows does not make such lie acceptable.
In future version of Windows this might fail as miserably as it does with ZFS.

The truth is, that these drives do work with ZFS. They just don't performs to the user expectations, that are based on published spec. Thing is, this spec is only valid if you align writes to the assumed, but not reported 4k sector size.

There is nothing in ZFS to blame about this. Other file systems fail to utilize those drives 'performance' as well.

Or, let me ask it in a different way: everything with these WD drives works as designed. So where is the problem? If you need different behavior, buy different drives. There are so many on the market.


----------



## phoenix (Nov 8, 2010)

danbi said:
			
		

> You are, of course wrong. ZFS is designed to use any sector size the underlying media says it has. And, you need to understand, that ZFS does not deal with only 'spinning disk media' but will all sorts of other storage devices, including network storage. Therefore, there is not any "512 byte sector" code compiled in.



Yes, there is.    The "minimum block size" is a compile-time option, and is currently set to 512 B.  This means, that ZFS will use variable-sized blocks for all writes, with the smallest block size being 512 B.

All of these "Advanced Format" drives advertise their block size as "512 B".  Thus, ZFS will happily write 0.5, 1, 2, and 4 KB blocks, nicely destroying any manual partition alignment you've done.  All it takes is writing out 1 little text file under 4 KB in size, to screw things up.

If the drive manufacturers fix their firmware to report 4 KB sectors, then things may work correctly.

Until then, you need to recompile all the ZFS tools to set the minimum block size to 4 KB.



> The truth is, that these drives do work with ZFS. They just don't performs to the user expectations, that are based on published spec. Thing is, this spec is only valid if you align writes to the assumed, but not reported 4k sector size.



Which requires you to recompile all the ZFS tools to set the minimum block size to 4 KB.  Without that, any small writes will be done using 0.5, 1, or 2 KB blocks, destroying your nicely aligned partitions.



> There is nothing in ZFS to blame about this. Other file systems fail to utilize those drives 'performance' as well.



Never said ZFS was to blame, just that ZFS will not work with these 4 KB drives without either manually recompiling the ZFS tools to set the minimum block size to 4 KB, or manufacturers fixing their firmware to report 4 KB physical and logical sectors.



> Or, let me ask it in a different way: everything with these WD drives works as designed.



They don't follow the ATA spec, which lists separate logical sector size and physical sector size parameters that the OS can query.  These drives list 512 B for both.


----------



## wonslung (Nov 9, 2010)

fgordon said:
			
		

> No it's not, just read what Phoenix wrote....
> 
> So the problem of ZFS is not disks "lying" about sector size  the real problem ist ZFS has to be recompiled to work efficiently with any media > 512 byte sector size - as 512 byte is the "hard-coded" sector-size assumed by ZFS for ANY drive and cannot be changed without changing the source code and recompile it - and it seems the necessary changes are not that easy - if the drives would report 4k ZFS would not be any better...
> 
> ...



You just don't know what you're talking about.

Next time someone explains the issue, before you go and make the same comment, try to do a google search.

A simple search for "ZFS 4k problem" or "4k drive lie" would have netted you a WEALTH of information on the subject.

The "patch" for FreeBSD isn't to fix any inherent flaw in ZFS, it's to make patch around the stupid 512b lie, and even then it's not even CLOSE to ideal.

Having a firmware which is "honest" is what we need.  This is why i do not buy or advise buying any WD drives (nor have i for the past few years)

The ridiculousness of your comments is hilarious.  I've seen ZFS work amazingly well on 4k drives which had 4k firmware (they DO exist).


----------



## aragon (Nov 9, 2010)

wonslung said:
			
		

> drives which had 4k firmware (they DO exist).


Where/how?!


----------



## danbi (Nov 13, 2010)

phoenix said:
			
		

> Yes, there is.    The "minimum block size" is a compile-time option, and is currently set to 512 B.  This means, that ZFS will use variable-sized blocks for all writes, with the smallest block size being 512 B.
> 
> All of these "Advanced Format" drives advertise their block size as "512 B".  Thus, ZFS will happily write 0.5, 1, 2, and 4 KB blocks, nicely destroying any manual partition alignment you've done.  All it takes is writing out 1 little text file under 4 KB in size, to screw things up.



Of course, ZFS has hard-coded "mimimum block size", but this is not really the "sector size". This minimum is to ensure that ZFS will write minimum of 512 bytes, even if the underlying media claims to support 128 bytes, for example. With smaller than 512 byte blocks, the filesystem will become too fragmented and metadata will occupy too much of the storage.

Wonder if this "Advanced Format" phenomenon will disappear soon and manufacturers will stop making such lying disks.


----------

