# ZFS RAID-Z3 performance impacts?



## Zanthra (Dec 25, 2013)

I keep reading that using RAID-Z3 has performance impacts vs RAID-Z2 or RAID-Z, but I cannot figure out why.  With five disks and 3 parity, it would seem that the ZFS software would calculate the parity almost instantly and then write them all out at once.  For reads, it should not take much longer either. How would this have any performance impact vs the other RAID-Z levels?

My current plan for a high availability storage system is to start with one RAID-Z3 with eight drives (two on each of four relatively cheap controllers).  When that fills, add a second RAID-Z3 of eight drives.  From there, if expansion is necessary, and significantly larger harddrives are available, I can replace and resilver drives in the smaller RAID-Z array to bring it up to higher capacity (and with RAID-Z3 and proper backups in case anything goes wrong, resilver two at a time for four steps rather than eight).

Any suggestions or thoughts, or explanations of the performance changes with RAID-Z3?

Thanks,

Zanthra


----------



## ralphbsz (Dec 30, 2013)

Zanthra said:
			
		

> I keep reading that using RAID-Z3 has performance impacts vs RAID-Z2 or RAID-Z, but I cannot figure out why.



RAID-Z3, or any RAID that writes extra redundancy copies to deal with extra failures, has a performance impact, simply because a larger fraction of the drives are needed for redundancy.

Let's work this through in a simplified example.  You say you have 8 drives, and want to run a 3-fault tolerant RAID code on them.  My (hypothetical) RAID implementation will take each block of data you write, cut it into 5 slices, and write 8 slices simultaneously.  For large reads, it will simultaneously read 5 disks (the other 3 are idle) to reassemble the block, and return it for you.  For small reads, it reads from any of the 5 disks that happen to contain the data (not parity) of the particular slice you are interested in.

For further simplification, let's start by studying sequential writing workloads, and let's assume that the RAID implementation keeps sequential files also sequential on disk.  Say every disk is capable of reading or writing 100 MB/s, and let's also assume that the disks are the only bottleneck (this is either reality, or at least a design goal).  In this case, a RAID-Z3 implementation would be able to write 500 MB/s of user data, while actually pushing 800 MB/s (the hardware limit) onto the disks.  You can easily see that a non-redundant implementation would be able to write 800 MB/s, a 1-fault tolerant RAID code would be able to write 700 MB/s, and a 2-fault tolerant code 600 MB/s.  Right there is your performance impact!  You paid for 8 disks, and you only got 5 disks worth of write bandwidth.

Now, for a read workload it gets more tricky, and one has to be very careful in how one defines performance.  Let's first look at your workload being a single-threaded sequential read.  In that case, at any given moment only 5 disks are busy, since we don't have to read the parity disks, so you will only get 500 MB/s again (just like the write case).  Again, the same 3/8 performance penalty.  But then, you could say that you have lots of applications running, all simultaneously doing sequential reads, and you could argue that if you average over hundreds of read streams, all disks are busy.  That argument is actually not completely correct: Because 3/8 of all data on disk is parity that never needs to be read, the workload on disk will not be completely sequential (the disks need to skip over parity slices on disk), so the performance will be slightly lower; but perhaps, that correction is a small effect. But if your disk array is actually only performance limited, not capacity limited, your argument is pretty good.  On the other hand, if you bought this many disks because you needed the capacity, then with a RAID-Z3 code you had to buy 3 extra disks, again a 3/8 cost overhead.

The situation with small and non-sequential reads and writes is even more complicated, and depends on RAID implementation and workload.  But overall, the higher the fault tolerance, the higher the capacity impact and write performance impact is going to be.  On the other hand, if you are not capacity limited and your workload consists of reads (meaning you are only worried about read performance), then RAID typically has little or no performance impact.

For the average home server, or the typical small commercial system, this typically doesn't matter anyhow.  It's quite unusual to find workloads that can actually saturate the performance of many disk drives for extended periods.  And if the occasional file copy takes 3/8 longer, that probably has few real-world effects on users.

By the way, please don't interpret what I wrote to mean that you should use less redundancy, just because doing a highly redundant RAID (like ZFS's RAID-Z3) will have a cost and performance impact.  For most users, the data is more valuable than the hardware it is stored on, and I very much endorse RAID in general, and in particular RAID codes that can handle multiple faults.  In the style of the credit card commercial: Getting your data back even after multiple disk failures - priceless.


----------



## throAU (Dec 31, 2013)

The big thing to keep in mind is that in terms of IOPs (i.e. number of requests per second able to be serviced), performance of a ZFS system will be hobbled by the number of VDEVs you have, for writes.  IOPs are usually what matter on a busy multi-tasking/multi-user storage system.

Write IOPs per VDEV is limited to the IOPs capacity of a single disk.

What are the implications of this?  RAIDZ3 requires quite a lot of disks per VDEV (minimum of 5 if I'm not mistaken?).  Disks that could be used with other RAID levels to create a larger number of VDEVs.

So, lets say you have 10 disks.  You could put them in a 2x RAIDZ3 configuration.  But you could also do 3 VDEV RAIDZ with a hot-spare, or 5 mirror VDEVs.

More VDEVs = more write IOPs.

Of course RAIDZ3 has better fault tolerance, but that doesn't come for free.


----------



## ralphbsz (Dec 31, 2013)

throAU said:
			
		

> Of course RAIDZ3 has better fault tolerance, but that doesn't come for free.



Exactly.  Old electrical engineering saying: Good, fast, cheap, pick any two.  

An interesting question is the following: For the common multi-user / multi-appication workload (excluding video editing), do long sequential runs of writes really matter?  It might be possible that the performance (or lack thereof) is actually driven by small writes (write operations that are smaller than a RAID stripe), including those that ZFS has to temporarily write mirrored instead of parity-based.

In that case, one should try out using an ultra-fast disk (SSD, or perhaps even a memory-backed volume on a RAID controller) as a ZIL disk.  It might greatly improve performance in such a case.  It might also be a great waste of money, or it might be really unreliable (as the data in the ZIL is probably not written redundantly, in particular if the user can only afford one of the expensive SSDs).  I also don't know whether ZFS writes only file system metadata into the ZIL; the greatest performance gain would come from the ZIL also being used for small data (file content) writes.  I don't know anything about ZFS internals, so my only advice is: If performance is inadequate, carefully study the workload, and benchmark proposed system configurations before putting them into production.

Another philosophical remark: A RAID system that is 3-fault tolerant (like RAIDZ3) is de-facto invulnerable to normal disk failures.  If you go through the numbers, assuming only uncorrectable read errors and whole disk failures exist, assuming that there are no correlated disk failures, you find that a normal-size file system that use a 3-fault tolerant code will lose data so slowly, a file system written by neanderthals would still be OK today.  This really points out that with such a system, other sources of data loss become dominant.  Examples include the wetware typing "rm -Rf", or a power supply failure killing 4 disks at once, or a small fire in the server room melting everything, or the NSA flipping all the 1 bits back to zero out of spite.  And we haven't even discussed the effect of software bugs.  If a user really needs the resilience of a 3-fault tolerant RAID, they also need to invest an enormous amount of effort into snapshots, backups, off-site storage, disaster recovery, security audits, and such.


----------



## Zanthra (Jan 1, 2014)

Thank you for the replies.  That all makes sense to me.

The majority of the workload here will be photos, video, and other large files, shared over a Gigabit Ethernet connection.  I do hope that this storage system will survive to see 10GBASE-T on the network here, so I am trying to think ahead some while planning.

The big things that worries me about using something less than RAID-Z3 is the fixed geometry of the ZFS pool.  The moment I enter the command to create this pool, that VDEV is going to have that geometry for the rest of that pool's lifetime.  I would rather put capacity and performance on the line for redundancy now, rather than regret my descision with a difficult process of rebuilding the pool to change it.  If ZFS allows restructuring a pool's geometry someday, I may rethink it then.

We do have a procedure for regular backups on our current system, and when this system is operational, the current system will move to a remote location to provide a backup as well.

This is still nearly a year away from being built, but knowing what my options are, and their impacts is something I am trying to figure out now.  I appreciate the explanations.  As Eisenhower said, "Plans are worthless, but planning is everything."


----------



## throAU (Jan 14, 2014)

Me?  I just paid the capacity cost and went with mirrors.

I currently have 2x 2 way mirrors, and can withstand a failure in each mirror (50% of drives can fail, so long as they're the right drives).

I can expand the storage by adding one, 2 drive mirror VDEV at a time (or replacing the drives in a mirror VDEV with a bigger pair), and performance will scale accordingly for newly written data.  To re-iterate:  you can expand a pool by ADDING VDEVs, or replacing each drive in a VDEV.  VDEVS themselves are fixed configuration, but the pool structure itself isn't set in stone - you can add (but not remove) VDEVs to/from it.

If you go for parity RAID...you need to replace all the drives in a VDEV to get any more space.  With RAIDZ3 that's 5+ drives at once.  In practice, more - because 5 drives with 3 for parity wastes a heap of space.  You'd be better off with say 7 or 9 drive VDEVs, and vs. say 4x 2 drive mirrors, performance will SUCK (but it's all relative - for a small number of users behind 1 Gigabit Ethernet it is likely quite fine).  You'd be better off buying bigger drives and going for say, 3x 3 way mirrors (if you need the reliability) or 4x 2 way mirrors in that example, IMHO.

Depends how paranoid you are - but really, the way I see it, RAIDZ2 and RAIDZ3 are intended for much bigger pools with many more drives.

You can still expand storage with RAIDZ3 - you're just looking at replacing a heap more drives at once - either by going larger, or adding another RAIDZ3 VDEV.


Also worth noting is that on READ, ZFS will load-balance between all drives in a mirror.


----------



## JanJurkus (Jan 16, 2014)

throAU said:
			
		

> Also worth noting is that on READ, ZFS will load-balance between all drives in a mirror.



Yes, so a 4-way mirror will also provide more IOPS. Well, in theory, have not tried it out yet. At least it seems like a better idea to me, what is the resilver time on a failed 2 TB disc these days?



			
				ralphbsz said:
			
		

> In that case, one should try out using an ultra-fast disk (SSD, or perhaps even a memory-backed volume on a RAID controller) as a ZIL disk.  It might greatly improve performance in such a case.  It might also be a great waste of money, or it might be really unreliable (as the data in the ZIL is probably not written redundantly, in particular if the user can only afford one of the expensive SSDs).  I also don't know whether ZFS writes only file system metadata into the ZIL; the greatest performance gain would come from the ZIL also being used for small data (file content) writes.  I don't know anything about ZFS internals, so my only advice is: If performance is inadequate, carefully study the workload, and benchmark proposed system configurations before putting them into production.



You probably do not want an expensive SSD. With a separate ZIL not much data is written to this device. A 4 GB SLC drive is fine, I think. And SLC, because it will mainly only receive writes. For really critical stuff you might want to mirror the ZIL. But a failure of a separate ZIL can be survived, but not if you have a dirty shutdown of the entire server. You might want to read a little bit about this subject and not trust my (non-ECC) memory.


----------



## fnj (Jan 21, 2014)

Large video files is exactly the predominant profile of my own ZFS. I made multiple 6-hard-drive (4+2) RAIDZ2 pools. Performance is highly satisfactory for me. I can write 500 MBps and read 700 MBps from a pool, using 66 GB test files. My total RAM is 16 GB, so caching is not obscuring the true to/from-disk-surface performance.

If I had the $$ and bays for more drives, I would have done RAIDZ3. I doubt very much that the performance of RAIDZ3 would be any problem at all in this kind of use.


----------



## belon_cfy (Nov 12, 2015)

throAU said:


> Me?  I just paid the capacity cost and went with mirrors.
> 
> I currently have 2x 2 way mirrors, and can withstand a failure in each mirror (50% of drives can fail, so long as they're the right drives).
> 
> ...



Recently I had lost some of my data because of mirror, bad sector detected on another disk during rebuilt. Fortunately I did snapshot send and receive regularly.

Using mirror, ensure you backup your data properly, it is not safer than multiple raidz2 VDEV


----------



## Galactic_Dominator (Dec 6, 2017)

I realize this is an old thread, this comment is for future consumers of it.



throAU said:


> Write IOPs per VDEV is limited to the IOPs capacity of a single disk.



I see this behavior cited frequently, but this is not true. Take this snippet from `zpool iostat`:

```
pool        alloc   free   read  write
----------  -----  -----  -----  -----
poolname     18.6T   101T  2.85K  7.62K
```

This pool is a single vdev raidz2 with a very large amount of disks none of which are even close to capable to writing that many IOPS.

While the quoted statement is untrue, the point behind remains valid.  I do not recommend structuring a raidz2 as I have described especially if performance is needed.  Multiple 6-wide raidz2 vdevs would be the canonical structure for performance and I see no reason that is still not the case especially when factoring in resilvering and scrub times.


----------

