# Is ZFS really better than...



## MorgothV8 (Oct 11, 2012)

I've bought new computer for my company.
Micro company but it's not important:
I have two disks 2 TB each, and I don't know which scenario to use:

ZFS option: encrypt both ada0 and ada1 with geli and next pass to ZFS as 
`# zpool create mirr0 mirror ada0.eli ada1.eli`
use ada0 and ada1 as gmirror:
`# gmirror label mirr0 ada0 ada1 mirr0`
then geli on mirr0 --> mirr0.eli
then gjournal on mirr0.eli --> mirr0.eli.gjournal
and finally `# newfs -J -o async,noatime /dev/mirror/mirr0.eli.gjournal`

ZFS looks to be full capable of all problems *but* I'm using ZFS from about 2+ years, .... and it it SLOW
I'm thinking of setup without ZFS. Am I right?

I need data store (in that order): encrypted, reliable, fast.... what to choose?


----------



## wblock@ (Oct 11, 2012)

For fast encryption, get a processor with AES-NI: Core i5 or higher, AMD Bulldozer.  Actually, a few of the i5 processors do not have AES-NI, verify it on the Intel web site first.
http://svnweb.freebsd.org/base/head/sys/crypto/aesni/aesni_wrap.c?view=log&pathrev=226837
http://svnweb.freebsd.org/base/head/sys/geom/eli/g_eli.c?view=log&pathrev=226840

Remember that 2T disks will have 4k sectors, so alignment is critical to performance.

Since you already have the hardware, why not set it up one way and test the performance?  If it's good enough, leave it.  Otherwise, set it up the other way and test.  Or do both tests so you'll know for sure.  And please post the results of bonnie++ here.


----------



## waksmundzki (Oct 11, 2012)

Create the mirror first, then geli that (so you only encrypt once, not twice). Use UFS journaling, not gjournal (for speed). Mount as sync for additional safety.


----------



## lockdoc (Oct 19, 2012)

waksmundzki said:
			
		

> Create the mirror first, then geli that (so you only encrypt once, not twice).


But this is not going to work in a ZFS environment, is it?


----------



## vermaden (Oct 20, 2012)

lockdoc said:
			
		

> But this is not going to work in a ZFS environment, is it?



For ZFS You put GELI on top of each device and then You do ZFS mirror or raidz.


----------



## m6tt (Oct 21, 2012)

I really recommend the ZFS option for business. It's very reliable, and you can make a 3-way "mirror" and drop one disk out each week as an offsite backup. The checksums are huge too...many companies have silently failing raid5 arrays that frankly scare me.

I'm not sure I would recommend adding an encryption layer unless you feel it is very necessary. Basic ATA drive security would solve most problems without the IO hit, and physical security should prevent even that being necessary, but you know your situation better than us.


----------



## zer0sig (Oct 24, 2012)

Generally agree with m6tt here. Encryption is cool and useful for certain kinds of critical information, but it can bog down a system in a hurry.

The answer to the title of this post is almost universally yes. ZFS is simply better, particularly for more drives, SAN setups, etc. It handles logical partitions locally and remotely with an elegance and efficacy that other FSes really don't on their own. It is also rock solid in general (not that UFS sucks, ZFS is simply a more advanced FS in general). You can also really up the speed with setting up an SSD to cache ZFS - it is a built-in feature. I believe FreeBSD does all of this effectively once you get it set up correctly. FreeBSD also thankfully supports a newer version of ZFS than any OS I am aware of aside from Solaris itself. Read up on ZFS - there are some fairly accurate and useful pages both on ZFS and contrasting ZFS with other FS types.


----------



## m6tt (Oct 25, 2012)

The reporting for ZFS is really good too. zpool command can show disk array history, current status, background scrub, options etc. Between zfs and zpool everything is very tunable to a given dataset and requirements.

Refrain from tuning too much with sysctls etc, usually it's not necessary anymore on a 64-bit system. Once you really identify things that are sub par because of tunables, tune those. Most of the docs I've found are either old, or tuned for different datasets (mysql, nfs, etc...different!).

For goodness sakes use mirrors. If you want raidz5 or whatever, make a raidz5 of mirrors. The reason is that performance of a mirror is better in most configurations, and they are extremely reliable. If you must do raidz, do it like the big solaris installs and make it up out of many double or triple mirrors. Space is cheap, reliability is not.

You can easily get sequential read speeds in the many hundres with just 7200 rpm sata disks and mirrors.


----------



## Sebulon (Oct 25, 2012)

Here are some benchmarks IÂ´ve made with AES-NI, GELI and ZFS in different configurations. I was rather surprised myself that it operated without any performance penalty what so ever. The results matches a similar machine without GELIÂ´d disks. IÂ´m sure there has to be some sort of breaking point on number of disks but I never hit it.

GELI Benchmarks

/Sebulon


----------



## MorgothV8 (Oct 27, 2012)

I'm using geli on 10-CURRENT - processor Core i5 3450, geli in hardware
I have mirror of two identical discs (!TB each): zpool create zmirr mirror ada1.eli ada2.eli
Also sync=disabled, atime=off
RAM: 16 GB, system CURRENT-10.0 from one week ago.
Geli: AES-CBC 128bit, aesni.ko loaded

Write/read speed in practice the same about 150-180 Mb/s
dd bs=16M if=./some_8GB_file of=/dev/zero gives 175 MB/s
dd bs=16M if=/dev/zero of=./some_8GB_file count=512 gives 158 Mb/s

Second dd from the same file givers about 2,5 Gb/s (but it is fetched from ZFS ARC then)


----------



## throAU (Nov 5, 2012)

Presumably these are server machines?  In a server room?


Could someone fill me in as to why Geli is considered to be worth it?  If someone can break into your server room they can steal your entire machine?  To me it just looks like a big performance and complexity overhead for... ?

Or is this to do with ease of disposal of disks as they are retired?  i.e., encrypted disk doesn't need to be wiped.


Not trying to troll here, I'm genuinely curious to know why people are doing this.


Cheers


----------



## lockdoc (Nov 5, 2012)

throAU said:
			
		

> Presumably these are server machines?  In a server room?
> 
> Could someone fill me in as to why Geli is considered to be worth it?  If someone can break into your server room they can steal your entire machine?  To me it just looks like a big performance and complexity overhead for... ?
> 
> ...



You brought a good example yourself for server machines:
If someone is stealing your server physically, they still can't  do anything with it, because it is encrypted and your secret company data is protected.


----------



## Crivens (Nov 5, 2012)

lockdoc said:
			
		

> You brought a good example yourself for server machines:
> If someone is stealing your server physically, they still can't  do anything with it, because it is encrypted and your secret company data is protected.



Indeed, also keep in mind that when the server is at some site you rent, you can not easily control access to the server room. Even with a server in your own basement, there are possibilities to steal things (parts, racks, whatever) when the value is high enough.

And being able to simply dump a bad disk without taking the precauton to shred it and issue some paperwork about securely destroyed data can be a bonus.


----------



## Sfynx (Nov 5, 2012)

m6tt said:
			
		

> For goodness sakes use mirrors. If you want raidz5 or whatever, make a raidz5 of mirrors. The reason is that performance of a mirror is better in most configurations, and they are extremely reliable. If you must do raidz, do it like the big solaris installs and make it up out of many double or triple mirrors. Space is cheap, reliability is not.



AFAIK you cannot nest vdevs, do you mean using gmirror to make mirrors and then making a raidz of those?
I always had the impression that it is best to supply the raw drives to ZFS whenever possible, so that it has full control over the redundancy and healing process.

Personally I always use pools with raidz2 vdevs for the redundancy whenever possible with some SSD based L2ARC to speed up the random read workloads.


----------



## throAU (Nov 6, 2012)

lockdoc said:
			
		

> You brought a good example yourself for server machines:
> If someone is stealing your server physically, they still can't  do anything with it, because it is encrypted and your secret company data is protected.



So to boot the thing, you need to enter a key or something every reboot?


----------



## m6tt (Nov 6, 2012)

I haven't ever done this (intentionally) on FreeBSD, but see this link. You definitely can stripe mirrors etc, not sure about raidz exactly.
Beware solaris like typing detected:

http://www.zfsbuild.com/2010/06/03/howto-create-striped-mirror-vdev-pool/

A mirror is a type of vdev, so you can "zpool add" them together to make a pool, which would be a stripe. Essentially a RAID10 config. I think I saw raidz and thought stripe for some reason (add more caffeine).

Here are some interesting thoughts:
http://constantin.glez.de/blog/2010/06/closer-look-zfs-vdevs-and-performance

One very, very important point is that everyone using ZFS that can't off the top of their head tell the difference between "zpool add" and "zpool attach" needs to stop and go read the zpool manpage. It *WILL* ruin your day/evening/job and potentially your data if you confuse these at the wrong time!


----------



## Sfynx (Nov 6, 2012)

m6tt said:
			
		

> One very, very important point is that everyone using ZFS that can't off the top of their head tell the difference between "zpool add" and "zpool attach" needs to stop and go read the zpool manpage. It *WILL* ruin your day/evening/job and potentially your data if you confuse these at the wrong time!



IMO the inability to remove a vdev is the main Achilles' heel of ZFS and not good for its image as a flexible administrator-friendly file system. I almost ruined the configuration of a big pool this way. Finally getting that block pointer rewrite functionality into the code should be top priority, even if that means the pool format becomes incompatible with the Oracle implementation (ZFS doesn't interest them anymore anyway AFAICS)


----------



## SirDice (Nov 6, 2012)

throAU said:
			
		

> So to boot the thing, you need to enter a key or something every reboot?



Yes.


----------



## usdmatt (Nov 6, 2012)

@m6tt: This is the comment Sfynx was referring to, which doesn't make a lot of sense:



> For goodness sakes use mirrors. If you want raidz5 or whatever, make a raidz5 of mirrors. The reason is that performance of a mirror is better in most configurations, and they are extremely reliable. If you must do raidz, do it like the big solaris installs and make it up out of many double or triple mirrors. Space is cheap, reliability is not.



A zpool is made up of one or more vdevs. If you have multiple vdevs, ZFS stripes data across them automatically, effectively giving RAID10/50/60. You don't have nested vdevs inside a stripe vdev though, there is no 'stripe vdev'. You just have a collection of mirror vdevs, with ZFS writing data across all of them. It's not even a 'real' stripe in the traditional sense - there's no guarantee data will be striped. If you fill a single vdev pool, then add another vdev, most writes will end up on the new vdev and you'll never get the same performance you would if you'd started with both vdevs. It doesn't rebalance the data after adding the new vdev. This is why you can add vdevs without block pointer rewrite. 

Nesting vdevs is not possible. You cannot create a raidz out of mirrors.

Incidently, I've never seen a big Solaris install mentioned anywhere that had raidz made up out of mirrors. They all either use one big RAID10 style setup or a matched set of RAIDZ{1,2,3} vdevs.


----------



## usdmatt (Nov 6, 2012)

Just going back to the original posts, there's always the following option:

Create the Zpool as a mirror on the raw devices
Install the OS (if you are doing Root-On-ZFS)
Create a ZVOL, run GELI on that then format UFS for your *secure* data

I'm not certain it's a particularly good solution but it means the whole system is on ZFS (benefiting from checksums,snapshots,scrub,easy backup with zfs send,etc) and you have an encrypted file system just for the stuff that needs encryption (With encryption only happening once rather than once per device as for ZPOOL on GELI).


----------



## m6tt (Nov 7, 2012)

My initial post was wrong. I don't use raid5/raidz because I don't trust either in production, thus my unfamiliarity.
I did clarify in my second post that I was referring to a pool made of mirrors. I never said a pool is a vdev, nor did I say they could be nested ever. I would recommend a raid10 configuration over a raid5 in any situation where storage is cheap and performance and reliability are desired. I assumed that raidz was a type of pool, and not a vdev as it is.

Referring to your second post, putting the geli on top means that you lose most of the benefits of ZFS, because it can't protect reads anymore with checksums, and it can't guarantee transactions are commited (from application layer to disk). ZFS should always be the top layer for best reliabilty. It's not what I would choose to do.


----------



## usdmatt (Nov 7, 2012)

It was just an idea to get encryption on ZFS without having to encrypt the entire pool, which is a bit over the top and causes the amount of encryption processing to be increased * number of disks. I came up with it because someone mentioned creating the mirror first then running geli which other said couldn't be done with ZFS. (Hopefully we'll get dataset-level encryption in ZFS features in the future).

Reads from disk into memory will still be checksummed and so you should be protected from disk read errors just as you would with normal datasets. If an app requests a sync write then the ZFS transaction should be committed and the UFS write will not succeed until it's done. Normal writes will go into current transaction just like any other ZFS write. You may have a slightly higher chance of corruption in memory than a normal ZFS file system (due to it going through UFS <-> GELI <-> ZFS), and I mean slightly, but ECC memory would fix that and memory corruption can screw up a ZFS file system as well (In fact you could fsck the UFS file system, ZFS will just give file errors that can only be fixed by removing them and quite often it can't even tell you the filename).

I said I'm not sure it's a great solution but it does have benefits and it's not losing as much of the ZFS protection as you make out. You basically have a UFS filesystem with all actual disk read/write activity protected by ZFS. Going back to the actual initial question, I would say it offers more protection than just running gmirror.


----------



## MorgothV8 (Nov 8, 2012)

Anybody asked if this is a server machine in server room etc.
No it is not.
It's just an ordinary PC, in my own home, I'm working remotely for USA from this computer.
And all is first geli encrypted and on top of geli there is zpool with mirror of two entire disks: ada1.eli and ada2.eli
Root is on small unencrypted UFS+SUJ (1,5 Gb) ada0p1
Swap is geli encrypted ada0p2.eli
Rest of first disk is on ZFS on geli on 3rd partition (on single partition without any raid or mirror, ashifted, about 640 Gb in size: ada0p3.eli)
That's all
Untill now (2+ months there were no single panic, reset, resilver etc), system is CURRENT-10, updated about once per month - manually build world etc.

Critical data is on mirror pool, some less important stuff is on single partition ZFS, and on UFS there is only system initial root, even without usr, var, tmp - there is just /bin, /etc, /boot and I think it's all


----------



## m6tt (Nov 8, 2012)

That sounds like a good setup...

There are ways to eliminate UFS entirely, but depending on which one you use they make upgrading zfs versions dangerous, so UFS boot is preferable if you need to use MBR at least and don't want to dd bootcode into your disks periodically. GPT may be different but both my systems don't like it due to bios issues. 

This may no longer be true, but I though ashift was unnecessary on top of geli, because somehow geli "fixed" advanced format disks unintentionally?


----------



## MorgothV8 (Nov 9, 2012)

Even if ashift was unneccessary it don't hurts, and I'm loosing maybe some kilobytes of space....


----------



## chrcol (Nov 14, 2012)

zfs great for features but for real world production use on a i/o heavy server I think its bad news.

Plus if you do zfs on root which i initially thought was the right way to go you cant add devices later to the pool as well as not able to use a seperate device for log.

A huge ever growing metadata cache that can be saturated simply by listing large dirs with 100k or so of files inside, the auto tuning of 'vfs.zfs.arc_meta_limit' goes to a very low value, but even when I tuned it to several gigs it was saturated within 24 hours.

Of some of the upsides tho even with server crashes even zfs related crashes (many zfs crashes) I have yet to find any corrupt data. Never needed something like fsck either.

However remember when I tried fbsd 9's default ufs+suj setup and having corrupted database files, a slow but stable filesystem even if really slow is always superior to a filesystem that has inconsistent data. so zfs in my view is a superior choice to the default choice in fbsd9.


----------



## Sfynx (Nov 14, 2012)

chrcol said:
			
		

> zfs great for features but for real world production use on a i/o heavy server I think its bad news.
> 
> Plus if you do zfs on root which i initially thought was the right way to go you cant add devices later to the pool as well as not able to use a seperate device for log.



If you're referring to the notice you get when trying to add a vdev or log device to the root pool, then you probably forgot to temporarily clear the bootfs property. I believe this is a Solarism because the FreeBSD booting process does not really care about how the root pool is laid out. 



			
				chrcol said:
			
		

> A huge ever growing metadata cache that can be saturated simply by listing large dirs with 100k or so of files inside, the auto tuning of 'vfs.zfs.arc_meta_limit' goes to a very low value, but even when I tuned it to several gigs it was saturated within 24 hours.



Isn't a saturated cache with the most often used data a good thing? What are the hit/miss statistics after a nice period of operation?


----------



## MorgothV8 (Dec 16, 2012)

Still using this setup - no single problem at all. Works like a charm.


----------

