# Future of filesystems



## Mage (Jun 11, 2012)

Chris Mason left Oracle. Today he starts working at a funny company. I checked their website. Funny is the nicest world I could tell.

[FLAME]

I would say that this is the end of BTRFS, however everything must have a beginning before having an end. After reading random emails from BTRFS development lists anyone who ever seen an IT project should know that BTRFS will never work.

[END OF FLAME]

ZFS is very nice, however we don't have the "block pointer rewrite" thing. If you ever filled up a ZFS pool more than 80% or added new devices to a large pool you should know that the lack of that feature hurts. Also Oracle is very good at destroying software.

Hammer is Dragonfly-only.

I wonder what the future will bring to us. I wish we had BPR.


----------



## UNIXgod (Jun 11, 2012)

Mage said:
			
		

> Hammer is Dragonfly-only.
> 
> I wonder what the future will bring to us. I wish we had BPR.



What's BPR?


----------



## cra1g321 (Jun 11, 2012)

UNIXgod said:
			
		

> What's BPR?



I'm guessing it's the block pointer rewrite, he mentioned.


----------



## UNIXgod (Jun 11, 2012)

cra1g321 said:
			
		

> I'm guessing it's the block pointer rewrite, he mentioned.



Funny. I need to read more carefully. I actually did a google search on the acronym and the first result was "Business process reengineering".


----------



## vermaden (Jun 11, 2012)

Mage said:
			
		

> ZFS is very nice, however we don't have the "block pointer rewrite" thing. If you ever filled up a ZFS pool more than 80% or added new devices to a large pool you should know that the lack of that feature hurts.



The ZFS _Feature Flags _has been imported [1] to HEAD making ZFS v28 into ZFS v5000 + features.

This means, that ANY feature (including "block pointer rewrite" one) can now be added to ZFS as a separate feature.

[1] http://freshbsd.org/commit/freebsd/r236884


----------



## xibo (Jun 11, 2012)

Can't we put zfs on gvinum raids and extend the volumes on geom instead of zfs level?

But yes, I would like to get BPR, too.


----------



## tingo (Jun 12, 2012)

Any good engineer / sysadmin / whatever makes use of the tools he or she has, with whatever features those tools have or lack. Today*'*s kids seem to want to have everything fixed for them, not fixing anything themselves. I fail to see how that can be a good life.


----------



## phoenix (Jun 12, 2012)

Mage said:
			
		

> Chris Mason left Oracle. Today he starts working at a funny company. I checked their website. Funny is the nicest world I could tell.
> 
> I would say that this is the end of BTRFS, however everything must have a beginning before having an end. After reading random emails from BTRFS development lists anyone who ever seen an IT project should know that BTRFS will never work.



About half of the Btrfs development comes from RedHat, so not sure why you think "losing" one Oracle employee will begin the death spiral of a fs that's part of the mainline Linux kernel.



> ZFS is very nice, however we don't have the "block pointer rewrite" thing. If you ever filled up a ZFS pool more than 80% or added new devices to a large pool you should know that the lack of that feature hurts. Also Oracle is very good at destroying software.



So long as you keep adding vdevs to a pool, or replacing drives in vdev with larger ones, then there's no problem.  

Not really sure what your so afraid of.  UFS isn't going anywhere, is getting new/improved features over time, and is great for small-ish filesystems.  ZFS isn't going anywhere, is getting new/improved features over time, and is great for medium-to-huge storage setups.  What more do we need?


----------



## graudeejs (Jun 12, 2012)

phoenix said:
			
		

> Not really sure what your so afraid of.  UFS isn't going anywhere, is getting new/improved features over time, and is great for small-ish filesystems.  ZFS isn't going anywhere, is getting new/improved features over time, and is great for medium-to-huge storage setups.  What more do we need?



Portable filesystem, that could be used between BSDs, GNU/Linux, and other Unixes. FAT doesn't count, because it doesn't preserve Unix file attributes.


----------



## Mage (Jun 12, 2012)

phoenix said:
			
		

> About half of the Btrfs development comes from RedHat, so not sure why you think "losing" one Oracle employee will begin the death spiral of a fs that's part of the mainline Linux kernel.



It is not the beginning of a death spiral. It is a sign of something we know for two years at least. A main developer told ages ago that BTRFS is broken by design. Design is something what doesn't change. BTRFS was never alive so it can't die. Do you read btrfs-devel lists? Every thread are like "I cannot mount please help", "Please recover my data", "One week ago I deleted some lines of the source but I put it back yesterday and wrote 15 lines long comment for myself to avoid deleting that lines again."



			
				phoenix said:
			
		

> So long as you keep adding vdevs to a pool, or replacing drives in vdev with larger ones, then there's no problem.



If you fill your pool above 80% then it gets 3-10 times slower by fragmentation. Please don't tell it doesn't need defrag because there are a lot of examples to prove it does. Also please don't tell you shouldn't fill it above 80% because the top is at 100% and sooner or later every drives on the world reaches that percentage.

If you change anything like checksum, compression, copies, dedup then you should use BPR. If you had it. The only fix is send | receive two times (to another pool and back). That means downtime.

As far as remember I read on of the developers who was working on BPR that it never will be finished.

I am not bashing ZFS. It is my favourite filesystem. I just would like to see a bright future.


----------



## Mage (Jun 12, 2012)

tingo said:
			
		

> Today*'*s kids seem to want to have everything fixed for them, not fixing anything themselves. I fail to see how that can be a good life.



I wonder how many people are on this Earth who could properly implement block pointer rewrite.


----------



## Mage (Jun 12, 2012)

graudeejs said:
			
		

> Portable filesystem, that could be used between BSDs, GNU/Linux, and other Unixes. FAT doesn't count, because it doesn't preserve Unix file attributes.



I use a shared pool with FreeBSD and Ubuntu. Ubuntu runs in VM under Windows. The only annoying thing is that I have to force the import time to time after I change OS.

However, I lost about 40GB of data when I imported a Gentoo MBR partitioned SSD drive. The pool was created with ZFS for Linux. It had several vdevs, cache and log on the same MBR drive. When I imported it under FreeBSD first time my data said goodbye.

This never happened again, however my actual shared pool uses raw disks and GPT.


----------



## Crivens (Jun 13, 2012)

Just an idea, so feel free to say if it is stupid.
To defrag only some files on windows I used a trick with dd, searching a build directory for all libraries and then making a copy of the file with a new name, delete old, rename new. This should also re-distribute a file over a new set of vdevs, would it not?


----------



## kpedersen (Jun 13, 2012)

graudeejs said:
			
		

> Portable filesystem, that could be used between BSDs, GNU/Linux, and other Unixes. FAT doesn't count, because it doesn't preserve Unix file attributes.



We do.. UFS. Now we just need the rest to support it 

How come Linux'ii do not support UFS? Surely it cannot be too hard to implement compared to NTFS and the license can't be the issue.

The linux folk spend time porting dreamcast filesystems to the kernel, why not a useful one?


----------



## wblock@ (Jun 13, 2012)

http://ghantoos.org/2009/04/04/mounting-ufs-in-readwrite-under-linux/

UFS write support might be the default now, I don't know.

UFS read/write support for Windows is more interesting to me:

http://www.crossmeta.org/crossmeta.html

I have not tried either of these.


----------



## graudeejs (Jun 13, 2012)

kpedersen said:
			
		

> We do.. UFS. Now we just need the rest to support it
> 
> How come Linux'ii do not support UFS? Surely it cannot be too hard to implement compared to NTFS and the licence can't be the issue.
> 
> The linux folk spend time porting dreamcast filesystems to the kernel, why not a useful one?



No we don't. Unless it's UFS1 and even then I doubt (feel free to correct me if I'm wrong) it will work under OpenBSD.


----------



## Mage (Jun 14, 2012)

Crivens said:
			
		

> Just an idea, so feel free to say if it is stupid.
> To defrag only some files on windows I used a trick with dd, searching a build directory for all libraries and then making a copy of the file with a new name, delete old, rename new. This should also re-distribute a file over a new set of vdevs, would it not?



The copy, move, rename, etc. methods don't really work as defrag on ZFS if you have for example dedup set on or if the pool has snapshots.

As far as I know  zfs send | zfs receive  does most of the things BPR should do. I am not sure it is 100% perfect solution. I mean beyond the fact that it is offline and needs double space.


----------



## Crivens (Jun 14, 2012)

Mage said:
			
		

> The copy, move, rename, etc. methods don't really work as defrag on ZFS if you have for example dedup set on or if the pool has snapshots.


Snapshots are a problem, sure, and dedup would screw up such things also.
Basically, this balance step should be part of a scrub operation. Maybe I should spend at least some minutes browsing the source for this.


----------



## Mage (Jun 21, 2012)

> Basically, this balance step should be part of a scrub operation. Maybe I should spend at least some minutes browsing the source for this.



It should be there but it isn't. I read some email written by one of the original ZFS developers and he said he spent some time on BPR. However, it is too hard to implement and even harder maintain when you add new features.

Also, Josef Bacik, another main developer of BTRFS left RedHat some days ago. It seems that ZFS should be the only available FS with checksumming, dedup and pool management. I hope that will bring improvements soon.


----------



## olav (Jul 23, 2012)

Block pointer rewrite is absolutely needed for ZFS. I really like ZFS, but it really doesn't scale big very well. For example if you have a 40x4TB disk raidz3 setup which is almost 100% full and lose one disk you have to scan almost all the 160TB with data to rebuild a new disk. I would take months...

In my opinion the metadata should be stored in a seperate database and if one disk fails the fs should easily figure out what data is missing and quickly build a new disk that can replace the faulted disk.

Since I really love ZFS with its extraordinary features, I've started to research developing my own distributed filesystem. I've been thinking about Cassandra database for storing all the metadata. I just wish I had more time so I could create something for real and just not prototyping a proof of concept.


----------



## vermaden (Jul 23, 2012)

@olav

After ZFS 'Feature Flags' are already merged, its probably just a matter of time when such 'feature' will be added.

ZFS is in the works all the time, for example here are benchmarks of LZ4 algorithm compared to others, mostly to LZJB:
http://thread.gmane.org/gmane.os.illumos.devel/8701/focus=8731


----------



## Crivens (Jul 23, 2012)

@olav: Somewhere, there is a paper/article concerning the best disk size for raid systems. I failed in finding it again, after I had read it prior to setting up my home server. It seems that one sweet spot for SATA is about 500GB per disk so a resilver does not expose you with 'your pants down' for longer than would be necessary. Resilver of 4GB disks, as you have, would likely trash the remaining disks for about 15 to 20 hours, in which the remaining disks (prolly from the same batch) are running flat out and are thus likely to fail in that time, also.

Also, it would be SOP to divide the disks into several vdevs, and only the rest of the vdev needs to be read while the resilver is running. This leaves the other vdevs idle, and hopefully your really important data which is on "copies=3" is secure on the other vdevs untill the resilver is done.


----------



## olav (Jul 23, 2012)

vermaden said:
			
		

> @olav
> 
> After ZFS 'Feature Flags' are already merged, its probably just a matter of time when such 'feature' will be added.
> 
> ...



That's cool! Though adding a compression algorithm is a walk in the park compared to adding block point rewrite 
I hope someone will add support LZMA soon! Yeah I know the compression speed is Ã¼berslow, but it compresses data amazingly well. Decompression speed is usable though.


----------



## phoenix (Jul 27, 2012)

olav said:
			
		

> Block pointer rewrite is absolutely needed for ZFS. I really like ZFS, but it really doesn't scale big very well. For example if you have a 40x4TB disk raidz3 setup which is almost 100% full and lose one disk you have to scan almost all the 160TB with data to rebuild a new disk. I would take months...



Hence why every single ZFS howto, best practises, and tuning guide says to never use more than 10 disks in a single vdev. *Especially* when using raidz.

The random write IOps for a raidz vdev is limited to the IOps of the slowest drive. In order to increase the IOps of the pool, you add more vdevs.


----------



## olav (Jul 27, 2012)

And that's my point, its actually completely unecessary if the metadata was stored somewhere else and had more information. In that way it would be possible to resilver a new drive without scanning through all the data you have.

Because the metadata database knows exactly what data that was stored on the defective drive.
As it is now, ZFS simply does not "scale out".


----------



## usdmatt (Jul 27, 2012)

I don't think you're understanding exactly how raidz (or parity based raid) works.

If you lose a disk, you have literally lost that data. Having the 'metadata' makes no sense. The only way to get that data back is by reading the rest of each block from the remaining disks along with the parity and recalculating it. This is why you would end up reading 160TB if you had a full array of 40x4TB disks rather than just 4TB. The data for that 4TB disk no longer exists, it has to be recalculated.

As far as I'm aware this is the same for raid5/6 and is one of the main reasons why it's *strongly* discouraged to have any more than 8-10 disks in one vdev. It's probably the biggest mistake anyone can make when choosing pool configuration.

As for not 'scaling out', I don't see the problem. A pool with 1024 disks in 128 8 disk vdevs should rebuild no slower than a pool with 1 8 disk vdev. If you lose any disk, it would only have to read from the other 7 disks in the same vdev. In fact ZFS only needs to read the actual data - many raid arrays will rebuild the entire disk including the empty space.


----------



## Slurp (Jul 27, 2012)

olav said:
			
		

> That's cool! Though adding a compression algorithm is a walk in the park compared to adding block point rewrite
> I hope someone will add support LZMA soon! Yeah I know the compression speed is Ã¼berslow, but it compresses data amazingly well. Decompression speed is usable though.


Large part of LZMA strength is support for big dictionaries. Try it with 128 KB and no solid mode and the strength will drop by a lot. Still would be stronger than deflate, but not very much and the speed would be hugely lower.


----------



## olav (Jul 28, 2012)

usdmatt, recalculating is what exactly I mean should be avoided. Why can't this be stored along with the rest of the metadata?

If you have a 1024 4TB disk setup with 128 8 disk vdevs, you lose 512TB of data to parity.

EDIT:
Oh I see it now, I've been doing the math wrong. First now I understand how XORing works. Hmmm, I still belive it's possible to improve this, but I have to think more about it.


----------



## Mage (Jul 28, 2012)

I always do mirrored setups not RaidX. However, once they get fragmented ZFS becomes hell slow. Also adding more vdev doesn't balance your data. Only send | receive helps which is usually not possible because of the downtime and the double space need.

Hammer2 seems be superior (based on its promises). Don't get me wrong, I like FreeBSD, I already have it on all my servers and desktops, it is really the best OS I ever touched, but I think that Hammer2 would be better for us than ZFS. Maybe if the evil Oracle didn't buy Sun then ZFS had some development beyond fixes, however one of the original ZFS developers said that BPR is pain to implement. Face the fact we will never have that. As far as I know Hammer already has something like BPR (rebalancing data). I also read that Hammer is hard to port to FreeBSD. I wish somehow we had Hammer2 next year.


----------



## UNIXgod (Jul 28, 2012)

Mage said:
			
		

> Hammer2 seems be superior (based on its promises). Don't get me wrong, I like FreeBSD, I already have it on all my servers and desktops, it is really the best OS I ever touched, but I think that Hammer2 would be better for us than ZFS. Maybe if the evil Oracle didn't buy Sun then ZFS had some development beyond fixes, however one of the original ZFS developers said that BPR is pain to implement. Face the fact we will never have that. As far as I know Hammer already has something like BPR (rebalancing data). I also read that Hammer is hard to port to FreeBSD. I wish somehow we had Hammer2 next year.



Yes. It was suggested as a google summer of code project and Matt Dillon stated:



> Personally I think it might be too much for a GSOC project.



http://wiki.freebsd.org/PortingHAMMERFS

Also from the FreeBSD developers wiki it claims it to be simple:
http://wiki.freebsd.org/IdeasPage#Port_DragonflyBSD.27s_HAMMER_file_system_to_FreeBSD



> Port DragonflyBSD's HAMMER file system to FreeBSD
> 
> Suggested Summer of Code 2012 project idea
> Contact Info
> ...



I'm with you mage. I feel it would be nice to have more choices when it comes to BSD filesystems. Having UFS, ZFS, and HAMMER in FreeBSD would round out this server OS nicely.


----------



## phoenix (Jul 28, 2012)

Well, hop to it, then. Get coding.


----------



## Mage (Oct 15, 2012)

I read that we won't have zpool v31 because Oracle didn't release the Solaris 11 source code.

I am wondering whether this is the source code we need or not: http://www.oracle.com/technetwork/opensource/systems-solaris-1562786.html


----------



## phoenix (Oct 15, 2012)

There is no CDDL release of ZFS sources beyond v28.  Any "public" sources are not actually public, and anyone reading them could be considered "tainted" and should not touch the open-source ZFS code.

The open-source ZFS devs (Illumos, Delphix, FreeBSD, NetBSD, Spectra Logic, various others) have moved ZFS beyond Oracle's versioning using feature flags, and have added several features that Oracle ZFS doesn't have.


----------



## UNIXgod (Oct 16, 2012)

phoenix said:
			
		

> (Illumos, Delphix, FreeBSD, NetBSD, Spectra Logic, various others) have moved ZFS beyond Oracle's versioning using feature flags, and have added several features that Oracle ZFS doesn't have.



Do you have a link to the new open source features?


----------



## throAU (Oct 16, 2012)

olav said:
			
		

> Block pointer rewrite is absolutely needed for ZFS. I really like ZFS, but it really doesn't scale big very well. For example if you have a 40x4TB disk raidz3 setup which is almost 100% full and lose one disk you have to scan almost all the 160TB with data to rebuild a new disk. I would take months...



Sounds like whoever created that pool was "doing it wrong"...


ZFS scales just fine, if you build your pools properly.  And yes I'd like to see hammer on FreeBSD as an additional choice.


----------



## Crivens (Oct 16, 2012)

throAU said:
			
		

> And yes I'd like to see hammer on FreeBSD as an additional choice.


+1 from me.


----------



## kpa (Oct 16, 2012)

throAU said:
			
		

> Sounds like whoever created that pool was "doing it wrong"...



http://forums.freebsd.org/showpost.php?p=27624&postcount=3

A 24 disk raidz2 vdev was already a big mistake, a 40 disk vdev is total lunacy...


----------



## phoenix (Oct 16, 2012)

UNIXgod said:
			
		

> Do you have a link to the new open source features?



There's discussion of the features here:  http://zfsday.com

Here's a patch for 9-STABLE that enables the feature flags:  http://lists.freebsd.org/pipermail/freebsd-fs/2012-October/015290.html

Here's the commit to 10-CURRENT:  http://lists.freebsd.org/pipermail/svn-src-head/2012-June/037825.html

There's probably more available on the illumos website.


----------



## Mage (Oct 18, 2012)

I lack the knowledge to port Hammer2, however I would donate some additional money if I knew they are porting it. (I am not talking about hundreds of dollars but my part of the community).


----------

