# copy-on-write file-system



## Nilesh (Apr 9, 2014)

Hello,

For performance reasons, I am looking for copy-on-write file system in which when I copy file f1 to new file f2, the FS should only create new inode for f2. It should share all data blocks that was created by f1. 

Post copy, f1 and/or f2 can be modified. Only during modifications/writes, required block(s) should get replicated and write should be performed on new block(s).

The system where I work needs to copy approx 500M size file. So copy is slow. Post copy only few blocks are modified (either in source or in destination file). So I believe copy-on-write should be very useful for my usecase.

I know ZFS does this but I am running FreeBSD 6.x and the memory I have under 1G.

Do we have light weight copy-on-write FS available ? If not, do you think modifying ufs is an option to achieve this ?

Thanks.


----------



## usdmatt (Apr 9, 2014)

I don't like being the one who offers no help and just says upgrade, but seriously, 6.x is far too old to still be in use.

I'm not aware of any way of doing this without COW file systems, even on recent versions of FreeBSD. It isn't actually that easy to do on ZFS either. You'd have to put the file on its own dataset and clone the entire dataset each time you wanted a copy (or use dedupe which is a non-starter with 1GB or RAM). This is one of the features Linux guys are always championing BTRFS for, it has the ability to create a COW copy of a file which references the same blocks, purely using the `cp` command.


----------



## Chris_H (Apr 9, 2014)

This might not be the answer you expected. But wouldn't with reads, and writes of a file this size, not be better served by incorporating it in a database? That way you get pretty much exactly what you're looking for -- cheap reads, and writes.

--Chris


----------



## Nilesh (Apr 9, 2014)

usdmatt said:
			
		

> I don't like being the one who offers no help and just says upgrade, but seriously, 6.x is far too old to still be in use.
> 
> I'm not aware of any way of doing this without COW file systems, even on recent versions of FreeBSD. It isn't actually that easy to do on ZFS either. You'd have to put the file on its own dataset and clone the entire dataset each time you wanted a copy (or use dedupe which is a non-starter with 1GB or RAM). This is one of the features Linux guys are always championing BTRFS for, it has the ability to create a COW copy of a file which references the same blocks, purely using the `cp` command.



Thanks for your reply. So you are saying that even if I upgrade to FreeBSD 10.x and throw more RAM, ZFS would not get me COW in FS ? If the answer is yes, my understanding of ZFS wasn't correct. But anyways, upgrade is not an option for me at this moment.

Thanks for sharing BTRFS links. Yes cp with reflink is exactly what I am looking for. I am not a kernel developer but started exploring ffs/ufs code to see what can be done. At high level I am thinking following.

Instead of cp, create clone command and add new system call which would create new inode which would refer to existing data blocks. A meta info would be needed to say that a data block is referred by how many inodes.

For example, f1 has 10 data blocks so initially each data block's refcount is 1. After cloning it to f2, all 10 data block's reference count would be 2. Assume f2 is attempting to write into data block #5. Since ref count is 2, it knows that its not only me here. So it would allocate new data block, copy data in there and then perform modification. It would also update its di_db or di_ib to reflect address of this newly created data block. This would also reduce refcount of data block #5 by 1 because now only f1 is using it.

The metadata for representing refcount would be a data structure (i dont know which one yet) indexed by data block address.


----------



## usdmatt (Apr 9, 2014)

ZFS is completely copy-on-write. COW just means that whenever an existing block is updated, a new copy is written somewhere else and, when successful, the old block is freed (unless it's referenced by something else such as a snapshot). If you copy a file on a COW filesystem, it doesn't really function any different to copying a file on a non COW filesystem - it will create a second copy of the data somewhere else on the disk. BTRFS has added the ability to create clones at the file level but this is a feature they've had to add. It was made possible, and probably fairly straight forward due to BTRFS being COW, but it's not an inherent feature of COW filesystems.

It may be a small enough job to make a clone command that points two inodes at the same data, similar to a hard link. I can see the difficulty being making sure the right thing always happens when you try to make changes to those files, and all the testing to make sure there aren't edge cases or dark corners of the UFS code which will happily screw up your copies because they aren't aware of your new features. Seems a lot of effort to go through to develop a feature that will probably only be usable on one old unsupported version of FreeBSD. You're welcome to try it if you're brave enough and have the time to waste though.


----------



## Nilesh (Apr 9, 2014)

usdmatt said:
			
		

> ZFS is completely copy-on-write. COW just means that whenever an existing block is updated, a new copy is written somewhere else and, when successful, the old block is freed (unless it's referenced by something else such as a snapshot). If you copy a file on a COW filesystem, it doesn't really function any different to copying a file on a non COW filesystem - it will create a second copy of the data somewhere else on the disk. BTRFS has added the ability to create clones at the file level but this is a feature they've had to add. It was made possible, and probably fairly straight forward due to BTRFS being COW, but it's not an inherent feature of COW filesystems.



Thanks for your explanation. So implementing ref-links for COW on ffs/ufs would be easy/simple enough ? I am planning to give it a try.

Thanks.


----------



## ralphbsz (Apr 10, 2014)

Nilesh said:
			
		

> For performance reasons, I am looking for copy-on-write file system in which when I copy file f1 to new file f2, the FS should only create new inode for f2.



OK, stop right there.

You say you copy the file.  How do you do that?  You can for example do that with `cp a b`, or `dd if=a of=b`.  Both programs do fundamentally the same thing.  They open file a for reading, and create empty file b for writing.  Then they read from a (a byte, a sector, a VM page, or a larger block at a time), and write to file b.  Iterate, until the end of file is reached.

Now put yourself into the mind of the file system.  It noticed that a process opened file a, read it, and then closed it.  Fabulous.  It also noticed that a process created file b, wrote to it, and then closed it.  Also fabulous.  Unfortunately, the file system has no idea that these two sets of events are correlated at all.  It doesn't even have any idea that the bytes that were written to file b happen to be the same as the bytes that were read from a.

For a perverse example, look at the difference between the `cp` program and `rot13` (a tiny utilty program that "encrypts" text by rotating all alphabetic characters by 13 positions, turning A into N, B into O, M into Z, N into A, and so on).  From the file system point of view, the series of system calls is exactly the same from both programs!  So, how will the file system even figure out that file b is a copy of file a?

I know of two solutions.  One is for the file system to use de-duplication.  In that technology, the file system checks all the data (at the file, block, or arbitrary-length-sequence level), and looks for duplicates.  It then notices that files a and b happen to be the same (or perhaps, depending on implementation, that they happen to contain all the same blocks), and removes the duplicates.  Obviously, when writing to a de-duplicated file, we need some CoW technology to create two modified copies.  ZFS in theory has deduplication, but I don't know whether that feature is included in the FreeBSD build, nor whether it works well or efficiently.

The second option is to not actually copy the file in the first place, but to tell the file system explicitly that the second file shall be a snapshot or clone of the first file (those terms are often used to describe this type of operation).  Unfortunately, there is no standard or common (for example POSIX) system call for this operation, so this will be a file-system specific hack.  And to my knowledge, ZFS doesn't have a per-file clone feature.  Certainly, adding that (to ZFS, UFS, or any other file system you care about) is at least theoretically possible.

Here is a suggestion: ZFS has a snapshot feature, which allows cloning a whole directory tree or file system.  And at least some ZFS versions support writable snapshots (those are called clones).  Have you checked whether this would fulfill your requirements?

Taking an existing file system (with very complex source code) and adding the have per-file clones would be a major undertaking.  And unless one is an experienced file system developer, this should only be attempted on a file system that has at least the infrastructure to do CoW already built-in (which has to be mostly present the moment a file system implements snapshots).


----------



## SirDice (Apr 10, 2014)

Nilesh said:
			
		

> I know ZFS does this but I am running FreeBSD 6.x and the memory I have under 1G.


Stop using an outdated version please. The last 6 version went end-of-life in November 2010, almost 4 years ago. It's not supported any more and is now actually a huge security risk because security bugs do not get fixed. And there have been plenty since November 2010. 

You may not care about _your_ security but putting ancient stuff like this on the internet is a threat to _my_ security.

Topics about unsupported FreeBSD versions


----------



## jalla (Apr 10, 2014)

I think UFS snapshots were introduced in FreeBSD-6.
They are more limited and cumbersome to work with than snapshots in ZFS, but in principle the functionality is there.


----------

