# Atomically swapping two directories



## Nasrudin (Apr 15, 2020)

Linux has this 
	
	



```
renameat2()
```
 system call which ostensibly performs an atomic swap of two directories. Is there anything equivalent in FreeBSD? If not, why not? 

I do know symlinks can be used to do this. For various reasons, I cannot use that strategy.


----------



## ralphbsz (Apr 16, 2020)

There is not, as far as I know. I don't think every Linux file system has it either; if I remember right, it was being faked (either in user space or in the VFS) for some file systems in Linux, which makes all of the atomicity guarantees into a joke. It's actually hard to implement. The regular rename() system call is already difficult, because it needs to be atomic, and side-effect free in the case of failure; rename2() is even harder, and it is very rarely used.

Can you explain what you are trying to accomplish? Maybe there is a better solution? Really, the only atomic multi-object operation in a file system is rename(), but often one can work around all this by using atomic create with lock files instead.


----------



## unitrunker (Apr 16, 2020)

This opens up horrific consequences for a chroot jail.


----------



## PMc (Apr 16, 2020)

I have a usecase where I need to rename a directory and the destination of a symlink pointing to it, preferably at the same time. 
I didn't find a solution for that, because `mv -h` works only with files, not with directories (otherwise the scheme would be: create another symlink targeting the directory, change the first symlink to point to the other symlink, then mv the directory to replace that other symlink).
This syscall seems to be able to do exactly that.


----------



## Nasrudin (Apr 21, 2020)

ralphbsz said:


> There is not, as far as I know. I don't think every Linux file system has it either; if I remember right, it was being faked (either in user space or in the VFS) for some file systems in Linux, which makes all of the atomicity guarantees into a joke. It's actually hard to implement. The regular rename() system call is already difficult, because it needs to be atomic, and side-effect free in the case of failure; rename2() is even harder, and it is very rarely used.



Isn't rename just changing a directory entry to be a different name? It seems to me to be trivial except for possibly the failure handling. Of course, my filesystem knowledge is very ancient and I've no idea what you have to do to play nice with filesystems like ZFS...so I'm merely curious here and I'm looking for just a bit more detail.




ralphbsz said:


> Can you explain what you are trying to accomplish? Maybe there is a better solution? Really, the only atomic multi-object operation in a file system is rename(), but often one can work around all this by using atomic create with lock files instead.



Not on a live webserver. If a high traffic webserver is serving multiple sites, and you want to alter just one of these site directories, the (apparently naive) way to do that would be:


```
mv sitedir tmp
mv newsitedir sitedir
```

It turns out that in the window of time between these two moves, the site appears broken. If your traffic is high enough, this is a lot of people who think the site is down or broken. So I was looking for a better way. I do know you can get fancier and try to block the site while you are updating it, or use a load balancer and do rolling one-at-a-time upgrades, however I'm pretty sure there are other use cases which would 



unitrunker said:


> This opens up horrific consequences for a chroot jail.



This mysterious comment has me curious, ostensibly the intended effect. What exactly do you mean?


----------



## ralphbsz (Apr 21, 2020)

Nasrudin said:


> Isn't rename just changing a directory entry to be a different name? It seems to me to be trivial except for possibly the failure handling.


Rename affects the following entities:

(Slightly) The object being renamed. If you think of this as an inode or a file system object, the effect is actually minimal, since neither content nor attributes (metadata such as mtime) of it change, only the way it is reached.
The old parent directory, which loses one entry (if the new directory is different, or the rename crashes in the middle.
The new directory, which gains an entry.
The object being replaced, if there was one already at the new name; it gets unlinked.
The problem is all in the atomicity and the error handling: After you're done, either the  object is in the old directory at the old name, and the object being replaced is still at the new name. Or the object being replaced is unlinked (and perhaps deleted, if that was the last link and it is not open), and the object is in the new directory at the new name. If something goes wrong, you can end up in a vast variety of wrong combinations. And since rename is often used as an atomic backup operation, this would have disastrous consequences. As an example, a typical use of rename is the way editors write their output files. Since writing a file can not be atomic, and is error-prone, they typically write a temporary output file. After that has succeeded, they rename the temporary file to atomically replace the old file; and they know exactly (from the error code that the rename() call returned) whether the old file or the new file is now in place or readable, tertium non datur. If you do writing a file this way, then at any point there is a correct and readable file in place, perhaps the old or the new, but never both or a mix or none.



> Not on a live webserver. If a high traffic webserver is serving multiple sites, and you want to alter just one of these site directories, the (apparently naive) way to do that would be: ...


How about this, would this work? I'm assuming the server will follow soft-links, and wants to read content from sitedir:

```
ln -s old_sitedir link_to_old_sitedir  # old_sitedir is a real directory, contains the old content
ln -s new_sitedir link_to_newsitedir  # new_sitedir is also a real directory, also has content
ln -s link_to_old_sitedir sitedir
# Start the server, it will work and read from sitedir -> old
# Now, when you want to switch to the new one:
mv link_to_new_sitedir sitedir
```
This atomically overwrites the link, but the underlying directories remain accessible.



> It turns out that in the window of time between these two moves, the site appears broken. If your traffic is high enough, this is a lot of people who think the site is down or broken. So I was looking for a better way. I do know you can get fancier and try to block the site while you are updating it, or use a load balancer and do rolling one-at-a-time upgrades, however I'm pretty sure there are other use cases which would


Well, there is a whole lot of cases that are not related to the rename problem which also need to be though through. For example: Someone loaded index.html from the old site. While they're staring at it, the sitedir is replaced with the new version. Then they click on a link on index.html. You have to make sure that following that old link does something sensible. I don't know what that sensible thing is, that depends on context.


----------



## Nasrudin (Apr 22, 2020)

ralphbsz said:


> The problem is all in the atomicity and the error handling: After you're done, either the  object is in the old directory at the old name, and the object being replaced is still at the new name.
> ...
> After that has succeeded, they rename the temporary file to atomically replace the old file; and they know exactly (from the error code that the rename() call returned) whether the old file or the new file is now in place or readable, tertium non datur. If you do writing a file this way, then at any point there is a correct and readable file in place, perhaps the old or the new, but never both or a mix or none.



It would appear you are claiming that file renames are more atomic than directory renames. Is this correct?



> How about this, would this work? I'm assuming the server will follow soft-links, and wants to read content from sitedir:
> 
> ```
> ln -s old_sitedir link_to_old_sitedir  # old_sitedir is a real directory, contains the old content
> ...



For various reasons, I've sadly had to reject this idea. It's probably the best idea in terms of an atomic switch, but replacing directories with symlinks is not on the table at this time. It's also not clear that...



> Well, there is a whole lot of cases that are not related to the rename problem which also need to be though through. For example: Someone loaded index.html from the old site. While they're staring at it, the sitedir is replaced with the new version. Then they click on a link on index.html. You have to make sure that following that old link does something sensible. I don't know what that sensible thing is, that depends on context.



...the webserver atomicity issues you indicate will be solved by a symlink; I can't easily prove that it will be.


----------



## unitrunker (Apr 22, 2020)

For the OP, your webserver config points to the www document root. Prepare a new directory. Edit the config to point to the new directory. Signal the www daemon to reload its config.

As an aside, it seems dir swapping would be easier than dir moving since both directories exist. I don't know how UFS or ZFS handle locking of the parent directories. Edit: no you still need to update the '..' entries (probably why hard links aren't allowed).

My concern related to chroot is that moving a directory is a common jailbreak technique. Adding a new way to move directories is a new attack vector.


----------



## PMc (Apr 22, 2020)

unitrunker said:


> My concern related to chroot is that moving a directory is a common jailbreak technique. Adding a new way to move directories is a new attack vector.



That may be true, but it also may be a little too far generalized.

As I found, technically is the problem as described be the OP not different from the problem I described above.
And the solution as proposed by ralphbsz does work, but it does work only a single time: you have an "old" entry and a "new" entry, and instead of swapping them, you switch the processing entirely to the new entry. Fine, problem solved - but what do we do the next time we need this? Create a "new2" entry, and then "new3", and so on?
So, this works when doing it manually; it works one time, maybe a second and a third. But it is not code-able, it is not useful for automated deploy.
With automated deploy it does not matter if they do an update once a year or twenty times a day, it just has to work. (It also does not matter if their index-urls still point to the right thing afterwards - that is an editor's problem, and they may or may not care for it.)

So I come to the conclusion that there is no fully satisfying solution, and those people who came up with renameat2() had a valid point.


----------



## unitrunker (Apr 22, 2020)

You can toggle between A and B.
You can have two deployments - an A and a B. You must of course remember which to use. That can also be solved in a script.


----------



## unitrunker (Apr 22, 2020)

The key here is the OP doesn't actually need atomic swap but atomic move. Once A is replaced by B - A is no longer needed.


----------



## ralphbsz (Apr 22, 2020)

Nasrudin said:


> It would appear you are claiming that file renames are more atomic than directory renames. Is this correct?


No, the rename algorithm doesn't care at all whether the object it is renaming is a file, directory, soft-link, or purple flying elephants. Even if you take a file and rename it so it replaces (unlinks) a directory. In a correctly written file system, a rename will either atomically succeed, or completely fail. (I know there is one small exception, during the rename there may be a period where the object is visible at *both* old and new location, and there may be another exception that I forgot about).

The joke about elephants is actually an in joke. At some point, I had to write a protocol design document to be shared with other companies, about something involving disks, enclosures, files, and directories. And because I have a nasty sense of humor, and didn't want to disclose company-internal details, I wrote the examples in the document involving a zoo that stores elephants, hippos and rhinos. And I went into considerable detail about the proposed upgrades to the protocol, working through the example of the next version having to deal with flying elephants.



> ...the webserver atomicity issues you indicate will be solved by a symlink; I can't easily prove that it will be.


Yes, that problem is really gnarly. People can have arbitrarily old documents on the screen, and then they click on a link. What do you do? If the parent document is old, and they click on a link to a file of which a newer version exists? I think the answer is "it depends", and can't be solved by a computer person without input from a web designer. For example, the old page may be for a widget, and the link is to buy a mounting bracket for the widget. The web page is being replaced because there is a new model of widget, which is smaller but even stronger. If they are looking at the old web page, then the link needs to lead them to the correct mounting bracket, because clearly the new bracket won't fit the old widget. On the other hand, a link that shows "today's stock price" should probably always go to the most recent version of the stock price page. It's complicated.


----------



## PMc (Apr 22, 2020)

unitrunker said:


> You can toggle between A and B.
> You can have two deployments - an A and a B. You must of course remember which to use. That can also be solved in a script.



That's corerct. That would be a possible approach. It only fails when you want to potentially revert an arbitrary number of versions.


----------



## unitrunker (Apr 22, 2020)

PMc said:


> That's correct. That would be a possible approach. It only fails when you want to potentially revert an arbitrary number of versions.


Deployment strategy is a whole other animal. I don't believe in keeping old deployments around for very long on the target machine. A local copy is nice for a fast rollback. If you need to rollback multiple versions, your certification process is broken. You archive the files elsewhere so a rollback is just a (re)deployment of an older version. Since they're offline, you can compare versions side by side to see what changed (or is about to change). This can be done by junior staff without granting them direct production access.

If the target machine goes belly up, you can quickly recover. For cloud like environments, there is no rollback. You deploy a new instance and destroy the old.

Really, a load balancer is the way to go as it lets you upgrade not just a few files but new www version, new SQL schema, or new hardware / different VM.

Yeah we're wandering off the reservation now.


----------



## AngryChris (Apr 22, 2020)

PMc said:


> And the solution as proposed by ralphbsz does work, but it does work only a single time: you have an "old" entry and a "new" entry, and instead of swapping them, you switch the processing entirely to the new entry. Fine, problem solved - but what do we do the next time we need this? Create a "new2" entry, and then "new3", and so on?
> So, this works when doing it manually; it works one time, maybe a second and a third. But it is not code-able, it is not useful for automated deploy.


Yes, you do use a "new2" and then a "new3" only they're not called that.  This is where you use a timestamp.  And those can be used in a programmatic, automated fashion.  Each new directory to which you cut over processing will have the timestamp of its creation in its name.  But the symlink won't.  The application will always use the same symlink to access the data, but the buckets of data being accessed are segregated by creation time.


----------



## PMc (Apr 22, 2020)

unitrunker said:


> Deployment strategy is a whole other animal. I don't believe in keeping old deployments around for very long on the target machine. A local copy is nice for a fast rollback. If you need to rollback multiple versions, your certification process is broken. You archive the files elsewhere so a rollback is just a (re)deployment of an older version. Since they're offline, you can compare versions side by side to see what changed (or is about to change). This can be done by junior staff without granting them direct production access.


All right. I didn't even think that far - I would have got that one free of cost, if the move were possible. 
I came across the problem because I was annoyed about the ugliness of these modern web applications. They create their own directory tree, have their own /var, their own /tmp, their own /log, their own /spool, obviousely also their own prereqisites and libaries, and they throw the whole crap into some user's homedir and call that a deploy. 
So I came to think if it is really necessary to do it that way - because once upon a time we had a nice paper about the unix directory tree, and which parts of the tree can be shared among machines, which parts need to be writeable, etc. I know, nobody cares about such anymore, they just want to have their app up and running - but that doesn't mean I need to forget about all this.
So I started to sketch a deploy that would honor the original tree, and found that it isn't so difficult - and it might even have a few benefits, e.g. that the whole application, or even that whole filesystem, need no longer be writeable for the application user and that ilk.
But then to handle the stuff under /var, one needs a few symlinks. And there the problem came up, because some parts of /var should stay and some parts must switch. Now, capistrano, for instance, allows a configurable number of old installations to be kept. And with my original approach, the elements in /var would just match up with that number. But, as I found out, that approach would need the swap of the paths.
That's basically the story.


----------



## PMc (Apr 22, 2020)

AngryChris said:


> Yes, you do use a "new2" and then a "new3" only they're not called that.  This is where you use a timestamp.  And those can be used in a programmatic, automated fashion.  Each new directory to which you cut over processing will have the timestamp of its creation in its name.  But the symlink won't.  The application will always use the same symlink to access the data, but the buckets of data being accessed are segregated by creation time.



Either this is the essential point that I missed to recognise, or it does not work. I have to look into that on occasion. Because, in my case I happen to have that timestamp already, from capistrano.


----------



## mark_j (Apr 28, 2020)

Nasrudin said:


> Linux has this
> 
> 
> 
> ...


Why not? It's a linux-ism. FreeBSD prefers to stick to POSIX rather than every thought bubble out of GNU/Linux...
I think if you can't accomplish this hack in FreeBSD then you're not thinking about the problem correctly, whether that be use of symbolic links or whatever. Approach the problem differently?

Good luck getting it to work with a non-local file system.


----------



## Nasrudin (May 6, 2020)

mark_j said:


> Why not? It's a linux-ism. FreeBSD prefers to stick to POSIX rather than every thought bubble out of GNU/Linux...
> I think if you can't accomplish this hack in FreeBSD then you're not thinking about the problem correctly, whether that be use of symbolic links or whatever. Approach the problem differently?
> 
> Good luck getting it to work with a non-local file system.



Sometimes the cry of "approach the problem differently" ignores the effort someone actually _has_ spent in looking at the problem.  You should know this problem has already been "approached differently" even as the original post was being typed. I have a workaround, it's not 100% perfect, but it mostly works for our use cases.

I do not agree that the use case of atomically swapping two directories is specific to linux, nor do I want to give linux so much power over FreeBSD by declaring "linux-isms are bad" in the way you appeared to do. This is a general problem that I believe is useful to solve, regardless of the operating system you are using. 

As expected, my beliefs do not drive development priorities.


----------



## PMc (May 6, 2020)

Nasrudin said:


> Sometimes the cry of "approach the problem differently" ignores the effort someone actually _has_ spent in looking at the problem.  You should know this problem has already been "approached differently" even as the original post was being typed. I have a workaround, it's not 100% perfect, but it mostly works for our use cases.



Im my case, the offered idea to use timestamps does actually work, and makes things even simpler. And I had spent quite an effort in making it work in the first place, but was obsessed by the idea that it must work in an old-current-new cycle, so I didn't get the step to look at it differently.
So, this thread has already proven valuable. 

Nevertheless...



> I do not agree that the use case of atomically swapping two directories is specific to linux, nor do I want to give linux so much power over FreeBSD by declaring "linux-isms are bad" in the way you appeared to do. This is a general problem that I believe is useful to solve, regardless of the operating system you are using.



... I agree to that.



> As expected, my beliefs do not drive development priorities.



Get used to it. Another thing that I would much appreciate is open/NOATIME, and I also have little hope that might appear - it would allow to remove stuff that wasn't used for some time while still have backups work normally (my backup tool supports it), but it's probably also a linuxism.


----------



## mark_j (May 7, 2020)

Nasrudin said:


> Sometimes the cry of "approach the problem differently" ignores the effort someone actually _has_ spent in looking at the problem.  You should know this problem has already been "approached differently" even as the original post was being typed. I have a workaround, it's not 100% perfect, but it mostly works for our use cases.
> 
> I do not agree that the use case of atomically swapping two directories is specific to linux, nor do I want to give linux so much power over FreeBSD by declaring "linux-isms are bad" in the way you appeared to do. This is a general problem that I believe is useful to solve, regardless of the operating system you are using.
> 
> As expected, my beliefs do not drive development priorities.


Fair enough, but listen to this:
The Linux (kernel) community has literally thousands of people working on it. Hundreds of corporations putting money into it, offering developer time to the project. They can afford to solve a problem of cracking a nut by using a sledge-hammer.

If you can give a case, a STRONG case, for such an implementation (and even provide the code for such a library function), then by all means offer it up to the arch mailing list, but I'm willing to bet they'll say "use the tools you have: symlink(2), rename(2), and so on."

So, while I agree with your sentiment, I don't agree with your approach. Adding more and more complexity is a Linux-ism.


----------



## mark_j (May 7, 2020)

PMc said:


> Get used to it. Another thing that I would much appreciate is open/NOATIME, and I also have little hope that might appear - it would allow to remove stuff that wasn't used for some time while still have backups work normally (my backup tool supports it), but it's probably also a linuxism.



Why not use utimes(2) or the more modern utimensat(2) to reset it?


----------



## PMc (May 7, 2020)

mark_j said:


> Why not use utimes(2) or the more modern utimensat(2) to reset it?


Simple answer: because the backup software shall kick in on an updated ctime - and rightfully so, to catch moves - so this gets either no moves saved or an endless loop of full backups.  (The matter is nicely discussed here and below.)
Actually this is not a thing that does really hurt me anymore - I don't do much with these timestamps nowadays. In the old times I was very delighted finding that unix has three times on the file, and started to script all kinds of systems maintenance, automated cleanups (diskspace wasn't there in abundance), to have the machines run practically self-maintained (DRY in systems-management  ). But nowadays we upgrade more often, cleanup more often, and have lots more of space.


----------



## mark_j (May 7, 2020)

I've got to say that documentation you linked to is rather confusing. Why is backup software using atime as a trigger? That's insane. Then it says you'll avoid a race condition if 2 or more pieces of software are accessing file X and one's not modifying atime. In other words, this is a problem of GNU/Linux's making by having noatime flag on open and allowing programs to use it.

Most backup software I've used in the enterprise realm has used metadata, checksums and their own backup times to do incremental backups. Then again, it's been over 10 years since I had to worry about such things (StorEdge).

It also points to what I consider a big failing of UNIX: historically no backup time flag in the file structure metadata. But, nowadays, there's snapshots et al to perform backups so who cares; disk space is cheap, cheap, cheap...


----------



## PMc (May 7, 2020)

mark_j said:


> I've got to say that documentation you linked to is rather confusing. Why is backup software using atime as a trigger?



It doesn't. It uses ctime+mtime. It just offers to keep atime at it's former state when reading the file. Either via utimes() as You suggested, or via O_NOATIME if available.



> Then it says you'll avoid a race condition if 2 or more pieces of software are accessing file X and one's not modifying atime. In other words, this is a problem of GNU/Linux's making by having noatime flag on open and allowing programs to use it.



This was also brought up when discussing the matter in the mailinglist. But I don't see a race condition there, or maybe I don't understand the problem. The other point that was brought up there is that it is quite laboursome to track the open() call through all the layers of geom, and somebody would have to do that work. That makes sense to me.


----------

