Things Unix can do atomically (2010)

Posted by onurkanbkrc 17 hours ago

Things Unix can do atomically (2010)(rcrowley.org)

239 points | 89 comments

0xbadcafebee 16 hours ago|

You can use `ln` atomicity for a simple, portable(ish) locking system: https://gist.github.com/pwillis-els/b01b22f1b967a228c31db3cf...

akoboldfrying 13 hours ago|

Really nice explanation of a useful pattern. I was surprised to discover that even the famously broken NFS honours atomicity of hardlink creation.

amstan 14 hours ago||

Missing (probably because of the date of the article): `mv --exchange` aka renameat2+RENAME_EXCHANGE. It atomically swaps 2 file paths.

rustybolt 13 hours ago||

I tried using this a while back and found it was not widely available. You need coreutils version 9.1 or later for this, many distros do not ship this.

I made https://github.com/rubenvannieuwpoort/atomic-exchange for my usecase.

oguz-ismail2 14 hours ago||

Title says Unix, renameat2 is Linux-only.

jasode 13 hours ago||

>Title says Unix,

You're misinterpreting the title. The author didn't intend "Unix" to literally mean only the official AT&T/TheOpenGroup UNIX® System to the exclusion of Linux.

The first sentence of "UNIX-like" makes that clear : >This is a catalog of things UNIX-like/POSIX-compliant operating systems can do atomically,

Further down, he then mentions some Linux specifics : >fcntl(fd, F_GETLK, &lock), fcntl(fd, F_SETLK, &lock), and fcntl(fd, F_SETLKW, &lock) . [...] There is a “mandatory locking” mode but Linux’s implementation is unreliable as it’s subject to a race condition.

shawn_w 11 hours ago|||

Bit rot alert: Linux doesn't even have mandatory file locks these days.

Linux-specific open file description locks could be brought up in a modern version of TFA though.

bee_rider 8 hours ago||||

They aren’t misinterpreting the title, the title is incorrect.

jasode 2 hours ago||

>, the title is incorrect.

Differing philosophies of how to interpret titles. Prescriptive vs Descriptive language.[0]

There can be different usages of the word "Unix":

#1: Unix is a UNIX(tm) System V descendent. More emphasis that the kernel needs to be UNIX. In this strict definition, you get the common reminder that "Linux is not a Unix!"

#2: "Unix" as a loose generic term for a family of o/s that looks/feels like Unix. This perspective includes using an o/s that has userland Unix utilities like cat/grep/awk. Sometimes deliberately styled as asterisk "*nix" or a suffix-qualifier "Unix-like" but often just written as a naked "Unix".

A Prescriptivist says the author's title is "incorrect". On the other hand, a Descriptivist looks at the whole content of the article -- notices the text has a lot of Linux specific info such as fcntl(,F_GETLEASE/F_SETLEASE), and every hyperlink to a man page url points to https://linux.die.net/man/ , etc -- and thus determines that the author is using "Unix"(#2) in the looser way that can include some Linux idiosyncrasies.

"Unix" instead of "*nix" as a generic term for Linux is not uncommon. Another example article where the authors use the so-called incorrect "Unix" in the title even though it's mostly discussing Linux CUPS instead of Solaris : https://www.evilsocket.net/2024/09/26/Attacking-UNIX-systems...

[0] https://en.wikipedia.org/wiki/Linguistic_prescription

monibious 11 hours ago||||

But I also don't think the auther meant Things you can do in Linux but not Unix

jasode 10 hours ago|||

>But I also don't think the auther meant Things you can do in Linux but not Unix

I wasn't claiming that. I just thought the ggp had a useful comment about renameat2() which led to gp's "correction" which wasn't 100% accurate.

IBM z/OS UNIX also has renameat2(). It doesn't have the Linux specific flag RENAME_EXCHANGE.

https://www.ibm.com/docs/en/zos/3.1.0?topic=functions-rename...

skissane 4 hours ago|||

In recent versions, z/OS has been copying lots of Linux-specific APIs (e.g. unshare [0]) in order to support the z/OS port of Kubernetes.

If Kubernetes starts using renameat2(RENAME_EXCHANGE), they could very plausibly add it.

[0] https://www.ibm.com/docs/en/zos/3.2.0?topic=csd-unshare-bpx1...

mghackerlady 6 hours ago|||

pedantic but z/OS isn't a unix, it can just pretend to be one enough for the open group to call it one. IBM has a unix still, AIX.

pjmlp 8 hours ago||||

Except POSIX doesn't specify some of them as happening atomically.

Many people write UNIX/POSIX without ever reading what it says.

stephenr 10 hours ago|||

Sounds like the key term then is probably this:

> POSIX-compliant

Which, FWIW, doesn't mean Linux. AFAIK there is no Linux distro that's fully compliant, even before you worry about the specifics of whether it's certified as compliant.

jasode 8 hours ago|||

>POSIX-compliant Which, FWIW, doesn't mean Linux. AFAIK there is no Linux distro that's fully compliant

I read author's use of "POSIX-compliant" as a loose and fuzzy family category rather than an exhaustive and authoritative reference on 100% strict compliance. Therefore, the author mentioning non-100%-compliant Linux is ok.

There seems to be 2 different expectations and interpretations of what the article is about.

- (1) article is attempting to be a strict intersection of all Unix-like systems that conform to official UNIX POSIX API. I didn't think this was a reasonable interpretation since we can't be sure the author actually verified/tested other POSIX-like systems such as FreeBSD, HP-UX, IBM AIX, etc.

- (2) article is a looser union of operating systems and can also include idiosyncracies of certain systems like Linux that the author is familiar with that don't apply to all other UNIX systems. I think some readers don't realize that all the author's citations to man pages point to Linux specific urls at : https://linux.die.net/man/

The ggp's (amstan) additional comment about renameat2(,,,,RENAME_EXCHANGE) is useful info and is consistent with interpretation (2).

If the author really didn't want Linux to be lumped in with "POSIX-like", it seems he would avoid linux.die.net and instead point to something more of a UNIX standard such as: https://unix.org/apis.html

[0] Intersection vs Union: https://en.wikipedia.org/wiki/Set_(mathematics)#Intersection

dietr1ch 8 hours ago||||

AFAIK you don't even want to be POSIX-compliant unless having a sticker means more to you than being reasonable. Most projects knowingly steer away from compliance (and certifying compliance is probably also expensive)

mionhe 6 hours ago||||

The slash is read as "OR" in this case.

As in: Unix-like OR POSIX-compliant

In that light, it's probably fine to not nitpick over certifications here.

rascul 9 hours ago|||

EulerOS was certified UNIX some years ago.

stephenr 9 hours ago||

Huh, TIL. Thanks.

ncruces 12 hours ago||

I use several of these to implement alternative SQLite locking protocols.

POSIX file locking semantics really are broken beyond repair: https://news.ycombinator.com/item?id=46542247

pjmlp 8 hours ago||

Unless they can be guaranteed by the POSIX specification, they are implementation specific and should not be relied upon for portable code.

kccqzy 5 hours ago|

Which of these are not guaranteed by the POSIX specification? It’s been a while since I studied it, but if I recall correctly the ones mentioned in the article are guaranteed.

nialv7 5 hours ago||

The mmap/msync one is incorrect I believe? (Correct me if I am wrong).

msync() sync content in memory back to _disk_. But multiple processes mapping the same file always see the same content (barring memory consistency, caching, etc.) already. Unless the file is mapped with MAP_PRIVATE.

DSMan195276 4 hours ago||

Yeah I agree that one isn't very clear, perhaps the idea is to use `msync()` as a barrier to achieve consistent ordering of the writes without having to handle that yourself with more complex primitives. But then, they do mention some of those primitives at the bottom of the article, so it's hard to say what exactly the idea is.

icedchai 4 hours ago||

mmap/msync is behavior is also very platform specific. On some systems (like AIX, at least older versions), even without msync, memory mapped data is synced back to disk periodically.

I worked on a code base that was portable between Linux, AIX, and some other Unix flavors. mmap/msync was a source of bugs. Just imagine your system running for days, never syncing any data to disk... then someone pulls the plug. Where'd my data go? Even worse, it happened "in production" at a beta site. Fortunately we had a way to recover data from a log.

Igrom 12 hours ago||

>fcntl(fd, F_GETLK, &lock), fcntl(fd, F_SETLK, &lock), and fcntl(fd, F_SETLKW, &lock)

There's also `flock`, the CLI utility in util-linux, that allows using flocks in shell scripts.

cachius 12 hours ago||

What are flocks in this context? Surely not a number of sheep...

gbacon 5 hours ago|||

https://man.openbsd.org/flock.2

https://man7.org/linux/man-pages/man2/flock.2.html

ncruces 12 hours ago|||

File locks.

pjmlp 8 hours ago|||

In UNIX/POSIX file locks are advisory, not enforced, it only works if all processes play ball.

zbentley 6 hours ago||

Sure, but the discussion is around whether they’re atomic, not whether they’re advisory.

zbentley 6 hours ago||

Aren’t flock and POSIX locks backed by totally different systems?

KevinChasse 5 hours ago||

Nice catalog. One subtle thing I’ve found in building deterministic, stateless systems is that atomic filesystem and memory operations are the only way to safely compute or persist secrets without locks. Combining rename/link/O_EXCL patterns with ephemeral in-memory buffers ensures that sensitive data is never partially written to disk, which reduces race conditions and side-channel exposure in multi-process workflows.

sega_sai 15 hours ago||

rename() is certainly the easiest to use for any sort of file-system based synchronization.

compressedgas 2 hours ago|

As long as you don't run into or want freedom from possible path races, for that you need the missing:

  frenameat2(srcdirfd, srcfd, srcname, dstdirfd, dstfd, dstname)

zzo38computer 14 hours ago|

Even though it can do some things atomically, it only does with one file at a time, and race conditions are still possible because it only does one operation at a time (even if you are only need one file). Some of these are helpful anyways, such as O_EXCL, but it is still only one thing at a time which can cause problems in some cases.

What else it does not do is a transaction with multiple objects. That is why, I would design a operating system, that you can do a transaction with multiple objects.

ptx 14 hours ago||

Windows had APIs for this sort of thing added in Vista, but they're now deprecating it "due to its complexity and various nuances which developers need to consider":

https://learn.microsoft.com/en-us/windows/win32/fileio/about...

Orphis 8 hours ago|||

In some cases, you can start by using the "at" functions (openat...) to work on a directory tree. If you have your logical "locking" done at the top-level of the tree, it might be a fine option.

In some other cases, I've used a pattern where I used a symlink to folders. The symlink is created, resolved or updated atomically, and all I need is eventual consistency.

That last case was to manage several APT repository indices. The indices were constantly updated to publish new testing or unstable releases of software and machines in the fleet were regularly fetching the repository index. The APT protocol and structure being a bit "dumb" (for better or worse) requires you to fetch files (many of them) in the reverse order they are created, which leads to obvious issues like the signature is updated only after the list of files is updated, or the list of files is created only after the list of packages is created.

Long story short, each update would create a new folder that's consistent, and a symlink points to the last created folder (to atomically replace the folder as it was not possible to swap them), and a small HTTP server would initiate a server side session when the first file is fetched and only return files from the same index list, and everything is eventually consistent, and we never get APT complaining about having signature or hash mismatches. The pivotal component was indeed the atomicity of having a symlink to deal with it, as the Java implementation didn't have access to a more modern "openat" syscall, relative to a specific folder.

akoboldfrying 13 hours ago||

I don't follow, sorry. Are you saying that if we run:

    mv a b
    mv c d

We could observe a state where a and d exist? I would find such "out of order execution" shocking.

If that's not what you're saying, could you give an example of something you want to be able to do but can't?

zbentley 6 hours ago|||

Depending on metadata cache behavior configuration, if the system is powered off immediately after the first command, then that could indeed happen I think.

As to whether it’s technically possible for it to happen on a system that stays on, I’m not sure, but it’s certainly vanishingly rare and likely requires very specific circumstances—not just a random race condition.

LgWoodenBadger 5 hours ago||

Uhh, if the system powers off immediately after the first command (mv a b), the second command (mv c d) would never run. So where would d come from if the command that created it never executed?

zbentley 4 hours ago||

Er, sorry: I meant: if the first command runs, the plug is pulled, system starts again, second command runs.

lpribis 30 minutes ago||

Sure, but splitting "atomic" operations across a reboot is an interesting design choice. Surely upon reboot you would re-try the first `mv a b` before doing other things.

jstimpfle 13 hours ago||||

I don't think that's happening in practice, but 1) it may not be specified and 2) What you say could well be the persisted state after a machine crash or power loss. In particular if those files live in different directories.

You can remedy 2) by doing fsync() on the parent directory in between. I just asked ChatGPT which directory you need to fsync. It says it's both, the source and the target directory. Which "makes sense" and simplifies implementations, but it means the rename operation is atomic only at runtime, not if there's a crash in between. It think you might end up with 0 or 2 entries after a crash if you're unlucky.

If that's true, then for safety maybe one should never rename across directories, but instead do a coordinated link(source, target), fsync(target_dir), unlink(source), fsync(source_dir)

jstimpfle 8 hours ago||

why is this being downvoted? If there's something wrong, explain?

duped 8 hours ago||||

All you need for this to occur is the window where both renames occurs overlap. A system polling to check if a, b, c, and d exist while the renames are happening might find all four of them.

jstimpfle 8 hours ago||

Assuming that the two `mv` commands are run in sequence, there shouldn't be any possibility for a and d to be observed "at once" (i.e. first d and then afterwards still a, by a single process).

devnonymous 11 hours ago|||

I'm almost certain what the OP meant was if the commands were run synchronously (ie: from 2 different shells or as `mv a b &; mv c d`) yes there is a possibility that a and d exist (eg: On a busy system where neither of the 2 commands can be immediately scheduled and eventually the second one ends up being scheduled before the first)

Or to go a level deeper, if you have 2 occurrences of rename(2) from the stdlibc ...

rename('a', 'b'); rename('c', 'd');

...and the compiler decides on out of order execution or optimizing by scheduling on different cpus, you can get a and d existing at the same time.

The reason it won't happen in the example you posted is the shell ensures the atomicity (by not forking the second mv until the wait() on the first returns)

isodude 10 hours ago||

nitpick, it should be `touch a c & mv a b & mv c d` as `&;` returns `bash: syntax error near unexpected token `;'`. I always find this oddly weird, but that would not be the first pattern in BASH that is.

`inotifywait` actually sees them in order, but nothing ensure that it's that way.

  $ inotifywait -m /tmp
  /tmp/ MOVED_FROM a
  /tmp/ MOVED_TO b
  /tmp/ MOVED_FROM c
  /tmp/ MOVED_TO d

`stat` tells us that the timestamps are equal as well.

  $ stat b d | grep '^Change'
  Change: 2026-02-06 12:22:55.394932841 +0100
  Change: 2026-02-06 12:22:55.394932841 +0100

However, speeding things up changes it a bit.

Given

  $ (
    set -eo pipefail
    for i in {1..10000}
    do
      printf '%d ' "$i"
      touch a c
      mv a b &
      mv c d &
      wait
      rm b d
    done
  )
  1 2 3 4 5 6 .....

And with `inotifywait` I saw this when running it for a while.

  $ inotifywait -m -e MOVED_FROM,MOVED_TO /tmp > /tmp/output
  cat /tmp/output | xargs -l4 | sort | uniq -c
  9104 /tmp/ MOVED_FROM a /tmp/ MOVED_TO b /tmp/ MOVED_FROM c /tmp/ MOVED_TO d
  896 /tmp/ MOVED_FROM c /tmp/ MOVED_TO d /tmp/ MOVED_FROM a /tmp/ MOVED_TO b

More comments...