Filesystems are having a moment

Posted by malgamves 13 hours ago

Filesystems are having a moment(madalitso.me)

160 points | 94 commentspage 2

JoeAltmaier 4 hours ago|

Digression: a file system is a terrible abstraction. The ceremonial file tree, where branches are directories and you have to hang your file on a particular branch like a Christmas ornament.

Relational is better. Hell, and kind of unique identifier would be nice. So many better ways to organize data stores.

zarzavat 4 hours ago||

Filesystems have a property that changes preserve locality. A change made to one branch of the tree doesn't affect other branches (except for links). Databases lack this property: any UPDATE or DELETE can potentially affect any row depending on the condition. This makes them powerful but also scary. I don't want that every time I delete a file it potentially does a rm -rf / if I mistype the query.

The best compromise is what modern OSs have: a tree-like structure to store files but a database index on top for queries.

JoeAltmaier 3 hours ago||

You can create the tree structure from a relation. Not a primitive data store operation at all. Just add the attribute: parent directory and voila.

So often we want to look up 'the last file I printed' or 'that message I got from Bob'. Instead of just creating that lookup, we have to go spelunking.

Hell, every major app creates it's own abstractions because the OS/Filesystem doesn't have anything useful. Email systems organize messages and tags; document editors have collections of document aspects they store in a structured blob. Instead of asking the OS to do that.

p_ing 4 hours ago|||

NTFS has a database, the MFT. It can index attributes, such as file names, which are a b+tree. A file's $DATA is also placed into the MFT, unless it doesn't fit, then NTFS allocates virtual cluster numbers (more MFT attributes) which point to the on-disk data structure of the file.

All files are represented in a table with rows and columns. "Directories" simply have a special "directory = true" attribute in a row (simplified).

The hierarchy is for you, the human.

Like many file systems, NTFS also contains a log for recoverability/rollback purposes.

It's not quite relational but it doesn't make sense to be relational. Why would you need more than one 'table' to contain everything you need to know about a file? Microsoft experimented with WinFS, which wasn't a traditional file system (it was an MSSQL database with BLOB storage which sat ontop of a regular NTFS volume). Performance was bad and Skydrive replaced the need for it (in the view of MSFT).

dist-epoch 3 hours ago||

The newest Microsoft filesystem, ReFS, remove the MFT. Because it created a lot of problems.

p_ing 3 hours ago||

> Because it created a lot of problems.

Please elaborate.

NTFS is still the better choice for common desktop usage. ReFS goals are centered around data integrity but it comes at the cost of performance.

packetlost 4 hours ago|||

Files in most file systems are uniquely identified by inode and can be referenced by multiple files. Why does everyone forget links?

JoeAltmaier 2 hours ago||

A dataset can persist across multiple file systems. A UUID is a way to know that one dataset is equivalent (identical) to another. Now you can cache, store-and-forward, archive and retrieve and know what you have.

packetlost 2 hours ago||

UUIDs aren't very good for this use case, a sufficiently large CRC or cryptographic hash is better because it's intrinsically tied to the data's value while UUIDs are not

mieubrisse 4 hours ago||

I've been wondering this too: for us, UUIDs are super opaque. But for an agent, two UUIDs are distinct as day and night. Is the best filesystem just blob storage S3 style with good indexes, and a bit of context on where everything lives?

One thing directories solve: they're great grouping mechanisms. "All the Q3 stuff lives in this directory"

I bet we move towards a world where files are just UUIDs, then directory structures get created on demand, like tags.

para_parolu 4 hours ago|||

Filepath is just unique name that model can identify easily and understand grouping. Uuid solves nothing but requires another mapping from file to short description.

JoeAltmaier 3 hours ago||

UUID solve oh so very, very much.

You can have several versions of the same set of data object at once - an entire source set for a build, all the names duplicate but tagged with 'revision' so they can be distinguished.

Hard to do that without a UUID at root, to use for unique identification of the particular 'particle' of the particular data set.

JoeAltmaier 3 hours ago|||

Or, have to "Q" attribute and ask the file store for "Q=3"

All good.

zmmmmm 2 hours ago||

I don't think there's a lot magical about files beyond (a) they are native for LLMs and coding because they both process text and (b)when things are rapidly in flux, unstructured formats prosper because flexibility is king. Literally any fixed format you try and describe becomes rapidly outdated and fails to serve the purpose. For example it feels like MCP is already ageing like milk.

Which is mainly to say, trust me, this is a temporary state, the god of complexity is coming. It is utterly inevitable. The people who created React, Kubernetes, all those Java frameworks you hated etc didn't go away. They are right now thinking about how amazing it would be if you if you stacked ten different tools together with brand new structured file formats and databases. We already have "beads" and "gastown" where this is starting. Enjoy these times because a couple of years from now it will already be the end of the "fun" part I think.

leonflexo 6 hours ago||

I wonder how much of a lost in the middle effect there is and if there could be or are tools that specifically differentiate optimizing post compaction "seeding". One problem I've run into with open spec is after a compaction, or kicking off a new session, it is easy to start out already ~50k tokens in and I assume somewhat more vulnerable to lost in the middle type effects before any actual coding may have taken place.

ramoz 7 hours ago||

I thing the real impact behind the scenes here is Bash(). Filesystem relevance is a bit coincidental to placing an agent on an operating system and giving it full capability over it.

stephbook 2 hours ago||

I'm not too deep into agentic coding, but I hadn't understood why people write `SOUL.md` files like no tomorrow. Does anyone think these will be called the same three years from now?

If you've got a coding convention, enforce it using a linter. Have the LLM write the rules and integrate it into the local build and CI tool.

Has noone ever thought about how – gasp – a future human collaborator would be onboarded?

0xbadcafebee 6 hours ago||

Can we bring back Plan9 architecture now? It had what was essentially MCP. You make a custom device driver, and anything really can be a file. Not only that, but you network them, so a file on local disk could be a display on a remote host (or whatever). Just tell the agent to read/write files and it doesn't need to figure out either MCP or tool calls.

bnjms 4 hours ago|

This seems like the place to ask. What other big ideas have there been since everything-is-a-file? I’m not aware of any. And it seems like we want another layer of permissions on device & data access we spent have before.

jmclnx 9 hours ago||

Funny, decades ago (mid-80s), I had to write a onetime fix on a what would be now a very low memory system, the data in question had a unique key of 8 7bit-ascii characters.

Instead of reading multi-meg data into memory to determine what to do, I used the file system and the program would store data related to the key in sub directories instead. The older people saw what I did and thought that was interesting. With development time factored in, doing it this way ended up being much faster and avoided memory issues that would have occurred.

So with AI, back to the old ways I guess :)

bsenftner 7 hours ago|

Reminds me of early data driving approaches. Early CD based game consoles had memory constraints, which I sidestepped by writing the most ridiculous simple game engine: the game loop was all data driven, and "going somewhere new" in the game was simply triggering a disc read given a raw sector offset and the number of sectors. That read was then a repeated series of bytes to be written at the memory address given by the first 4 bytes read and next 4 bytes how many bytes to copy. That simple mechanism, paired with a data organizer for creating the disc images, enabled some well known successful games to have "huge worlds" with an executable under 100K, leaving the rest of the console's memory for content assets, animations, whatever.

alexjplant 6 hours ago||

Which games were these out of interest? I enjoy reading about game dev from the nascent era of 3D on home consoles (on the Saturn in particular) and would love to hear more.

bsenftner 6 hours ago||

Tiger Woods Golf PSX was one, RoadRash3D0 another. Dozens that were never popular too.

TacticalCoder 8 hours ago|

As TFA basically says: files on a filesystem is a DB. Just a very crude one. There aren't nice indexes for a variety of things. "Views" are not really there (arguably you can create different views with links but it's, once again, very crude). But it's definitely a DB, represented as a tree indeed as TFA mentions.

My life's data, including all the official stuff (bank statements, notary acts, statements made to the police [witness, etc.], insurance, property titels), all my coding projects, all the family pictures (not just the ones I took) and all the stuff I forgot, is in files, not in a dedicated DB. But these files are a definitely a database.

And because I don't want to deal with data corruption and even less want to deal with synching now corrupted data, many of my files contains, in their filename, a partial cryptographic checksum. E.g. "dsc239879879.jpg" becomes "dsc239789879-b3-6f338201b7.jpg" (meaning the Blake3 hash of that file has to begin with 6f338201b7 or the file is corrupted).

At any time, if I want to, I can import these in "real" dedicated DBs. For example I can pass my pictures as a read-only to "I'm Mich" (immich) and then query my pictures: "Find me all the pictures of Eliza" or "Find me all the pictures taken in 2016 on the french riviera".

But the real database of my all my life is and shall always be files on a filesystem.

With a "real" database, a backup can be as simple as a dump. With files backuping involve... Making sure you keep a proper version of all your files.

I'd say files are even more important than the filesystem: a backup on a BluRay disc or on an ext4-formatted SSD or on an exfat formatted SSD or on a tape... Doesn't matter: the files are the data.

A filesystem is the first "database" with these data: a crude one, with only simple queries. But a filesystem is definitely a database.

The main advantage of this very simple database is that as long as the data are accessible, you know your data is safe and can always use them to populate more advanced databases if needed.

euroderf 5 hours ago||

It's not "crude" if you get hierarchical organization without having to screw around with RECURSIVE, or "closure this" and "closure that". It just works.

rzerowan 6 hours ago|||

Were it more portable BeOS/Haiku's BeFS would have been a perrfect fit in this instance.Seeing that it is a filesystem thah has database properties via extended attributes[1] and indexing.

Were Haiku mor mature/stable would have been a nice fit for the OS for the LLM/Ai personal use cases.

[1] https://arstechnica.com/information-technology/2018/07/the-b...

ciupicri 6 hours ago|||

Why Blake3 and not say XXH3 64/128 bits (https://xxhash.com/)?

heavyset_go 6 hours ago||

You can get views by using namespaces/cgroups

More comments...