Top
Best
New

Posted by ajayvk 10/27/2024

Using SQLite as storage for web server static content(clace.io)
273 points | 126 commentspage 2
Szpadel 10/27/2024|
I don't get argumentation, swapping "current" symlink to point to another version worked for years as atomic way to swap 2 site versions

using sqlite for static content might have other benefits, but i don't think that this is the main one

ajayvk 10/27/2024|
Swapping symlinks is possible. Using a database (sqlite specifically) has other benefits, like being able to do deduplication, backups are easier, compressed content can be stored, content hash can be stored etc.
webstrand 10/27/2024|||
Sqlite isn't necessarily easier to backup than a filesystem. I've got a fairly large (~60GB) sqlite that I'm somewhat eager to get off of. If I'd stuck with pure filesystem then backing up only the changeset would be trivial, but with sqlite I have to store a new copy of the database.

I've tried various solutions like litestream and even xdelta3 (which generates patches in excess of the actual changeset size), but I haven't found a solution I'm confident in other than backing up complete snapshots.

fanf2 10/27/2024|||
You might like the new sqlite3_rsync command https://www.sqlite.org/rsync.html
simonw 10/27/2024||
Yeah that looks ideal for this exact problem, because it lets you stream a snapshot backup of a SQLite over SSH without needing to first create a duplicate copy using .backup or vacuum. My notes here: https://til.simonwillison.net/sqlite/compile-sqlite3-rsync
webstrand 10/27/2024||
Maybe that tool just doesn't fit my use-case, but I'm not sure how you'd use it to do incremental backups? I store all of my backups in S3 Glacier for the cheap storage, so there's nothing for me to rsync onto.

I can see how you'd use it for replication though.

simonw 10/27/2024||
If you want incremental backups to S3 I recommend Litestream.
hedgehog 10/27/2024|||
What do you do about compaction?
webstrand 10/28/2024||
zstd, though that only shaves off a few GB total.
hedgehog 10/28/2024||
Oh, I mean like with the vacuum command. As the databases get larger it can become unwieldy.
webstrand 10/29/2024||
I just tried it, it only recovered a few MB
hedgehog 10/29/2024||
Aha, so not much need. I've always avoided SQLite for larger databases due to the extra space needed to allow compaction, maybe not a real problem in most applications though.
theturtle32 10/27/2024||||
You could also employ a different filesystem like ZFS or btrfs in tandem with the symlink-swapping strategy to achieve things like deduplication. Or, once you have deduplication at the filesystem level, just construct a new complete duplicate of the folder to represent the new version and use renaming to swap the old for the new, and poof -- atomic changes and versioning with de-duplication, all while continuing to be able to use standard filesystem paradigms and tools.
borsecplata 10/27/2024||||
Deduplication can be achieved the same way as in sqlite, by keeping files indexed by sha256. There are also filesystems who provide transparent compression.

Seeing as you need some kind of layer between web and sqlite, you might as well keep a layer between web and FS who nets you most or all of the benefits.

warble 10/27/2024|||
All of this is easily done on a filesystem too. I would assume this is a performance tradeoff rather than features?
theturtle32 10/27/2024||
Genuinely, I'm curious to hear from OP if you explored using filesystems that can provide de-duplication, snapshots, versioning, and compression and compared that to the SQLite approach. It would be interesting to get your take on the comparative benefits of SQLite compared with an approach that uses a more advanced filesystem like ZFS or btrfs. Once you have a filesystem that can de-dup, atomic changes are more or less a matter of building up a directory with the new version first and then swapping one directory for another with a rename. Though I assume there may be a small race condition in between the two directory rename calls. I don't know if there's a standard call to tell the filesystem "mv current old && mv new current" in one shot.
bhaney 10/27/2024||
> I don't know if there's a standard call to tell the filesystem "mv current old && mv new current" in one shot.

Used to be hackily done by overwriting symlinks, but now there's https://manpages.debian.org/testing/manpages-dev/renameat2.2...

ajayvk 10/27/2024||
Clace is meant to be portable and easy to operationalize. It runs on Linux/macOS/Windows. The only external dependency is Docker/Podman for starting containers. Depending on particular filesystem features would make it much more difficult to setup, I did not explore that. There would be use cases where using filesystem features would be more appropriate.
withinboredom 10/27/2024|||
If you are using Docker, then you know the target system is Linux (unless you are building windows containers). There is no reason not to rely on filesystem features.
ajayvk 10/28/2024||
Clace runs natively on Windows and macOS, it does not have to run in a container. It can then start containers, either Linux or Windows (the specs https://clace.io/docs/container/overview/#app-specs are Linux by default)
theturtle32 10/28/2024|||
Makes sense! Portability as a core feature is a good reason not to use platform-specific tech.
dogaar 10/28/2024||
This article seems to suggest that atomic application updates at the server side is by itself a solution to application versioning, which it isn’t. As long as client-side asset caching is enabled, which it is by default for all web applications, then the neatly separated application versions will get all mixed up in the browser. One solution would be to serve each application version under a different path prefix.
ajayvk 10/27/2024||
I have been building https://github.com/claceio/clace, a project which makes it easier to manage web apps for internal use (locally and across a team). SQLite is used to store static files. This post talks about the reasoning behind that.
MathMonkeyMan 10/27/2024||
This gives me an idea.

Use git to manage versions of the static files.

Use [git worktree][1] to have multiple separate working trees in different directories.

Use [openat()][2] in your app whenever serving a request for a static file. On each request, open the directory, i.e. `/var/www/mysite.com`, and then use `openat()` with the directory's file descriptor when looking up any files under that directory.

`/var/www/mysite.com` is a symbolic link to some working tree.

When modifying the site, make modifications to the git repository. Then check out the modified working tree into a new directory.

Create a symbolic link to the new directory and `mv` (rename) it to `/var/www/mysite.com`.

Then some time later, `rm -r` the old working tree.

[1]: https://git-scm.com/docs/git-worktree

[2]: https://pubs.opengroup.org/onlinepubs/9799919799/functions/o...

mistrial9 10/27/2024|
> Use git to manage versions of the static files

bad match -- git stores every version in the history forever. Do you really need every revision of a binary file, completely duplicated? big files, more bad

MathMonkeyMan 10/27/2024||
Hm, good point. I suppose you could use [more git commands][1] to enforce a "nothing older than N commits" policy. But now the solution lacks that dead simple allure.

[1]: https://www.reddit.com/r/git/comments/wk2kqy/delete_files_un...

earthboundkid 10/27/2024||
I'm continually going through this thought process:

1. I should make blog engine using flat files. That way I can diff the plain text.

2. But flat files are hard to read/write structured data from, so I should use SQLite instead.

3. But SQLite is hard to diff. GOTO 1.

crazygringo 10/27/2024||
In every project of mine with databases, I use a tool to export the schema (without data) as a text SQL file on every commit (as a pre-commit hook), to record any changes.

There's no reason you can't do the same but including the content too.

That way you're always committing a text friendly version of the database rather than the database binary encoding. And your diffs will work great.

sureglymop 10/27/2024|||
But with sqlite would it be that hard? You could use sqldiff or, even something like this:

diff <(sqlite3 db 'select text from posts where id = 1') <(sqlite3 db 'select text from posts where id = 2')

ajayvk 10/28/2024||
There is a `version files` command to list the files for a version, including the SHA. I plan to add an export command also to make diff easier
redleader55 10/27/2024||
This is interesting as a toy and I'm sure the authors are having a lot of fun implementing their idea.

On the other hand, the state of art when it comes to performance is zero-copy from disk to NIC using io_uring and DMA, combined with NICs that support TLS & checksumming offloading and separated header and payload that can be filled independently by the OS and user-space.

I wonder if the authors of these projects ask themselves: what reasons are there to not do it like this? This thread has a few answers.

ajayvk 10/27/2024|
I am the author (of the blog post and of the Clace project). Clace is built for one specific use case: internal web apps, to run locally or on a shared server, for use by a team. Clace is not a general purpose web server. It is an application server which uses containers and sandboxing (when running Starlark code) to allow for multiple apps to be managed easily. Clace just happens to implement some web server features for operational simplicity, to avoid requiring an additional component to be managed.

I am not claiming a SQLite database is always the right approach for serving static files. In my particular use case for Clace, I found a database to work better.

fsiefken 10/27/2024||
I read "There is no equivalent implementation using the file system to compare against, so a direct benchmark test is not done." But Btrfs and ZFS have file versioning built-in, not just snapshots, but also individual files - this can be used to make a similar setup, without using binary blobs, but just the filesystem references in sqlite or hard links - which might even perform faster.
ajayvk 10/27/2024|
I meant there is no Clace implementation which uses the file system instead of the database. That would be required for a direct benchmark. For dev mode apps, Clace loads files from disk, but that dev mode has many differences from the prod mode
superkuh 10/27/2024||
So the entire argument here is that using sqlite to store files prevents visitors from seeing partial/mid-way changes when updating a website (a process that would take seconds)? Is that really a problem?

Otherwise the file system is far superior for every "benefit" they list. And far easier to work with other software and more robust over time (nothing to change/break).

throwawayie6 10/27/2024||
I made something similar once, but simply wrote the output to a temporary file, and then just renamed it to the final name when it was done

From what I remember, this was an atomic operation, and if there was a download in progress it would continue using the old file, because the data was still on disk and the filename was simply a pointer to the node.

This may behave differently on other file systems. This was done on an old Linux server with ext3

Seems like a simpler solution than using a db

okl 10/27/2024||
One benefit they list is storing associated metadata in the database (specifically different types of hashes are mentioned) which is not so easy with a file system.

I think the bigger benefit though is the increased read performance on many small files (saving system call overhead). To which amount that applies to static files that a smart server might keep in cache, I don't know.

superkuh 10/27/2024||
I'm not sure what associated metadata is in this context but www/path/to/filename.jpg and www/path/to/filename.jpg.json would work and be very file-y. I take their/your point in it not being directly integrated though.
jazzyjackson 10/27/2024|
Isn't there bottleneck at your egress - I thought sqlite was fast at reads but only handles one read at a time - so if someone is downloading a gigabyte binary out of your db every other connection just has to wait - am I wrong?

Something I want to try is using sqlite as my filesystem but just storing content hashes that would point to an S3-compatible object store, so you get the atomicity and rollbackability and all but you also get massive parallelization and multiregionality of an object store

Edit: I googled it again and find that multiple processes can read sqlite concurrently so shouldn't be a problem

https://stackoverflow.com/questions/4060772/sqlite-concurren...

catlifeonmars 10/27/2024|
There’s still a bottleneck in that you have to host it on the same machine and egress from the same network interface. Classic tradeoff between consistency and availability though.
More comments...