Using SQLite as storage for web server static content

Posted by ajayvk 10/27/2024

Using SQLite as storage for web server static content(clace.io)

273 points | 126 comments

simonw 10/27/2024|

I did some experiments around this idea a few years ago, partly inspired by the "35% Faster Than The Filesystem" article: https://www.sqlite.org/fasterthanfs.html

Here are the notes I made at the time: https://simonwillison.net/2020/Jul/30/fun-binary-data-and-sq...

I built https://datasette.io/plugins/datasette-media as a plugin for serving static files from SQLite via Datasette and it works fine, but honestly I've not used it much since I built it.

A related concept is using SQLite to serve map tiles - this plugin https://datasette.io/plugins/datasette-tiles does that, using the MBTiles format which it turns out is a SQLite database full of PNGs.

If you want to experiment with SQLite to serve files you may find my "sqlite-utils insert-files" CLI tool useful for bootstrapping the database: https://sqlite-utils.datasette.io/en/stable/cli.html#inserti...

ajayvk 10/27/2024||

In terms of performance, the content hash based file name approach is easier to implement with SQLite storage. The content hash has to be generated just once during file upload, not every time the web server is restarted (or using a build step which does actual file renames). It is possible to do it dynamically for file system files (see https://github.com/benbjohnson/hashfs for a embedFS implementation), but the database makes it a bit easier.

azornathogron 10/27/2024||

This is pretty cool, but if you're rewriting resource names server side then you need some machinery to rewrite references to those resources. I see the example you linked shows the root HTML being generated server-side from a template so content hash names can be injected into the template. Looks reasonable. Not so great for dealing with things like `import` calls in javascript (for deferred loading of script modules), if those need to be rewritten as well.

Your JS bundler (if you use one!) might be effectively rewriting imports anyway so it can be convenient to do the content-hash-named-files rewrites in the same place. But not everyone wants to use a bundler for that.

ajayvk 10/27/2024||

Clace can be used to build hypermedia driven UI's, using Go templates and HTMX. https://clace.io/docs/develop/templates/#static-function gives the content hashed file name, the templates have to use that function. See https://github.com/claceio/apps/blob/f5566cea6061ec85ea59495... for an example, that game is running live at https://cowbull.co/

Clace can also run containers, where the UI is served by the container. For example, Streamlit/Gradio based apps. In that case, Clace acts like an app server and reverse proxy to the container, no file name rewrites are done.

Clace includes esbuild for being able to import npm packages as ES modules https://clace.io/docs/develop/javascript/. There is no support for any JS bundling.

westurner 10/28/2024|||

Is there a way to sendfile from a SQLite database like xsendfile?

requests-cache caches requests in SQLite by (date,URI) IIRC. https://github.com/requests-cache/requests-cache/blob/main/r...

/? pyfilesystem SQLite: https://www.google.com/search?q=pyfilesystem+sqlite

/? sendfile mmap SQLite: https://www.google.com/search?q=sendfile+mmap+sqlite

https://github.com/adamobeng/wddbfs :

> webdavfs provider which can read the contents of sqlite databases

There's probably a good way to implement a filesystem with Unix file permissions and xattrs extended file attribute permissions atop SQLite?

Would SQLite be faster or more convenient than e.g. ngx_http_memcached_module.c ? Does SQLite have per-cell ACLs either?

rnewme 10/28/2024||

Not all you asked but Sqlite archive can be mounted using fuse, https://sqlite.org/sqlar/doc/trunk/README.md

westurner 10/31/2024||

FUSE does local filesystems, and there are also HTTP remote VFS for sqlite.

sqlite-wasm-http is based on phiresky/sql.js-httpvfs, "a read-only HTTP-Range-request based virtual file system for SQLite. It allows hosting an SQLite database on a static file hoster and querying that database from the browser without fully downloading it." "but uses the new official SQLite WASM distribution." [1][2]

[2] sql.js-httpvfs: https://github.com/phiresky/sql.js-httpvfs

[1] sqlite-wasm-http: https://www.npmjs.com/package/sqlite-wasm-http

[3] sqlite3vfshttp: https://github.com/psanford/sqlite3vfshttp

HTTP Range requests: https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requ...

snarg 10/27/2024|||

Using sqlite keeps a persistent open database connection, so you only have to send requests for content that sqlite has probably cached. When reading static files, you open, read, and close a file with every request, meaning more context switches, even though the filesystem layer will have cached the file content. If you want to speed this up, the appropriate solution is to add a caching front-end, not to turn everything into a database. It will be faster than using sqlite and easier to maintain and troubleshoot.

eknkc 10/28/2024||

Well you can store metadata in SQLite. You can update metadata and files in a transaction. It already has a caching layer built in (I think 8mb by default, that might need some tuning for this use case). It is thread safe (I mean it is a giant rw mutex but whatever) Does not sound that bad to just use it instead of building a caching layer that might be marginally faster than sqlite. And if you restart your caching frontend thing, it will need to rebuild the cache using tons of syscalls. SQLite cache will fill up easier as it has less context switches.

One major downside I see would be backing up the database. You can do granular backups on a filesytem. Even rsync would work fine. You’d need to basically snapshot the entire thing here.

mharig 10/28/2024||

sqlite-rsync is on the road.

pdimitar 10/29/2024||

I have no idea why this comment was dead and I vouched for it.

Just recently the `sqlite3-rsync` tool was added to SQLite and it does point-in-time backup, effectively.

mharig 10/30/2024||

[dead]

nabla9 10/27/2024||

Isn't anything that avoids too many syscalls and copying between userspace and kernel 35% faster? Including filesystems that run completely on user space (not FUSE, FUSE calls go trough kernel)

ori_b 10/28/2024||

In some ways, it seems surprising that avoiding open/read/write syscalls and caching in user space is only 35% faster.

manmal 10/28/2024||

My hypothesis is that the scenarios the SQLite team has setup don’t particularly exploit the fact that the SQLite metrics become more favorable the more files are accessed.

ori_b 10/31/2024||

I took a look at the code, and the file based code is basically the worst case for the file system, accessing about as many files as it can.

Additionally, it's doing a lot of unnecessary work. It's using buffered I/O to do a full file read, and throwing away the buffer immediately, which is an expensive way to add an unnecessary memcpy. It's opening by full path instead of openat, which incurs a path lookup penalty.

I think the file based code can be quite a bit faster, if you put a bit of effort in.

I think in serious use, the gap would narrow, not grow.

azornathogron 10/27/2024||

> Transactional Updates : This is the main benefit. Updating multiple files can be done in one transaction. Isolation ensures that there are no broken webapps during the update.

Your server on its own, whether it uses SQLite or the filesystem, cannot save you from having broken webapps during an update. Each page in the browser is a _tree_ of resources, which are fetched with separate HTTP requests and therefore not subject to your server-side transaction/atomic update system. You can change all resources transactionally on the server side but the browser can still observe a combination of old and new resources.

The common solution here is to ensure that all sub-resources of a page (javascript bundle(s), stylesheet(s), media, etc) are named (ie, in the URL) using a content hash or version. All resources (from the root HTML document down) need to refer to the specific content-hash or versioned name, so that if the browser loads version X of the root HTML document then it will load the _corresponding version_ of all sub-resources. Not only that, but if you're updating a page from version X to version Y, all the version-X sub-resources need to remain available after you start serving page version Y, until you're confident that no browser could reasonably be still loading version X of the page, otherwise you can break page version X while someone is still loading it.

This means you actually specifically don't want to put sub-resources for a page into one atomically-switched-out bundle along with the root HTML document, because if you do that you'll be removing the previous versions of sub-resources while they may still be referenced.

Also of course in some cases there might be some sub-resources (e.g., perhaps some media files) that you want to be versioned "separately" from the HTML document that contains them, so that you can update them without busting caches for all of the page/app structural elements like your javascript blobs and stylesheets and so on, and you _might_ need to take that into account in your page build system as well.

withinboredom 10/27/2024||

> until you're confident that no browser could reasonably be still loading version X of the page

During experiments of this stuff at a big company (which saw a large portion of the web during this time):

- we saw most (>80%) of users staying on a web app ~2-3 days (most likely skewed from people leaving tabs open over the weekend).

- 95% was ~2 weeks

- 100% was about 600 day (yes, apparently we had users with a tab open for nearly 2 years)

If you are seeking 100%, you are going to be waiting a while.

: this is 100% from memory, I don't work there anymore.

azornathogron 10/27/2024|||

Presumably at some point you trigger a forced client side reload to get people off old client versions?

I'm impressed you kept the measurements going long enough to track the 600-day users though!

withinboredom 10/27/2024||

This is what pointed out a reason for implementing that feature!

yawaramin 10/28/2024||

Nowadays we get this for free with Chrome

keeganpoppen 10/28/2024|||

that 100%ile sounds like me...

ajayvk 10/27/2024|||

For Hypermedia driven web apps, interactions within a page return small HTML fragments. For larger UI changes, a full page update is recommended (the Multi Page Application MPA style). The scenario of a user staying on a page for long and getting broken links is more of a SPA issue.

I agree generally with your comment. Transactional updates prevent only one category of update related issues. There can be other issues at the app level which can cause a broken experience. Continuing to serve older versions of static content when referenced by content hash is possible, it is not currently implemented by Clace.

hinkley 10/27/2024||

This helps with front end caches and CDNs as well.

The main trick is upload all of the non-html changes before the html changes, so that no file is referenced before it exists. If you want to make the app as complex as possible you do a depth first search for upload. But if you value your sanity you relax the problem and have assets-first in your app.

SAHChandler 10/27/2024||

Back in 2011/2012 I worked for a small game development company and at my recommendation we moved all our assets under 100kb into an sqlite3 db, and then made a "pak file" and stored offsets to said files inside the sqlite3 db (and this was done because there was a post-mortem talk given by Richard Hipp where he said in hindsight he'd wish blobs had been treated more like an inode, where they were an offset further into the database and blobs were just appended to the file)

It was hella fast how quickly we could load our assets (this was for a mobile game so only a handful of assets were not in the db). It's been neat seeing people adopt it further.

One aspect someone might also not realize is you can store near infinite metadata alongside your content, so you can always just query the database to find "similar" files. We threw so much metadata into the DB and I think in the end the pak file was 200MB, and the database was something like 20MB. Again, this was a mobile game. I think the worst thing we had happen on the client side was a single double inner join that we couldn't figure out how to reduce due to some server side complexity (we weren't able to implement the server, which was frustrating as the people we worked with were very bad™ at software development so our builds would break out of nowhere when they would change the entire backend spec without alerting us )

We also used a separate sqlite3 database for game replays so you could (post match completion) replay an entire game and see what each opponent was doing during said game. This was super nice for automated testing as well.

samuelstros 10/27/2024||

We also came around to dumping files into SQLite instead of dealing with the filesystem (and git) for the lix change control system. This article touches on problems we encountered: https://opral.substack.com/i/150054233/breaking-git-compatib....

- File locking, concurrency, etc., are problems that SQLite solves

- Using SQLite allows drumroll querying files with SQL instead of platform-dependent fs APIs

- Using SQL queries is typesafe with Kysely https://kysely.dev/ (without the need for an ORM)

egeozcan 10/27/2024||

> Using SQL queries is typesafe with Kysely https://kysely.dev/ (without the need for an ORM)

Wow, this is even better than what I've seen people do with F# type providers. Cool cool cool.

samuelstros 10/27/2024||

the types kysely implemented are crazy. even custom where clause like this one [0] are typesafe :O

[0] https://github.com/opral/monorepo/blob/99356e577f558f4442a95...

surrealize 10/27/2024||

Totally random, but I had a guess about the ts error. I hadn't seen kysely before, very cool!

  diff --git a/packages/lix-sdk/src/query-utilities/is-in-simulated-branch.ts b/packages/lix-sdk/src/query-utilities/is-in-simulated-branch.ts
  index 7d677477e..39502f245 100644
  --- a/packages/lix-sdk/src/query-utilities/is-in-simulated-branch.ts
  +++ b/packages/lix-sdk/src/query-utilities/is-in-simulated-branch.ts
  @@ -21,10 +21,9 @@ export function isInSimulatedCurrentBranch(
                // change is  not in a conflict
                eb("change.id", "not in", (subquery) =>
                        subquery.selectFrom("conflict").select("conflict.change_id").unionAll(
  -                             // @ts-expect-error - no idea why
                                subquery
                                        .selectFrom("conflict")
  -                                     .select("conflict.conflicting_change_id"),
  +                                     .select("conflict.conflicting_change_id as change_id"),
                        ),
                ),
                // change is in a conflict that has not been resolved

samuelstros 10/27/2024||

You are now forever in our git history https://github.com/opral/monorepo/commit/58734e11e51d8e20092.... the ts-expect-error was indeed fixed by your suggestion

samuelstros 10/27/2024|||

Today i learned that HN has a nesting limit. @surrealize i fixed the username https://github.com/opral/monorepo/commit/7dc1f3c806bd89c6d68...

aspenmayer 10/28/2024|||

The conversation in some aspect may have tripped the flamewar detection, which removes the reply button iirc. You can manually reply to a specific comment by clicking/tapping on on its timestamp.

surrealize 10/27/2024|||

Haha, thanks!

surrealize 10/27/2024||||

Cool! I'm surrealize though, sureglymop wrote a sibling comment! I should have just made a PR, lol.

ec109685 10/27/2024|||

Isn’t kind just overriding the type checker? Should that as be necessary?

surrealize 10/27/2024|||

In the context of the union query, I think it makes sense. The query takes two different columns (with two different names) from the same table, and combines them together into one column. The TS error happened because the column names didn't match between the two union inputs. But they would never match without the "as".

samuelstros 10/27/2024|||

Yes. The runtime was not affected. Hence, the ts-expect-error. Still nice to have the ts-expect-error fixed :)

mmerickel 10/28/2024|||

I'm a huge proponent of using SQLite as an abstraction over a filesystem. One warning I will note though, is be aware that a SQLite database does not shrink unless you vacuum it (basically copies the data into a separate file and deletes the original). This is a manual operation you have to do at points where it makes sense within your application. So be careful with disk usage when just writing binary data and then deleting it.

samuelstros 10/28/2024||

That's a good tip. I was wondering how SQLite avoids page fragmentation. The answer is "it depends," but VACUUM is to the rescue.

What are your reasons for advocating for SQLite as a filesystem abstraction?

hinkley 10/27/2024|||

You can add as much metadata to the files as your little heart desires.

sureglymop 10/27/2024||

Regarding your lix project, have you looked at Fossil? It seems like it could maybe do what you're trying to do with some small changes.

samuelstros 10/27/2024||

Yes. We looked at all version control systems (fossil, pijul, jj, git, darc, sapling). None provide what we need:

- target the browser

- supports any file format, not just text files

- understands changes, not just versioning of files

- suited to build web apps on top

- ultimately enable 1000's of artists, designers, lawyers, civil engineers, etc. to collaborate

we are 2 years into the journey. dec 16, we'll release a public preview

sureglymop 10/27/2024|||

Interesting! Definitely looking forward to that and I hope you succeed.

p4bl0 10/27/2024||

Amusingly, my static site generator CMS does precisely the reverse of what's done here. During development/updating of the website, all pages and posts are entries in an SQLite database that is manipulated through a web interface that present an editable version of the website. And then the website is dumped as static pages to the file system to be deployed directly and/or downloaded as a zip and uploaded anywhere else for deployment (including entirely static hosting services).

Macha 10/27/2024|

I use Zola for now, but for a variety of reasons I'm contemplating building my own to let me tweak stuff a bit more flexibly, and this kind of approach has been on my mind. How has it been working for you?

p4bl0 10/27/2024||

It works great. I actually build it for a few friends initially and now I have about a hundred people using it, mostly for personal web pages (which it was designed for), but not only. I also use it to manage the websites of my uni department and research group for example, at https://informatique.up8.edu/ [website is in french].

Source code is available here if you wanna take a look: https://code.up8.edu/pablo/cigala

A warning: the code is not written to be pretty, it's written to be simple and effective. The initial goal being to have the whole software in a single PHP file that could even be used offline to produce a fully static website.

thisislife2 10/27/2024||

From Appropriate Uses For SQLite - https://www.sqlite.org/whentouse.html :

> The amount of web traffic that SQLite can handle depends on how heavily the website uses its database. Generally speaking, any site that gets fewer than 100K hits/day should work fine with SQLite. The 100K hits/day figure is a conservative estimate, not a hard upper bound. SQLite has been demonstrated to work with 10 times that amount of traffic ... The SQLite website (https://www.sqlite.org/) uses SQLite itself, of course, and as of this writing (2015) it handles about 400K to 500K HTTP requests per day, about 15-20% of which are dynamic pages touching the database. Dynamic content uses about 200 SQL statements per webpage. This setup runs on a single VM that shares a physical server with 23 others and yet still keeps the load average below 0.1 most of the time. See also: https://news.ycombinator.com/item?id=33975635

ajayvk 10/27/2024|

100K/day is less than 2/second. I think those numbers need to be updated on the SQLite page. For a read heavy workload, like serving static files, SQLite can do much more. Since the content caching headers are set, the browser will cache content, causing request to the server to be required for new clients only. I do not expect SQLite to be the bottleneck for most use cases.

therealdrag0 10/28/2024||

Drives me nuts when people use rates of any unit besides sec or min. Anything longer is not useful information, since traffic varies over time and what you care about is peak rates.

Lockal 10/28/2024||

The idea of using SQLite to serve static content based solely on the "35% Faster Than The Filesystem" page from 2017 sounds half-baked, to say the least.

Modern web servers like Nginx employ optimal strategies for handling static files, starting with sendfile and extending to io_uring and splice operations, all within well-designed thread pools based on epoll, kqueue, or eventport (whichever suits your needs).

Meanwhile, the best SQLite can offer (at least natively) is its memory-mapped I/O support (https://www.sqlite.org/mmap.html). This might work well for single-client services, like locally hosted web apps (see also https://github.com/electron/asar). However, for large websites, as mentioned in other comments, it attempts to solve a problem that doesn't exist.

UniverseHacker 10/27/2024||

I do a lot of high performance scientific computing and find the most flexible and high performance way to access data- especially in parallel- is often a read only sqlite database on a ramdisk. It feels super hacky, but it is easier to setup and faster than anything else I’ve found.

antognini 10/27/2024||

Honestly it doesn't really feel like a hack to me. Fast efficient parallel reads are something that SQLite was specifically designed to do.

A friend of mine in astronomy once observed that a lot of people in the sciences should get comfortable working with databases because otherwise they just end up inadvertently spending a huge amount of effort rolling their own really crappy database.

edweis 10/27/2024|||

We would love a write up about your findings.

UniverseHacker 10/27/2024||

Probably not, I’m not a CS person and just found a hacky thing that was surprisingly fast enough for my use. I didn’t do a systematic comparison with other methods.

jfkfif 10/27/2024||

More performant than using HDF5?

UniverseHacker 10/27/2024||

I doubt it- I didn’t do a systematic comparison, but stuck with this because it was fast enough that it wasn’t the bottleneck

deskr 10/27/2024||

> Why this approach is not more common?

Because file system excels at handling files. Need an atomic update? Checkout into a new directory and switch a symlink.

I've seen various versions of using a database as a filesystem. They have a nice side to them and then a nightmare side to them when the shit hits the fan.

eddd-ddde 10/28/2024|

Exactly what I was thinking, what happened to just making v2/index.html.

You can then also use something like btrfs to dedup at the filesystem layer as well.

zzzeek 10/27/2024|

> When updating an app, since there could be lots of files being updated, using a database would allow all changes to be done atomically in a transaction. This would prevent broken web pages from being served during a version change.

but....the SQLite file is locked against reads while writing in order to achieve serializable isolation. which sort of indicates you're better off doing your database work in an offline file, then just swapping the new file for the old one that's in production. which sort of indicates....just use a tar file, or a separate directory that you swap in to update the new content.

it is so much easier to serve static files statically rather than from some program that's trying to manage live SQLite connections and achieve some weird "update concurrency" magic when this problem is not at all that hard to solve. It's fine to manage your CMS in a sqlite database but when you serve the content live and it's static, use static files.

okl 10/27/2024|

> but....the SQLite file is locked against reads while writing in order to achieve serializable isolation.

Not with WAL: https://www.sqlite.org/wal.html

simonw 10/27/2024||

And even without WAL (which you should absolutely be using if you're serving web content with SQLite) the lock for most writes lasts for a tiny fraction of a second.

zzzeek 10/27/2024|||

small writes, which is still a dramatically larger pause than simply copying a few files to a directory and not pausing anything. if the website update is hundreds of large files, then the SQLite write is going to be large also. it then comes down to, "is it faster to copy 200M of files to a filesystem or write 200M of new data to BLOBs in a single monolithic SQLite file?" I'd bet the former in that race

okl 10/27/2024||

YMMV depending on the exact setup but SQLite is usually faster for many small BLOBs than the filesystem, https://www.sqlite.org/fasterthanfs.html

resoluteteeth 10/27/2024|||

I might be misremembering, but if you're using a transaction like in the article but using the rollback journal mode rather than WAL, won't sqlite actually hold the lock on the database for the entire time until the transaction is committed, which might actually be a substantial amount of time if you're writing lots of blobs like in the article even if each individual blob doesn't take that long?

More comments...