Gwtar: A static efficient single-file HTML format

Posted by theblazehen 9 hours ago

Gwtar: A static efficient single-file HTML format(gwern.net)

154 points | 52 commentspage 2

isr 2 hours ago||

Hmm, so this is essentially the appimage concept applied to web pages, namely:

- an executable header

- which then fuse mounts an embedded read-only heavily compressed filesystem

- whose contents are delivered when requested (the entire dwarf/squashfs isn't uncompressed at once)

- allowing you to pack as many of the dependencies as you wish to carry in your archive (so, just like an appimage, any dependency which isn't packed can be found "live"

- and doesn't require any additional, custom infrastructure to run/serve

Neat!

spankalee 6 hours ago||

I really don't understand why a zip file isn't a good solution here. Just because is requires "special" zip software on the server?

gwern 3 hours ago||

> Just because is requires "special" zip software on the server?

Yes. A web browser can't just read a .zip file as a web page. (Even if a web browser decided to try to download, and decompress, and open a GUI file browser, you still just get a list of files to click.) Therefore, far from satisfying the trilemma, it just doesn't work.

And if you fix that, you still generally have a choice between either no longer being single-file or efficiency. (You can just serve a split-up HTML from a single ZIP file with some server-side software, which gets you efficiency, but now it's no longer single-file; and vice-versa. Because if it's a ZIP, how does it stop downloading and only download the parts you need?)

spankalee 2 hours ago||

We're talking about servers here - the article specifically said that one of the requirements was no special _server_ software, and a web server almost certainly has zip (or tar) installed. These gwtar files don't work without a server apparently either.

gwern 1 hour ago||

I'm not following your point here. Yes, a web server (probably) has access to zip/tar utilities, but so what? That doesn't automagically make a random .zip jump through hoops to achieve anything beyond 'download like a normal binary asset'. That's what a ZIP file does. Meanwhile, Gwtar works with any server out of the box: it is just a HTML file using a pre-existing HTTP standardized functionality, and works even if the server declines to support range requests for some wacky reason like undocumented Cloudflare bugs, and downgrades RANGE to GET. (It just loses efficiency, but it still works, you know, in the way that a random .zip file doesn't work at all as a web page.) You can upload a Gwtar to any HTTP server or similar thing like an AWS bucket and it will at least work, zero configuration or plugins or additional executables or scripting.

Now, maybe you mean something like, 'a web server could additionally run some special CGI software or a plugin or do some fancy Lua scripting in order to munge a ZIP and split it up on the fly so as to do something like serve it to clients as a regular efficient multi-file HTML page'. Sure. I already cover that in the writeup, as we seriously considered this and got as far as writing a Lua nginx script to support special range requests. But then... it's not single-file. It's multi-file - whatever the additional special config file, script, plugin, or executable is.

newzino 5 hours ago||

Zip stores its central directory at the end of the file. To find what's inside and where each entry starts, you need to read the tail first. That rules out issuing a single Range request to grab one specific asset.

Tar is sequential. Each entry header sits right before its data. If the JSON manifest in the Gwtar preamble says an asset lives at byte offset N with size M, the browser fires one Range request and gets exactly those bytes.

The other problem is decompression. Zip entries are individually deflate-compressed, so you'd need a JS inflate library in the self-extracting header. Tar entries are raw bytes, so the header script just slices at known offsets. No decompression code keeps the preamble small.

fluidcruft 4 hours ago||

You can also read a zip sequentially like a tar file. Some info is in the directory only but just for getting file data you can read the file records sequentially. There are caveats about when files appear multiple times but those caveats also apply to processing tar streams.

O1111OOO 6 hours ago||

I gave up a long time ago and started using the "Save as..." on browsers again. At the end of the day, I am interested in the actual content and not the look/feel of the page.

I find it easier to just mass delete assets I don't want from the "pageTitle_files/" directory (js, images, google-analytics.js, etc).

mikae1 5 hours ago||

Have you https://addons.mozilla.org/firefox/addon/single-file/?

If you really just want the text content you could just save markdown using something like https://addons.mozilla.org/firefox/addon/llmfeeder/.

ninalanyon 2 hours ago||

On the subject of SingleFile there is also WebScrapBook: https://github.com/danny0838/webscrapbook

I prefer it because it can save without packing the assets into one HTML file. Then it's easy to delete or hardlink common assets.

venusenvy47 1 hour ago||

I see that it gives three choices for saving the assets: single file, zip or folder. Is the zip version just zipping the folder?

gwern 3 hours ago|||

I find that 'save as' horribly breaks a lot of web pages. There's no choice these days but to load pages with JS and serialize out the final quiescent DOM. I also spend a lot of time with uBlock Origin and AlwaysKillSticky and NoScript wrangling my archive snapshots into readability.

TiredOfLife 4 hours ago||

Save as doesn't work on sites that lazy load.

westurner 5 hours ago||

Does this verify and/or rewrite the SRI integrity hashes when it inlines resources?

Would W3C Web Bundles and HTTP SXG Signed Exchanges solve for this use case?

WICG/webpackage: https://github.com/WICG/webpackage#packaging-tools

"Use Cases and Requirements for Web Packages" https://datatracker.ietf.org/doc/html/draft-yasskin-wpack-us...

gwern 3 hours ago|

> Does this verify and/or rewrite the SRI integrity hashes when it inlines resources?

As far as I know, we do not have any hash verification beyond that built into TCP/IP or HTTPS etc. I included SHA hashes just to be safe and forward compatible, but they are not checked.

There's something of a question here of what hashes are buying you here and what the threat model is. In terms of archiving, we're often dealing with half-broken web pages (any of whose contents may themselves be broken) which may have gone through a chain of a dozen owners, where we have no possible web of trust to the original creator, assuming there is even one in any meaningful sense, and where our major failure modes tend to be total file loss or partial corruption somewhere during storage. A random JPG flipping a bit during the HTTPS range request download from the most recent server is in many ways the least of our problems in terms of availability and integrity.

This is why I spent a lot more time thinking about how to build FEC in, like with appending PAR2. I'm vastly more concerned about files being corrupted during storage or the chain of transmission or damaged by a server rewriting stuff, and how to recover from that instead of simply saying 'at least one bit changed somewhere along the way; good luck!'. If your connection is flaky and a JPEG doesn't look right, refresh the page. If the only Gwtar of a page that disappeared 20 years ago is missing half a file because a disk sector went bad in a hobbyist's PC 3 mirrors ago, you're SOL without FEC. (And even if you can find another good mirror... Where's your hash for that?)

> Would W3C Web Bundles and HTTP SXG Signed Exchanges solve for this use case?

No idea. It sounds like you know more about them than I do. What threat do they protect against, exactly?

karel-3d 3 hours ago||

The example link doesn't work for me at all in iOS safari?

https://gwern.net/doc/philosophy/religion/2010-02-brianmoria...

I will try on Chrome tomorrow.

woodruffw 3 hours ago|

It also doesn't work on desktop Safari 26.2 (or perhaps it does, but not to the extent intended -- it appears to be trying to download the entire response before any kind of content painting.)

renewiltord 6 hours ago||

Hmm, I’m interested in this, especially since it applies no compression delta encoding might be feasible for daily scans of the data but for whatever reason my Brave mobile on iOS displays a blank page for the example page. Hmm, perhaps it’s a mobile rendering issue because Chrome and Safari on iOS can’t do it either https://gwern.net/doc/philosophy/religion/2010-02-brianmoria...

wetpaws 6 hours ago||

[dead]

nullsanity 6 hours ago|

Gwtar seems like a good solution to a problem nobody seemed to want to fix. However, this website is... something else. It's full of inflated self impprtantance, overly bountiful prose, and feels like someone never learned to put in the time to write a shorter essay. Even the about page contains a description of the about page.

I don't know if anyone else gets "unemployed megalomaniacal lunatic" vibes, but I sure do.

3rodents 6 hours ago||

gwern is a legendary blogger (although blogger feels underselling it… “publisher”?) and has earned the right to self-aggrandize about solving a problem he has a vested interest in. Maybe he’s a megalomaniac and/or unemployed and/or writing too many words but after contributing so much, he has earned it.

TimorousBestie 5 hours ago||

I was more willing to accept gwern’s eccentricities in the past but as we learn more about MIRI and its questionable funding resources, one wonders how much he’s tied up in it.

The Lighthaven retreat in particular was exceptionally shady, possibly even scam-adjacent; I was shocked that he participated in it.

k33n 4 hours ago||

What does any of that have to do with the value of what’s presented in the article?

fluidcruft 6 hours ago|||

What's up with the non-stop knee-jerk bullshit ad hom on HN lately?

Krutonium 6 hours ago|||

We're tired, chief.

esseph 6 hours ago||||

The earth is falling out from under a lot of people, and they're trying to justify their position on the trash heap as the water level continues to rise around it. It's a scary time.

TimorousBestie 5 hours ago|||

Technically it’s only an ad hominem when you’re using the insult as a component in a fallacious argument; the parent comment is merely stating an aesthetic opinion with more force than is typically acceptable here.

isr 2 hours ago||

I read your BRILLIANT synopsis in the tone of Sir Humphrey (the civil servant) from "Yes Minister". Fits perfectly. Take a bow, good sir ...

isr 2 hours ago||

Wow, thats one hell of a reaction to someone's blog post introducing their new project.

Its almost as if someone charged you $$ for the privilege of reading it, and you now feel scammed, or something?

Perhaps you can request a refund. Would that help?