Top
Best
New

Posted by theblazehen 7 hours ago

Gwtar: A static efficient single-file HTML format(gwern.net)
154 points | 52 comments
simonw 5 hours ago|
TIL about window.stop() - the key to this entire thing working, it's causes the browser to stop loading any more assets: https://developer.mozilla.org/en-US/docs/Web/API/Window/stop

Apparently every important browser has supported it for well over a decade: https://caniuse.com/mdn-api_window_stop

Here's a screenshot illustrating how window.stop() is used - https://gist.github.com/simonw/7bf5912f3520a1a9ad294cd747b85... - everything after <!-- GWTAR END is tar compressed data.

Posted some more notes on my blog: https://simonwillison.net/2026/Feb/15/gwtar/

moritzwarhier 4 hours ago||
Not the inverse, but for any SPA (not framework or library) developers seeing this, it's probably worth noting that this is not better than using document.write, window.open and simular APIs.

But could be very interesting for use cases where the main logic lives on the server and people try to manually implement some download- and/or lazy-loading logic.

Still probably bad unless you're explicitly working on init and redirect scripts.

8n4vidtmkvmk 5 hours ago||
Neat! I didn't know about this either.

Php has a similar feature called __halt_compiler() which I've used for a similar purpose. Or sometimes just to put documentation at the end of a file without needing a comment block.

tym0 4 hours ago||
I was on board until I saw that those can't easily be opened from a local file. Seems like local access is one of the main use case for archival formats.
avaer 4 hours ago|
Agreed, I was thinking it's like asm.js where it can "backdoor pilot" [1] an interesting use case into the browser by making it already supported by default.

But not being able to "just" load the file into a browser locally seems to defeat a lot of the point.

[1] https://en.wikipedia.org/wiki/Television_pilot#Backdoor_pilo...

gildas 1 hour ago||
I would like to know why ZIP/HTML polyglot format produced by SingleFile [1] and mentioned in the article "achieve static, single, but not efficiency". What's not efficient compared to the gwtar format?

[1] https://github.com/gildas-lormeau/Polyglot-HTML-ZIP-PNG

gwern 1 hour ago|
'efficiency' is downloading only the assets needed to render the current view. How does it implement range requests and avoid downloading the entire SingleFileZ when a web browser requests the URL?
gildas 1 hour ago||
I haven't looked closely, but I get the impression that this is an implementation detail which is not really related to the format. In this case, a polyglot zip/html file could also interrupt page loading via a window.stop() call and rely on range requests (zip.js supports them) to unzip and display the page. This could also be transparent for the user, depending on whether the file is served via HTTP or not. However, I admit that I haven't implemented this mechanism yet.
gwern 1 hour ago||
> that this is an implementation detail which is not really related to the format. In this case, a polyglot zip/html file could also interrupt page loading via a window.stop() call...However, I admit that I haven't implemented this mechanism yet.

Well, yes. That's why we created Gwtar and I didn't just use SingleFileZ. We would have preferred to not go to all this trouble and use someone else's maintained tool, but if it's not implemented, then I can't use it.

(Also, if it had been obvious to you how to do this window.stop+range-request trick beforehand, and you just hadn't gotten around to implementing it, it would have been nice if you had written it up somewhere more prominent; I was unable to find any prior art or discussion.)

gildas 1 hour ago||
The reason I did not implement the innovative mechanism you describe is because, in my case, all the technical effort was/is focused on reading the archive from the filesystem. No one has suggested it either.

Edit: Actually, SingleFile already calls window.stop() when displaying a zip/html file from HTTP, see https://github.com/gildas-lormeau/single-file-core/blob/22fc...

gwern 37 minutes ago||
What does that do?
gildas 27 minutes ago||
The call to window.stop() stops HTML parsing/rendering, which is unnecessary since the script has downloaded the page via HTTP and will decompress it as-is as a binary file (zip.js supports concatenated payloads before and after the zip data). However, in my case, the call to window.stop() is executed asynchronously once the binary has been downloaded, and therefore may be too late. This is probably less effective than in your case with gtwar.

I implemented this in the simplest way possible because if the zip file is read from the filesystem, window.stop() must not be called immediately because the file must be parsed entirely. In my case, it would require slightly more complex logic to call window.stop() as early as possible.

Edit: maybe it's totally useless though, as documented here [1]: "Because of how scripts are executed, this method cannot interrupt its parent document's loading, but it will stop its images, new windows, and other still-loading objects." (you mentioned it in the article)

[1] https://developer.mozilla.org/en-US/docs/Web/API/Window/stop

zetanor 5 hours ago||
The author dismisses WARC, but I don't see why. To me, Gwtar seems more complicated than a WARC, while being less flexible and while also being yet another new format thrown onto the pile.
simonw 5 hours ago||
I don't think you can provide a URL to a WARC that can be clicked to view its content directly in your browser.
zetanor 5 hours ago||
At the very least, WARC could have been used as the container ("tar") format after the preamble of Gwtar. But even there, given that this format doesn't work without a web server (unlike SingleFile, mentioned in the article), I feel like there's a lot to gain by separating the "viewer" (Gwtar's javascript) from the content, such that the viewer can be updated over time without changing the archives.

I certainly could be missing something (I've thought about this problem for all of a few minutes here), but surely you could host "warcviewer.html" and "warcviewer.js" next to "mycoolwarc.warc" "mycoolwrc.cdx" with little to no loss of convenience, and call it a day?

gwern 1 hour ago||
You could potentially use WARC instead of Tar as the appended container, sure, but that's a lot of complexity, WARC doesn't serialize the rendered page (so what is the greater 'fidelity' actually getting you?) and SingleFile doesn't support WARC, and I don't see a specific advantage that a Gwtar using WARC would have. The page rendered what it rendered.

And if you choose to require separate files and break single-file, then you have many options.

> surely you could host "warcviewer.html" and "warcviewer.js" next to "mycoolwarc.warc" "mycoolwrc.cdx"

I'm not familiar with warcviewer.js and Googling isn't showing it. Are you thinking of https://github.com/webrecorder/wabac.js ?

obscurette 5 hours ago||
WARC is mentioned with very specific reason not being good enough: "WARCs/WACZs achieve static and efficient, but not single (because while the WARC is a single file, it relies on a complex software installation like WebRecorder/Replay Webpage to display)."
iainmerrick 1 hour ago||
I agree with the motivation and I really like the idea of a transparent format, but the first example link doesn’t work at all for me in Safari.
calebm 2 hours ago||
Very cool idea. I think single-file HTML web apps are the most durable form of computer software. A few examples of Single-File Web Apps that I wrote are: https://fuzzygraph.com and https://hypervault.github.io/.
mr_mitm 3 hours ago||
Pretty cool. I made something similar (much more hacky) a while ago: https://github.com/AdrianVollmer/Zundler

Works locally, but it does need to decompress everything first thing.

gwern 2 hours ago|
So this is like SingleFileZ in that it's a single static inefficient HTML archive, but it can easily be viewed locally as well?

How does it bypass the security restrictions which break SingleFileZ/Gwtar in local viewing mode? It's complex enough I'm not following where the trick is and you only mention single-origin with regard to a minor detail (forms).

mr_mitm 1 hour ago||
The content is in an iframe, my code is outside of it, and the two frames are passing messages back and forth. Also I'm monkey patching `fetch` and a few other things.
gwern 1 hour ago||
OK, but how does that get you 'efficiency' if you're doing this weird thing where you serialize the entire page into some JSON blob and pass it in to an iframe or whatever? That would seem to destroy the 'efficiency' property of the trilemma. How do you get the full set of single-file, static, and efficient, while still working locally?
Retr0id 3 hours ago|
It's fairly common for archivers (including archive.org) to inject some extra scripts/headers into archived pages or otherwise modify the content slightly (e.g. fixing up relative links). If this happens, will it mess up the offsets used for range requests?
gwern 2 hours ago|
The range requests are to offsets in the original file, so I would think that most cases of 'live' injection do not necessarily break it. If you download the page and the server injects a bunch of JS into the 'header' on the fly and the header is now 10,000 bytes longer, then it doesn't matter, since all of the ranges and offsets in the original file remain valid: the first JPG is still located starting at offset byte #123,456 in $URL, the second one is located starting at byte #456,789 etc, no matter how much spam got injected into it.

Beyond that, depending on how badly the server is tampering with stuff, of course it could break the Gwtar, but then, that is true of any web page whatsoever (never mind archiving), and why they should be very careful when doing so, and generally shouldn't.

Now you might wonder about 're-archiving': if the IA serves a Gwtar (perhaps archived from Gwern.net), and it injects its header with the metadata and timeline snapshot etc, is this IA Gwtar now broken? If you use a SingleFile-like approach to load it, properly force all references to be static and loaded, and serialize out the final quiescent DOM, then it should not be broken and it should look like you simply archived a normal IA-archived web page. (And then you might turn it back into a Gwtar, just now with a bunch of little additional IA-related snippets.) Also, note that the IA, specifically, does provide endpoints which do not include the wrapper, like APIs or, IIRC, the 'if_/' fragment. (Besides getting a clean copy to mirror, it's useful if you'd like to pop up an IA snapshot in an iframe without the header taking up a lot of space.)

More comments...