Top
Best
New

Posted by tamnd 6 hours ago

Show HN: Kage – Shadow any website to a single binary for offline viewing(github.com)
309 points | 64 commentspage 2
sanqui 6 hours ago|
Cool concept. I would like to see this combined with mitmproxy for archive grade fidelity. You could be saving exactly the data served and at the same time a representation by a modern (contemporary) browser, with all JS having run. This combination would be my perfect replacement for the WARC format.
tamnd 5 hours ago||
I'm working on WARC too, with format from Common Crawl!

By converting it to Markdown, we save a lot of space, but it is for a different purpose and a different project: https://github.com/tamnd/ccrawl-cli

sanqui 5 hours ago||
That's neat! In my opinion, the WARC format is quite tricky and underspecified especially since HTTP2 introduced new semantics. It encodes too much in-band and requires rewriting of the server data. A mitmproxy capture is higher fidelity and supports capturing modern features such as WebSockets. I think if we could wrap Kage's crawler interactions by it and store its capture (the intercepted traffic), we could make a potentially nice new archival format.
tamnd 5 hours ago||
I tried to follow well-known formats first, such as WARC and ZIM from Kiwix, so we could benefit from existing tooling support.

For my own custom data format, I have a lot of private code that I plan to release soon. It is optimized for compression, fast lookups, and more. I have been working on it for two years. This is part of a larger, ambitious umbrella project: I am building Google from scratch (all open source), something that anyone can host, including the crawler, indexer, storage, and serving layers. Stay tuned!

sanqui 5 hours ago|||
I'm a fan of compatibility with established formats!

Sounds awesome. There is a lot of untapped potential with respect to efficiently archiving and indexing websites. I saw the impressive things Marginalia Search is doing in this area (the blog is great when it gets technical). There is also a lot of very complete archives of websites out there which are not being indexed at all, and I would love to make them available for researchers. In any case, I'm interested in your project!

threecheese 1 hour ago||||
OK, sounds fascinating; following! (your GH)
tamnd 54 minutes ago||
Thanks ;)
Prime_Axiom 4 hours ago|||
Looking forward to the next project! I love these kinds of archiving tools.
Dhavidh 5 hours ago||
sound interesting
calrizien 2 hours ago||
Does this work for the Apple Docs website? Really tricky to get those offline.
tamnd 50 minutes ago|
Making docs available offline was one of my main motivations for building this tool. I will try Apple Docs too.

I previously downloaded the Snowflake docs, and it was something like tens or even hundreds of thousands of pages, I do not remember exactly. The output ended up being very large.

By the way, I forgot to add zstd compression support to my ZIM reader/writer. I will implement that in the next version.

Igor_Wiwi 5 hours ago||
This is quite useful tool, especially for the cases where internet access is limited (the flights for example). I implemented it as a separate feature in mdview.io: for example you can export a document as a html file for offline usage, with all the presentation features like reach tables, mermaid and etc built in. Example https://mdview.io/s/why-markdown-became-default-format-for-a... then try to Export - Export HTML
lolpython 5 hours ago||
This is cool. I could see myself downloading the articles behind the first couple pages of hacker news with this, for viewing on a flight or long distance train ride with spotty internet
nitotm 2 hours ago||
I was looking for something like this the other day, it can be very helpful.
latexr 4 hours ago||
For those with an eReader, one thing that works really well is using pandoc to download and convert a webpage to EPUB that you can then load to your reader.

  pandoc --from html --to epub --output /PATH/TO/FILE.epub https://example.com
arikrahman 3 hours ago|
Thanks, will try this out on the Kobo later.
rahimnathwani 6 hours ago||
So this is like using wget --mirror except that it works on pages that require javascript, right?
tamnd 6 hours ago|
Yeah, it is. For example, openai.com is rendered with Next.js, so I will try to mirror it tomorrow.
KellyCriterion 3 hours ago||
Sounds like .MCH-files re-invented? (-:
More comments...