The surprisingly complex journey to text-selectable client-side generated PDFs

Posted by FailMore 1 day ago

The surprisingly complex journey to text-selectable client-side generated PDFs(sdocs.dev)

27 points | 12 comments

alansaber 1 minute ago|

You don't know the hell of trawling through PDF XML and HTML construction until you've done it

Worf 27 minutes ago||

PDFs should be only for printing or maybe for keeping scanned versions of things. For anything else they're just not the right tool for the job. Not for things meant to be accessed on a computer like books, scientific papers or, for some weird reason, catalogs and price lists from websites.

We have responsive and open standards like HTML and EPUB (zipped XTML) and they work great. arXiv has HTML papers, and libgen and anna's archive often have EPUB versions of books. The issue for me with EPUB is the lack of good readers now.

gf000 4 minutes ago||

I don't know, I really love a well-typeset books/papers. Especially when they feature figures that are deliberately placed close to the relevant section in the text, it's just not something we can replicate with HTML, that can barely do proper justified text.

Sure, I would like that beautifully designed page to magically become a single column beautiful document on my phone, but I will take the former over a badly designed text extract where the relevant figure is 10 pages away.

Epub (=html) is good for novels, but there is nothing replacing PDF for science papers. If anything, the latex (or ideally typst) source would come the closest, if properly written (not absolute offsets). That could be used to produce different page sized versions.

FailMore 22 minutes ago||

Interesting point. What do you feel about the "business world"'s heavy use of PDFs? There is something to be said about the file format being trusted/so dominant now... probably some random sequence of events led to this happening... but perhaps hard to shift

gobdovan 43 minutes ago||

Thanks, this puts into perspective why copy-paste from PDFs is so bad.

I months into building a pasteboard transform library that normalises VS Code, Google Docs, PDFs and a bunch of Chromium apps provider-specific data so I can start pasting everything everywhere exactly how I want it. It's much, much messier than I expected.

Apps put different UTTypes on the pasteboard that are not really compatible with each other. Usually there's a plain text fallback, then rich text/HTML, then provider-specific data. You show how much insane work is needed just to make text selectable with glyph mappings, layout, links, code blocks, rendered styles, etc. But once you copy from that PDF, most viewers still only expose raw text, and often broken raw text at that...

FailMore 37 minutes ago|

Yep, it is a very interesting space for improvements imo. Kind of broadly speaking copy and paste is so central to working with a computer in a smooth way it should probably have more power / quality built into it (e.g. not having to install some random plug in to get clipboard history, etc.)

josefrichter 1 hour ago||

It’s not that surprising. It’s one of those well known pandora boxes of web development: email templates, PDFs, printing,…

FailMore 1 hour ago|

Ah, I didn't know that. It's not something I had worked on before, and the file format is highly prevalent (so I assumed things would be easy), so it was surprising to me

SirHumphrey 59 minutes ago||

Nothing about PDF is easy. Similarly to what once Tom Scott said about time zones, every time I must deal with PDFs I pray that PDF.js can be hacked in to doing it instead, otherwise I just don’t bother.

It’s on of the few examples when converting it in to picture and chucking it in a multimodal llm is a more sensible solution than trying to parse it.

ashishb 44 minutes ago||

Software engineers drastically underestimates GUI - Web layouts, mobile app layouts, and even PDF layouts are non-trivial pieces of work to get right in all circumstances.

freedomben 23 minutes ago||

Nobody who has actually worked on those things think that. You might want to qualify if you're only talking about people who have never worked in this area.

In my experience it's the NON software engineers who tend to underestimate the complexity

FailMore 38 minutes ago||

Yep, they (can) rarely enter your domain... so it's easy to assume its going to be trivial (maybe because things like .md or .txt files are trivial, so it's easy to think there's not much of a delta)