What's up with all those equals signs anyway?

Posted by todsacerdoti 12 hours ago

What's up with all those equals signs anyway?(lars.ingebrigtsen.no)

567 points | 171 comments

kstrauser 6 hours ago|

For context, this is the Lars Ingebrigtsen who wrote the manual for Gnus[0], a common Emacs package for reading email and Usenet. It’s clever, funny, and wildly informative. Lars has probably forgotten more about email parsing than 99% of us here will ever have learned.

The manual itself says[1]:

> Often when I read the manual, I think that we should take a collection up to have Lars psycho-analysed.

0: https://www.gnu.org/software/emacs/manual/html_mono/gnus.htm...

1: https://www.gnus.org/manual.html

sovande 4 hours ago|

Not only the manual, but Gnus itself. I remember this guy from the university (UiO) when he started working on Gnus. He was a small celebrity among us informatics students, and we all used Emacs and Gnus, of course.

rurban 1 hour ago|||

Also gmane. The once popular mailing lists search site.

jd3 6 minutes ago||

I discovered X-Face[0] through gmane! Such a blast from the past.

[0]: https://en.wikipedia.org/wiki/X-Face

kstrauser 2 hours ago|||

I'd forgotten that! Yeah, I believe Lars also wrote a huge chunk of the current Gnus. I stopped using it a while back and maybe someone else came along and rewrote it again, replacing all his code, but I don't think that's the case.

Gnus was absolutely delightful back in the day. I moved on around the time I had to start writing non-plaintext emails for work reasons. It's also handy to be using the same general email apps and systems as 99.99% of the rest of the world. I still have a soft spot in my heart for it.

PS: Also, I have no idea whatsoever why someone would downvote you for that. Weird.

ruhith 10 hours ago||

The real punchline is that this is a perfect example of "just enough knowledge to be dangerous." Whoever processed these emails knew enough to know emails aren't plain text, but not enough to know that quoted-printable decoding isn't something you hand-roll with find-and-replace. It's the same class of bug as manually parsing HTML with regex, it works right up until it doesn't, and then you get congressional evidence full of mystery equals signs.

lvncelot 9 hours ago||

> It's the same class of bug as manually parsing HTML with regex, it works right up until it doesn't

I'm sure you already know this one, but for anyone else reading this I can share my favourite StackOverflow answer of all time: https://stackoverflow.com/a/1732454

josefx 9 hours ago|||

I prefer the question about CPU pipelines that gets explained using a railroad switch as example. That one does a decent job of answering the question instead of going of on a, how to best put it, mentally deranged one page rant about regexes with the lazy throw away line at the end being the only thing that makes it qualify as an answer at all.

kapep 9 hours ago|||

The regex answer is from the very old days of Stackoverflow, before fun was banned. I agree it barely qualifies as answer, but considering that the question has over 4 million page views (which almost puts it in the top 100 most viewed questions all-time), it has reached a lot people. The answer probably had much more influence than any serious answer on that topic. So I'd say the author did a good job.

bobince 8 hours ago|||

Of all the things I wrote on SO, including many actually-useful detailed explanations, it was this drunken rant that stuck, for some reason.

falcor84 7 hours ago|||

And for that I applaud you.

I know it's a hassle for a platform to moderate good rants from bad ones, and I decry SO from pushing too hard against these. I truly believe that our industry would benefit from more drunken technical rants.

scott_s 6 hours ago|||

I think of, and look up, this drunken rant at least once a year.

DangitBobby 8 hours ago|||

People have shared it here and on reddit a bunch of times because it's funny. I always found the pragmatic counter-answer about using regex and the comments about how brittle it is to parse XML properly assuming a specific structure to be much more useful.

MrGilbert 8 hours ago||||

For anyone wondering about the railroad switch post: https://stackoverflow.com/questions/11227809/why-is-processi...

operator-name 5 hours ago||

This is new to me, and a wonderful dive that I wish I was aware of during my OS course. Thanks!

bityard 6 hours ago|||

But--and this is crucial--the one about regexes is hilarious.

It also comes from a time in Internet culture when humor was appreciated instead of aggressively downvoted.

encom 4 hours ago||

It's because the author put effort into it. Most (online) humour is lazy, low effort, regurgitated meme spam. See: Reddit. It should be downvoted and ideally never posted at all.

This is also the reason why I consider the lack of images in IRC a feature.

perching_aix 5 hours ago||||

It took me years to notice, but did you catch that the answer actually subtly misinterprets what the question is asking for?

Guy (in my reading) appears to talk about matching an entire HTML document with regex. Indeed, that is not possible due to the grammars involved. But that is not what was being asked.

What was being asked is whether the individual HTML tags can be parsed via regex. And to my understanding those are very much workable, and there's no grammar capability mismatch either.

somat 3 hours ago|||

The thing is, even when parsing html "correctly" (whatever that is) regexes will still be used. Sure, There will be a bunch of additional structures and mechanisms involved, but you will be identifying tokens via a bunch of regexes.

So yes, while it is an inspired comidic genius of a rant, and sort of informative in that it opens your eyes to the limitations of regexes, it sort of brushes under the rug all the places that those poor maligned regular expressions will be used when parsing html.

tiagod 5 hours ago|||

I think even for single opening tags like asked there are impossible edge cases.

For example, this is perfectly valid XHTML:

    <a href="/" title="<a /> />"></a>

chungy 2 hours ago|||

No, that is not valid. The "<" and ">" characters in string values must always be escaped with < and >. The correct form would be:

    <a href="/" title="&lt;a /&gt; /&gt;"></a>

comex 4 hours ago|||

If you already know where the start of the opening tag is, then I think a regex is capable of finding the end of that same opening tag, even in cases like yours. In that sense, it’s possible to use a regex to parse a single tag. What’s not possible is finding opening tags within a larger fragment of HTML.

kstrauser 2 hours ago||

For any given regex, an opponent can craft a string which is valid HTML but that the regex cannot parse. There are a million edge cases like:

  <!—- Don't count <hr> this! -—> but do count <hr> this -->

and

  <!-- <!-- Ignore <ht> this --> but do count <hr> this —->

Now your regex has to include balanced comment markers. Solve that

You need a context-free grammar to correctly parse HTML with its quoting rules, and escaping, and embedded scripts and CDATA, etc. etc. etc. I don't think any common regex libraries are as powerful as CFGs.

Basically, you can get pretty far with regexes, but it's provably (like in a rigorous compsci kinda way) impossible to correctly parse all valid HTML with only regular expressions.

marcosdumay 1 hour ago||

HTML comments do not nest. The obvious tokenizer you can create with regular expressions is the correct one.

kstrauser 44 minutes ago||

If you're talking about tokenizers, then you're no longer parsing HTML with a regex. You're tokenizing it with a regex and processing it with an actual parser.

bayesnet 9 hours ago||||

I know this is grumpy but this I’ve never liked this answer. It is a perfect encapsulation of the elitism in the SO community—if you’re new, your questions are closed and your answers are edited and downvoted. Meanwhile this is tolerated only because it’s posted by a member with high rep and username recognition.

1718627440 8 hours ago|||

I think this answer was tolerated when SO wasn't as bad as it is now, and wouldn't be tolerated now from anyone.

bombcar 6 hours ago||

It's because SO at the time was a small high-trust society where "everyone knew each other" and so things flew back then that wouldn't fly now.

throwaway_61235 8 hours ago|||

As someone who used to write custom crawlers 20 years ago, I can confirm that regular expressions worked great. All my crawlers were custom designed for a page and the sites were mostly generated by some CMS and had consistent HTML. I don't remember having to do much bug fixes that were related to regular expression issues.

I don't suggest writing generic HTML parsers that works with any site, but for custom crawlers they work great.

Not to say that the tools available are the same now as 20 years ago. Today I would probably use puppeteer or some similar tool and query the DOM instead.

vbezhenar 5 hours ago|||

An interesting thing is that most webpages are generated using text templates. There's some text processing like escaping special characters, but it's mostly text that happened to be (somewhat) valid HTML.

So extracting information from this text with regexps often makes perfect sense.

wat10000 6 hours ago|||

I would distinguish between parsing and scraping. Parsing really needs a, well, parser. Otherwise you’ll get things wrong on perfectly well formed input and your program will be brittle and weird.

A scraper is already resigned to being brittle and weird. You’re relying not only on the syntax of the data, but an implicit structure beyond that. This structure is unspecified and may change without notice, so whatever robustness you can achieve will come from being loose with what you accept and trying to guess what changes might be made on the other end. Regex is a decent tool for that.

Cthulhu_ 9 hours ago||||

HE COMES

umanwizard 6 hours ago|||

Funny how differently people can perceive things. That's my least favorite SO answer of all time, and I cringe every time I see it.

It's a very bad answer. First of all, processing HTML with regex can be perfectly acceptable depending on what you're trying to do. Yes, this doesn't include full-blown "parsing" of arbitrary HTML, but there are plenty of ways in which you might want to process or transform HTML that either don't require producing a parse tree, don't require perfect accuracy, or are operating on HTML whose structure is constrained and known in advance. Second, it doesn't even attempt to explain to OP why parsing arbitrary HTML with regex is impossible or poorly-advised.

The OP didn't want his post to be taken over by someone hamming it up with an attempt at creative writing. He wanted a useful answer. Yes, this answer is "quirky" and "whimsical" and "fun" but I read those as euphemisms for "trying to conscript unwilling victims into your personal sense of nerd-humor".

chucksmash 6 hours ago|||

There's nothing that brings joy into this world quite like the guy waiting around to tell people he doesn't like the thing they like.

philistine 6 hours ago|||

The whole argument hinges on one word in your post: arbitrary.

I parse my own HTML I produce directly in a context where I fully control the output. It works fine, but parsing other people’s HTML is a lesson in humility. I’ve also done that, but I did it as a one time thing. I parsed a specific point in time, refusing to change that at any point.

umanwizard 6 hours ago||

It also hinges on another word: parsing. There are things other than parsing that you might want to do. For example, if you want to count the number of `<hr>` tags in an HTML document, that doesn't require parsing it, and can indeed be done with regex.

kstrauser 5 hours ago||

No you can’t. You can have an unescaped <hr> inside a script tag, for example. The best you can do is a simple string search for “<hr>” and hope it’s returning what you think it might be returning. Regexps are not powerful enough to determine whether any particular instance of “<hr>” is actually an HTML tag.

Like, it’s not a matter of cleverness, either. You can’t code around it. It’s simply not possible.

ErigmolCt 5 hours ago|||

And because the output still looks mostly readable, nobody questions it until years later when it's suddenly evidence in front of Congress

V__ 10 hours ago||

They have top men working on it right now.

tiborsaas 11 hours ago||

> We see that that’s a quite a long line. Mail servers don’t like that

Why do mail server care about how long a line is? Why don't they just let the client reading the mail worry about wrapping the lines?

direwolf20 10 hours ago||

SMTP is a line–based protocol, including the part that transfers the message body

The server needs to parse the message headers, so it can't be an opaque blob. If the client uses IMAP, the server needs to fully parse the message. The only alternative is POP3, where the client downloads all messages as blobs and you can only read your email from one location, which made sense in the year 2000 but not now when everyone has several devices.

fluoridation 9 hours ago||

Hey, POP3 still makes sense. Having a local copy of your emails is useful.

direwolf20 8 hours ago|||

If you want it to be the only copy and not sync with anything

POP3 is line–based too, anyway. Maybe you can rsync your maildir?

fluoridation 8 hours ago||

I just read it mainly in one place and through the web interface when I have to.

dylan604 6 hours ago||

If your "in one place" reader is still open and downloading messages then there will be no messages to view in the web interface when you have to.

fluoridation 5 hours ago|||

There will, because my client doesn't delete the messages from the server when it downloads them.

skydhash 5 hours ago|||

POP3 is more for reading and acting on your email in one place (taking notes, plan actions, discard and delete,…). No need to consume them on other devices as you’ve already extracted the important bits.

I use imap on my mobile device, but that’s mostly for recent emails until I get to my computer. Then it’s downloaded and deleted from the server.

Jaxan 4 hours ago||||

Isn’t the only difference between pop and imap that pop removes the mail from the server? I only use imap, and all my email is available offline.

direwolf20 2 hours ago|||

POP is a simple mail transfer protocol (hehe...). It supports three things: get number of mails, download mail by number, delete mail by number. This is what you need to move mails in bulk from one point to another. POP3 mail clients are local maildir clients that use POP3 to get new mail from the server. It's like SMTP if it were based on polling.

IMAP is an interactive protocol that is closer to the interaction between Gmail frontend and backend. It does many things. The client implements a local view of a central source of truth.

fluoridation 4 hours ago||||

No, the difference is that IMAP doesn't store anything other than headers on the client (at least, not until the user tries to read a message), while POP3 eagerly downloads messages whenever they're available. A POP3 client can be configured with various remote retention policies, or even to never delete downloaded messages.

I don't have an IMAP account available to check, but AFAIK, you should not have locally the content of any message you've never read before. The whole point of IMAP is that it doesn't download messages, but instead acts like a window into the server.

mmooss 3 hours ago||

Also, IMAP syncs the other way. If you locally tag a message locally or move it to another folder, it also happens on the server.

gsich 33 minutes ago|||

Depending on what you configured. It can also keep the mail on the server.

ahoka 6 hours ago||||

But it's more akin to consuming a message queue. You have fetched it, it's gone.

foresto 2 hours ago||

This is incorrect. POP3 does not require fetched messages to be deleted from the server.

encom 4 hours ago|||

Nothing stops you from locally archiving your email with IMAP.

mmooss 3 hours ago||

How do you do that, by default? Can you tell an IMAP client to work like POP3 and download everything?

masfuerte 2 hours ago|||

In Thunderbird you can "Select this folder for offline use".

direwolf20 2 hours ago|||

Some you can

layer8 10 hours ago|||

Mails are (or used to be) processed line-by-line, typically using fixed-length buffers. This avoids dynamic memory allocation and having to write a streaming parser. RFC 821 finally limited the line length to at most 1000 bytes.

Given a mechanism for soft line breaks, breaking already at below 80 characters would increase compatibility with older mail software and be more convenient when listing the raw email in a terminal.

This is also why MIME Base64 typically inserts line breaks after 76 characters.

SoftTalker 8 hours ago||

In early days, many/most people also read their email on terminals (or printers) with 80-column lines, so breaking lines at 72-ish was considered good email etiquette (to allow for later quoting prefix ">" without exceeding 80 characters).

bjourne 5 hours ago||

One of the technical marvels of the day were mail and usenet clients that could properly render quoted text from infinite, never ending flame wars!

GMoromisato 5 hours ago|||

I don't think kids today realize how little memory we had when SMTP was designed.

For example, the PDP-11 (early 1970s), which was shared among dozens of concurrent users, had 512 kilobytes of RAM. The VAX-11 (late 1970s) might have as much as 2 megabytes.

Programmers were literally counting bytes to write programs.

NetMageSCW 1 hour ago||

I assure you we were not, at least it wasn’t really necessary. Virtual Memory is a powerful drug.

liveoneggs 10 hours ago|||

This is how email work(ed) over smtp. When each command was sent it would get a '200'-class message (success) or 400/500-class message (failure). Sound familiar?

telnet smtp.mailserver.com 25

HELO

MAIL FROM: me@foo.com

RCPT TO: you@bar.com

DATA

blah blah blah

how's it going?

talk to you later!

QUIT

1718627440 8 hours ago|||

For anyone who wants to try this against a modern server:

    openssl s_client -connect smtp.mailserver.com:smtps -crlf
    220 smtp.mailserver.com ESMTP Postfix (Debian/GNU)
    EHLO example.com
    250-smtp.mailserver.com
    250-PIPELINING
    250-SIZE 10240000
    250-VRFY
    250-ETRN
    250-AUTH PLAIN LOGIN
    250-ENHANCEDSTATUSCODES
    250-8BITMIME
    250-DSN
    250-SMTPUTF8
    250 CHUNKING

    MAIL FROM:me@example.com
    250 2.1.0 Ok

    RCPT TO:postmaster
    250 2.1.5 Ok

    DATA
    354 End data with <CR><LF>.<CR><LF>

    Hi
    .
    250 2.0.0 Ok: queued as BADA579CCB

    QUIT
    221 2.0.0 Bye

Telemakhos 9 hours ago||||

This brings back some fun memories from the 1990s when this was exactly how we would send prank emails.

kstrauser 6 hours ago|||

Yep! And also, if you included a blank line and then the headers for a new email in the bottom of your message, you could tell the server, hey, here comes another email for you to process!

If you were typing into a feedback form powered by something from Matt’s Script Archive, there was about a 95% chance you could trivially get it to send out multiple emails to other parties for every one email sent to the site’s owner.

fix4fun 7 hours ago|||

That was nice part of 1990s - many systems allow for funny things ;)

xg15 8 hours ago|||

I like how SMTP was at least honest in calling it the "receipt to" address and not the "sender" address.

Edit: wrong.

1718627440 8 hours ago||

RCPT TO specifies the destination (recipient) address, the "sender" is what is written in MAIL FROM.

However what most mail programs show as sender and recipient is neither, they rather show the headers contained in the message.

xg15 8 hours ago||

Ah, sorry. You're right.

jcynix 8 hours ago|||

"BITNET was a co-operative university computer network in the United States founded in 1981 by Ira Fuchs at the City University of New York (CUNY) and Greydon Freeman at Yale University."

https://en.wikipedia.org/wiki/BITNET

BITNET connected mainframes, had gateways to the Unix world and was still active in the 90s. And limited line lengths … some may remember SYSIN DD DATA … oh my goodness …

https://www.ibm.com/docs/en/zos/2.1.0?topic=execution-systsi...

citrin_ru 10 hours ago|||

Back in 80s-90s it was common to use static buffers to simplify implementation - you allocate a fixed size buffer and reject a message if it has a line longer than the buffer size. SMTP RFC specifies 1000 symbols limit (including \r\n) but it's common to wrap around 87 symbols so it is easy to examine source (on a small screen).

thephyber 11 hours ago|||

The simplest reason: Mail servers have long had features which will send the mail client a substring of the text content without transferring the entire thing. Like the GMail inbox view, before you open any one message.

I suspect this is relevant because Quoted Printable was only a useful encoding for MIME types like text and HTML (the human readable email body), not binary (eg. Attachments, images, videos). Mail servers (if they want) can effectively treat the binary types as an opaque blob, while the text types can be read for more efficient transfer of message listings to the client.

Pinus 11 hours ago|||

As far as I can remember, most mail servers were fairly sane about that sort of thing, even back in the 90’s when this stuff was introduced. However, there were always these more or less motivated fears about some server somewhere running on some ancient IBM hardware using EBCDIC encoding and truncating everything to 72 characters because its model of the world was based on punched cards. So standards were written to handle all those bizarre systems. And I am sure that there is someone on HN who actually used one of those servers...

jcynix 7 hours ago|||

EBCDIC wasn't the problem, this was (part of) the problem:

https://www.ibm.com/docs/en/zos/2.1.0?topic=execution-systsi...

And BITNET …

kstrauser 6 hours ago||

> EBCDIC wasn't the problem

Wake up, everyone! Brand new sentence just dropped!

tiborsaas 10 hours ago|||

Thanks, I really expected a tale from the 70's, but did not see punch cards coming :)

jibal 10 hours ago||

The influence of 80 column punch cards remains pervasive.

NetMageSCW 1 hour ago|||

IBM has a lot to make up for.

josefx 10 hours ago|||

RFC822 explicitly says it is for readability on systems with simple display software. Given that the protocol is from 1982 and systems back then had between 4 and 16kb RAM in total it might have made sense to give the lower end thin client systems of the day something preprocessed.

sumtechguy 9 hours ago|||

Also it is an easy way to stop a denial of service attack. If you let an infinite amount in that field. I can remotely overflow your system memory. The mail system can just error out and hang up on the person trying the attack instead of crashing out.

fluoridation 8 hours ago||

Surely you don't need the message to be broken up into lines just for that. Just read until a threshold is reached and then close the connection.

badc0ffee 4 hours ago|||

You could expect a lot more (512kB, 1MB, 2MB) in an internet-connected machine running Unix or VMS.

codingdave 10 hours ago||

Keep in mind that in ye olden days, email was not a worldwide communication method. It was more typical for it to be an internal-only mail system, running on whatever legacy mainframe your org had, and working within whatever constraints that forced. So in the 90s when the internet began to expand, and email to external organizations became a bigger thing, you were just as concerned with compatibility with all those legacy terminal-based mail programs, which led to different choices when engineering the systems.

liveoneggs 10 hours ago||

This is incorrect

kstrauser 6 hours ago||

Are you certain? Not OP, but a huge chunk of early RFCs was about how to let giant IBM systems talk to everyone else, specifying everything from character sets (nearly universally “7-bit ASCII”) to end of line/message characters. Otherwise, IBM would’ve tried to make EBCDIC the default for everything.

For instance, consider FTP’s text mode, which was primarily a way to accidentally corrupt your download when you forgot to type “bin” first, but was also handy for getting human readable files from one incompatible system to another.

liveoneggs 4 hours ago||

I had a pre-'@' email address and it was able to communicate all over the world.

kstrauser 3 hours ago||

My first reading was that you were disagreeing with the bits about email worrying about compatibility, and that part seemed reasonably true to me.

As to the other bits, I think even in the uucp era, email was mostly internal, by volume of mail sent, even though you could clearly talk to remote sites if everything was set up correctly. It was capable of being a worldwide communication system. I bet the local admins responsible for monitoring the telephone bill preferred to keep that in check, though.

heikkilevanto 10 hours ago||

I thought the article would be about the various meanings of operators like = == === .=. <== ==> <<== ==>> (==) => =~=

direwolf20 10 hours ago||

What is this, a Haskell for ants?

dkga 10 hours ago|||

It has to be at least… three times bigger than this

fix4fun 7 hours ago|||

My fist association was brainf..k (*.bf) programming language

ErigmolCt 5 hours ago||

This ended up being way more interesting

TazeTSchnitzel 7 hours ago||

The most interesting thing to me wasn't the equals signs, which I knew are from quoted-printable, but the fact that when an equals sign appears, a letter that should have been preceding or following it is missing. It's as if an off-by-one error has occurred, where instead of getting rid of the equals sign, it's gotten rid of part of the actual text. Perhaps the CRLF/LF thing is part of it.

btown 6 hours ago||

The article goes into exactly why this happens!

ErigmolCt 5 hours ago||

That's exactly how you end up with mystery missing characters in something that's supposed to be evidence

xg15 10 hours ago||

I'm just wondering why this problem shows up now. Why do lots of people suddenly post their old emails with a defective QP decoder?

> For some reason or other, people have been posting a lot of excerpts from old emails on Twitter over the last few days.

On the risk of having missed the latest meme or social media drama, but does anyone know what this "some reason or other" is?

Edit: Question answered.

SCdF 10 hours ago||

Presumably the Epstein files, but I'm not on twitter so not sure

grimgrin 8 hours ago|||

you presume cor=rect

https://www.jmail.world/thread/EFTA02512824?view=person

xg15 8 hours ago||

Huh, Noam Chomsky, nice one!

xg15 9 hours ago||||

Ooh, that reason. Sorry for having been dense. Thanks!

avemg 8 hours ago|||

Jeff Epstein? The New York financier?

ropp 10 hours ago|||

the DOJ published another bunch of Epstein emails

jychang 9 hours ago||

[flagged]

sd9 8 hours ago||

Of course the Epstein files are serious.

But not everybody has every single global development / news event IVed into their veins. Many of us just don’t keep updated on global news such that we may not be aware of an event that happened in the last 3 days.

Important news tends to get to me eventually. And there is usually nothing I can do about something personally anyway (at least within a short time horizon), so there is really very little value in trying to stay informed of the absolute latest developments. The signal to noise ratio is far too low, and it also induces a bunch of unnecessary anxiety and stress.

So yes, believe it or not very many people are unaware of this.

cachius 44 minutes ago||

I'd like a good .eml viewer that undoes the quoted printable transformation for the contained plain and html text. useful for mails downloaded from Outlook.

thedanbob 10 hours ago||

I wrote my own email archiving software. The hardest part was dealing with all the weird edge cases in my 20+ year collection of .eml files. For being so simple conceptually, email is surprisingly complicated.

jandrese 2 hours ago||

Email is one of those cursed standards where the committee wasn't building a protocol from scratch, but rather trying to build a universal standard by gluing together all of the independently developed existing systems in some way that might allow them to interoperate. Verifying that a string a user has typed is a valid email address is close to impossible short of just throwing up your hands and allowing anything with a @ somewhere in it.

stevekemp 7 hours ago||

I wrote a console-based mail client, which was 25% C++ and 75% Lua for defining the UI and the processing.

It never got too popular, but I had users for a few years and I can honestly say MIME was the bane of my life for most of those years.

thedanbob 4 hours ago||

Indeed. A big chunk of my email parser deals with missing or incorrect content headers. Most of the rest attempts to sensibly interpret the infinite combinations of parts found in multipart (and single-part!) emails.

beejiu 11 hours ago||

> So what’s happened here? Well, whoever collected these emails first converted from CRLF (i.e., “Windows” line ending coding) to “NL” (i.e., “Unix” line ending coding). This is pretty normal if you want to deal with email. But you then have one byte fewer:

I think there is a second possible conclusion, which is that the transformation happened historically. Everyone assumes these emails are an exact dump from Gmail, but isn't it possible that Epstein was syncing emails from Gmail to a third party mail server?

Since the Stackoverflow post details the exact situation in 2011, I think we should be open to the idea that we're seeing data collected from a secondary mail server, not Gmail directly.

Do we have anything to discount this?

(If I'm not mistaken, I think you can also see the "=" issue simply by applying the Quoted-Printable encoding twice, not just by mishandling the line-endings, which also makes me think two mail servers. It also explains why the "=" symbol is retained.)

topspin 11 minutes ago||

What happened here is what always happens with all printed and digital material that goes through some evidentiary process.

The shot-callers demand the material, which is a task fobbed off onto some nobody intern who doesn't matter (deliberately, because the lawyers don't want any "officer of the court" to put eyes on things they might need to deny knowing about later). They have the intern use only the most primitive, mechanical method possible, with little to no discretion. The collected mass of mangled junk is then shipped to whoever has made the demands, either in boxes or on CD-ROM/DVD (yes, still) or something. Then, the reverse process is done, equally badly, again by low-level staff, also with zero discretion and little to no technical knowledge or ability, for exactly the same reasons, to get the material into some form suitable for filing or whatever.

Through all of this, the subtle details of data formats and encodings are utterly lost, and the legal archive fills with mangled garbage like raw quoted-printable emails.

TazeTSchnitzel 7 hours ago|||

In one of the email PDFs I saw an XML plist with some metadata that looked like it was from Apple's Mail.app, so these might be extracted from whatever internal format that uses.

flomo 3 hours ago|||

When they process these emails, it's fairly common to import everything into a MS Outlook PST file (using whatever buggy tool). That's probably why these look like Outlook printouts even though its Yahoo mail or etc.

ErigmolCt 5 hours ago|||

Yeah, I wouldn't bet on this being a single bad Gmail export; it smells much more like the accumulated scars of multiple mail systems doing "helpful" things to the same messages over time

MoltenMan 11 hours ago||

This seems like the most likely reason to me!

maartin0 9 hours ago|

Fun how the archive.today article near the top has this exact issue

https://pastes.io/correspond

https://news.ycombinator.com/item?id=46843805

More comments...