Top
Best
New

Posted by ComputerGuru 1 day ago

Recreating Epstein PDFs from raw encoded attachments(neosmart.net)
208 points | 45 comments
dperfect 48 minutes ago|
Nerdsnipe confirmed :)

Claude Opus came up with this script:

https://pastebin.com/ntE50PkZ

It produces a somewhat-readable PDF (first page at least) with this text output:

https://pastebin.com/SADsJZHd

(I used the cleaned output at https://pastebin.com/UXRAJdKJ mentioned in a comment by Joe on the blog page)

chrisjj 2 hours ago||
> it’s safe to say that Pam Bondi’s DoJ did not put its best and brightest on this

Or worse. She did.

eek2121 2 hours ago|
I mean, the internet is finding all her mistakes for her. She is actually doing alright with this. Crowdsource everything, fix the mistakes. lol.
helterskelter 58 minutes ago|||
I wonder if this could be intentional. If the datasets are contaminated with CSAM, anybody with a copy is liable to be arrested for possession.

More likely it's just an oversight, but it could also be CYA for dragging their feet, like "you rushed us, and look at these victims you've retraumatized". There are software solutions to find nudity and they're quite effective.

TSiege 1 hour ago||||
This would be funnier if it wasn’t child porn being unredacted by our government
dagi3d 1 hour ago||||
the issue is that mistakes can't be fixed in the sense once they are discovered, it doesn't matter if they are eventually redacted
chrisjj 2 hours ago||||
Let's see her sued for leaking PII. Here in Europe, she'd be mincemeat.
ISL 55 minutes ago||
The US administration is, at present, regularly violating the law and ignoring court orders. Indeed, these very releases are patently in violation of multiple federal laws -- they're simultaneously insufficiently-responsive to meet the requirements of the law requiring the release of the files and fall afoul of CSAM laws by being incompletely redacted.

The challenge, as we're all experiencing together, is that the law is not inherently self-enforcing.

typeofhuman 36 minutes ago||
Can you provide a couple examples of the laws they're violating?
ISL 20 minutes ago|||
As noted above:

https://www.govinfo.gov/content/pkg/PLAW-119publ38/pdf/PLAW-... : the Attorney General was to have produced the entirety of the Epstein files, with very narrowly-enumerated redactions, in December. She has not done so.

Furthermore, there are numerous allegations that the documents that have been released contain CSAM, which (referencing the PDF above) may fall afoul of 18 U.S.C. 2252–2252A.

In addition, one need only glance at the action in US courts to see egregious violations of the Constitution and valid court orders playing out daily.

https://www.documentcloud.org/documents/26513988-trorder0128...

https://storage.courtlistener.com/recap/gov.uscourts.mnd.230...

mschuster91 28 minutes ago|||
There's more than enough credible reports of CSAM in the Epstein Files dump - more than enough for me to not go and download even a single file of them myself, simply because German law does not care about why you are in the possession of CSAM, even if you took the picture yourself.

The legal situation regarding CSAM is very strict no matter which country, and I better hope no one here will actually be dumb enough to provide actual links.

rockskon 1 hour ago|||
Yeah - they'll take these lessons learned for future batches of releases.
bawolff 2 hours ago||
Teseract supports being trained for specific fonts, that would probably be a good starting point

https://pretius.com/blog/ocr-tesseract-training-data

pyrolistical 2 hours ago||
It decodes to binary pdf and there are only so many valid encodings. So this is how I would solve it.

1. Get an open source pdf decoder

2. Decode bytes up to first ambiguous char

3. See if next bits are valid with an 1, if not it’s an l

4. Might need to backtrack if both 1 and l were valid

By being able to quickly try each char in the middle of the decoding process you cut out the start time. This makes it feasible to test all permutations automatically and linearly

bawolff 2 hours ago|
Sounds like a job for afl
ChocMontePy 27 minutes ago||
You can use the justice.gov search box to find several different copies of that same email.

The copy linked in the post:

https://www.justice.gov/epstein/files/DataSet%209/EFTA004004...

Three more copies:

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02153...

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02154...

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02154...

Perhaps having several different versions might make it easier.

percentcer 2 hours ago||
This is one of those things that seems like a nerd snipe but would be more easily accomplished through brute forcing it. Just get 76 people to manually type out one page each, you'd be done before the blog post was written.
jjwiseman 1 hour ago||
Or one person types 76 pages. This is a thing people used to do, not all that infrequently. Or maybe you have one friend who will help–cool, you just cut the time in half.
WolfeReader 2 hours ago|||
You think compelling 76 people to honestly and accurately transcribe files is something that's easy and quick to accomplish.
fragmede 2 hours ago||
> Just get 76 people

I consider myself fairly normal in this regard, but I don't have 76 friends to ask to do this, so I don't know how I'd go about doing this. Post an ad on craigslist? Fiverr? Seems like a lot to manage.

Krutonium 2 hours ago||
Amazon Mechanical Turk?
pimlottc 3 hours ago||
Why not just try every permutation of (1,l)? Let’s see, 76 pages, approx 69 lines per page, say there’s one instance of [1l] per line, that’s only… uh… 2^5244 possibilities…

Hmm. Anyone got some spare CPU time?

wahern 2 hours ago||
It should be much easier than that. You should should be able to serially test if each edit decodes to a sane PDF structure, reducing the cost similar to how you can crack passwords when the server doesn't use a constant-time memcmp. Are PDFs typically compressed by default? If so that makes it even easier given built-in checksums. But it's just not something you can do by throwing data at existing tools. You'll need to build a testing harness with instrumentation deep in the bowels of the decoders. This kind of work is the polar opposite of what AI code generators or naive scripting can accomplish.
pimlottc 34 minutes ago|||
I wonder if you could leverage some of the fuzzing frameworks tools like Jepsen rely on. I’m sure there’s got to be one for PDF generation.
cluckindan 2 hours ago|||
On the contrary, that kind of one-off tooling seems a great fit for AI. Just specify the desired inputs, outputs and behavior as accurately as possible.
kevin_thibedeau 2 hours ago||
pdftoppm and Ghostscript (invoked via Imagemagick) re-rasterize full pages to generate their output. That's why it was slow. Even worse with a Q16 build of Imagemagick. Better to extract the scanned page images directly with pdfimages or mutool.

Followup: pdfimages is 13x faster than pdftoppm

velaia 2 hours ago|
Bummer that it's not December - the https://www.reddit.com/r/adventofcode/ crows would love this puzzle
More comments...